<<

RESOURCE-EFFICIENT FOR UBIQUITOUS COMPUTING Lightweight Cryptographic Primitives from a Hardware & Software Perspective

DISSERTATION

for the degree of Doktor-Ingenieur of the Faculty of Electrical Engineering and Information Technology at the Ruhr University Bochum, Germany

by Elif Bilge Kavun Bochum, December 2014 Copyright © 2014 by Elif Bilge Kavun. All rights reserved. Printed in Germany. Anneme ve babama...

Elif Bilge Kavun Place of birth: Izmir,˙ Turkey Author’s contact information: [email protected] www.emsec.rub.de/chair/ staff/elif bilge kavun

Thesis Advisor: Prof. Dr.-Ing. Christof Paar Ruhr-Universit¨atBochum, Germany Secondary Referee: Prof. Christian Rechberger Danmarks Tekniske Universitet, Denmark Thesis submitted: December 19, 2014 Thesis defense: February 6, 2015

v

Abstract

Technological advancements in the semiconductor industry over the last few decades made the mass production of very small-scale computing devices possible. Thanks to the compactness and mobility of these devices, they can be deployed “pervasively”, in other words, everywhere and anywhere – such as in smart homes, logistics, e-commerce, and medical technology. Em- bedding the small-scale devices into everyday objects pervasively also indicates the realization of the foreseen “ubiquitous computing” concept. However, ubiquitous computing and the mass deployment of the pervasive devices in turn brought some concerns – especially, security and privacy. Many people criticize the security and privacy management in the ubiquitous context. It is even believed that an inadequate level of security may be the greatest barrier to the long-term success of ubiquitous computing. For ubiquitous computing, the adversary model and the se- curity level is not the same as in traditional applications due to limited resources in pervasive devices – area, power, and energy are actually harsh constraints for such devices. Unfortunately, the existing cryptographic solutions are generally quite heavy for these ubiquitous applications. In order to address the security problem of the resource-constrained devices, “lightweight cryp- tography” has been defined over a decade ago and many different lightweight cryptographic primitives have already been proposed. The published work so far mostly deals with hardware cost reduction. However, this is not the only important metric for such devices. Depending on the application, resource-constrained devices may need lightweight ciphers to be executed in one clock cycle, which still achieve a certain security level and a small footprint. Furthermore, as most of the pervasive computing applications are implemented in software on embedded mi- crocontrollers, there is also a need for lightweight ciphers that result in efficient code size and execution time. In this thesis, we understand lightweight cryptography also as “resource-efficient cryptogra- phy” and we aim to provide new “resource-efficient” solutions for resource-constrained devices, which address the mentioned gaps in lightweight cryptography. We start with initial investiga- tions on existing lightweight primitives, where we present efficient implementations on different platforms, their applications, and comparisons. In the light of our initial investigations, we first propose a new low-latency and low-area lightweight PRINCE. Following PRINCE, we change our direction to the software side – targeting the software implementations on mi- crocontrollers. As a first step, we come up with a hardware/software co-design approach, the Non-linear/Linear Unit (NLU) Instruction Set Extension (ISE), which targets the 8-bit AVR instruction set of widely-used Atmel microcontrollers. After that, we extend our approach more on the primitive design side, where we define another new lightweight cipher, the “software- oriented” lightweight cipher PRIDE. In addition to our contributions on efficient lightweight primitive implementations presented in the first part of this thesis, the two novel lightweight block cipher designs achieve the targets and present the best academic results published so far. In the ISE design, our good results encourage further block cipher extensions on different microcontrollers in order to get a better code size and execution time. However, it is of course not easy to overcome all the gaps in lightweight cryptography in one work. Therefore, other designs and solutions addressing different metrics still remain as an open research problem left for future works.

Keywords. Lightweight cryptography, resource-efficient cryptography, design, ubiquitous computing, symmetric- cryptography, block cipher, , hardware implementation, software implementa- tion, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), microcontroller, Atmel AVR, IT Security

viii Kurzfassung

Ressourcen-effiziente Kryptographie f¨urUbiquitous Computing: Kryptographische Primitive aus einer Hardware & Software Perspektive

Technologische Fortschritte in der Halbleiterindustrie in den letzten Jahrzehnten haben die Massenproduktion von Mikrorechnern m¨oglich gemacht. Dank der Kompaktheit und Mobilit¨at dieser Ger¨ate k¨onnen diese “durchdringend” eingesetzt werden, mit anderen Worten, uberall¨ und an jedem Ort – wie in Smart Homes, Logistik, E-Commerce, und Medizintechnik. Das Einbetten der kleinen Controller in Alltagsgegenst¨ande zeigt die zunehmende Realisierung des prophezeiten “Ubiquitous Computing” Konzepts. Allerdings bringt dieses neue Paradigma und sein Massenseinsatz auch neue Bedrohungen mit sich – vor allem mit Blick auf Datenschutz und Privatsph¨are. Viele Menschen kritisieren aktuelle Sicherheitskonzepte in diesem allgegenw¨artigen Kontext. Es wird sogar angenommen, dass ein unzureichendes Sicherheitsniveau die gr¨oßte Barriere fur¨ den langfristigen Erfolg des Ubiquitous Computing sein k¨onnte. Anders als in traditionellen Bereichen der Datenverarbeitung sind das Angreifermodell und die Sicherheitslevel nicht die gleichen. Dies liegt vor allem an den begrenzten Ressourcen in Mikrokontrollern und Chips – Chipfl¨ache, Rechenleistung, und Energie sind ¨okonomisch bedingte Einschr¨ankungen fur¨ sol- che Ger¨ate. Aus diesem Grund sind die vorhandenen Verschlusselungsl¨ ¨osungen fur¨ die neuen, allgegenw¨artigen Anwendungen in der Regel nicht tauglich. Um das Sicherheitsproblem der ressourcenbeschr¨ankte Ger¨ate zu adressieren wurde das Forschungsgebiet der “Hocheffizienten Kryptographie” vor uber¨ einem Jahrzehnt definiert. In diesem Gebiet wurden bereits viele ver- schiedene kryptographische Primitive vorgeschlagen. Die ver¨offentlichten Arbeiten besch¨aftigen sich bisher meist mit der Kostenreduktion bei Implementierungen in Hardware (d.h. wenn ein eigener Chip gebaut werden soll). Dies ist jedoch nicht die einzige wichtige Metrik fur¨ sol- che Ger¨ate. Je nach Anwendung brauchen ressourcenbeschr¨ankte Ger¨ate eine Chiffre, die in einem Taktzyklus ausgefuhrt¨ werden kann und trotzdem ein hohes Sicherheitsniveau bietet. Ferner, da die meisten durchdringend Anwendungen in Software auf Mikrocontrollern imple- mentiert werden, gibt es auch hier Bedarf fur¨ geeignete Chiffren mit effizienter Codegr¨oße und Ausfuhrungszeit.¨ In dieser Arbeit bezeichnen wir die Hocheffiziente Kryptographie auch als “Ressourcen- effiziente Kryptographie” und wir wollen neue “Ressourcen-effiziente” L¨osungen fur¨ ressourcen- beschr¨ankte Ger¨ate bieten, die die genannten Lucken¨ adressieren. Wir beginnen mit Untersu- chungen von bestehenden Primitiven, deren Characteristika wir auf verschiedenen Plattformen pr¨asentieren. Angesichts unserer ersten Untersuchungen schlagen wir zun¨achst eine neue Block- chiffre mit dem Namen PRINCE vor, die besonders wenig Chipfl¨ache und Ausfuhrungszeit¨ ben¨otigt. Danach zielen wir auf Softwareimplementierungen auf Mikrocontrollern. Der erste Schritt in diese Richtung ist ein Hardware/Software Codesign genannt NLU ISE, das sich an den 8-Bit-AVR-Befehlssatz der weit verbreiteten Atmel Mikrocontroller-Familie richtet. Danach definieren wir eine weitere neue Chiffre optimiert fur¨ den Einsatz in Software (ohne notwendige Hardwareerweiterung) die den Namen PRIDE tr¨agt. Zus¨atzlich zu unserem Beitrag zu effizienten Implementierungen im ersten Teil dieser Arbeit pr¨asentieren wir damit zwei neuartige Designs; die Chiffren PRINCE und PRIDE erreichen besten bisher ver¨offentlichten akademischen Ergebnisse. Ergebnisse zu Untersuchungen von Har- ware/Software Codesign ermutigen weitere Codesigns von verschiedenen Chiffren auf verschie- denen Mikrocontroller. Naturlich¨ ist es nicht einfach, alle aktuellen Lucken¨ in der Hocheffizienten Kryptographie in nur einer Arbeit zu uberwinden,¨ daher sind andere Konzepte und L¨osungen sowie die Adressierung von verschiedene Metriken offene Forschungsprobleme fur¨ Nachfolgear- beiten.

Schlagworte. Hocheffiziente Kryptographie, Ressourcen-effiziente Kryptographie, Entwurf, Ubiquitous Com- puting, symmetrische Kryptographie, Blockchiffre, Hashfunktion, Hardware Implementierung, Software Implementierung, ASIC, FPGA, Mikrocontroller, Atmel AVR, IT-Sicherheit

x Acknowledgements

This thesis is the outcome of wonderful three and a half years at the Embedded Security (EmSec) Group of the Horst G¨ortzInstitute for IT-Security (HGI), Ruhr University Bochum (RUB). EmSec was always like (sometimes literally) a second home to me. The warm, friendly, and inspiring environment of EmSec (and of course, SHA1) group was always motivating, thanks a lot to you all! There are many people that I would like to express my gratitude, without whom the results of this thesis would not be possible. First of all, I would like to thank my supervisor Christof Paar for giving me this Ph.D. opportunity. I have always felt very lucky for being his student, he is open to new ideas and provides the research freedom a student needs. He always encouraged networking, which gave me the opportunity to meet and work with the leading researchers in IT Security field – I learned a lot from them! Secondly, I am very grateful to Tolga Yal¸cınand Gregor Leander for their friendly and cooperative guidance during my Ph.D. – they have been great teachers and mentors. Special thanks to Christian Rechberger, for being the secondary referee of my thesis and the fruitful research projects we had. My co-authors (in alphabetical order) – Martin. R. Albrecht, Lejla Batina, Andrey Bogdanov, Julia Borghoff, Anne Canteaut, Amitabh Das, Benedikt Driessen, Barı¸sEge, Susanne Engels, Tim G¨uneysu,Miroslav Kneˇzevi´c, Lars R. Knudsen, Nele Mentens, Hristina Mihajloska, Oliver Mischke, Ventislav Nikov, Thomas P¨oppelmann, Peter Rombouts, Søren S. Thomsen, Elmar Tischhauser, Ingrid Verbauwhede; it was an honor working with you, thanks a lot! Many thanks to our project partner NXP for pointing us to the exciting low-latency cipher design idea, which is now a very important part of this thesis. Furthermore, I would like to thank the UbiCrypt (Ubiquitous Cryptography) Research Training Group (DFG Graduiertenkolleg) for funding my research, and supporting me throughout my Ph.D. with many personal and career development seminars and workshops. All UbiCrypt members, it was great to work with you, thanks a lot for all the comments and productive discussions. Thanks to all the people I met during my internship at Qualcomm, it was a great experience working with you! Special thanks to our team assistant Irmgard K¨uhnand our technical assistant Horst Edel- mann, for their kind help with all administrative and technical stuff. Also, thanks a lot to our UbiCrypt coordinator Dominik Baumgarten for answering all my UbiCrypt-related questions and especially for his help during CrossFyre workshop organization. My office-mate Christian Zenger – it was awesome to share an office with you, keep the awesomeness! Vielen Dank an Benedikt, especially for helping me with the translation of the thesis abstract into German. All the loved ones; my family and friends – you provided me the support I need all the way along... This work would not be possible without you, thank you for being there! Last but not the least, to all other people I forgot to mention here: You know I owe you, thanks!

1Sichere Hardware Arbeitsgruppe

Table of Contents

Imprint ...... v Abstract ...... vii Kurzfassung ...... ix Acknowledgements ...... xi

I Preliminaries 1

1 Introduction 3 1.1 Ubiquitous Computing and Security Concerns ...... 3 1.2 Resource-constrained (Lightweight) Cryptography ...... 4 1.3 Motivation ...... 5 1.4 Contributions and Structure of the Thesis ...... 6

2 Overview 9 2.1 Cryptological Background ...... 9 2.2 Block Ciphers ...... 9 2.2.1 Substitution-Permutation Network (SPN) ...... 10 2.2.2 Notation and Mathematical Background ...... 11 2.3 Hash Functions ...... 13

3 Tools 15 3.1 Hardware Platforms ...... 15 3.1.1 ASIC ...... 15 3.1.2 Reconfigurable Hardware – FPGA ...... 16 3.2 Software Platforms ...... 17 3.2.1 ATmega8A ...... 17 3.2.2 STM32F4DISCOVERY ...... 17 3.2.3 EnergyMicro EF32GG-STK3700 ...... 17

II Resource-efficient Implementations of Existing Ciphers 19

4 An Analysis of Recently Developed Lightweight Block Ciphers 21 4.1 Introduction ...... 21 4.2 Background ...... 22 4.2.1 Fully-parallel Implementations ...... 22 Table of Contents

4.2.2 Implementation of KATAN ...... 24 4.3 Analysis Strategy ...... 24 4.3.1 Architectural Decisions ...... 24 4.3.2 Evaluation of Design Parameters ...... 25 4.4 Results ...... 25 4.5 Conclusions and Future Work ...... 26

5 Random Access Memory (RAM)-Based Ultra-Lightweight FPGA Implementation of PRESENT 29 5.1 Introduction ...... 29 5.2 Previous Work ...... 30 5.3 The Proposed PRESENT Implementations on FPGA ...... 30 5.3.1 State Processing ...... 31 5.3.2 ...... 32 5.3.3 Final Round ...... 33 5.3.4 Overall Design ...... 34 5.4 Results and Discussion ...... 37 5.5 Conclusion and Future Work ...... 37

6 On the Suitability of SHA-3 Finalists for Lightweight Applications 39 6.1 Introduction ...... 39 6.2 Implementations of the Finalist Hash Functions ...... 40 6.2.1 BLAKE ...... 40 6.2.2 Grøstl ...... 44 6.2.3 JH ...... 46 6.2.4 Keccak ...... 52 6.2.5 ...... 55 6.3 Interface ...... 59 6.4 Results and Discussion ...... 60 6.5 Conclusion ...... 61

7 IPSecco: A Lightweight and Reconfigurable Internet Protocol Security (IPSec) Core 63 7.1 Introduction ...... 63 7.2 Related Work ...... 64 7.3 Implementation ...... 65 7.3.1 ...... 66 7.3.2 Authenticated ...... 68 7.3.3 ...... 71 7.4 Results and Discussion ...... 72 7.5 Conclusion and Future Work ...... 74

III New Resource-efficient Constructions for the Ubiquitous Computing Era 77

xiv Table of Contents

8 Targeting Low-latency: A Lightweight Block Cipher PRINCE 79 8.1 Introduction ...... 79 8.2 PRINCE Background ...... 81 8.2.1 Highlights of PRINCE ...... 81 8.2.2 Cipher Description ...... 82 8.2.3 Design Decisions ...... 84 8.2.4 Security Analysis ...... 87 8.3 Hardware Investigations ...... 88 8.3.1 Searching for the Optimal Sbox ...... 88 8.3.2 Low-cost Linear Layer ...... 89 8.4 Hardware Implementations ...... 89 8.4.1 Unrolled Architecture ...... 90 8.4.2 Round-based Architecture ...... 91 8.4.3 Performance of Both Architectures on FPGA ...... 92 8.5 Software Implementations ...... 93 8.6 Investigations on Countermeasures Against Potential Side-Channel Attacks (SCA) on PRINCE ...... 94 8.6.1 Adversary Model ...... 94 8.6.2 Potential Countermeasures for PRINCE Against SCA ...... 95 8.7 Concluding Remarks ...... 96

9 Towards Software-efficiency: NLU Instruction Set Extension 97 9.1 Introduction ...... 97 9.2 Related Work and the Proposed ISE Model ...... 98 9.2.1 Realization of Substitution ...... 99 9.2.2 Realization of Permutation ...... 100 9.2.3 Loading the Substitution and Permutation Coefficients ...... 101 9.2.4 NLU Instructions ...... 101 9.3 Unified Non-linear/Linear Hardware Unit ...... 101 9.3.1 Non-linear Unit ...... 103 9.3.2 Linear Unit ...... 103 9.3.3 Area and Power Consumption of NLU Module ...... 103 9.4 Applications ...... 105 9.4.1 PRESENT Software Implementation ...... 105 9.4.2 CLEFIA Software Implementation ...... 109 9.4.3 Serpent Software Implementation ...... 110 9.4.4 AES Software Implementation ...... 110 9.5 Results and Discussion ...... 114 9.6 Conclusion and Future Directions ...... 115

10 Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE 117 10.1 Introduction ...... 117 10.1.1 Our Contribution ...... 118 10.2 The Interleaving Construction ...... 119 10.2.1 The General Construction ...... 119

xv Table of Contents

10.2.2 Searching for L0 ...... 121 10.3 The Search for “Software-friendly” Linear Layers ...... 121 10.3.1 Proposed Search Model ...... 122 10.3.2 Overall Architecture ...... 124 10.3.3 Evaluation of Search Results ...... 125 10.4 The Software-oriented Lightweight Block Cipher PRIDE ...... 126 10.4.1 The Extremely Efficient Linear Layer ...... 126 10.4.2 Sbox Selection ...... 128 10.4.3 Description of PRIDE ...... 130 10.4.4 Performance Analysis ...... 131 10.4.5 Security Analysis ...... 133 10.5 Conclusion ...... 133

IV Conclusion 135

11 Concluding Remarks 137 11.1 Implementations and Applications of Lightweight Primitives ...... 137 11.2 Lightweight Block Cipher Design ...... 138 11.3 ISE Design for (Lightweight) Crypto Primitives ...... 138

12 Directions for Future Research 141 12.1 Further Lightweight Crypto Implementations and Applications ...... 141 12.2 Potential Lighweight Cipher Design Ideas ...... 142 12.3 Further (Lightweight) Crypto ISE Ideas ...... 142

V Appendix 143

PRINCE Diffusion Matrices 145

PRINCE Test Vectors 149

PRIDE Diffusion Matrices 153

PRIDE Test Vectors 157

Abbreviations 161

Bibliography 167

List of Figures 183

xvi Table of Contents

List of Tables 187

About the Author 189

Author’s Publications 191

xvii

Part I

Preliminaries

Chapter 1 Introduction

This chapter provides some background information about ubiquitous computing and “resource-constrained” (lightweight) cryptography, gives a brief overview of the avail- able lightweight constructions and implementations, and discusses the unaddressed points in lightweight solutions. Following that, the research motivation of this the- sis is explained. Finally, the structure of the thesis is outlined and the research contributions are summarized.

Contents of this Chapter 1.1 Ubiquitous Computing and Security Concerns ...... 3 1.2 Resource-constrained (Lightweight) Cryptography ...... 4 1.3 Motivation ...... 5 1.4 Contributions and Structure of the Thesis ...... 6

1.1 Ubiquitous Computing and Security Concerns

Modern computing devices have been in our lives now for more than 50 years. At the beginning, they used to cover the size of a large room, and they were consuming as much power as several hundred modern PCs. Over the last few decades, the semiconductor industry has shown great technological advancements, which made the mass production of very small-scale computing devices possible. The compactness, power-efficiency, and mobility of these devices let them to be deployed pervasively, in other words, everywhere and anywhere. For instance, pervasive devices such as embedded microcontrollers, sensors, and terminals can now be found in many different daily life applications – smart homes, logistics, e-commerce, and medical technology are to name but a few of these. In Figure 1.1, chart (b) demonstrates the development of computing devices (a) visually1. Embedding the small-scale devices into our everyday objects pervasively also indicates the realization of the foreseen “ubiquitous computing” [Wei99] concept, which is becoming a more important part of our lives day by day. However, ubiquitous computing and the mass deployment of pervasive devices in turn brought some concerns – especially, security and privacy. Despite the security and privacy attributes mentioned in the ubiquitous computing definition of Massachusetts Institute of Technology (MIT)’s Project Oxygen [MIT], many people criticize the security and privacy management in the ubiquitous context. It is even believed that an inadequate level of security may be the

1Figure adapted from [Wal07].

3 Chapter 1. Introduction

Network

Interfaces Processor Storage

Energy

Computing Device (a)

n

o

i

s

a

v

r

e P Ubiquitous systems ~ 2000 /2005

Ambient intelligence ~ 2005 /2010 Mobility emerged ~ 1990

Computers communicate ~ 1970

Information technology born ~ 1960

(b) Time

Figure 1.1: Development of computing devices over the last 50 years greatest barrier to the long-term success of ubiquitous computing. However, for ubiquitous com- puting, the adversary model and the security level is not the same as for traditional applications due to the resource-constrained nature of the small-scale devices. Area, power, and energy are harsh constraints as a result of the limited resources. Unfortunately, many existing “traditional” cryptographic solutions are generally quite heavy for such applications. “Lightweight cryptogra- phy” has been defined to provide solutions for the security problems of these embedded devices. As lightweight cryptography is specifically tailored for resource-constrained environments, we will also call it “resource-efficient cryptography” in this thesis.

1.2 Resource-constrained (Lightweight) Cryptography

Lightweight cryptography has been attracting researchers despite being a relatively new topic. The main question of this field of cryptography is achieving high or sufficient levels of security by consuming small area and less computing power, which means being “resource-efficient”. The

4 1.3. Motivation focus of lightweight cryptography lies in new lightweight primitive designs and adaptations and efficient implementations of existing cryptographic primitives. In the area of lightweight cryptography, several primitives for symmetric and asymmetric cryptography, and hash functions have been proposed within the last decade. Among the best- studied symmetric algorithms are the block ciphers CLEFIA [SSA+07], HEIGHT [HSH+06], KATAN, KTANTAN [CDK09], KLEIN [GNL11], mCrypton [LK06], LED [GPPR11], Pic- colo [SIH+11], and PRESENT [BKL+07], as well as the stream ciphers Grain [HJMM08], Mickey [BD08], and Trivium [CP08]. While NTRU [NTR] being the only lightweight example for asymmetric cryptography (note that Elliptic Curve Cryptography (ECC) [Kob87, Mil86] per- forms well compared to RSA [RSA78] and public-key based on discrete logarithm problem [PP10a] and is hence used in lightweight applications, but it is not developed specifi- cally for lightweight); PHOTON [GPP11], QUARK [AHMNP13], and SPONGENT [BKL+11] have been proposed as lightweight hash functions. There have also been many implementa- tion/adaptation attempts – different hardware and software implementations of traditional and lightweight proposals have been extensively studied in many versions – speed-optimized, area- optimized, serial, parallel, etc. – by the researchers. Out of the primitives that we mentioned above, two prominent lightweight block cipher exam- ples, the PRESENT lightweight block cipher and Sony’s CLEFIA, have already been accepted as lightweight cryptography standards by International Organization for Standardization (ISO) (as ISO/IEC Standard 29192-2:2012) in 2012 [ISO]. This standardization also points out industry demand for lightweight ciphers. Somewhat surprisingly, the focus in the lightweight crypto literature has almost been entirely on hardware-efficiency. However, this is not the only important metric for pervasive devices. Depending on the application, resource-constrained devices may need lightweight crypto prim- itives that are designed according to other metrics.

1.3 Motivation

So far, metrics other than hardware-efficiency have barely been addressed in lightweight cryp- tography solutions. While hardware-efficiency is certainly a valid optimization objective, its relevance to real-world applications is limited. Nowadays, there are several interesting and strong proposals available that feature a very small area but simultaneously neglect other, im- portant real-world constraints. For example, depending on the application, resource-constrained devices may need lightweight ciphers to be executed in one clock cycle (e.g., instant authen- tication and read/write access to memory devices), while still guaranteeing a certain security level and a small footprint. Unfortunately, recent proposals achieve the goal of a small chip area by sacrificing execution speed to such an extent that even in applications where speed is supposedly uncritical, the ciphers are getting too slow2. Furthermore, as the vast majority of pervasive devices is equipped with (small) embedded CPUs as opposed to dedicated ASICs, and the pervasive computing applications – including security applications – are implemented in software on these embedded microcontrollers, there is also a need for lightweight ciphers that result in efficient code size and execution time. Interestingly, lightweight solutions with respect to software-efficiency have barely been addressed so far. Figure 1.2 visualizes this situation in

2See also [KNR12], which asks “Is lightweight = light + wait?”.

5 Chapter 1. Introduction lightweight cryptography: There have been numerous block cipher proposals on the hardware area side; however, the other three parts are nearly left unaddressed. The software column de- picts the attempts to come up with ciphers partially-tailored for low-cost processors; but except SPECK [BSS+13], these proposals generally result in high execution time and/or use a large amount of program memory storage3.

Hardware Software

lo o PRESENT KA HT c K TAN HIG ITUBee KL Area/ ic LEIN EIN P TAN TAN SIMON ? K Code size TWINE K m SE PEC Cryp CLE A S LED LBlock ton FIA

E L Latency/ IN Bloc ? TW ? k Execution time SP ECK KLEIN

Figure 1.2: The gaps in lightweight cryptography

In this thesis, we aim to address the mentioned gaps in lightweight cryptography. We start with initial investigations of existing lightweight primitives and solutions, where we present efficient implementations of them together with comparisons. We also present efficient imple- mentation of ciphers specific to different platforms. In light of our initial investigations, we propose the new low-latency and low-area lightweight block cipher PRINCE, which fills the first column of the table in Figure 1.2. After PRINCE, we change our direction to software- oriented solutions – targeting software implementations on microcontrollers. We first propose a hardware/software co-design approach, the NLU ISE, which targets the 8-bit AVR instruction set of widely-used Atmel microcontrollers. Following that, we extend our approach to the prim- itive design side, where we try to find solutions for the second column of the table in Figure 1.2. We propose another new lightweight crypto primitive, the “software-oriented” lightweight block cipher PRIDE. Note that, in this thesis, we will focus on the lightweight block cipher and lightweight hash function parts of lightweight cryptography.

1.4 Contributions and Structure of the Thesis

This thesis consists of four parts: Preliminaries (Part I) – together with this introduction, we present a cryptographic overview and a description of our tools in this part; Appetizers (Part II) – here we explain our initial investigations and findings in lightweight cryptography; Resource- efficient Solutions for Ubiquitous Computing (Part III) – we introduce our novel designs in this part; Conclusion (Part IV) – we finally conclude the thesis and outline future directions.

3See Section 10.1 for more information.

6 1.4. Contributions and Structure of the Thesis

Note that, due to the interdisciplinary nature of the thesis topic, we had collaborations with different research groups during the course of this thesis. However, the thesis presents only the contributions of the thesis author. We refer to the original publications where necessary. In the following, we list the four parts and their chapters together with the respective contri- butions and the regarding publications.

 Part I

 Chapter 2: A general overview of block ciphers and hash functions is presented here, in order to provide the basic knowledge needed for the following chapters.

 Chapter 3: Here we present the hardware and software tools used during the course of this thesis.

 Part II

 Chapter 4: This chapter presents an area, power, and energy analysis of some of the recently developed lightweight block ciphers and compares them to different implementations of the standard AES [NIS01] algorithm. This work is a guide for the selection of algorithms based on application requirements. The results of this collaborative work were later published at the International Workshop on RFID Security and Privacy (RFIDSec) in 2013 [BDE+13].

 Chapter 5: Here we propose two different FPGA implementations of the lightweight cipher PRESENT. Our main design strategy in this chapter is the utilization of ex- isting RAM blocks in FPGAs for the storage of internal states, in order to reduce the slice count. Our results were published at the International Conference on Re- configurable Computing and FPGAs (ReConFig) in 2011 [KY11].

 Chapter 6: This chapter investigates the suitability of SHA-3 finalists for lightweight applications. For each finalist, we try to achieve the lowest reported gate count while maintaining a respectable throughput. We mainly favor a word-serial approach in our designs to achieve a low gate count. This work was presented at the 3rd SHA-3 Conference in 2012 [KY].

 Chapter 7: In this chapter, we propose a reconfigurable lightweight IPSec hardware core, which is implemented on FPGA. We evaluate efficient implementations of stan- dardized and/or well-known lightweight and hardware-friendly algorithms. Here we make use of some of the designs presented in 5 and 6. The results of this collaborative work were published at ReConFig in 2012 [DGK+12].

 Part III

 Chapter 8: The design of a new lightweight block cipher PRINCE, which specifi- cally targets low-latency encryption, is investigated in this chapter. The main design target is to keep the gate count below the existing ciphers in the literature, and the latency at exactly one clock cycle. Within the scope of this thesis, we only focus on the specific parts of this design process, on which the author mainly worked. This work was published at ASIACRYPT in 2012 [BCG+12a].

 Chapter 9: This chapter is a step forward towards the software-efficiency of the cryptographic algorithms. We introduce a non-linear/linear ISE, namely NLU, which

7 Chapter 1. Introduction

is capable of implementing non-linear operations, namely Sboxes, expressed in their Algebraic Normal Form (ANF) and linear operations expressed as binary matrix multiplication, “multiply-and-add”, form. We base our NLU design only on the widely-used 8-bit AVR instruction set. Our results were published at IEEE Sympo- sium on Computer Arithmetic (ARITH) in 2013 [EKM+13].

 Chapter 10: This chapter focuses on “software-efficient” algorithm design. In order to get an efficient cipher, we have a look at the software-costly SPN component linear layer. On a hardware-accelerated reconfigurable search platform, we search for efficient linear layers regarding software performance. Our software-oriented block cipher proposal PRIDE uses the resulting linear layer. PRIDE is optimized for 8-bit microcontrollers and outperforms all academic solutions both in terms of code size and cycle count. The results of this collaborative work were published at ReConFig in 2013 [KLY13] and at CRYPTO in 2014 [ADK+14a].

 Part IV

 Chapter 11: Here we summarize our results and discuss the contributions of the thesis.

 Chapter 12: In this chapter, we point out the parts that might need further inves- tigation and present some ideas for future research.

Note that the author of this thesis has contributed to other topics, such as hardware accel- eration for and authenticated encryption; however, they are not included in this thesis due to its focus on lightweight cryptography. A detailed list of all publications can be found at the end of this thesis.

8 Chapter 2 Overview

Lightweight cryptographic primitives possess the properties of traditional crypto primitives but are optimized for resource-constrained platforms. Therefore, here we present a general overview of block ciphers and hash functions, in order to provide the basic knowledge needed for the following chapters.

Contents of this Chapter 2.1 Cryptological Background ...... 9 2.2 Block Ciphers ...... 9 2.3 Hash Functions ...... 13

2.1 Cryptological Background

Cryptology consists of two main parts, one being the effort of building primitives – cryptography, the other being the effort of analyzing these primitives – cryptanalysis. Actually, in the ideal case, these two have a mutual relationship: Cryptanalysis investigates the security of a primitive and provides feedback; then cryptography comes up with a more secure primitive design, which gives cryptanalysis the chance to apply better analysis techniques to break the primitive. In this thesis, our focus is on cryptographic primitives. As depicted in Figure 2.1, it has three main branches; symmetric (secret-key) crypto, asymmetric (public-key) crypto, and hash functions. Within the scope of this thesis, we are interested in the symmetric-key primitives and hash functions. Block ciphers and stream ciphers are the basic building blocks of symmetric cryptography. Out of these, we will give further information on block ciphers in the following.

2.2 Block Ciphers

Block ciphers are one of the most prominently used cryptographic primitives and account for the largest portion of data encrypted today. This was facilitated by the introduction of Rijn- dael [DR98] as the AES, which was a major step forward in the field of block cipher design. Not only does AES offer strong security, but its structure also inspired many cipher designs ever since. There are two main design strategies that can be identified for block ciphers: Sbox-based constructions and constructions without Sboxes. Sbox-less constructions are most prominently

9 Chapter 2. Overview

Cryptology

Cryptographic Cryptanalysis Primitives

Secret-key Hash Public-key Cryptography Functions Cryptography

Figure 2.1: Branches of cryptology and cryptographic primitives those using addition, rotation, and Exclusive OR (XOR)s (Addition-Rotation-XOR (ARX) designs). Furthermore, Sbox-based designs can be split into Feistel-ciphers and SPN. Both concepts have been successfully used in practice, the most prominent example of an SPN cipher being AES and the most prominent Feistel-cipher being the former DES [NIS99]. In the scope of this thesis, we will focus on SPN constructions.

2.2.1 SPN An SPN is an iterated cipher structure where the round function is built from invertible function layers such as substitution and permutation, as can be seen in Figure 2.2.

S S S

k kn1 k n k 0 S 1 S n S n o o i i t t a a t t

plaintext u S S u S m m r r e e P S S P S

S S S

Figure 2.2: Illustration of SPN

Besides the most prominent example AES, it is also worth mentioning that the concept of SPN has not only been used in the design of block ciphers but also for designing crypto-

10 2.2. Block Ciphers graphic permutations, most prominently for the design of several sponge-based hash functions including SHA-3 [NIS14]. In Substitution-Permutation (SP) networks, the round function con- sists of a non-linear layer composed of small Sboxes working in parallel on small chunks of the state, and a linear layer that mixes those chunks. Thus, designing an SPN block cipher essentially reduces to choosing one (or several) Sboxes and a linear layer.

Substitution (Non-linear) Layer

Substitution is the non-linear part of a cipher and introduces confusion. It generally consists of the parallel application of functions S : {0, 1}m → {0, 1}m that operate on small chunks of data. The S functions are called Sboxes (Substitution boxes) and can be implemented as a lookup table with 2m entries. Note that the input and output size of Sboxes may also be different. However, in such cases the Sbox is not bijective. A lot of research has been devoted to the study of Sboxes. All Sboxes of size up to 4 bits have been classified (indeed, more than once – cf. [BCBP03, LP07, Saa11]). Moreover, Sboxes with optimal resistance against differential and linear attacks have been classified up to dimension 5 [BL08]. In general, several constructions are known for good and optimal Sboxes in arbitrary dimensions. Starting with the work of Nyberg [Nyb93], this has evolved into its own field of research in which those functions are studied in great detail. A nice survey of the main results of this line of study is provided by Carlet [Car10].

Permutation (Linear) Layer

The permutation layer is an affine transformation that operates on the complete block and introduces diffusion. A well-chosen linear layer is not only crucial for the security and efficiency of a block cipher, but also allows to argue in a simple and thereby convincing way about its security. AES and its predecessor SQUARE [DKR97] are good examples of that. Research efforts have been focused on Sboxes as mentioned in the previous section; however, the situation for the linear layer is less clear. For the design of the linear layer, two general ap- proaches can be identified. One still widely-used method is to design the linear layer in a rather ad-hoc fashion, without following general design guidelines. While this might lead to very secure and efficient algorithms (cf. Serpent [ABK98] and SHA-3 as prominent examples), it is not very satisfactory from a scientific point-of-view. The second general design strategy is the wide-trail strategy introduced by Daemen in [Dae95] (see also [DR01]). Especially for the security against linear [Mat93] and differential [BS90] attacks, the wide-trail strategy usually results in simple and strong security arguments. It is therefore not surprising that this concept has found its way in many recent designs (e.g., Khazad [BR00b], Anubis [BR00a], Grøstl [GKM+11], PHOTON, LED, PRINCE, mCrypton to name but a few). In a nutshell, the main idea of the wide-trail strategy is to link the number of active Sboxes for linear and differential cryptanalysis to the minimal distance of a certain linear code associated with the linear layer of the cipher. In turn, choosing a good code (with some additional constraints) results in a large number of active Sboxes.

2.2.2 Notation and Mathematical Background

In this part, we fix some basic notation and furthermore recall the ideas of the wide-trail strategy.

11 Chapter 2. Overview

We deal with SPN block ciphers where the Sbox layer consist of n Sboxes of size b each. Thus the block size of the cipher is n × b. The linear layer will be implemented by applying k binary matrices in parallel. n We denote by F2 the field with two elements and by F2 the n-dimensional vector space over b F2. Note that any finite extension field F2b over F2 can be viewed as the vector space F2 of n dimension b. Along these lines, the vector space (F2b ) can be viewed as the (nested) vector n  b  space F2 . n  b  b 1 Given a vector x = (x1, . . . , xn) ∈ F2 , where each xi ∈ F2, we define its weight as

wtb(x) = |{1 ≤ i ≤ n | xi =6 0}|.

b n b n Following [DR01], given a linear mapping L :(F2) → (F2) , its differential branch number is defined as n  b  Bd(L) := min{wtb(x) + wtb(L(x)) | x ∈ F2 , x =6 0}. The cryptographic significance of the branch number is that the branch number corresponds to the minimal number of active Sboxes in any two consecutive rounds. Here an Sbox is called active Sbox if it gets a non-zero input difference in its input. Given an upper bound p on the differential probability for a single Sbox along with a lower bound of active Sboxes immediately allows to deduce an upper bound for any differential char- acteristic2 using

average probability for any non-trivial characteristic ≤ p#active Sboxes.

For linear cryptanalysis, the linear branch number is defined as n ∗  b  Bl(L) := min{wtb(x) + wtb(L (x)) | x ∈ F2 , x =6 0} where L∗ is the adjoint linear mapping. That is, with respect to the standard inner product, L∗ corresponds to the transposed matrix of L. In terms of correlation (cf., for example, [Dae95]), an upper bound c on the absolute value of the correlation for a single Sbox results in a bound for any linear trail (or linear characteristic, linear path) via absolute correlation for a trail ≤ c#active Sboxes.

The differential branch number corresponds to the minimal distance of the F2-linear code C b over F2 with generator matrix G = [I | LT ] where I is the n × n identity matrix. The length of the code is 2n and its dimension is n (here n dimension corresponds to log2b (|C|) as it is not necessarily a linear code). Thus, C is a (2n, 2 ) b additive code over F2 with minimal distance d = Bd(L). The linear branch number corresponds in the same way to the minimal distance of the F2- linear code C⊥ with generator matrix

G∗ = [L | I].

1 b n nb Of course F2 is isomorphic to F2 , but the weight is defined differently on each. 2Averaging over all keys, assuming independent round keys.

12 2.3. Hash Functions

Note that C⊥ is the dual code of C and in general the minimal distances of C⊥ and C do not need to be identical. Finally, given linear maps L1 and L2, we denote by L1 × L2 the direct sum of the mappings, i.e. (L1 × L2)(x, y) := (L1(x),L2(y)).

2.3 Hash Functions

Hash functions are vital cryptographic primitives and they are used in many protocols. A hash function computes a digest of a message, which is a short and fixed-length bit-string. Unlike other cryptographic algorithms, hash functions do not use key. In order to be secure, hash functions need to possess three main properties, namely pre-image resistance (one-wayness), second pre-image resistance (weak collision resistance), and collision resistance (strong collision resistance) [PP10b]. Hash functions are used in applications such as digital signatures, Message Authentication Code (MAC)s, key derivation, etc. The primary application of hash functions in cryptography is message integrity. The hash value provides a digital fingerprint of message contents, ensuring that the message is not changed by an adversary. Hash algorithms are effective because of the extremely low probability that two different plaintext messages will yield the same hash value. There are several well-known hash functions in use today, e.g., MD5 [MD5], National Institute of Standards and Technology (NIST) standards SHA-1, SHA- 256 [NIS08] and new SHA-3, PHOTON, and SPONGENT to name but a few of these.

13

Chapter 3 Tools

In this chapter, we present the hardware and software tools used during the course of this thesis. We use the hardware and software tools mainly for two reasons – as implementation platform, and as search platform. Here, we explain the use case of each of these tools, and briefly give their properties.

Contents of this Chapter 3.1 Hardware Platforms ...... 15 3.2 Software Platforms ...... 17

3.1 Hardware Platforms

In our work, we have used hardware platforms for two basic purposes; as implementation plat- form, or as hardware accelerator in order to realize our search algorithms. In the former case, we have used both ASIC and FPGA platforms. However, for the latter case, we made use of only FPGA platforms, due to the reconfigurability they offer. In the following, we give some information on the hardware tools we have used in our research.

3.1.1 ASIC In our work, we implemented many different hardware architectures of the existing and new lightweight cryptographic primitives. In the implementation process, Mentor Graphics Model- sim [Gra] and Cadence NCVerilog 06.20-p001 [Cad] are used for functional verification/sim- ulation. The RTL description is then synthesized with Cadence Encounter RTL Compiler v10.1 [Cad13], which was also used to generate the area, timing, and power estimation reports. Since the gate count and delay parameters are heavily technology dependent, the implemen- tations have been synthesized for different standard-cell technology libraries. Three different technology libraries are used throughout this thesis: 130 nm and 90 nm low-leakage Faraday Libraries from UMC [UMCa, UMCb], and 45 nm generic NANGATE [Nan] Open Cell Library. All ciphers are implemented in Verilog-Hardware Description Language (HDL) and typical op- erating conditions were assumed in all syntheses. Area results are depicted in µm2 in area reports; however, this value is highly-dependent on the fabrication technology and the standard-cell library used in the synthesis. Therefore, in order to have fair comparisons, chip area is typically measured in Gate Equivalence (GE), i.e., the cipher area normalized to the area of a two-input NAND gate in a given standard-cell

15 Chapter 3. Tools library. The area in GE is derived by dividing the silicon area of the implementation by the silicon area of the two-input NAND gate in the same technology library.

3.1.2 Reconfigurable Hardware – FPGA

Although ASICs still have lower costs in large volumes, for customers willing to produce small amounts of sensor nodes or Radio Frequency IDentification (RFID)s, low-cost and low- power FPGAs seem to be the ideal solution. With the latest advances in the FPGA technologies and architectures, they are expected to be preferable for battery-powered applications like wire- less sensor nodes, presenting a popular platform for lightweight applications. Furthermore, the reconfigurable nature of FPGAs makes them even more attractive in applications that may re- quire regular hardware updates and/or modifications. Therefore, some of our implementations are also synthesized on FPGA. In addition to the regular implementations, we also implemented our reconfigurable search architecture (see Chapter 10) on the FPGA platform. Using an FPGA was indispensable due to the reconfigurability and flexibility it provided, unlike an ASIC platform. This was important as we needed to modify the search parameters and implement architectural modifications easily. As the search platform, we used Xilinx Virtex-6 ML605 Evaluation Board (with XC6VLX240T FPGA on it) [Xil12] (Figure 3.1). We also made use of Xilinx Spartan-3 FPGA Family includ- ing Spartan-3A XC3S50 FPGA [Xil11a] and Xilinx Spartan-6 FPGA Family [Xil11c] together with ML605 Evaluation Board for implementations purposes. All the cores were synthesized, mapped, and place & routed for the corresponding FPGA using Xilinx ISE Design Tool [Xil].

Figure 3.1: Xilinx Virtex-6 ML605 Evaluation Board

16 3.2. Software Platforms

3.2 Software Platforms

We also implemented our designs on embedded microcontrollers. A microcontroller is a small computer on a chip composed of a processor, programmable peripherals, memory (and in most cases program memory), and RAM as well. It is widely-used in embedded systems like remote controls, washing machines, or control systems of a car. In this section, we give brief information on the microcontrollers we have used in our research.

3.2.1 ATmega8A The ATmega8A [Atm13] belongs to the high-performance, low-power Atmel AVR 8-bit micro- controller series. ATmega8A uses 8-bit AVR [Atm10] instruction set and has an advanced Reduced Instruction Set Computer (RISC) architecture with 130 instructions. Most of these instructions are executable in a single clock cycle. ATmega8A reaches up to 16 Million Instruc- tions per Second (MIPS) at a frequency of 16 MHz. Moreover, it comes with three different memory segments: 8 kB of program memory, 512 bytes of EEPROM, and 1 kB internal Static Random Access Memory (SRAM). For 8-bit microcontrollers, AVR is dominating the market along with PIC [Mic] (see [Fri]), thus it was meaningful to choose an ATmega8A microcon- troller for our lightweight crypto primitive implementations. In order to simulate the behavior of our implementations, we also made use of Atmel Studio Integrated Development Environ- ment (IDE) [Atm].

3.2.2 STM32F4DISCOVERY The STM32F4DISCOVERY [STM13], as can be seen in Figure 3.2, is based on an ARM 32-bit Cortex-M4 CPU and has a frequency up to 168 MHz. Furthermore, it is equipped with 192 kB of SRAM and 1 MB of Flash memory. The ST-LINK/V2 with SWD connector makes program- ming and debugging via USB possible. As ARM dominates the market for 32-bit microcon- trollers and widely in use, we chose this microcontroller for the 32-bit assembly implementation of PRIDE(see Chapter 10 for details).

3.2.3 EnergyMicro EF32GG-STK3700 We synthesized PRINCE and PRIDE C implementations on EnergyMicro EF32GG-STK3700 [Sil13] Evaluation Board. It has an ARM 32-bit CORTEX-M3 microprocessor and contains sensors together with peripherals that demonstrate some of the microcontroller’s many capa- bilities. The board features an on-board SEGGER J-Link debugger and an Advanced Energy Monitoring system, allowing to program, debug, and perform real-time current profiling of the application without using external tools. It comes with 1 MB Flash and 128 kB RAM. As the compiler, we selected SimplicityStudio IDE [Sil14], which is based on Eclipse [Ecl] and uses GNU ARM C compiler.

17 Chapter 3. Tools

Figure 3.2: STM32F4DISCOVERY

18 Part II

Resource-efficient Implementations of Existing Ciphers

Chapter 4 An Analysis of Recently Developed Lightweight Block Ciphers

This chapter presents an area, power, and energy analysis of some of the recently developed lightweight block ciphers and different implementations of the AES algo- rithm. We show that area is not always correlated to power and energy consumption, which is of importance for mobile battery-powered devices. Our work can be used as a guide while selecting algorithms based on the area, energy, and key/block length requirements of applications.

Contents of this Chapter 4.1 Introduction ...... 21 4.2 Background ...... 22 4.3 Analysis Strategy ...... 24 4.4 Results ...... 25 4.5 Conclusions and Future Work ...... 26

4.1 Introduction

The purpose of this investigation is to perform an area, power, and energy analysis of some of the most recently developed lightweight block ciphers along with the NIST standard AES, which can be helpful for both the engineering and theoretical communities concerned with lightweight cryptography. Given that the lightweight algorithms are particularly interesting for battery- powered or passive systems such as RFID tags, a valid energy consumption prediction is very desirable. Furthermore, we want to present a comparative study on the performance of the existing lightweight primitives and AES. This is a necessary step for us in order to draw our road map for further research in lightweight cryptography area, which will be presented in the following parts of this thesis. There have been several investigations on performances of lightweight block ciphers and AES. In addition to software performances and hardware area consumption figures of lightweight al- gorithms given in their corresponding publications (e.g., [BKL+07, GNL11, CDK09, HSH+06, SMMK12]), comparative studies on software [CMM13, EGG+12] and hardware [EKU+07, KSPS12, Cry14, KDH+12] platforms have also been provided. Except the recent work of Ker- ckhof et al. [KDH+12], these have focused on area and power consumption figures, and they

21 Chapter 4. An Analysis of Recently Developed Lightweight Block Ciphers do not discuss energy consumption. In [KDH+12], energy consumptions of six different block ciphers are evaluated as well as the area, power consumption, and throughput figures. Con- clusions are drawn with respect to round unrolling, parallelism, and pipelining. Other related works in the past were focused mainly on other, more specific aspects of low-cost applications. For instance, the work of Singel´eeet al. [SSBV11] focuses on the computation and commu- nication energy budget of authentication protocols for active RFID tags. Accordingly, they consider other cryptographic primitives that can be used for authentication, i.e., ECC-based protocols. On the other hand, the AES algorithm as the main standard for encryption in the past decade, has been already evaluated with regard to energy consumption in several previous studies [HV06, dMGSP08, TFs08]. In particular the work of Tillich et al. [TFs08] examines sev- eral different AES Sbox implementations that are based on three different design strategies. The results addressed the consequences of different design strategies on critical path delay, silicon area, and power consumption. All those works contributed to the better understanding of low-cost design principles. As mentioned above, the requirements of extreme low-cost applications of today require more lightweight solutions as advocated by the new block ciphers’ proposals. Our work extends those studies with a suite of the recent lightweight symmetric schemes. Our work covers six different lightweight block cipher architectures and five different implementations of AES block cipher. The differences between the AES architectures are based on the different implementations of Sbox. Note that this chapter presents a collaborative work, which was later published at RFIDSec in 2013 [BDE+13]; however, only the parts with author’s contribution are included in this thesis.

4.2 Background

In this section, we provide some background information on the evaluated block cipher archi- tectures. The information is grouped according to similarities in the architectures. Note that we do not present the algorithmic details of the analyzed ciphers in this thesis, but refer to their original publications.

4.2.1 Fully-parallel Implementations

We have implemented fully-parallel (round-based) versions of AES-128, CLEFIA-128, PRESENT- 80, LED-128, KLEIN-64, and mCrypton-96, where the number next to each cipher represents the chosen key length for the implementation. Among these, AES and CLEFIA are 128-bit block ciphers, whereas all others are 64-bit. For a fair comparison, we have implemented the encryption-only version of each cipher. All the implemented block ciphers share the same struc- ture. Any of these block ciphers can be implemented as shown in Figure 4.1, where the round function is instantiated only once. In this case, the initial input (upon a start signal) of the round function is the sum of the input key and the plaintext (i.e., the initial state). It is pro- cessed by the combinational round function and the next state is generated. It is then stored in the state register, whose output becomes the input to the round function in the next cycle. The iteration count is as many rounds as the cipher is defined for. Finally, the ciphertext can be taken either from the state register output or some internal node of the round function block. The datapath width inside the round function block is equal to the cipher block size, i.e., 128

22 4.2. Background bits for AES and CLEFIA, and 64 bits for all others, while the datapath width of the key scheduling block depends on the selected key size. In the case of LED, we use a fixed key (fixed means either the original input key, or a function of it). Figure 4.2 illustrates the no key-update case for such a round-based implementation.

key sch key round ctext ptext round function

state init round update

Figure 4.1: Round-based implementation of a generic block cipher

key ctext ptext round function

state init round update

Figure 4.2: Round-based implementation of a generic block cipher – no key schedule

This is basically the design strategy we used in all our round-based implementations. In order to keep the round numbers minimum, we got the ciphertext from internal nodes inside the round function instead of the outputs of the state registers. This way, we ensured that the cipher could be run in maximal throughput, that is the distance between two consecutive starts is equal to the number of rounds. We furthermore implemented fully-parallel AES in various flavors. Mainly, we focused on the Sbox, which is the most area consuming unit inside the AES algorithm. The first implementation (AES lut 128) uses Look-Up Table (LUT)-based Sboxes. The second version implements the Sbox in the composite field GF ((24)2) (AES 4 2 128), and the third version in the composite field GF (((22)2)2) (AES 2 2 2 128), which are explained in [Can05]. We have also observed that

23 Chapter 4. An Analysis of Recently Developed Lightweight Block Ciphers the isomorphic transform matrices used for composite implementations are very area-consuming due to the high Hamming Weight (HW), that correspond to XOR gates in hardware. In order to reduce this, it is possible to use other isomorphic matrices, that are affine equivalent to the original matrices. Of course, this requires that the corresponding affine and inverse affine transformations have to be applied to the plaintext and ciphertext, respectively. Similarly, the key scheduling also needs to be carried out in the affine equivalent “domain”. Depending on the choice of the affine transformation, it is possible to reduce the area or the power (or even both) consumption of the overall design. We chose transforms that would minimize the power consumption, and have implemented two more flavors of AES, namely affine-transformed in the composite field GF ((24)2) (AES iso 4 2 128), and affine-transformed in the composite field GF (((22)2)2) (AES iso 2 2 2 128).

4.2.2 Implementation of KATAN KATAN is a block cipher that belongs to a family of small and efficient hardware-oriented block ciphers. KATAN ciphers include KATAN32, KATAN48, and KATAN64. All three ciphers use 80-bit keys and have a different block size (KATANn has an n-bit block size). All three block ciphers are highly compact, with the one having the smallest block size resulting in the smallest circuit area of only 802 GEs [CDK09]. In KATAN, the plaintext is loaded in two registers. In each round, several bits are taken from the registers and entered in two non-linear Boolean functions. The output of the Boolean functions is loaded to the least significant bits of the registers (after they are shifted). This is done in an invertible manner. To ensure sufficient mixing, 254 rounds of the cipher are executed. A round counting Linear Feedback Shift Register (LFSR) is used instead of a counter, for counting the rounds to stop the encryption after 254 rounds, and to introduce more diffusion as well. As there are 254 rounds, an 8-bit LFSR with a sparse feedback polynomial can be used. The LFSR is initialized with some state, and the cipher has to stop running the moment the LFSR arrives at the predetermined state. The key schedule of the KATAN cipher loads the 80-bit key into another LFSR (the least significant bit of the key is loaded to position 0 of the LFSR). In each round, positions 79 and 78 of the LFSR are generated as the round’s subkey, and the LFSR is clocked twice [CDK09]. Note that the base hardware implementations of KATAN were provided by our co-authors from the Katholieke Universiteit Leuven and Radboud University Nijmegen. Then, their im- plementations were adjusted to be in accordance with our general synthesis structure and syn- thesized using our tools.

4.3 Analysis Strategy

This section explains the architectural decisions for the implementation of the ciphers and how the design parameters are evaluated.

4.3.1 Architectural Decisions In order to perform a fair comparison, we make the following architectural decisions:

 All inputs and outputs of the cipher are buffered by a flip-flop.

24 4.4. Results

 We consider encryption-only architectures. This is justified by the fact that the most pop- ular modes of operation do not need decryption in the constrained environment [Dwo01].

4.3.2 Evaluation of Design Parameters

The area reports are generated after hierarchical synthesis in Cadence Encounter RTL Compiler v10.1 using UMC 130 nm low-leakage Faraday technology library. The area numbers in the tables are given in terms of GE. Our evaluation method consists of estimating the pre-layout power consumption and the derived energy using RTL Compiler and ModelSim simulations. Each module is first synthesized for the best power. We used Cadence Encounter RTL Compiler v10.1 in this step in order to get the power results (and the above-mentioned area figures). The generated netlists are then used to simulate the actual module with 100 random keys together with 10 random plaintexts per key to get the best statistics. From these ModelSim simulations, SAIF files are generated. In the last step, the SAIF files are sent back to the synthesis tool together with the netlist from the initial synthesis to run power analysis.

4.4 Results

In this section, we list our results for the different cipher implementations. Further, we detect anomalies based on the fact that we expect the dynamic power consumption to be larger for de- signs with a larger area. The anomalies represent architectures of which (some) gates contribute less to the static and/or dynamic power consumption than the gates of other architectures. This is mainly related to the number of transistors that are conducting (for the static power con- sumption) and the number of nodes that switch (for the dynamic power consumption). Energy per bit is calculated by dividing the total power by the clock frequency and then multiplying by the cycle count, followed by dividing by the block length. A comparison of the AES implementations are given in Table 4.1. The parallel architectures only differ in the way the Sbox was implemented. The table shows that the architectural differences in the Sboxes cause significant differences in area, power, and energy. The reason for some architectures to have a larger dynamic power consumption compared to others while they have a smaller area is probably caused by the fact that these architectures give rise to more internal glitches. In addition, some architectures have a larger static power consumption compared to others while they have a smaller area, which is probably because of the fact that parts of the architecture are not being used all the time during the computation. In the KATAN designs, the block size changes while the key size is the same. Therefore a slight change in area and also in static power consumption is observed in Table 4.2. But the dynamic power is quite low due to the simplistic round operations included in KATAN. When comparing LED and CLEFIA, both with block length 128, it is noticeable that the area ratio is much smaller than the power consumption ratio, in favor of LED. The reason is that LED is based on a fixed key and a simpler round computation compared to CLEFIA. The energy comparison also turns out in favor of LED. In the same way, the round function of PRESENT is much simpler in terms of logic depth compared to mCrypton, resulting in a similar trend. The energy comparison turns out in favor of mCrypton though. This is due to the fact that PRESENT needs more cycles.

25 Chapter 4. An Analysis of Recently Developed Lightweight Block Ciphers

Table 4.1: Performance and energy numbers of various AES realizations all using 128-bit key lengths

Cipher Key Encryption Freq. Area StaticDynamic Energy Energy length time power power per bit (# cycles) (KHz) (GE) (µW) (µW) (pJ/bit) (nJ) AES 2 2 2 128 128 10 100 12405 24.46 210.15 183.29 23.46 AES 4 2 128 128 10 100 11453 21.37 135.26 122.37 15.66 AES iso 2 2 2 128 128 10 100 15442 30.41 52.85 65.05 8.33 AES iso 4 2 128 128 10 100 13052 25.19 37.06 48.63 6.23 AES lut 128 128 10 100 19591 30.81 96.11 99.16 12.69

Table 4.2: Performance and energy numbers of lightweight block ciphers implementations

Cipher Block/KeyEncryption Freq. Area StaticDynamic Energy Energy length time power power per bit (# cycles) (kHz)(GE)(µW) (µW) (pJ/bit) (nJ) CLEFIA 128/128 18 100 6941 13.24 37.09 70.78 9.06 KLEIN 64/64 12 100 2760 4.88 2.18 13.24 0.85 LED 64/128 48 100 3194 5.62 2.34 59.70 3.82 mCrypton 64/96 13 100 3197 5.80 2.50 16.86 1.08 PRESENT 64/80 31 100 2195 3.75 1.14 23.69 3.82 Katan 32 32/80 254 100 801 1.52 0.43 154.78 4.94 Katan 48 48/80 254 100 925 1.71 0.49 116.42 5.60 Katan 64 64/80 254 100 1048 1.94 0.56 99.22 6.34

Note that all the implementations listed in Tables 4.1 and 4.2 are implemented and synthesized from scratch using our resources.

4.5 Conclusions and Future Work

In this chapter, we evaluated the area, power consumption, and energy of lightweight block cipher implementations and different AES architectures. We discussed the differences in dy- namic power consumption in relation to the area. The results show that the parallel AES core with LUT-based Sboxes is the largest in terms of area, but consumes the least dynamic power of all parallel cores. For the other ciphers, the difference depends on the algorithm, mostly on the complexity of the round function. It is evident from the results presented in this chapter, dynamic power consumption plays an important role on the energy/power consumption of cryptographic chips. Although in this work, industrial tools are used to generate dynamic power consumption results, one can also investigate the HDL implementation of a cipher and estimate the dynamic power consumption

26 4.5. Conclusions and Future Work through measuring the toggle activity. The basic idea is to measure the total number of bit toggles, i.e., bit flips, that happen during the encryption of a single block of a given algorithm. This metric can be in particular helpful when considering energy. The energy consumption in Complementary Metal Oxide Semiconductor (CMOS) circuits is dominated by dynamic power dissipation. This is directly proportional to the number of bit toggles and hence provides a good prediction for the energy consumption of a cipher. As future work, the relation between the number of toggles from an HDL implementation and the power reports from the synthesis of this implementation can be investigated.

27

Chapter 5 RAM-Based Ultra-Lightweight FPGA Implementation of PRESENT

This chapter proposes two different FPGA implementations of the lightweight cipher PRESENT. The main design strategy for both designs is the utilization of exist- ing RAM blocks in FPGAs for the storage of internal states, thereby reducing the slice count. In the first design, Sboxes are realized within the slices, while in the sec- ond design they are also integrated into the same RAM block used for state storage. Both designs are well-suited for lightweight applications, which are implemented on low-cost FPGA devices. Besides low-area, a reasonable throughput is also obtained even though it is not the first concern. In addition to a single block RAM, the two designs occupy only 83 and 85 slices and produce a throughput of 6.03 and 5.13 Kbps at 100 KHz system clock on a Xilinx Spartan XC3S50 device, respectively.

Contents of this Chapter 5.1 Introduction ...... 29 5.2 Previous Work ...... 30 5.3 The Proposed PRESENT Implementations on FPGA ...... 30 5.4 Results and Discussion ...... 37 5.5 Conclusion and Future Work ...... 37

5.1 Introduction

So far, several implementations of lightweight cryptographic algorithms have been realized on ASICs, which can also be seen in Chapter 4. While being very desirable for mass pro- duction and low-power consumption, ASICs also suffer from high costs associated with high Non-recurring Engineering (NRE) costs and long manufacturing times, which may be the killing factor for low-volume products. FPGAs present attractive alternatives for this portion of the market with their low or zero NRE costs, and short time-to-market properties. Block RAMs on FPGAs can be considered as one of their major advantages over ASICs, as RAMs require the utilization of special memory generators in ASICs that add further to manufacturing costs. However, block RAMs can also be considered as a curse for FPGAs, especially in massively-computational applications (such as block ciphers implementations), where registers are dominantly-used for state data storage and RAMs are left unutilized.

29 Chapter 5. RAM-Based Ultra-Lightweight FPGA Implementation of PRESENT

In this study, we aim to demonstrate that block RAM-based implementations of block ciphers are also possible and viable on FPGAs, which result in “more slices left” for other applications. As a proof-of-concept, the lightweight block cipher PRESENT is selected as our target algorithm mainly due to its lightweight nature and wide utilization. The PRESENT block cipher, whose suitability for compact implementations on ASICs has been demonstrated several times, can also be implemented on FPGAs efficiently. Until now, there have been register-based implementations of PRESENT on FPGAs [YK09, SSPP09]. However, none of them made use of the already existing block RAMs on FPGAs. In this work, 128-bit key length for PRESENT is considered for comparison with previous works. We propose two different RAM-based implementations of PRESENT in the following sections.

5.2 Previous Work

Due to the relatively high area and power consumption of FPGAs with respect to ASICs, they have not been the target application platform for lightweight applications until recently. As a result, there have been few publications on the design and implementation of lightweight ciphers on FPGAs. Thereby, most of the existing PRESENT implementations generally target ASIC platform, as can be seen in [BKL+07, EKU+07, KSPS12, Cry14, KDH+12]. Another line of research dealt with design strategies for better FPGA implementations of ciphers [RSQL03]. This work presented new descriptions of DES and Triple-DES in order to have better results on FPGA. Out of the few PRESENT implementations on FPGAs (including [Pos09]), we focus on two promising works [YK09, SSPP09]. The first one is mainly a comparison of the PRESENT cipher to the HIGHT cipher on FPGAs. The second one is a detailed survey of PRESENT implementations on FPGAs using different logic optimizers. Both designs rely on registers for state storage. Therefore, they both result in slice counts above 100, while offering high data throughputs. However, in most lightweight applications, lower slice count is more important than high throughput. Hence, we set our goal to have a lower slice count.

5.3 The Proposed PRESENT Implementations on FPGA

As mentioned before, in order to achieve a smaller slice count in our PRESENT implementation, we opted for using one of the existing block RAMs on the FPGA for state and key storage instead of using registers. We identified two different RAM-based approaches: One is to implement the Sboxes on slices, and the second approach is to store the Sboxes in RAM as a lookup table. The second technique is expected to give a smaller slice count. The drawback of the second technique is the increased cycle count and complexity of its control block compared to the first approach. For both approaches, a 64-bit state is represented as 4 × 4 state matrix with 4-bit entries. In order to store data in a 4×4 matrix, 4×16-bit entries are required for state and 8×16-bit entries are required for the key (as we implement the 128-bit key PRESENT in this work). However, this requirement is doubled by using memory in a ping-pong fashion: Rounds of PRESENT are named as even and odd rounds, and the result of an even round is written into an odd-

30 5.3. The Proposed PRESENT Implementations on FPGA round address. Likewise, the result of the following odd round is written into the following even address. By doing so, existing memory space is used instead of registers. This scheme is shown in Figure 5.1.

key (k1) (k2) (k3) << K << K << K (k0)

ptxt (b1) (s1) (b2) (s2) (b3) (s3) ctxt S P S P S P (s0) (s4)

round-1 round-2 round-3 final

initial round-1 round-2 round-3 final (s0,k0) (s1,k1) (s2,k2) (s3,k3) (s4)

S0 s0,0 s0,1 s0,2 s0,3 s0,0 s0,1 s0,2 s0,3 s2,0 s2,1 s2,2 s2,3 s2,0 s2,1 s2,2 s2,3 s4,0 s4,1 s4,2 s4,3

S1 s0,4 s0,5 s0,6 s0,7 s0,4 s0,5 s0,6 s0,7 s2,4 s2,5 s2,6 s2,7 s2,4 s2,5 s2,6 s2,7 s4,4 s4,5 s4,6 s4,7

S2 s0,8 s0,9 s0,10 s0,11 s0,8 s0,9 s0,10 s0,11 s2,8 s2,9 s2,10 s2,11 s2,8 s2,9 s2,10 s2,11 s4,8 s4,9 s4,10 s4,11

S3 s0,12 s0,13 s0,14 s0,15 s0,12 s0,13 s0,14 s0,15 s2,12 s2,13 s2,14 s2,15 s2,12 s2,13 s2,14 s2,15 s4,12 s4,13 s4,14 s4,15

S4 s1,0 s1,1 s1,2 s1,3 s1,0 s1,1 s1,2 s1,3 s3,0 s3,1 s3,2 s3,3 s3,0 s3,1 s3,2 s3,3

S5 s1,4 s1,5 s1,6 s1,7 s1,4 s1,5 s1,6 s1,7 s3,4 s3,5 s3,6 s3,7 s3,4 s3,5 s3,6 s3,7

S6 s1,8 s1,9 s1,10 s1,11 s1,8 s1,9 s1,10 s1,11 s3,8 s3,9 s3,10 s3,11 s3,8 s3,9 s3,10 s3,11

S7 s1,12 s1,13 s1,14 s1,15 s1,12 s1,13 s1,14 s1,15 s3,12 s3,13 s3,14 s3,15 s3,12 s3,13 s3,14 s3,15

K0 k0,0 k0,1 k0,2 k0,3 k0,0 k0,1 k0,2 k0,3 k2,0 k2,1 k2,2 k2,3 k2,0 k2,1 k2,2 k2,3 ......

K7 k0,28 k0,29 k0,30 k0,31 k0,28 k0,29 k0,30 k0,31 k2,28 k2,29 k2,30 k2,31 k2,28 k2,29 k2,30 k2,31

K8 k1,0 k1,1 k1,2 k1,3 k1,0 k1,1 k1,2 k1,3 k3,0 k3,1 k3,2 k3,3 ......

K15 k1,28 k1,29 k1,30 k1,31 k1,28 k1,29 k1,30 k1,31 k3,28 k3,29 k3,30 k3,31

Figure 5.1: PRESENT data flow scheme in the RAM-based design for a 3-round example

5.3.1 State Processing While Sboxes can be applied to each 16-bit entry in parallel; due to the nature of permutation layer, which is shown in Figure 5.2, it is not possible to do the same thing for permutation. Therefore, a different approach is applied. This approach can be best explained by a toy version of PRESENT with only 3 rounds as shown in Figure 5.1. In the first round (Round-1), the first 16-bit of input data and key are read from address S0 and K0, then XORed and Sboxed in the following cycle. The result is sent to a temporary register to appear in the next cycle. Also, the first target address S4 is read in the same cycle. In the next cycle, permutation is performed by using the first bit of every nibble in temporary register (which is the XORed and Sboxed value) and stored to S4 via shifting it by 4 bits to left. At the same time, temporary register is shifted left by 1-bit, in order to get the correct permutation values for the next cycle. The same operation is performed for addresses S5, S6

31 Chapter 5. RAM-Based Ultra-Lightweight FPGA Implementation of PRESENT

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7 8 9 10 11 8 9 10 11 12 13 14 15 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Figure 5.2: PRESENT permutation scheme and S7. At the end of 6 cycles, first nibbles of S4, S5, S6, and S7 are ready. The procedure is repeated for all other input addresses S1, S2, and S3. The cycle count for each of these is also 6 cycles. In total, this processing of state takes 4 × 6 = 24 cycles for each round. This scheme can be seen in Figure 5.3. For the on-RAM Sbox version of the implementation, the only difference is the additional cycle for reading Sbox values from memory, which can be seen in Figure 5.4. Therefore, the total cycle count of state processing in each round is 28 cycles.

5.3.2 Key Schedule

At the end of every round, key scheduling of PRESENT is performed. Again, the first address (for first round, K0) is read to appear in the following cycle. Then, it is shifted by 3 bits and stored in the corresponding address in order to perform key scheduling. At the same time, least significant 3 bits of the read key data is stored in a 3-bit key temporary register for the shifting part of scheduling. The value kept in key temporary register is also written into the

32 5.3. The Proposed PRESENT Implementations on FPGA

PORT-A PORT-A PORT-B PORT-B PORT-B PH CYC TMP REGISTER OPERATION OUTPUT OPERATION OUTPUT INPUT 0 READ S0 READ K0

1 READ S4 s0,0 s0,1 s0,2 s0,3 k0,0 k0,1 k0,2 k0,3

2 READ S5 ...... b1,0 b1,1 b1,2 b1,3 WRITE S4 ...... s1,0 0 3 READ S6 ...... tmp << 1 WRITE S5 ...... s1,4

4 READ S7 ...... tmp << 1 WRITE S6 ...... s1,8

5 ...... tmp << 1 WRITE S7 ...... s1,12 0 READ S1 READ K1

1 READ S4 s0,4 s0,5 s0,6 s0,7 k0,4 k0,5 k0,6 k0,7

2 READ S5 ...... b1,4 b1,5 b1,6 b1,7 WRITE S4 ...... s1,0 s1,1 1 3 READ S6 ...... tmp << 1 WRITE S5 ...... s1,4 s1,5

4 READ S7 ...... tmp << 1 WRITE S6 ...... s1,8 s1,9

5 ...... tmp << 1 WRITE S7 ...... s1,12 s1,13 0 READ S2 READ K2

1 READ S4 s0,8 s0,9 s0,10 s0,11 k0,8 k0,9 k0,10 k0,11

2 READ S5 ...... b1,8 b1,9 b1,10 b1,11 WRITE S4 . . . s1,0 s1,1 s1,2 2 3 READ S6 ...... tmp << 1 WRITE S5 . . . s1,4 s1,5 s1,6

4 READ S7 ...... tmp << 1 WRITE S6 . . . s1,8 s1,9 s1,10

5 ...... tmp << 1 WRITE S7 . . . s1,12 s1,13 s1,14 0 READ S3 READ K3

1 READ S4 s0,12 s0,13 s0,14 s0,15 k0,12 k0,13 k0,14 k0,15

2 READ S5 ...... b1,12 b1,13 b1,14 b1,15 WRITE S4 s1,0 s1,1 s1,2 s1,3 3 3 READ S6 ...... tmp << 1 WRITE S5 s1,4 s1,5 s1,6 s1,7

4 READ S7 ...... tmp << 1 WRITE S6 s1,8 s1,9 s1,10 s1,11

5 ...... tmp << 1 WRITE S7 s1,12 s1,13 s1,14 s1,15

Figure 5.3: PRESENT cycle count for the state processing phases of each round (on-slice Sbox version) corresponding address with the read and 3-bit shifted key data at the same cycle. This process is performed for all 8 addresses, which results in 10 cycles as seen in Figure 5.5. For the on-RAM Sbox version, again the Sbox value read part is added to this flow, which results in an increase of 2 cycles as shown in Figure 5.6.

5.3.3 Final Round

At the last round, Round-32, only XOR operation is performed and the result is written-back to the original input addresses. To perform this, an XOR and write cycle comes after every read data and key read. This results in 8 cycles for both approaches, as there is no Sbox operation in this phase. This scheme is shown in Figure 5.7.

33 Chapter 5. RAM-Based Ultra-Lightweight FPGA Implementation of PRESENT

PORT-A PORT-A PORT-B PORT-B PORT-B PH CYC TMP REGISTER OPERATION OUTPUT OPERATION OUTPUT INPUT 0 READ S0 READ K0

1 READ x0,0x0,1 s0,0 s0,1 s0,2 s0,3 READ x0,2x0,3 k0,0 k0,1 k0,2 k0,3

2 READ S4 b1,0 b1,1 b1,0 b1,1 b1,2 b1,3 b1,2 b1,3

0 3 READ S5 ...... b1,0 b1,1 b1,2 b1,3 WRITE S4 ...... s1,0

4 READ S6 ...... tmp << 1 WRITE S5 ...... s1,4

5 READ S7 ...... tmp << 1 WRITE S6 ...... s1,8

6 ...... tmp << 1 WRITE S7 ...... s1,12

Figure 5.4: PRESENT cycle count for the first state processing phase of each round (on-RAM Sbox version)

PORT-A PORT-A KEY TMP PORT-B PORT-B PH CYC OPERATION OUTPUT REGISTER OPERATION INPUT 0 READ K0

1 READ K1 k0,0-3 WRITE K12 k0,0-3 >> 3

2 READ K2 k0,4-7 k0,0-3 << 13 WRITE K13 k0,0-3 << 13 | k0,4-7 >> 3 = k1,20-23

3 READ K3 k0,8-11 k0,4-7 << 13 WRITE K14 k0,4-7 << 13 | k0,8-11 >> 3 = k1,24-27

4 READ K4 k0,12-15 k0,8-11 << 13 WRITE K15 k0,8-11 << 13 | k0,12-15 >> 3 = k1,28-31 4 5 READ K5 k0,16-19 k0,12-15 << 13 WRITE K8 k0,12-15 << 13 | k0,16-19 >> 3 = k1,0-311

6 READ K6 k0,20-23 k0,16-19 << 13 WRITE K9 k0,16-19 << 13 | k0,20-23 >> 3 = k1,4-711

7 READ K7 k0,24-27 k0,20-23 << 13 WRITE K10 k0,20-23 << 13 | k0,24-27 >> 3 = k1,8-111

8 READ K0 k0,28-31 k0,24-27 << 13 WRITE K11 k0,24-27 << 13 | k0,28-31 >> 3 = k1,12-15

9 k0,0-3 k0,28-31 << 13 WRITE K12 k0,28-31 << 13 | k0,0-3 >> 3 = k1,16-19

Figure 5.5: PRESENT cycle count for the key expansion phase of each round (on-slice Sbox version)

5.3.4 Overall Design

According to the data flow explained above, the hardware circuit of the design is defined. As can be seen from Figure 5.8, the regular on-slice Sbox implementation has rather simple logic; however, it uses 4 Sboxes that increase the slice count. In Figure 5.9, the block diagram of on-RAM Sbox design is shown. It is easy to see that the control logic of this implementation is more complex, compared to the first approach. Therefore, even though it has no on-slice Sboxes,

34 5.3. The Proposed PRESENT Implementations on FPGA

PORT-A PORT-A KEY TMP PORT-B PORT-B PH CYC OPERATION OUTPUT REGISTER OPERATION INPUT 0 READ K0

1 READ K1 k0,0-3 WRITE K12 k0,0-3 >> 3

2 READ K2 k0,4-7 k0,0-3 << 13 WRITE K13 k0,0-3 << 13 | k0,4-7 >> 3 = k1,20-23

3 READ K3 k0,8-11 k0,4-7 << 13 WRITE K14 k0,4-7 << 13 | k0,8-11 >> 3 = k1,24-27

4 READ K4 k0,12-15 k0,8-11 << 13 WRITE K15 k0,8-11 << 13 | k0,12-15 >> 3 = k1,28-31

5 READ K5 k0,16-19 k0,12-15 << 13 WRITE K8 k0,12-15 << 13 | k0,16-19 >> 3 = k1,0-311 4 6 READ K6 k0,20-23 k0,16-19 << 13 WRITE K9 k0,16-19 << 13 | k0,20-23 >> 3 = k1,4-711

7 READ K7 k0,24-27 k0,20-23 << 13 WRITE K10 k0,20-23 << 13 | k0,24-27 >> 3 = k1,8-111

8 READ K0 k0,28-31 k0,24-27 << 13 WRITE K11 k0,24-27 << 13 | k0,28-31 >> 3 = k1,12-15

9 READ K8 k0,0-3 k0,28-31 << 13 WRITE K12 k0,28-31 << 13 | k0,0-3 >> 3 = k1,16-19

10 READ k1,0-1 k1,0-3

11 WRITE K8 k1,0-1 k1,0-3

Figure 5.6: PRESENT cycle count for the key expansion phase of each round (on-RAM Sbox version)

PORT-A PORT-A PORT-A PORT-B PORT-B PH CYC OPERATION OUTPUT INPUT OPERATION OUTPUT 0 READ S4 READ K8

1 WRITE S0 s3,0 s3,1 s3,2 s3,3 s4,0 s4,1 s4,2 s4,3 k3,0 k3,1 k3,2 k3,3 2 READ S5 READ K9

3 WRITE S1 s3,4 s3,5 s3,6 s3,7 s4,4 s4,5 s4,6 s4,7 k3,4 k3,5 k3,6 k3,7 0 4 READ S6 READ K10

5 WRITE S2 s3,8 s3,9 s3,10 s3,11 s4,8 s4,9 s4,10 s4,11 k3,8 k3,9 k3,10 k3,11 6 READ S7 READ K11

7 WRITE S3 s3,12 s3,13 s3,14 s3,15 s4,12 s4,13 s4,14 s4,15 k3,12 k3,13 k3,14 k3,15

Figure 5.7: PRESENT cycle count for the final round (same in both on-slice and on-RAM Sbox versions)

it ended up with a higher slice count than the first approach, which is not in accordance with our initial guess.

35 Chapter 5. RAM-Based Ultra-Lightweight FPGA Implementation of PRESENT

Out-A

0:15 10:14

Out-A S 10:14 0:9,15

Inp-A << 13 >> 3 DPRAM Out-B TMP << 1 KTMP 0:2 3:15 KEXP Inp-B 0:7 PERM S

0:7 8:15 0:15

Key

Figure 5.8: PRESENT block diagram (key expansion module shown on the right) for on-slice Sbox version

cadr-A Out-A

Adr-A 0:15 10:14 Out-A

Inp-A 10:14 0:9,15 DPRAM Adr-B cadr-B << 13 >> 3

Out-B KTMP 0:2 3:15 KEXP TMP << 1 0:15 Inp-B PERM Key

Figure 5.9: PRESENT block diagram (key expansion module shown on the right) for on-RAM Sbox version

36 5.4. Results and Discussion

5.4 Results and Discussion

Table 5.1 shows the performance comparison of both designs to prior work. As shown in the table, our designs achieve the lowest slice counts with the addition of a block RAM. On the cycle count, our designs are slower, but still offer respectable throughput for lightweight applications. Our figure of merit (throughput per slice) can reach one third of the previous most compact work.

Table 5.1: Performance comparison

Key # of slices Cycle Max Tput Tput/slice Design size & count freq. @100KHz @max.freq. RAMs (MHz) (kbps) (Mbps/slice) Our work 128 83 / 1 1062 129.77 6.03 0.094 (on-slice Sbox) Our work 128 85 / 1 1248 135.15 5.13 0.082 (on-RAM Sbox) [YK09] 128 117 / 0 256 114 24.96 0.24 [SSPP09] 128 202 / 0 32 254 200 2.51

5.5 Conclusion and Future Work

As the cost of FPGAs get lower together with their power consumption, they become more and more attractive alternatives for lightweight security applications that are realized on sensor nodes, RFIDs, etc. It is crucial to implement the lightweight cryptography on such devices in a way to ensure the maximum possible amount of resources left for other circuitry that are sharing the device. Utilization of the RAMs on FPGAs allows this at least to a certain limit. In this work, we presented two different RAM-based implementations of the lightweight cipher algorithm PRESENT on FPGAs. With their lowest reported slice counts, both implementations are suitable for lightweight applications. While the first implementation is fully-implemented on slices, the second one integrates the Sboxes into the block RAM as well. Our on-slice Sbox and on-RAM Sbox designs are good candidates for the application of the shared Sbox technique in [PMK+11] and the block memory content scrambling technique given in [GM11] for SCA-resistance. As the next step of this study, these SCA countermeasures can be applied to our designs and their effects on the area as well as their resistance to SCA attacks can be observed.

37

Chapter 6 On the Suitability of SHA-3 Finalists for Lightweight Applications

In this chapter, we investigate the suitability of SHA-3 finalists for lightweight ap- plications. For each finalist, we try to achieve the lowest reported gate count while maintaining a respectable throughput. We mainly favor a word-serial approach in our designs to achieve low gate count, where the word size varies from 32 to 64 bits depending on the structure of the hash function and the trade-off between the through- put and area. All hash function cores are realized in Verilog-HDL, synthesized using 90 nm UMC CMOS standard-cell library and optimized for area for prototyping. A generic First In First Out (FIFO)-based Input/Output (IO) interface is also built in order to establish data transfer between an external controller and the active hash function core.

Contents of this Chapter 6.1 Introduction ...... 39 6.2 Implementations of the Finalist Hash Functions ...... 40 6.3 Interface ...... 59 6.4 Results and Discussion ...... 60 6.5 Conclusion ...... 61

6.1 Introduction

NIST announced a public competition in November 2007 in order to develop a new crypto- graphic hash algorithm. The winning algorithm was planned to be named as “SHA-3” and the hash algorithms specified in Federal Information Processing Standard (FIPS) 180-3, Secure Hash Standards [NIS08], were planned to be augmented. There were numerous submissions to the SHA-3 competition. It proceeded round by round, and in the last round there were only 5 finalists left; BLAKE [AHMP10], Grøstl, JH [Wu11], Keccak [BDPA11] and Skein [FLS+10]. In October 2012, during the course of this thesis work, the winner of the competition was announced as Keccak. The SHA-3 competition had been a very attractive process, there were many studies and discussions on the submitted algorithms. Implementation of the algorithms was an important part of these investigations. Several software and hardware implementations dealt with effec- tive and high performance realization of the candidates on a wide range of platforms from

39 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications embedded processors to custom ASICs. Out of these, [KKI+11, TFI+09, GSH+12] compare (sometimes compact) hardware implementations of different SHA-3 candidates, including the finalists, on ASIC. Other works provided comparative analyses of “compact and lightweight” implementations of the finalists on FPGA platform [KDVC+11, KYS+11, JA11]. However, the suitability of the SHA-3 candidates for lightweight applications was not comprehensively studied. In this work, we investigated the suitability of the SHA-3 finalists for lightweight applications and tried to come up with suggestions for possible lightweight extensions. The main target of this study is to present efficient compact implementations of SHA-3 finalists offering the lowest possible gate count (and therefore the lowest power consumption), whereas the resultant throughput is still within the limits desirable for lightweight applications. One approach to achieve this target is to replace registers by RAM(s) and implement minimal combinational circuitry necessary for the realization of computational operations. Another approach is to keep the registers, but perform computational operations serially, thereby saving logic and interconnection area. We opted for the latter option, mainly because of the non- standard block memory interfaces and performances offered by different process technologies. We also believe that the structures we propose for each hash function can be easily modified and used within a hybrid approach. In our study, we chose the 256-bit message digest option for all finalists. Our designs are both suitable for ASIC and FPGA platforms. However, we have synthesized our implementations in 90 nm UMC CMOS technology. Area-optimized synthesis results show that Grøstl offers the lowest gate count, while BLAKE offers the best throughput and throughput/area numbers. We have also compared the finalists with each other in order to observe the overall performance.

6.2 Implementations of the Finalist Hash Functions

Before giving the details of each implementation, we describe each hash function briefly in order to clarify our architectures.

6.2.1 BLAKE

Algorithm

BLAKE is a family of four hash functions: BLAKE-224, BLAKE-256, BLAKE-384, and BLAKE-512, which follows the HAIFA iteration mode [BD07]. The compression function de- pends on a and the number of bits hashed so far (as counter): A large inner state is initialized from the initial value, the salt and the counter; and it is injectively updated by message-dependent rounds until it is finally compressed to return the next chain value, as shown in Figure 6.1. The inner state of the compression function is represented as a 4 × 4 matrix of words. In one round of BLAKE-256, all four columns and then all four disjoint diagonals are updated independently. In the update of each column or diagonal, two message words are input according to a round-dependent permutation as shown in Figure 6.2. Table 6.1 shows the specification of BLAKE for 256-bit message digest.

40 6.2. Implementations of the Finalist Hash Functions

chain value initialization rounds finalization next chain value

salt counter message chain value salt

Figure 6.1: BLAKE compression function

G G 0 2 G5 csr(2i+1) csr(2i) V V V2 3 V V G4 V0 1 0 V1 2 V3 msr(2i) msr(2i+1) V V6 V V V V5 5 4 V5 5 V6 4 a A V V11 V8 V V10 V V V9 10 9 11 8 >>> >>> b 12 7 B V V V V15 12 V13 14 V15 V V13 14 12 c C

G7 d >>> >>> D G1 G3 G6 16 8

Figure 6.2: One round of BLAKE and the underlying Gi function

Table 6.1: BLAKE specifications

Algorithm Word Message Block Salt Rounds Digest BLAKE-256 32-bit < 264-bit 512-bit 128-bit 14 256-bit

Implementation

The serialized architecture for BLAKE is given in Figure 6.3. The first operation is the ini- tialization, where the data is written into the state registers as 32-bit words in 16 clock cycles. The salt, hash and message registers, which are also shown in Figure 6.3, store the salt, the hash and the message, respectively. The state words are then processed by the half Gi function block shown in Figure 6.4, together with the corresponding values from the other registers, and written back onto the state register. The Gi function module operates on each column for G0−3, and then four disjoint diagonals for G4−7 twice because of its “half” structure. This structure, while reducing the area, doubles the cycle count. As shown in Figure 6.5, G0−3 is processed at first, in halves (namely H1 and H2); followed by the processing of G4−7, again in halves. The multiplexers are switched in order to make sure that the sequence of the serially-processed words gives the same result as a parallel implementation. This process is repeated for 14 rounds, and a new message block is injected after the 14th round

41 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications

state register

a A

b B G

c C

d D

CNST ROM IV ROM

salt register hash register

HASH OUTPUT

MESSAGE / SALT INPUT message register

Figure 6.3: BLAKE serial architecture

cnst

msg

a A >>> 12 b >>> B 7 c C >>> 16 d >>> D 8

Figure 6.4: Gi half function

42 6.2. Implementations of the Finalist Hash Functions

4 4 i i I l t s e a o o o o 1 1 u u g u u n n n n n n d d d d F R R R R N w M 1 1 2 2 2 2 2 2 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 1 1 1 1 1 1 3 3 3 3 3 3 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 1 1 1 1 1 s e 0 4 0 4 0 4 0 4 0 4 0 4 2 6 2 6 2 6 2 6 u 2 6 2 6 1 1 1 1 1 1 1 1 1 1 1 1 l a 1 5 1 5 1 5 1 5 1 5 1 5 v 3 7 3 7 3 7 3 7 3 7 3 7

1 1 1 1 1 1 1 1 1 1 1 1 n i

a 7 2 7 2 7 2 7 2 G G G G H H H H h c

1 1 5 1 5 5 1 1 5 1 5 5 t 3 7 3 7 3 7 3 7 3 7 3 7 1 1 1 1 1 1 1 1 1 1 1 1 x e 2 2 2 2 2 2 0 4 8 0 0 4 8 0 4 8 4 8 0 4 8 0 4 8 1 1 1 1 1 1 N

: 3 3 3 3 3 3 1 5 9 1 1 5 9 1 5 9 5 9 1 5 9 1 5 9 1 1 1 1 1 1 5 0 4 0 4 0 4 0 4 0 4 0 4 1 2 6 2 6 2 6 2 6 2 6 2 6 1 1 1 1 1 1 1 1 1 1 1 1 .

.

6 2 6 2 6 2 6 2 G G G G H H H H . 0 0 4 0 4 4 0 0 4 0 4 4 2 6 2 6 2 6 2 6 2 6 2 6 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 5 5 1 1 5 1 5 5 3 7 3 7 3 7 3 7 3 7 3 7 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 4 8 0 0 4 8 0 4 8 4 8 0 4 8 0 4 8 1 1 1 1 1 1 s 3 3 3 3 3 3 t 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 1 1 1 1 1 u p

t 5 2 5 2 5 2 5 2 G G G G H H H H u o

3 3 3 3 3 3 2 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 1 1 1 1 1 H / 0 0 4 0 4 4 0 0 4 0 4 4 2 6 2 6 2 6 2 6 2 6 2 6 7 1 1 1 1 1 1 1 1 1 1 1 1 G

1 1 5 1 5 5 1 1 5 1 5 5 3 7 3 7 3 7 3 7 3 7 3 7 – 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 H 1 1 1 1 1 1 /

4 4 2 4 2 4 2 4 2 G G G G H H H H G

: 2 2 2 2 2 2 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 1 1 1 1 1 1 5 1 3 3 3 3 3 3 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 1 1 1 1 1 .

. 0 0 4 0 4 4 0 0 4 0 4 4

2 6 2 6 2 6 2 6 2 6 2 6 . 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 5 5 1 1 5 1 5 5 3 7 3 7 3 7 3 7 3 7 3 7 0 1 1 1 1 1 1 1 1 1 1 1 1

7 1 7 1 7 1 7 1 G G G G H H H H 1 5 1 5 1 5 1 5 1 5 1 5 3 7 3 7 3 7 3 7 3 7 3 7 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 1 1 1 1 1 1 s 3 3 3 3 3 3 t 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 1 1 1 1 1 u p 0 0 4 0 4 4 0 0 4 0 4 4 t 2 6 2 6 2 6 2 6 2 6 2 6 1 1 1 1 1 1 1 1 1 1 1 1 u o

6 1 6 1 6 1 6 1 G G G G H H H H 1 H / 0 4 0 4 0 4 0 4 0 4 0 4 2 6 2 6 2 6 2 6 2 6 2 6 4 1 1 1 1 1 1 1 1 1 1 1 1 G

1 5 1 5 1 5 1 5 1 5 1 5 3 7 3 7 3 7 3 7 3 7 3 7 – 1 1 1 1 1 1 1 1 1 1 1 1

1 2 2 2 2 2 2 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 H 1 1 1 1 1 1 / 4 3 3 3 3 3 3 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 1 1 1 1 1 G

:

5 1 5 1 5 1 5 1 G G G G H H H H 5 1 3 3 3 3 3 3 1 5 1 5 9 1 5 9 9 1 5 1 5 9 1 5 9 9 1 1 1 1 1 1 .

. 0 4 0 4 0 4 0 4 0 4 0 4

2 6 2 6 2 6 2 6 2 6 2 6 . 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 5 1 5 1 5 1 5 1 5 3 7 3 7 3 7 3 7 3 7 3 7 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 0 4 8 1 1 1 1 1 1

4 1 4 1 4 1 4 1 G G G G H H H H 2 2 2 2 2 2 0 4 0 4 8 0 4 8 8 0 4 0 4 8 0 4 8 8 1 1 1 1 1 1 s 3 3 3 3 3 3 t 1 5 1 5 9 1 5 9 9 1 5 1 5 9 1 5 9 9 1 1 1 1 1 1 u p 0 4 0 4 0 4 0 4 0 4 0 4 t 2 6 2 6 2 6 2 6 2 6 2 6 1 1 1 1 1 1 1 1 1 1 1 1 u o 1 5 1 5 1 5 1 5 1 5 1 5

3 7 3 7 3 7 3 7 3 7 3 7 1 1 1 1 1 1 1 1 1 1 1 1 2 H

/ 3 2 3 2 3 2 3 2 G G G G H H H H 3 G

1 5 1 5 1 5 1 5 1 5 1 5 3 3 7 3 7 7 3 3 7 3 7 7 – 1 1 1 1 1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 0 4 0 4 8 0 4 8 8 0 4 0 4 8 0 4 8 8 H 1 1 1 1 1 1 / 0 3 3 3 3 3 3 1 5 1 5 9 1 5 9 9 1 5 1 5 9 1 5 9 9 1 1 1 1 1 1 G

: 0 4 0 4 0 4 0 4 0 4 0 4 2 6 2 6 2 6 2 6 2 6 2 6 1 1 1 1 1 1 1 1 1 1 1 1 5

1 2 2 2 2 2 2 2 2 G G G G H H H H .

. 0 4 0 4 0 4 0 4 0 4 0 4

2 2 6 2 6 6 2 2 6 2 6 6 . 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 5 1 5 1 5 1 5 1 5 3 3 7 3 7 7 3 3 7 3 7 7 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 4 0 4 8 0 4 8 8 0 4 0 4 8 0 4 8 8 1 1 1 1 1 1 3 3 3 3 3 3 1 5 1 5 9 1 5 9 9 1 5 1 5 9 1 5 9 9 1 1 1 1 1 1

1 2 1 2 1 2 1 2 G G G G H H H H s 3 3 3 3 3 3 t 1 1 5 9 1 5 9 5 9 1 1 5 9 1 5 9 5 9 1 1 1 1 1 1 u p 0 4 0 4 0 4 0 4 0 4 0 4 t 2 2 6 2 6 6 2 2 6 2 6 6 1 1 1 1 1 1 1 1 1 1 1 1 u o 1 5 1 5 1 5 1 5 1 5 1 5

3 3 7 3 7 7 3 3 7 3 7 7 1 1 1 1 1 1 1 1 1 1 1 1 1 H 2 2 2 2 2 2 / 0 4 0 4 8 0 4 8 8 0 4 0 4 8 0 4 8 8 1 1 1 1 1 1 3

G

0 2 0 2 0 2 0 2 G G G G H H H H –

1 2 2 2 2 2 2 0 0 4 8 0 4 8 4 8 0 0 4 8 0 4 8 4 8 H 1 1 1 1 1 1 / 0 3 3 3 3 3 3 1 1 5 9 1 5 9 5 9 1 1 5 9 1 5 9 5 9 1 1 1 1 1 1 G

: 0 4 0 4 0 4 0 4 0 4 0 4 2 2 6 2 6 6 2 2 6 2 6 6 1 1 1 1 1 1 1 1 1 1 1 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 3 3 7 3 7 7 3 3 7 3 7 7 1 1 1 1 1 1 1 1 1 1 1 1 .

.

3 1 3 1 3 1 3 1 G G G G H H H H . 1 5 1 5 1 5 1 5 1 5 1 5 3 7 3 7 3 7 3 7 3 7 3 7 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 0 4 8 0 4 8 4 8 0 0 4 8 0 4 8 4 8 1 1 1 1 1 1 3 3 3 3 3 3 1 1 5 9 1 5 9 5 9 1 1 5 9 1 5 9 5 9 1 1 1 1 1 1 0 4 0 4 0 4 0 4 0 4 0 4 s 2 2 6 2 6 6 2 2 6 2 6 6 1 1 1 1 1 1 1 1 1 1 1 1 e

u l 2 1 2 1 2 1 2 1 G G G G H H H H a v

0 4 0 4 0 4 0 4 0 4 0 4 n 2 6 2 6 2 6 2 6 2 6 2 6 i 1 1 1 1 1 1 1 1 1 1 1 1 a 1 5 1 5 1 5 1 5 1 5 1 5 h 3 7 3 7 3 7 3 7 3 7 3 7 1 1 1 1 1 1 1 1 1 1 1 1 c

2 2 2 2 2 2 s 0 0 4 8 0 4 8 4 8 0 0 4 8 0 4 8 4 8 1 1 1 1 1 1 u o i 3 3 3 3 3 3 1 1 5 9 1 5 9 5 9 1 1 5 9 1 5 9 5 9 v 1 1 1 1 1 1 e

r 1 1 1 1 1 1 1 1 G G G G H H H H P

: 3 3 3 3 3 3 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 5 9 1 1 1 1 1 1 5 1 0 4 0 4 0 4 0 4 0 4 0 4 2 6 2 6 2 6 2 6 2 6 2 6 1 1 1 1 1 1 1 1 1 1 1 1 .

. 1 5 1 5 1 5 1 5 1 5 1 5

3 7 3 7 3 7 3 7 3 7 3 7 . 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 0 0 4 8 0 4 8 4 8 0 0 4 8 0 4 8 4 8 0 1 1 1 1 1 1

0 1 0 1 0 1 0 1 G G G G H H H H

Figure 6.5: BLAKE serial data flow

43 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications

clock

start pulse

pop

push

phase 0 1 2 3 4 3 4

round 0 ... 13 0 ... 13

cycle 0 ... 4 ... 7 0 ... 3 0 ... 15 0 ... 15 ... 0 ... 15 0 ... 15 0 ... 15 ... 0 ... 15 0 1 ... 8 9 ... 15

last

active

ready

Figure 6.6: BLAKE timing diagram

(if it exists). Injection of the message blocks continues until the last block. The finalization process returns the next chain value (or message digest, if it is the last message block). The whole process is explained in phase-round-cycle concept in Figure 6.6. In phase-0, the salt is read in 8 cycles. In the following 4 cycles, the length of the message block is read, which is phase-1. Following the length, the first message block is read in phase-2 in 16 cycles. In phase-3, the data processing is performed for 14 rounds (each round in 16 cycles). The next message block is read in phase-4. However, after the last message block, the message digest is written back in the first 8 cycles of phase-4.

6.2.2 Grøstl Algorithm Grøstl is a collection of hash functions, which can return message digests from 8 to 512 bits in 8-bit steps. The variant returning n bits is called Grøstl-n. Hashing starts by padding the input message M and splitting it into l-bit messages m1, . . . , mt. Each message block is then processed sequentially by the iterative compression function f, whose other input is the l-bit chaining input with an initial value of h0 = iv, as shown in Figure 6.7. For Grøstl variants with n up to 256 (which covers our case), l is defined to be 512. After the processing of the last message block, the output H(M) of the hash function is computed as H(M) = Ω(ht). Here Ω is the output transformation, whose output size is n bits, where n ≤ 2l.

m1 m2 m3 mt

IV ƒ ƒ ƒ ƒ Ω H(m) l l n

Figure 6.7: Grøstl compression function

The compression function f is based on two l-bit permutations P and Q, which is defined as f(h, m) = P (h ⊕ m) ⊕ Q(m) ⊕ h; and the output function is defined by Ω(x) = truncn(P (x) ⊕

44 6.2. Implementations of the Finalist Hash Functions

x),where truncn(x) discards all but the trailing n bits of x. Both functions are illustrated in Figure 6.8. Figure 6.9 shows details of P and Q permutations.

mi Q

hi-1 P hi ht P H(m)

Figure 6.8: Grøstl construction function f (left) and output function Ω (right)

SubBytes ShiftBytes for P P/Q Composition S(.)

AddRoundConstant MixBytes SubBytes ShiftBytes circ(02,02,03,04,05,03,05,07) ShiftBytes for Q MixBytes

Figure 6.9: P and Q permutations

Table 6.2 shows the specification of Grøstl for 256-bit message digest.

Table 6.2: Grøstl specifications

Algorithm Word Message Block Rounds Digest Grøstl-256 32-bit < (273 − 577)-bit 512-bit 10 256-bit

45 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications

Implementation

The serialized architecture for Grøstl is shown in Figure 6.10. There exists only a single block for both P and Q operations in order to save area, which also allows us to use the same block for both f and Ω functions. For the f function, message and previous hash result (which is iv at the first round) are selected as input. For the output function Ω, the only input comes from the hash register and zero is selected instead of the message. While the message is processed inside the P /Q module in P mode, it is also stored inside the temp register. In the Q mode, the result of P is stored inside the temp register while the message is restored. It is then processed in Q mode, and its result is combined with the P result (restored from the temp register) and the previous hash value. The detailed block diagram of P /Q module is shown in Figure 6.11. It basically implements a modified version of the serial AES-like data flow in [HAHH06] via SubBytes, ShiftBytes and MixBytes functions. The data flow for a 4 × 4 toy version of ShiftBytes and MixBytes are given in Figure 6.12, note that ShiftBytes operation is different for P and Q.

MESSAGE 0 P/Q Block

temp register

IV HASH

hash register

Figure 6.10: Grøstl serial architecture

The whole process is explained in phase-half round-round-cycle concept in Figure 6.13. In phase-0, the length is read in 10 cycles. Phase-1 is for reading the initialization vector iv. Following this, the message blocks are read and processed. Finally, in phase-3, the message digest is written back.

6.2.3 JH Algorithm

JH is a family of four hash algorithms — JH-224, JH-256, JH-384, and JH-512. In the design of JH, a compression function is constructed from a large block cipher with constant key. The generalized d-dimensional AES design methodology is applied in the design of the large cipher. In our case of 256-bit digest, d is set to 8, hence the compression function is named as F8. It sequentially processes the padded and split message blocks m1, . . . , mt, starting with an initial vector (iv), as shown in Figure 6.14. F8 is bijective due to the block cipher, whose block size is 2m bits. Its structure is shown in Figure 6.15, together with the internal function E8. The 2m-bit hash value H(i−1) and the m-bit

46 6.2. Implementations of the Finalist Hash Functions

SubByte

in S ×8 ×8 ×8 ×8

ShiftBytes

7 5 2 2 : 1 byte 0 0 0 0 MixBytes

out 0

Figure 6.11: Details of P /Q block

0 4 8 12 0 4 8 12 0 4 8 12 4 8 12 0 ShiftBytes 1 5 9 13 5 9 13 1 ShiftBytes 1 5 9 13 13 1 5 9 for P 2 6 10 14 10 14 2 6 for Q 2 6 10 14 2 6 10 14 3 7 11 15 15 3 7 11 3 7 11 15 11 15 3 7

12 11 10 9 8 7 6 5 4 3 2 1 0 0 12 11 10 9 8 7 6 5 4 3 2 1 0 4 MixBytes 13 12 11 10 9 8 7 6 5 4 3 2 1 5 13 12 11 10 9 8 7 6 5 0 3 2 1 13 14 13 12 11 10 9 8 7 6 1 4 3 2 10 14 1 12 11 10 9 8 7 6 5 0 3 2 2 circ(02,02,03,04) 15 14 13 12 11 2 9 8 7 6 1 4 3 15 15 14 1 12 11 10 9 8 7 6 5 0 3 11 3 14 13 12 11 2 9 8 7 6 1 4 4 15 14 1 12 3 10 9 8 7 6 5 0 8 3 14 13 12 11 2 9 8 7 6 1 9 15 14 1 12 3 10 9 0 7 6 5 1 3 14 13 12 11 2 1 8 7 6 14 15 14 5 12 3 10 9 0 7 6 6 0 4 8 12 0 4 8 12 3 6 13 12 11 2 1 8 7 3 15 14 5 12 3 10 9 0 7 15 1 5 9 13 1 5 9 13 7 6 13 12 11 2 1 8 8 7 14 5 12 3 10 9 0 12 2 6 10 14 2 6 10 14 7 6 13 12 11 2 1 13 7 14 5 0 3 10 9 5 3 7 11 15 3 7 11 15 7 6 1 12 11 2 2 7 14 9 0 3 10 10 7 6 1 12 11 7 7 14 9 0 3 3 11 6 1 12 12 7 14 9 0 0 11 6 1 1 7 14 9 9 : output bytes 11 6 6 : recycled bytes 7 14 14 11 11 7 7

4 02× 4 02× 4 03× 4 04× 4

5 04× 4 + 02× 5 02× 4 + 02× 5 02× 4 + 03× 5 03× 4 + 04× 5

6 03× 4 + 04× 5 + 02× 6 04× 4 + 02× 5 + 02× 6 02× 4 + 02× 5 + 03× 6 02× 4 + 03× 5 + 04× 6

7 02× 4 + 03× 5 + 04× 6 + 02× 7 03× 4 + 04× 5 + 02× 6 + 02× 7 04× 4 + 02× 5 + 02× 6 + 03× 7 02× 4 + 02× 5 + 03× 6 + 04× 7

Figure 6.12: Data flow for 4 × 4 toy version

message block M(i) are compressed into the 2m-bit H(i). E8 is also bijective and applies SPN and Maximum Distance Separable (MDS) constructions to the bit array. MDS is applied before

47 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications 3 6 . . 9 . 0 ...... 3 6 . . 0 . 0 3 6 3 6 . . 9 . . . 9 . 0 0 ...... 3 6 3 6 . . 1 . . . 0 . 0 0 3 6 3 6 . . . . . 9 . 0 6 0 ...... 6 5 3 6 . . . . . 1 . 2 5 0 . . . 1 0 3 6 8 4 . . . 3 . 1 . 6 . 0 6 . . 4 . 4 . . . 1 . . 6 . 6 5 . . 0 . 4 . . . 7 . . 5 . 2 5 . . 6 . 3 . . . 3 . . 5 . 8 4 . . 2 . 0 3 . 2 . . 9 . 0 . 4 . 4 4 . . 8 . 2 . . . 5 . . 4 . 0 4 . . 4 . 2 . 3 . . 1 . . 4 . 6 3 . . 0 . 2 . . . 7 . . 3 . 2 0 3 . . 6 . 1 . . . . 0 . . 8 2 3 2 6 1 ...... 9 . . . 4 2 0 8 ...... 0 2 3 6 4 ...... 0 . . . 6 1 0 0 . . . 3 6 9 2 1 . . 9 . 8 . . . . 0 . . 0 8 ...... 4 2 . . . 3 . . 6 . 4 . . 0 . 0 . . . 0 0 e y e y t t t e h d e d e e h d e d p e e e h d k e d p y v e k p k v d d s l i s l s l i s s s s s v s n n s s s o c n n o c n n d o c t a t a i l c a l c a l c a a u a u a u t u o u u p l u o c a u p l u o p l c e e y u y u y l u l l h p h c r o p o h o p o a e r o o a c c c p c p c p c r r r r r r r p a

p

p

t t t f f f r l r l r l a a a a a a t t t h h h s s s

Figure 6.13: Grøstl timing diagram

48 6.2. Implementations of the Finalist Hash Functions

m1 m2 m3 mt

IV F8 F8 F8 F8 H(m)

Figure 6.14: JH compression function

the first and after the last rounds. The round function R8 consists of an Sbox layer (selected via round constants), a linear transformation layer (applied on bytes), and a permutation layer P8 (composed of three permutations), whose details can be seen in Figure 6.16. R8 is repeated 42 times.

H(i-1) group M(i)

R8 S L E8 P 42×

ungroup M(i)

H(i)

Figure 6.15: Structure of F8 compression function (left) and E8 function (right)

Table 6.3 shows the specification of JH for 256-bit message digest.

Table 6.3: JH specifications

Algorithm Word Message Block Rounds Digest JH-256 32-bit < 264-bit 512-bit 42 256-bit

Implementation The serialized architecture for JH is given in Figure 6.17. A 32-bit datapath is used in the seri- alized implementation of JH. The state register is filled with the sum (XOR) of the initialization vector and the message block at the beginning of the process, while the message is also backed up in the message register for post-processing. Upon completion of the rounds, the output of

49 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications

C0 1 0 1 0 0 1 0 0 A0

C1 1 1 0 1 0 0 1 0 A1

C2 1 1 1 0 1 0 0 1 A2 S S C 0 1 0 1 1 0 0 0 A 0 1 3 = × 3 D0 0 1 0 0 1 0 0 0 B0 RC D1 0 0 1 0 0 1 0 0 B1

D2 1 0 0 1 0 0 1 0 B2

D3 1 0 0 0 0 0 0 1 B3

Pd P4

πd π4

P’d P’4

f d f 4

Figure 6.16: Three layers of round function

the E8 block is combined with the backed up message to form the next value of the state register (hash), which in turn is summed with the next message block. This process continues until all the message blocks are processed.

GROUP / DE-GROUP BLOCK IV state g u g u g u g u g u g u g u g u register HASH fd S L πd

P’d Permutation

message MESSAGE register

Figure 6.17: JH serial architecture

The group/de-group block realizes the grouping and de-grouping steps of E8 function. It only performs grouping/de-grouping at word-level. Instead of implementing bit-level grouping/de- grouping, E8 round function is modified in order to support operation on the word-level-grouped input and produce output compatible with word-level de-grouping. Serialized E8 round function 0 consists of an Sbox, the linear transformation block, and the πd, Pd, and φd partial permutation 0 blocks. All, except the Pd-module, operate on 32 bits. The serial data flow of JH is shown in Figure 6.18. It starts with the grouping round, which lasts for 32 cycles. This round is followed by R8 round function for 42 rounds (each of them is again 32 cycles). After R8 process, de-grouping round is performed. These grouping and de-grouping operations result in two additional rounds, which make 44 rounds in total. For the last message block, one extra quarter round is required for squeezing the output.

50 6.2. Implementations of the Finalist Hash Functions

0 1 0 2 1 0

GROUPING 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 round 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 29 7 27 26 25 6 23 22 21 5 19 18 17 4 15 14 13 3 11 10 9 2 7 6 5 1 3 2 1 0 30 15 7 27 26 14 6 23 22 13 5 19 18 12 4 15 14 11 3 11 10 10 2 7 6 9 1 3 2 8 0 31 23 15 7 27 22 14 6 23 21 13 5 19 20 12 4 15 19 11 3 11 18 10 2 7 17 9 1 3 16 8 0 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 16 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 round - 1 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 15 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 16 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 round - 2 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 15 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31

0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 16 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 round - 42 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 15 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 16 8 0 31 23 15 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 DE- GROUPING 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 15 7 30 round 7 30 22 14 6 29 21 13 5 28 20 12 4 27 19 11 3 26 18 10 2 25 17 9 1 24 16 8 0 31 23 15 7 15 28 30 22 14 24 29 21 13 20 28 20 12 16 27 19 11 12 26 18 10 8 25 17 9 4 24 16 8 0 31 23 15 23 29 28 30 22 25 24 29 21 21 20 28 20 17 16 27 19 13 12 26 18 9 8 25 17 5 4 24 16 1 0 31 23 31 30 29 28 30 26 25 24 29 22 21 20 28 18 17 16 27 14 13 12 26 10 9 8 25 6 5 4 24 2 1 0 31 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 1

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 6 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 7

Figure 6.18: JH serial flow

51 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications

The whole process is explained in phase-half round-round-cycle concept. In phase-0, the length of the message block is read. Then, in phase-1, initialization vector is read and stored in the state register. In phase-2, the message blocks are read in every round-0 and these message blocks are processed from round-1 to round-44. Also, the message digest is written back in round-44 of the last message block, again in phase-2. This scheme can be seen in Figure 6.19.

clock

start pulse

pop

push

phase 0 1 2

round 0 1 ... 44 0 1 ... 44

cycle 0 ... 4 ... 8 ... 12 ... 15 0 ... 31 0 ... 15 ... 31 0 ... 31 ... 0 ... 31 0 ... 15 ... 31 0 ... 31 ... 0 ... 25 ... 31

last

active

ready

Figure 6.19: JH timing diagram

6.2.4 Keccak Algorithm Keccak, the winning hash function, is a family of hash functions based on the sponge construction [BDPA07]. The fundamental function is the Keccak-f[b] permutation, which consists of a number of simple rounds with logical operations and bit permutation. b ∈ 25, 50, 100, 200, 400, 800, 1600 is the width of both the permutation and the state in the sponge construction. In our work, we concentrate on Keccak-f[1600] with 256-bit message digest. The state of Keccak is organized in 5×5 lanes, each with w-bits, where w ∈ 1, 2, 4, 8, 16, 32, 64, and b = 25w. The Keccak[r, c, d] (Figure 6.20) is obtained by applying the sponge construction to Keccak-f[r + c] with the parameters capacity c, bit rate r (which are 512 and 1088, respectively, for Keccak-f[1600]). The flow of Keccak-f and the details of the steps are given in Figure 6.21. The number of rounds nr depends on the permutation width, l which is calculated by nr = 12 + 2 × l, where 2 = w. This yields 24 rounds for Keccak-f[1600]. Table 6.4 shows the specification of Keccak-f[1600] for 256-bit message digest.

Table 6.4: Keccak specifications

Algorithm Word Message Block Rounds Digest Keccak-256 64-bit < 2128-bit 1088-bit 24 256-bit

Implementation The serialized architecture for Keccak is given in Figure 6.22. In the serial design, the data is processed in lanes, which is 1/25 of the whole state. The state registers, numbered 24 − 0,

52 6.2. Implementations of the Finalist Hash Functions

m1 m2 mt z1 z2

r 0 ƒ ƒ ƒ ƒ

c 0

ABSORBING SQUEEZING

Figure 6.20: Sponge construction of Keccak

q r p c i

q Step: ∑ p Step: ∑

r Step: c Step:

Figure 6.21: Keccak-f function and steps of the function

53 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications are used to store the internal state, and the four summation registers (rightmost registers numbered 4 − 0) store the row sums. The operational blocks that implement a Keccak round are the θ, ρ, π, χ, ι-modules. All, but π-module, operate on a single lane. π-step is executed in parallel on all 25 lanes. It is a fixed permutation operation, and the only area cost comes from the additional multiplexers and routing. There is additional area cost caused by sum registers (required for θ-step) and two temporary registers (required for χ-step). However, this additional area is compensated by the huge area saving of the serialized processing and the resulting single

lane combinational blocks.

0/1 ρ θ MESSAGE 0/1 HASH 4 1 0

24 23 2 1 0 0/1

χ ι RC

π 1 0

Figure 6.22: Keccak serial architecture

The processing starts with round 31, where the length of the message block is read. Then round 0 comes, where the data is written in lanes into the state registers and each row sum is accumulated inside the sum registers. The first incoming data is lane(0, 0) and it is shifted into the state register 24 while sum register 4 is filled with the same value. In the next cycle, state register 24 is shifted into state register 23 and filled with the incoming lane(1, 0). In parallel, sum register 4 is shifted into the sum register 3, and re-initialized with lane(1, 0). At the end of the first 5 cycles, the first 5 lanes of data are in state registers 24 to 20, while sum registers 4 to 0 have the first lanes of each column. In the following cycles, incoming data are added onto the sum registers and shifted into the state registers. At the end of the first 25 cycles, state registers contain the full state and sum registers contain the row sums. Starting with the next cycle, θ and ρ operations are run in parallel from lane(0, 0) until lane(4, 4), covering the whole state. These operations are completed in 25 cycles. It is followed by another 25 cycles, where π, χ, and ι operations are performed. Since π can only be executed on the whole state, it is done in parallel with the first lane of χ. The ι operation (round constant addition) is also done in the same cycle. In the following 24 cycles, the χ operation is performed on the remaining lanes, completing the first round. Each of these 25 cycles are named as half rounds. The row summations for the following round are also performed in parallel with π, χ, and ι operations of the current round, as an additional optimization. A full round takes 50 cycles to complete. At the end of the 24 rounds, the second half round of the “last” round is used for “squeezing” the message digest. The timing diagram in Figure 6.23 shows the round, half round, and cycles for processing of two message blocks.

54 6.2. Implementations of the Finalist Hash Functions

clock

start pulse

pop

push

round 31 0 1 ... 24 0 1 ... 23 24

half round ......

cycle 0 ... 8 ... 15 0 ... 24 0 ... 24 0 ... 24 ... 0 ... 24 0 ... 24 0 ... 24 0 ... 24 0 ... 24 ... 0 ... 24 0 ... 24 0 ... 24 0 1 ... 4 ... 24

last

active

ready

Figure 6.23: Keccak timing diagram

The whole data processing in each half round is explained by a 3 × 3 lanes toy-version of Keccak in Figure 6.24, instead of the actual 5 × 5 lanes configuration.

ρ-Step θ-Step State Registers Temp S-Registers Regs θ-Step State Registers S8 S7 S6 S5 S4 S3 S2 S1 S0 S2 S1 S0 T1 T0 S-Registers

χ S8 S7 S6 S5 S4 S3 S2 S1 S0 S2 S1 S0 ι θi 0,0 2 1 0 2 1 0 2 1 0 1,0 0 ∑0 0 2 1 0 2 1 0 2 1 ∑0 0 2,0 1 0 ∑1 ∑0 1 0 2 1 0 2 1 0 2 ∑1 ∑0 1 0 0,1 2 1 0 ρ ∑2 ∑1 ∑0 2 1 0 2 1 0 2 1 0 ∑2 ∑1 ∑0 π, χ, ι and 1,1 0 2 1 0 ∑00 ∑2 ∑1 θ-init (θ ) 0 2 1 0 2 1 0 2 1 ∑00 ∑2 ∑1 0 i θ-init 2,1 1 0 2 1 0 ∑11 ∑00 ∑2 half 1 0 2 1 0 2 1 0 2 ∑11 ∑00 ∑2 1 0 (π+χ+ι+θ ) round i 0,2 2 1 0 2 1 0 θ ∑22 ∑11 ∑00 2 1 0 2 1 0 2 1 0 ∑22 ∑11 ∑00 half round 1,2 0 2 1 0 2 1 0 ∑000 ∑22 ∑11 0 2 1 0 2 1 0 2 1 ∑000 ∑22 ∑11 0 2,2 1 0 2 1 0 2 1 0 ∑111 ∑0 ∑22 1 0 2 1 0 2 1 0 2 ∑111 ∑0 ∑22 1 0 2 1 0 2 1 0 2 1 0 S∑222 ∑1 ∑0 2 1 0 2 1 0 2 1 0 ∑222 ∑1 ∑0 0 2 1 0 2 1 0 2 1 ∑0 ∑2 ∑1 0 2 1 0 2 1 0 2 1 ∑0 ∑2 ∑1 1 0 2 1 0 2 1 0 2 ∑1 ∑0 ∑2 1 0 2 1 0 2 1 0 2 ∑1 ∑0 ∑2 2 1 0 2 1 0 2 1 0 ∑2 ∑1 ∑0 2 1 0 2 1 0 2 1 0 ∑2 ∑1 ∑0 0 2 1 0 2 1 0 2 1 ∑0 S2 ∑1 θ and ρ 0 2 1 0 2 1 0 2 1 ∑0 S2 ∑1 θ and ρ 1 0 2 1 0 2 1 0 2 ∑1 ∑0 ∑2 (θ+ρ) half 1 0 2 1 0 2 1 0 2 ∑1 ∑0 ∑2 (θ+ρ) half 2 1 0 2 1 0 2 1 0 ∑2 ∑1 ∑0 round 2 1 0 2 1 0 2 1 0 ∑2 ∑1 ∑0 round 0 2 1 0 2 1 0 2 1 ∑0 ∑2 ∑1 0 2 1 0 2 1 0 2 1 ∑0 ∑2 ∑1 1 0 2 1 0 2 1 0 2 ∑1 ∑0 ∑2 1 0 2 1 0 2 1 0 2 ∑1 ∑0 ∑2 2 1 0 2 1 0 2 1 0 0 ∑0 π ι χ

Figure 6.24: Keccak data flow

6.2.5 Skein Algorithm

Skein is a family of hash functions with three different internal state sizes: 256, 512 and 1024 bits, where Skein-512 is the primary hash function and can be used for all current hashing applications. Skein hash function is built out of a tweakable block cipher (ThreeFish), which

55 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications allows hashing configuration data along with the input text in every block, making every instance of the compression function unique. In addition to the ThreeFish tweakable block cipher (256, 512 and 1024-bit block sizes) at the core, Skein is built up of a Unique Block Iteration (UBI), which maps an arbitrary input size to a fixed output size, and an optional argument system that allows supporting different optional features. The normal (straightforward) hashing option we use can be seen in Figure 6.25. The first block is for configuration, the following instances are for message processing, and the last block is for output processing.

config m1 mt 0

Three Three Three Three 0 Fish Fish Fish Fish H(m)

type:CFG type:MSG type:MSG type:OUT

Figure 6.25: Skein normal hashing scheme

ThreeFish tweakable block cipher is defined for 256, 512 and 1024-bit block sizes. The key is the same size as the block, and the tweak value is 128 bits for all block sizes. Each one of Skein-512’s 72 rounds consists of four MIX functions followed by a permutation of the eight 64-bit words. A subkey is added every four rounds. The word permutation is the same for every round, and the rotation constants repeat themselves every eight rounds. A key schedule is also performed for generating subkeys from the original key and the tweak. Figure 6.26 shows ThreeFish-512 construction for four rounds together with the internal details of the MIX function, which is an ARX construction. Table 6.5 shows the specification of Skein for 256-bit message digest.

Table 6.5: Skein specifications

Algorithm Word Message Block Rounds Digest Skein-256 64-bit < 264-bit 512-bit 72 256-bit

Implementation

The serialized architecture for Skein is given in Figure 6.27. In round-0, the rightmost eight key expansion registers are filled with input key in 8 cycles, while all input key words are accumulated in the leftmost key register. This practically implements the key expansion process defined for ThreeFish. Following this round, state register is filled the sum of the input message block and the subkey generated in the previous round. In parallel, key expansion process continues within the key registers. At the same time, the message block is backed up inside the message register for post-processing following the completion of all ThreeFish rounds.

56 6.2. Implementations of the Finalist Hash Functions

Plaintext

SubKey 0

MIX MIX MIX MIX

PERMUTE

MIX MIX MIX MIX <<< RC

PERMUTE

MIX MIX MIX MIX

PERMUTE

MIX MIX MIX MIX

PERMUTE

SubKey 1

Figure 6.26: Four rounds of ThreeFish-512

message register MESSAGE

state MIX register

Tweak PERMUTE Gen

HASH IV

Kcnst

key expansion register

Figure 6.27: Skein serial architecture

57 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications

IN mix mix perm OUT R7 R6 R5 R4 R3 R2 R1 R0 OUT REG OUT IN R8 R7 R6 R5 R4 R3 R2 R1 R0 OUT

e0 k0 e1 e0+k0 f1 f0 k1 ∑k0 k0 e2 e1+k1 f0 f1 k2 ∑k1k0 k1 k0 e3 e2+k2 f1 f0 f3 f2 k3 ∑k2...k0 k2 k1 k0 Round e4 e3+k3 f2 f1 f0 f3 Round k4 ∑k3...k0 k3 k2 k1 k0 0 e5 e4+k4 f3 f2 f1 f0 f5 f4 1 k ∑k ...k k k k k k 5 4 0 4 3 2 1 0 e e +k f f f f f f k ∑k ...k k k k k k k 6 5 5 4 3 2 1 0 5 6 5 0 5 4 3 2 1 0 e e +k f f f f f f f f v ...v k ∑k ...k k k k k k k k 7 6 6 5 4 3 2 1 0 7 6 7 0 7 6 0 6 5 4 3 2 1 0 v v v v v v v v v f ∑k ...k k k k k k k k k k 0 7 6 5 4 3 2 1 0 7 7 0 7 6 5 4 3 2 1 0 0 v v f f k k k k k k k k k k 1 0 1 0 0 8 7 6 5 4 3 2 1 1 v v f f Rounds k k k k k k k k k k 2 1 0 1 1 0 8 7 6 5 4 3 2 2 2,6,10,… k2 k1 k0 k8 k7 k6 k5 k4 k3 k3 Round 3,7,11,… k3 k2 k1 k0 k8 k7 k6 k5 k4 k4 1 v v f f f f f f f f v ...v k k k k k k k k k k 7 6 5 4 3 2 1 0 7 6 7 0 4,8,12,... 4 3 2 1 0 8 7 6 5 5 v v v v v v v v v f k k k k k k k k k k 0 7 6 5 4 3 2 1 0 7 5 4 3 2 1 0 8 7 6 6 v s +v f f k k k k k k k k k k 1 K0 0 1 0 6 5 4 3 2 1 0 8 7 7 v s +v f f k k k k k k k k k 2 K1 1 0 1 7 6 5 4 3 2 1 0 8 Rounds k k k k k k k k k 8 7 6 5 4 3 2 1 0 5,9,13,... k0 k8 k7 k6 k5 k4 k3 k2 k1 Rounds v7 sK6+v6 f5 f4 f3 f2 f1 f0 f7 f6 v7...v0 k0 k8 k7 k6 k5 k4 k3 k2 k1 2,6,10,… v0 v7 v6 v5 v4 v3 v2 v1 v0 f7

k0 k8 k7 k6 k5 k4 k3 k2 k1 v1 v0 f1 f0 k0 k8 k7 k6 k5 k4 k3 k2 k1 Rounds v2 v1 f0 f1 3,7,11,… Round 4,8,12,... 42 k0 k8 k7 k6 k5 k4 k3 k2 k1 v7 v6 f5 f4 f3 f2 f1 f0 f7 f6 v7...v0 k0 k8 k7 k6 k5 k4 k3 k2 k1 k1 e0 v7 v6 v5 v4 v3 v2 v1 v0 v0 k1 k0 k8 k7 k6 k5 k4 k3 k2 k2 Rounds e1 e0+k0v7 v6 v5 v4 v3 v2 v1 f1 f0 v1 5,9,13,... Round 43 k7 k6 k5 k4 k3 k2 k1 k0 k8 k8 v7 v7

Figure 6.28: Skein serial flow

ThreeFish processing inside the state register is done via a 128-bit MIX block and a fully parallel 512-bit permutation block, which is a fixed 64-bit word based permutation. Its only cost is multiplexers. The 128-bit MIX block requires an additional 64-bit temporary register in order to collect 128-bits of data. At the end of round-42, ThreeFish operation is completed, and round-43 is used to add the stored messages onto the ThreeFish result (UBI operation) in order to obtain the next state of the hash. The operation is repeated until all message blocks are processed. The serial data flow of Skein is shown in Figure 6.28.

The whole process is explained in phase-round-cycle concept. In phase-0, the length of the message block is read. Then, in phase-1, 512-bit initialization vector is directly read from RAM, which makes additional ThreeFish run not necessary. In phase-2, the message blocks are read and processed. Following this, the hash value is updated in phase-3. Phase-2 and phase-3 are repeated in series, until all message blocks are processed. After the processing of the last message block, the message digest is written back in that block’s phase-3. This scheme can be seen in Figure 6.29.

58 6.3. Interface

clock

start pulse

pop

push

phase 0 1 2 3 2 3 2 3

round 0 ... 72 0 ... 72 0 ... 72

cycle 0 ... 4 ... 7 0 ... 7 0 ... 7 ... 0 ... 7 0 ... 7 0 ... 7 ... 0 ... 7 0 ... 7 0 ... 7 ... 0 ... 7 0 1 ... 4

last

active

ready

Figure 6.29: Skein timing diagram

6.3 Interface

All five hash modules are connected to a 32-bit FIFO-based IO interface module in order to connect to the external world in an Integrated Circuit (IC). Internal interface with modules is 64-bit for Keccak and 32-bit for all other blocks. The FIFO is organized as an even/odd couple in order to provide 64 bits necessary for the Keccak block (both FIFOs active) and 32 bits for the others (odd FIFO active in odd cycles, even FIFO active in even cycles). A simple REQuest/ACKnowledge signalling scheme is implemented, where REQ signal is set when FIFO is almost empty and ACK is set when the result is ready. 2016 bytes of memory exist for MESSAGE/DATA and 32 bytes are present for HASH result (message digest). The architecture enables only the selected module, and disables the others via clock gating. This interface architecture is shown in Figure 6.30.

Lightweight SHA-3 ASIC KECCAK DATA INPUT BLAKE REQ EVEN ODD PUSH GRØSTL FIFO FIFO ACK (MESSAGE) (MESSAGE) POP JH

(HASH) (HASH) DATA SKEIN OUTPUT MODULE SEL/ENB

Figure 6.30: Interface model

59 Chapter 6. On the Suitability of SHA-3 Finalists for Lightweight Applications

Table 6.6: Comparison of our work with previous works

Message Cycles Tput Tput/ Techn. Area block size Freq. per (kbps area Finalist (nm) (kGE) (bits) (MHz) block @100kHz) (bps per GE) BLAKE [HAMP11] 180 13.58 512 215 816 63 4.64 BLAKE [HAMP11] 180 8.6a 512 100 N.A. 63 7.33 Our BLAKE 90 11.3 512 N.A. 240 213 18.88 Grøstl [TFI+09] 350 14.622 512 56 N.A. 261 17.85 Our Grøstl 90 9.2 512 N.A. 1280 40 4.32 JH [TFK+09] 180 58.832 512 380.22 39 1313 22.32 JH [KKI+11] 90 31.864 512 353 N.A. 1314 41.24 Our JH 90 13.6 512 N.A. 1440 36 2.61 Keccak [BDPA11] 130 9.3b 1088 200 5160 20 2.15 Our Keccak 90 15.2 1088 N.A. 1200 91 5.96 Skein [TFI+09] 350 12.890c 512 80 N.A. 25 1.94 Skein [KKI+11] 90 22.562d 512 50 10 2694 119.40 Our Skein 90 15.5 512 N.A. 592 86 5.58 a) This compact core uses an external memory to hold the message block and does not provide salted hashing. b) This value includes the area of the RAM. The coprocessor uses 5 kGE with external RAM (as reported in Keccak documentation). Including RAM area yields 9.3 kGE. c) Skein-256-256 d) Skein-512-256

6.4 Results and Discussion

In our study, we achieved better results than most of the previous works in terms of area and throughput. Grøstl and BLAKE give the best gate counts. Best throughput numbers are presented by BLAKE and Keccak, while the best results are provided by BLAKE and Keccak in terms of throughput/area.

Note that, except for Keccak, all hash functions have half the internal state size with respect to 512-bit message digest option. Such a normalization for Keccak will result in Keccak-800 − 256, and will yield the best gate count and worst throughput. It is also worth mentioning that the throughput of Grøstl can be quadrupled at the expense of an additional 2 kGE (estimated), making it the second best in terms of throughput, while preserving its top position with the smallest area.

Table 6.6 lists our results for all finalists as well as comparison with previous works.

60 6.5. Conclusion

6.5 Conclusion

Since the existing security standards and primitives are most of the time not suitable for de- ployment in lightweight devices, there have already been several creative implementations of these standards targeted for lightweight applications, in addition to the new lightweight algo- rithm proposals. While these algorithms have mostly focused on block ciphers, there are also recent examples on lightweight hash functions. Unfortunately, these studies were independent of the SHA-3 standardization process, where the suitability of the candidates for lightweight applications was almost neglected. We believe that the two efforts should somehow be combined or at least associated. In this work, we made such an association possible through a suitability analysis of the SHA- 3 finalists for lightweight applications. We limited our focus on reaching lightweight figures for area, which also results in good numbers for average power consumption in most applica- tions; and tried to get the lowest possible recorded gate counts for all five finalists. Use of block memories is avoided for compatibility on different platforms. We have been successful in reaching our target of lowest gate count, and we have presented very compact implementations of SHA-3 finalists. However, despite our lightweight approach, SHA-3 finalists are still larger compared to lightweight block ciphers (see Chapter 4). We believe that this is a result of hav- ing no specific lightweightness target for SHA-3 candidates. Other hash functions, which are specially-tailored for lightweight applications, such as PHOTON, QUARK, and SPONGENT (which follow sponge construction like Keccak), result in better figures compared to our serial implementations of SHA-3 finalists.

61

Chapter 7 IPSecco: A Lightweight and Reconfigurable IPSec Core

In this chapter, we propose a reconfigurable lightweight IPSec hardware core, which is implemented on FPGA. Our architecture supports the main IPSec protocols; namely Authentication Header (AH), Encapsulating Security Payload (ESP), and Internet Key Exchange (IKE). Instead of re-implementing the common “too heavy” IPSec configurations, which are not suitable for pervasive devices, we evaluate efficient im- plementations of standardized and/or well-known lightweight and hardware-friendly algorithms. In particular, we examine different versions of PRESENT, hash func- tions Grøstl and PHOTON, and a very compact ECC core. As a consequence, we present IPSecco, an IPSec core with adequate security and only moderate resource requirements, making it suitable for lightweight devices.

Contents of this Chapter 7.1 Introduction ...... 63 7.2 Related Work ...... 64 7.3 Implementation ...... 65 7.4 Results and Discussion ...... 72 7.5 Conclusion and Future Work ...... 74

7.1 Introduction

With the technological development in today’s world, more and more computing devices enter the market, as mentioned previously in Chapter 1. Many of them are connected over the Internet or local networks in order to communicate with each other. Naturally, an increasing number of these devices are also resource-constrained devices and used in pervasive computing applications. These lightweight devices also need to connect to the Internet for device-to- device communication and product updates – mostly over a wireless network. There must be a common language for devices across networks to share information with each other. The Internet Protocol (IP) [RFCh, RFCa, RFCb] is the primary communication protocol for transferring data between parties across a network. It defines datagram structures that are encapsulating the data to be delivered. For resource-constrained environments like embedded systems, there is even a lightweight version of IP, namely lwIP [lwI], which reduces the resource utilization.

63 Chapter 7. IPSecco: A Lightweight and Reconfigurable IPSec Core

Nevertheless, the rapid increase in the utilization of these computing devices led to security problems also in communication over networks. Nowadays, many devices require authentication and encryption of the data they receive and send. The same is valid for area-constrained devices, many of them provide critical data. For devices with no resource limitation, a security enhance- ment to IP, called IPSec [RFCc, RFCd, RFCg], has already been proposed. IPSec defines a family of protocols to provide security services such as confidentiality – to prevent undesired access attempts to the data transmission, data integrity – to make sure that the transferred data is not changed, and authentication – to identify the information source. In IPSec, there are different protocols to provide mentioned services. For instance, the AH protocol provides data authentication. The ESP protocol defines mechanisms for confidentiality and data in- tegrity. Finally, the IKE protocol is used for establishing secure connections. These protocols use different cryptographic primitives such as encryption, hashing, and modular arithmetic in order to provide security services. A minimum set of algorithms, which must be supported in an IPSec implementation for AH, ESP, and IKE protocols, was defined in “Cryptographic Suites for IPSec” [RFCd, RFCg] for standardization purposes. For example, in “Cryptographic Suite B” [RFCg], the AES cipher is used in Galois/Counter Mode (GCM) [Dwo07] to provide authenticated encryption. The Hash-based Message Authentication Code (HMAC) [RFCe] con- struction is used with the SHA for AH services. For exchanging keys between parties, IKE uses the Diffie-Hellman key exchange [RFCf]. However, none of these recommendations are actu- ally targeted at resource-constrained devices. To the best of our knowledge, there exists no standardized lightweight IPSec protocol in the literature. In this work, in order to have a lightweight IPSec core, we exchange regular algorithms with lightweight ones – standardized and/or well-known lightweight algorithms. For instance, we use PRESENT instead of AES. As hash functions, we evaluate SHA-3 finalist Grøstl and the lightweight hash function proposal PHOTON, which uses the PRESENT Sbox. Due to the mentioned lightweight device constraints, a lightweight IPSec core must be low-cost and low-power. Software solutions suffer from low performance (when compared to hardware) and some hardware implementations, such as ASIC implementations, lack the flexibility and programmability offered by software. Hence, using an FPGA platform for the designed hardware core seems to be the perfect solution to achieve our goals and stay reconfigurable. We implement both the encryption/hashing algorithms and modes of operation in hardware. Selecting Xilinx Spartan FPGAs as the target platform provided us with reconfigurability, we therefore can switch between different lightweight algorithms or implementations depending on our needs with less effort. For these platforms, we evaluate and propose the IPSecco-80 and IPSecco- 128 cores, which are configured to achieve 80-bit and 128-bit symmetric security.

7.2 Related Work

Hardware-related implementations of IPSec-specific functionality have unfortunately been rel- atively scarce, except for a spark of recent activity. In [SRK11], the authors propose an architecture for implementing IPSec on a Xilinx Virtex- 4 FPGA board. Cryptographic primitives are realized as reconfigurable coprocessors, which are attached to a MicroBlaze soft-core [Xil09]. The MicroBlaze is responsible for handling the protocol layer and reconfiguring the coprocessor according to the type of primitive that is required. The supported primitives are programmed into the coprocessor on-demand, i.e., the

64 7.3. Implementation coprocessor supports only one primitive at a time, which allows for a lower overall resource consumption. However, since partial reconfiguration comes with a time penalty for switching the crypto core, this approach does not allow for extremely high throughput in a typical setting. The authors of [WBC10] propose an architecture that targets IPSec and Secure Sockets Layer (SSL). The architecture is optimized for throughput rather than area and is implemented on a Xilinx Spartan-3 FPGA board. A soft-core handles the protocol layer and distributes cryp- tographic operations to an array of crypto engines, which are instantiated in parallel. Equipped with adequate buses, this system allows for an extremely high throughput while maximizing the device resource utilization. The authors of [NWW+11] implement their design on a Xilinx Virtex-5 FPGA board. They propose to use four different types of cores, designed to handle the AH and ESP packets of IPSec. In contrast to the previously discussed works, the system in this work allows one to flexibly choose how many cores of which type are to be used in the IPSec processor. This feature is enabled by a flexible bus architecture connecting these cores. The key exchange phase of IPSec is not considered in this design, hence there are no cores using asymmetric cryptography. In [LL05], the authors propose an IPSec implementation on a Xilinx Virtex-2 Pro FPGA board. Again, the protocol layer is handled by a soft-core and reconfigurable hardware imple- ments AES and HMAC. Asymmetric cryptography is also not considered in this work.

7.3 Implementation

Our IPSecco implementation supports a tight coupling with a soft-core, such as Xilinx’ Mi- croBlaze or PicoBlaze [Xil11b], as depicted in Figure 7.1. The main idea here is to let the soft-core handle the protocol layer of IPSec, whereas our core executes all cryptographic oper- ations, thus accelerating the supported cryptographic primitives. Contrary to the mentioned related work, our focus is on providing a small core that is able to handle all IPSec requirements while maintaining a very low footprint.

Figure 7.1: Integrating IPSecco

65 Chapter 7. IPSecco: A Lightweight and Reconfigurable IPSec Core

Table 7.1: IPSecco configurations

AH (HMAC) ESP (GCM) IKE IPSecco-80 PHOTON PRESENT-80R ECC (secp160r1) IPSecco-128 Grøstl PRESENT-128S ECC (secp256r1) R: RAM-based S: Serial

Our core supports PRESENT-80 and PRESENT-128, Grøstl, PHOTON, and ECC. For both PRESENT variants we have implemented RAM-based and serial versions. The RAM- based implementation is taken from the approach that has already been explained in Chapter 5 – in this chapter we also implement the 80-bit key version of that idea. The RAM-based version is targeted for extra low area utilization, while the serial version is optimized for higher speeds (compared to the RAM-based version). To complement the respective version with a hash function, we use a RAM-based imple- mentation of PHOTON and a serial implementation of Grøstl. Considering the key sizes and different versions, the overall IPSecco core is available in two different configurations (see Ta- ble 7.1). IPSecco-80 provides a security level equivalent to 80-bit symmetric security, while the IPSecco-128 core achieves 128-bit security. The design choices and implementation details of the three distinct blocks (enabling AH, ESP, IKE of IPSec) of our core are given in the following.

7.3.1 Message Authentication

In order to provide data integrity and authenticity in the AH mode, IPSec defines a symmetric message authentication code (MAC). The biggest advantage of a MAC is its relatively high speed compared to digital signatures. However, MACs are symmetric so that both parties have to agree on a joint key and also do not ensure confidentiality of the transmitted information. IPSec requires the utilization of the HMAC construction, which provides a fixed size au- thentication tag for arbitrary messages and can be provably secure under some circumstances. The HMAC value of a message X is computed as

 + +  HMACK (X) = H (K ⊕ opad)||H(K ⊕ ipad)||X where opad and ipad are padding constants, and K+ is the key K padded to a length equal to the block length of the hash function. In order to make the construction secure, it is required that the key length used is equal or greater than the output length of the employed hash function H. The implementation of an HMAC wrapper, given a hash function H, is rather straightforward as can be seen in Figure 7.2. We only require a shift register, implemented as Shift Logical Right (SLR), that stores the key (as it is needed twice) and also the hash output after the first call to the hash function. We have implemented two modern, lightweight hash functions, to be used in our wrapper.

66 7.3. Implementation

Figure 7.2: Architecture of the HMAC core

PHOTON

One implementation option we provide for IPSecco is the lightweight hash function PHOTON, which is designed for hardware efficiency and low area consumption. PHOTON is based on the well-established and well-analyzed sponge framework [BDPA07] and supports several parameter sets for different levels of security. In this work, we have implemented PHOTON-160/36/36, which provides 80-bit collision resistance and has an input block length of 36 bits. The state is represented as a 7 × 7 nibble matrix, which is first initialized with some constants. For each message block, the message is XORed with the content of the state and afterwards a permutation is applied (12 times for every input block). This permutation consists of adding constants, applying the PRESENT Sbox to every entry of the state, rotating some rows and finally mixing the columns of the state. After all message blocks have been absorbed, the squeezing phase begins, in which the hash output is generated by further applications of the permutation and the extraction of the parts of the state. Note that this module is not implemented by the author of this thesis. Our co-author Thomas P¨oppelmann implemented the PHOTON module as an Application-Specific Processor (ASP) that stores its program code and also the state in a dual-port block RAM in order to save slice resources. Some operations are directly applied to the state (e.g., Sbox or XORing of constants) and do not involve any registers. Other instructions, like shifting and re-ordering of entries of the state, are performed with the help of two 4-bit registers. One register is also able to execute Galois field arithmetic, which is necessary in the MixColumns layer. Constants are generated by LFSRs, which are also used to trigger the execution of static branch instructions and hold the state (e.g., the number of applications of the permutation layer to a message block).

Grøstl

Our serialized Grøstl architecture is taken from the Grøstl design given in Chapter 6 (see Section 6.2.2).

67 Chapter 7. IPSecco: A Lightweight and Reconfigurable IPSec Core

Instead of using our Grøstl implementation as part of HMAC, another possibility is to use a different message authentication method. In their submission to NIST, the authors of Grøstl state that an envelope construction such as

+ + MACK (X) = H(K ||X ||K) offers security similar to HMAC, but requires only one iteration of the hash function and no wrapper.

7.3.2 Authenticated Encryption GCM provides authenticated encryption with low overhead and was originally designed [Dwo07] for 128-bit block ciphers, e.g., AES. However, for lightweight block ciphers such as PRESENT, a modification was proposed [MV05] to extend GCM to ciphers with a block size of only 64-bit. Encryption with GCM allows for four different inputs: an initialization vector IV , a plaintext P , a key K, and additional data A, which is not encrypted but authenticated. In our notation, A is split into m blocks A1,A2, ..., Am of 64 bits and P (and thus also C) in n blocks P1, ..., Pn. The output of the encryption step is a ciphertext C1, ..., Cn and an authentication tag T , which can be used to validate the integrity of C and A. We have adapted and slightly modified the GCM proposal for 64-bit block sizes in order to correct what we believe1 is a flaw in the document. Accordingly, GHASH64(H, A, C) is defined by the recursive relation

 0 i = 0,   (Xi−1 ⊕ Ai) ∀i = 1, ..., m − 1,  ∗ Xi =H · (Xi−1 ⊕ P(Am)) i = m,  (Xi−1 ⊕ Ci) ∀i = m + 1, ..., m + n − 1,   ∗ (Xi−1 ⊕ P(Cn)) i = m + n, and finally GHASH64(H, A, C) = (((Xm+n ⊕ L(A)) · H ⊕ L(C)) · H. Here, P(·) returns a zero-padded string of 64 bits, L(·) returns the zero-padded bit length of its input. Building on the construction of GHASH64, the authenticated encryption operations are defined by the following equations:

H = E(K, 064) ( IV ||031||1 if len(IV) = 32 Y0 = GHASH64(H, {},IV ) otherwise

Yi = I(Yi−1) ∀i = 1, ..., n

Ci = Pi ⊕ E(K,Yi) ∀i = 1, ..., n − 1

Cn = Pn ⊕ Mu(E(K,Yn))

T = Mt(GHASH64(H, A, C) ⊕ E(K,Y0))

1We have tried to contact the authors of the respective document but received no answer.

68 7.3. Implementation

Here, I(·) denotes the 32-bit incrementation while Mx(·) returns the x most significant bits. In the original proposal [MV05] we find

( 031||1||IV if len(IV) = 32 Y0 = GHASH64(H, {},IV ) otherwise which, in case the length of the IV is 32 bits, severely limits the possible counter values Yi to 232 instead of 264 possibilities. Figure 7.3 shows the architecture of the PRESENT/GCM core. Since GCM is basically a combination of the well-known Counter (CTR) mode and multiplication/addition in Galois Field (GF) GF(264), this is also reflected in our design. Our design can be used for encryption as well as for decryption – both modes include calculating the authentication tag T . In case of decryption, the computed T has to be verified by the user of our core. Note that Figure 7.3 depicts a simplified view of the implemented architecture.

Figure 7.3: Architecture of the PRESENT/GCM core

PRESENT-RAM For IPSecco, we implement PRESENT for both key options, PRESENT-80 and PRESENT- 128.

69 Chapter 7. IPSecco: A Lightweight and Reconfigurable IPSec Core

The RAM-based implementation of PRESENT is taken from the approach presented in Chap- ter 5. However, in Chapter 5, only the design for PRESENT-128 is presented. For our work, we also implement PRESENT-80 – the difference of this design is in the last phase of the operation flow (key schedule). The overall data flow of PRESENT-80 is similar to PRESENT-128; how- ever, in the key scheduling part we have less clock cycles as a result of the shorter key length. The main modification here is in the addressing scheme. As mentioned before, the cycle count is less than PRESENT-128; however, the non-symmetric nature of PRESENT-80’s addressing logic does not allow for a significant reduction of the slice count. In total, PRESENT-80 takes 969 clock cycles while PRESENT-128 takes 1062 clock cycles (note that we have implemented the on-slice Sbox version in this chapter, not the on-RAM Sbox version). As stated above, the data flow for both PRESENT-80 and PRESENT-128 is more or less the same except the key scheduling part. The general block diagram for both versions is given by Figure 7.4.

Figure 7.4: Architecture of the RAM-based PRESENT implementation

PRESENT-SERIAL In the serial implementations of PRESENT-80 and PRESENT-128, instead of the regular 4-bit datapath as depicted in [RPLP08], we use a 16-bit datapath approach. Using a 4-bit datapath would certainly decrease the slice count as the overall circuit uses less combinational circuitry. However, we also want to have a better performance compared to regular serial implementations. Therefore, we selected the 16-bit datapath for the serial implementations of PRESENT. The data is processed in 16-bit chunks, which means we only update 16 bits of the state in each clock cycle. As can be seen in Figure 7.5, we have four 16-bit state registers and five 16-bit key registers in serial PRESENT-80 implementation. These registers normally act like shift registers, except during the data load and permutation/key scheduling. We take the data and key input in the

70 7.3. Implementation

first round and we start processing at the same time. The key addition and substitution are performed at the same time in each cycle on the output of registers SR0 and KR0. Note that we have four 4-bit Sboxes for our 16-bit state in the substitution step. At the end of each round, the permutation is applied on the full state as it is not possible to serialize the permutation step in a cheap way. The same is true for key scheduling, we therefore update the whole key registers at the end of each round, in parallel to permutation.

Figure 7.5: Architecture of the serial PRESENT-80 implementation

The serial implementations for PRESENT-80 and PRESENT-128 are similar. The main difference of PRESENT-128 is the additional three key registers for key storage. As a result of having a larger key size, the PRESENT-128 implementation takes more clock cycles than PRESENT-80. One round of PRESENT-80 is processed in 5 clock cycles while it takes 8 cycles for PRESENT-128. Furthermore, there is an additional multiplexer at the input of SR1 to select between key-added & Sboxed and unprocessed SR0 output, as we have to wait for a few cycles before the permutation process at the end of each round (which is done in parallel with the key scheduling to preserve the data flow). Figure 7.6 shows the serial PRESENT-128 implementation.

7.3.3 Key Exchange RFC 6379 [RFCg] defines the recommended cryptographic primitives for IPSec for security levels of 128-bit and above. For key exchange, and a security level equivalent to AES-128, either RSA or use of prime field elliptic curve cryptography is required. We opted for ECC, the given curves are the same as recommended by NIST and earlier by SECG [Sta00]. The naming conventions are slightly different, so the NIST curve ECC-p256 was formerly known as secp256r1. While NIST only standardized the secp curves from 224-bit upwards, there also exist recommended curves for 160-bit and 192-bit arithmetic, named secp160r1 and secp192r1. In 2011, a reconfigurable ECC-p core [VGM11] using a microcode approach [VMG+10] was introduced at ReConFig2. The design uses a very small arithmetic and logic unit (Arithmetic 2The co-author of the publication related to this chapter, Oliver Mischke, is also a co-author of [VGM11]

71 Chapter 7. IPSecco: A Lightweight and Reconfigurable IPSec Core

Figure 7.6: Architecture of the serial PRESENT-128 implementation

Logic Unit (ALU)) and a control unit, which is able to read and decode certain instruction sequences from RAM. It is able to perform modular arithmetic from point addition and doubling up to point multiplication. Because the program code (e.g., how to perform doubling, addition, a fast NIST reduction, etc.) and all necessary constants (base point, etc.) are stored in RAM, it is possible to switch from one curve to another just by changing the stored program code and data in the RAM. This can happen on the fly after the initialization of the FPGA. The ability to change the prime curve for ECC operations just by swapping the memory content makes this design very suitable for less expensive, older FPGAs. This directly is beneficial for the Xilinx Spartan-3 Series, which do not support partial reconfiguration. We have therefore chosen to implement a similar design as [VGM11], but instead of implement- ing ECC-p224/secp224r1, we implemented secp160r1 alongside ECC-p256/secp256r1 to cover a wider range of security levels. As ECC-p256/secp256r1 provides an equivalent security level as AES-128 or PRESENT-128, secp160r1 is a very good match for lower security cryptographic primitives like the implemented PRESENT-80. Note that the ECC block described here is implemented by our co-author Oliver Mischke. The architecture of our ECC module is depicted in Figure 7.7 and is, as mentioned before, very similar to the one of [VGM11]. A control unit (PRG CTL) reads and decodes instructions from one RAM and passes that information to the modular ALU. That unit features a small arithmetic unit using a 16-bit datapath, which can perform basic operations like multiplication of two 16-bit values, addition, subtraction, and comparison. It also contains two dual-port RAMs to be able to read both operands and store the result of a computation in a single clock cycle. Compared to [VGM11] our design is slightly smaller, most likely because we focused only on Xilinx Spartan FPGAs with dedicated Multipliers/Digital Signal Processing (DSP)s. Thus, there was no need to optimize the datapath to make the same design fast on different FPGA architectures.

7.4 Results and Discussion

Table 7.2 and Table 7.3 show our synthesis results for the two representatives of Xilinx’ Spartan family. We have evaluated two major design strategies, RAM-based implementations and serial implementations. The RAM-based approach makes use of the FPGA’s on-board RAMs in

72 7.4. Results and Discussion

Figure 7.7: Architecture of the ECC core

order to reduce the slice count. It also comes with a performance penalty and is therefore more suitable for low throughput applications. Our serial implementations are also implemented with low area requirements in mind but still allow for faster processing of data (when compared to the RAM versions). They are suitable for medium throughput applications.

As expected, the resource consumption of our design is significantly lower than that of previ- ously published work on IPSec FPGA implementations. For example, in [SRK11], AES, SHA- 256, and a modular exponentiation core for elliptic curve Diffie-Hellmann was implemented on a Xilinx Virtex-4 FPGA. While the Virtex-4 series is significantly more powerful then our Spartan-3 target, it employs a similar 4-input LUT architecture. Therefore, the slice count as well as LUT and flip-flop usage can be compared to each other. Their AES core uses 1862 slices, while our similarly secure PRESENT-128 implementation in the smallest option (RAM) only requires 84 slices. The area difference is still large when looking at the hash functions; their SHA-256 implementation requires 924 slices, while our similarly secure serial Grøstl only requires 349 slices.

Reducing the security requirements to an equivalent symmetric security of 80 bits would further reduce the necessary slice count, especially since we then could use a low area, RAM- based implementation of PHOTON instead of Grøstl. Note that these numbers alone of course do not allow a fair comparison, since the cores of [SRK11] are most likely designed to allow a higher throughput while we primarily focused on minimizing the slice count. For the modular exponentiation core, the slice count is quite comparable; namely, 499 slices and three DSP48s in [SRK11] versus 569 slices and one 18-bit block multiplier in our design.

73 Chapter 7. IPSecco: A Lightweight and Reconfigurable IPSec Core

Table 7.2: Characteristics of our implementation for Xilinx Spartan-3

#Slices #LUTs #FFs #RAMs #MULs PRESENT-80R 84 135 38 1 0 PRESENT-80S 131 222 153 0 0 PRESENT-80/GCMR 356 630 448 3 0 PRESENT-80/GCMS 402 714 561 2 0 PRESENT-128R 84 130 38 1 0 PRESENT-128S 161 295 201 0 0 PRESENT-128/GCMR 356 632 450 3 0 PRESENT-128/GCMS 443 771 609 2 0 PHOTONR 91 160 50 1 0 PHOTON/HMACR 133 235 80 1 0 GrøstlS 349 647 376 0 0 Grøstl/ 601 1119 693 0 0 ECC-160 569 1028 507 3 1 ECC-256 569 1028 507 3 1 IPSecco-80 1107 1986 1123 7 1 IPSecco-128 1613 2918 1809 5 1 R: RAM-based S: Serial

7.5 Conclusion and Future Work

In this chapter, we have presented how lightweight building blocks for the execution of the popular IPSec protocol can efficiently be realized on reconfigurable hardware. Our research shows that even a complex protocol suite, which requires several cipher standards and modes of operation, can be implemented even on very low-cost and energy-efficient Spartan FPGAs. By selecting algorithms that are generally designed for efficient realization in hardware, and by adapting them to the specification of our target device, we were able to achieve a significant reduction in terms of area usage compared to related work. With our results we show that standardized security can be achieved on reconfigurable hardware with only a modest investment on slice resources. This leaves more space for other functionality on the target hardware. As a lightweight IPSec core that will probably be used mostly on low-power devices deployed in the field, an attacker could employ side-channel analysis to obtain secret keys or introduce faults into the crypto cores or management modules. As protection mechanisms against such attacks usually require large amounts of device resources and make analysis hard, we did not consider them within the scope of this work. However, generic protection layers [GM11] can be integrated in future as well as countermeasures that are specifically designed to use in lightweight applications. Another interesting topic would be the investigation of different algorithms and their combi- nations, such as CLEFIA, Keccak, HIGHT, etc., and also the primitives that can make use of resource-sharing. Note that a generic IPSec core requires hash functions, block ciphers, as well

74 7.5. Conclusion and Future Work

Table 7.3: Characteristics of our implementation for Xilinx Spartan-6

#Slices #LUTs #FFs #RAMs #MULs PRESENT-80R 26 83 38 1 0 PRESENT-80S 48 167 153 0 0 PRESENT-80/GCMR 138 455 451 3 0 PRESENT-80/GCMS 164 515 572 2 0 PRESENT-128R 31 82 38 1 0 PRESENT-128S 62 231 201 0 0 PRESENT-128/GCMR 141 454 459 3 0 PRESENT-128/GCMS 174 512 624 2 0 PHOTONR 30 89 50 1 0 PHOTON/HMACR 60 152 80 1 0 GrøstlS 136 411 385 0 0 Grøstl/HMACS 275 808 760 0 0 ECC-160 221 630 482 3 1 ECC-256 221 630 482 3 1 IPSecco-80 424 1301 1101 7 1 IPSecco-128 670 1950 1866 5 1 R: RAM-based S: Serial

as an asymmetric construction. These primitives are occupying valuable resources on the device but are not always executed simultaneously (e.g., HMAC is not needed when only ESP mode with authenticated encryption enabled is used). One common solution for efficient resource- sharing is partial reconfiguration, which is only available on more powerful but also more ex- pensive and larger FPGAs. Therefore, further effort can be put into the integration and sharing of components between different ciphers (e.g., Sbox lookup tables, block RAMs, or registers for keys) in order to further reduce the size of our design.

75

Part III

New Resource-efficient Constructions for the Ubiquitous Computing Era

Chapter 8 Targeting Low-latency: A Lightweight Block Cipher PRINCE

In this chapter, the design of a new lightweight block cipher that specifically targets low-latency encryption is investigated. The main design target is to keep the gate count below the existing ciphers in the literature (actually, as low as possible while keeping a certain level of security), and the latency at exactly one clock cycle. The entire development process was a major effort that involved many people. However, this chapter focuses on the contributions of the thesis author, which are investigations on hardware cost, implementation of different hardware architectures, performance on ASIC and FPGA platforms and microcontrollers (together with comparison to the performance of other prominent block ciphers in the literature), and hardware cost estimation for possible side-channel resistant constructions of the cipher. In the light of these investigations, we identify a hardware-friendly low-latency cipher, namely PRINCE.

Contents of this Chapter 8.1 Introduction ...... 79 8.2 PRINCE Background ...... 81 8.3 Hardware Investigations ...... 88 8.4 Hardware Implementations ...... 89 8.5 Software Implementations ...... 93 8.6 Investigations on Countermeasures Against Potential SCA on PRINCE .. 94 8.7 Concluding Remarks ...... 96

8.1 Introduction

The majority of the lightweight ciphers, as mentioned previously, have been optimized according to a dominant metric, which is chip area. This is certainly a valid optimization objective in cases where there are extremely tight power or cost constraints, in particular for passive RFID tags. However, in many other applications, there are several other implementation parameters according to which a cipher should have lightweight characteristics. For instance, instant au- thentication and block-wise read/write access to memory devices (e.g., solid-state hard disks) need low-latency encryption and instant response time.

79 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE

For such cases, there is a certain demand for symmetric ciphers that can instantaneously encrypt/decrypt a given plaintext/ciphertext. Here, instantaneous means that the entire en- cryption and decryption should take place within the shortest possible delay. This seemingly simple problem poses a considerable challenge with today’s cryptosystems – in particular if encryption and decryption should both be available on a given platform. Software implemen- tations of virtually all strong ciphers take hundreds or thousands of clock cycles, making them ill-suited for a designer aiming for low-latency cryptography. In the case of stream ciphers im- plemented in hardware, the high number of clock cycles for the initialization phase makes them unsuitable for this task, especially when secret keys need to be regularly changed. Moreover, if we want to encrypt small blocks selected at random (e.g., encryption of sectors on solid-state disks), stream ciphers are not convenient1. This leaves block ciphers as the remaining viable solution. However, the round-based, i.e., iterative, nature of virtually all existing block ciphers, as shown for the case of AES, makes low-latency implementation a non-trivial task. A round-based hardware architecture of AES- 128 requires ten clock cycles to output a ciphertext, which we do not consider instantaneous as it is still too long for some applications. As a remedy, the ten rounds can be loop-unrolled, which means the circuit realizing a single round is repeated ten times. Now, the cipher returns a ciphertext within a single clock cycle – but at the cost of a very long critical path. This yields a very slow absolute response time and clock frequency, e.g., in the range of a few MHz. Furthermore, the unrolled architecture has a high gate count in the range of several tens of thousand GE, implying a high power consumption and resource utilization. Both of these are undesirable, especially when the intended applications are in the embedded domain. Following the same motivation and reasoning with this work, Kneˇzevi´cet al. [KNR12] com- pared several lightweight ciphers with respect to latency and as a conclusion they called for new designs that are optimized for low-latency. As a summary of the above discussions, the following criteria are of importance for such new low-latency cipher designs:

(1) The cipher should perform instantaneous encryption, i.e., a ciphertext should be computed within a single clock cycle. There should be no warm-up phase.

(2) When implemented in modern chip technology, low delays (resulting in moderately high clock rates) should be be achieved.

(3) The hardware costs should be moderate (i.e., considerably lower than fully-unrolled ver- sions of AES or PRESENT).

(4) Encryption and decryption should both be possible with low-cost and low-overhead.

Here, it is important to point out that the existing lightweight block ciphers, such as PRESENT, do not fulfill Criteria 2 and 3 (low-latency, low-area) due to their large number of rounds. Therefore, the number of rounds should be set in a way to address these two criteria without sacrificing security. Another crucial point worth mentioning is the need for (almost) identical hardware utilization for encryption and decryption in order to fulfill Criterion 4. This is an important requirement since the unrolled nature of instantaneous ciphers leads to circuits

1A possible exception are random-access stream ciphers such as Salsa [Ber08]

80 8.2. PRINCE Background that are large and it is thus clearly advantageous if large parts of the implementation can be used both for encryption and decryption. Taking all of these into account, we set our goal to design a new block cipher optimized with respect to the given criteria when implemented in hardware. In the following sections, we introduce our low-latency lightweight block cipher proposal PRINCE and present the details of the specific parts of the design process.

8.2 PRINCE Background

The development process of PRINCE was a major effort involving many people from Ruhr University Bochum, Technical University of Denmark, and NXP Semiconductors. We present the specification of the PRINCE cipher and explain its necessary design details briefly in this section. The details of the individual contributions of the co-authors can be found in the respective publications [BCG+12a, BCG+12b].

8.2.1 Highlights of PRINCE Before going into the details, we would like to highlight some important features of PRINCE. Besides being a new lightweight cipher for the first time optimized with respect to the goals above, PRINCE has several innovative features. First of all, a fully-unrolled design increases the possible design choices enormously. With a fully-unrolled cipher, the traditional need for a cipher to be iterative with very similar round functions disappears. In turn, that allowed us to efficiently implement a cipher where decryption with one key corresponds to encryption with a related key. This property, we refer to as α- reflection, is of independent interest and its soundness against generic attacks is proven in the original PRINCE publication [BCG+12a], which was presented at ASIACRYPT in 2012 (also, a full version is available on IACR Cryptology ePrint Archive [BCG+12b]). As a consequence of the α-reflection, the overhead of implementing decryption over encryption became negligible. Note that the previous approaches to minimize the overhead of decryption over encryption, for example in the ciphers NOEKEON [DPAR00] and ICEBERG [SPR+04], usually require a multiplexer in each round. While for a round-based implementation this does not make a difference, our approach is clearly preferable for a fully-unrolled implementation, as we require a multiplexer only once at the beginning of the circuit. Another difference to known lightweight ciphers like PRESENT is that we balanced the cost of the substitution layer and the linear layer. Our investigations on hardware cost, which will be explained in detail in the following sections, showed that optimizing the cost of the chosen Sbox has a major influence on the overall cost of the cipher. As an Sbox that performs well in one technology does not necessarily perform well in another technology, we propose the PRINCE- family of ciphers that allows to choose the Sbox freely within a (large) set of Sboxes fulfilling the certain criteria. In our work, we favored one Sbox, which was one of the smallest in all three different technologies we used. Our choice for the linear layer can be seen as a solution “in-between” the bit-permutation layers of PRESENT (implemented with wires only) and AES (implemented with considerable combinatorial logic). With the expense of only 2 additional XOR gates per bit over a simple bit-permutation layer, we achieved an almost- MDS property that helps to prove much better

81 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE bounds against various classes of attacks. This in turn allowed us to significantly reduce the number of rounds, hence the latency. As a result of all these properties, PRINCE compares very favorable to existing ciphers. For the same time constraints and technologies, PRINCE uses approximately 6 times less area than PRESENT-80 and 14-15 times less area than AES-128. In addition to this, our design uses about 4-5 times less area than the other ciphers in the literature (see Section 8.4 and in particular Tables 8.2 and 8.3 for a detailed comparison in different technologies). To facilitate further study and fairer comparisons, we reported our synthesis results using also the open- source standard-cell library NANGATE. Note that, although it is not the main objective of the cipher, PRINCE compares reasonably well to other lightweight ciphers also when implemented in a round-based fashion.

8.2.2 Cipher Description PRINCE is a 64-bit block cipher with a 128-bit key. The key is split into two parts of 64 bits each, k = k0||k1 and extended to 192 bits by the mapping 0 (k0||k1) → (k0||k0||k1) := (k0||((k0 ≫ 1) ⊕ (k0  63))||k1).

PRINCE is based on the so-called FX-construction [Bir05, KR96]: The first two subkeys k0 0 and k0 are used as whitening keys, while the subkey k1 is the 64-bit key for a 12-round block cipher we refer to as PRINCEcore. This scheme is depicted in the figure below.

0 k0 k0

m PRINCEcore c

The detailed test vectors for PRINCE encryption and decryption are given in the appendix (Part V).

Specification of PRINCEcore

The following figure shows the whole encryption process of PRINCEcore.

RC1 RC2 RC3 RC4 RC5 RC6 RC7 RC8 RC9 RC10

R R R S M’ -1 R 1 R 1 R 1 R 1 R 1 R1 R2 3 4 5 S 6 7 8 9 10

k1 RC0 k1 k1 k1 k1 k1 k1 k1 k1 k1 k1 RC11 k1

S M M -1 S -1

RC RC i k1 k1 i

82 8.2. PRINCE Background

As can be seen from the figure, PRINCEcore consists of 5 regular rounds, a middle layer, 5 inverse rounds, together with round constant and subkey additions at the beginning and at the end. Regular round and inverse round are similar, only the functions and their order are inverted. Each round of PRINCEcore consists of a subkey addition, a substitution layer, a linear layer, and the addition of a round constant.

 Subkey Addition Here the 64-bit state is XORed with the 64-bit subkey k1.

 Substitution Layer In the substitution layer, the state is divided into 16 nibbles and these nibbles go through 16 same 4-bit Sboxes . Inverse of this Sbox is used for the inverse round. The PRINCE Sbox and its inverse in hexadecimal notation is given in the following table. x 0 1 2 3 4 5 6 7 8 9 A B C D E F S[x] B F 3 2 A C 9 1 6 7 8 0 E 5 D 4 S−1[x] B 7 3 2 F D 8 9 A 6 4 0 5 E C 1

 Linear Layer In the round linear layer, the 64-bit state is multiplied with a 64 × 64 matrix M, which is defined in Section 8.2.3. Inverse of this matrix, M −1, is used for the inverse round.

 Round Constant Addition In this step, a 64-bit round constant RCi is XORed with the state. We define PRINCE round constants below in hexadecimal format.

RC0 0000000000000000 RC1 13198a2e03707344 RC2 a4093822299f31d0 RC3 082efa98ec4e6c89 RC4 452821e638d01377 RC5 be5466cf34e90c6c RC6 7ef84f78fd955cb1 RC7 85840851f1ac43aa RC8 c882d32f25323c54 RC9 64a51195e0e3610d RC10 d3b5a399ca0c2399 RC11 c0ac29b7c97c50dd

Note that, for all 0 ≤ i ≤ 11, RCi ⊕ RC11−i is the constant α = c0ac29b7c97c50dd. We define RC0 = 0 and the first five round constants RC1,...,RC5 together with α are derived from the fraction part of π = 3.141..., in order to ensure a random-like behavior while at the same time being publicly verifiable. The second issue can nevertheless be ignored, but any randomly-generated round constants should be a valid choice.

 Middle Layer In the middle layer, the 64-bit state first goes through a substitution layer, then gets multiplied with a 64 × 64 matrix M 0 defined in Section 8.2.3, and finally goes through an inverse substitution layer.

83 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE

8.2.3 Design Decisions This section discusses the design decisions of PRINCE. In our design, we opt for a traditional Sbox-based – and especially an SP-network – construc- tion. Note that in our case, which is a low-latency design, an SPN-cipher is preferable over a Feistel-cipher, since a Feistel-cipher operates only on half the state resulting often in a higher number of rounds. In order to minimize the number of rounds and still achieve security against linear and differential attacks, we adopted the wide-trail strategy [Dae95]. As not all round functions have to be identical for a cipher aiming for a fully-unrolled imple- mentation as PRINCE, it is very tempting to directly use the concept of “code-concatenation” [DR09] to achieve a high number of active Sboxes over 4 rounds of the cipher. However, sim- ilar round functions are not only beneficial for a serial implementation, but also very helpful for ensuring a minimum number of active Sboxes. Assume that, using the code-concatenation approach, one can ensure that rounds Ri to Ri+3 have at least 16 active Sboxes. While this is nice, the problem is that it does not ensure that rounds Ri−1 to Ri+2 or Ri+1 to Ri+4 have 16 active Sboxes as well if the individual rounds are very different in nature. We therefore decided to follow a design that on one hand allows to use the freedom given by a fully-unrolled design and on the other hand still keeps the round functions similar enough to prove some bounds on the resistance against linear and differential attacks. In this context, one of the main features of the design is that the decryption can be im- plemented on top of the encryption with a minimal overhead. This is achieved by designing a cipher that is symmetric around the middle round, has a very simple key scheduling and a special choice of round constants. 0 As the round constants satisfy RCi ⊕ RC11−i = α and M is an involution (see Section 8.2.3), we deduce that the core cipher is such that the inverse of “PRINCEcore parametrized with k” is equal to “PRINCEcore parametrized with (k⊕α)”. We call this property of PRINCEcore the α- reflection property. For the details and analysis of this property, we refer to our main PRINCE publications [BCG+12a, BCG+12b]. Note that this is also a work in progress that is investigated by the co-authors of the PRINCE publication, more details might be available in near future. 0 From α-reflection property, it follows that for any expanded key (k0||k0||k1),

D 0 (·) = E 0 (·) (k0||k0||k1) (k0||k0||k1⊕α) where α is the 64-bit constant α = c0ac29b7c97c50dd. Thus, one only has to do a very cheap change to the master key and afterwards reuse the exact same circuit for decryption.

Aligning Encryption with Decryption The use of a core cipher having the α-reflection property together with two additional whitening keys offers a nice alternative to the usual design strategy, which consists of using involutional components – NOEKEON, Khazad, Anubis, ICEBERG, or SEA [SPGQ06] are some examples of such ciphers with involutional components. Actually, the general construction used in PRINCE has the following advantages:

 It allows a much larger choice of Sboxes, which may lead to a lower implementation cost, since the Sbox is not required to be an involution. It is worth noticing that the fact that

84 8.2. PRINCE Background

both the Sbox and its inverse are involved in the encryption function and it does not affect the cost of the fully-unrolled implementations we consider.

 In ciphers with involutional components, the overhead due to the implementation of the inverse key scheduling can be reduced by adding some symmetry in the subkey sequence. But this may introduce weak keys or potential slide attacks. The fact that all components are involutions may also introduce some regularities in the cyclic structure of the cipher, which can be exploited in some attacks [Bir03]. The resistance of PRINCE to these type of attacks are extensively discussed in our main PRINCE publications [BCG+12a, BCG+12b].

 It is an open problem to prove the security of ciphers with ideal, involutional components against generic attacks. The main PRINCE publications [BCG+12a, BCG+12b] show that the ciphers with the α-reflection property (for α =6 0) has a proof of security similar to that of the FX-construction.

 Previous approaches to minimizing the overhead of decryption over encryption usually require multiplexers in each round while our approach requires a multiplexer only once at the beginning of the circuit.

The impact of these advantages can be seen better in the hardware implementations, which are presented in Section 8.4.

Choosing the Sbox The cost of the Sbox, i.e., its area and critical path, is a substantial part of the overall cost (see Section 8.3). Thus, choosing an Sbox that minimizes those costs is crucial for obtaining com- petitive results. As the cost of an Sbox depends on various parameters, such as the technology, the synthesis tool, and the library used, one cannot expect that there is one optimal Sbox for all environments. In fact, in order to achieve optimal results, it is preferable to choose your favorite Sbox. 4 4 In order to ensure the security of the resulting design, an Sbox S : F2 → F2 for the “PRINCE- family” has to fulfill the following criteria:

(1) The maximal probability of a differential is 1/4.

(2) There are exactly 15 differentials with probability 1/4.

(3) The maximal absolute bias of a linear approximation is 1/4.

(4) There are exactly 30 linear approximations with absolute bias 1/4.

(5) Each of the 15 non-zero component functions has algebraic degree 3.

As it can be deduced, e.g., from [LP07], there are only 8 Sboxes fulfilling those criteria up to affine equivalence. Thus, another way of defining an Sbox for the PRINCE-family is to say that it has to be affine equivalent to one of the eight Sboxes Si given in Table 8.1. Note that S0 is equivalent to the inverse function in F16 and the defined PRINCE Sbox is equivalent to S7.

85 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE

Table 8.1: All Sboxes for the PRINCE-family up to affine equivalence

S0 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xC, 0x5, 0x3, 0xA, 0xE, 0xB, 0x9 S1 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xC, 0x9, 0xB, 0xA, 0xE, 0x5, 0x3 S2 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xC, 0xB, 0x9, 0xA, 0xE, 0x3, 0x5 S3 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xC, 0xB, 0x9, 0xA, 0xE, 0x5, 0x3 S4 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xC, 0xE, 0xB, 0xA, 0x9, 0x3, 0x5 S5 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xE, 0xB, 0xA, 0x5, 0x9, 0xC, 0x3 S6 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xE, 0xB, 0xA, 0x9, 0x3, 0xC, 0x5 S7 0x0, 0x1, 0x2, 0xD, 0x4, 0x7, 0xF, 0x6, 0x8, 0xE, 0xC, 0x9, 0x5, 0xB, 0xA, 0x3

Construction of Diffusion Matrices

As mentioned before, the 64-bit state is multiplied with a 64 × 64 matrix M in the round linear layer, and M 0 in the middle layer, which are defined below. We have different requirements for these two different linear layers. The M 0-layer is only used in the middle round, thus M 0 has to be an involution to ensure the α-reflection property. This requirement does not apply for the M-layer used in the round functions. Here we want to ensure full diffusion after two rounds. To achieve this, we combine the M 0-mapping with an application of matrix SR that behaves like the AES ShiftRows and permutes the 16 nibbles in the following way:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 −→ 0 5 10 15 4 9 14 3 8 13 2 7 12 1 6 11

Thus, M = SR ◦ M 0. In addition, as the implementation costs should be minimized, the number of ones in the matrices M 0 and M should be minimal, while at the same time guaranteeing that at least 16 Sboxes are active in 4 consecutive rounds (see [BCG+12a, BCG+12b] for details). So, trivially each output bit of an Sbox has to influence 3 Sboxes in the next round and therefore the minimum number of ones per row and column is 3. Hence, we can use the following four 4 × 4 binary matrices as building blocks for the M 0-layer:

 0000   1000   1000   1000           0100   0000   0100   0100  M0 =   ,M1 =   ,M2 =   ,M3 =   .  0010   0010   0000   0010  0001 0001 0001 0000

In the next step, we generate 4 × 4 block matrices Mˆ (i), where each row and column is a permutation of the four 4 × 4 matrices M0,...,M3. The row permutations are chosen such that we obtain a symmetric block matrix. The choice of the building blocks and the symmet- ric structure ensures that the resulting 16 × 16 binary “AES MixColumns-like” matrices are involutions. We define:

86 8.2. PRINCE Background

    M0 M1 M2 M3 M1 M2 M3 M0  M M M M   M M M M  ˆ (0)  1 2 3 0  ˆ (1)  2 3 0 1  M =   M =   .  M2 M3 M0 M1   M3 M0 M1 M2  M3 M0 M1 M2 M0 M1 M2 M3 Note once again that the following criteria were taken into account for the construction of these 16 × 16 binary MixColumns-like matrices:

 Branch number 4.

 Exactly 3 ones in each row and column.

 Exactly 3 ones in each 4 × 4 sub-block.

 Involution.

In order to obtain a permutation for the full 64-bit state, we construct a 64×64 block diagonal matrix M 0 with (Mˆ (0), Mˆ (1), Mˆ (1), Mˆ (0)) as diagonal blocks. The matrix M 0 is an involution with 232 fixed points, which is average for a randomly chosen involution [FS09, Page 596]. The linear layer M is not an involution any more due to the composition of M 0 and ShiftRows. Both matrices are given in the appendix (Part V). Also, more details on linear layer design can be found in the original publications [BCG+12a, BCG+12b].

The Key Expansion

0 The 128-bit key (k0||k1) is extended to a 192-bit key (k0||k0||k1) by a linear mapping of the form (k0||k1) 7→ (k0||P (k0)||k1) . This expansion should make the peeling of rounds (both at the beginning and at the end) by partial key guessing difficult for the attacker. A hardware-optimal choice for P is

P (x) = (x ≫ 1) ⊕ (x  63) , i.e., P (x63, . . . , x0) = (x0, x63, . . . , x2, x1 ⊕ x63). More details on the design decisions of key expansion is given in the original publication [BCG+12a, BCG+12b].

8.2.4 Security Analysis The security analysis of PRINCE is not in the scope of this thesis. Due to the interdisciplinary nature of this work, the author was focused on the selection and design of efficient building blocks for the cipher, together with the implementations. Therefore, we leave the security discussion of PRINCE to the original publications [BCG+12a, BCG+12b], in which the security claims for PRINCE and a detailed security evaluation of its general construction regarding classical attacks are provided. Note that these attacks do not only include linear, differential, and algebraic attacks, but also the recently-introduced biclique attacks [BKR11]. It is worth mentioning that the performed security analysis together with third-party analyses have not revealed any critical and/or practical weaknesses of PRINCE so far.

87 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE

8.3 Hardware Investigations

Besides the main target low-latency, low-cost hardware implementation is one of the design ob- jectives of PRINCE. To achieve low-latency, a fully-unrolled design is considered for PRINCE, which makes it very difficult to come up with a low-cost implementation if the existing solu- tions for the block cipher layers are used. However, with “carefully-tailored” new solutions for PRINCE components, the given targets can be achieved. Therefore, during the design pro- cess of PRINCE, the cost of each function was investigated exhaustively and each component was designed in order to get the lowest possible gate count without compromising security. One of the most critical and expensive operations of the cipher is the substitution, where we use the same Sbox 16 times (rather than having 16 different Sboxes). Therefore, the design efforts of PRINCE started with an Sbox search, to find the most suitable one for the target design specifications. We analyzed many Sbox instances to identify a candidate with optimal combinational logic and propagation paths. In order to achieve an implementation with low- latency and a low gate count, the targeted unrolled design is implemented with the resulting optimal Sbox. The hardware cost of the second critical part, linear layer, is also investigated carefully. The underlying diffusion matrices are designed with the least possible number of ones, in turn XOR operations, in order to keep the gate count as low as possible while keeping the security at the desired level. In the following sections, we explain both processes in detail.

8.3.1 Searching for the Optimal Sbox As mentioned before, Sboxes are the most critical building blocks in existing block cipher implementations, in terms of both area and delay. In our design, we tried to overcome this area and propagation delay problem. Unfortunately, neither of the existing approaches on Sboxes and their implementations [SSA+07, HAHH06, RPP08, KY10, Rij00, MPL+11, GC09, Can05] is applicable to PRINCE, where the design target is an unrolled cipher with low-area and low-latency. The only viable solution is to implement a very low gate count combinational 4-bit Sbox with optimal propagation paths. Therefore, a brute-force-like search is performed to find such an Sbox. We followed a two-step approach in our “search for an optimal Sbox for PRINCE”. Firstly, we generated 64000 candidate Sboxes, which are based on the 8 core Sboxes satisfying the properties introduced in Section 8.2.3. These 64000 Sboxes are simply the affine equivalents of the core Sboxes. Then, all Sboxes and their inverses (in total 128000) are synthesized for minimum area. As PRINCE is mainly targeted for ASIC platforms, the synthesis is performed on ASIC with different technology libraries. In the analysis process, Cadence Encounter RTL Compiler v10.1 is used for synthesis. Since the gate count and delay parameters are heavily technology-dependent, the implementations have been synthesized for two different technology libraries: 90 nm low-leakage Faraday Library from UMC, and 45 nm generic NANGATE Open Cell Library. In all syntheses, typical operating conditions are assumed. The synthesis took approximately 15 days (nearly one week for synthesis in each technology) on our cluster. Fig- ure 8.1 shows the distribution of the Sboxes, inverse Sboxes, and the average of both with respect to gate counts in GE for 45 nm NANGATE and 90 nm UMC technologies. Note that the vertical axis “Number of Sboxes” denotes the Sbox frequency for the corresponding GE.

88 8.4. Hardware Implementations

Figure 8.1: Sbox distributions with respect to gate counts

Following the first – so-called – elimination step, the smallest 100 Sboxes on both technologies are selected and then re-synthesized for minimum power and minimum delay. The reason for the elimination was mainly the synthesis time of the Sboxes – it would take even longer with the given restrictions. This process is repeated for three different process technologies, this time also including 130 nm low-leakage Faraday Library from UMC. The only Sbox that is in top 10 in terms of area, power, and speed in all three technologies is selected to be used in PRINCE, which was previously given in Section 8.2.2.

8.3.2 Low-cost Linear Layer

The linear layer applications in PRINCE can actually be defined as binary matrix multiplica- tions. As a result, the HW of the multiplied matrix gives the number of XOR operations, which estimates the cost of the linear layer (excluding the synthesizer optimizations). We therefore tried to keep the number of ones in our diffusion matrices as low as possible. As already de- scribed in Section 8.2.3, the two 64 × 64 diffusion matrices that we use in PRINCE are made up of 4 16 × 16 binary matrices. Each of these matrices have 3 ones per row and column, which is a requirement for the needed security level, as explained before. In total, the HW of each diffusion matrix is 192, which means that the cost of our linear layer is approximately equal to the cost of 192 XOR gates. The cost in GE of course differs for each technology library; however, the cost of an XOR gate is nearly 2.25 − 2.5 GE for most of the technology libraries. Hence, each application of the linear layer costs approximately 450 GE. Except for the zero-cost wiring, which is the case for PRESENT’s permutation layer in hardware implementations, our linear layer is one of the cheapest designs for the given security level. More realistic results (synthesis output) is presented in the following section.

8.4 Hardware Implementations

In this section, we provide the hardware implementations of PRINCE. Both unrolled (which is the real target), and round-based architectures of PRINCE are considered for implementation. Together with the results, a comparison of PRINCE to other block ciphers in literature is also provided.

89 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE

In the implementation process, Cadence NCVerilog 06.20-p001 is used for simulation and Cadence Encounter RTL Compiler v10.1 is used for synthesis. Since the gate count and delay parameters are heavily technology dependent, the implementations have been synthesized for three different technology libraries: 130 nm and 90 nm low-leakage Faraday Libraries from UMC, and 45 nm generic NANGATE Open Cell Library. All ciphers are implemented in Verilog-HDL and typical operating conditions were assumed in all syntheses.

8.4.1 Unrolled Architecture

The unrolled version of PRINCE is a direct mapping of the cipher defined in Section 8.2.2 to hardware. Multiplexers, which select the encryption and decryption keys accordingly, are the only encryption/decryption overhead. The only costs associated with the key whitening stages are XOR gates and multiplexers used for whitening key selection. However, in practice, due to the unrolled nature of the implementation, these additions reduce to XOR operations with constants, which in turn reduce to inverters or no additional gates at all. Furthermore, these inverters are combined with the preceding or following matrix multiplications, which are implemented with cascaded XOR gates. In cases where an XOR is sourced by the output from an inverter, or is sourcing the input of an inverter, it is simply replaced by an Exclusive NOR (XNOR) gate and the sourced/sourcing inverter is removed. Since both XOR and XNOR have the same gate count, the overall effect of the round constant addition on area reduces to zero. Unrolled implementation results are listed in Table 8.2 for different technologies with respect to different timing constraints. Together with PRINCE, we also implemented PRESENT-80, PRESENT-128, LED-128, and AES-128. The same metrics are also applied to these ciphers in order to adequately evaluate the achievements of our new cipher (note that in some cases the key size – and also our security claim – is different: PRINCE does not claim to offer 128-bit security and security against related key-attacks). In order to achieve both encryption and decryption capability in PRESENT and LED, both Sboxes and inverse Sboxes are implemented and their output is selected by a multiplexer. This, in turn, doubled the Sbox area with respect to an encryption-only implementation. For AES, we just had to implement the inverse affine transform since the finite field inversion module could be shared between encryption and decryption. In addition to this comparison, Table 8.3 shows the extrapolated results (which are calculated by removing register and control logic area from the total gate count, and multiplying the rest by the number of rounds) for other unrolled cipher instances obtained from round-based cipher implementations provided by previous works. Note that all ciphers in the table include encryption and decryption functionality with 128-bit key size, however the comparison is difficult as the block size is different in some cases (also note that the ciphers having 128-bit block size are obviously much bigger and more power consuming than a 64-bit block cipher). We also measured the maximum frequencies achievable by the unrolled versions of PRINCE under two different conditions: The frequency where the area of the synthesized design starts to deviate from the unconstrained area – 158.9, 38.4 and 35.5 MHz, and the frequency where the timing slack becomes zero – 212.8, 71.8 and 54.3 MHz. Both figures are given for 45 nm, 90 nm, and 130 nm, respectively.

90 8.4. Hardware Implementations

Table 8.2: Area/power comparison of unrolled versions of PRINCE and other ciphers

Tech. Nangate 45 nm Generic UMC 90 nm Faraday UMC 130 nm Faraday Constr.(ns) 1000 3162 10000 1000 3162 10000 1000 3162 10000 PRINCE Area(GE) 8260 8263 8263 7996 7996 7996 8679 8679 8679 Power(mW) 38.5 17.9 8.3 26.3 10.9 3.9 29.8 11.8 4.1 PRESENT-80 Area(GE) 63942 51631 50429 113062 49723 49698 119196 51790 51790 Power(mW) 1304.6 320.9 98.0 1436.9 144.9 45.5 1578.4 134.9 42.7 PRESENT-128 Area(GE) 68908 56668 55467 120271 54576 54525 126351 56732 56722 Power(mW) 1327.1 330.4 99.1 1491.1 149.9 47.8 1638.7 137.4 43.6 LED-128 Area(GE) 109811 109958 109697 281240 286779 98100 236770 235106 111496 Power(mW) 2470.7 835.7 252.3 5405.0 1076.3 133.7 5274.8 1133.9 163.6 AES-128 Area (GE) 135051 135093 118440 421997 130835 118522 347860 141060 130764 Power (mW) 3265.8 1165.7 301.6 8903.2 587.4 186.8 8911.2 876.8 229.1

Table 8.3: Extrapolated area of unrolled versions of other ciphers against PRINCE

Area* (GE) CLEFIA-128 [AH11] 28035 (18 rounds unrolled, 130nm CMOS) HIGHT-128 [HSH+06] 42688 (32 rounds unrolled, 250nm CMOS) mCrypton-128 [LK06] 37635 (13 rounds unrolled, 130nm CMOS) Piccolo-128 [SIH+11] 25668 (31 rounds unrolled, 130nm CMOS) * Area requirements extrapolated from round-based implementations.

8.4.2 Round-based Architecture

PRINCE is also synthesized as a round-based implementation to make a fair comparison with the existing works in the literature. Figure 8.2 shows the block diagram for the round-based implementation. To get a low-cost round-based implementation, we tried to maximize shared use of the operational blocks. This way, double use of the resources can be avoided. In our case, matrix multiplications in the linear layer gave a larger gate count than both the Sbox and inverse Sbox layers; therefore, by taking this layer in the middle of the round function (instead of putting Sboxes in the middle) and building the other blocks accordingly, we have achieved the smallest possible area. A comparison of PRINCE round-based implementation with PRESENT-80, PRESENT-128, LED-128, AES-128, which are implemented in our technology with the same metrics that we use for PRINCE, is shown in Table 8.4. We also provide a comparison of PRINCE with the reported results of other ciphers in the literature, such as CLEFIA, HIGHT, mCrypton, and Piccolo, which follows in Table 8.5. As in the unrolled case, all implementations are for encryption and decryption functionality with 128-bit key size.

91 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE

' k0 k0 

k1

S -1 ptext S M’ ctext SR-1 SR

k1 k1 k1 ' RC+α RC RC RC+α k0  k0

Figure 8.2: Round-based implementation of PRINCE

Table 8.4: Performance comparison of round-based versions of PRINCE and other ciphers

Nangate 45 nm Generic UMC 90 nm Faraday UMC 130 nm Faraday Area Freq. Power Tput Area Freq. Power Tput Area Freq. Power Tput (GE) (MHz) (mW) (Gbps) (GE) (MHz) (mW) (Gbps) (GE) (MHz) (mW) (Gbps) PRINCE 3779 666.7 5.7 3.56 3286 188.7 4.5 1.00 3491 153.8 5.8 0.82 PRESENT-80 3105 833.3 1.2 1.67 2795 222.2 2.1 0.44 2909 196.1 2.5 0.39 PRESENT-128 3707 833.3 1.6 1.67 3301 294.1 3.4 0.59 3458 196.1 2.9 0.39 LED-128 3309 312.5 0.5 0.41 3076 103.1 1.9 0.13 3407 78.13 2.4 0.10 AES-128 15880 250.0 5.8 2.91 14691 78.1 14.3 0.91 16212 61.3 18.8 0.71

Table 8.5: Performance of round-based versions of other ciphers against PRINCE

Technology Area Tput @100kHz (nm) (GE) (kbps) PRINCE 130 3491 533.3 CLEFIA-128 [AH11] 130 2678 73.0 HIGHT-128 [HSH+06] 250 3048 188.3 mCrypton-128 [LK06] 130 4108 492.3 Piccolo-128 [SIH+11] 130 1260 237.0

8.4.3 Performance of Both Architectures on FPGA

In order to observe the performance of the above-mentioned architectures (Sections 8.4.1 and 8.4.2) on reconfigurable devices, we also mapped them to an FPGA platform. Al- though PRINCE is targeted for ASIC platforms, the performance of PRINCE on FPGA could be of interest for researchers, e.g., for prototyping of different cryptographic applications,

92 8.5. Software Implementations for utilization in reconfigurable applications, etc. The cores were synthesized, mapped, and place & routed on Xilinx Virtex-6 ML605 Evaluation Board (with XC6VLX240T FPGA on it) using Xilinx ISE Design Tool. Note that the presented figures are only the place & route numbers, and we do not provide any comparisons – these numbers are given just to provide some informative results. Our results are shown in Table 8.6 and Table 8.7 for unrolled and round-based implementations, respectively.

Table 8.6: Performance comparison of unrolled versions of PRINCE and other ciphers on FPGA

# of slice LUTs # of (all used as logic) occupied slices PRINCE 1178 out of 150720 504 out of 37680 PRESENT-80 5662 out of 150720 2192 out of 37680 PRESENT-128 5611 out of 150720 2224 out of 37680 LED-128 15560 out of 150720 5034 out of 37680 AES-128 19929 out of 150720 6591 out of 37680

Table 8.7: Performance comparison of round-based versions of PRINCE and other ciphers on FPGA

# of slice regs. # of slice LUTs # of (all used as FF) (all used as logic) occupied slices PRINCE 74 out of 301440 571 out of 150720 224 out of 37680 PRESENT-80 150 out of 301440 347 out of 150720 108 out of 37680 PRESENT-128 198 out of 301440 399 out of 150720 143 out of 37680 LED-128 79 out of 301440 506 out of 150720 176 out of 37680 AES-128 269 out of 301440 3031 out of 150720 1133 out of 37680

8.5 Software Implementations

In order to observe the software performance, PRINCE is also implemented both in C and 8-bit AVR assembly. The runtime, code size, and energy consumption of the C implementation is evaluated on the evaluation board EnergyMicro EF32GG-STK3700, which has a 32-bit ARM CORTEX-M3 microprocessor2. As the compiler, we selected SimplicityStudio IDE, which is based on Eclipse and uses the GNU ARM C compiler. The results are given in Table 8.8. The performance of PRINCE on Atmel ATmega8A microcontroller is presented in Table 8.9. Note that the Sbox in AVR implementation is implemented in its ANF, which is used in bitsliced

2Our C code was run on this microcontroller by Jan Zimmer.

93 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE implementations in order to achieve better software performance (see following chapters for bitslicing).

Table 8.8: Performance of PRINCE on 32-bit ARM CORTEX-M3

Run time Code size Energy (clock cycles) (bytes) (µJ) [Enc/Dec] [Enc/Dec] PRINCE 42881/43980 1447 102.4/102.8

Table 8.9: Performance of PRINCE on 8-bit Atmel ATmega128

Run time Code size (clock cycles) (bytes) [Enc] PRINCE 3614 1108

8.6 Investigations on Countermeasures Against Potential SCA on PRINCE

Besides the security of the algorithm itself, susceptibility of its implementation to SCA is one of the main concerns when the algorithm is implemented in a hostile environment. In this section, we define the possible side-channel adversary model that we take into account while evaluating the vulnerability of an implementation of PRINCE and state the potential countermeasures together with a discussion of their feasibility with respect to the additional resource requirements on area and time.

8.6.1 Adversary Model According to the definition given in [MM12]:

 An attack that combines v different time instances – usually in v different clock cycles – of each power trace is called v-variate attack, and

 the order of an attack – regardless of v – is defined by the order of the statistical moments that are considered in the attack.

Since an implementation of PRINCE is supposed to perform, e.g., an encryption in one clock cycle, the attack domain is restricted to univariate attacks only. On the other hand, if we suppose that a masking scheme is applied to avoid a dependency of the side-channel leakage with

94 8.6. Investigations on Countermeasures Against Potential SCA on PRINCE processed (intermediate) values, all shares, e.g., the mask and masked data, must be processed at the same time. This poses another restriction to the attack domain of first-order attacks. According to the scenarios, in which PRINCE is intended for use, we consider two cases where the adversary measures the corresponding side-channel leakage with either plaintext or ciphertext selected as input. Therefore, we define the adversary model as a side-channel attacker, who is able to perform a first-order univariate attack on a set of collected side-channel observations of either random or selected inputs/outputs of the cipher.

8.6.2 Potential Countermeasures for PRINCE Against SCA Due to the operational conditions of PRINCE, the options to deploy a side-channel counter- measure are significantly restricted. For instance, shuffling is clearly not an option since all the operations are performed in one clock cycle. In the domain of masking, a possible solution can be threshold implementation [NRS11], which is based on – minimum 3-shared – secret sharing (Boolean masking) scheme and multi-party computation technique. If implemented correctly, it can provide security against first-order attacks. Below we discuss the possible options to realize a threshold implementation of PRINCE.

Choosing an Sbox for a Threshold Implementation Using a threshold implementation to encrypt a block in a single clock cycle, we cannot di- rectly apply the principle of decomposing an Sbox into several layers. We have to focus on 4 threshold implementations in a single stage. The PRINCE Sbox corresponds to the class C231 in [BNN+12], resulting in a cost of approximately 2000 GE due to 5 shares while avoiding de- composition. Thus, even the protection of only a single round at the beginning and at the end of the algorithm leads to a cost that is prohibitively high. We therefore investigated the case of a 3-bit Sbox instead of a 4-bit Sbox. We consider 3 3 3 S : F2 → F2. A component function is a Boolean function Sb : F2 → F2 defined as a linear combination (speficied by b) of output bits of S, i.e., Sb(x) = hb, S(x)i. Up to affine equivalence, there are only 4 classes of 3-bit Sboxes. The four classes can actually be classified by the number of linear component functions they have. One equivalence class can be classified by having 8 linear component functions, the second one has 4 linear component functions, the third has 2 linear component functions and the last has no linear component functions. The last class is clearly the only one that is cryptographically strong and therefore the only one of interest for us. This function corresponds in particular to the inverse function in F23 . The main cryptographic criteria of this class are as follows.

(1) The maximal probability of a differential is 1/4.

(2) The maximal absolute bias of a linear approximation is 1/4.

(3) Each of the 7 non-zero component functions has algebraic degree 2.

3 + The function furthermore corresponds to the class Q3 [BNN 12]. In order to reduce the hardware cost of a masked 3-bit Sbox to be used in an SCA-resistant implementation, we need to select one of the smallest Sboxes of this class. We therefore synthe- sized all Sboxes in the class, in a similar fashion to the syntheses presented in Section 8.3.1 for

95 Chapter 8. Targeting Low-latency: A Lightweight Block Cipher PRINCE

4-bit Sboxes. The area of 3-bit Sboxes ranged from 6.5 to 11.75 GE – out of these, 36 Sboxes were with an area requirement of 6.5 GE. Note that even if we suppose that we can decompose a 3-bit Sbox into two stages, the cost of implementing a 3-shared version of a single Sbox results roughly in 100 GE (possibly higher for the 4-shared version without decomposition).

Further Design Issues Using 3-bit Sboxes in special masking-rounds is non-trivial, since it complicates the security analysis significantly. Our considered adversary is able to choose the plaintexts, hence he can extend the attack to further rounds by partially fixing the plaintexts. Furthermore, in order to make it more difficult for the attacker to attack later rounds, we have to use a rather complex linear layer for high diffusion after two Sbox layers. We therefore considered a linear layer as a 64 × 64 binary matrix with 9 ones per row and column (11 for its inverse), which ensures a maximal diffusion. The operations of a protected round of PRINCE are listed in the following together with their estimated cost (estimations are for 90 nm low-leakage Faraday Library from UMC).

(1) A key addition (150 GE)

(2) 21 3-bit protected Sboxes: 1-bit goes through the Sbox layer unmodified (2100 GE)

(3) A linear layer: A 64 × 64 binary matrix with 9 ones per row and column – 11 ones per row and column for its inverse (1300 and 1600 GE, respectively)

(4) A second key addition (150 GE)

(5) Again the same Sbox layer (2100 GE)

Note that the linear layer and key addition costs are tripled after sharing. Therefore, one protected-round of PRINCE results in at least 7 kGE, which leads to a total cost of approx- imately 23 kGE for an unrolled implementation of the overall cipher – excluding the cost as- sociated with the for the masks. Nevertheless, even this cost is prohibitively high compared to the original cipher.

8.7 Concluding Remarks

In this chapter, we described the new low-latency and low-cost cipher PRINCE, together with a variety of its implementations on different platforms, ranging from ASICs and FPGAs to software implementations. PRINCE results in a very small footprint, while achieving lowest latency compared to the existing lightweight solutions without compromising security. The compactness of PRINCE is the result of an extensive search on the layers of the cipher, es- pecially on the Sbox. We also considered potential SCA-resistant constructions for PRINCE; however, such countermeasures increase the cipher area significantly. While PRINCE so far has been resistant to mathematical attacks and analysis, it is of course open to further analysis and comments.

96 Chapter 9 Towards Software-efficiency: NLU Instruction Set Extension

In this chapter, we take a step towards the software-efficiency of cryptographic algo- rithms. We address the open question of efficient cipher implementations on CPUs by introducing a non-linear/linear ISE, namely NLU, which is capable of imple- menting non-linear operations, namely Sboxes, expressed in their ANF and linear operations expressed as binary matrix multiplication, “multiply-and-add”, form. We propose a modular 8-bit architecture for NLU, which allows its adoption to dif- ferent CPU sizes for possible future extensions. However, in this work, we base our NLU design only on the widely-used 8-bit AVR instruction set. In order to demonstrate the coding advantage of the proposed ISE, we present 8-bit AVR im- plementations of standard cryptographic algorithms both “with NLU” and “with- out NLU”.

Contents of this Chapter 9.1 Introduction ...... 97 9.2 Related Work and the Proposed ISE Model ...... 98 9.3 Unified Non-linear/Linear Hardware Unit ...... 101 9.4 Applications ...... 105 9.5 Results and Discussion ...... 114 9.6 Conclusion and Future Directions ...... 115

9.1 Introduction

Most of the progress in lightweight cryptography has been on the efficiency of the hardware implementations so far. However, lightweight designs with respect to software have barely been addressed. This is in contrast to the fact that many pervasive devices are equipped with embedded CPUs instead of dedicated ASICs and a large percentage of ciphers are implemented in software on these CPUs. Furthermore, even if a cipher is implemented in hardware on an end-device, e.g., on an RFID tag, it is not uncommon that a corresponding embedded software implementation is needed on the other end of the communication link, e.g., within an RFID reader. As most of such devices come without a support for cryptographic operations, and due to the lack of software-oriented cryptographic solutions, the realizations of these applications unfortunately result in costly implementations.

97 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension

As a result of their hardware-efficient nature, the existing lightweight ciphers are generally based on hardware-friendly but software-unfriendly operations. For instance, bit-permutations with zero costs in hardware require several number of CPU cycles and/or lookup tables in software. Also, Sboxes in the substitution layer often require relatively large lookup tables in software, such as the T-tables in the case of AES [Ko¸c09]. This development has somewhat led us to the paradox situation: Many of the “lightweight” and/or widely-used ciphers are not particularly lightweight with respect to their software implementations. In this work, we try to improve the cryptographic block cipher implementations on embedded CPUs by proposing an extension unit. Our approach is based on a non-linear/linear operation unit, namely NLU, which is capable of computing common cipher operations substitution (non-linear) and permutation (linear). In NLU, non-linear operations (Sboxes) are expressed in their ANF and linear operations are expressed as binary matrix multiplications (we name it as multiply-and-add). NLU targets embedded microcontrollers and is, hence, 8 bits wide1. We designed NLU as an extension to the widely-used Atmel AVR instruction set. The modular architecture of NLU allows its adoption also to 16, 32, 64, and even 4-bit CPUs, giving the flexibility to be used with different microcontrollers and microprocessors in future. For demonstration purposes, we show that NLU can result in time-area product reductions between 20-68% for the widely-used and standardized ciphers PRESENT, CLEFIA, Serpent, and AES. Note that Serpent and AES do not claim to be lightweight, but they are selected for different reasons. For instance, Serpent is a bitsliced design (see [ABK98]), which results in better software implementations. AES is the NIST standard and widely-used in many embedded applications. Interestingly, it also has better software performance in comparison to most of the existing lightweight ciphers. Therefore, for both ciphers, it is of interest to see the amount of cost reduction when implemented using NLU.

9.2 Related Work and the Proposed ISE Model

The performance enhancement of software implementations of cryptographic algorithms is not entirely a new idea. There have already been several studies on this subject. However, most of the previous works rely on implementation of cipher-specific instructions, which – in hardware – maps to plugging the specific cipher as a coprocessor into the main module [OE10, GGP08]. This approach, while providing the best performance improvement in terms of execution time, results in a considerable increase in microprocessor area. For example, in [OE10], the authors state that the area increase in the total area of the processor to be less than 65%. Furthermore, such an approach limits the performance improvement to a few specific ciphers. For instance, in addition to well-known Intel AES ISE [Sha12], there have been specific ISE proposals for AES target- ing better performance on 32-bit processors [TsS05, Ts06] and 8-bit microcontrollers [TH08]. Moreover, [Ts05] suggests using ECC extensions for AES acceleration, which supports more cryptographic operations compared to others. However, these works are still limited to a few specific algorithms, such as AES and ECC, and they do not support many different crypto al- gorithms at the same time. Other works try to address the software performance enhancement issue by introducing relatively complex instruction utilization [LSY01, ML01, SYL08, Rub07].

1Note that 8-bit CPUs still hold a huge share of the worldwide microprocessor market.

98 9.2. Related Work and the Proposed ISE Model

Note that many of these works, except [TH08], do not really target lightweight cryptography or resource-constrained devices. The resource limitation on constrained devices is two-fold: In terms of hardware, it should oc- cupy the smallest possible silicon area; in terms of software, it should be realized with minimum number of instructions. These two requirements usually contradict each other. The smallest possible silicon area is achieved by implementing all cryptographic operations in software on a small-footprint embedded processor; while the smallest possible code size can be achieved by dedicated hardware – either in the form of specific crypto-units, or crypto-coprocessors attached to an embedded processor. Unfortunately, smallest silicon area results in the maximum number of instructions and longest execution time, while the shortest code results in the highest silicon area. In our case, we bridge the gap between these two approaches, which also comes with a compromise: We can neither reach the minimal silicon area nor the shortest code. Instead, we try to find a compromise point by introducing minimum additional silicon area that results in a code size reduction (especially time-area product reduction of about 20-68%, which can be considered as “high” for most applications). We also target a generic solution that would apply to (nearly) any cryptographic block cipher. We take the standard ALUs, which exist even in the simplest microcontrollers, as our base approach. In fact, this is also the basis for Floating Point Unit (FPU)s, where floating point instructions are realized on specific hardware modules. These modules are physically-connected to the main data bus as the ALU and logically-connected to the instruction decoder, which forwards each floating point instruction to the FPU. The previous works on cryptographic instruction set extensions have also proposed additional “crypto” units that function exactly like the ALU or the FPU. However, unlike the previous cipher-specific extensions, we propose a non-linear/linear op- eration unit. We design the NLU as a single input (source) and single output (destination) unit, where the additional operands (ANF or binary matrix coefficients) are stored inside a configuration register. The choice of the operations relies on investigation of several different lightweight and regular ciphers (both block and stream) [BKL+07, BCG+12a, SSA+07, NIS01, ABK98, HSH+06, CDK09, GNL11, LK06, GPPR11, SIH+11, HJMM08, BD08, CP08] as well as recently-introduced lightweight hash functions that use block-cipher-like internal permuta- tions [BDPA11, BKL+11, GPP11, AHMNP13]. The two common operations in all of these ciphers are substitution (the non-linear operation to introduce confusion) and permutation (the linear operation to introduce diffusion). In the following, we explain the realization of these two operations.

9.2.1 Realization of Substitution

Substitution can be realized in software either by table-lookup or by its ANF expression. For lightweight ciphers, generally 4-bit Sboxes are chosen, unlike the 4-to-6 transformation in DES [NIS99] or the 8-to-8 transformation in AES. This led us to support only 4-bit substi- tutions in NLU, along with the fact that 8-bit substitutions are more efficiently implemented as lookup tables. In a lookup table implementation, a 4-bit substitution means a maximum of 16 different 4-bit output words (for a bijective substitution) selected by 4-bit input, resulting in a total storage

99 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension of 16 × 4 = 64 bits. In an ANF-based implementation, each output bit is expressed in terms of the 1st to 4th degree product terms of input bits masked with 16 bits of ANF coefficients, which again results in a total storage of 16 × 4 = 64 bits. In other words, both approaches result in the same storage cost for 4-bit substitutions. However, at the end of our area cost analyses, we have observed that ANF gives the better area results; so, we opted for ANF in the non-linear part of NLU as it might also provide additional flexibility for the implementations of different ciphers (in turn, different Sboxes). Furthermore, it could even be used in the implementation of non-substitution type of operations, which might still require ANF form. Another important parameter regarding the substitution operation is the input/output width. While we have chosen to support 4-bit substitutions only, the input operand width is enforced by the target microcontroller architecture. For most of the lightweight embedded applications, this starts from 4 bits (very low-end processors) and might go up to 32 bits (high-end embedded processors). Since the operations within the substitution layer of a cipher can be executed in parallel, it is possible to put one or more of the 4-bit Sboxes within the same substitution module, where nibbles of the input operand are distributed to each of these Sboxes, and their output nibbles are concatenated to form the output operand of 4 to 32 bits. In this case, only the number of the Sboxes inside the substitution module will change (one for 4-bit, eight for 32- bit). Apart from the configuration of the ANF coefficients, substitution requires an instruction with source and destination registers/memory addresses as operands. We accomplish this functionality via the NNL (non-linear operation) instruction of NLU, which accepts source and destination as input and output arguments, respectively. In our case of 8-bit embedded processors, 8-bit content of the source is split into two nibbles, on which the non-linear operation described by the ANF coefficients is performed in parallel. The two 4-bit results are then combined to form the output for the destination.

9.2.2 Realization of Permutation

Permutation requires input/output widths larger than (almost always multiples of) Sbox widths. Therefore, in our case of 4-bit Sboxes, the permutation layer should be realized at least 8 bits wide. However, the upper limit for that is also enforced by the target microcontroller’s datapath width. For instance, with an operand width of 8 bits, it is infeasible to implement a 64-bit generic permutation block directly in one instruction. Hence, we decided to design our “permutation” hardware module as an 8 × 8 binary matrix multiplication, where we can use the 8-bit operand effectively for many different datapath widths. The choice of 8 × 8 is mainly due to the 64 bits storage required for the matrix coefficients, so that the storage required for the ANF coefficients can be reused. In our scheme, it is possible to implement the “multiplication with a large matrix” by means of repetitive multiply-and-adds with the small 8×8 matrices. In other words, the permutation layer can be realized with a matrix multiplication unit and an accumulator. Since loading coefficients for each of these smaller matrices requires additional clock cycles and code space, it also makes sense to make reuse of each small matrix. This can be made possible by implementing a FIFO register of depth d instead of a single register (d = 1) for the accumulator part of permutation unit. FIFO register depth d is yet another parameter for the design. Our investigations on several different ciphers showed us that most permutation operations require inputs from a maximum

100 9.3. Unified Non-linear/Linear Hardware Unit of 32 bits (8 Sboxes), which means that choosing d = 4 would be sufficient for most applications. It should be possible to accumulate on to any desired part of the FIFO register, which means choosing from one of the four bytes of the 32 bits. Furthermore, the permutation operation should also be run in multiply-only mode. In order to realize this, there are two options: We need either a single instruction with an input and output operand, 1-bit for the mode selection of (multiply-only or multiply-and-add), and 2 bits for the FIFO output selection in the multiply-and-add mode; or two separate instructions – one for multiply-only, and one for multiply-and-add. In our design, we have opted for the latter case and implemented two new instructions – NMU (multiply-only) and NMA (multiply-and-add). Both instructions take the usual source and destination registers as input and output arguments, and perform binary matrix multiplication on the input byte. For NMU, this is the final result. In the case of NMA, the result of the multiplication is added onto the content of the sth register in the FIFO in order to obtain the final result, where s is the second input argument for this instruction. In both cases, the final result is both sent to the destination register and pushed into the FIFO.

9.2.3 Loading the Substitution and Permutation Coefficients

The configuration of the bits for the ANF or binary matrix coefficients is the last issue in our instruction set architecture. In parallel to our choice of 8-bit operands, we can perform this configuration by 8 bits at a time. It can be done either by writing 8 bits into a target portion of 64 bits selected via 3-bit address, or by shifting 8 bits into the 64-bit configuration register. The latter option is a better choice when supported with the selection of the number of bits to be shifted. In our case, the shifting changes from 1 to 8 bits, which is selected by 3 bits, resulting in an instruction with an 8-bit immediate value for the configuration and 3 bits of length selection. This configuration is accomplished via the NLD (load NLU configuration) instruction. Its input arguments are the 8-bit input stream, K, and the 3-bit value for the number of bits to be shifted, n. NLD has no output arguments, as the NLU configuration register is the default target.

9.2.4 NLU Instructions

Details of the NLU instructions (supported by hardware choices) are given in Table 9.1. These instructions assume the 8-bit AVR microcontroller architecture with 16-bit instructions, which is very common for most of the embedded processors. However, they can also be modified for larger (or smaller) instruction widths. Note that Rs stands for the source register and Rd stands for the destination register. Other parameters are already mentioned in the previous sections.

9.3 Unified Non-linear/Linear Hardware Unit

In order to realize the explained NLU instructions, we implement a unified hardware unit performing non-linear and linear operations as explained in Section 9.2. In this unified structure, we have non-linear and linear operational sub-units, a 64-bit configuration register to store the ANF/matrix coefficients, and also an accumulation module for the add part of the linear

101 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension

Table 9.1: NLU instructions

Instr. Syntax Operation Description NLD NLD n , K Load NLU CONF ← CONF  K[MSB−n], if n>0 configuration CONF ← CONF  K , else NNL NNL Rd, Rs NLU non-linear Rd(7:4) ← ANF[Rs(7:4)] operation Rd(3:0) ← ANF[Rs(3:0)] NMU NMU Rd, Rs NLU multiply Rd ← M × Rs operation FIFO ← FIFO  M × Rs NMA NMA s , Rd, RsNLU multiply-and-addRd ← M × Rs ⊕ FIFO(s) operation FIFO ← FIFO  [M × Rs ⊕ FIFO(s)]

mac acc 64-bit Register 3 64 sel push dinp 8 0 0 0 0 1 1 1 1 Non-linear Unit Linear Unit 1 0 0 1 2 3 0 mode 0 1

2 dout sro 8

Figure 9.1: Overall structure of NLU

multiply-and-add operation along with depth-4 FIFO registers. The overall block diagram of this structure is given in Figure 9.1.

In our NLU proposal, the push input is activated whenever an NLD instruction is issued. The immediate value part of the instruction is pushed from right to left in n bits, which results in t clock cycles (where 1 ≤ t ≤ 8) to store a new 64-bit coefficient data. During the configuration phase, the register acts like a shift-register. sel selects the number of most significant bits of the immediate value to be pushed. This is accomplished by an 8-to-1 barrel-shifter of 64 bits (not shown on the circuit schematic). After the configuration of the coefficient values, NLU can be operated in either non-linear or linear mode with respect to the issued instruction. The 64-bit coefficient data from the configuration register is connected to both the non-linear and linear units together with the 8-bit input data (operand). The output of the NLU is selected by the mode input – non-linear result or linear result. mode is set to 0 for NNL instruction, and to 1 for NMU and NMA instructions.

102 9.3. Unified Non-linear/Linear Hardware Unit

9.3.1 Non-linear Unit

The non-linear unit consists of many AND and XOR gates, which actually express the Sbox in ANF, as mentioned previously. The input (masking) coefficient values define the ANF coeffi- cients of the Sbox used in the cipher. In our case of 8-bit operand, we use two identical Sboxes in parallel, thereby sharing the same ANF masking coefficients. The most significant 4 bits of the input data go into the first Sbox and the least significant 4 bits to the second Sbox. The outputs of the Sboxes are concatenated to form the 8-bit output in the same order as the input bits. For each Sbox output, the first 16 bits of the mask value define the first bit of the Sbox result, the second 16 bits define the second bit, and so on. A detailed schematic of the non-linear unit with all the masking coefficients (m0...63) is shown in Figure 9.2.

9.3.2 Linear Unit

The linear operations unit partly performs bitwise operations expressed in multiply-and-add form. The linear unit handles the multiply-only step by multiplying the 8-bit input data with an 8 × 8 binary matrix, whose coefficients are taken from the 64-bit configuration register. This matrix multiplication can be expressed as follows.

out7 = ( m63 ∧ in7 ) ⊕ ... ⊕ ( m56 ∧ in0 ) out6 = ( m55 ∧ in7 ) ⊕ ... ⊕ ( m48 ∧ in0 ) . . . . out1 = ( m15 ∧ in7 ) ⊕ ... ⊕ ( m8 ∧ in0 ) out0 = ( m7 ∧ in7 ) ⊕ ... ⊕ ( m0 ∧ in0 )

Figure 9.3 depicts this multiplication scheme and Figure 9.4 presents the detailed circuit dia- gram of the linear matrix multiplication unit. An additional accumulation circuit performs the add step of multiply-and-add structure. When acc is set (i.e., NMA instruction is issued), the selected FIFO register output is added to the matrix multiplication result. sro input (corre- sponding to the n parameter of the NMA instruction) decides the register output to be taken. The output of each multiply-only or multiply-and-add operation is pushed into the FIFO via the mac input to allow the following accumulations, which is activated with each NMU or NMA instruction. The additional accumulation circuit can be seen in Figure 9.1 (on the right side of the figure, next to the linear unit).

9.3.3 Area and Power Consumption of NLU Module

In order to observe the impact of the NLU unit on area (which would be added to the overall microcontroller area cost), we implemented the unit as synthesizable RTL code and synthesized in UMC 90 nm low-leakage Faraday library. Our implementation occupies 1752 GE, and the (simulational) power consumption of the unit is 28.59 µW @100 kHz (which is the case for smart cards). Therefore, our design is very compact and low-cost as an extension unit for Atmel 8-bit AVR microcontrollers.

103 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension ’ ’ ’ l l l ’ l a b d c l l l l d d d d l l l l c c c c l l l l b b b b l l l l 1 3 a a a 5 7 a 3 6 1 4 m m m m l l l l c c c c l l l l b b b b l l l l a a a a 0 2 4 6 3 6 1 4 m m m m l l l l d d d d l l l l b b b b l l l l a a a 1 a 3 9 5 6 1 2 4 m m m m l l l l b b b b l l l l a a a 0 a 2 8 4 6 1 2 4 m m m m l l l l d d d d l l l l c c c c l l l l 9 a a a a 3 7 1 5 4 2 1 m m m m l l l l c c c c l l l l a a a a 2 8 0 6 4 5 1 2 m m m m l l l l d d d d l l l l a a a a 1 7 9 5 4 5 2 m m m m l l l l a a a a 4 0 6 2 4 5 8 m m m m l l l l d d d d l l l l c c c c l l l l } b b b 3 9 5 b 7 ’ 2 3 5 l m m m m d

, l l l l c c c ’ c l c l l l l

b b b 4 b 2 8 , 6 5 2 3 ’ m l m m m b l l l l

d d d , d ’ l l l l l 3 b b b 1 b 5 a 7 5 2 3

m , m m m ’ h l l l d l

b b b b , 6 2 0 ’ 3 5 2 4 h m m m m c

, l l l l d d d ’ d h l l l l b c c c c 1 9 5 3

5 1 3 , m m m m ’ h a l l l l { c c c c 4 0 8 3 5 2 1 m m m m 8 l l l l d d d d 3 9 7 3 t 1 4 1 i m } m m m 3 6 n m 1 1 1 1

U , 8 2 6 4 0 3 2

1 6 m m m m r m

a ,

e ’ ’ ’ ’ h h h h 4 c a d b … 6 n

i , 1 l h h h h - d d d d m h h h h

c c c c , n h h h h 0 b b b b 1 h h h 5 7 3 h 3 o 1 4 6 a a a m a m m m m { N h h h h c c c c h h h h b b b b h h h h 0 6 a a a a 4 2 3 4 8 1 6 m m m m h h h h d d d d } h h h h l b b b b h h h h d 9 3 1 a a a a 5

2 1 6 4 , l m m m m c

h h h h , l b b b b b h h h h

0 2 8 4 a a a a , 6 1 2 4 l m m m m a

, h h h h h d d d d h h h h d c c c c h h h h

, a a a a 1 7 3 9 1 2 4 5 h m m m m c

, h h h h c c c c h b h h h h

6 2 8 a a a a 0 , 2 4 5 1 h m m m m a { h h h h d d d d h h h h 1 a a a a 5 7 9 4 2 5 m m m m h h h h a a a a 4 0 6 8 2 4 5 m m m m h h h h d d d d h h h h c c c c h h h h b b b b 5 3 9 7 5 2 3 m m m m h h h h c c c c h h h h b b b b 6 8 4 2 3 5 2 m m m m h h h h d d d d h h h h 3 b b b b 1 7 5 5 2 3 m m m m h h h h b b b b 6 0 2 3 2 4 5 m m m m h h h h d d d d h h h h 1 c c c c 9 5 3 5 1 3 m m m m h h h h c c c c 8 4 0 2 1 3 5 m m m m h h h h d d d d 9 7 3 4 1 3 1 m m m m 1 1 1 1 6 8 2 1 4 0 3 m m m m

Figure 9.2: Non-linear unit

104 9.4. Applications

{m0, m1, … , m62, m63} 64

{in , in , in , in , in , in , in , in } Linear Unit {out , out , out , out , out , out , out , out } 7 6 5 4 3 2 l 0 8 8 7 6 5 4 3 2 l 0

m63 m62 . . . m57 m56 in7 out7 m m . . . m m in out 5.5 54 49 .48 . 6 . 6 ...... m15 m14 . . . m9 m8 in1 out1 m7 m6 . . . m1 m0 in0 out0

Figure 9.3: Matrix multiplication

9.4 Applications

We demonstrate the effectiveness of the NLU in terms of code space and execution time by implementing two lightweight ciphers and two full size ciphers using the target instruction set, 8-bit AVR. As the lightweight ciphers, we chose PRESENT and CLEFIA, which are the lightweight ISO standards; and as the full size ciphers, we chose Serpent and AES. We imple- mented two different versions of each of these ciphers on Atmel’s 8-bit AVR microcontroller in assembly using both “standard instruction set” and “standard instruction set together with NLU instruction set extension” in order to have a fair comparison. The implementation details for these ciphers are given in the following.

9.4.1 PRESENT Software Implementation

The permutation layer of PRESENT is extremely simple for hardware implementations: It is only bit-shuffling, which has zero cost. However, the same arguments are not valid for software implementations. The bit-shuffling is a very costly operation when implemented with a standard instruction set. 4-bit Sbox also makes the direct software implementation harder in “non-4-bit” processors. Programmers usually prefer to use 8-bit T -tables, where the 16 entries of the Sbox are repeated 16 times at the low nibbles of 256 bytes, whereas the high nibble changes once in every 16 entries. This implementation makes it possible to run two parallel Sbox operations at a time, while costing 256 bytes instead of the ideal case of 8 bytes for Sbox storage. In an optimized PRESENT implementation in AVR assembly, each 1-byte Sbox operation takes 6 instructions, one of which is a program memory load instruction with 3 cycles while others are all single cycle instructions. Therefore, total cost for the substitution layer (i.e., 64 bits) is 64 cycles and 48 instructions. There is also 256 bytes of T -table storage within the program memory. Permutation is much worse, where it takes 16 cycles to complete for each byte, resulting in a total of 128 cycles (plus overhead) per round for the permutation layer. Similar arguments are also valid for the key expansion, where the 80-bit key has to be rotated by 61 bits in every round. In total, this implementation occupies 660 bytes of program (Flash)

105 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension

m63...56 1 8 1 1 1 8 1 1 1 1 1

m55...48 1 8 1 1 1 8 1 1 1 1 1

m47...40 1 8 1 1 1 8 1 1 1 1 1

m39...32 1 8 1 1 1 8 1 1 1 1 1 in out 8 1 8 1 1 1 8 1 1 1 8 1 1 m31...24 1 1 1 1 8 1 1 1 8 1 1 m23...16 1 1 1 1 8 1 1 1 8 1 1 m15...8 1 1 1 1 8 1 1 1 8 1 1 m7...0

Figure 9.4: Linear unit – detailed circuit

memory and takes 10792 cycles to complete the encryption of a single 64-bit input block with an 80-bit key. In our implementation of PRESENT using the NLU ISE, it takes 8 instructions (and in turn, 8 clock cycles) to load the ANF coefficients into the NLU configuration register. Then it takes 1 cycle per byte for the Sbox operation, as shown in the following assembly code segment:

106 9.4. Applications

; load ANF bits for the PRESENT Sbox NLD 0, 0xb3 NLD 0, 0x92 NLD 0, 0x67 NLD 0, 0x0b NLD 0, 0xde NLD 0, 0x43 NLD 0, 0x4a NLD 0, 0x80

; perform non-linear Sbox operation NNL r18, r18 NNL r19, r19 NNL r20, r20 NNL r21, r21 NNL r22, r22 NNL r23, r23 NNL r24, r24 NNL r25, r25

We also use NLU for the permutation layer – this time in linear mode. We first write the expressions for the bit permutations as follows (where Yi is the output bit and Aj is the input bit in big-endian format):

y{63...56} = a{63,59,55,51,47,43,39,35} y{55...48} = a{31,27,23,19,15,11,7,3} y{47...40} = a{62,58,54,50,46,42,38,34} y{39...32} = a{30,26,22,18,14,10,6,2} y{31...24} = a{61,57,53,49,45,41,37,33} y{23...16} = a{29,25,21,17,13,9,5,1} y{15...8} = a{60,56,52,48,44,40,36,32} y{7...0} = a{28,24,20,16,12,8,4,0}

In the next step, we convert these expressions into matrix form (shown only for the first output byte):           y63 10000000 a63 00000000 a39 y    a    a   62 00001000  62 00000000  38           y61 00000000 a61 00000000 a37           y60 00000000 a60 00000000 a36   =     ⊕ · · · ⊕     y  00000000 a  00000000 a   59    59    35           y58 00000000 a58 00000000 a34           y57 00000000 a57 10000000 a33 y56 00000000 a56 00001000 a32 Writing this for all the bytes, we get:

107 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension

Y7 = M00A7 ⊕ M01A6 ⊕ M02A5 ⊕ M03A4 Y6 = M00A3 ⊕ M01A2 ⊕ M02A1 ⊕ M03A0 Y5 = M10A7 ⊕ M11A6 ⊕ M12A5 ⊕ M13A4 Y4 = M10A3 ⊕ M11A2 ⊕ M12A1 ⊕ M13A0 Y3 = M20A7 ⊕ M21A6 ⊕ M22A5 ⊕ M23A4 Y2 = M20A3 ⊕ M21A2 ⊕ M22A1 ⊕ M23A0 Y1 = M30A7 ⊕ M31A6 ⊕ M32A5 ⊕ M33A4 Y0 = M30A3 ⊕ M31A2 ⊕ M32A1 ⊕ M33A0

As can be seen in the matrix form, an output byte is obtained by the sum of multiplications of four input bytes with four different binary matrices, where each matrix is a 2-row shifted version of the neighboring matrix. In total, we need 16 different matrices, which can be ob- tained from each other in a sequence of row shifts. In the case of NLU, each byte written into the configuration register corresponds to left shift of the coefficients by one byte, which in fact is a 1-row shift upwards. Therefore, we start with multiplication of the rightmost byte and continue towards the leftmost byte by consecutive multiply-and-adds. However, since each byte pair uses the identical four matrices, we can also make use of this property by performing the rightmost multiplications of two bytes. We then continue with the multiply-and-adds in pairs, using the second register output from the accumulator FIFO. The resultant assembler code for the permutation layer is given as follows (only initialization and permutation for the first two bytes are shown):

; state is in registers r18 to r25

; write M03 to NLU NLD 0, 0x00 NLD 0, 0x00 NLD 0, 0x00 NLD 0, 0x00 NLD 0, 0x00 NLD 0, 0x00 NLD 0, 0x80 NLD 0, 0x40

; temporary registers r10, r11 NMU r10, r21 NMU r11, r25

; write M02 in NLU NLD 0, 0x00 NLD 0, 0x00 NMA 2, r10, r20 NMA 2, r11, r24

108 9.4. Applications

; write M01 in NLU NLD 0, 0x00 NLD 0, 0x00 NMA 2, r10, r19 NMA 2, r11, r23

; write M00 in NLU NLD 0, 0x00 NLD 0, 0x00 NMA 2, r10, r18 NMA 2, r11, r22

; write M13 in NLU NLD 0, 0x40 NLD 0, 0x04 ...

As can be seen in the code, it takes 16 cycles in average for the permutation of two bytes (or 8 cycles per byte), which is only the half of the implementation without NLU instructions. After applying a similar use of NLU for the key expansion block as well, we end up with 6017 clock cycles and 406 bytes of program memory to complete the encryption of a single 64-bit input block. The reduction in both code space and execution time is considerable with respect to a raw assembly implementation.

9.4.2 CLEFIA Software Implementation

CLEFIA is based on a 4-branch generalized Feistel structure, which uses two different diffusion functions and two different 8-bit Sboxes. On the key schedule part, the same diffusion functions are used to generate the intermediate key, L, from the input key, K. Both keys, K and L are then expanded to round keys using a permutation called DoubleSwap function. The fastest and most compact version of the cipher, the 128-bit key version, uses 18 rounds for both encryption and decryption. In our comparison, we only implemented the fast and compact versions of the 128-bit key encryption function with minimal optimization in both standard assembly and NLU ISE cases. The 8-bit Sboxes are most efficiently implemented as lookup tables in either case. However, the situation is the opposite for the diffusion and permutation functions. Since both of them are based on linear combinations of shifted bytes, they are perfect examples to utilize the linear mode operation of the NLU. For example, the basic operation of the diffusion function is a multiplication by 2 over finite field GF(28) defined by the primitive polynomial p(x) = x8 + x4 + x3 + x2 + 1. It is repeated several times inside both diffusion operations. In the standard assembly, it is implemented as a subroutine, which takes 15 or 17 cycles (depending on the most significant bit of the input byte) including the subroutine calls. It is also possible to reduce the cycle count to 7 or 9 at the expense of repeating the corresponding program code several times, adding hundreds of bytes to the occupied program memory.

109 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension

On the other hand, the whole multiplication is only a single NMU instruction for NLU, following a one-time initialization of the internal product matrix (via NLD instruction). Similar approach is also used for the DoubleSwap function. The resultant NLU-based code is 1912 bytes vs. 2170 and 3046 bytes in the compact and fast standard codes, respectively. Execution time is reduced to 15268 cycles from 28684 cycles in the fast version and 42124 cycles in the subroutine- based compact version.

9.4.3 Serpent Software Implementation

Serpent is an interesting case study for several reasons: It is the cipher from which PRESENT has originated. Therefore, they have several similarities. However, Serpent is defined for bit- sliced implementations. Therefore, its code can be optimized for different targets: Either for code space (compactness) or for execution time. In the compact case, Sboxes are implemented as lookup tables, which requires un-bitslicing and re-bitslicing the state bits before and after Sbox operations. Such an implementation is much slower than a bitsliced implementation. Since NLU instruction set is optimized for ANF-based implementations, it is not directly applicable to a bitsliced cipher. However, in the case of Serpent, un-bitslicing operation is equivalent to the permutation layer of PRESENT, only with a larger block size – 128 bits instead of 64. Therefore, PRESENT permutation layer code can be reused together with the substitution layer codes. Finally, re-bitslicing has to be done, which is the inverse permutation of PRESENT and can be implemented using multiply-and-adds. We have applied this strategy in re-writing the Serpent code, resulting in substantial gains both in code space and execution time. We have also re-written the bitsliced linear layer of Serpent using NLU linear operations, which provided nominal gain. As a result, we came up with a 10% faster software implementation, which has a less code size than the half of the fastest Serpent’s code size. It is also faster more than twice the most compact implementation, but slightly larger in code size.

9.4.4 AES Software Implementation

Since AES is the most widely-used and still unbroken block cipher, we have also investigated the possibility of implementing AES using NLU instructions. In this case, as in CLEFIA, we were not able to implement the Sboxes more efficiently than lookup tables. However, there is space for improving the implementation of MixColumns diffusion operator. In an optimized implementation of the AES cipher, MixColumns operation takes 117 instruc- tions, 16 of which are 3-cycle program memory load instructions, while all the others are single cycle instructions. Therefore, total execution time for the MixColumns diffusion operator is 149 cycles, whereas it requires a total of 492 bytes of program memory – 236 bytes for 118 instructions and 256 bytes for an auxiliary lookup table. In our implementation using the NLU ISE, the same operation requires only 112 instructions (224 bytes of program memory), which also corresponds to 112 cycles of execution time. In order to implement MixColumns via NLU instructions, we use the linear mode of NLU. We first recall the expressions for MixColumns:

110 9.4. Applications

Y0 = {02}A0 ⊕ {03}A1 ⊕ A2 ⊕ A3 Y1 = A0 ⊕ {02}A1 ⊕ {03}A2 ⊕ A3 Y2 = A0 ⊕ A1 ⊕ {02}A2 ⊕ {03}A3 Y3 = {03}A0 ⊕ A1 ⊕ A2 ⊕ {02}A3

This can also be expressed in matrix form as follows (shown only for the first output Y0):               y7 01000000 a7 11000000 a15 a23 a31 y    a    a  a  a   6 00100000  6 01100000  14  22  30               y5 00010000 a5 00110000 a13 a21 a29               y4 10001000 a4 10011000 a12 a20 a28   =     ⊕     ⊕   ⊕   y  10000100 a  10001100 a  a  a   3    3    11  19  27               y2 00000010 a2 00000110 a10 a18 a26               y1 10000001 a1 10000011  a9  a17 a25 y0 10000000 a0 10000001 a8 a16 a24 The matrix form representation allows us to implement it using the linear mode of NLU. The most expensive part is the initializations of the matrices with coefficients for multiplication by 2 and 3. Re-initialization with two matrices (×2 and ×3) for every column corresponds to eight initializations, which is optimized by reusing the last initialized matrix. Only the initialization for the first column requires both matrices, while the other columns require only a single matrix initialization. The resultant assembly code is very similar to that of CLEFIA.

; temporary results are in r16 - r19

; first column: state is in r0 - r3 (32 bits)

; initialization for multiplication by 2 NLD 0, 0x40 NLD 0, 0x20 NLD 0, 0x10 NLD 0, 0x88 NLD 0, 0x84 NLD 0, 0x02 NLD 0, 0x81 NLD 0, 0x80

; multiplication by 2 NMU r16, r0 NMU r17, r1 NMU r18, r2 NMU r19, r3

; initialization for multiplication by 3 NLD 0, 0xC0

111 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension

NLD 0, 0x60 NLD 0, 0x30 NLD 0, 0x98 NLD 0, 0x8C NLD 0, 0x06 NLD 0, 0x83 NLD 0, 0x81

; multiplication by 3 and accumulation NMA 0, r16, r1 NMA 0, r17, r2 NMA 0, r18, r3 NMA 0, r19, r0

; multiplication by 1 and accumulation EOR r16, r2 EOR r16, r3 EOR r17, r0 EOR r17, r3 EOR r18, r0 EOR r18, r1 EOR r19, r1 EOR r19, r2

; transfer temp results to state registers MOVW r17:r16, r1:r0 MOVW r19:r18, r3:r2

; second column: state is in r4 - r7 (32 bits)

; multiplication by 3 ...

; initialization for multiplication by 2 ...

; multiplication by 2 and accumulation ...

; multiplication by 1 and accumulation ...

; transfer temp results to state registers ...

112 9.4. Applications

; third column: state is in r8 - r11 (32 bits)

; multiplication by 2 ...

; initialization for multiplication by 3 ...

; multiplication by 3 and accumulation ...

; multiplication by 1 and accumulation ...

; transfer temp results to state registers ...

; fourth column: state is in r12 - r15 (32 bits)

; multiplication by 3 ...

; initialization for multiplication by 2 ...

; multiplication by 2 and accumulation ...

; multiplication by 1 and accumulation ...

; transfer temp results to state registers ...

Our implementation using NLU instructions reduces execution time of the MixColumns op- eration by 24%, and its code size by 55%. In the overall AES encryption and decryption, this corresponds to a total reduction of 12% (2406 vs. 2739 cycles) and 10% (3246 vs. 3579 cycles) in execution time, respectively. The program memory size reduction for the combined encryption and decryption is 11% (1402 vs. 1570 bytes).

113 Chapter 9. Towards Software-efficiency: NLU Instruction Set Extension

Table 9.2: Performance comparison of ciphers implemented with/without NLU

Number Memory Time-Area TAP Implementation of Clock Utilization Product (TAP) Gain Cycles (cycles·bytes) (%) PRESENT (LUT) 10792 660 bytes 7.1 × 106 0 PRESENT (NLU) 6017 406 bytes 2.4 × 106 66 CLEFIA (compact) 42124 2170 bytes 91.4 × 106 0 CLEFIA (fast) 28684 3046 bytes 87.4 × 106 4 CLEFIA (NLU) 15268 1912 bytes 29.2 × 106 68 Serpent (ANF) 49314 7220 bytes 356.0 × 106 0 Serpent (LUT) 106338 2620 bytes 278.6 × 106 22 Serpent (NLU) 45431 2960 bytes 134.5 × 106 62 AES (LUT) 3159 1570 bytes 4.96 × 106 0 AES (NLU) 2826 1402 bytes 3.96 × 106 20

9.5 Results and Discussion

The performance that can be achieved via an NLU unit on an AVR microcontroller is shown in Table 9.2 with a comparison to the software implementations of PRESENT, CLEFIA, Serpent, and AES without the utilization of NLU. PRESENT is first implemented in a LUT-based fashion, which is slower and larger than the NLU-based implementation. In our NLU-based approach, we have 44% less clock cycles and we use 39% less Flash memory. In the case of CLEFIA, we compare our results to a fast implementation and a compact implementation. Both use lookup for Sboxes and logic operations for diffusion layers. The main difference is the subroutines in the compact version, which are replaced by repeated code in the fast version. We were able to optimize both versions using the NLU in linear mode to implement diffusions and permutations. Our execution time reduction is between 46-64%, while the code space reduction is 37-12% against fast and compact versions, respectively. For Serpent, we have two different implementations: One is using ANF representations for the substitution part and is implemented in bitsliced form, and the other is again implemented in bitsliced form except for the Sbox operation. In the latter, we have a LUT-based substi- tution layer as in the case of PRESENT. Our NLU-based Serpent design is even faster than the ANF-based Serpent implementation and uses approximately the same amount of memory with the LUT-based implementation. In order to compare our AES results, we use an op- timized software implementation, which uses lookup tables for both Sboxes and MixColumns operation. We were only able to optimize the MixColumns operation, which alone reduced both the code size and execution time. Depending on the implementation (encryption/decryption or encryption-only), code size reduction is between 10-18%, while execution time reduction is around 10% for both cases.

114 9.6. Conclusion and Future Directions

We also list the time-area (execution time and program memory) products for each case in order to highlight the NLU ISE advantage. The execution times are given as average times. RAM utilization is independent of the implementation in all four ciphers.

9.6 Conclusion and Future Directions

In this work, we were able to improve the software implementations of cryptographic block ciphers on small CPUs. We came up with a non-linear/linear ISE, NLU, which can implement non-linear and linear operations of these ciphers. Using the four new NLU instructions, we show that we achieve time-area product reductions of 20-68% for the widely-used ciphers PRESENT, Serpent, CLEFIA, and AES. The area overhead for the extension unit is only 1.8 kGE – almost the same as the area of a parallel PRESENT core designed in the same fashion. In future, the extension unit design can be extended to different (16-bit, 32-bit, 64-bit, etc.) microcontroller architectures. Furthermore, more comparisons and cipher support can be added to the current and future ISEs.

115

Chapter 10 Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE

This chapter addresses the software performance bottleneck of block cipher imple- mentations by proceeding on the cipher design side. In order to achieve an efficient cipher, we focus on the “software-costly” SPN core component “linear layer”. We first briefly present a general methodology to construct good, sometimes optimal, lin- ear layers allowing for a large variety of trade-offs. Then, we search for efficient representatives of such linear layers regarding software performance. Our search is performed on a reconfigurable hardware platform (FPGA), which accelerates the pro- cess via a carefully-designed hardware search architecture. The value of the linear layer construction is underlined by using the resulting software-efficient linear layer in our new “software-oriented” block cipher proposal. PRIDE is optimized for 8-bit microcontrollers and significantly outperforms all academic solutions both in terms of code size and cycle count.

Contents of this Chapter 10.1 Introduction ...... 117 10.2 The Interleaving Construction ...... 119 10.3 The Search for “Software-friendly” Linear Layers ...... 121 10.4 The Software-oriented Lightweight Block Cipher PRIDE ...... 126 10.5 Conclusion ...... 133

10.1 Introduction

As already discussed in Chapter 8, the dominant metric according to which the vast majority of lightweight ciphers have been optimized is the chip area, which contradicts with the fact that we also need to consider other design metrics depending on the application. Software-efficiency is one of these important metrics (as mentioned in Chapter 9), as low-end embedded processors actually dominate the world of embedded systems and dedicated hardware is a comparably small fraction. Hence, most of the existing “hardware-friendly”-“software-unfriendly” crypto- graphic solutions are implemented on those devices in software. Unfortunately, programming in embedded systems is a skill game that heavily depends on the personal skills and experience of the programmer. Some specific applications, such as our focus lightweight cryptography,

117 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE especially require specialized knowledge and specially-tailored designs in order to get efficient software implementations. Considering this fact, it is quite puzzling that software-efficient solu- tions for lightweight crypto applications on low-cost processors have been (nearly) disregarded for so long. Certainly, there were a few exceptions: Besides the practical examples that we already presented in Chapter 9, several theoretical studies have been published in this field. There have been attempts to come up with ciphers that are (partially) tailored for low-cost processors [SMMK12, SPGQ06, WZ11, GNL11, BSS+13, KDH13]. Of these, execution times of both SEA and ITUbee are rather high, mostly due to the high number of rounds. Furthermore, ITUbee uses 8-bit Sboxes, which occupy a vast amount of program memory storage. However, SPECK, which is the lightweight cipher proposal of US National Security Agency (NSA) (to- gether with SIMON) [BSS+13], seems to be an excellent lightweight software cipher in terms of both speed and program memory1. It is obvious that there are quite some challenges to be overcome in this relatively untouched area of lightweight software cryptography. A software cipher for embedded devices should not only be compact in terms of program memory, but also be relatively fast in execution time. It should clearly be secure and, preferably, its security should be easily analyzed and verified. The latter can possibly be achieved by building on conservative structures, which are conventionally costly in software implementations, thereby posing even harder challenges. One major component influencing all or at least most of those criteria outlined above is the linear layer. Thus, it is important to have general constructions for linear layers that allow to explore and make optimal use of the possible trade-offs. The general design approaches for linear layers are already given in Section 2.2.1. Out of these, the wide-trail strategy usually results in simple and strong security arguments. While the wide-trail strategy does provide a powerful tool for arguing about the security of a cipher, it does not help in actually designing an efficient linear layer (or the corresponding linear code) with a suitable number of active Sboxes. Here, with the exception of early designs in [Dae95] and later PRINCE and mCrypton, most ciphers following the wide-trail strategy simply choose an MDS matrix as the core component. This might guarantee an (partially) optimal number of active Sboxes, but usually comes at the price of a less efficient implementation. The only excep- tion here is that, in the case of MDS matrices, the authors of PHOTON and LED made the ob- servation that implementing such matrices in a serialized fashion improves hardware-efficiency. This idea was further generalized in [SDMS12, WWW12], and more recently in [AF14]. It is our belief that, in many cases, it is advantageous to use a near-MDS matrix (or in general a matrix with a sub-optimal branch number) for the overall design. Furthermore, it is, in our opinion, utmost surprising that there are virtually no general constructions or guidelines that would allow an SPN design to benefit from security vs. efficiency trade-offs.

10.1.1 Our Contribution

As a part of this work, we take steps towards a better understanding of possible trade-offs for linear layers. In the following sections, we briefly present a general construction that allows to combine several strong linear mappings on a few number of bits into a strong linear layer for

1Note that SIMON and SPECK was announced during the course of our work and they were not published in academic proceedings/journals. We therefore claim that our results are superior to the “academic” proposals.

118 10.2. The Interleaving Construction a larger number of bits2. From a coding theory perspective, this construction corresponds to a construction known as block-interleaving (see [LC04], pages 131–132). While this idea is rather simple, its applicability is powerful. Implicitly, a specific instance of our construction is already implemented in AES. Furthermore, special instances of this construction are recently used in [BNN+10] and [GLSV14]. We show that this construction leads to very strong linear layers with respect to efficiency on embedded 8-bit microcontrollers. For this, we adopt a search strategy from [UCI+11] to find the most efficient linear layer possible within our constraints. We implemented this search as a carefully-designed hardware architecture on an FPGA platform in order to overcome the big computational effort involved and have the advantage of reconfigurability. Note that our target platform is Atmel’s AVR microcontroller, as it is dominating the market along with PIC; and many implementations in the literature are also implemented using AVR 8-bit instruction set. We therefore opted for this platform in order to provide a better comparison to other ciphers – especially to SIMON and SPECK. Following the search, we use the resulting efficient linear layers in the design of a new software- oriented block cipher named PRIDE which significantly outperforms all existing block cipher proposals of similar key sizes – with the exception of SIMON and SPECK. One of the key points here is that our construction of strong linear layers is nicely in line with a bitsliced implementation of the Sbox layer. Our cipher is comparable, both in speed and memory size, to the new NSA block ciphers SIMON and SPECK, dedicated for the same platform. We would like to note that while in this work we focus on SPN ciphers, most of the results translate to the design of Feistel ciphers as well. Note that this work is the output of a collaboration and was published at CRYPTO in 2014. In this chapter, we only focus on the parts that the author was involved in; therefore, we refer to the original publications [ADK+14a, ADK+14b] where necessary.

10.2 The Interleaving Construction

Following the wide-trail strategy described in Chapter 2, we build linear layers by constructing n b (2n, 2 ) additive codes with minimal distance d over F2. The code needs to have a generator matrix G in standard form, i.e., G = [I | LT ] where the submatrix L is invertible, and corresponds to the linear layer we are using. Hence, the main question is how to construct “efficient” matrices L with a given branch number. Our construction allows to combine small matrices into bigger ones. We hereby drastically reduce the search-space of possible linear layers. This in turn makes it possible to construct efficient linear layers for various trade-offs. For further information on interleaving construction and an illustrative example of our ap- proach, we refer to our original publications [ADK+14a, ADK+14b].

10.2.1 The General Construction In order to give a formal description of our approach, we first define the following isomorphism. 2As this part is not the personal contribution of the thesis author, we refer to the original publica- tions [ADK+14a, ADK+14b] for further details of this construction.

119 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE

 n  n  n  n P n b1 × b2 × · · · × bk → b1 × b2 × · · · × bk b1,...bk : F2 F2 F2 F2 F2 F2  (1) (1)  (k) (k) (x1, . . . , xn) 7→ x1 , . . . , xn ,..., x1 , . . . , xn

 (1) (k) (j) bj where xi = xi , . . . , xi with xi ∈ F2 .

This isomorphism performs the transformation of mapping Sbox outputs to our small linear layers Li. Note that, for our purpose, there are in fact many possible choices for P . In particular, we bi n may permute the entries within (F2 ) . Given this isomorphism we can now state our main theorem. The construction of P follows the idea of a diffusion-optimal mapping as defined in [DR01, Definition 5].

T n Theorem 10.2.1 Let Gi = [I | Li ] be the generator matrix for an F2-linear (2n, 2 ) code with bi T minimal distance di over F2 for 0 ≤ i < k. Then the matrix G = [I | L ] with

 −1 L P n ◦ L × L × · · · × L ◦ P n = b1,...bk ( 0 1 k−1) b1,...bk

n b is the generator matrix of an F2-linear (2n, 2 ) code with minimal distance d over F2 where X d = min di and b = bi. i i

For the proof, please see our publications [ADK+14a, ADK+14b]. A special case of the construction above is implicitly already used in AES. In the case of AES, 32 8 it is used to construct a [8, 4, 5] code over F2 from 4 copies of the [8, 4, 5] code over F2 given by the MixColumns operation. In the Superbox view on AES, the ShiftRows operation plays the role of the mapping P (and its inverse) and MixColumns corresponds to the mappings Li. In the following, we use this construction to design efficient linear layers. Besides the differ- ential and linear branch number, we hereby focus mainly on three criteria:

(1) Maximize the diffusion (via ensuring high dependency)

(2) Minimize the density of the matrix

(3) Software-efficiency

In this thesis, we will explain only the third bullet-point, software-efficiency, in detail and refer to the original publications [ADK+14a, ADK+14b] for the details of the first two criteria. n The strategy we employ is as follows. We first find candidates for L0, i.e., (2n, 2 ) additive b0 codes with minimal distance d0 over F2 . In this stage, we ensure that the branch number is d0 and our efficiency constraints are satisfied. We then apply permutations to L0 to produce Li for i > 0. This stage maximizes diffusion.

120 10.3. The Search for “Software-friendly” Linear Layers

10.2.2 Searching for L0 The following lemma (which is a rather straightforward generalization of Theorem 4 in [WWW12]) gives a necessary and sufficient condition that a given matrix L has branch number b d over F2.

Lemma 10.2.2 Let L be a bn × bn binary matrix, decomposed into b × b submatrices Li,j.   L0,0 L0,1 ...L0,n−1    L1,0 L1,1 ...L1,n−1  L =  . . .  (10.1)  . . .. .   . . . .  Ln−1,0 Ln−1,1 ...Ln−1,n−1

b Then, L has differential branch number d over F2 if and only if all i × (n − d + i + 1) block submatrices of L have full rank for 1 ≤ i < d − 1. Moreover, L has linear branch number d if and only if all (n − d + i + 1) × i block submatrices of L have full rank for 1 ≤ i < d − 1.

Based on Lemma 10.2.2 we may instantiate various search algorithms, which we will describe in the following sections. In our search we focus on cyclic matrices, i.e. matrices where row i > 0 is constructed by cyclic shifting row 0 by i indices. These matrices have the advantage of being efficient both in software and hardware. Furthermore, since these matrices are symmetric, considering the dual code C⊥ to C = [I | LT ] is straightforward.

10.3 The Search for “Software-friendly” Linear Layers

In this section, we give the details of our hardware-based search in order to find efficient linear layers for our “software-oriented” cipher. We explain our search constraints and present our hardware architecture design in detail. We are, unsurprisingly, making use of the construction given in Theorem 10.2.1. We decided on a linear layer with high dependency and a linear and differential branch number of 4. One key-observation is that the construction of Theorem 10.2.1 fits naturally with a bitsliced imple- mentation of the cipher, in particular with the Sbox layer. As a bitsliced implementation of the Sbox layer is advantageous on 8-bit microcontrollers, in any case this is a nice match. A natural choice in terms of Theorem 10.2.1 is to choose k = 4 and b1 = b2 = b3 = b4 = 1. Thus, the task reduces to finding four 16 × 16 matrices forming one 64 × 64 matrix (to permute the whole state) of the following form.   L0 0 0 0    0 L1 0 0     0 0 L2 0  0 0 0 L3

Each of these four 16×16 matrices should provide branch number 4 and together achieve high dependency with the least possible number of instructions. Instead of searching for an efficient implementation for a given matrix, we decided to search for the most efficient solution fulfilling our criteria. However, we need to simplify this search process in order to make it realizable.

121 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE

In a software implementation, the part of the state, which is stored in two 8-bit registers X and Y , is multiplied with each Li as a part of the linear layer application. Hence, we can split each 16 × 16 matrix as in the following in order to come up with a structured search.

! ! ! X0 AB X = Y 0 CD Y where ! AB L = . i CD

Note that X,Y are 8-bit registers and A, B, C, D are all 8×8 cyclic matrices. In the following, we explain our search effort to find instructions resulting in “good” 8 × 8 binary matrices, i.e., A, B, C, D, on a hardware-based search architecture.

10.3.1 Proposed Search Model

Our proposed search architecture is highly-parallel and operates on a limited set of instructions pertinent to implementation of linear layer diffusion matrices. A predetermined number of instructions is executed in a pipelined manner and the resultant output register contents are checked either for match to a target matrix or for certain cryptographic properties. As already mentioned, we are looking for 8 × 8 binary matrices in this search; however, we will combine the resulting matrices in the end in order to come up with 16 × 16 diffusion matrices. Similar previous attempts targeted to reduce the number of instructions for substitution layer [UCI+11]. In their work, the authors wanted to come up with the least number of in- structions (in turn clock cycles) for 4-bit Sboxes. However, their search was run on a software platform, which significantly slowed down the search process. In our work, we focus on the lin- ear layer and try to find diffusion matrices giving the least possible number of instructions. To this end, we make use of a hardware platform (Xilinx Virtex-6 ML605 Evaluation Board), which inherently possesses both parallelism and pipelining properties that significantly accelerate the search process. First of all, in order to have an idea of the number of instructions that we need, we started our search effort with analyzing the best figures in the literature, i.e., cycle counts of NSA lightweight ciphers SIMON and SPECK. A naive back-the-envelope calculation of a possible software implementation revealed that we have to come up with a linear layer of 7-9 instruc- tions/byte for a 64-bit block cipher. Our manual attempts provided us with BN = 4 diffusion matrices of 9-11 instructions/byte. Clearly, in order to reach our targets, these figures had to be reduced. A natural idea is to search through all instruction combinations (for a limited set and number of instructions) and see if we end up with good diffusion matrices. In other words, the problem can be summarized as follows: Given N instructions3 (say N = 8) and only linear operations (exclusive OR, rotate, shift, move, and masking) with 2-3 temporary registers, which instruction

3More importantly, clock cycles. It can actually differ from the number of instructions, but each instruction of our limited instruction set costs 1 clock cycle. So, the number of instructions are the same with the clock cycle count in our case.

122 10.3. The Search for “Software-friendly” Linear Layers combinations would result in good matrices? And more importantly, how can we go through all N-instruction combinations? In our search model, the starting (initial) value is the identity matrix, i.e., Y = A. We execute instructions step by step, similar to a software implementation. Each “linear” instruction will act as a matrix multiplied with the initial matrix, i.e., Y = M × A, which will finally yield a certain value in the output register. After all instructions are executed, the quality of this M will be checked. Here, by quality, we mean the branch number and non-singularity – the cryptographic properties – of the matrix. This scheme is depicted in Figure 10.1.

Y I Q detect

M1 M2 M7 M8 Instruction 1 matrix Instruction 2 matrix Instruction 7 matrix Instruction 8 matrix

Figure 10.1: Instruction flow with quality check – an example with 8 instructions

As mentioned before, our search is limited only to instructions that can result in a “linear” operation, which are listed in Table 10.1. In the overall structure, there should be a counter that would go through all possible combi- nations for each one of the targeted N instructions. Depending on these counter value(s), we have to decide on the instruction’s equivalent matrix, the source and destination registers; and then apply them. An important question is the number of source/destination registers. Clearly, one register is needed as the IO, and 2-3 registers are needed for temporary storage. Although manual coding attempts revealed that 1-2 registers are sufficient, we use 3 temporary registers in our design. The next question is if we need all 4 registers at every stage. The answer is actually no, as we can skip some of the registers depending on the instruction order. This scheme is depicted in Figure 10.2. Such a reduction of course means a drop in the time complexity, as we try less number of possibilities. This scheme means, we should define five different instruction module types, one for each of 1st, 2nd, 3rd, regular (intermediate), and last instructions. Furthermore, not every instruction is possible at every cycle. For example, the first instruction cannot be EOR, since a second

0 0 0 0 0 0 0

1 1 1 1 Instruction 1 Instruction Instruction Instruction Instruction Instruction last-1 last 1 2 3 4 (L-1) (L) 2 2 2 2

3 3 3

Figure 10.2: Overall instruction flow – showing source and destination skips at each stage

123 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE

Table 10.1: Limited instruction set

Instr. Operation Description Input → Output EOR D ← D ⊕ S Multiply destination matrix - with identity matrix (I) and then add (EOR) source matrix. MOV D ← S Copy source matrix to destination.-

MOVW D + 1 : D ← S + 1 : SCopy two source matrices - to two destination matrices. SWAP D ← D3...0,D7...4 Swap upper and lower x7x6x5x4x3x2x1x0 → nibbles of destination. x3x2x1x0x7x6x5x4 ROL D ← D ≪ 1 Rotate destination by 1-bit x7x6x5x4x3x2x1x0c → to left using carry register. x6x5x4x3x2x1x0cx7 ROR D ← D ≫ 1 Similar to ROL, x7x6x5x4x3x2x1x0c → in right direction instead. cx7x6x5x4x3x2x1x0 LSL D ← D  1 Logical shift left, x7x6x5x4x3x2x1x0c → 0 is pushed in instead of carry. x6x5x4x3x2x1x00x7 LSR D ← D  1 Logical shift right, x7x6x5x4x3x2x1x0c → LSL in right direction. 0x7x6x5x4x3x2x1x0 ASR D ← D ÷ 2 Arithmetic shift right, x7x6x5x4x3x2x1x0c → divide destination by 2. x7x7x6x5x4x3x2x1x0 CLC - Clear carry. x7x6x5x4x3x2x1x0c → x7x6x5x4x3x2x1x00 CLR - Clear register. x7x6x5x4x3x2x1x0c → 00000000c

operand does not exist yet. Also, there should not be any bit destructing last instruction, like LSL, LSR, ASR. All of these criteria further reduce the time complexity of our search model.

10.3.2 Overall Architecture

Our resultant reconfigurable architecture is shown in Figure 10.3. We have implemented different types of blocks to check the properties of the generated linear layer matrices, such as non- singularity (validity) and linear/differential branch numbers (cryptographic properties) of the matrix. Non-singularity is checked via the determinant. For binary matrices, a non-zero determinant guarantees non-singularity. Finding the determinant is usually a problem for matrices over 4 × 4. However, since we are interested in the determinant of a binary matrix, we can simply use U of LU-decomposition. For that, we turn the matrix into an upper triangle. If it can be done, then the matrix is simply non-singular. This operation is realized via line-by-line elimination in a total of 8 steps. On the other hand, the branch number can be found by exact realization of “minimum of the sum of input and output Hamming distances” algorithm in hardware, implemented as a 256 − to − 1 binary tree.

124 10.3. The Search for “Software-friendly” Linear Layers

carry I/O

Instruction 1 cnt1

Instruction 2 cnt2

Instruction 3 cnt3

Instruction 4 cnt4

Instruction L-1 cntL-1

Instruction L cntL

Determinant check

det & (c = 0)

Branch number check

Compare

Results Match cnt FIFO

Figure 10.3: Reconfigurable “optimal linear layer” search architecture

10.3.3 Evaluation of Search Results

We performed our search using the Xilinx ML605 Evaluation Board, which has a Virtex-6 XC6VLX240T FPGA. We tried different combinations of instructions from 7 to 11 instructions. The resulting matrices are checked for the mentioned properties. Using our parallel architecture, we were able to find several good permutation layer matrices starting from 7 instructions with branch number 4. We were able to search up to 11 instructions and cover matrices with branch number 6 as well.

125 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE

For further design, time complexity reduction, and reconfigurable hardware architecture de- tails, we refer to our publication [KLY13], as the topic is too broad to cover within the scope of this chapter. Note that, following the search results, we combined the 8 × 8 matrices manually to form 16 × 16 matrices. During this effort, we made further optimizations. For example, we used MOVW instead of MOV where necessary, in order to come up with a better number of instructions/cycle count. Such a replacement resulted in one more temporary register; however, this was not a problem as we had enough registers left.

10.4 The Software-oriented Lightweight Block Cipher PRIDE

In this section, we describe our new lightweight software-cipher PRIDE, a 64-bit block cipher that uses a 128-bit key. It has an efficient linear layer and an involution Sbox as well as a low-cost key schedule. Note that the key schedule design and security analysis of the cipher are realized in cooperation with the co-authors of the original publications [ADK+14a, ADK+14b]. The author of this thesis, as mentioned before, has contributed to the efficient linear layer search, Sbox selection, and implementations of the cipher.

10.4.1 The Extremely Efficient Linear Layer

As a result of the search explained in previous sections, we achieved an extremely efficient linear layer. The cheapest solution provided by our search needed 36 cycles for the complete linear layer, which is what we opted for. The optimal matrices forming the linear layer are given in the appendix (Part V). Of these four matrices, L0 and L3 are involutions with the cost of 7 instructions (in turn, clock cycles), while L1 and L2 require 11 and 13 instructions for true and inverse matrices, respectively. The assembly codes are given in the following.

; State s0, s1, s2, s3, s4, s5, s6, s7 ; Temporary registers t0, t1, t2, t3

; Linear Layer and Inverse Linear Layer: L0 movw t0, s0 ; t1:t0 = s1:s0 swap s0 swap s1 eor s0, s1 eor t0, s0 mov s1, t0 eor s0, t1

; Linear Layer: L1 swap s3 movw t0, s2 ; t1:t0 = s3:s2 movw t2, s2 ; t3:t2 = s3:s2

126 10.4. The Software-oriented Lightweight Block Cipher PRIDE lsl t0 rol t2 lsr t1 ror t3 eor s2, t3 mov t0, s2 eor s2, t2 eor s3, t0

; Inverse Linear Layer: L1 movw t0, s2 ; t1:t0 = s3:s2 movw t2, s2 ; t3:t2 = s3:s2 lsr t0s ror t2 lsr t1 ror t3 eor t3, t2 eor s3, t3 swap s3 mov s2, t3 lsr t3 ror s2 eor s2, t2

; Linear Layer: L2 swap s4 movw t0, s4 ; t1:t0 = s5:s4 movw t2, s4 ; t3:t2 = s5:s4 lsl t0 rol t2 lsr t1 ror t3 eor s4, t3 mov t0, s4 eor s4, t2 eor s5, t0

; Inverse Linear Layer: L2 movw t0, s4 ; t1:t0 = s5:s4 movw t2, s4 ; t3:t2 = s5:s4 lsr t0 ror t2 lsr t1 ror t3 eor t3, t2

127 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE eor s5, t3 mov s4, t3 lsr t3 ror s4 eor s4, t2 swap s4

; Linear Layer and Inverse Linear Layer: L3 movw t0, s6 ; t1:t0 = s7:s6 swap s6 swap s7 eor s6, s7 eor t1, s6 mov s7, t1 eor s6, t0

Comparing to linear layers of other SPN-based ciphers clearly demonstrated the benefit of our approach. However, note that these comparisons have to be taken with care as not all linear layers operate on the same state size and do not offer the same security level. The linear layer of PRESENT costs 144 cycles (derived from the total cycle count given in [EKM+13]). MixColumns operation of AES4 costs 117 instructions (but 149 cycles because of 3-cycle data load instruction utilizations, as MixColumns constants are implemented as lookup table – which means additional 256 bytes of memory, too) [AVR]. Note that ShiftRows operation was merged with the lookup table of Sbox in this implementation, so we take only MixColumns cost as the linear layer cost. The linear layer of CLEFIA (again 128-bit cipher) costs 146 instructions and 668 cycles. The bitslicing-oriented design Serpent (128-bit cipher) linear layer costs 155 instructions and 158 cycles. Other lightweight proposals, KLEIN and mCrypton linear layers cost 104 instructions (100 cycles) and 116 instructions (342 cycles), respectively [EGG+12]. Finally, the linear layer cost of PRINCE is 357 instructions and 524 cycles5, which is even worse than AES. One of the reasons for this high cost is the non-cyclic 4 × 4 matrices forming the linear layer. The other reason is the ShiftRows operation applied on 4-bit state words, which makes coding much more complex than that of AES on an 8-bit microcontroller.

10.4.2 Sbox Selection

For our bitsliced design, we decided to use a very simple – in terms of software-efficiency – 10-instruction Sbox (which makes 10 × 2 = 20 instructions in total for the whole state). The formulation for this Sbox is given below.

4It is of course not fair to compare a 128-bit cipher with a 64-bit cipher. However, we provide AES numbers as a reference due to the fact that it is a widely-used standard cipher and its cost is much better compared to many lightweight ciphers. 5Result taken from Chapter 8.

128 10.4. The Software-oriented Lightweight Block Cipher PRIDE

A = c ⊕ (a&b) B = d ⊕ (b&c) C = a ⊕ (A&B) D = b ⊕ (B&C)

It is at the same time an involution Sbox, which prevents the encryption/decryption overhead. Besides being very efficient in terms of cycle count, this Sbox is also optimal with respect to linear and differential attacks. The maximal probability of a differential is 1/4 and the best correlation of any linear approximation is 1/2. The PRIDE Sbox is is depicted in the following.

x 0 1 2 3 4 5 6 7 8 9 A B C D E F S[x] 0 4 8 F 1 5 E 9 2 7 A C B D 6 3

The corresponding assembly code for the Sbox is as follows.

; State s0, s1, s2, s3, s4, s5, s6, s7 ; Temporary registers t0, t1, t2, t3

; Substitution Layer movw t0, s0 movw t2, s2 and s0, s2 eor s0, s4 and s2, s4 eor s2, s6 and s1, s3 eor s1, s5 and s3, s5 eor s3, s7 movw s4, s0 movw s6, s2 and s4, s6 eor s4, t0 and s6, s4 eor s6, t2 and s5, s7 eor s5, t1 and s7, s5 eor s7, t3

129 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE

10.4.3 Description of PRIDE Similar to PRINCE, the cipher makes use of the FX-construction [Bir05]. A pre-whitening key k0 and post-whitening key k2 are derived from one half of k, while the second half serves as basis k1 for the round keys, i.e.,

k = k0||k1 with k2 = k0. Moreover, in order to allow an efficient bitsliced implementation, the cipher starts and ends with a bit-permutation. This clearly does not influence the security of PRIDE in any way. Note that in a bitsliced implementation, none of the permutations P nor P −1 used in PRIDE has to be actually implemented explicitly. The cipher has 20 rounds, of which the first 19 are identical. Subkeys are different for each round, i.e., the subkey for round i is given by fi(k1). We define (0) (1) (2) (3) fi(k1) = k10 ||gi (k11 )||k12 ||gi (k13 )||k14 ||gi (k15 )||k16 ||gi (k17 ) as the subkey derivation function with four byte-local modifiers of the key as (0) (1) gi (x) = (x + 193i) mod 256, gi (x) = (x + 165i) mod 256, (2) (3) gi (x) = (x + 81i) mod 256, gi (x) = (x + 197i) mod 256, which simply add one of four constants to every other byte of k1. The overall structure of the cipher is depicted in the following.

The round function R of the cipher shows a classical SPN: The state is XORed with the round key, fed into 16 parallel 4-bit Sboxes and then permuted and processed by the linear layer.

130 10.4. The Software-oriented Lightweight Block Cipher PRIDE

The difference between R and R0 is that in the latter no more diffusion is necessary, therefore the last round ends after the substitution layer. With the software-friendly matrices we have found, the linear layer is defined as follows (cf. Theorem 10.2.1):

−1 16 L := P ◦ (L0 × L1 × L2 × L3) ◦ P where P := P1,1,1,1.

The test vectors for the cipher are provided in the appendix (Part V).

10.4.4 Performance Analysis

As already mentioned, one round of our proposed cipher PRIDE consists of a linear layer, a substitution layer, a key addition, and a round constant addition (key update). In a software implementation of PRIDE on a microcontroller, we also perform branching in each round of the cipher in addition to the previously listed layers. Adding up all these costs gives us the total implementation cost for one round of the cipher. The total cost can roughly be calculated by multiplying the number of rounds with the cost of each round. Note that we should subtract the cost of one linear layer from the overall cost, as PRIDE has no linear layer in the last round. The software implementation cost of the round function of PRIDE on Atmel AVR ATmega8A 8-bit microcontroller is presented in the following. Key update Key addition Sbox Layer Linear Layer Total Time (cycles) 4 8 20 36 68 Size (bytes) 8 16 40 72 136

Comparing PRIDE to existing ciphers in literature, we can see that it outperforms many of them significantly both in terms of cycle count and code size. Note that we are not using any lookup tables in our implementation, in turn no RAMs6. The comparison with existing implementations is given in Table 10.2. In the table, the first row is the time (performance) in clock cycles, the second row is the code size in bytes, and the third row is the equivalent rounds. The third row expresses the number of rounds for the given ciphers that would result in a total running time similar to PRIDE. Note that we do not list the RAM utilization for the ciphers under comparison in the table. In the implementation of PRIDE, our target was to be fast and at the same time compact. Note that we do not exclude data & key read and data write back as well as the whitening steps in our results (these are omitted in SIMON and SPECK numbers). Although the given numbers are just for encryption, decryption overhead is also acceptable: It costs 1570 clock cycles and 282 bytes. A cautionary note is indicated for the above comparison for several reasons. AES, Serpent, CLEFIA, and NOEKOEN are working on 128-bit blocks; so, for a cycle per byte comparison, their cycle count has to be divided by a factor of two. Moreover, the ciphers differ in the claimed

6Which has the additional advantage of increased resistance against cache-timing attacks.

131 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE

Table 10.2: Performance comparison of PRIDE to other block ciphers 12] 13] + 13] 13] + 13] 13] + + 13] + + + 13] + -128 AES-128 [EKM Serpent-128 [EKM PRESENT-128 [EKM CLEFIA-128 [EKM SEA-96 [SPGQ06] NOEKEON-128 [EGG PRINCE ITUbee-80 [KDH13] SIMON-64/128 [BSS SPECK-64/96 [BSS SPECK-64/128 [BSS PRIDE t(cyc) 3159 49314 10792 28648 17745 23517 3614 2607 2000 1152 1200 1514 bytes 1570 7220 660 3046 386 364 1108 716 282 182 186 266 eq.r. 5/10 1/32 4/31 1/18 8/92 1/16 5/12 12/20 33/44 34/26 34/27

security level and key-size. PRIDE does not claim any resistance against related-key attacks (and actually can be distinguished trivially in this setting) and also generic time-memory trade- offs are possible against PRIDE in contrast to most other ciphers. Besides those restrictions, the security margin in PRIDE in terms of the number of rounds is sufficient. One can see that PRIDE is comparable to SPECK-64/96 and SPECK-64/128 (members of NSA’s software-cipher family), which are based on a Feistel structure and use modular addi- tions as the main source of non-linearity. In addition to the above table, the recent work of Grosso et al. [GLSV14] presents LS- Designs. This is a family of block ciphers that can systematically take advantage of bitslicing in a principled manner. In this work, the authors make use of lookup tables. Therefore, a direct comparison with PRIDE is not fair as the use of lookup tables does not minimize the linear layer cost. However, to have an idea, we can try to estimate the cost of the 64-bit case of this family. They suggest two options: The first uses a 4-bit Sbox with 16-bit Lbox, and the second uses 8-bit Sbox with 8-bit Lbox. The first option has 8 rounds, which results in 64 non-linear operations, 128 XORs, and 128 table lookups in total. The second one has 6 rounds, which takes 72 non-linear operations, 144 XORs, and 48 table lookups. For linear layer cost, we consider the XOR cost together with table lookups. Unfortunately, it is not easy to estimate the overall cost of the given two options on AVR platform as the table lookups take more than one cycle compared to the non-linear and linear operations. Another important point here to mention is that the use of lookup tables result in a huge memory utilization. In order to observe the software performance of PRIDE on different microcontrollers, it is also implemented in C (a reference implementation) and its runtime, code size, and energy consumption is evaluated on the evaluation board EnergyMicro EF32GG-STK37007. As the compiler, we selected SimplicityStudio IDE as in the case of PRINCE. The results are given in Table 10.3.

7Our C code was run on this microcontroller by Jan Zimmer.

132 10.5. Conclusion

Table 10.3: Performance of PRIDE on 32-bit ARM CORTEX-M3

Run time Code size Energy (clock cycles) (bytes) (µJ) [Enc/Dec] [Enc/Dec] PRIDE 3921/4267 863 19.1/20.3

Note that this is not the most efficient 32-bit design as it is a direct mapping of the 8-bit assembly implementation of PRIDE to C. We therefore come up with an optimized 32-bit implementation on another 32-bit ARM platform, STM32F4DISCOVERY8. The results are given in Table 10.4.

Table 10.4: Performance of PRIDE on 32-bit STM32F4DISCOVERY

Run time Code size (clock cycles) (bytes) [Enc/Dec] [Enc/Dec] PRINCE 1477/1541 354/440

Finally, we note that, despite its target being software implementations, PRIDE is also efficient in hardware. It can be considered a hardware-friendly design, due to its cheap linear and Sbox layers.

10.4.5 Security Analysis The security analysis of PRIDE is not in the scope of this thesis. Due to the interdisciplinary nature of this work, the author was focused on the search and design of efficient building blocks for the cipher, together with the implementations. Therefore, we leave the security discussion of PRIDE to the original publications [ADK+14a, ADK+14b], in which the security claims for PRIDE and a security evaluation of its general construction regarding classical attacks (linear, differential) are provided. It is worth mentioning that the performed security analysis together with third-party analyses have not revealed any critical and/or practical weaknesses of PRIDE so far; however, further analysis is of course encouraged.

10.5 Conclusion

In this work, we presented a framework, which can be extended in the future, in order to construct linear layers for block ciphers that allows to trade security against efficiency. Then, we proposed a hardware architecture to search for very efficient examples of this construction

8Implemented by our student Jascha Kolberg.

133 Chapter 10. Software-efficiency – Going Further: A Lightweight Block Cipher PRIDE especially for software implementations of cryptographic block ciphers. Using this proposed highly-parallel and pipelined hardware search architecture, we found diffusion matrices with smallest possible cycle counts and code sizes. Moreover, we presented a new “software-oriented” cipher PRIDE dedicated for 8-bit microcontrollers, which makes use of the resulting linear layer diffusion matrices. Our proposal PRIDE significantly outperforms all academic solutions both in terms of code size and cycle count. In addition to our software-efficient cipher proposal, our work – to the best of our knowledge – is the first attempt to search for efficient cipher linear layers for software implementations using a hardware architecture. Our highly parallel architecture made it possible to search for various parameters in a very short amount of time, which would not be possible on a software search platform. For the hardware-powered search for linear layers, we can extend our work to cover a larger spectrum of microcontroller instruction sets, including the highly popular PIC and ARM [ARM] instruction sets, as a natural result of using FPGAs. Finally, regarding PRIDE, we obviously encourage further cryptanalysis.

134 Part IV

Conclusion

Chapter 11 Concluding Remarks

In this thesis, we have presented efficient implementations of lightweight primitives on different platforms and proposed new lightweight block ciphers targeting different metrics. In the following, we summarize our results and discuss the contributions of this thesis.

Contents of this Chapter 11.1 Implementations and Applications of Lightweight Primitives ...... 137 11.2 Lightweight Block Cipher Design ...... 138 11.3 ISE Design for (Lightweight) Crypto Primitives ...... 138

11.1 Implementations and Applications of Lightweight Primitives

In this thesis, we have presented efficient implementations of some prominent lightweight symmetric-key primitives and hash functions and their alternative implementations on differ- ent platforms for better performance. Furthermore, we have also shown some special use case examples (i.e., IPSecco) of lightweight cryptography. Firstly, we evaluated area, power, and energy consumption of lightweight block cipher im- plementations and different AES architectures. We discussed the differences in dynamic power consumption in relation to the area. Our results showed that the dynamic power consumption has an important effect on the energy and power consumption of these cryptographic imple- mentations. Following that, we switched our implementation focus to FPGAs, as they become more and more attractive alternatives for lightweight security applications. We made use of the existing RAMs on the FPGA in order to get a better slice count. We presented two differ- ent RAM-based implementations of the lightweight cipher algorithm PRESENT. With their lowest reported slice counts, both implementations are suitable for lightweight applications. Then, we performed a suitability analysis of the SHA-3 finalists for lightweight applications. We limited ourselves to reach lightweight figures for area, which also results in good numbers for average power consumption in most applications; and tried to get the lowest possible recorded gate counts for all five finalists. We avoided the use of block memories in order to be compatible with different platforms. We successfully reached our target of lowest gate count, and presented very compact implementations of SHA-3 finalists.

137 Chapter 11. Concluding Remarks

We later combined some results of Chapter 5 and 6, and presented an example lightweight cryptography application. We implemented lightweight building blocks for the execution of the popular IPSec protocol on reconfigurable hardware. We showed that even a complex protocol suite, which requires several cipher standards and modes of operation, can be implemented even on very low-cost and energy-efficient FPGAs. We selected algorithms that are generally designed for efficient realization in hardware and adapted them to the specification of our target device. We were therefore able to achieve a significant reduction in terms of area utilization compared to related work. Our results show that standardized security can be achieved on reconfigurable hardware with only a modest investment on slice resources. This leaves us with more space for other functionality on the target hardware.

11.2 Lightweight Block Cipher Design

This thesis proposes two novel “lightweight” ciphers, namely PRINCE and PRIDE. We first presented the new low-latency and low-cost cipher PRINCE, together with a vari- ety of its implementations on different platforms, ranging from ASICs and FPGAs to software implementations. PRINCE has a very small footprint and achieves the lowest latency com- pared to the existing lightweight solutions – without compromising security. The compactness of PRINCE is the result of an extensive search on the substitution and permutation layers of the cipher. Potential SCA-resistant constructions for PRINCE are also considered; however, we observed that such countermeasures increase the cipher area significantly. Later, we proceeded in the “software-efficiency” direction. We started with presenting a framework to construct linear layers for block ciphers that allows to trade security against efficiency. Then, we proposed a hardware architecture to search for very efficient examples of this construction especially for software implementations of cryptographic block ciphers. Using our proposed highly-parallel and pipelined hardware search architecture, we found diffusion matrices with smallest possible cycle counts and code sizes. Furthermore, we presented a new “software- oriented” cipher PRIDE dedicated for 8-bit microcontrollers, which uses these resulting linear layer diffusion matrices. Our proposal PRIDE significantly outperforms all academic solutions both in terms of code size and cycle count. In addition to our software-efficient cipher proposal, our work – to the best of our knowledge – is the first attempt to search for efficient cipher linear layers for software implementations using a hardware architecture. Our highly parallel architecture made it possible to search for various parameters in a very short amount of time, which would not be possible on a software search platform.

11.3 ISE Design for (Lightweight) Crypto Primitives

There have been many different ISE proposals for cryptographic applications so far. However, most of the previous works have been on the implementation of cipher-specific instructions. Furthermore, the previous works also do not target (lightweight) cryptography for resource- constrained devices. In this thesis, we were able to improve the software implementations of cryptographic block ciphers on small CPUs through a new crypto ISE proposal. We came up with a non-linear/linear ISE, NLU, which can implement non-linear and linear operations of block ciphers. Using the four new NLU instructions, we show that we achieve time-area product

138 11.3. ISE Design for (Lightweight) Crypto Primitives reductions of 20-68% for the widely-used ciphers PRESENT, Serpent, CLEFIA, and AES with a very low area overhead for the hardware extension unit.

139

Chapter 12 Directions for Future Research

This chapter points out the open problems, topics that might need further investiga- tion, and ideas for future research based on the problems and results of the research carried out during the course of this thesis.

Contents of this Chapter 12.1 Further Lightweight Crypto Implementations and Applications ...... 141 12.2 Potential Lighweight Cipher Design Ideas ...... 142 12.3 Further (Lightweight) Crypto ISE Ideas ...... 142

12.1 Further Lightweight Crypto Implementations and Applications

Within the scope of this thesis, we evaluated a number of well-known lightweight ciphers. An extended evaluation including any other existing and upcoming (lightweight) block and stream ciphers would definitely be more beneficial to serve as a design/selection guide for cryptogra- phers. In our evaluation, we used industrial tools in order to generate dynamic power consump- tion results; however, one can also investigate the HDL implementation of a cipher and estimate the dynamic power consumption through measuring the toggle activity. The main idea of this approach is to measure the total number of bit toggles that happen during the encryption of a single block of a given algorithm. Such a metric can be especially helpful when we consider energy. The energy consumption in CMOS circuits is dominated by the dynamic power dissi- pation and it is directly proportional to the number of bit toggles. Hence, it provides a good prediction for the energy consumption of a cipher. In the future, the relation between “the number of toggles from an HDL implementation” and “the power reports from the synthesis of this implementation” can be investigated. For our RAM-based PRESENT implementations on FPGA, further capabilities can be pro- posed. For instance, our on-slice Sbox and on-RAM Sbox designs are good candidates for the application of the shared Sbox technique in [PMK+11] and the block memory content scrambling technique given in [GM11] for SCA-resistance. In an extended version of this study, these SCA countermeasures can be applied and their effects on the area can be observed together with their resistance to SCA attacks. We have used some of our proposed implementations in an example application, IPSecco, in this thesis. We can apply SCA-resistant techniques also to IPSecco. Generic protection

141 Chapter 12. Directions for Future Research layers [GM11] can be integrated in future as well as countermeasures that are specifically de- signed to use in lightweight applications. Another interesting topic would be the investigation of different algorithms and their combinations, such as CLEFIA, Keccak, HIGHT, etc., and also the primitives that can make use of resource-sharing. One common solution for efficient resource-sharing is partial reconfiguration, which is only available on more powerful, expensive, and larger FPGAs. Therefore, further effort can be put into the integration and sharing of components between different ciphers (e.g., Sbox lookup tables, block RAMs, or registers for keys) in order to further reduce the size of our design. Furthermore, similar to the case of IPSecco, our other designs and implementations can also contribute to different applications. Moreover, we can also come up with realizations of different applications and prove the real-life use cases of lightweight cryptography.

12.2 Potential Lighweight Cipher Design Ideas

In our work, we identified some “unaddressed” metrics, such as low-latency and software- efficiency, in lightweight cryptography and proposed new lightweight block ciphers in order to provide solutions for these gaps. However, more design metrics can be identified and novel cryptographic primitives and design techniques can be suggested, also making use of the ideas and techniques that we suggested in our work. Especially, for the case of PRIDE, our frame- work for linear layer construction can be extended in the future. For the hardware-powered search for linear layers, we can extend our work to cover a larger spectrum of microcontroller instruction sets, including the highly popular PIC and ARM instruction sets, as a natural result of using FPGAs. Nevertheless, further implementations of PRINCE and PRIDE on different hardware and software platforms are encouraged together with further cryptanalysis.

12.3 Further (Lightweight) Crypto ISE Ideas

As a future work, we can address the necessary compiler modifications for the AVR instruction set in order to include NLU. This would be beneficial for a fully-running “AVR with NLU” toolchain. We can also propose a very simple custom microcontroller including NLU in its instruction set. In this case, we can come up with our own toolchain, which includes NLU in its compiler from scratch. The NLU ISE design can also be extended to different (4-bit, 16-bit, 32-bit, 64-bit, etc.) microcontroller architectures. 32-bit ARM instruction set can be a good starting point due to its wide utilization. Moreover, our proposed and future NLU instructions can be used to implement many other (lightweight) primitives. An extended evaluation of a large set of crypto primitives using NLU would be very beneficial in order to observe time-area product reductions.

142 Part V

Appendix

PRINCE Diffusion Matrices

In this appendix, we present the two 64 × 64 diffusion matrices (M 0 and M) used in PRINCE middle layer and round linear layers. PRINCE Diffusion Matrices

Matrix M 0

 0000100010001000000000000000000000000000000000000000000000000000  0100000001000100000000000000000000000000000000000000000000000000  0010001000000010000000000000000000000000000000000000000000000000   0001000100010000000000000000000000000000000000000000000000000000   1000100010000000000000000000000000000000000000000000000000000000   0000010001000100000000000000000000000000000000000000000000000000   0010000000100010000000000000000000000000000000000000000000000000   0001000100000001000000000000000000000000000000000000000000000000     1000100000001000000000000000000000000000000000000000000000000000   0100010001000000000000000000000000000000000000000000000000000000   0000001000100010000000000000000000000000000000000000000000000000   0001000000010001000000000000000000000000000000000000000000000000   1000000010001000000000000000000000000000000000000000000000000000   0100010000000100000000000000000000000000000000000000000000000000     0010001000100000000000000000000000000000000000000000000000000000   0000000100010001000000000000000000000000000000000000000000000000   0000000000000000100010001000000000000000000000000000000000000000   0000000000000000000001000100010000000000000000000000000000000000   0000000000000000001000000010001000000000000000000000000000000000   0000000000000000000100010000000100000000000000000000000000000000     0000000000000000100010000000100000000000000000000000000000000000   0000000000000000010001000100000000000000000000000000000000000000   0000000000000000000000100010001000000000000000000000000000000000   0000000000000000000100000001000100000000000000000000000000000000   0000000000000000100000001000100000000000000000000000000000000000   0000000000000000010001000000010000000000000000000000000000000000     0000000000000000001000100010000000000000000000000000000000000000   0000000000000000000000010001000100000000000000000000000000000000   0000000000000000000010001000100000000000000000000000000000000000   0000000000000000010000000100010000000000000000000000000000000000   0000000000000000001000100000001000000000000000000000000000000000  0  0000000000000000000100010001000000000000000000000000000000000000  M =    0000000000000000000000000000000010001000100000000000000000000000   0000000000000000000000000000000000000100010001000000000000000000   0000000000000000000000000000000000100000001000100000000000000000   0000000000000000000000000000000000010001000000010000000000000000   0000000000000000000000000000000010001000000010000000000000000000   0000000000000000000000000000000001000100010000000000000000000000     0000000000000000000000000000000000000010001000100000000000000000   0000000000000000000000000000000000010000000100010000000000000000   0000000000000000000000000000000010000000100010000000000000000000   0000000000000000000000000000000001000100000001000000000000000000   0000000000000000000000000000000000100010001000000000000000000000   0000000000000000000000000000000000000001000100010000000000000000     0000000000000000000000000000000000001000100010000000000000000000   0000000000000000000000000000000001000000010001000000000000000000   0000000000000000000000000000000000100010000000100000000000000000   0000000000000000000000000000000000010001000100000000000000000000   0000000000000000000000000000000000000000000000000000100010001000   0000000000000000000000000000000000000000000000000100000001000100     0000000000000000000000000000000000000000000000000010001000000010   0000000000000000000000000000000000000000000000000001000100010000   0000000000000000000000000000000000000000000000001000100010000000   0000000000000000000000000000000000000000000000000000010001000100   0000000000000000000000000000000000000000000000000010000000100010   0000000000000000000000000000000000000000000000000001000100000001     0000000000000000000000000000000000000000000000001000100000001000   0000000000000000000000000000000000000000000000000100010001000000   0000000000000000000000000000000000000000000000000000001000100010   0000000000000000000000000000000000000000000000000001000000010001   0000000000000000000000000000000000000000000000001000000010001000   0000000000000000000000000000000000000000000000000100010000000100  0000000000000000000000000000000000000000000000000010001000100000 0000000000000000000000000000000000000000000000000000000100010001

146 PRINCE Diffusion Matrices

Matrix M

 0000100010001000000000000000000000000000000000000000000000000000  0100000001000100000000000000000000000000000000000000000000000000  0010001000000010000000000000000000000000000000000000000000000000   0001000100010000000000000000000000000000000000000000000000000000   0000000000000000100010000000100000000000000000000000000000000000   0000000000000000010001000100000000000000000000000000000000000000   0000000000000000000000100010001000000000000000000000000000000000   0000000000000000000100000001000100000000000000000000000000000000     0000000000000000000000000000000010000000100010000000000000000000   0000000000000000000000000000000001000100000001000000000000000000   0000000000000000000000000000000000100010001000000000000000000000   0000000000000000000000000000000000000001000100010000000000000000   0000000000000000000000000000000000000000000000001000000010001000   0000000000000000000000000000000000000000000000000100010000000100     0000000000000000000000000000000000000000000000000010001000100000   0000000000000000000000000000000000000000000000000000000100010001   0000000000000000100010001000000000000000000000000000000000000000   0000000000000000000001000100010000000000000000000000000000000000   0000000000000000001000000010001000000000000000000000000000000000   0000000000000000000100010000000100000000000000000000000000000000     0000000000000000000000000000000010001000000010000000000000000000   0000000000000000000000000000000001000100010000000000000000000000   0000000000000000000000000000000000000010001000100000000000000000   0000000000000000000000000000000000010000000100010000000000000000   0000000000000000000000000000000000000000000000001000100000001000   0000000000000000000000000000000000000000000000000100010001000000     0000000000000000000000000000000000000000000000000000001000100010   0000000000000000000000000000000000000000000000000001000000010001   1000000010001000000000000000000000000000000000000000000000000000   0100010000000100000000000000000000000000000000000000000000000000   0010001000100000000000000000000000000000000000000000000000000000   0000000100010001000000000000000000000000000000000000000000000000  M =    0000000000000000000000000000000010001000100000000000000000000000   0000000000000000000000000000000000000100010001000000000000000000   0000000000000000000000000000000000100000001000100000000000000000   0000000000000000000000000000000000010001000000010000000000000000   0000000000000000000000000000000000000000000000001000100010000000   0000000000000000000000000000000000000000000000000000010001000100     0000000000000000000000000000000000000000000000000010000000100010   0000000000000000000000000000000000000000000000000001000100000001   1000100000001000000000000000000000000000000000000000000000000000   0100010001000000000000000000000000000000000000000000000000000000   0000001000100010000000000000000000000000000000000000000000000000   0001000000010001000000000000000000000000000000000000000000000000     0000000000000000000010001000100000000000000000000000000000000000   0000000000000000010000000100010000000000000000000000000000000000   0000000000000000001000100000001000000000000000000000000000000000   0000000000000000000100010001000000000000000000000000000000000000   0000000000000000000000000000000000000000000000000000100010001000   0000000000000000000000000000000000000000000000000100000001000100     0000000000000000000000000000000000000000000000000010001000000010   0000000000000000000000000000000000000000000000000001000100010000   1000100010000000000000000000000000000000000000000000000000000000   0000010001000100000000000000000000000000000000000000000000000000   0010000000100010000000000000000000000000000000000000000000000000   0001000100000001000000000000000000000000000000000000000000000000     0000000000000000100000001000100000000000000000000000000000000000   0000000000000000010001000000010000000000000000000000000000000000   0000000000000000001000100010000000000000000000000000000000000000   0000000000000000000000010001000100000000000000000000000000000000   0000000000000000000000000000000000001000100010000000000000000000   0000000000000000000000000000000001000000010001000000000000000000  0000000000000000000000000000000000100010000000100000000000000000 0000000000000000000000000000000000010001000100000000000000000000

147

PRINCE Test Vectors

In this appendix, we present five different test vectors for PRINCE encryption/de- cryption, with intermediate values. PRINCE Test Vectors

Encryption Test Vector 1 Test Vector 2 Test Vector 3 Test Vector 4 Test Vector 5 Plaintext 0000000000000000 0000000000000000 0000000000000000 ffffffffffffffff 0123456789abcdef Key (k0) 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000 0000000000000000 Key (k1) 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 fedcba9876543210 k0 prime 0000000000000000 fffffffffffffffe 0000000000000000 0000000000000000 0000000000000000 Round key 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 fedcba9876543210 After k0 add 0000000000000000 ffffffffffffffff 0000000000000000 ffffffffffffffff 0123456789abcdef After k1 add 0000000000000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff After RC0 add 0000000000000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff After Round1 Sbox bbbbbbbbbbbbbbbb 4444444444444444 4444444444444444 4444444444444444 4444444444444444 After Round1 Mix bbbbbbbbbbbbbbbb 4444444444444444 4444444444444444 4444444444444444 4444444444444444 Round1 constant 13198a2e03707344 13198a2e03707344 13198a2e03707344 13198a2e03707344 13198a2e03707344 After Round1 RC add a8a23195b8cbc8ff 575dce6a47343700 575dce6a47343700 575dce6a47343700 575dce6a47343700 After Round1 key add a8a23195b8cbc8ff 575dce6a47343700 a8a23195b8cbc8ff 575dce6a47343700 a98174f231600510 After Round2 Sbox 86832f7c06e0e644 c1c5ed98a12a21bb 86832f7c06e0e644 c1c5ed98a12a21bb 876f1a432f9bbcfb After Round2 Mix 81ccd0e7aed047ae d911f080ba9e00a9 81ccd0e7aed047ae d911f080ba9e00a9 9c5ce4904a3dcc3a Round2 constant a4093822299f31d0 a4093822299f31d0 a4093822299f31d0 a4093822299f31d0 a4093822299f31d0 After Round2 RC add 25c5e8c5874f767e 7d18c8a293013179 25c5e8c5874f767e 7d18c8a293013179 3855dcb263a2fdea After Round2 key add 25c5e8c5874f767e 7d18c8a293013179 da3a173a78b08981 7d18c8a293013179 c689662a15f6cffa After Round3 Sbox 3cecd6ec61a4191d 15f6e68372bf2f17 5828f128160b676f 15f6e68372bf2f17 e9679938fc49e448 After Round3 Mix b59d7610dd3755cb b9215fcae80eec12 57fe8cfe667efb59 b9215fcae80eec12 da322c482a11a223 Round3 constant 082efa98ec4e6c89 082efa98ec4e6c89 082efa98ec4e6c89 082efa98ec4e6c89 082efa98ec4e6c89 After Round3 RC add bdb38c8831793942 b10fa5520440809b 5fd076668a3097d0 b10fa5520440809b d21cd6d0c65fceaa After Round3 key add bdb38c8831793942 b10fa5520440809b a02f899975cf682f b10fa5520440809b 2cc06c48b00bfcba After Round4 Sbox 05026e662f1727a3 0fb48cc3baab6b70 8b3467771ce49634 0fb48cc3baab6b70 3eeb9ea60bb04e08 After Round4 Mix 3e1eec758e6eb76e 63b6ba98add6835b e6a96bcc7b166771 63b6ba98add6835b f78a99273c256202 Round4 constant 452821e638d01377 452821e638d01377 452821e638d01377 452821e638d01377 452821e638d01377 After Round4 RC add 7b36cd93b6bea419 269e9b7e9506902c a3814a2a43c67406 269e9b7e9506902c b2a2b8c104f57175 After Round4 key add 7b36cd93b6bea419 269e9b7e9506902c 5c7eb5d5bc398bf9 269e9b7e9506902c 4c7e025972a14365 After Round5 Sbox 1029e572090d8af7 397d701d7cb97b3e ce1d0c5c0e276047 397d701d7cb97b3e ae1db3c7138fa29c After Round5 Mix 9dc4b16bc9a01285 3d0a636d4e7a39ae 3113cf3691a1419c 3d0a636d4e7a39ae 5c6d1830f6ea5344 Round5 constant be5466cf34e90c6c be5466cf34e90c6c be5466cf34e90c6c be5466cf34e90c6c be5466cf34e90c6c After Round5 RC add 2390d7a4fd491ee9 835e05a27a9335c2 8f47a9f9a5484df0 835e05a27a9335c2 e2397effc2035f28 After Round5 key add 2390d7a4fd491ee9 835e05a27a9335c2 70b856065ab7b20f 835e05a27a9335c2 1ce5c467b4576d38 After S layer 327b518a45a7fdd7 62cdbc8318722ce3 1b06c9b9c80103b4 62cdbc8318722ce3 fedcea910ac19526 After Mix prime layer e6faaf5681eb5d77 4bb1c671dc3e439d ce97aefc15dcef1c 4bb1c671dc3e439d cfedb6455fb668db After Inverse S layer c81441d8a7c0de99 f0075897e52cf26e 5c694c157de5c175 f0075897e52cf26e 51ce08fdd1088ae0 After IRound6 key add c81441d8a7c0de99 f0075897e52cf26e a396b3ea821a3e8a f0075897e52cf26e af12b265a75cb8f0 IRound6 constant 7ef84f78fd955cb1 7ef84f78fd955cb1 7ef84f78fd955cb1 7ef84f78fd955cb1 7ef84f78fd955cb1 After IRound6 RC add b6ec0ea05a558228 8eff17ef18b9aedf dd6efc927f8f623b 8eff17ef18b9aedf d1eafd1d5ac9e441 After IRound6 Inv Mix 4fe93516350bee8c d9e8036edb118bdd d3d63951a491a114 d9e8036edb118bdd 544df06a3ea070d5 After IRound6 Inv Sbox f1c62d782db0cca5 e6cab28ce077a0ee e2e826d74f67477f e6cab28ce077a0ee dffe1b842c4b9bed After IRound7 key add f1c62d782db0cca5 e6cab28ce077a0ee 1d17d928b098b880 e6cab28ce077a0ee 2122a11c5a1fa9fd IRound7 constant 85840851f1ac43aa 85840851f1ac43aa 85840851f1ac43aa 85840851f1ac43aa 85840851f1ac43aa After IRound7 RC add 74422529dc1c8f0f 634ebadd11dbe344 9893d1794134fb2a 634ebadd11dbe344 a4a6a94dabb3ea57 After IRound7 Inv Mix 5f392caef64ea44e 4aa1d40e9f6b413a 339177a9bf461f9d 4aa1d40e9f6b413a dd8e9ae5e145f109 After IRound7 Inv Sbox d126354c18fc4ffc f447efbc6180f724 2267994601f8716e f447efbc6180f724 eeac64cdc7fd17b6 After IRound8 key add d126354c18fc4ffc f447efbc6180f724 dd9866b9fe078e91 f447efbc6180f724 1070de55b1a925a6 IRound8 constant c882d32f25323c54 c882d32f25323c54 c882d32f25323c54 c882d32f25323c54 c882d32f25323c54 After IRound8 RC add 19a4e6633dce73a8 3cc53c9344b2cb70 151ab596db35b2c5 3cc53c9344b2cb70 d8f20d7a949b19f2 After IRound8 Inv Mix cf4afc99dd61ecf5 33bbb85c0c809184 4525782a99d1b888 33bbb85c0c809184 faec542f80676107 After IRound8 Inv Sbox 51f41566ee87c51d 22000ad5b5ab67af fd3d9a3466e70aaa 22000ad5b5ab67af 14c5df31ab8987b9 After IRound9 key add 51f41566ee87c51d 22000ad5b5ab67af 02c265cb9918f555 22000ad5b5ab67af ea1965a9ddddb5a9 IRound9 constant 64a51195e0e3610d 64a51195e0e3610d 64a51195e0e3610d 64a51195e0e3610d 64a51195e0e3610d After IRound9 RC add 355104f30e64a410 46a51b40554806a2 6667745e79fb9458 46a51b40554806a2 8ebc743c3d3ed4a4 After IRound9 Inv Mix 5205150401555906 2062fd430fd41410 5ed50d8a1f8bb37d 2062fd430fd41410 da73336b969e12a6 After IRound9 Inv Sbox d3bd7dbfb7ddd6b8 3b831ef2b1ef7f7b dcedbea471a0029e 3b831ef2b1ef7f7b e4922280686c7348 After IRound10 key add d3bd7dbfb7ddd6b8 3b831ef2b1ef7f7b 2312415b8e5ffd61 3b831ef2b1ef7f7b 1a4e98181e384158 IRound10 constant d3b5a399ca0c2399 d3b5a399ca0c2399 d3b5a399ca0c2399 d3b5a399ca0c2399 d3b5a399ca0c2399 After IRound10 RC add 0008de267dd1f521 e836bd6b7be35ce2 f0a7e2c24453def8 e836bd6b7be35ce2 c9fb3b81d43462c1 After IRound10 Inv Mix af38aef5ea1d64b1 8bd9e415e80dc8b1 a1497849ac46f6e5 8bd9e415e80dc8b1 7bcc2df2f1534db3 After IRound10 Inv Sbox 412a4c1dc47e8f07 a0e6cf7dcabe5a07 47f69af645f818cd a0e6cf7dcabe5a07 90553e1317d2fe02 After RC11 add 818665aa0d02dfda 604ae6ca03c20ada 875ab3418c844810 604ae6ca03c20ada 50f917a4deaeaedf After k1 add 818665aa0d02dfda 604ae6ca03c20ada 78a54cbe737bb7ef 604ae6ca03c20ada ae25ad3ca8fa9ccf After k0 prime add 818665aa0d02dfda 9fb51935fc3df524 78a54cbe737bb7ef 604ae6ca03c20ada ae25ad3ca8fa9ccf

150 PRINCE Test Vectors

Decryption Test Vector 1 Test Vector 2 Test Vector 3 Test Vector 4 Test Vector 5 Ciphertext 818665aa0d02dfda 9fb51935fc3df524 78a54cbe737bb7ef 604ae6ca03c20ada ae25ad3ca8fa9ccf Key (k0) 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000 0000000000000000 Key (k1) 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 fedcba9876543210 k0 prime 0000000000000000 fffffffffffffffe 0000000000000000 0000000000000000 0000000000000000 Round key c0ac29b7c97c50dd c0ac29b7c97c50dd 3f53d6483683af22 c0ac29b7c97c50dd 3e70932fbf2862cd After k0 prime add 818665aa0d02dfda 604ae6ca03c20ada 78a54cbe737bb7ef 604ae6ca03c20ada ae25ad3ca8fa9ccf After k1 add 412a4c1dc47e8f07 a0e6cf7dcabe5a07 47f69af645f818cd a0e6cf7dcabe5a07 90553e1317d2fe02 After RC0 add 412a4c1dc47e8f07 a0e6cf7dcabe5a07 47f69af645f818cd a0e6cf7dcabe5a07 90553e1317d2fe02 After Round1 Sbox af38aef5ea1d64b1 8bd9e415e80dc8b1 a1497849ac46f6e5 8bd9e415e80dc8b1 7bcc2df2f1534db3 After Round1 Mix 0008de267dd1f521 e836bd6b7be35ce2 f0a7e2c24453def8 e836bd6b7be35ce2 c9fb3b81d43462c1 Round1 constant 13198a2e03707344 13198a2e03707344 13198a2e03707344 13198a2e03707344 13198a2e03707344 After Round1 RC add 131154087ea18665 fb2f374578932fa6 e3be68ec4723adbc fb2f374578932fa6 dae2b1afd7441185 After Round1 key add d3bd7dbfb7ddd6b8 3b831ef2b1ef7f7b dcedbea471a0029e 3b831ef2b1ef7f7b e4922280686c7348 After Round2 Sbox 5205150401555906 2062fd430fd41410 5ed50d8a1f8bb37d 2062fd430fd41410 da73336b969e12a6 After Round2 Mix 355104f30e64a410 46a51b40554806a2 6667745e79fb9458 46a51b40554806a2 8ebc743c3d3ed4a4 Round2 constant a4093822299f31d0 a4093822299f31d0 a4093822299f31d0 a4093822299f31d0 a4093822299f31d0 After Round2 RC add 91583cd127fb95c0 e2ac23627cd73772 c26e4c7c5064a588 e2ac23627cd73772 2ab54c1e14a1e574 After Round2 key add 51f41566ee87c51d 22000ad5b5ab67af fd3d9a3466e70aaa 22000ad5b5ab67af 14c5df31ab8987b9 After Round3 Sbox cf4afc99dd61ecf5 33bbb85c0c809184 4525782a99d1b888 33bbb85c0c809184 faec542f80676107 After Round3 Mix 19a4e6633dce73a8 3cc53c9344b2cb70 151ab596db35b2c5 3cc53c9344b2cb70 d8f20d7a949b19f2 Round3 constant 082efa98ec4e6c89 082efa98ec4e6c89 082efa98ec4e6c89 082efa98ec4e6c89 082efa98ec4e6c89 After Round3 RC add 118a1cfbd1801f21 34ebc60ba8fca7f9 1d344f0e377bde4c 34ebc60ba8fca7f9 d0dcf7e278d5757b After Round3 key add d126354c18fc4ffc f447efbc6180f724 2267994601f8716e f447efbc6180f724 eeac64cdc7fd17b6 After Round4 Sbox 5f392caef64ea44e 4aa1d40e9f6b413a 339177a9bf461f9d 4aa1d40e9f6b413a dd8e9ae5e145f109 After Round4 Mix 74422529dc1c8f0f 634ebadd11dbe344 9893d1794134fb2a 634ebadd11dbe344 a4a6a94dabb3ea57 Round4 constant 452821e638d01377 452821e638d01377 452821e638d01377 452821e638d01377 452821e638d01377 After Round4 RC add 316a04cfe4cc9c78 26669b3b290bf033 ddbbf09f79e4e85d 26669b3b290bf033 e18e88ab9363f920 After Round4 key add f1c62d782db0cca5 e6cab28ce077a0ee e2e826d74f67477f e6cab28ce077a0ee dffe1b842c4b9bed After Round5 Sbox 4fe93516350bee8c d9e8036edb118bdd d3d63951a491a114 d9e8036edb118bdd 544df06a3ea070d5 After Round5 Mix b6ec0ea05a558228 8eff17ef18b9aedf dd6efc927f8f623b 8eff17ef18b9aedf d1eafd1d5ac9e441 Round5 constant be5466cf34e90c6c be5466cf34e90c6c be5466cf34e90c6c be5466cf34e90c6c be5466cf34e90c6c After Round5 RC add 08b8686f6ebc8e44 30ab71202c50a2b3 633a9a5d4b666e57 30ab71202c50a2b3 6fbe9bd26e20e82d After Round5 key add c81441d8a7c0de99 f0075897e52cf26e 5c694c157de5c175 f0075897e52cf26e 51ce08fdd1088ae0 After S layer e6faaf5681eb5d77 4bb1c671dc3e439d ce97aefc15dcef1c 4bb1c671dc3e439d cfedb6455fb668db After Mix prime layer 327b518a45a7fdd7 62cdbc8318722ce3 1b06c9b9c80103b4 62cdbc8318722ce3 fedcea910ac19526 After Inverse S layer 2390d7a4fd491ee9 835e05a27a9335c2 70b856065ab7b20f 835e05a27a9335c2 1ce5c467b4576d38 After IRound6 key add e33cfe1334354e34 43f22c15b3ef651f 4feb804e6c341d2d 43f22c15b3ef651f 229557480b7f0ff5 IRound6 constant 7ef84f78fd955cb1 7ef84f78fd955cb1 7ef84f78fd955cb1 7ef84f78fd955cb1 7ef84f78fd955cb1 After IRound6 RC add 9dc4b16bc9a01285 3d0a636d4e7a39ae 3113cf3691a1419c 3d0a636d4e7a39ae 5c6d1830f6ea5344 After IRound6 Inv Mix 1029e572090d8af7 397d701d7cb97b3e ce1d0c5c0e276047 397d701d7cb97b3e ae1db3c7138fa29c After IRound6 Inv Sbox 7b36cd93b6bea419 269e9b7e9506902c 5c7eb5d5bc398bf9 269e9b7e9506902c 4c7e025972a14365 After IRound7 key add bb9ae4247fc2f4c4 e632b2c95c7ac0f1 632d639d8aba24db e632b2c95c7ac0f1 720e9176cd8921a8 IRound7 constant 85840851f1ac43aa 85840851f1ac43aa 85840851f1ac43aa 85840851f1ac43aa 85840851f1ac43aa After IRound7 RC add 3e1eec758e6eb76e 63b6ba98add6835b e6a96bcc7b166771 63b6ba98add6835b f78a99273c256202 After IRound7 Inv Mix 05026e662f1727a3 0fb48cc3baab6b70 8b3467771ce49634 0fb48cc3baab6b70 3eeb9ea60bb04e08 After IRound7 Inv Sbox bdb38c8831793942 b10fa5520440809b a02f899975cf682f b10fa5520440809b 2cc06c48b00bfcba After IRound8 key add 7d1fa53ff805699f 71a38ce5cd3cd046 9f7c5fd1434cc70d 71a38ce5cd3cd046 12b0ff670f239e77 IRound8 constant c882d32f25323c54 c882d32f25323c54 c882d32f25323c54 c882d32f25323c54 c882d32f25323c54 After IRound8 RC add b59d7610dd3755cb b9215fcae80eec12 57fe8cfe667efb59 b9215fcae80eec12 da322c482a11a223 After IRound8 Inv Mix 3cecd6ec61a4191d 15f6e68372bf2f17 5828f128160b676f 15f6e68372bf2f17 e9679938fc49e448 After IRound8 Inv Sbox 25c5e8c5874f767e 7d18c8a293013179 da3a173a78b08981 7d18c8a293013179 c689662a15f6cffa After IRound9 key add e569c1724e3326a3 bdb4e1155a7d61a4 e569c1724e3326a3 bdb4e1155a7d61a4 f8f9f505aadead37 IRound9 constant 64a51195e0e3610d 64a51195e0e3610d 64a51195e0e3610d 64a51195e0e3610d 64a51195e0e3610d After IRound9 RC add 81ccd0e7aed047ae d911f080ba9e00a9 81ccd0e7aed047ae d911f080ba9e00a9 9c5ce4904a3dcc3a After IRound9 Inv Mix 86832f7c06e0e644 c1c5ed98a12a21bb 86832f7c06e0e644 c1c5ed98a12a21bb 876f1a432f9bbcfb After IRound9 Inv Sbox a8a23195b8cbc8ff 575dce6a47343700 a8a23195b8cbc8ff 575dce6a47343700 a98174f231600510 After IRound10 key add 680e182271b79822 97f1e7dd8e4867dd 97f1e7dd8e4867dd 97f1e7dd8e4867dd 97f1e7dd8e4867dd IRound10 constant d3b5a399ca0c2399 d3b5a399ca0c2399 d3b5a399ca0c2399 d3b5a399ca0c2399 d3b5a399ca0c2399 After IRound10 RC add bbbbbbbbbbbbbbbb 4444444444444444 4444444444444444 4444444444444444 4444444444444444 After IRound10 Inv Mix bbbbbbbbbbbbbbbb 4444444444444444 4444444444444444 4444444444444444 4444444444444444 After IRound10 Inv Sbox 0000000000000000 ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff After RC11 add c0ac29b7c97c50dd 3f53d6483683af22 3f53d6483683af22 3f53d6483683af22 3f53d6483683af22 After k1 add 0000000000000000 ffffffffffffffff 0000000000000000 ffffffffffffffff 0123456789abcdef After k0 add 0000000000000000 0000000000000000 0000000000000000 ffffffffffffffff 0123456789abcdef

151

PRIDE Diffusion Matrices

In this appendix, we present PRIDE’s software-friendly four 16 × 16 matrices L0, L1, L2, L3 and their inverses. PRIDE Diffusion Matrices

 0000100010001000  0000010001000100    0000001000100010   0000000100010001     1000000010001000     0100000001000100   0010000000100010   0001000000010001  L & L−1 =   0 0  1000100000001000     0100010000000100     0010001000000010   0001000100000001     1000100010000000   0100010001000000   0010001000100000  0001000100010000

 1100000000010000  0110000000001000    0011000000000100   0001100000000010     0000110000000001     0000011010000000   0000001101000000   1000000100100000  L =   1  1000000000011000     0100000000001100     0010000000000110   0001000000000011     0000100010000001   0000010011000000   0000001001100000  0000000100110000

 0000001100000010  1000000100000001    1100000010000000   0110000001000000     0011000000100000     0001100000010000   0000110000001000   0000011000000100  L−1 =   1  0001000000011000     0000100000001100     0000010000000110   0000001000000011     0000000110000001   1000000011000000   0100000001100000  0010000000110000

154 PRIDE Diffusion Matrices

 0000110000000001  0000011010000000    0000001101000000   1000000100100000     1100000000010000     0110000000001000   0011000000000100   0001100000000010  L =   2  0000100010000001     0000010011000000     0000001001100000   0000000100110000     1000000000011000   0100000000001100   0010000000000110  0001000000000011

 0011000000100000  0001100000010000    0000110000001000   0000011000000100     0000001100000010     1000000100000001   1100000010000000   0110000001000000  L−1 =   2  0000000110000001     1000000011000000     0100000001100000   0010000000110000     0001000000011000   0000100000001100   0000010000000110  0000001000000011

 1000100000001000  0100010000000100    0010001000000010   0001000100000001     1000100010000000     0100010001000000   0010001000100000   0001000100010000  L & L−1 =   3 3  0000100010001000     0000010001000100     0000001000100010   0000000100010001     1000000010001000   0100000001000100   0010000000100010  0001000000010001

155

PRIDE Test Vectors

In this appendix, we present five different test vectors for PRIDE encryption/de- cryption, with intermediate values. PRIDE Test Vectors

Encryption Test Vector 1 Test Vector 2 Test Vector 3 Test Vector 4 Test Vector 5 Plaintext 0000000000000000 0000000000000000 0000000000000000 ffffffffffffffff 0123456789abcdef Key (k0) 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000 0000000000000000 Key (k1) 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 fedcba9876543210 After Pre-whitening 0000000000000000 ffffffffffffffff 0000000000000000 ffffffffffffffff 0123456789abcdef Round1 key 00c100a5005100c5 00c100a5005100c5 ffc0ffa4ff50ffc4 00c100a5005100c5 fe9dba3d76a532d5 After Round1 Key add 00c100a5005100c5 ff3eff5affaeff3a ffc0ffa4ff50ffc4 ff3eff5affaeff3a ffbeff5aff0eff3a After Round1 Sbox 00d000c4000100a5 00b40030ff0eff5a 00d000c4ff00ffa4 00b40030ff0eff5a 00140030ffaeff7a After Round1 Linear layer dd0d266a80815aff ff4b818207f6a500 dd0d266a00ff4a11 ff4b818207f6a500 554181825706a722 Round2 key 0082004a00a2008a 0082004a00a2008a ff81ff49ffa1ff89 0082004a00a2008a fe5ebae276f6329a After Round2 Key add dd8f262080235a75 ffc981c80754a58a 228cd923ff5eb598 ffc981c80754a58a ab1f3b6021f095b8 After Round2 Sbox 84235a55dd8e7e24 869ca4ca7b41a188 ff5e6c9a4e9695b1 869ca4ca7b41a188 0af0b4d8abcf9ba8 After Round2 Linear layer 59fe44a52114db81 3d27bb5e7856331a 44e560116639d7f3 3d27bb5e7856331a 5fa51bff2892a89b Round3 key 004300ef00f3004f 004300ef00f3004f ff42ffeefff2ff4e 004300ef00f3004f fe1fba877647325f After Round3 Key add 59bd444a21e7dbce 3d64bbb178a53355 bba79fff99cb28bd 3d64bbb178a53355 a1baa1785ed59ac4 After Round3 Sbox 61efdb8c18315c4a 41850bf43ce0b351 026cb176bbc32ebd 41850bf43ce0b351 ffed9a943b3ebb6c After Round3 Linear layer 078908771a283d2b c90dbae334539d7f 8ae461652d991784 c90dbae334539d7f ccde0b77cb92c611 Round4 key 0004009400440014 0004009400440014 ff03ff93ff43ff13 0004009400440014 fee0ba2c76983224 After Round4 Key add 078d08e31a6c3d3f c909ba7734179d6b 75e79ef6d2dae897 c909ba7734179d6b 323eb15bbd0af435 After Round4 Sbox 1aed355f17c01da3 bc16ad7c651d9f6b c63c7a4537e3acb7 bc16ad7c651d9f6b 8d10453f372eb475 After Round4 Linear layer 9265a53af3d1f648 bc16158974c5d024 9369a40464611d06 bc16158974c5d024 c954364f824aa869 Round5 key 00c50039009500d9 00c50039009500d9 ffc4ff38ff94ffd8 00c50039009500d9 fea1bad176e932e9 After Round5 Key add 92a0a503f344f691 bcd315b07450d0fd 6cad5b3c9bf5e2de bcd315b07450d0fd 37f58c9ef4a39a80 After Round5 Sbox 73445791c1a0e483 60c0c4edfc13d1b1 d3d9f9eabd65e25c 60c0c4edfc13d1b1 f0371e0227f78a9c After Round5 Linear layer 370075c274ec92f5 ca6a2275d955d7b7 79735d00de0c09b7 ca6a2275d955d7b7 4b8c322e6d7eebfd Round6 key 008600de00e6009e 008600de00e6009e ff85ffddffe5ff9d 008600de00e6009e fe62ba76763a32ae After Round6 Key add 3786751c740a926b caec22abd9b3d729 86f6a2dd21e9f62a caec22abd9b3d729 b5ee88581b44d953 After Round6 Sbox 410ee6637784131c db1bd78a19e63329 a33dd6e304d7a61e db1bd78a19e63329 9b0cd11324ee885a After Round6 Linear layer fab530cbdbb1e3ec 17d72c2bc1049288 d44a64f72b7c2d95 17d72c2bc1049288 75e2ea78b1dba577 Round7 key 0047008300370063 0047008300370063 ff46ff82ff36ff62 0047008300370063 fe23ba1b768b3273 After Round7 Key add faf23048db86e38f 17902ca8c13392eb 2b0c9b75d44ad2f7 17902ca8c13392eb 8bc15063c7509704 After Round7 Sbox ebc6f38f1974214c c5b392cb9713beab df4e42b7690adb77 c5b392cb9713beab c711d7444cc11423 After Round7 Linear layer 1439687788dff79a d4a2e97002e3effa 57c67b84be9911bd d4a2e97002e3effa 7caa5ab1ade56750 Round8 key 0008002800880028 0008002800880028 ff07ff27ff87ff27 0008002800880028 fee4bac076dc3238 After Round8 Key add 1431685f8857f7b2 d4aae958026befd2 a8c184a3411eee9a d4aae958026befd2 824ee071db395568 After Round8 Sbox 8846ffe59c75f43a c263ef9a16a8efd0 c19fee986859ecbb c263ef9a16a8efd0 5b79955993177160 After Round8 Linear layer aa642f8ee00618d6 79d8e492f79d1c23 7a24f7a3277399ce 79d8e492f79d1c23 5b7974cac0a56071 Round9 key 00c900cd00d900ed 00c900cd00d900ed ffc8ffccffd8ffec 00c900cd00d900ed fea5ba65762d32fd After Round9 Key add aaad2f43e0df183b 7911e45ff7441cce 85ec086fd8ab6622 7911e45ff7441cce a5dcceafb688528c After Round9 Sbox cade3878a2f50f33 9755f88ae9110c5f d8c76e09cded4466 9755f88ae9110c5f 3204d404b5d85aaf After Round9 Linear layer 9f8b8b7c8425ccf0 79bb5d042b07396a 3629fab693c76644 79bb5d042b07396a 67515db481ef05f0 Round10 key 008a0072002a00b2 008a0072002a00b2 ff89ff71ff29ffb1 008a0072002a00b2 fe66ba0a767e32c2 After Round10 Key add 9f018b0e840fcc42 79315d762b2d39d8 c9a005c76cee99f5 79315d762b2d39d8 9937e7bef7913732 After Round10 Sbox 0f0f4c4c930d8b02 721d30fc492d5d5a 6d6e9d33c48281c5 721d30fc492d5d5a 76a7d0a2c995273e After Round10 Linear layer 0f0fb6eacdb2139a eb84b7182b2f2d2a 5e5d3f37958fc581 eb84b7182b2f2d2a ba6b64ef6fc3b6af Round11 key 004b0017007b0077 004b0017007b0077 ff4aff16ff7aff76 004b0017007b0077 fe27baaf76cf3287 After Round11 Key add 0f44b6fdcdc913ed ebcfb70f2b542d5d a117c0216af53af7 ebcfb70f2b542d5d 444cde40190c8428 After Round11 Sbox cb8d97248c4032fd 885b0e59e396b51f eaf47ad6cbc38ae3 885b0e59e396b51f 5d4c9c285844c640 After Round11 Linear layer e9af99f479a8ce01 66b5d85109e31fb5 150b38a1249e1c75 66b5d85109e31fb5 5d4ce45face3ae28 Round12 key 000c00bc00cc003c 000c00bc00cc003c ff0bffbbffcbff3b 000c00bc00cc003c fee8ba547620324c After Round12 Key add e9a399487964ce3d 66b9d8ed092f1f89 ea00c71adb55e34e 66b9d8ed092f1f89 a3a45e0bdac39c64 After Round12 Sbox f064d77d39c7880d 498617a4673ddfc9 1955205eea54e74e 498617a4673ddfc9 d8c3c66763e71c6c After Round12 Linear layer 2db993eb57b7d055 7ab51c7804d5bea8 91dd9237d9d07dd4 7ab51c7804d5bea8 7269708ba9221b6b Round13 key 00cd0061001d0001 00cd0061001d0001 ffccff60ff1cff00 00cd0061001d0001 fea9baf976713211 After Round13 Key add 2d74938a57aad054 7a781c1904c8bea9 6e116d5726cc82d4 7a781c1904c8bea9 8cc0ca72df53297a After Round13 Sbox 56aac3de6ffed054 1cd0baa162f83eb9 4adda6906c8149d7 1cd0baa162f83eb9 5713e328cfc00972 After Round13 Linear layer 6599b2d86477981c 1cd0c2ad16a246c1 a4336f2b8b87a03e 1cd0c2ad16a246c1 57136520655cbec5 Round14 key 008e0006006e00c6 008e0006006e00c6 ff8dff05ff6dffc5 008e0006006e00c6 fe6aba9e76c232d6 After Round14 Key add 6517b2de641998da 1c5ec2ab16cc4607 5bbe902e74ea5ffb 1c5ec2ab16cc4607 a979dfbe139e8c13 After Round14 Sbox 440fb8c2651592de 16c6448f18d8c223 64c44fd11f7e9f7e 16c6448f18d8c223 9aa69f8d33fdcc33 After Round14 Linear layer bbf0df8270c9561a cb1bb0c0ee35dc3d ce6e5fdc2db08160 cb1bb0c0ee35dc3d 6559cc2bab3033cc Round15 key 004f00ab00bf008b 004f00ab00bf008b ff4effaaffbeff8a 004f00ab00bf008b fe2bba437613329b After Round15 Key add bbbfdf2970765691 cb54b06bee8adcb6 3120a076d20e7eea cb54b06bee8adcb6 9b727668dd230157 After Round15 Sbox eb5f06b1b9aedf89 6eca7cbca7dc94f7 f22efeecc30c627a 6eca7cbca7dc94f7 cf435577de312259 After Round15 Linear layer 14a08790fb62baec 80246152e0c8a2c1 e33f64574236e3fb 80246152e0c8a2c1 8b074499ae4495ee Round16 key 0010005000100050 0010005000100050 ff0fff4fff0fff4f 0010005000100050 feecbae876643260 After Round16 Key add 14b087c0fb72babc 80346102e0d8a291 1c309b18bd391cb4 80346102e0d8a291 75ebfe71d820a78e After Round16 Sbox fff239fc2d40ae80 e0d8c29140a42182 a52985ac99181a10 e0d8c29140a42182 ac417fae59eba7db After Round16 Linear layer 222fac1157b24c62 5b63cb575ef21bb8 e16deb2aa68dbab0 5b63cb575ef21bb8 9f72f4e04b8b601c Round17 key 00d100f500610015 00d100f500610015 ffd0fff4ff60ff14 00d100f500610015 feadba8d76b53225 After Round17 Key add 22feace457d34c77 5bb2cba25e931bad 1ebd14de59ed45a4 5bb2cba25e931bad 61df4e6d3d3e5239 After Round17 Sbox 773748b762c9ec65 1531512f4a938ba1 4d7155685bdd4596 1531512f4a938ba1 7d735e153dce5269 After Round17 Linear layer 3373658e8e0b74fd 73578ada24fe2903 b28ebc90308678ab 73578ada24fe2903 939d4aa7137ae1da Round18 key 0092009a00b200da 0092009a00b200da ff91ff99ffb1ffd9 0092009a00b200da fe6eba32760632ea After Round18 Key add 33e165148eb97427 73c58a40244c29d9 4d1f4309cf378772 73c58a40244c29d9 6df3f095657cd330 After Round18 Sbox afb9703713d07504 260c299953cd8bc9 8e3ec473c92d8328 260c299953cd8bc9 05edb3246cd7d091 After Round18 Linear layer d8ce29ba3b896213 ae84b77cb91eafed 3585d66833273992 ae84b77cb91eafed 638bf5d0a0fac485 Round19 key 0053003f0003009f 0053003f0003009f ff52ff3eff02ff9e 0053003f0003009f fe2fbad7765732af After Round19 Key add d89d29853b8a628c aed7b743b91daf72 cad72956cc25c60c aed7b743b91daf72 9da44f07d6adf62a After Round19 Sbox 330f4b0cdb916285 1f5e1e73b085a742 c473ce080ed72756 1f5e1e73b085a742 dba9b02f0d8d4f0a After Round19 Linear layer ccf0bdeb0ee41cfb 4a0bb9b2df4cf91c 08bf130ecadc3041 4a0bb9b2df4cf91c 8efca83bb79b1b5e Round20 key 001400e400540064 001400e400540064 ff13ffe3ff53ff63 001400e400540064 fef0ba7c76a83274 After Round20 Key add cce4bd0f0eb01c9f 4a1fb956df18f978 f7aceced358fcf22 4a1fb956df18f978 700c1247c133292a After Round20 Sbox 82b4109fcc70bd1f d70e60680a17b956 d123ebaf368fce62 d70e60680a17b956 d1372929712d336e After Post-whitening 82b4109fcc70bd1f 28f19f97f5e846a9 d123ebaf368fce62 d70e60680a17b956 d1372929712d336e

158 PRIDE Test Vectors

Decryption Test Vector 1 Test Vector 2 Test Vector 3 Test Vector 4 Test Vector 5 Ciphertext 82b4109fcc70bd1f 28f19f97f5e846a9 d123ebaf368fce62 d70e60680a17b956 d1372929712d336e Key (k0) 0000000000000000 ffffffffffffffff 0000000000000000 0000000000000000 0000000000000000 Key (k1) 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 fedcba9876543210 After Post-whitening 82b4109fcc70bd1f d70e60680a17b956 d123ebaf368fce62 d70e60680a17b956 d1372929712d336e After IRound1 Sbox cce4bd0f0eb01c9f 4a1fb956df18f978 f7aceced358fcf22 4a1fb956df18f978 700c1247c133292a IRound1 key 001400e400540064 001400e400540064 ff13ffe3ff53ff63 001400e400540064 fef0ba7c76a83274 After IRound1 Key add ccf0bdeb0ee41cfb 4a0bb9b2df4cf91c 08bf130ecadc3041 4a0bb9b2df4cf91c 8efca83bb79b1b5e After IRound2 Linear layer 330f4b0cdb916285 1f5e1e73b085a742 c473ce080ed72756 1f5e1e73b085a742 dba9b02f0d8d4f0a After IRound2 Sbox d89d29853b8a628c aed7b743b91daf72 cad72956cc25c60c aed7b743b91daf72 9da44f07d6adf62a IRound2 key 0053003f0003009f 0053003f0003009f ff52ff3eff02ff9e 0053003f0003009f fe2fbad7765732af After IRound2 Key add d8ce29ba3b896213 ae84b77cb91eafed 3585d66833273992 ae84b77cb91eafed 638bf5d0a0fac485 After IRound3 Linear layer afb9703713d07504 260c299953cd8bc9 8e3ec473c92d8328 260c299953cd8bc9 05edb3246cd7d091 After IRound3 Sbox 33e165148eb97427 73c58a40244c29d9 4d1f4309cf378772 73c58a40244c29d9 6df3f095657cd330 IRound3 key 0092009a00b200da 0092009a00b200da ff91ff99ffb1ffd9 0092009a00b200da fe6eba32760632ea After IRound3 Key add 3373658e8e0b74fd 73578ada24fe2903 b28ebc90308678ab 73578ada24fe2903 939d4aa7137ae1da After IRound4 Linear layer 773748b762c9ec65 1531512f4a938ba1 4d7155685bdd4596 1531512f4a938ba1 7d735e153dce5269 After IRound4 Sbox 22feace457d34c77 5bb2cba25e931bad 1ebd14de59ed45a4 5bb2cba25e931bad 61df4e6d3d3e5239 IRound4 key 00d100f500610015 00d100f500610015 ffd0fff4ff60ff14 00d100f500610015 feadba8d76b53225 After IRound4 Key add 222fac1157b24c62 5b63cb575ef21bb8 e16deb2aa68dbab0 5b63cb575ef21bb8 9f72f4e04b8b601c After IRound5 Linear layer fff239fc2d40ae80 e0d8c29140a42182 a52985ac99181a10 e0d8c29140a42182 ac417fae59eba7db After IRound5 Sbox 14b087c0fb72babc 80346102e0d8a291 1c309b18bd391cb4 80346102e0d8a291 75ebfe71d820a78e IRound5 key 0010005000100050 0010005000100050 ff0fff4fff0fff4f 0010005000100050 feecbae876643260 After IRound5 Key add 14a08790fb62baec 80246152e0c8a2c1 e33f64574236e3fb 80246152e0c8a2c1 8b074499ae4495ee After IRound6 Linear layer eb5f06b1b9aedf89 6eca7cbca7dc94f7 f22efeecc30c627a 6eca7cbca7dc94f7 cf435577de312259 After IRound6 Sbox bbbfdf2970765691 cb54b06bee8adcb6 3120a076d20e7eea cb54b06bee8adcb6 9b727668dd230157 IRound6 key 004f00ab00bf008b 004f00ab00bf008b ff4effaaffbeff8a 004f00ab00bf008b fe2bba437613329b After IRound6 Key add bbf0df8270c9561a cb1bb0c0ee35dc3d ce6e5fdc2db08160 cb1bb0c0ee35dc3d 6559cc2bab3033cc After IRound7 Linear layer 440fb8c2651592de 16c6448f18d8c223 64c44fd11f7e9f7e 16c6448f18d8c223 9aa69f8d33fdcc33 After IRound7 Sbox 6517b2de641998da 1c5ec2ab16cc4607 5bbe902e74ea5ffb 1c5ec2ab16cc4607 a979dfbe139e8c13 IRound7 key 008e0006006e00c6 008e0006006e00c6 ff8dff05ff6dffc5 008e0006006e00c6 fe6aba9e76c232d6 After IRound7 Key add 6599b2d86477981c 1cd0c2ad16a246c1 a4336f2b8b87a03e 1cd0c2ad16a246c1 57136520655cbec5 After IRound8 Linear layer 56aac3de6ffed054 1cd0baa162f83eb9 4adda6906c8149d7 1cd0baa162f83eb9 5713e328cfc00972 After IRound8 Sbox 2d74938a57aad054 7a781c1904c8bea9 6e116d5726cc82d4 7a781c1904c8bea9 8cc0ca72df53297a IRound8 key 00cd0061001d0001 00cd0061001d0001 ffccff60ff1cff00 00cd0061001d0001 fea9baf976713211 After IRound8 Key add 2db993eb57b7d055 7ab51c7804d5bea8 91dd9237d9d07dd4 7ab51c7804d5bea8 7269708ba9221b6b After IRound9 Linear layer f064d77d39c7880d 498617a4673ddfc9 1955205eea54e74e 498617a4673ddfc9 d8c3c66763e71c6c After IRound9 Sbox e9a399487964ce3d 66b9d8ed092f1f89 ea00c71adb55e34e 66b9d8ed092f1f89 a3a45e0bdac39c64 IRound9 key 000c00bc00cc003c 000c00bc00cc003c ff0bffbbffcbff3b 000c00bc00cc003c fee8ba547620324c After IRound9 Key add e9af99f479a8ce01 66b5d85109e31fb5 150b38a1249e1c75 66b5d85109e31fb5 5d4ce45face3ae28 After IRound10 Linear layer cb8d97248c4032fd 885b0e59e396b51f eaf47ad6cbc38ae3 885b0e59e396b51f 5d4c9c285844c640 After IRound10 Sbox 0f44b6fdcdc913ed ebcfb70f2b542d5d a117c0216af53af7 ebcfb70f2b542d5d 444cde40190c8428 IRound10 key 004b0017007b0077 004b0017007b0077 ff4aff16ff7aff76 004b0017007b0077 fe27baaf76cf3287 After IRound10 Key add 0f0fb6eacdb2139a eb84b7182b2f2d2a 5e5d3f37958fc581 eb84b7182b2f2d2a ba6b64ef6fc3b6af After IRound11 Linear layer 0f0f4c4c930d8b02 721d30fc492d5d5a 6d6e9d33c48281c5 721d30fc492d5d5a 76a7d0a2c995273e After IRound11 Sbox 9f018b0e840fcc42 79315d762b2d39d8 c9a005c76cee99f5 79315d762b2d39d8 9937e7bef7913732 IRound11 key 008a0072002a00b2 008a0072002a00b2 ff89ff71ff29ffb1 008a0072002a00b2 fe66ba0a767e32c2 After IRound11 Key add 9f8b8b7c8425ccf0 79bb5d042b07396a 3629fab693c76644 79bb5d042b07396a 67515db481ef05f0 After IRound12 Linear layer cade3878a2f50f33 9755f88ae9110c5f d8c76e09cded4466 9755f88ae9110c5f 3204d404b5d85aaf After IRound12 Sbox aaad2f43e0df183b 7911e45ff7441cce 85ec086fd8ab6622 7911e45ff7441cce a5dcceafb688528c IRound12 key 00c900cd00d900ed 00c900cd00d900ed ffc8ffccffd8ffec 00c900cd00d900ed fea5ba65762d32fd After IRound12 Key add aa642f8ee00618d6 79d8e492f79d1c23 7a24f7a3277399ce 79d8e492f79d1c23 5b7974cac0a56071 After IRound13 Linear layer 8846ffe59c75f43a c263ef9a16a8efd0 c19fee986859ecbb c263ef9a16a8efd0 5b79955993177160 After IRound13 Sbox 1431685f8857f7b2 d4aae958026befd2 a8c184a3411eee9a d4aae958026befd2 824ee071db395568 IRound13 key 0008002800880028 0008002800880028 ff07ff27ff87ff27 0008002800880028 fee4bac076dc3238 After IRound13 Key add 1439687788dff79a d4a2e97002e3effa 57c67b84be9911bd d4a2e97002e3effa 7caa5ab1ade56750 After IRound14 Linear layer ebc6f38f1974214c c5b392cb9713beab df4e42b7690adb77 c5b392cb9713beab c711d7444cc11423 After IRound14 Sbox faf23048db86e38f 17902ca8c13392eb 2b0c9b75d44ad2f7 17902ca8c13392eb 8bc15063c7509704 IRound14 key 0047008300370063 0047008300370063 ff46ff82ff36ff62 0047008300370063 fe23ba1b768b3273 After IRound14 Key add fab530cbdbb1e3ec 17d72c2bc1049288 d44a64f72b7c2d95 17d72c2bc1049288 75e2ea78b1dba577 After IRound15 Linear layer 410ee6637784131c db1bd78a19e63329 a33dd6e304d7a61e db1bd78a19e63329 9b0cd11324ee885a After IRound15 Sbox 3786751c740a926b caec22abd9b3d729 86f6a2dd21e9f62a caec22abd9b3d729 b5ee88581b44d953 IRound15 key 008600de00e6009e 008600de00e6009e ff85ffddffe5ff9d 008600de00e6009e fe62ba76763a32ae After IRound15 Key add 370075c274ec92f5 ca6a2275d955d7b7 79735d00de0c09b7 ca6a2275d955d7b7 4b8c322e6d7eebfd After IRound16 Linear layer 73445791c1a0e483 60c0c4edfc13d1b1 d3d9f9eabd65e25c 60c0c4edfc13d1b1 f0371e0227f78a9c After IRound16 Sbox 92a0a503f344f691 bcd315b07450d0fd 6cad5b3c9bf5e2de bcd315b07450d0fd 37f58c9ef4a39a80 IRound16 key 00c50039009500d9 00c50039009500d9 ffc4ff38ff94ffd8 00c50039009500d9 fea1bad176e932e9 After IRound16 Key add 9265a53af3d1f648 bc16158974c5d024 9369a40464611d06 bc16158974c5d024 c954364f824aa869 After IRound17 Linear layer 1aed355f17c01da3 bc16ad7c651d9f6b c63c7a4537e3acb7 bc16ad7c651d9f6b 8d10453f372eb475 After IRound17 Sbox 078d08e31a6c3d3f c909ba7734179d6b 75e79ef6d2dae897 c909ba7734179d6b 323eb15bbd0af435 IRound17 key 0004009400440014 0004009400440014 ff03ff93ff43ff13 0004009400440014 fee0ba2c76983224 After IRound17 Key add 078908771a283d2b c90dbae334539d7f 8ae461652d991784 c90dbae334539d7f ccde0b77cb92c611 After IRound18 Linear layer 61efdb8c18315c4a 41850bf43ce0b351 026cb176bbc32ebd 41850bf43ce0b351 ffed9a943b3ebb6c After IRound18 Sbox 59bd444a21e7dbce 3d64bbb178a53355 bba79fff99cb28bd 3d64bbb178a53355 a1baa1785ed59ac4 IRound18 key 004300ef00f3004f 004300ef00f3004f ff42ffeefff2ff4e 004300ef00f3004f fe1fba877647325f After IRound18 Key add 59fe44a52114db81 3d27bb5e7856331a 44e560116639d7f3 3d27bb5e7856331a 5fa51bff2892a89b After IRound19 Linear layer 84235a55dd8e7e24 869ca4ca7b41a188 ff5e6c9a4e9695b1 869ca4ca7b41a188 0af0b4d8abcf9ba8 After IRound19 Sbox dd8f262080235a75 ffc981c80754a58a 228cd923ff5eb598 ffc981c80754a58a ab1f3b6021f095b8 IRound19 key 0082004a00a2008a 0082004a00a2008a ff81ff49ffa1ff89 0082004a00a2008a fe5ebae276f6329a After IRound19 Key add dd0d266a80815aff ff4b818207f6a500 dd0d266a00ff4a11 ff4b818207f6a500 554181825706a722 After IRound20 Linear layer 00d000c4000100a5 00b40030ff0eff5a 00d000c4ff00ffa4 00b40030ff0eff5a 00140030ffaeff7a After IRound20 Sbox 00c100a5005100c5 ff3eff5affaeff3a ffc0ffa4ff50ffc4 ff3eff5affaeff3a ffbeff5aff0eff3a IRound20 key 00c100a5005100c5 00c100a5005100c5 ffc0ffa4ff50ffc4 00c100a5005100c5 fe9dba3d76a532d5 After IRound20 Key add 0000000000000000 ffffffffffffffff 0000000000000000 ffffffffffffffff 0123456789abcdef After Pre-whitening 0000000000000000 0000000000000000 0000000000000000 ffffffffffffffff 0123456789abcdef

159

Abbreviations

AES Advanced Encryption Standard ANF Algebraic Normal Form ASIC Application Specific Integrated Circuit CHES International Workshop on Cryptographic Hardware and Embedded Systems CMOS Complementary Metal Oxide Semiconductor CPU Central Processing Unit CTR Counter (mode of operation) DES Data Encryption Standard DSP Digital Signal Processing ECC Elliptic Curve Cryptography EEPROM Electrically Erasable Programmable Read-Only Memory EmSec Embedded Security FIFO First In First Out (memory) FIPS Federal Information Processing Standard FPGA Field Programmable Gate Array HDL Hardware Description Language HGI Horst G¨ortzInstitute for IT-Security HMAC Hash-based Message Authentication Code HW Hamming Weight IC Integrated Circuit IDE Integrated Development Environment ISO International Organization for Standardization LFSR Linear Feedback Shift Register LUT Look-Up Table MAC Message Authentication Code NIST National Institute of Standards and Technology Abbreviations

PC Personal Computer RAM Random Access Memory RISC Reduced Instruction Set Computer RFID Radio Frequency IDentification RSA Rivest Shamir and Adleman RUB Ruhr University Bochum SCA Side-Channel Attacks SHA Secure Hash Algorithm SRAM Static Random Access Memory USB Universal Serial Bus XOR Exclusive OR ACM Association for Computing Machinery ACNS International Conference on Applied Cryptography and Network Security AFRICACRYPT Annual International Conference on the Theory and Applications of Cryptol- ogy AH Authentication Header ALU Arithmetic Logic Unit APN Almost Perfect Non-linear ARX Addition-Rotation-XOR ARITH IEEE Symposium on Computer Arithmetic ASAP IEEE International Conference on Application-specific Systems Architectures and Pro- cessors ASIACRYPT Annual International Cryptology Conference ASP Application-Specific Processor CARDIS Smart Card Research and Advanced Application Conference CIS International Conference on Computational Intelligence and Security CMS IFIP TC-6 TC-11 Conference on Communications and Multimedia Security CRYPTO Annual International Conference on the Theory and Application of Cryptology and Information Security CT-RSA RSA Conference Cryptographers’ Track DATE Design, Automation & Test in Europe DSD EUROMICRO Conference on Digital System Design

162 Abbreviations

EUROCRYPT Annual International Conference on the Theory and Applications of Crypto- graphic Techniques ESP Encapsulating Security Payload FPL International Conference on Field Programmable Logic and Applications FPU Floating Point Unit FSE International Workshop on Fast Software Encryption GCM Galois/Counter Mode GE Gate Equivalence GF Galois Field IACR International Association for Cryptologic Research ICCD International Conference on Computer Design ICCSA International Conference on Computational Science and Its Applications IEEE Institute of Electrical and Electronics Engineers IKE Internet Key Exchange IMA Institute of Mathematics and its Applications INDOCRYPT International Conference on Cryptology in India IO Input/Output IP Internet Protocol IPDPS IEEE International Parallel and Distributed Processing Symposium IPSec Internet Protocol Security ISE Instruction Set Extension LightSec International Workshop on Lightweight Cryptography for Security and Privacy LNCS Lecture Notes in Computer Science MDS Maximum Distance Separable MIPS Million Instructions per Second MIT Massachusetts Institute of Technology NESSIE New European Schemes for Signatures, Integrity, and Encryption NLU Non-linear/Linear Unit NRE Non-recurring Engineering NSA US National Security Agency ReConFig International Conference on Reconfigurable Computing and FPGAs

163 Abbreviations

RFIDSec International Workshop on RFID Security and Privacy SAC International Workshop on Selected Areas in Cryptography SecPriWiMob IEEE International Workshop on Security and Privacy in Wireless and Mobile Computing, Networking, and Communications SECSI Secure Component and System Identification Workshop SLR Shift Logical Right SP Substitution-Permutation SPL Southern Programmable Logic Conference SPN Substitution-Permutation Network SSL Secure Sockets Layer UBI Unique Block Iteration VLSI Very Large Scale Integration WAIFI International Workshop on the Arithmetic of Finite Fields WISA International Workshop on Information Security Applications WiSec ACM Conference on Wireless Network Security XNOR Exclusive NOR

164 Bibliography

Bibliography

[ABK98] Ross Anderson, Eli Biham, and Lars Knudsen. Serpent: A Proposal for the Advanced Encryption Standard. Submission to NIST AES Competition, August 1998.

[ADK+14a] Martin R. Albrecht, Benedikt Driessen, Elif Bilge Kavun, Gregor Leander, Christof Paar, and Tolga Yal¸cın. PRIDE – Focus on the Linear Layer (feat. PRIDE. In Proceedings of the 34th Annual International Conference on the Theory and Application of Cryptology and Information Security (CRYPTO), volume 8616 of Lecture Notes in Computer Science (LNCS), pages 57–76. Springer, 2014.

[ADK+14b] Martin R. Albrecht, Benedikt Driessen, Elif Bilge Kavun, Gregor Leander, Christof Paar, and Tolga Yal¸cın. PRIDE – Focus on the Linear Layer (feat. PRIDE: Full Version. International Association for Cryptologic Research (IACR) Cryptology ePrint Archive, 2014:453, 2014.

[AF14] Daniel Augot and Matthieu Finiasz. Direct Construction of Recursive MDS Dif- fusion Layers using Shortened BCH Codes. IACR Cryptology ePrint Archive, 2014:566, 2014.

[AH11] Toru Akishita and Harunaga Hiwatari. Very Compact Hardware Implementations of the Blockcipher CLEFIA. In Proceedings of the 18th International Workshop on Selected Areas in Cryptography (SAC), volume 7118 of LNCS, pages 278–292. Springer, 2011.

[AHMNP13] Jean-Philippe Aumasson, Luca Henzen, Willi Meier, and Mar´ıaNaya-Plasencia. QUARK: A Lightweight Hash. Journal of Cryptology, 26(2):313–339, 2013.

[AHMP10] Jean-Philippe Aumasson, Luca Henzen, Willi Meier, and Raphael C.-W. Phan. SHA-3 Proposal BLAKE. Submission to NIST SHA-3 Competition, 2010.

[ARM] ARM. The ARM 32-bit Instruction Set. http://simplemachines.it/doc/arm_ inst.pdf.

[Atm] Atmel. Atmel Studio IDE. http://www.atmel.com/tools/ATMELSTUDIO.aspx.

[Atm10] Atmel. 8-bit AVR Instruction Set Datasheet, 2010. http://www.atmel.com/ images/doc0856.pdf.

[Atm13] Atmel. ATmega8A 8-bit Microcontroller Datasheet, 2013. http://www.atmel. com/images/atmel-8159-8-bit-avr-microcontroller-atmega8a_datasheet. pdf. Bibliography

[AVR] AVRAES. AVRAES: The AES block cipher on AVR controllers. http:// point-at-infinity.org/avraes/.

[BCBP03] Alex Biryukov, Christophe De Canni`ere, An Braeken, and Bart Preneel. A Tool- box for Cryptanalysis: Linear and Affine Equivalence Algorithms. In Proceedings of the 22nd Annual International Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT), volume 2656 of LNCS, pages 33–50. Springer, 2003.

[BCG+12a] Julia Borghoff, Anne Canteaut, Tim G¨uneysu, Elif Bilge Kavun, Miroslav Kneˇzevi´c,Lars R. Knudsen, Gregor Leander, Ventzislav Nikov, Christof Paar, Christian Rechberger, Peter Rombouts, Søren S. Thomsen, and Tolga Yal¸cın. PRINCE – A Low-latency Block Cipher for Pervasive Computing Applications: Extended Version. In Proceedings of the 18th Annual International Cryptology Conference (ASIACRYPT), volume 7658 of LNCS, pages 208–225. Springer, 2012.

[BCG+12b] Julia Borghoff, Anne Canteaut, Tim G¨uneysu, Elif Bilge Kavun, Miroslav Kneˇzevi´c,Lars R. Knudsen, Gregor Leander, Ventzislav Nikov, Christof Paar, Christian Rechberger, Peter Rombouts, Søren S. Thomsen, and Tolga Yal¸cın. PRINCE – A Low-latency Block Cipher for Pervasive Computing Applications: Full Version. IACR Cryptology ePrint Archive, 2012:529, 2012.

[BD07] Eli Biham and Orr Dunkelman. A Framework for Iterative Hash Functions – HAIFA. IACR Cryptology ePrint Archive, 2007:278, 2007.

[BD08] Steve Babbage and Matthew Dodd. The MICKEY Stream Ciphers. In New Designs – The eSTREAM Finalists, volume 4986 of LNCS, pages 191–209. Springer, 2008.

[BDE+13] Lejla Batina, Amitabh Das, Barı¸sEge, Elif Bilge Kavun, Nele Mentens, Christof Paar, Ingrid Verbauwhede, and Tolga Yal¸cın. Dietary Recommendations for Lightweight Block Ciphers: Power, Energy and Area Analysis of Recently De- veloped Architectures. In Proceedings of the 9th RFIDSec, volume 8262 of LNCS, pages 103–112. Springer, 2013.

[BDPA07] Guido Bertoni, Joan Daemen, Micha¨el Peeters, and Gilles Van Assche. Sponge Functions. ECRYPT Hash Workshop, 2007.

[BDPA11] Guido Bertoni, Joan Daemen, Micha¨elPeeters, and Gilles Van Assche. The Keccak Sponge Function Family. Submission to NIST SHA-3 Competition, 2011.

[Ber08] Daniel J. Bernstein. The Salsa20 Family of Stream Ciphers. In New Stream Cipher Designs – The eSTREAM Finalists, volume 4986 of LNCS, pages 84–97. Springer, 2008.

[Bir03] Alex Biryukov. Analysis of Involutional Ciphers: Khazad and Anubis. In Pro- ceedings of the 10th International Workshop on Fast Software Encryption (FSE), volume 2887 of LNCS, pages 45–53. Springer, 2003.

168 Bibliography

[Bir05] Alex Biryukov. DES-X (or DESX). Encyclopedia of Cryptography and Security, D:146–147, 2005.

[BKL+07] Andrey Bogdanov, Lars R. Knudsen, Gregor Leander, Christof Paar, Axel Poschmann, Matthew J. B. Robshaw, Yannick Seurin, and Charlotte Vikkelsø. PRESENT: An Ultra-Lightweight Block Cipher. In Proceedings of the 9th Inter- national Workshop on Cryptographic Hardware and Embedded Systems (CHES), volume 4727 of LNCS, pages 450–466. Springer, 2007.

[BKL+11] Andrey Bogdanov, Miroslav Kneˇzevi´c,Gregor Leander, Deniz Toz, Kerem Varıcı, and Ingrid Verbauwhede. SPONGENT: A Lightweight Hash Function. In Pro- ceedings of the 13th CHES, volume 6917 of LNCS, pages 312–325. Springer, 2011.

[BKR11] Andrey Bogdanov, Dmitry Khovratovich, and Christian Rechberger. Biclique Cryptanalysis of the Full AES. In Proceedings of the 17th ASIACRYPT, volume 7073 of LNCS, pages 344–371. Springer, 2011.

[BL08] Marcus Brinkmann and Gregor Leander. On the Classification of Almost Perfect Non-linear (APN) Functions Up to Dimension Five. Designs, Codes and Cryptog- raphy, 49(1–3):273–288, 2008.

[BNN+10] Paulo S. L. M. Barreto, Ventzislav Nikov, Svetla Nikova, Vincent Rijmen, and Elmar Tischhauser. Whirlwind: A New Cryptographic Hash Function. Designs, Codes and Cryptography, 56(2–3):141–162, 2010.

[BNN+12] Beg¨ul Bilgin, Svetla Nikova, Ventzislav Nikov, Vincent Rijmen, and Georg St¨utz. Threshold Implementations of All 3 × 3 and 4 × 4 Sboxes. In Proceedings of the 14th CHES, volume 7428 of LNCS, pages 76–91. Springer, 2012.

[BR00a] Paulo S.L.M. Barreto and Vincent Rijmen. The ANUBIS Block Cipher. Sub- mission to New European Schemes for Signatures, Integrity, and Encryption (NESSIE) European Research Project, October 2000.

[BR00b] Paulo S.L.M. Barreto and Vincent Rijmen. The Khazad Legacy-level Block Ci- pher. Submission to NESSIE European Research Project, October 2000.

[BS90] Eli Biham and Adi Shamir. Differential Cryptanalysis of DES-like Cryptosystems. In Proceedings of the 10th CRYPTO, volume 537 of LNCS, pages 2–21. Springer, 1990.

[BSS+13] Ray Beaulieu, Douglas Shors, Jason Smith, Stefan Treatman-Clark, Bryan Weeks, and Louis Wingers. The SIMON and SPECK Families of Lightweight Block Ci- phers. IACR Cryptology ePrint Archive, 2013:414, 2013.

[Cad] Cadence. NC-Verilog Simulator.

[Cad13] Cadence. Encounter RTL Compiler, 2013. http://www.cadence.com/rl/ resources/datasheets/encounter_rtlcompiler.pdf.

169 Bibliography

[Can05] David Canright. A Very Compact S-Box for AES. In Proceedings of the 7th CHES, volume 3659 of LNCS, pages 441–455. Springer, 2005.

[Car10] Claude Carlet. Boolean Methods and Models, chapter Vectorial Boolean Functions for Cryptography, pages 398–469. Cambridge University Press, 2010.

[CDK09] Christophe De Canni`ere,Orr Dunkelman, and Miroslav Kneˇzevi´c. KATAN and KTANTAN – A Family of Small and Efficient Hardware-Oriented Block Ciphers. In Proceedings of the 11th CHES, volume 5747 of LNCS, pages 272–288. Springer, 2009.

[CMM13] Micka¨elCazorla, Kevin Marquet, and Marine Minier. Survey and Benchmark of Lightweight Block Ciphers for Wireless Sensor Networks. IACR Cryptology ePrint Archive, 2013:295, 2013.

[CP08] Christophe De Canni`ereand Bart Preneel. Trivium. In New Stream Cipher De- signs – The eSTREAM Finalists, volume 4986 of LNCS, pages 244–266. Springer, 2008.

[Cry14] CryptoLUX. Lightweight Block Ciphers, 2014. https://www.cryptolux.org/ index.php/Lightweight_Block_Ciphers.

[Dae95] Joan Daemen. Cipher and Hash Function Design, Strategies Based on Linear and Differential Cryptanalysis. PhD thesis, Katholieke Universiteit Leuven, 1995.

[DGK+12] Benedikt Driessen, Tim G¨uneysu, Elif Bilge Kavun, Oliver Mischke, Christof Paar, and Thomas P¨oppelmann. IPSecco A Lightweight and Reconfigurable IPSec Core. In Proceedings of the 7th ReConFig, pages 1–7. Institute of Electrical and Electronics Engineers (IEEE) Computer Society, 2012.

[DKR97] Joan Daemen, Lars Knudsen, and Vincent Rijmen. The Block Cipher SQUARE. In Proceedings of the 4th FSE, volume 1267 of LNCS, pages 149–165. Springer, 1997.

[dMGSP08] Giacomo de Meulenaer, Fran¸coisGosset, Fran¸cois-Xavier Standaert, and Olivier Pereira. On the Energy Cost of Communications and Cryptography in Wireless Sensor Networks (Extended Version). In Proceedings of the 1st IEEE International Workshop on Security and Privacy in Wireless and Mobile Computing, Network- ing, and Communications (SecPriWiMob), pages 580–585. IEEE, 2008.

[DPAR00] Joan Daemen, Micha¨elPeeters, Gilles Van Assche, and Vincent Rijmen. NESSIE Proposal: NOEKEON. Submission to NESSIE European Research Project, Oc- tober 2000.

[DR98] Joan Daemen and Vincent Rijmen. AES Proposal: Rijndael, 1998.

[DR01] Joan Daemen and Vincent Rijmen. The Wide Trail Design Strategy. In Proceed- ings of the 3rd Institute of Mathematics and its Applications (IMA) International Conference on Cryptography and Coding, volume 2260 of LNCS, pages 222–238. Springer, 2001.

170 Bibliography

[DR09] Joan Daemen and Vincent Rijmen. Codes and Provable Security of Ciphers. In En- hancing Cryptographic Primitives with Techniques from Error Correcting Codes, volume 1807 of NATO Science for Peace and Security Series D – Information and Communication Security 23, pages 60–80. IOS Press, 2009.

[Dwo01] Morris Dworkin. NIST Recommendation for Block Cipher Modes of Operation: Methods and Techniques. NIST Special Publication 800-38A, 2001.

[Dwo07] Morris Dworkin. NIST Recommendation for Block Cipher Modes of Operation: Galois/Counter Mode (GCM) and GMAC. NIST Special Publication 800-38D, 2007.

[Ecl] Eclipse. Eclipse IDE. https://eclipse.org/ide/.

[EGG+12] Thomas Eisenbarth, Zheng Gong, Tim G¨uneysu,Stefan Heyse, Sebastiaan In- desteege, St´ephanieKerckhof, Fran¸cois Koeune, Tomislav Nad, Thomas Plos, Francesco Regazzoni, Fran¸cois-Xavier Standaert, and Loic van Oldeneel tot Olden- zeel. Compact Implementation and Performance Evaluation of Block Ciphers in ATtiny Devices. In Proceedings of the 5th Annual International Conference on the Theory and Applications of Cryptology (AFRICACRYPT), volume 7374 of LNCS, pages 172–187. Springer, 2012.

[EKM+13] Susanne Engels, Elif Bilge Kavun, Hristina Mihajloska, Christof Paar, and Tolga Yal¸cın. A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers. In Proceedings of the 21st ARITH, pages 67–75. IEEE Computer Society, 2013.

[EKU+07] Thomas Eisenbarth, Sandeep Kumar, Leif Uhsadel, Christof Paar, and Axel Poschmann. A Survey of Lightweight-Cryptography Implementations. IEEE De- sign & Test of Computers, 24(6):522–533, 2007.

[FLS+10] Niels Ferguson, Stefan Lucks, Bruce Schneier, Doug Whiting, Mihir Bellare, Ta- dayoshi Kohno, Jon Callas, and Jesse Walker. The Skein Hash Function Family. Submission to NIST SHA-3 Competition, 2010.

[Fri] Limor Fried. PIC vs. AVR. http://www.ladyada.net/library/picvsavr.html.

[FS09] Philippe Flajolet and Robert Sedgewick. Analytic Combinatorics. Cambridge University Press, 2009.

[GC09] Kris Gaj and Pawel Chodowiec. Cryptographic Engineering, chapter FPGA and ASIC Implementations of AES, pages 235–294. Springer, 2009.

[GGP08] Philipp Grabher, Johann Großsch¨adl,and Dan Page. Light-Weight Instruction Set Extensions for Bit-Sliced Cryptography. In Proceedings of the 10th CHES, volume 5154 of LNCS, pages 331–345. Springer, 2008.

[GKM+11] Praveen Gauravaram, Lars R. Knudsen, Krystian Matusiewicz, Florian Mendel, Christian Rechberger, Martin Schl¨affer,and Søren S. Thomsen. Grøstl – A SHA-3 Candidate. Submission to NIST SHA-3 Competition, 2011.

171 Bibliography

[GLSV14] Vincent Grosso, Ga¨etanLeurent, Fran¸cois-Xavier Standaert, and Kerem Varıcı. LS-Designs: Bitslice Encryption for Efficient Masked Software Implementations. In Proceedings of the 21st FSE, LNCS. Springer, 2014. to appear. [GM11] Tim G¨uneysuand Amir Moradi. Generic Side-Channel Countermeasures for Re- configurable Devices. In Proceedings of the 13th CHES, volume 6917 of LNCS, pages 33–48. Springer, 2011. [GNL11] Zheng Gong, Svetla Nikova, and Yee Wei Law. KLEIN: A New Family of Lightweight Block Ciphers. In Proceedings of the 7th RFIDSec, volume 7055 of LNCS, pages 1–18. Springer, 2011. [GPP11] Jian Guo, Thomas Peyrin, and Axel Poschmann. The PHOTON Family of Lightweight Hash Functions. In Proceedings of the 31st CRYPTO, volume 6841 of LNCS, pages 222–239. Springer, 2011. [GPPR11] Jian Guo, Thomas Peyrin, Axel Poschmann, and Matthew J. B. Robshaw. The LED Block Cipher. In Proceedings of the 13th CHES, volume 6917 of LNCS, pages 326–341. Springer, 2011. [Gra] Mentor Graphics. ModelSim Simulator. http://www.mentor.com/products/fv/ modelsim/. [GSH+12] Xu Guo, Meeta Srivastav, Sinan Huang, Dinesh Ganta, Michael B. Henry, Leyla Nazhandali, and Patrick Schaumont. ASIC Implementations of Five SHA-3 Final- ists. In Proceedings of the Design, Automation & Test in Europe (DATE), pages 1006–1011. IEEE, 2012. [HAHH06] Panu Hamalainen, Timo Alho, Marko Hannikainen, and Timo D. Hamalainen. Design and Implementation of Low-Area and Low-Power AES Encryption Hard- ware Core. In Proceedings of the 9th EUROMICRO Conference on Digital System Design (DSD), pages 577–583. IEEE Computer Society, 2006. [HAMP11] Luca Henzen, Jean-Philippe Aumasson, Willi Meier, and Raphael C.-W. Phan. Very Large Scale Integration (VLSI) Characterization of the Cryptographic Hash Function BLAKE. IEEE Transactions on VLSI Systems, 19(10):1746–1754, 2011. [HJMM08] Martin Hell, Thomas Johanssen, Alexander Maximov, and Willi Meier. The Grain Family of Stream Ciphers. In New Stream Cipher Designs – The eSTREAM Finalists, volume 4986 of LNCS, pages 179–190. Springer, 2008. [HSH+06] Deukjo Hong, Jaechul Sung, Seokhie Hong, Jongin Lim, Sangjin Lee, Bonseok Koo, Changhoon Lee, Donghoon Chang, Jaesang Lee, Kitae Jeong, Hyun Kim, Jongsung Kim, and Seongtaek Chee. HIGHT: A New Block Cipher Suitable for Low-Resource Devices. In Proceedings of the 8th CHES, volume 4249 of LNCS, pages 46–59. Springer, 2006. [HV06] Alireza Hodjat and Ingrid Verbauwhede. Sensor Network Operations, chapter The Energy Cost of Embedded Security for Wireless Sensor Networks, pages 510–522. Wiley-IEEE Press, 2006.

172 Bibliography

[ISO] ISO. Information technology – Security techniques – Lightweight cryptography – Part 2: Block ciphers (ISO/IEC 29192-2:2012). http://www.iso.org/iso/iso_ catalogue/catalogue_tc/catalogue_detail.htm?csnumber=56552.

[JA11] Bernhard Jungk and J¨urgenApfelbeck. Area-Efficient FPGA Implementations of the SHA-3 Finalists. In Proceedings of the 6th ReConFig, pages 235–241. IEEE Computer Society, 2011.

[KDH+12] St´ephanieKerckhof, Fran¸coisDurvaux, C´edricHocquet, David Bol, and Fran¸cois- Xavier Standaert. Towards Green Cryptography: A Comparison of Lightweight Ciphers from the Energy Viewpoint. In Proceedings of the 14th CHES, volume 7428 of LNCS, pages 390–407. Springer, 2012.

[KDH13] Ferhat Karako¸c,H¨useyinDemirci, and Emre Harmancı. ITUbee: A Software Ori- ented Lightweight Block Cipher. In Proceedings of the 2nd International Workshop on Lightweight Cryptography for Security and Privacy (LightSec), volume 8162 of LNCS, pages 16–27. Springer, 2013.

[KDVC+11] St´ephanie Kerckhof, Fran¸cois Durvaux, Nicolas Veyrat-Charvillon, Francesco Regazzoni, Guerric Meurice de Dormale, and Fran¸cois-Xavier Standaert. Com- pact FPGA Implementations of the Five SHA-3 Finalists. In Proceedings of the 10th Smart Card Research and Advanced Application Conference (CARDIS), vol- ume 7079 of LNCS, pages 217–233. Springer, 2011.

[KKI+11] Miroslav Kneˇzevi´c,Kazuyuki Kobayashi, Jun Ikegami, Shin’ichiro Matsuo, Akashi Satoh, Unal¨ Kocaba¸s,Junfeng Fan, Toshihiro Katashita, Takeshi Sugawara, Kazuo Sakiyama, Ingrid Verbauwhede, Kazuo Ohta, Naofumi Homma, and Takafumi Aoki. Fair and Consistent Hardware Evaluation of Fourteen Round Two SHA-3 Candidates. IEEE Transactions on VLSI Systems, 20(5):827–840, 2011.

[KLY13] Elif Bilge Kavun, Gregor Leander, and Tolga Yal¸cın. A reconfigurable architec- ture for searching optimal software code to implement block cipher permutation matrices. In Proceedings of the 8th ReConFig, pages 1–8. IEEE Computer Society, 2013.

[KNR12] Miroslav Kneˇzevi´c,Ventzislav Nikov, and Peter Rombouts. Low-Latency Encryp- tion – Is “Lightweight = Light + Wait”? In Proceedings of the 14th CHES, volume 7428 of LNCS, pages 426–446. Springer, 2012.

[Kob87] Neal Koblitz. Elliptic Curve Cryptosystems. Mathematics of Computation, 48(177):203–209, 1987.

[Ko¸c09] C¸etin Kaya Ko¸c. Cryptographic Engineering. Springer, 2009.

[KR96] Joe Kilian and Phillip Rogaway. How to Protect DES Against Exhaustive Key Search. In Proceedings of the 16th CRYPTO, volume 1109 of LNCS, pages 252– 267. Springer, 1996.

173 Bibliography

[KSPS12] Paris Kitsos, Nicolas Sklavos, Maria Parousi, and Athanassios N. Skodras. A Comparative Study of Hardware Architectures for Lightweight Block Ciphers. Computers & Electrical Engineering, 38(1):148–160, 2012.

[KY] Elif Bilge Kavun and Tolga Yal¸cın. On the Suitability of SHA-3 Finalists for Lightweight Applications. http://csrc.nist.gov/groups/ST/hash/sha-3/ Round3/March2012/documents/papers/KAVUN_paper.pdf.

[KY10] Elif Bilge Kavun and Tolga Yal¸cın. A Pipelined Camellia Architecture for Com- pact Hardware Implementation. In Proceedings of the 21st IEEE International Conference on Application-specific Systems Architectures and Processors (ASAP), pages 305–308. IEEE, 2010.

[KY11] Elif Bilge Kavun and Tolga Yal¸cın. RAM-Based Ultra-Lightweight FPGA Im- plementation of PRESENT. In Proceedings of the 6th ReConFig, pages 280–285. IEEE Computer Society, 2011.

[KYS+11] Jens-Peter Kaps, Panasayya Yalla, Kishore Kumar Surapathi, Bilal Habib, Susheel Vadlamudi, Smriti Gurung, and John Pham. Lightweight Implementations of SHA-3 Candidates on FPGAs. In Proceedings of the 12th International Con- ference on Cryptology in India (INDOCRYPT), volume 7107 of LNCS, pages 270–289. Springer, 2011.

[LC04] Shu Lin and Daniel J. Costello. Error Control Coding (2nd Edition). Prentice Hall, 2004.

[LK06] Chae Hoon Lim and Tymur Korkishko. mCrypton – A Lightweight Block Cipher for Security of Low-Cost RFID Tags and Sensors. In Proceedings of the 6th International Workshop on Information Security Applications (WISA), volume 3786 of LNCS, pages 243–258. Springer, 2006.

[LL05] Jing Lu and John Lockwood. IPSec Implementation on Xilinx Virtex-II Pro FPGA and Its Application. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 158–159. IEEE Computer So- ciety, 2005.

[LP07] Gregor Leander and Axel Poschmann. On the Classification of 4 Bit S-Boxes. In Proceedings of the 1st International Workshop on the Arithmetic of Finite Fields (WAIFI), volume 4547 of LNCS, pages 159–176. Springer, 2007.

[LSY01] Ruby B. Lee, Zhijie Shi, and Xiao Yang. Efficient Permutation Instructions for Fast Software Cryptography. IEEE Micro Magazine, 21(6):56–69, 2001.

[lwI] lwIP. http://savannah.nongnu.org/projects/lwip/.

[Mat93] Mitsuru Matsui. Linear Cryptoanalysis Method for DES Cipher. In Proceedings of the 12th EUROCRYPT, volume 765 of LNCS, pages 386–397. Springer, 1993.

[MD5] RFC-1321. https://www.ietf.org/rfc/rfc1321.txt.

174 Bibliography

[Mic] Microchip. 8-bit PIC MCUs. http://www.microchip.com/pagehandler/en-us/ family/8bit/.

[Mil86] Victor S. Miller. Use of Elliptic Curves in Cryptography. In Proceedings of the 4th CRYPTO, volume 218 of LNCS, pages 417–426. Springer, 1986.

[MIT] MIT. MIT Project Oxygen: Overview. http://oxygen.csail.mit.edu/ Overview.html.

[ML01] John Patrick McGregor and Ruby B. Lee. Architectural Enhancements for Fast Subword Permutations with Repetitions in Cryptographic Applications. In Pro- ceedings of the 19th International Conference on Computer Design (ICCD), pages 453–461. IEEE, 2001.

[MM12] Amir Moradi and Oliver Mischke. How Far Should Theory Be from Practice? – Evaluation of a Countermeasure. In Proceedings of the 14th CHES, volume 7428 of LNCS, pages 92–106. Springer, 2012.

[MPL+11] Amir Moradi, Axel Poschmann, San Ling, Christof Paar, and Huaxiong Wang. Pushing the Limits: A Very Compact and A Threshold Implementation of AES. In Proceedings of the 30th CRYPTO, volume 6632 of LNCS, pages 69–88. Springer, 2011.

[MV05] David A. McGrew and John Viega. The Galois/Counter Mode of Operation (GCM), 2005.

[Nan] NanGate. The NanGate 45nm Opencell Library. http://www.nangate.com/ ?page_id=2325.

[NIS99] NIST. DES. NIST FIPS Publication 46-3, October 1999.

[NIS01] NIST. AES. NIST FIPS Publication 197, November 2001.

[NIS08] NIST. Secure Hash Standard. NIST FIPS Publication 180-3, October 2008.

[NIS14] NIST. SHA-3. NIST Draft FIPS Publication 202, May 2014.

[NRS11] Svetla Nikova, Vincent Rijmen, and Martin Schl¨affer.Secure Hardware Implemen- tation of Nonlinear Functions in the Presence of Glitches. Journal of Cryptology, 24(2):292–321, 2011.

[NTR] NTRU. Quantum-resistant Cryptography. http://tbuktu.github.io/ntru/.

[NWW+11] Yun Niu, Liji Wu, Li Wang, Xiangmin Zhang, and Jun Xu. A Configurable IPSec Processor for High Performance In-Line Security Network Processor. In Pro- ceedings of the 7th International Conference on Computational Intelligence and Security (CIS), pages 674–678. IEEE Computer Society, 2011.

[Nyb93] Kaisa Nyberg. Differentially Uniform Mappings for Cryptography. In Proceedings of the 12th EUROCRYPT, volume 765 of LNCS, pages 55–64. Springer, 1993.

175 Bibliography

[OE10] Sean O’Melia and Adam J. Elbirt. Enhancing the Performance of Symmetric- Key Cryptography via Instruction Set Extensions. IEEE Transactions on VLSI Systems, 18(11):1505–1518, 2010.

[PMK+11] Axel Poschmann, Amir Moradi, Khoongming Khoo, Chu-Wee Lim, Huaxiong Wang, and San Ling. Side-Channel Resistant Crypto for Less than 2,300 GE. Journal of Cryptology, 24(2):322–345, 2011.

[Pos09] Axel Poschmann. Lightweight Cryptography – Cryptographic Engineering for a Pervasive World. PhD thesis, Ruhr-Universit¨atBochum, 2009.

[PP10a] Christof Paar and Jan Pelzl. Understanding Cryptography, chapter Public- Key Cryptosystems Based on the Discrete Logarithm Problem, pages 205–238. Springer, 2010.

[PP10b] Christof Paar and Jan Pelzl. Understanding Cryptography, chapter Hash Func- tions, pages 293–317. Springer, 2010.

[RFCa] RFC-1349. http://www.ietf.org/rfc/rfc1349.txt.

[RFCb] RFC-2474. http://www.ietf.org/rfc/rfc2474.txt.

[RFCc] RFC-4301. http://www.ietf.org/rfc/rfc4301.txt.

[RFCd] RFC-4308. http://www.ietf.org/rfc/rfc4308.txt.

[RFCe] RFC-4868. http://www.ietf.org/rfc/rfc4868.txt.

[RFCf] RFC-5903. http://www.ietf.org/rfc/rfc5903.txt.

[RFCg] RFC-6379. http://www.ietf.org/rfc/rfc6379.txt.

[RFCh] RFC-791. http://www.ietf.org/rfc/rfc791.txt.

[Rij00] Vincent Rijmen. Efficient Implementation of the Rijndael S-box. ESAT, Katholieke Universiteit Leuven, Belgium, 2000.

[RPLP08] Carsten Rolfes, Axel Poschmann, Gregor Leander, and Christof Paar. Ultra- lightweight Implementations for Smart Devices – Security for 1000 Gate Equiva- lents. In Proceedings of the 8th CARDIS, volume 3928 of LNCS, pages 89–103. Springer, 2008.

[RPP08] Carsten Rolfes, Axel Poschmann, and Christof Paar. Security for 1000 Gate Equivalents. In Proceedings of Secure Component and System Identification Work- shop (SECSI). ECRYPT, 2008.

[RSA78] Ron Rivest, Adi Shamir, and Leonard Adleman. A Method for Obtaining Digital Signatures and Public-key Cryptosystems. Communications of the Association for Computing Machinery (ACM), 21(2):120–126, 1978.

176 Bibliography

[RSQL03] Ga¨elRouvroy, Fran¸cois-Xavier Standaert, Jean-Jacques Quisquater, and Jean- Didier Legat. Design Strategies and Modified Descriptions to Optimize Ci- pher FPGA Implementations: Fast and Compact Results for DES and Triple- DES. In Proceedings of the 13th International Conference on Field Programmable Logic and Applications (FPL), volume 2778 of LNCS, pages 181–193. Springer, 2003.

[Rub07] Ruby B. Lee and Murat Fı¸skıranand Michael Wang and Yedidya Hilewitz and Yu-Yuan Chen. PAX: A Cryptographic Processor with Parallel Table Lookup and Wordsize Scalability, 2007. Princeton University Department of Electrical Engineering Technical Report CE-L2007-010.

[Saa11] Markku-Juhani O. Saarinen. Cryptographic Analysis of All 4 × 4-Bit S-Boxes. In Proceedings of the 18th SAC, volume 7118 of LNCS, pages 118–133. Springer, 2011.

[SDMS12] Mahdi Sajadieh, Mohammad Dakhilalian, Hamid Mala, and Pouyan Sepehrdad. Recursive Diffusion Layers for Block Ciphers and Hash Functions. In Proceedings of the 19th FSE, volume 7549 of LNCS, pages 385–401. Springer, 2012.

[Sha12] Shay Gueron. Intel AES New Instructions Set – White Paper, 2012. https://software.intel.com/sites/default/files/article/165683/ aes-wp-2012-09-22-v01.pdf.

[SIH+11] Kyoji Shibutani, Takanori Isobe, Harunaga Hiwatari, Atsushi Mitsuda, Toru Ak- ishita, and Taizo Shirai. Piccolo: An Ultra-Lightweight Blockcipher. In Proceed- ings of the 13th CHES, volume 6917 of LNCS, pages 342–357. Springer, 2011.

[Sil13] Silicon Labs. EFM32GG-STK3700 EFM32 Giant Gecko 32-bit Microcon- troller Starter Kit User Manual, 2013. http://www.silabs.com/Support% 20Documents/TechnicalDocs/efm32gg-stk3700-ug.pdf.

[Sil14] Silicon Labs. Simplicity Studio, 2014. http://www.silabs.com/Marcom% 20Documents/Resources/simplicity-studio-solutions.pdf.

[SMMK12] Tomoyasu Suzaki, Kazuhiko Minematsu, Sumio Morioka, and Eita Kobayashi. TWINE: A Lightweight Block Cipher for Multiple Platforms. In Proceedings of the 19th SAC, volume 7707 of LNCS, pages 339–354. Springer, 2012.

[SPGQ06] Fran¸cois-Xavier Standaert, Gilles Piret, Neil Gershenfeld, and Jean-Jacques Quisquater. SEA: A Scalable Encryption Algorithm for Small Embedded Appli- cations. In Proceedings of the 7th CARDIS, volume 3928 of LNCS, pages 222–236. Springer, 2006.

[SPR+04] Fran¸cois-Xavier Standaert, Gilles Piret, Ga¨elRouvroy, Jean-Jacques Quisquater, and Jean-Didier Legat. ICEBERG : An Involutional Cipher Efficient for Block Encryption in Reconfigurable Hardware. In Proceedings of the 11th FSE, volume 3017 of LNCS, pages 279–299. Springer, 2004.

177 Bibliography

[SRK11] Ahmad Salman, Marcin Rogawski, and Jens-Peter Kaps. Efficient Hardware Ac- celerator for IPSec Based on Partial Reconfiguration on Xilinx FPGAs. In Pro- ceedings of the 6th ReConFig, pages 242–248. IEEE Computer Society, 2011.

[SSA+07] Taizo Shirai, Kyoji Shibutani, Toru Akishita, Shiho Moriai, and Tetsu Iwata. The 128-bit Block Cipher CLEFIA (Extended Abstract). In Proceedings of the 14th FSE, volume 4593 of LNCS, pages 181–195. Springer, 2007.

[SSBV11] Dave Singel´ee, Stefaan Seys, Lejla Batina, and Ingrid Verbauwhede. The Com- munication and Computation Cost of Wireless Security – Extended Abstract. In Proceedings of the 4th ACM Conference on Wireless Network Security (WiSec), pages 1–3. ACM, 2011.

[SSPP09] Mohamad Sbeiti, Michael Silbermann, Axel Poschmann, and Christof Paar. De- sign Space Exploration of PRESENT Implementations for FPGAs. In Proceedings of the 5th Southern Programmable Logic Conference (SPL), pages 141–145. IEEE, 2009.

[Sta00] Standards for Efficient Cryptography Group. SEC 2: Recommended Elliptic Curve Domain Parameters. http://www.secg.org/collateral/sec2_final. pdf, 2000.

[STM13] STMicroelectronics. STM32F4DISCOVERY Microcontroller with 32-bit ARM Cortex-M4F Core Datasheet, 2013. http://www.st.com/st-web-ui/static/ active/en/resource/technical/document/data_brief/DM00037955.pdf.

[SYL08] Zhijie Jerry Shi, Xiao Yang, and Ruby B. Lee. Alternative Application-Specific Processor Architectures for Fast Arbitrary Bit Permutations. International Jour- nal of Embedded Systems, 3(4):219–228, 2008.

[TFI+09] Stefan Tillich, Martin Feldhofer, Wolfgang Issovits, Thomas Kern, Hermann Kureck, Michael M¨uhlberghuber, Georg Neubauer, Andreas Reiter, Armin K¨ofler, and Mathias Mayrhofer. Compact Hardware Implementations of the SHA-3 Can- didates ARIRANG, BLAKE, Grøstl, and Skein. IACR Cryptology ePrint Archive, 2009:349, 2009.

[TFK+09] Stefan Tillich, Martin Feldhofer, Mario Kirschbaum, Thomas Plos, J¨orn-Marc Schmidt, and Alexander Szekely. High-speed Hardware Implementations of BLAKE, Blue Midnight Wish, CubeHash, ECHO, Fugue, Grøstl, Hamsi, JH, Keccak, Luffa, Shabal, SHAvite-3, SIMD, and Skein. IACR Cryptology ePrint Archive, 2009:510, 2009.

[TFs08] Stefan Tillich, Martin Feldhofer, and Johann Großsch¨adl.Area, Delay, and Power Characteristics of Standard-Cell Implementations of the AES S-box. Journal of Signal Processing Systems, 50(2):251–261, 2008.

[TH08] Stefan Tillich and Christoph Herbst. Boosting AES Performance on a Tiny Proces- sor Core. In Proceedings of the RSA Conference Cryptographers’ Track (CT-RSA), volume 4964 of LNCS, pages 170–186. Springer, 2008.

178 Bibliography

[Ts05] Stefan Tillich and Johann Großsch¨adl.Accelerating AES Using Instruction Set Ex- tensions for Elliptic Curve Cryptography. In Proceedings of the 5th International Conference on Computational Science and Its Applications (ICCSA), volume 3481 of LNCS, pages 665–675. Springer, 2005.

[Ts06] Stefan Tillich and Johann Großsch¨adl. Instruction Set Extensions for Effi- cient AES Implementation on 32-bit Processors. In Proceedings of the 8th CHES, volume 4249 of LNCS, pages 270–284. Springer, 2006.

[TsS05] Stefan Tillich, Johann Großsch¨adl,and Alexander Szekely. An Instruction Set Extension for Fast and Memory-Efficient AES Implementation. In Proceedings of the 9th IFIP TC-6 TC-11 Conference on Communications and Multimedia Security (CMS), volume 3677 of LNCS, pages 11–21. Springer, 2005.

[UCI+11] Markus Ullrich, Christophe De Canni`ere,Sebastiaan Indesteege, Ozg¨ulK¨u¸c¨uk,¨ Nicky Mouha, and Bart Preneel. Finding Optimal Bitsliced Implementations of 4 × 4-Bit S-boxes. Symmetric Key Encryption Workshop, 2011.

[UMCa] UMC. 130 nm Low-leakage Faraday Library. http://freelibrary. faraday-tech.com/doc/0.13um/LL/FSC0L_D_GENERIC_CORE_ProductBrief. pdf.

[UMCb] UMC. 90 nm Low-leakage Faraday Library. http://www.umc.com/english/pdf/ 90nm_dm.pdf.

[VGM11] Michal Varchola, Tim G¨uneysu,and Oliver Mischke. MicroECC: A Lightweight Reconfigurable Elliptic Curve Crypto-processor. In Proceedings of the 6th ReConFig, pages 204–210. IEEE Computer Society, 2011.

[VMG+10] Jo Vliegen, Nele Mentens, Jan Genoe, An Braeken, Serge Kubera, Abdellah Touhafi, and Ingrid Verbauwhede. A Compact FPGA-based Architecture for El- liptic Curve Cryptography over Prime Fields. In Proceedings of the 21st ASAP, pages 313–316. IEEE, 2010.

[Wal07] Jean-Baptiste Waldner. Nano-informatique et Intelligence Ambiante. Hermes Science Publishing, 2007.

[WBC10] Haixin Wang, Guoqiang Bai, and Hongyi Chen. A Gbps IPSec SSL Security Processor Design and Implementation in an FPGA Prototyping Platform. Journal of Signal Processing Systems, 58(3):311–324, 2010.

[Wei99] Mark Weiser. The Computer for the 21st Century. ACM SIGMOBILE Mobile Computing and Communications Review Special Issue, 3(3):3–11, 1999.

[Wu11] Hongjun Wu. The Hash Function JH. Submission to NIST SHA-3 Competition, 2011.

[WWW12] Shengbao Wu, Mingsheng Wang, and Wenling Wu. Recursive Diffusion Layers for (Lightweight) Block Ciphers and Hash Functions. In Proceedings of the 19th SAC, volume 7707 of LNCS, pages 355–371. Springer, 2012.

179 Bibliography

[WZ11] Wenling Wu and Lei Zhang. LBlock: A Lightweight Block Cipher. In Proceedings of the 9th International Conference on Applied Cryptography and Network Security (ACNS), volume 6715 of LNCS, pages 327–344. Springer, 2011.

[Xil] Xilinx. ISE Design Suite. http://www.xilinx.com/products/design-tools/ ise-design-suite.html.

[Xil09] Xilinx. MicroBlaze Soft Processor Core, 2009. http://www.xilinx.com/ support/documentation/user_guides/ug486.pdf.

[Xil11a] Xilinx. Extended Spartan-3A Family Overview, 2011. http://www.xilinx.com/ support/documentation/data_sheets/ds706.pdf.

[Xil11b] Xilinx. PicoBlaze 8-bit Microcontroller, 2011. http://www.xilinx.com/ support/documentation/ip_documentation/ug129.pdf.

[Xil11c] Xilinx. Spartan-6 Family Overview, 2011. http://www.xilinx.com/support/ documentation/data_sheets/ds160.pdf.

[Xil12] Xilinx. ML605 Hardware User Guide, 2012. http://www.xilinx.com/support/ documentation/boards_and_kits/ug534.pdf.

[YK09] Panasayya Yalla and Jens-Peter Kaps. Lightweight Cryptography for FPGAs. In Proceedings of the 4th ReConFig, pages 225–230. IEEE Computer Society, 2009.

180 List of Figures

List of Figures

1.1 Development of computing devices over the last 50 years ...... 4 1.2 The gaps in lightweight cryptography ...... 6

2.1 Branches of cryptology and cryptographic primitives ...... 10 2.2 Illustration of SPN ...... 10

3.1 Xilinx Virtex-6 ML605 Evaluation Board ...... 16 3.2 STM32F4DISCOVERY ...... 18

4.1 Round-based implementation of a generic block cipher ...... 23 4.2 Round-based implementation of a generic block cipher – no key schedule . . . . . 23

5.1 PRESENT data flow scheme in the RAM-based design for a 3-round example . . 31 5.2 PRESENT permutation scheme ...... 32 5.3 PRESENT cycle count for the state processing phases of each round (on-slice Sbox version) ...... 33 5.4 PRESENT cycle count for the first state processing phase of each round (on-RAM Sbox version) ...... 34 5.5 PRESENT cycle count for the key expansion phase of each round (on-slice Sbox version) ...... 34 5.6 PRESENT cycle count for the key expansion phase of each round (on-RAM Sbox version) ...... 35 5.7 PRESENT cycle count for the final round (same in both on-slice and on-RAM Sbox versions) ...... 35 5.8 PRESENT block diagram (key expansion module shown on the right) for on-slice Sbox version ...... 36 5.9 PRESENT block diagram (key expansion module shown on the right) for on- RAM Sbox version ...... 36

6.1 BLAKE compression function ...... 41 6.2 One round of BLAKE and the underlying Gi function ...... 41 6.3 BLAKE serial architecture ...... 42 6.4 Gi half function ...... 42 6.5 BLAKE serial data flow ...... 43 6.6 BLAKE timing diagram ...... 44 6.7 Grøstl compression function ...... 44 6.8 Grøstl construction function f (left) and output function Ω (right) ...... 45 6.9 P and Q permutations ...... 45 6.10 Grøstl serial architecture ...... 46 List of Figures

6.11 Details of P /Q block...... 47 6.12 Data flow for 4 × 4 toy version ...... 47 6.13 Grøstl timing diagram ...... 48 6.14 JH compression function ...... 49 6.15 Structure of F8 compression function (left) and E8 function (right) ...... 49 6.16 Three layers of round function ...... 50 6.17 JH serial architecture ...... 50 6.18 JH serial flow ...... 51 6.19 JH timing diagram ...... 52 6.20 Sponge construction of Keccak ...... 53 6.21 Keccak-f function and steps of the function ...... 53 6.22 Keccak serial architecture ...... 54 6.23 Keccak timing diagram ...... 55 6.24 Keccak data flow ...... 55 6.25 Skein normal hashing scheme ...... 56 6.26 Four rounds of ThreeFish-512 ...... 57 6.27 Skein serial architecture ...... 57 6.28 Skein serial flow ...... 58 6.29 Skein timing diagram ...... 59 6.30 Interface model ...... 59

7.1 Integrating IPSecco ...... 65 7.2 Architecture of the HMAC core ...... 67 7.3 Architecture of the PRESENT/GCM core ...... 69 7.4 Architecture of the RAM-based PRESENT implementation ...... 70 7.5 Architecture of the serial PRESENT-80 implementation ...... 71 7.6 Architecture of the serial PRESENT-128 implementation ...... 72 7.7 Architecture of the ECC core ...... 73

8.1 Sbox distributions with respect to gate counts ...... 89 8.2 Round-based implementation of PRINCE ...... 92

9.1 Overall structure of NLU ...... 102 9.2 Non-linear unit ...... 104 9.3 Matrix multiplication ...... 105 9.4 Linear unit – detailed circuit ...... 106

10.1 Instruction flow with quality check – an example with 8 instructions ...... 123 10.2 Overall instruction flow – showing source and destination skips at each stage . . 123 10.3 Reconfigurable “optimal linear layer” search architecture ...... 125

184 List of Tables

List of Tables

4.1 Performance and energy numbers of various AES realizations all using 128-bit key lengths ...... 26 4.2 Performance and energy numbers of lightweight block ciphers implementations . 26

5.1 Performance comparison ...... 37

6.1 BLAKE specifications ...... 41 6.2 Grøstl specifications ...... 45 6.3 JH specifications ...... 49 6.4 Keccak specifications ...... 52 6.5 Skein specifications ...... 56 6.6 Comparison of our work with previous works ...... 60

7.1 IPSecco configurations ...... 66 7.2 Characteristics of our implementation for Xilinx Spartan-3 ...... 74 7.3 Characteristics of our implementation for Xilinx Spartan-6 ...... 75

8.1 All Sboxes for the PRINCE-family up to affine equivalence ...... 86 8.2 Area/power comparison of unrolled versions of PRINCE and other ciphers . . . 91 8.3 Extrapolated area of unrolled versions of other ciphers against PRINCE ..... 91 8.4 Performance comparison of round-based versions of PRINCE and other ciphers . 92 8.5 Performance of round-based versions of other ciphers against PRINCE ..... 92 8.6 Performance comparison of unrolled versions of PRINCE and other ciphers on FPGA ...... 93 8.7 Performance comparison of round-based versions of PRINCE and other ciphers on FPGA ...... 93 8.8 Performance of PRINCE on 32-bit ARM CORTEX-M3 ...... 94 8.9 Performance of PRINCE on 8-bit Atmel ATmega128 ...... 94

9.1 NLU instructions ...... 102 9.2 Performance comparison of ciphers implemented with/without NLU ...... 114

10.1 Limited instruction set ...... 124 10.2 Performance comparison of PRIDE to other block ciphers ...... 132 10.3 Performance of PRIDE on 32-bit ARM CORTEX-M3 ...... 133 10.4 Performance of PRIDE on 32-bit STM32F4DISCOVERY ...... 133

About the Author

About the Author

Author information as of February 2015.

Personal Information

Elif Bilge Kavun Born in Izmir,˙ Turkey on 22/12/1986

Education

 06/2011–02/2015: Ph.D. (Dr.-Ing.) in Electrical Engineering and Information Technol- ogy, Ruhr-Universit¨atBochum, Germany

 08/2008–09/2010: M.Sc. in Cryptography, Middle East Technical University, Ankara, Turkey

 09/2004–07/2008: B.Sc. in Electronics and Telecommunications Engineering, Izmir˙ Institute of Technology, Turkey

 01/2008–06/2008: Erasmus Exchange Student at Universit´eJoseph Fourier - Grenoble, France

 09/1997–06/2004: High School Diploma in Science Field, Konak Anadolu Lisesi, Izmir/Turkey˙

Professional and Academic Experience

 12/2014–Present: Digital Design Engineer, Infineon Technologies AG, Munich, Ger- many

 06/2011–11/2014: Research Assistant, Embedded Security Group, Horst G¨ortzInsti- tute for IT-Security, Ruhr-Universit¨atBochum, Germany

189 About the Author

 03/2014–05/2014: Interim Engineering Intern, Product Security Group, Qualcomm Technologies, Inc., San Diego, CA, USA

 01/2011–05/2011: Research Assistant, Hardware Security Group, Fraunhofer - AISEC (former Fraunhofer - SIT), Munich, Germany

 06/2009–12/2010: Independent Consultant – Hardware Design, Izmir,˙ Turkey

 03/2008–06/2008: Intern, TIMA Laboratory, Institut Polytechnique de Grenoble, France

 08/2007–09/2007: Intern, Crypto Group, Aselsan Electronics, Ankara, Turkey

 05/2007–07/2007: Intern, Synplicity, Ankara, Turkey

 06/2006–07/2006: Intern, R & D Safety Group, Vestel Electronics, Manisa, Turkey

Academic Awards

 10/2012–11/2014: UbiCrypt (Ubiquitous Cryptography) Research Training Group (DFG Graduiertenkolleg – DFG RTG GRK 1817/1) Scholarship

 12/2011: METU Mustafa N. Parlar Research Foundation, M.Sc. Best Thesis Award

 06/2011: Middle East Technical University, M.Sc. Best Thesis Award

190 Author’s Publications

Author’s Publications

Author’s publications as of February 2015.

Peer-Reviewed Publications in Journals

 E. B. Kavun, H. Mihajloska, and T. Yal¸cın. A Survey On Authenticated Encryption – Hardware Designer’s Perspective. Under revision in ACM Computing Surveys. ACM, 2014.

 A. Bogdanov, E. B. Kavun, E. Tischhauser, and T. Yal¸cın.Large-Scale High-Resolution Computational Validation of Novel Complexity Models in Linear Cryptanalysis. In Jour- nal of Computational and Applied Mathematics, volume 259, part B, pages 592-598. El- sevier, 2014.

Peer-Reviewed Publications in the Proceedings of International Conferences and Workshops

 M. R. Albrecht, B. Driessen, E. B. Kavun, G. Leander, C. Paar, and T. Yal¸cın. Block Ciphers – Focus On The Linear Layer (feat. PRIDE). In 34th International Cryptology Conference (CRYPTO’14), volume 8616 of Lecture Notes in Computer Science, pages 57-76. Springer, 2014.

 E. B. Kavun, G. Leander, and T. Yal¸cın. A Reconfigurable Architecture for Searching Optimal Software Code to Implement Block Cipher Permutation Matrices. In Inter- national Conference on ReConFigurable Computing and FPGAs (ReConFig’13). IEEE Computer Society, 2013.

 L. Batina, A. Das, B. Ege, E. B. Kavun, N. Mentens, C. Paar, I. Verbauwhede, and T. Yal¸cın.Dietary Recommendations for Lightweight Block Ciphers: Power, Energy and Area Analysis of Recently Developed Architectures. In 9th Workshop on RFID Secu- rity (RFIDSec’13), volume 8262 of Lecture Notes in Computer Science, pages 103-112. Springer, 2013.

 S. Engels, E. B. Kavun, H. Mihajloska, C. Paar, and T. Yal¸cın. A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers. In 21st IEEE Symposium on Computer Arithmetic (ARITH 21). IEEE Computer Society, 2013.

 J. Borghoff, A. Canteaut, T. G¨uneysu, E. B. Kavun, M. Kneˇzevi´c,L. R. Knudsen, G. Leander, V. Nikov, C. Paar, C. Rechberger, P. Rombouts, S. S. Thomsen, and T. Yal¸cın.

191 Author’s Publications

PRINCE – A Low-Latency Block Cipher for Pervasive Computing Applications (Extended Abstract). In 18th Annual International Conference on the Theory and Application of Cryptology and Information Security (ASIACRYPT’12), volume 7658 of Lecture Notes in Computer Science, pages 208-225. Springer, 2012.

 B. Driessen, T. G¨uneysu, E. B. Kavun, O. Mischke, C. Paar, and T. P¨oppelmann. IPSecco: A Lightweight and Reconfigurable IPSec Core. In International Conference on ReConFigurable Computing and FPGAs (ReConFig’12). IEEE Computer Society, 2012.

 A. Bogdanov, E. B. Kavun, E. Tischhauser, and T. Yal¸cın. Efficient Reconfigurable Hardware Architecture for Accurately Computing Success Probability and Data Com- plexity of Linear Attacks. In International Conference on ReConFigurable Computing and FPGAs (ReConFig’12). IEEE Computer Society, 2012.

 T. Yal¸cınand E. B. Kavun. On the Implementation Aspects of Sponge-Based Authen- ticated Encryption for Pervasive Devices. In 11th Smart Card Research and Advanced Application Conference (CARDIS’12), volume 7771 of Lecture Notes in Computer Sci- ence, pages 141-157. Springer, 2012.

 A. Bogdanov, E. B. Kavun, E. Tischhauser, and T. Yal¸cın. Experimental Evaluation of Success Probability and Data Complexity of Linear Attacks on Hardware. In International Conference on Applied and Computational Mathematics (ICACM). 2012.

 E. B. Kavun and T. Yal¸cın. RAM-Based Ultra-Lightweight FPGA Implementation of PRESENT. In International Conference on ReConFigurable Computing and FPGAs (ReConFig’11). IEEE Computer Society, 2011.

 B.Ege, E. B. Kavun, and T. Yal¸cın. Memory Encryption for Smart Cards. In 10th Smart Card Research and Advanced Application Conference (CARDIS’11), volume 7079 of Lecture Notes in Computer Science, pages 199-216. Springer, 2011.

 B.Bilgin, E. B. Kavun, and T. Yal¸cın.Towards an Ultra Lightweight Crypto Processor. In 1st Workshop on Lightweight Security and Privacy (LightSec’11). 2011.

 E. B. Kavun and T. Yal¸cın.A Lightweight Implementation of Keccak Hash Function for Radio-Frequency Identification Applications. In 6th Workshop on RFID Security (RFID- Sec’10), volume 6370 of Lecture Notes in Computer Science, pages 258-269. Springer, 2010.

 E. B. Kavun and T. Yal¸cın.A Pipelined Camellia Architecture for Compact Hardware Implementation. In 21st IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP’10), pages 305 - 308. IEEE, 2010.

 T. Yal¸cın, E. B. Kavun, and R. Carmon. A Compact Cryptographic Processor with Flexible Coprocessor Plug-in Interface. In 4th International Information Security and Cryptology Conference (ISC Turkey’10). 2010.

192 Author’s Publications

Other Publications

 M. R. Albrecht, B. Driessen, E. B. Kavun, G. Leander, C. Paar, and T. Yal¸cın. Block Ciphers – Focus On The Linear Layer (feat. PRIDE). In IACR Cryptology ePrint Archive 2014:453, 2014.

 E. B. Kavun, M. M. Lauridsen, G. Leander, C. Rechberger, P. Schwabe, and T. Yal¸cın. Prøst. CAESAR Proposal ( http://proest.compute.dtu.dk ). 2014.

 L. Batina, A. Das, B. Ege, E. B. Kavun, N. Mentens, C. Paar, I. Verbauwhede, and T. Yal¸cın.Dietary Recommendations for Lightweight Block Ciphers: Power, Energy and Area Analysis of Recently Developed Architectures. In IACR Cryptology ePrint Archive 2013:753, 2013.

 J. Borghoff, A. Canteaut, T. G¨uneysu, E. B. Kavun, M. Kneˇzevi´c,L. R. Knudsen, G. Leander, V. Nikov, C. Paar, C. Rechberger, P. Rombouts, S. S. Thomsen, and T. Yal¸cın. PRINCE – A Low-latency Block Cipher for Pervasive Computing Applications (Full Version). In IACR Cryptology ePrint Archive 2012:529, 2012.

Accepted Posters and Presentations at International Conferences and Workshops (without proceedings)

 S. Engels, E. B. Kavun, H. Mihajloska, C. Paar, and T. Yal¸cın. A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers. Presentation at 3rd Workshop on Cryptography, Robustness, and Provably Secure Schemes for Female Young Researchers (CrossFyre’13). 2013.

 E. B. Kavun and T. Yal¸cın.On Energy and Cryptographic-Effort of Lightweight Ciphers. Poster at 8th Workshop on RFID Security (RFIDSec’12). 2012.

 E. B. Kavun, C. Paar, and T. Yal¸cın. A Linear Transform Based Power Reduction Technique for Lightweight Block Ciphers. Presentation at 2nd Workshop on Cryptography, Robustness, and Provably Secure Schemes for Female Young Researchers (CrossFyre’12). 2012.

 T. Yal¸cınand E. B. Kavun. On the Suitability of SHA-3 Finalists for Lightweight Applications. Presentation at 3rd SHA-3 Conference. 2012.

 A. Bogdanov, E. B. Kavun, D. Khovratovich, C. Paar, C. Rechberger, and T. Yal¸cın. Practical Biclique Cryptanalysis: Low Data Complexity Key Search for AES-128 on FPGA: Extended Abstract. Presentation at 5th Workshop on Special-Purpose Hardware for Attacking Cryptographic Systems (SHARCS’12). 2012.

 E. B. Kavun and T. Yal¸cın.Ultra-Compact Reconfigurable NTRUEncrypt Public Key Core. Presentation at 9th International Workshop on Cryptographic Archi- tectures Embedded in Reconfigurable Devices (CryptArchi’11). 2011.

 B. Bilgin, E. B. Kavun, and T. Yal¸cın.Serialized Keccak Architecture for Lightweight Applications. Presentation at 2nd SHA-3 Conference. 2010.

193 Author’s Publications

Invited Talks

 25/05/2012: Lightweight Cryptography – What do we really mean by “Lightweight”?. IT-Sicherheits-Stammtisch, Industrie- und Handelskammer Mittleres Ruhrgebiet, Bochum, Germany.

 17/05/2012: Lightweight Cryptography – What do we really mean by “Lightweight”?. 5th International Conference on Information Security and Cryptology (ISCTurkey’12), Ankara, Turkey.

Attended Conferences and Workshops During Ph.D.

 CRYPTO’14, Santa Barbara, CA, US (August 2014)

 CrossFyre’14, Bochum, Germany (July 2014)

 ReConFig’13, Canc´un,Mexico (December 2013)

 CHES’13, Santa Barbara, CA, US (August 2013)

 CRYPTO’13, Santa Barbara, CA, US (August 2013)

 RFIDSec’13, Graz, Austria (July 2013)

 CrossFyre’13, Leuven, Belgium (June 2013)

 ARITH 21, Austin, Texas, US (April 2013)

 ASIACRYPT’12, Beijing, China (December 2012)

 ICACM, Ankara, Turkey (October 2012)

 CHES’12, Leuven, Belgium (September 2012)

 RFIDSec’12, Nijmegen, the Netherlands (July 2012)

 CrossFyre’12, Eindhoven, the Netherlands (June 2012)

 ISCTurkey’12, Ankara, Turkey (May 2012)

 3rd SHA-3 Conference, Washington, D.C., US (March 2012)

 SHARCS’12, Washington, D.C., US (March 2012)

 ReConFig’11, Canc´un,Mexico (December 2011)

 ECRYPT Workshop on Lightweight Cryptography, Louvain-la-Neuve, Belgium (Novem- ber 2011)

 CARDIS’11, Leuven, Belgium (September 2011)

 CryptArchi’11, Bochum, Germany (June 2011)

194