<<

CONSTRUCTIVEAND DESTRUCTIVE REVERSE ASPECTSOF DIGITAL SYSTEMS

DISSERTATION

zur Erlangung des Grades eines Doktor-Ingenieurs der Fakultat¨ fur¨ Elektrotechnik und Informationstechnik an der Ruhr-Universitat¨ Bochum

by Marc Fyrbiak Bochum, May 2018 © 2018 by Marc Fyrbiak. All rights reserved. Printed in Germany. To my beloved family.

Marc Fyrbiak Place of birth: Braunschweig, Germany Author’s contact information: [email protected] www.emsec.rub.de

Thesis Advisor: Prof. Dr.-Ing. Christof Paar Ruhr-Universit¨atBochum, Germany Secondary Referee: Prof. Russell Tessier University of Massachusetts Amherst, MA, USA Thesis submitted: May 30, 2018 Thesis defense: September 21, 2018 Last revision: April 26, 2019

v vii

Abstract

Our modern digital society stands on the foundation of digital hardware systems implemented in a myriad of interconnected smart-X devices as well as traditional computing systems realizing technological trends such as cloud computing. Since these devices and systems are an integral part of our everyday life, targeted manipulations of digital hardware systems have devastating consequences for safety, security, and privacy. In today’s production chains, hardware are transparent to numerous untrusted stake- holders and thus inevitably prone to manipulations and (IP) piracy. In many attack scenarios, an adversary merely has access to the low-level, potentially obfuscated gate-level netlist and faces the costly and time-consuming task of reverse engineering the pro- prietary to identify security-critical circuitry, followed by the insertion of a meaningful hardware Trojan. However, these challenges of destructive aspects have been only considered in passing by the research community. In addition to destructive aspects, hardware reverse engineering facilitates various constructive applications and can thus safeguard digital hardware systems. For example, security engineers are forced to leverage reverse engineering in order to witness hardware Trojans in untrusted third-party IP cores as is typically not available. Moreover, understanding the complexity of hardware reverse engineering facilitates sound countermeasures for hardware protection schemes. So far reverse engineering is commonly neglected in the security of various schemes since it is still an opaque and poorly understood process. In order to systematically investigate constructive and destructive aspects of gate-level netlist reverse engineering, we first address the lack of gate-level netlist reverse engineering and manipu- lation frameworks in the open literature and then present a holistic framework called HAL. HAL supports and automates custom reverse engineering tasks and enables tailored manipulations with ease. Based on its extensibility, we investigate diverse technical aspects of gate-level netlist reverse engineering and provide several research contributions such as the many-faceted workflow of semi-automated hardware Trojan insertion, costs associated with reverse engineering based on graph similarity algorithms, and novel insights on the (in-)security of several Finite State Machine (FSM)-based hardware obfuscation schemes. Moreover, we investigate cognitive aspects of reverse engineering and develop a methodology based on problem solving research and research on the acquisition of expertise to quantify the crucial human factor in hardware reverse engineering. As an important finding we show that development of automated custom tools for reverse engineering and hardware Trojan injection is not as challenging and time-consuming as previously thought.

Keywords. Gate-level Netlist Reverse Engineering, Cognitive Aspects of Reverse Engineering, Hardware Obfuscation, Hardware Trojans, Graph Similarity

Kurzfassung

Unsere moderne digitale Gesellschaft steht auf dem Fundament von digitalen Hardware Systemen, die in einer Vielzahl von verbundenen smart-X Ger¨aten und traditionellen Computersystemen verbaut sind. Da diese Ger¨ate und Systeme integraler Bestandteil unseres t¨aglichen Lebens geworden sind, haben gezielte Manipulation dieser Hardware Systemen verheerende Auswirkungen fur¨ die Sicherheit und Privatheit. In heutigen Produktionsketten sind Hardware Designs transparent fur¨ viele nicht vertrau- enswurdige¨ Stakeholder und damit zwangsl¨aufig anf¨allig fur¨ Manipulation und IP Piraterie. In vielen Angriffsszenarien hat ein Angreifer allerdings nur Zugang zur low-level, potentiell obfuskierten, gate-level Netzliste und steht daher dem kosten- und zeitaufwendigem Reverse Engineering gegenuber:¨ Zuerst mussen¨ sicherheitsrelevante Elemente im Schaltkreis idenzifiziert werden und anschließend ein sinnvoller Hardware Trojaner eingefugt¨ werden. Jedoch wurden diese Herausforderungen bisher nur fluchtig¨ von der Forschungsgemeinschaft behandelt. Zus¨atzlich zu destruktiven Aspekten erm¨oglicht Hardware Reverse Engineering verschiedene konstruktive Applikationen um Hardware Systeme abzusichern. So sind Sicherheitsingenieure dazu gezwungen nicht vertrauenswurdige¨ Designs von Drittparteien per Reverse Engineering zu analysieren um Hardware Trojaner zu identifizieren, da der Quellcode typischerweise nicht verfugbar¨ ist. Außerdem erm¨oglicht ein genaueres Verst¨andnis von Hardware Reverse Engi- neering die Entwicklung solider Schutzmaßnahmen. Jedoch wurde Reverse Engineering in der Sicherheitsanalyse von verschiedenen Schutzmaßnahmen vernachl¨assigt, da Reverse Engineering immer noch ein undurchsichtiger und kaum verstandener Prozess ist. Um systematisch konstruktive und destruktive Aspekte von gate-level Netlist Reverse En- gineering zu erforschen, beschreiben wir zuerst das Fehlen eines Netlist Reverse Engineering- und Manipulationsframeworks in der ¨offentlich zug¨anglichen Literatur und pr¨asentieren ein ganzheitliches Framework namens HAL. HAL unterstutzt¨ und automatisiert maßgeschneiderte Reverse Engineering Aufgaben und erm¨oglicht zielgerichtete Manipulation. Basierend auf der Erweiterbarkeit von HAL untersuchen wir verschiedene technische Aspekte von gate-level Netlist Reverse Engineering und wir liefern mehrere Forschungsbeitr¨age, z. B. den facettenreichen Arbeitsablauf der teilautomatisierten Hardware Trojaner Einfugung,¨ Reverse Engineering Kosten basierend auf Graph Ahnlichkeitsanalysen,¨ sowie neue Erkenntnisse bzgl. der (Un-)Sicherheit von mehreren Obfuskationsschemata basierend auf endlichen Zustandsautomaten. Außerdem untersuchen wir kognitive Aspekte beim Reverse Engineering und wir entwickeln eine Methodik basierend auf Probleml¨ose- und Expertiseforschung, um entscheidende menschlichen Faktoren beim Hardware Reverse Engineering zu untersuchen. Gegens¨atzlich zur bisherigen Meingung zeigen wir, dass die Entwicklung von automatisierten maßgeschneiderten Werkzeugen zum Rever- se Engineering und zur Hardware Trojaner Einfugung¨ nicht anspruchsvoll und zeitaufwendig sind.

Schlagworte. Gate-level Netlist Reverse Engineering, Kognitive Aspekte von Reverse Engi- neering, Hardware Obfuskation, Hardware Trojaner, Graph Ahnlichkeit¨

Acknowledgements

I want to express my sincere gratitude to my Doktorvater Christof Paar for his continuous excellent advises far beyond the scope of research. His positive and inspiring nature formed an outstanding education for both research and teaching. In particular, I want to thank him for pushing Philipp, Benjamin, and myself to founding our start-up. Moreover, my sincere thanks are given to Russell Tessier for being like a second Doktorvater to me, for his excellent guidance in the field of computer architecture and for improving my writing. Moreover I want to thank Irmgard for her overall support with any office and administration issues, and our office (chit)chat. Furthermore thanks to Horst for solving all technical issues and satisfying our technical requirements. Many thanks to all my colleagues at EMSEC, SYSSEC, and SecHuman for numerous interesting conversations about security research and aspects about research in general. Special thanks to Nicolai for his kind introduction to statistics research and our conversations about technology. Another special thanks go to Nikol, Malte, Sebastian and Carina for being so patient in teaching me psychology. I want to thank our Chaos WG with Philipp, Benjamin, and Christian for exploration of IT-security research as well as our spare time enjoyed together. In addition, I want to thank my office mates (in chronological order) Ingo, Christian, Pawel, Max, Sebastian, and Nils for numerous inspiring technical and non-technical discussions. My sincere gratitude to all my co-authors and colleagues for spending nights in the office right before submissions deadlines and all of my students I supervised during my research time — I was very lucky to supervise you and learned a lot from you! Last but not least I want to thank my family and friends for their love and support throughout the years.

Table of Contents

Imprint ...... v Preface ...... vii Abstract ...... vii Kurzfassung ...... ix Acknowledgements ...... xiii

I Introduction1

1 Introduction 3

II Background7

2 Technical Background9 2.1 Hardware Design Process ...... 9 2.2 Gate-level Netlists ...... 10 2.3 Adversary Model ...... 11 2.4 Hardware Reverse Engineering ...... 11 2.4.1 Chip-level Reverse Engineering ...... 11 2.4.2 FPGA Bitstream Reverse Engineering ...... 12 2.4.3 Gate-level Netlist Reverse Engineering ...... 12 2.5 Hardware Trojan Detection and Manipulation ...... 13 2.6 Hardware Protection ...... 13

III Gate-level Netlist Reverse Engineering 15

3 HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework 17 3.1 Introduction ...... 18 3.2 HAL- Design and Implementation ...... 19 3.2.1 System Architecture ...... 19 3.2.2 Building Blocks ...... 20 3.2.3 Implementation ...... 22 3.3 Hardware Trojan Detection Technique ANGEL ...... 23 3.3.1 Boolean Function and Graph Neighborhood Analysis ...... 23 Table of Contents

3.3.2 Evaluation ...... 25 3.3.3 Discussion ...... 26 3.4 Hardware Trojan Injection ...... 27 3.4.1 Case Study: Disarm Cryptographic Self-Tests ...... 28 3.4.2 Case Study: Wiretapping Keys in IP Cores ...... 31 3.5 IP Watermarking ...... 33 3.5.1 LUT-based Watermarking ...... 33 3.5.2 Opaque Predicates ...... 34 3.6 Human Factors in Hardware Reverse Engineering ...... 35 3.6.1 Setting: A Learning Perspective ...... 36 3.6.2 Problem Solving and Expertise Research ...... 36 3.6.3 Open Challenges: Quantification of Human Factors ...... 37 3.6.4 Dichotomies for Human Factor Quantification ...... 37 3.6.5 Research Designs, Data Collection, and Challenges ...... 38 3.7 Conclusion ...... 40

4 Graph Similarity and Its Application to Hardware Security 41 4.1 Introduction ...... 42 4.2 The Graph Similarity Problem ...... 42 4.2.1 Preliminaries ...... 42 4.2.2 Hardware Characteristics and Optimizations ...... 44 4.2.3 Graph Similarity Analysis Strategy ...... 44 4.2.4 Graph Edit Distance Approximation ...... 46 4.2.5 Neighbour Matching ...... 48 4.2.6 Multiresolutional Spectral Analysis ...... 50 4.3 Evaluation ...... 51 4.3.1 Implementation ...... 52 4.3.2 Case Study I: Gate-level Netlist Reverse Engineering ...... 52 4.3.3 Case Study II: Trojan Detection ...... 56 4.3.4 Case Study III: Obfuscation Assessment ...... 58 4.4 Discussion ...... 60 4.5 Conclusion ...... 61

5 On the Difficulty of FSM-based Hardware Obfuscation 67 5.1 Introduction ...... 68 5.2 Automated FSM Reverse Engineering ...... 68 5.2.1 Phase 1: Topological Analysis ...... 69 5.2.2 Phase 2: Boolean Function Analysis ...... 72 5.3 Reverse Engineering and Deobfuscation of FSM Obfuscation Schemes ...... 73 5.3.1 HARPOON ...... 74 5.3.2 Dynamic State Deflection ...... 75 5.3.3 Active Hardware Metering ...... 77 5.3.4 Interlocking Obfuscation ...... 79 5.3.5 Lessons Learned ...... 80

xvi Table of Contents

5.4 Evaluation ...... 81 5.4.1 Case Study: Cryptographic Designs ...... 82 5.4.2 Case Study: Communication Interfaces ...... 88 5.5 Discussion ...... 90 5.6 Conclusion ...... 92

IV Hardware-Assisted Instruction Set Architecture Obfuscation 95

6 Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors 97 6.1 Introduction ...... 98 6.2 Technical Background and Related Work ...... 99 6.2.1 System and Adversary Model ...... 100 6.2.2 Instruction Set Architecture ...... 102 6.3 Hardware-level Obfuscation ...... 103 6.3.1 Opcode Substitution ...... 103 6.3.2 Operand Permutation ...... 105 6.3.3 Hardware-enforced Access Control ...... 105 6.3.4 Hardware-level Booby Traps ...... 106 6.4 -level Obfuscation ...... 107 6.5 Implementation ...... 108 6.6 Performance Evaluation ...... 108 6.7 Security Analysis ...... 111 6.8 Security Metrics for Obfuscation ...... 112 6.8.1 Similarity Metric ...... 112 6.8.2 Case Study – SPREE Benchmark Suite ...... 114 6.9 Discussion ...... 121 6.10 Conclusion ...... 123

V Conclusion 125

7 Conclusion 127 7.1 Impact of Gate-level Netlist Reverse Engineering ...... 127 7.2 Future Research Directions ...... 128

VI Appendix 131

Bibliography 133

List of Abbreviations 149

xvii Table of Contents

List of Figures 150

List of Tables 153

About the Author 157

xviii Part I

Introduction

Chapter 1 Introduction

Computing systems and communication technologies have become an integral part of our everyday life and nowadays connect billions of users world-wide. Essential parts of our Information Age are digital hardware systems implemented in a myriad of interconnected smart-X devices and traditional computing systems which lay the foundation for applications such as the Internet of Things (IoT) or autonomous driving. Since digital hardware systems handle personal and security-critical information, understanding how proprietary, third-party digital hardware systems work in detail is inevitable to protect our modern digital society as malicious manipulations of underlying hardware components have catastrophic consequences for security and safety. In particular, this profound understanding of digital hardware systems a.k.a. reverse engineering yields both constructive and destructive applications. Reverse Engineering—Destructive Aspects. Modern digital hardware system design and manufacturing processes are heavily globalized with various (untrusted) stakeholders, including IP providers and off-shore foundries [1]. Both are able to reverse engineer the proprietary, potentially obfuscated design and subsequently inject malicious circuitry a.k.a. hardware Trojans prior to fabrication with devastating consequences to system security, safety and privacy, either for their own advantage or as a requirement by nation-state adversaries. For market-dominating Static Random Access Memory (SRAM)-based Field Programmable Gate Arrays (FPGAs), the situation is similarly gloomy, since protective bitstream can be invalidated for the majority of currently employed families [2, 3], allowing adversaries to reverse engineer the design and inject malicious circuitry post-manufacturing [4]. Note that most current low-cost FPGA families do not offer bitstream encryption at all. Reverse Engineering—Constructive Aspects. Even though reverse engineering is often associated with destructive, illegitimate actions, it is a general tool for various constructive applications such as failure analysis and the detection of counterfeit products and hardware Trojans [5]. In particular, security engineers are forced to resort to reverse engineering to witness the latter since the Register Transfer Level (RTL) source code is typically not available in these scenarios [6]. In typical real-world scenarios (e.g., detection of hardware Trojans in third-party IP cores or analysis of a competitor product), a reverse engineer has access to a flattened, unstructured gate-level netlist of the design. Reverse engineering is also relevant in other non-security scenarios (e.g., the US Air Force wanted to recover obsolete hardware designs of military systems [7]). In spite of the fact that reverse engineering has attracted little scrutiny from the scientific community, understanding its considerable complexity is indeed essential to develop sound threat assessments of hardware Trojans. Note that there has only been a scant treatment regarding the practicability of actually inserting hardware Trojans into designs. The vast majority of scientific works suppose access to the Hardware Description Language (HDL) source code or

3 Chapter 1. Introduction neglects the crucial step of reverse engineering, see [8, 9, 10, 11, 12, 13, 14]. Moreover, insights in reverse engineering provide valuable guidelines for developing countermeasures such as hardware obfuscation [15, 16] or countermeasures against the pressing problem of IP piracy. We want to emphasize that it is estimated that (IC) companies face losses in the range of a billion US dollars in global revenue due to piracy [5] and incomplete operative, low-quality counterfeits may have similar catastrophic consequences for the security and safety of target systems. Unfortunately, reverse engineering is a time-consuming task even for a team of analysts, thus automated and reliable techniques are inevitable to reduce invested time and costs. Similar to the hardware design process, hardware reverse engineering requires frameworks and tools to automate custom tasks and simplify the steps for a human analyst. Given that nation-state adversaries, who arguably pose the most credible threat regarding hardware Trojans, might have developed hardware reverse engineering frameworks and tools, a better understanding of such frameworks is crucial for the security community at large. While hardware reverse engineering frameworks and tools exist in the industrial sector [17, 18], to the best of our knowledge, no publicly available reverse engineering and manipulation framework for gate-level netlists exists so far, however, such a framework is a relevant requirement for understanding capabilities and limitations of reverse engineering. Security Limitations of Hardware Obfuscation. Modern System-on-Chip (SoC) de- sign often involves the use of numerous reusable IP cores to reduce both time-to-market and cost [19]. The economic advantages of IP use are accompanied by increased security risks for both IP owners and consumers. Since such threat potential is an increasing concern for practical applications [16], numerous hardware protection schemes have been proposed, see Shakya et al. [20] for a comprehensive survey. Most common strategies to realize hardware protection focus on RTL or gate-level transformations in order to conceal overall internal func- tionality [21, 22, 23, 24, 25, 26, 27]. However, reverse engineering and subsequent manipulation are often neglected in the security analysis of such schemes, hence their security can be limited in realistic settings. Goal. Despite research on hardware reverse engineering [28, 16] and companies that perform on-demand services [29, 17], it is still an opaque and poorly understood process. The question is not whether analysts are able to reverse engineer a given design, since with sufficient resources it will always succeed. Rather, the fundamental research question is:

How time-consuming and thus costly is the reverse engineering process of a proprietary and potentially obfuscated gate-level netlist for successfully extracting crucial information (e.g., hardware Trojans or algorithmic implementation details)?

In order to advance towards an answer to this long standing research question, we investigate constructive and destructive applications of hardware reverse engineering, including both technical and cognitive aspects.

4 Contributions and Organization.

• Chapter 2 provides the fundamental background information on hardware security and reverse engineering including our threat model used throughout this thesis.

• Chapter 3 addresses the lack of gate-level netlist reverse engineering and manipulation frameworks in the open literature and presents HAL, an interactive framework of virtually any gate library, including Application Specific Integrated Circuit (ASIC) and FPGA libraries that supports and automates reverse engineering tasks and enables tailored manip- ulations with ease. Based on the extensibility of HAL, we detail the many-faceted workflow of semi-automated hardware Trojan insertion and IP infringement with accompanied reverse engineering of third-party gate-level netlists under realistic assumptions. Moreover we present a novel hardware Trojan detection technique based on Boolean function analysis and graph neighborhood analysis. In particular, we demonstrate that graph neighborhood analysis considerably reduces the crucial false-positive rate by several factors and can detect Trojans armed with obfuscation. Moreover this chapter presents an outlook on the quantification of human factors in hardware reverse engineering based on a methodology leveraging problem solving research and research on the acquisition of expertise.

• Chapter 4 presents the use of graph similarity in the hardware security domain and provides novel insights on costs associated with reverse engineering with sound mathematical underpinnings and an extensive evaluation. We show a broad spectrum of applications ranging from security-relevant circuitry reverse engineering, over hardware Trojan detection, to assessment of hardware obfuscation schemes.

• Chapter 5 reveals several shortcomings of allegedly secure FSM-based hardware obfuscation schemes yielding semi-automatic IP infringement of protected hardware designs. In concert with realistic reverse engineering and manipulation capabilities, we provide comprehensive insights into published security metrics and previous (erroneous) assumptions about reverse engineering to serve as an educational basis for future obfuscation designers and implementers.

• Chapter 6 present a novel hybrid obfuscation strategy combining hardware-assisted and software-assisted obfuscation techniques to hinder software reverse engineering of deeply embedded systems. We provide a detailed analysis of various information disclosure sources for an adversary with dynamic access to the instruction . Moreover we discuss the generic shortcomings of state-of-the-art Instruction Set Architecture (ISA) randomization defenses in our adversary model. In addition we propose novel evaluation tools to quantify efficacy of hybrid obfuscation systems.

• Chapter 7 summarizes the impact of the presented research results and we propose new future directions for hardware security research.

5

Part II

Background

Chapter 2 Technical Background

We now introduce fundamental background information on hardware security and reverse engineering to allow understanding of the following chapters of this disserta- tion. To this end, we first sketch the basic principles of the hardware design process. We then introduce our threat model and describe the state-of-the-art in hardware reverse engineering and its related strands of research.

Contents of this Chapter

2.1 Hardware Design Process ...... 9 2.2 Gate-level Netlists ...... 10 2.3 Adversary Model ...... 11 2.4 Hardware Reverse Engineering ...... 11 2.4.1 Chip-level Reverse Engineering ...... 11 2.4.2 FPGA Bitstream Reverse Engineering ...... 12 2.4.3 Gate-level Netlist Reverse Engineering ...... 12 2.5 Hardware Trojan Detection and Manipulation ...... 13 2.6 Hardware Protection ...... 13

2.1 Hardware Design Process

Modern hardware design processes consist of various steps to translate a hardware design described in an HDL to a bitstream or manufactured chip. We want to emphasize that we sketch the hardware design process of FPGAs [30] hereinafter, and we refer the interested reader to [31] for ASIC processes. In general, a first step is to describe a behavioral specification model of the hardware design with an HDL such as VHDL or . Note that High-Level Synthesis (HLS)[32] introduces another layer of indirection by accepting classical software programming languages such as C/C++ as input to generate a behavioral model described in an HDL. Afterwards the hardware design is synthesized to a gate-level netlist representation (see Section 2.2), i.e. the HDL description is translated to a lower-level digital circuit schematic. In addition to the translation step, the design is optimized for a specific goal such as area or power, and moreover mapped (for a pre-defined device) onto available FPGA resources such as Look-Up Tables (LUTs) or dedicated memory blocks. From a high-level point of view, logic synthesis is a lossy process where valuable information is lost from the viewpoint of

9 Chapter 2. Technical Background reverse engineering. For example, meaningful descriptive information such as names, module boundary and hierarchy information are typically not available in a gate-level netlist. Moreover original high-level information is typically pruned due to logic-level, gate-level, and boundary optimizations. After the hardware design has been transformed to a gate-level netlist of a target gate library, the netlist is processed by placement and routing algorithms to place gates and route interconnections on the FPGA grid. Finally, proprietary vendor tools generate a bitstream which is a proprietary encoded version of the place-and-routed gate-level netlist. In contrast to FPGAs, ASIC place-and-route tools generate a layout which is given to a (off-shore) foundry for manufacturing.

2.2 Gate-level Netlists

As noted in the previous section, logic synthesis tools transform a hardware design typically written in RTL to a gate-level netlist representation in terms of a list of logic gates of a target gate library and their interconnections, see right part in Figure 2.1. We want to highlight that the XOR and INV gates of the gate-level netlist in Figure 2.1 are typically realized as LUTs in an FPGA. Note that gate-level netlists may contain hierarchy information of higher-level modules, however, we focus on flattened gate-level netlists since hierarchy information is typically not available in real-world settings.

0 0 1 module FSM(RST, CLK,I,O); 2 input RST, CLK,I; 1 3 output O; S0 S1 4 RST 1 0 5 wire o G1,o G2; 1 6 7 XOR G1( 8 . IN1(o G2), .IN2(I), 9 .O(o G1) 10 ); 11 DFF G2( 12 . CLK(CLK), . RST(RST), 13 .D(o G1),.Q(o G2) 14 ); G1 G3 O I G2 15 INV G3( 16 .IN(o G2),.O(O) 17 ); 18 endmodule

Figure 2.1: Example Moore FSM circuit as state transition graph (upper left part) with associated gate-level netlist in (1) visual graph-based representation (lower left part), and (2) textual representation with an exemplary gate library in Verilog (right part).

10 2.3. Adversary Model

2.3 Adversary Model

We assume an adversary with access to the potentially obfuscated, flattened, gate-level netlist who has no a priori knowledge of the design’s internal workings. More precisely, the adversary has no information of module hierarchies, synthesis options, or the names of gates and signals. The high-level goal of the adversary is to retrieve information of the design’s internal workings for a specific purpose (e.g., to commit IP infringement or implant hardware Trojans). The gate- level netlist can be obtained through several means: (1) chip-level or layout reverse engineering in the case of ASICs (see Section 2.4.1), (2) bitstream-level reverse engineering in the case of FPGAs (see Section 2.4.2), or (3) directly from the (firm / hard) IP provider [1]. Note that our threat model is consistent with prior research on hardware security [1, 24, 25, 20].

2.4 Hardware Reverse Engineering

We now systematically survey the fundamental background of hardware reverse engineering for both ASICs (Section 2.4.1) and FPGAs (Section 2.4.2) and we provide the state of the art in gate-level netlist reverse engineering (Section 2.4.3).

2.4.1 Chip-level Reverse Engineering

In order to access gate-level netlists of an ASIC, a reverse engineer has to perform chip-level reverse engineering and deprocess the chip, see Torrance et al. [17] and Quadir et al. [28] for comprehensive surveys. To this end, the original chip has to be (1) depackaged and mechanically preprocessed first, afterwards (2) delayered and imaged layer-by-layer, and (3) all images have to be post-processed to eventually assemble the gate-level netlist. Depackaging and Mechanical Preprocessing. First, the chip has to be depackaged by use of wet-chemical or mechanical means. In particular, the die has to be protected from any harm and thus typically wet-chemical depackaging is chosen since the die is protected by a front-side seal-layer (e.g., an SiO2 passivation). Moreover the backside usually offers enough silicon to withstand careful depackaging processes. Bonding wires are of special interest since they connect the embedded die to the package pins. Delayering and Imaging. Once the die is exposed, it is delayered and digitized by optical means (e.g., with the aid of a Scanning Electron Microscope (SEM) or Focused Ion Beam (FIB) since modern technology sizes hit the diffraction limit of optical microscopes). In practice, delayering involves a combination of different wet-chemicals, plasma-etching, and mechanical polishing steps to achieve a planar removal. To this end, different conductors, semiconductors, and dielectrics have to be investigated (due to variety of chip manufacturing processes, optimizations, and technology sizes). Software Post-Processing. To generate a functional representation of the chip, the digitized images have to be stitched and vectorized. Since each IC is built from elements of a standard-cell library, each cell has to be recognized in the post-processed images. Therefore, each cell of the standard-cell library has to be analyzed first in order to extract its functional meaning. After each standard cell is recognized in the post-processed images, we obtain the gate-level netlist.

11 Chapter 2. Technical Background

2.4.2 FPGA Bitstream Reverse Engineering

In order to access gate-level netlists of an FPGA, a reverse engineer has to analyze the configura- tion bitstream file that defines its behavior [28]. To this end, the reverse engineer has to (1) access the bitstream, (2) decrypt the bitstream (in case a bitstream encryption scheme is deployed), and (3) perform reverse engineering of the proprietary bitstream file format to retrieve the netlist. Note that the following description focuses on the market-dominating SRAM-based FPGAs technology, for other technologies the interested reader is referred to Wanderley et al. [33]. Bitstream Access. Due to the underlying SRAM technology, SRAM-based FPGAs require external non-volatile memory such as flash to store the bitstream. Hence, a reverse engineer can either access the non-volatile memory and dump its content, or wire-tap the communication between FPGA and non-volatile memory upon boot-up, cf. [4] Bitstream Decryption. In order to provide confidentiality of the bitstream, FPGA manufac- turers deployed a bitstream encryption scheme for various device series using strong cryptographic primitives. However, several works have demonstrated that various FPGA series are vulnerable to side-channel attacks which recover the secret encryption keys [2, 3]. Thus even if bitstream encryption is deployed, the bitstream can be decrypted for the majority of series. Note that most low-cost series do not offer bitstream encryption at all. Bitstream Reverse Engineering. Since the bitstream file format is propri- etary, a reverse engineer has to analyze the file format in order to transform the (decrypted) bitstream into its readable gate-level netlist description. To this end, several works developed automated file format reverse engineering strategies to recover (partial) netlist information, cf. [33, 34, 4]. Note that other strategies may focus on reverse engineering parts of the toolchain, particularly bitstream generation tools to uncover specifics of the proprietary file format.

2.4.3 Gate-level Netlist Reverse Engineering

In order to infer high-level structures (e.g., security-relevant circuitry) of a gate-level netlist, the adversary has to reverse engineer the interplay of the sea of gates. However, valuable information is lost during the synthesis step such as module boundary information and hierarchy information which structure the design. Moreover a variety of (logic-level) optimization strategies is performed to achieve a pre-defined optimization goal (e.g., area reduction or minimal latency) yielding deeply entangled and potentially overlapping gate structures and thus cannot be easily reverted into its high-level counterpart. In a case study, Hansen et al. [35] described several best-practices for a human reverse engineer such as the detection of recurrent modules and common library structures. Chisholm et al. [36] presented a workflow on how to reverse engineer module-level descriptions from gate-level netlists, addressing the synergy of the human analyst’s creativity and the computer’s ability to solve repetitive tasks. Doom et al. [37] proposed a technique to automatically identify functionally equivalent subcircuits in a combinational gate-level netlist. White et al. [7] built on this approach and extended it with a subcircuit enumeration technique. Chowdhary et al. [38] developed an approach to extract functional regularities from datapath circuits based on graph templates. Shi et al. [39] reported a technique to algorithmically extract FSM gates and signals from ASIC gate-level netlists. Later, Shi et al. [40] described a method to extract diverse functional modules from a gate-level netlist via Boolean function analysis. Meade et al. [41] extended FSM reverse engineering by retrieving the state transition function from ASIC gate-level netlists and

12 2.5. Hardware Trojan Detection and Manipulation

Meade et al. [42] performed another notable work using a similarity-based approach, since they examined similarity of a netlist’s graph topology to identify control registers of FSMs. Li et al. [43] developed a technique to match unknown sub-circuits against library components based on pattern mining of simulation traces and model checking. In further work, Li et al. [44] described how word-level structures can be algorithmically uncovered. Subramanyan et al. [45] extended the algorithmic reverse engineering technique arsenal by extracting functional components such as register files or adders. Since functional identification requires that the input signals of the component are in a specific order, a reverse engineer must examine all orderings to find the correct permutation. Gasc´on et al. [46] addressed this problem with a template-based solution. Soeken et al. [47] proposed a dynamic reverse engineering approach based on subgraph isomorphism of permutation-invariant simulation vectors.

2.5 Hardware Trojan Detection and Manipulation

Since an initial report by the US DoD in 2005 [48], the scientific community has extensively researched destructive and constructive aspects of hardware Trojans, see Bhunia et al. [49] for a comprehensive overview. In general, a hardware Trojan consists of a payload circuit delivering the malicious functionality (e.g., leakage of cryptographic keys or denial of service) and an optional payload activating trigger circuit (e.g., a counter or sensor). Constructive research focuses on detection of hardware Trojans based on diverse characteristics such as physical attributes, trigger features, and payload features [50]. In order to detect a Trojan’s characteristics, various approaches based on side-channel analysis [49], and design analysis [51, 52, 53, 54, 55] have been proposed. Several works targeted manipulations at layout-level design methodologies such as dopant-level Trojans [11], analog Trojans [14], or parametric Trojans [13]. In addition, several destructive works focus on applications of hardware Trojans or methodologies thwarting automated detection, see [8, 9, 12].

2.6 Hardware Protection

In general, protection schemes to protect against a reverse engineer with access to the gate-level netlist can be broadly categorized as (1) logic locking, and (2) FSM-based obfuscation. Logic locking refers to the addition of key-dependent logic gates (e.g., based on XOR or multiplexers) introduced by Roy et al. [56] in order to hide the original functionality from any untrusted party which is not in possession of the key. Since the initial work by Roy et al., various attacks [57, 58] and adapted obfuscation schemes [59, 60, 61] to thwart these attacks have been published. Since FSMs are commonly selected to implement a circuit’s control path, they constitute security-critical circuitry and thus are a relevant protection target. Several works focused on FSM- based obfuscation in order to facilitate post-manufacturing control of valuable IP [25, 24, 62, 63]. From a high-level point of view, the obfuscated FSMs are augmented with a key-based activation mechanism so that original functionality is achieved only for the correct key. Even though numerous techniques for IP protection have been proposed, the security of these techniques is often questionable since reverse engineering capabilities are often neglected or not properly discussed. In anticipation of Chapter 5, we reveal several shortcomings of these allegedly secure

13 Chapter 2. Technical Background

FSM-based obfuscation strategies. Li et al. [64] developed a structural obfuscation technique for sequential circuits, however, in anticipation of Chapter 4 we demonstrate that several key fragments of an obfuscated design can still be identified by means of similarity analyses. Other strategies exist to hamper component identification [65, 66, 67]. Further IP protection techniques below the gate-level netlist such as layout-level obfuscation or gate camouflaging are out of the scope of this thesis and we refer the interested reader to Shakya et al. [20] and Vijayakumar et al. [16].

14 Part III

Gate-level Netlist Reverse Engineering

Chapter 3 HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

Motivation. Hardware manipulations pose a serious threat to numerous systems, ranging from a myriad of smart-X devices to military systems. In many attack scenarios an adversary merely has access to the low-level, potentially obfuscated gate-level netlist. Since the attacker possesses minimal information, the adversary (e.g., a malicious foundry) faces daunting tasks of: (1) reverse engineering high-level information from a third-party gate-level netlist, (2) overcoming possible IP protection mechanisms, (3) identifying the security-critical modules, (4) followed by the actual Trojan insertion in the target design. These challenges have been considered only in passing by the research community. Although reverse engineering is commonplace in practice, the quantification of its complexity is an unsolved problem to date since both technical and human factors have to be accounted for.

Contents of this Chapter

3.1 Introduction ...... 18 3.2 HAL- Design and Implementation ...... 19 3.2.1 System Architecture ...... 19 3.2.2 Building Blocks ...... 20 3.2.3 Implementation ...... 22 3.3 Hardware Trojan Detection Technique ANGEL ...... 23 3.3.1 Boolean Function and Graph Neighborhood Analysis ...... 23 3.3.2 Evaluation ...... 25 3.3.3 Discussion ...... 26 3.4 Hardware Trojan Injection ...... 27 3.4.1 Case Study: Disarm Cryptographic Self-Tests ...... 28 3.4.2 Case Study: Wiretapping Keys in IP Cores ...... 31 3.5 IP Watermarking ...... 33 3.5.1 LUT-based Watermarking ...... 33 3.5.2 Opaque Predicates ...... 34 3.6 Human Factors in Hardware Reverse Engineering ...... 35 3.6.1 Setting: A Learning Perspective ...... 36 3.6.2 Problem Solving and Expertise Research ...... 36

17 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

3.6.3 Open Challenges: Quantification of Human Factors ...... 37 3.6.4 Dichotomies for Human Factor Quantification ...... 37 3.6.5 Research Designs, Data Collection, and Challenges ...... 38 3.7 Conclusion ...... 40

Contribution. The research presented in this chapter was joint work with Sebastian Wallat (affiliated with the University of Massachusetts Amherst, USA) who focused on design and implementation of the Graphical User Interface (GUI), Python shell, HDL writers, and build process parts of HAL, Max Hoffmann (affiliated with the Ruhr-Universit¨atBochum, Germany) who focused on design and implementation of the database, HDL parsers, and core functionality parts of HAL. Section 3.2 and Section 3.4 are part of a publication in the IEEE Transactions on Dependable and Secure Computing [68]. Section 3.6 was joint work with Sebastian Strauss and Malte Elson (both affiliated with the Ruhr-University Bochum, Germany) and it has been previously published at the IEEE International Verification and Security Workshop [5]. Section 3.5 was also previously published at the IEEE International Verification and Security Workshop [69].

3.1 Introduction

Goals and Contributions. In this chapter, we focus on reverse engineering of high-level information from gate-level netlists. Our goal is to demonstrate actual netlist reverse engineering and manipulation capabilities for constructive and destructive applications under realistic assumptions. To this end, we first address the lack of netlist-level reverse engineering frameworks in the open literature and present HAL, a holistic framework to support and automate custom time-consuming tasks such as Trojan insertion and detection, or assessment of IP violations. HAL’s primary purpose is to facilitate hardware security research, allowing researchers to focus on the innovative aspects of their work and unify collectively-acquired knowledge. Moreover, we present our novel static analysis hardware Trojan detection technique, ANGEL, which outperforms a previously-proposed static analysis detection method, FANCI [52]. In particular, we demonstrate that we are able to automatically find Trojan triggers which were obfuscated with the Trojan obfuscation scheme DeTrust [12], as well as k-XOR-LFSR Trojans [54]. We then present multiple case studies with a focus on how gate-level netlists of cryptographic designs can be surreptitiously weakened. In particular, we demonstrate how security-relevant parts can be semi-automatically reverse engineered, and how custom-tailored malicious circuitry can be injected to invalidate the system security. Furthermore, we review the security of a constraint-based IP watermarking scheme specifically tailored for FPGAs and subsequently we show general improvements to increase the security against reverse engineering. We then survey problem solving research and research on the acquisition of expertise, and briefly summarize what these approaches can provide to quantify the so-far neglected human factors in reverse engineering. Finally, we discuss how interdisciplinary research may be able to quantify the complexity of reverse engineering. In summary, our main contributions are:

• Netlist Reverse Engineering Framework. We present the design and implementation of HAL, an interactive framework for virtually any user-defined gate library, including ASIC and FPGA libraries. HAL supports and automates reverse engineering tasks and

18 3.2. HAL- Design and Implementation

enables tailored manipulations with ease. HAL also assists users in , testing, and structural analyses to make sense of large and complex gate-level netlists. A core feature of HAL is its extendability, and we demonstrate that the development of custom tools for reverse engineering, Trojan detection and Trojan injection is surprisingly fast and efficient.

• Hardware Trojan Detection. We present our novel hardware Trojan detection technique ANGEL which is based on Boolean function analysis and graph neighborhood analysis. In particular, we demonstrate that graph neighborhood analysis considerably reduces the crucial false-positive rate by several factors and can detect Trojans armed with the obfuscation scheme DeTrust as well as k-XOR-LFSR Trojans.

• Low-level Hardware Trojan Insertion. We detail the many-faceted workflow of semi- automated hardware Trojan insertion with accompanied reverse engineering under realistic assumptions. In several case studies, we demonstrate how meaningful Trojans can be injected into third-party gate-level netlists. Our custom-tailored hardware Trojans semi- automatically invalidate security measures of cryptographic implementations, including bypassing self-tests and leaking crypto keys.

• IP Infringement. We carefully analyze the security of constraint-based watermarking tailored to FPGAs and show how to automatically identify and tamper watermarks. Additionally, we present novel improvements to mitigate reverse engineering by use of opaque predicates. Moreover, we demonstrate flaws in proposed hardware opaque predicate implementations.

• Novel Reverse Engineering Techniques. Our case studies include novel reverse engi- neering techniques to algorithmically disclose security-relevant parts such as cryptographic self-tests and interfaces of cryptographic implementations. We provide results for nu- merous cryptographic implementations, a variety of FPGA families, and several design optimization goals.

• Human Factors Quantification Overview. To the best of our knowledge, we are the first to propose problem solving research and research on the acquisition expertise to quantify human factors in hardware reverse engineering. Finally, we discuss how interdisciplinary research with technical and humanistic perspectives may facilitate a sound quantification.

3.2 HAL- Design and Implementation

We now describe HAL’s overall architecture, workflow, and implementation. To this end, we want to stress that HAL itself is not a tool but a comprehensive framework that can be used to create tools, a common task during reverse engineering.

3.2.1 System Architecture HAL was written according to modern software design and architecture standards to achieve easy maintainability, extendability, and high modularity. Therefore, HAL consists of several separated building blocks, each focusing on a logical feature set. In the following we outline the workflow

19 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

Figure 3.1: Overview on HAL’s architecture and workflow. of HAL, before providing more detail on the main building blocks. To guide the reader, we will refer to the numbered circles of Figure 3.1 which provides an overview of the workflow of HAL. HAL - General Workflow. The user invokes HAL with a gate-level netlist 1 . HAL uses one of its parsers 2 (e.g., VHDL or Verilog) to transform the input netlist into its internal graph representation 3 . After this translation step, user-defined plugins 5 can be invoked via the plugin manager 4 to automatically analyze and possibly manipulate the gate-level netlist. All changes to the graph throughout the plugin operations including meta data added by plugins or the user (e.g., meaningful names and hierarchy information) are synchronized with a local database 0 . To further support the user, the whole workflow is also accessible via an interactive GUI 9 and an interactive Python shell 10. When all requested plugins and tasks have been processed, the graph may be written back 6 to a gate-level netlist for synthesis or simulation 7 in any of the supported HDL languages.

3.2.2 Building Blocks

HDL Parser and Writer. HAL transforms an input netlist 1 (e.g., in VHDL or Verilog) into a directed multi-graph representation of the design. To be more precise, a gate-level netlist includes a series of gates (nodes) and how they are connected via nets (directed edges) where a gate A may have two output ports and both connect to the input of gate B (multi-digraph). Typically, a netlist includes a set of atomic gates defined in a gate-level library which specifies their behavior. However, in cases of chip-level reverse engineering the gate-level library is often not known beforehand and is disclosed during the process itself. To support incorporation of custom netlist libraries and even HDL languages, we developed an extensible parser and writer interface which is independent of the gate-library and source language. Currently, we support all Xilinx FPGA and ASIC gate libraries for the TrustHub benchmark suite [70]. Adding libraries is straightforward and can be done without recompiling HAL.

20 3.2. HAL- Design and Implementation

After the analysis and manipulation step, a reverse engineer may be interested in a synthesizable netlist containing the modified graph 7 . To provide this functionality, HAL includes multiple HDL file writers 6 , including VHDL and Verilog, which transform the modified design back to its netlist representation in the user-preferred supported HDL language. For example, if the input is a VHDL netlist we can also output a Verilog version which is handy since several open-source tools such as Verilator do not support VHDL. Graph Core. The operational heart of HAL is the graph core 3 which allows netlist exploration. Our graph representation is a high-level abstraction of any arbitrary gate-level netlist independent of the underlying gate library or HDL language, analogous to intermediate representations used in software program analysis. Note that such a step is favorable in practice since automated techniques can be developed on the high-level representation rather than a single technique for each gate library. Furthermore, dedicated algorithms from graph theory can be employed, leveraging existing research from math theory. In essence, the graph core provides graph traversal functionalities and methods to edit the graph itself. The graph can be reshaped by adding new gates and nets or removing existing ones. Moreover, the grouping and annotation of gates and nets into subgraphs (modules in HDL) is supported which is important for reconstruction of a design’s module hierarchy. Furthermore, we implemented a dynamic and generic decorator interface to add user-defined behavior and state to gates and submodules during HAL execution. For example, we developed decorators to provide Binary Decision Diagram (BDD) representations of gates and special memory access functionality for LUT gates. Since we are using C++ , static inheritance (instead of decorators) is not a favorable solution to dynamically attach additional functionality. On top of these fundamental operations, the Application Programming Interface (API) provides the user with access to high-level graph analysis algorithms (e.g., Dijkstra’s Shortest Path, Strongly Connected Components). Since all plugins typically make heavy use of the graph core, it is optimized for speed and failure safety. To enable persistent storage of revealed information and to facilitate joint reverse engineering on a specific design, we integrated a database synchronization engine 0 . The database not only stores the graph itself, but also any user-defined meta data such as names and hierarchy information. Since textual representations of flattened gate-level netlists are in the range of several megabytes, we decided to employ a performance-optimized NoSQL database and implemented a custom key format to store arbitrary user data. With our database engine we enable collaborative analysis of the same netlist and snapshot generation since the database files can easily be shared among a team of users or saved as a backup. Plugin System. To separate the development of the core framework from user-defined applications and algorithms, we use a plugin-based system architecture to dynamically include external code. Here, a plugin-based approach is favorable since this approach is highly extensible and HAL itself does not have to be recompiled when new plugins are implemented. We use a plugin manager 4 which handles registration, loading, usage, and eventually unloading of plugins during run-time. Multiple plugins can be executed in parallel or consecutively invoked to automatically engage each other. Analysis Report. Using a sophisticated logging system the user can create rich analysis report files 8 for result logging or as debugging information. The logging system was created with support for multiple channels and severity levels, as well as intuitive output formatting.

21 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

GUI. An engineer can already accomplish a lot by using HAL only via its command line interface with the plugin system. However, with textual output alone it can be difficult to make sense of huge designs with potentially billions of gates. To mitigate the manipulation of overwhelming amounts of textual information, we included an interactive GUI to enable visual representation of the gates and nets of the processed graph 9 . For optimal presentation, we created a unique graph view layout tool from the ground up to meet our high requirements of interactive design exploration. Note that layout planning of large gate-level netlist graphs is a challenging problem since the vertex and edge arrangement has to be computed quickly and focused on comprehensibility (e.g., to support the mental map for manual reverse engineering). Therefore, we integrated a generic interface to support multiple graph layout algorithms as different layout techniques focus on different visual representations. For example, we integrated an orthogonal layouter, which arranges a graph in a rectangular 2D grid, as well as a hierarchical layouter that leverages additional information about the distance of nodes to I/O ports. The GUI is also enhanced with interaction. The graph can be visually traversed by mov- ing from pin to pin by clicking on the graphical representation or using keyboard shortcuts. Additionally multiple docks provide rich information about selected components and plugin- generated annotations. For example, a reverse engineering plugin that automatically detects a cryptographic module can adjust the color of identified gates and nets to aid the analyst. We stress that the GUI is of immense interest for a human reverse engineer, since making sense of a complex design is notably easier with a visual representation than just with textual information. Python Shell. To allow simple execution of arbitrary core functionalities, for example for testing or even batch execution of long running analysis tasks, we integrated an interactive Python3-based shell console widget 10 into HAL. Our Python shell allows access to the fast C++ HAL API from within the Python interpreter by mapping each core function from C++ to Python. This enhances the static plugin system with interactive code execution to improve the usability during semi-automated design exploration. Since we believe that HAL aids researchers in analyzing their designs and because of its rich set of features and the optimized core, we plan to publicly release HAL to the research community.

3.2.3 Implementation

We implemented HAL in C++ 14 due to its efficiency and high performance capabilities that are especially critical for processing large hardware designs consisting of hundreds of thousands of gates. To keep HAL maintainable and extensible, we focused on clean and well documented code, as well as the predominant design principles of . Software Libraries. We employ several components of the Boost library (version 1.58), namely the Boost Graph Library, and the Boost Filesystem. The Boost graph library forms the backbone of our graph core as it already provides a rich set of graph algorithms. To ease functional analysis we use the BuDDy library (version 2.4) to automatically generate BDD dynamic extensions for single gates or entire combinational subgraphs. For database serialization, we decided to use Kyoto Cabinet, a collection of fast, server-less, NoSql database types which operates cross-platform so that database files can be easily exchanged among engineers. The GUI is built on top of the Qt5 (version 5.6) application framework. Qt is also platform independent and integrates well with C++ 14. The interactive Python shell is built using Python3 (version 3.6) and pybind11 (version 2.1.1) to connect the C++ functions of the HAL API with Python.

22 3.3. Hardware Trojan Detection Technique ANGEL

To manage the build process and dependencies platform-independently, we employ the cross- platform build management tool CMake (version 3.6). It generates configured build files for GNU Make and other build systems such as Ninja. For the build process itself, we support both GCC (version 6.3) as well as LLVM (version 4). Supporting multiple compilers also results in more robust code, as the compilers perform different optimizations and provide differing output. Currently, HAL is supported on Ubuntu, Arch and macOS. Note that support is prepared, but not fully functional yet.

3.3 Hardware Trojan Detection Technique ANGEL

We now present our novel static analysis technique ANGEL (Analyzing the Neighborhood of Graphs to Expose Leakers) based on (1) Boolean function analysis, and (2) graph neighborhood analysis.

3.3.1 Boolean Function and Graph Neighborhood Analysis

ANGEL builds on previous state-of-the-art research in static analysis hardware Trojan detection, namely FANCI [52, 71]. Similar to FANCI we focus on detection of weakly-affecting inputs through Boolean function analysis, but additionally consider the neighborhood of combinational gates for each gate to address fundamental limitations of FANCI. To this end, we first sketch the idea of the Boolean function analysis and subsequently we present the novel idea of incorporation of the graph neighborhood. In addition, we want to emphasize that static analysis is indeed a powerful tool since it does not rely on a golden model or verification tests which is favorable in practice since fewer potential weak-points for attackers have to be trusted. Boolean Function Analysis. To estimate the impact of an input signal on a corresponding output of a gate g, FANCI proposed a so-called control value CVo which is a vector for the output o of Boolean differences BDo for each input i standardized by the number of inputs I, −I i.e. CV = BDo(i) · 2 , see [12]. The Boolean difference BDo(i) computes the total number of patterns under which flipping input i results in a change of the output o. For example, let us consider a simple AND gate with 3 input signals (i0, i1, i2) and 1 output signal: the Boolean 2 difference is 23 = 0.25 for each input signal, since there is a difference for two input patterns, i.e. 011 and 111 for i0, 101 and 111 for i1, and 110 and 111 for i2. Since the complexity of ANGEL (and FANCI) exponentially grows with the number of inputs for a graph cut (or a gate for FANCI), we utilize the technique developed in the original FANCI work: we approximate the control vector computation by choosing a certain number of inputs uniformly (e.g., 210) and compute the Boolean difference for these values to keep analysis time practical. Graph Neighborhood Analysis. Since a Boolean difference computation on the fan-in combinational logic cone, as used in FANCI [52, 71], results in a high false-positive rate, as demonstrated by Zhang et al. [12], we consider the local neighborhood of the combinational gate. To this end, we determine a d-feasible graph cut [45] for each gate g. In other words we apply a backward breath-first search starting from g for depth d. Afterwards, we compute the Boolean difference on the overall neighborhood of each gate. Note that the consideration of local predecessors is favorable since a Trojan trigger is typically implemented with multiple

23 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework gates. Hence, an analysis of the Boolean difference of multiple coherent gates will increase the detection probability in case of low controllability and simultaneously decrease the chance that genuine gates are flagged as malicious. Ignoring Sequential Stages Boundaries. Due to advances in hardware Trojan design research, a Trojan trigger may be spread across multiple sequential stages [12, 54]. To cope with such obfuscation, we simply ignore the sequential registers and latches and build the graph cut using predecessors of data input ports. In this way, we are able to track the Boolean difference of local combinational gates even if they are (intentionally) in different sequential stages. Note that genuine circuits typically do not possess a low controllability across sequential stages, hence we expect that the false-positive rate does not increase if we ignore sequential stages. We want to emphasize that a similar idea was noted as FANCIX by Haider et al. [54], but they did not perform an evaluation since they proposed “to monitor the circuits up to multiple sequential stages at a time, while ignoring any FFs in between” which consequently results in a high computational complexity. We build upon this idea and demonstrate that with an incorporation of the local neighborhood of a predefined depth d, the computational complexity is still practical on commodity hardware while simultaneously providing a low false-positive detection rate.

Algorithm 1 ANGEL Input: D - Design gate-level netlist Input: d - Depth Input: t - Threshold Output: S - Set of suspicious gates 1: for gate g ∈ D do 2: cut c ← get feasible graph cut(D, g, d) 3: truth table t ← determine truth table(c) 4: for output o of g do 5: control vector CV ← ∅ 6: for input i of c do 7: CV.push back(compute heuristic(c, i, o)) 8: if check heueristic(CV ) < t then 9: S ← S ∪ {g} 10: return S

Algorithm 1 formalizes the idea of ANGEL. In line 2 we compute a feasible graph cut of depth d while ignoring any Flip Flops (FFs) or latches in between. Lines 3 - 9 determine the Boolean difference for each cut. To this end determine truth table combines all Boolean functions of the graph cut into a single large truth table. We used the mean-and-median heuristic 7 , since it performs best [52]. Implementation. We now want to highlight several steps that accelerate the computation of ANGEL. First, each gate 1 can be analyzed in parallel, as well as the computation of the Boolean difference for each input signal 6 . Hence, this approach scales with more computation power. Second, we store the heuristics for each cut so that we can perform the heuristic check 8

24 3.3. Hardware Trojan Detection Technique ANGEL

Table 3.1: Evaluation of the identification accuracy of ANGEL compared to FANCI [52] for hardware Trojans equipped with DeTrust (see Section 3.3.2 for details). The X symbol indicates that (parts of) the Trojan were identified, the  symbol indicates that no part of the Trojan was identified. Threshold to Computation Number of Suspicions Gates Design Defense Detect Trojan Time in % for Threshold t t = 0.01 t = 0.04 t = 0.075 t = 0.13 −3 s15850 FANCI 2 2.37 h 8.4  15.1  25.1  36.1 X −7 ANGEL (d = 2) 2 25.1 s 1.6 X 3.7 X 6.7 X 10.3 X −∞ ANGEL (d = 3) 2 42.1 s 3.4 X 5.5 X 8.5 X 11.5 X −3 s35932 FANCI 2 48.1 m 0  0  0  0.1 X −8 ANGEL (d = 2) 2 1.03 m 0.01 X 0.01 X 0.01 X 0.2 X −4 ANGEL (d = 3) 2 1.71 m 0.01  0.1 X 0.2 X 18.4 X −3 s38417 FANCI 2 17.4 h 8.3  22.3  31.9  45.1 X −7 ANGEL (d = 2) 2 1.48 m 0.7 X 3.0 X 6.1 X 10.2 X −∞ ANGEL (d = 3) 2 2.98 m 2.2 X 5.1 X 7.6 X 13.5 X −3 s38584 FANCI 2 8.24 h 4.3  12.0  27.3  44.0 X −6 ANGEL (d = 2) 2 1.48 m 0.3  1.1 X 2.5 X 5.9 X −6 ANGEL (d = 3) 2 1.61 m 0.5  1.3 X 2.9 X 5.5 X for multiple threshold values, since it only involves comparison of real values (which is faster than the computation of the Boolean difference). Before we present and discuss the results of our evaluation, we briefly sketch two state-of-the- art Trojan design strategies that aim for increased stealthiness by applying special constructions to the Trojan so that automatic detection algorithms are deceived. DeTrust Hardware Trojans. Zhang et al. [12] presented DeTrust, a systematic way to design hardware Trojans which could not be detected by FANCI. The general idea is to hide the combinational Trojan trigger in several sequential stages. Zhang et al. reported that a threshold of around 0.1 exposes several Trojan related gates, however it suffers from a large amount of false-positive gate detections. Note that ignoring sequential stages in our graph cut determination specifically handles these DeTrust obfuscation Trojans. Analogous to Salmani [55], we realized the DeTrust obfuscation by inserting additional FFs at the output of each Trojan trigger gate. k-XOR-LFSR Hardware Trojans. Haider et al. [54] presented the k-XOR-LFSR hardware Trojan design strategy to construct stealthy triggers with implicit malicious behavior. Basically, an Linear Feedback Shift Register (LFSR) consisting of k registers is leveraged to design a counter so that several selective connections of the LFSR state determine the trigger condition for the Trojan. Hence, the adversary is able to design more complex trigger conditions and a higher signal dimension, i.e. number of wires used to trigger the payload. For comparison to Salmani [55], we use the functionally same 4-XOR-LFSR Trojan that leaks data for a specific LFSR state, see [55] Figure 10 (a), which was originally described by Haider et al. [54]. Note that we utilize the same LFSR structure and a semantically equivalent payload circuit.

3.3.2 Evaluation

To demonstrate the efficiency of ANGEL, we used benchmarks from the popular TrustHub suite [70]. Table 3.1 shows the results of our evaluation for four TrustHub designs, (1) s15850,

25 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

(2) s35932, (3) s38417, and (4) s38584. Design (1) has 1666 combinational gates and 517 sequential gates, (2) has 3717 comb. gates and 1729 seq. gates, (3) has 3739 comb. gates and 1602 seq. gates, and (4) has 5202 comb. gates and 1282 seq. gates prior to obfuscation. Overall, we see that the false-positive rate decreases by several magnitudes when we increase the depth, so that the number of suspicious gates is in a range that can be analyzed manually. Our results on FANCI are similar to the evaluation results of Zhang et al. [12] for the threshold 0.13 (false-positive rate around 40%), cf. Fig. 10 (a) in [12]. Note that deviations occur since the control value approximation is non-deterministic and we did not have access to the original implementations and we also skipped control value computation of combinational subgraphs with more than 120 gates due to the high computation effort, resulting in around at most 4%-6% not analyzed gates. The table compares the identification accuracy of FANCI to the accuracy of ANGEL. It can easily be seen that ANGEL outperforms FANCI in both computation time and accuracy. Additionally, our results for thresholds t = 0.01 and t = 0.04 show that ANGEL is able to successfully detect the malicious circuity even for thresholds where FANCI is not able to detect any Trojan. We selected the exemplary threshold values 0.13 for two reasons: First, FANCI requires a threshold of > 0.125 to identify the Trojans thus we rounded this value up to report the number of false-positive detected gates. Second, this threshold is crucial for Trojans designed with the obfuscation DeTrust [12], which we briefly discuss in Section 3.3.3. k-XOR-LFSR Hardware Trojans. We also evaluated ANGEL with respect to the 4-XOR- LFSR Trojan (described in the previous section). Note that the number of suspicious gates is similar to the DeTrust designs for FANCI and ANGEL with depth d = 3 and the heuristic median, since we only change the relatively small Trojan and thus we deliberately did not provide an additional table. FANCI was only able to detect the selective connections of the 4-XOR-LFSR Trojan with a threshold of 0.13, however, for this threshold the number of suspicious gates is rather high (similar as for the DeTrust Trojans). ANGEL was able to successfully detect the Trojan for depth d = 3 for all threshold values while resulting in a low false-positive rate, i.e. 3.6% for s15850. Note that an additional DeTrust obfuscation for the selective connections would not increase the stealthiness, since ANGEL simply ignores sequential stage boundaries. As demonstrated in Table 3.1, ANGEL significantly reduces the false-positive rate and thus enables automatic detection by static analysis for DeTrust obfuscated Trojans. This observation is also true for the k-XOR-LFSR Trojan.

3.3.3 Discussion

DeTrust and k-XOR-LFSR Trojans. As noted before, our evaluation results regarding FANCI are the same as provided by Zhang et al. [12], cf. Figure 10 (a) in [12]. Note that minor variations exist since the control vector approximation is non-deterministic and hence results differ across multiple executions. For higher depths, ANGEL significantly outperforms FANCI. In addition, we want to highlight another static analysis strategy for k-XOR-LFSR Trojans. In Section 3.5 we demonstrate how LFSRs can be automatically extracted from gate-level netlists. Note that we implemented this LFSR detection using HAL, so any LFSR structure is automatically and reliably exposed in seconds. Since HAL supports development for multiple analyses and manual inspection afterwards, an analyst can examine reports from plugins to verify whether a Trojan is present or not (e.g., using both ANGEL and LFSR detection).

26 3.4. Hardware Trojan Injection

Threshold. A major concern for both FANCI and ANGEL is finding an appropriate threshold value. Even though it should be a small value, to the best of our knowledge, no automated means to approximate an optimal threshold exists. Our implementation aids the identification of a threshold for a given design since all control values are computed before heuristic checking. Comparison to Other Static Schemes. We now discuss similarities and differences between ANGEL and two state-of-the-art static analysis Trojan detection techniques, namely (1) correlation-based clustering [72], and (2) COTD [55]. Similar to ANGEL, both strategies attempt to identify crucial Trojan trigger logic by computing a form of controllability, since Trojan triggers are typically associated with gates that possess low controllability. However, both techniques are orthogonal to ANGEL. For correlation-based clustering, simulation data of tests for manufacturing faults are analyzed and a correlation-based similarity weight is determined for input/output gate values. Afterwards, a density-based clustering algorithm is used to flag outliers, i.e. potential Trojan trigger gates. As pointed out by Salmani [55], the accuracy depends on observing sufficient signal activity. For COTD, a controllability and observability value is determined by SCOAP and afterwards an unsupervised clustering analysis splits signals into malicious and genuine lists. The complexities of the different strategies are ANGEL/FANCIX: O(gm2m), and COTD: O(n) for g = number of gates, m = (sub)circuit inputs, and n = number of wires [55]. Note that the dynamic scheme HaTCh possesses a complexity of O((2n2)d), with d = signal trigger dimension. The runtime of COTD is considerably faster than ANGEL, however, ANGEL’s runtime is not impractical. Even though ANGEL possesses the exponential factor 2m, we leverage the approximation of the control value computation, thus this exponential factor is bound (e.g., 210) and the complexity is bounded by O(gm). Lastly, we want to emphasize that with the assistance of HAL, we were able to implement a high-performance, gate-level library agnostic version of ANGEL with merely several hundred lines of C++ code.

3.4 Hardware Trojan Injection

We now present two destructive case studies to demonstrate HAL’s applications for gate-level Trojan injection into benign third-party gate-level netlists. In particular, we introduce generic semi-automatic reverse engineering and manipulation techniques implemented with the assistance of HAL to inject hardware Trojans into gate-level netlists of cryptographic designs. More precisely, we show how to (1) trick and disarm cryptographic power-up self-tests (Section 3.4.1), and (2) subtly wiretap and leak cryptographic keys via unused Input/Output (I/O) pins (Section 3.4.2). Cryptographic Designs. To realize security properties such as confidentiality or integrity, it is of crucial importance that the deployed cryptographic module is not compromised. Given that most cryptographic primitives in use, such as the Suite B ciphers [73], are robust against traditional attacks, i.e. brute-force and , adversaries are often forced to exploit implementation attacks to undermine the security of systems and applications. Most prominent implementation attacks are Side-Channel Analysis (SCA) and Fault Injection (FI) which have been investigated in great detail in the context of hardware security in the scientific and industrial communities, see [74, 75] for comprehensive overviews. Even though SCA and FI countermeasures do not solve all problems, there is a sound understanding of attacks and countermeasures.

27 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

We focus on hardware Trojans to weaken security by manipulating the underlying hardware. To evaluate the severe consequences of low-level hardware manipulations at a larger scale, we obtained numerous publicly available, third-party Advanced Encryption Standard (AES) implementations from OpenCores and an NSA website to achieve variability. Each AES IP core provides an interface to set a key and encrypt or decrypt user data. To communicate with the environment, we extended each design to include a Universal Asynchronous Receiver Transmitter (UART)/RS-232 interface that can be used as part of an FPGA implementation. To be as close as possible to a practical scenario, we furthermore augmented each AES IP core with a self-test1, see Section 3.4.1. Note that we made no changes to the underlying AES design, but merely integrated necessary hardware components with a full-fledged IP core. SRAM-based FPGAs. In our case studies, we focus on SRAM-based FPGAs. As noted in Section 2.4.2, the majority of currently deployed SRAM-based FPGAs from Xilinx and other vendors can be affected by post-deployment hardware Trojan injections since the bitstream is either not protected or the cryptographic protection can be circumvented by means of SCA attacks. Thus, an adversary can read-out and transform the proprietary bitstream file format to a readable gate-level netlist, perform reverse engineering and manipulation of security-critical design parts, and re-generate and deploy the bitstream. Notation. We use the following notations: p - Plaintext (16 bytes), k - Key (16 bytes), c = AESk(p) - Ciphertext (16 bytes), (pref, cref) - Plaintext/ciphertext pair for the self-test, kst - Key for the self-test, ku - Key for user data.

3.4.1 Case Study: Disarm Cryptographic Self-Tests Several works have demonstrated that targeted manipulation of third-party cryptographic hardware implementations have serious consequences, ranging from key leakage to surreptitiously weakened ciphers, even for high-security real-world devices, see [77, 78, 79, 80]. These attempts are usually detected by mandatory self-tests in cryptographic IP cores [76]. To utilize these powerful attacks for realistic IP cores, an adversary must disarm the self-test prior to the manipulation. In this case study, we demonstrate for the first time how the self-test circuitry can be both algorithmically reverse engineered and manipulated in a way that the aforementioned attacks can be performed. Detailed System Model. We assume the following generic workflow for the cryptographic IP core: 1. Upon initialization, the IP core conducts a power-up self-test with an internally stored self-test key kst and a reference plaintext pref.

2. The IP core checks whether the computed ciphertext is equal to the internally stored ? reference ciphertext, i.e. cref = AESkst (pref).

3. If the self-test is successful, a user key ku is passed to the AES core and used to encrypt or decrypt user data. Otherwise, the cryptographic module enters an error state. Note that we do not make assumptions about how the IP core is integrated into a larger system or whether the user key ku is stored internally or supplied from the external environment. 1“A cryptographic module shall perform power-up self-tests and conditional self-tests to ensure that the module is functioning properly”, see FIPS PUB 140-2 [76].

28 3.4. Hardware Trojan Injection

Synthesis AES Design Device Option 1 2 3 4 5 6 7 8 9 10 11 12 13 area n.a. Spartan-3E balanced n.a. xc3s1600e speed n.a. area n.a. n.a. G# Spartan-3 balanced n.a. n.a. xc3s1000 speed n.a. n.a. area n.a. Spartan-6 balanced n.a. xc6slx16 G# G# G# speed n.a. area n.a. n.a. Virtex-4 balanced n.a. n.a. xc4vlx25 speed n.a. n.a. area Virtex-5 balanced xc5vlx50 speed area Virtex-6 balanced xc6vlx75 G# G#G# speed G# area G# 7 series balanced xc7k70t G# G#G# speed G# G# Table 3.2: Evaluation results of self-test reverse engineering. : successfully reverse engineered, : reverse engineering required minor manual netlist inspection, n.a.: the given AES G#could not be implemented for the device, blank: reverse engineering did not yield a result.

Adversary’s Goal. The high-level goal of the adversary is to perform targeted manipulation of the cryptographic computation to reveal the employed key ku or weaken the cipher (e.g., by an S-box substitution [77, 78, 79, 80]). To perform a successful manipulation, the adversary is required to disarm the self-test circuit and trick it in a way that it always returns successful irregardless of the manipulated cryptographic implementation.

Algorithmic Reverse Engineering of Self-Test Circuits

To disarm and manipulate the self-test, the adversary has to first reverse engineer which gates and signals implement this functionality. To this end, we first detail how cryptographic self-tests are usually implemented and second we present our novel technique to determine how such structures can be automatically identified. Cryptographic Power-Up Self-Tests. A self-test of a cryptographic module is usually realized as several additional states in the design’s FSM that runs the cryptographic algorithm on some a priori computed, internally stored reference data. If the reference and the dynamically computed values are not equal, the design transitions into an error state (and does not perform any further operation), cf. [73].

29 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

Hereinafter, we describe the automatic reverse engineering strategies for two distinct device family series from Xilinx. In particular, we highlight how the different FPGA architectures (designed for 4-input LUTs or 6-input LUTs) affect our reverse engineering strategy. We want to emphasize that our search mainly focuses on the crucial 128-bit comparator which checks the equivalence of cref and the dynamically computed AESkst (pref). Reverse Engineering. The general idea of our automated reverse engineering technique is to search for the comparator circuit that computes the equivalence of cref and AESkst (pref). Even though comparators can be realized by numerous FPGA gates (e.g, LUTs, multiplexers, AND, or carry gates), we developed a generic approach shown in Algorithm 2.

Algorithm 2 Self-Test Circuit Reverse Engineering Input: D - Design gate-level netlist Output: S - Set of self-test circuits 1: set S ← ∅ 2: list L ← ∅ 3: for gate g ∈ D do 4: if check hamming weight(g) = true then 5: L.append(g) 6: set of comparators C ← merge(L) 7: return S ← merge comparators(C)

In lines 3 - 5 , the function check hamming weight analyzes whether each gate’s Boolean function of all active input pins (neither static GND nor VCC) implements a function that returns a distinctive output bit (e.g., logical 1 or logical 0 for exactly one input) and the complementary bit for all other active input pin assignments. We then analyze the neighborhood of each candidate gate and merge connected ones 6 . Therefore, we first check whether candidate gates are direct successors or predecessors of each other as well as whether the output of several candidate gates merge in further target gates. We repeat this action until no further candidate or target gate can be added to the comparator. We then merge the identified comparators to determine whether they form a multi-comparator 7 , i.e. that checks more than one value. Note that this multi-comparator is not identified in 6 , since the gate that combines two comparators usually consists of a Boolean OR. Finally, we output all identified (multi-)comparators and output their respective bit-width which is derived by counting the number of different inputs. With the assistance of HAL, we developed a plugin that yields a list of comparators and their corresponding width. Prior to the large-scale evaluation of the comparator detection algorithm, we briefly describe how such identified comparators can be manipulated.

Manipulation of a Self-Test

We present two implementation strategies to bypass a self-test: (1) always bypassed, and (2) conditionally bypassed. Always Bypassed. To bypass the self-test for any input, we manipulate the final gate in the comparator which decides whether an input value matches the expected one. For example, if the final gate is a LUT, we change its configuration to always output a logical 1 (or 0 depending

30 3.4. Hardware Trojan Injection on what the comparator interprets as true). Otherwise, we simply alter each identified LUT in the comparator appropriately so that it always (erroneously) outputs true. Conditionally Bypassed. If the comparator of the self-test is utilized by other circuitry, we can simply add a trigger condition to the design that checks whether the comparator is used for the self-test or not. Typically, this requires some additional reverse engineering of the design’s control logic, see [39, 41].

Evaluation

For the large-scale evaluation of our proposed self-test detection, we target a range of Xilinx FPGA families and several synthesis options for each of the AES designs. Note that we employed the ATHENa framework [81] to automate this process of gathering diverse variants of each design. Our large-scale analysis results are depicted in Table 3.2. Note that we targeted 128-bit comparators due to the state width of AES. The vast majority of the comparators were detected for the diverse FPGA families, synthesis options, and designs. In the cases marked with , the comparator deviates more distinctively from 128 or the comparator was only found in twoG# distinct parts. This occurred due to the fact that additional logic is performed by some LUTs in the comparator which complicates the algorithmic reverse engineering, however these cases can be easily resolved with manual analysis. In the small number of cases where the comparator reverse engineering was not successful, the self-test comparator was split up into more than two single-comparators or the recovered bit-width exceeded 128. Practical Verification. To confirm that we manipulated the self-test circuit correctly, we performed an S-Box substitution attack [78] and verified the erroneous AES computation for the xc6slx16 on a sample basis. Note that we used the always bypass manipulation. For the other FPGA families, we simulated the overall design’s behavior and verified the successful manipulation of the different AES IP cores on a sample basis.

Discussion

We acknowledge that the self-test used for evaluation was implemented by us, however, to the best of our knowledge, there is no open-source available FIPS-certified AES implementation which includes a crucial self-test. Our approach is easy to adapt for different implementation structures and with HAL the identification process can successfully be automated with little effort.

3.4.2 Case Study: Wiretapping Keys in IP Cores

In contrast to the prior case study, we demonstrate that we are able to semi-automatically insert hardware Trojan circuits into an existing IP core to wiretap and leak utilized AES keys. Detailed System Model. For this case study, we assume the same system model as described in Section 3.4.1. Adversary’s Goal. The primary goal is to insert a Trojan that leaks sensitive data while an AES core is actively used. Therefore, the adversary attempts to wiretap the entire state after the SubBytes transformation in the first round. Subsequently, the adversary aims to leak the sensitive data to the external environment to enable key recovery.

31 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

Figure 3.2: Xilinx sp601 development board used for experimentation. A wiretap Trojan was inserted into an existing low-level netlist. Sensitive data is leaked via an additional Trojan UART connected to a ttl2rs232 USB module.

Algorithmic Key Detection

Since the adversary intends to leak the state after the SubBytes transformation in the first round, we used HAL to perform the S-Box detection techniques described by Swierczynski et al. [78]. The algorithm outputs a list of S-Box instances and the bit order of the S-Box output. Note that the algorithm does not reveal any information regarding which state byte belongs to which S-Box instance so the correct permutation is unknown.

32 3.5. IP Watermarking

Key Recovery

In the following, we explain the key recovery for our AES core under attack. The core is a round-based implementation and therefore utilizes sixteen S-Box instances in the SubBytes step and four S-Box instances to implement the AES key schedule. Note that we confirmed this architecture by means of the number of identified S-Box instances. Furthermore, it is noteworthy that our procedure is not limited to this AES core, but with minor modifications applicable to other designs as well. To gain knowledge of the correct byte permutation of the leaked 16-byte state, we analyzed which S-Box sub-circuit processes which byte of the state by means of simulation. Additionally, we observed the AES state over time. With this information we know which data buses are the relevant ones to be wiretapped and at what point in time we need to wiretap. Once wiretapping is done, the retrieved data must be forwarded to the external environment. Therefore, we inserted a UART/RS-232 core and merged it with the given low-level netlist. The UART transmitter was connected to an unused I/O pin. To retrieve the wiretapping result we attached an additional ttl2rs232 USB receiver to this pin, cf. Figure 3.2. Once the wiretapped state is obtained, the adversary can easily compute the employed AES key by applying the inverse S-Box operation followed by an XOR with the plaintext.

Discussion

With HAL we are able to add any plain high-level IP core into an existing low-level design. Using the FPGA vendor’s toolchain, a new bitstream can be generated to verify the functionality of the modified design. Although we only analyzed one specific core in this case study, our results are transferable to different designs as our technique is generic. A more elegant technique of leaking the data would include the design and use of an antenna circuitry in the FPGA design [82]. As long as the user does not probe each I/O pin or the near-field, both leakage approaches remain relatively stealthy.

3.5 IP Watermarking

In this case study, we present how IP infringement can be performed for watermark protected designs. To demonstrate the efficacy of gate-level netlist reverse engineering, we analyze a scheme which aims to protect valuable FPGA IP cores at netlist level.

3.5.1 LUT-based Watermarking Various works have addressed IP protection by use of watermarks and particularly constraint- based watermarking [83] is suited for FPGAs since the additional satisfiability constraints are suited for the LUT-based FPGA structures [84]. Watermarking Scheme [84]. The high-level idea of the scheme by Schmid et al. is to exploit not addressable LUT memory space, see Figure 3.3. Since input pins I4 and I5 are connected to GND, there are 48 Bits that can be arbitrarily changed without altering any functionality of the design. Thus these LUT memory bits can be used to embed the watermark. Automated Reverse Engineering. In order to identify the LUT memory bits which implement the watermark, we analyze the Boolean function of each LUT. More precisely, if

33 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

I5 I4 I3 I2 I1 I0 O I0 0 0 0 0 0 0 O0 I1 0 0 0 0 0 1 O1 I2 ... LUT O I3 0 0 1 1 1 1 O15 ’0’ I4 0 1 0 0 0 0 X ) ’0’ I5 ... Watermark 1 1 1 1 1 1 X

Figure 3.3: Example of a LUT-6 with input pins I0 to I5 and output pin O. Pins I4 and I5 are connected to GND. 48 Bits marked with X can be used for watermarks. there exist any input assignment where an GND or VCC input signal is required to be logical 1 or 0, respectively, we successfully identified a watermark bit. We practically verified that we are able to automatically disclose the watermark of our target design (an AES IP core synthesized for an xc6slx16 with Xilinx ISE 14.7). Furthermore, we are able to automatically remove and alter the watermark in the gate-level netlist. Note that Schmid et al. [84] proposed to utilize LUTs configured as shift register LUTs to prevent optimization, but since the shift-enable input pin is assigned to GND (to ensure the designs functionality), we treat them as general LUTs in our reverse engineering algorithm.

3.5.2 Opaque Predicates Even though Schmid et al. [84] noted that the security can be increased by use of bogus constant generating signals (instead of the plain connection to GND), it appears challenging how such signals can be implemented with consideration of a reasonable reverse engineer. To this end, we propose to leverage opaque predicates which have been mainly used in the context of software watermarking and obfuscation [85]. An opaque predicate is an expression that either evaluates to true or false irregardless of the input and thus this function implements a constant generating signal. Hence, instead of GND the unused LUT’s input pins (e.g., I4 and I5 in Figure 3.3) are connected to the output of an opaque predicate which mitigates our reverse engineering approach presented in the previous section. Despite numerous works that address opaque predicates for software, there is to the best of our knowledge only one work by Sergeichik et al. [86] applying them to the hardware reverse engineering context. Implementation [86]. To implement opaque predicates, Sergeichik et al. suggest to exploit LFSRs as constant signal generators. Their general idea is to connect all registers of the LFSR to an OR or NOR gate to generate a constant high or low signal, respectively, see Figure 3.4. Note that a typical LFSR with XOR feedback never enters the zero state (where all FFs hold a logical ’0’) thus there is at least one FF which holds a logical ’1’. Automated Reverse Engineering. Even though the reverse engineering strategy from the previous section is generally prevented by opaque predicates, the proposed opaque predicate instantiation by LFSRs is not sufficient to mitigate reverse engineering and thus IP infringement is again possible. The high-level idea of our automated reverse engineering strategy is to identify LFSRs and subsequently whether they are used to implement constant generators. For the detection of LFSRs, we exploit their typical characteristic: a chain of FFs elements to store and

34 3.6. Human Factors in Hardware Reverse Engineering

I0 I1 I2 O I3 LUT I4 I5

Figure 3.4: Example of LUT-6 with input pins I0 to I5 and output pin O. Pins I4 and I5 are connected to a NOR gate which constantly generates a logical ’0’ for each LFSR state (except the all zero state). shift the current state. Note that we skip any pass-through LUTs or buffers in each step. To find the FF chains, we search for the initial FF (which stores the next state bit) by checking whether its preceding gate is a FF. For any candidate, we execute a modified Depth-First Search (DFS), and we search for circles considering taps defined by the underlying feedback polynomial of the LFSR. Since we can identify the position of the initial FF and the taps, we are able to algorithmically identify the underlying feedback polynomial as well. After the LFSRs are automatically reverse engineered, we topologically analyze each LFSR for the constant generator part, i.e. an OR or NOR gate that is connected to all FFs of the LFSR. Note that the OR and NOR gates might be implemented across multiple LUTs depending on the type and size of the LFSR. To identify the final constant signal generation gate, we manually inspected our target design (an AES IP core synthesized for an xc6slx16 with Xilinx ISE 14.7) which typically requires only several minutes but can be easily automated. In case other types of LFSRs such as XNOR-based LFSRs [86], the search for the final constant generation gate has to be adapted. In summary, LUT-based watermarking is a promising approach to protect valuable FPGA IP and particularly in combination with advanced hardware obfuscation techniques it might provide an adequate security level to hamper reverse engineering.

3.6 Human Factors in Hardware Reverse Engineering

We now present problem solving and expertise research in the context of hardware reverse engineering, in particular for quantification of gate-level netlist reverse engineering. We first highlight why this reverse engineering task is a problem solving process (Section 3.6.1). Then we provide a general introduction to problem solving research and research on the acquisition of expertise (Section 3.6.2). Finally, we propose how the human factor can be quantified using the aforementioned psychological research fields (Section 3.6.3). We want to emphasize that this section does not provide concrete results, however, it lays the foundation for future research in this direction.

35 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

3.6.1 Setting: A Learning Perspective

As stated in Section 2.3, we assume a reverse engineer with access to a gate-level netlist and the goal to understand parts of the design’s inner workings. To this end, the analyst chooses actions which reduce the difference between the initial state (no high-level information) and the goal state (design’s inner workings successfully extracted). During this process the person draws on prior knowledge (e.g., knowledge from past instances of gate-level netlist reverse engineering or textbooks). Thus, knowledge generated during reverse engineering can be utilized in future attempts. Perspectives on the Human Factor. This setting points out two separate but intertwined mechanisms: (1) gate-level netlist reverse engineering can be viewed as a problem solving process, and (2) reverse engineers can acquire new knowledge or skills and store them in long-term memory [87]. In particular, reverse engineers gain expertise by performing reverse engineering repeatedly in different contexts, i.e. formal (e.g., school, university) and/or informal (e.g., learning or training on-the-job, self-study, exchange with peers) educational settings [88]. Both mechanisms, problem solving and acquisition of expertise, also describe the arms race of reverse engineering and obfuscation, since reverse engineers are able to break obfuscation strategies and use their gained experience for future reverse engineering attempts. Designers then have to implement a new obfuscation, which presents a novel problem to reverse engineers.

3.6.2 Problem Solving and Expertise Research

Based on the learning perspective in the previous section, we survey problem solving and expertise research for a general audience. While problem solving research focuses on problem properties and adaptation of strategies to overcome obstacles, expertise research conceptualizes the development of knowledge required for successful reverse engineering, changes in problem solving strategies with accumulating experience, mental representations of problems and increasing automation of complex and initially effortful behaviors (experts vs. novices) [89, 90]. Problem Solving Research. Problem solving can be defined as a sequence of directed cognitive operations that are employed in a situation (the problem) where the individual does not possess a suitable routine operation that allows a transition from a given initial state to the desired goal state. This situation is termed problem [91, 89]. During problem solving, knowledge is manipulated in order to attain the desired goal state. Due to different prior knowledge and problem solving skills, a situation might pose a problem to one person, but not to another. Further, as soon as a person has solved a problem and is able to fully reproduce the solution schema, the situation loses its problem character and simply represents a task to this individual [92]. Thus, learning from problem solving as an ongoing process should be taken into consideration as well. Expertise Research. Ongoing experience within one field, combined with deliberate prac- tice [93] results in acquisition of expertise. Deliberate practice refers to a specific practice or training activity in which a person willingly and repeatedly produces an action (often under supervision). The trainee receives feedback on the quality of the production of the action with the ultimate goal to improve performance [93, 94]. After an extensive amount of practice a person is capable of repeatedly exhibiting superior performance with minimal variation. Note that this separates the acquisition of expertise from the acquisition of (everyday) skills, which eventually reaches an autonomous stage where performance will no longer improve [95]. In

36 3.6. Human Factors in Hardware Reverse Engineering contrast to novices, experts perform superior due to their improved working memory, cf. [87], which allows them to process great amount of information at a time. For example, chess masters are able to quickly perceive and evaluate complex configurations and choose promising options for further moves [96]. Due to repeated problem solving, experts also possess problem-schemas, which allow them to identify the deep structure of a problem, retrieve multiple solutions and select the best solution [97, 98].

3.6.3 Open Challenges: Quantification of Human Factors

Quantification of human factors in reverse engineering is a major open challenge in hardware security research. For the quantification using problem solving and expertise research, several aspects can be investigated such as the reverse engineering process itself or how the human experiment is arranged. In the following, we provide novel interdisciplinary perspectives that systematically capture the different aspects of human factor quantification for reverse engineering. First, we propose two dichotomies which can guide quantification of reverse engineering , namely (1) process vs. result, and (2) human vs. task (Section 3.6.4). Second, we discuss possible research designs and methods of data collection to investigate the human factor (Section 3.6.5).

3.6.4 Dichotomies for Human Factor Quantification

We present a systematic overview of quantifiable aspects arranged in two dichotomies. Note that both dichotomies are not mutually exclusive, but rather represent different perspectives of gate-level netlist reverse engineering.

Dichotomy: Process vs. Results Quantification

For the quantification of human factors using problem solving and expertise research, it appears reasonable to distinguish between quantification of the process itself and quantification of its results. Process Quantification Dimension. The primary scope of the process quantification dimension is not if analysts are able to reverse engineer a netlist, but how they solve problems they encounter. Usually, analysts identify high-level steps and define a set of main goals to complete. These steps represent meaningful units that guide analysts and pose specific challenges to them. In order to complete these steps, analysts might need to employ a number of different strategies. The process dimension also focuses on learning gains and the time required to complete a given task, i.e. how fast can an analyst learn to master a task and how long does it take to accomplish the task of a given complexity. These processes change over time as individuals repeatedly encounter problems of a similar topography and their actions become automated [99]). With regard to expertise and its acquisition, characterization of expertise-specific problem solving strategies and problem representations is expedient for quantification, since experts have different ways of perceiving problems and employ qualitatively different problem solving strategies due to their superior knowledge organization compared to novices (this has been shown in domains like chess [100], physics [98], symbolic drawings in [101] or [102]).

37 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

Result Quantification Dimension. The primary scope of the result quantification dimen- sion is to investigate whether analysts were successful in reverse engineering and what they have learned during problem solving, i.e. what new knowledge or skills they acquired. Analysts acquire new (domain-specific) problem solving strategies or reach an improved proficiency in utilizing already learned strategies. Considering the acquisition of expertise, it is also important to assess whether analysts can reproduce their solution on similar problems. This asserts whether or not a problem class of challenges still poses a problem to the analysts. Moreover it is necessary to investigate to what extend analysts can transfer their knowledge about the solution of a problem solved to a structurally similar problem.

Dichotomy: Quantification of Human vs. Task Properties Another dichotomy for the quantification is to distinguish between properties of the analyst and properties of the task, i.e. the hardware design. Human Property Dimension. The primary scope of the human property dimension is the analysis of characteristics required for reverse engineering (e.g., domain knowledge, technical skills) and broader human traits such as general intelligence. These factors determine how and to what result an analyst is able to solve a reverse engineering task. The comparison of such capacities between subjects may yield a more sophisticated understanding of characteristics to distinguish experts from novices, and the identification of useful predictors (and less relevant factors, or even obstacles) for successful reverse engineering. Task Property Dimension. The primary scope of the task property dimension is to analyze the characteristics of the target hardware design, i.e. the amount of gates and the complexity of their interconnections. Whereas the difficulty of so-called simple problems (e.g., The Tower of Hanoi [103]) merely determines the amount of time required to solve it (due to an increasing number of incremental steps or iterations required), increasing the difficulty of complex problems [104] (i.e. the amount of relevant information to be considered or processed simultaneously) may further diminish the problem solving performance, i.e. the quality of the solution. Analyzing which components of a netlist should be considered simple, and which complex, is key to quantifying the human factor in reverse engineering. Further, since analysts might use tools to transform the original gate-level netlists into a graph-based representation to conduct visual pattern matching search strategies, research on insight problems [105] might indicate design characteristics that facilitate or hinder reverse engineering when using visualizations.

3.6.5 Research Designs, Data Collection, and Challenges In addition to the quantifiable aspects in the previous section, we now present (1) aspects for research designs, (2) methods of data collection used to investigate the human factor, and (3) challenges for future research. In order to quantify the human factor, researchers collect data on the process and outcome of reverse engineering attempts and which human characteristics and task influence reverse engineering success. By choosing between different research designs and methods of data collection,the researcher selects which parts of reality are under investigation and which are excluded. Therefore, carefully choosing designs and methods of data collection

38 3.6. Human Factors in Hardware Reverse Engineering with regard to the research question and their respective advantages and disadvantages is an important task.

Research Designs: Laboratory vs. Field

Research on human factors can be either carried out in laboratory studies or in the field. Field studies are observations at the places where analysts naturally perform hardware reverse engineering. This allows gaining insight into the complexity of the processes in their respective context. Conversely, in laboratory experiments researchers control the context and observe single aspects in great detail and with reduced external influences as compared to observations in field. These studies take place at the researchers’ laboratories. Please note that the term laboratory refers to the artificiality of the context and must not be confused with the reverse engineer’s laboratory - which constitutes the site of a field study. Here, reverse engineering is carried out under artificial conditions. However, researchers may underestimate or mischaracterize such processes as they impose unrealistic boundary conditions, restrict access to resources analysts might normally use, or fail to capture relevant strategies not available in a laboratory setting. The strict control of context, however, is key to investigate and isolate the effect or role of particular variables. Research in the laboratory requires researchers to use a formal description of the reverse engineering process (see Section 3.6.5).

Data Collection

In order to quantify human factors, researchers have to collect data on how analysts reverse engineer, what results from these processes and what properties of the human analyst and the task influence reverse engineering success. In the following, we will shortly describe two types of data: (1) behavioral, and (2) verbal data. Behavioral Data. Behavior observations such as log-files from human-computer-interaction, eye-tracking, screen captures, or videographs allow a detailed analysis of actions and strategies and their respective development over time. Behavioral data is not affected by shortcomings concerning memory, introspection, or response biases. Meaningfully reconstructing behavioral sequences from logfiles, however, requires a sophisticated system to be set up a priori. Behaviors not expected to occur by the researcher may simply not be reconstructible from such automated recordings, rendering them arguably incomplete or even useless. Verbal Data. Having analysts verbalize their thoughts while they are problem solving (e.g., think-aloud) allows insights into mental models, deliberations and intentions behind the strategies employed (e.g., a sequence of goals) [106]. An alternative to think-aloud is stimulated recall [106]. With this data collection technique, the problem solving process itself is not verbalized during its course, but the process is recorded. After problem solving, the problem solvers are presented a section of their problem solving behavior and are asked to explain their actions. While information revealed that way is valuable to understand a phenomenon, the quality of such self-report data may be limited when respondents are unwilling or unable to provide an accurate account [107, 108].

39 Chapter 3. HAL- Gate-level Netlist Reverse Engineering and Manipulation Framework

Challenges In addition to the diverse aspects, there are several open challenges which have to be addressed by interdisciplinary research. Process Description. A major challenge for research is the lack of a formal description of how analysts carry out reverse engineering. Understanding the structure of a problem is an important prerequisite in order to investigate the cognitive processes involved in solving the problem [89]. Applying a formalized description during research on reverse engineering makes results of different research groups comparable, facilitates an integration of findings and allows meaningful research synthesis. Sampling. Meaningful research on reverse engineering requires sampling of subjects trained in reverse engineering which is a highly domain-specific process which presumably only relatively few people are capable of. This dramatically reduces the population to draw samples from. Among those, a substantial proportion will be unwilling to follow an invitation to a university laboratory (as their reverse engineering usually pursues destructive applications), and some might be bound by or other agreements that prevents them from participating. In addition, building contact to the remaining potential participants and thus recruiting research participants may be challenging, and researchers interested in studying the human factor in reverse engineering are advised to pool their resources.

3.7 Conclusion

In this chapter, we closed several important research gaps: First, we introduced our generic netlist reverse engineering and manipulation framework HAL. The framework allows for the development of custom tools to automate time-consuming and complex reverse engineering. Second, we presented our hardware Trojan detection technique ANGEL which is based on Boolean function analysis and graph neighborhood analysis. Third, we demonstrated the manifold applicability of HAL in a variety of case studies that focus on the real-world threat posed by hardware Trojans in cryptographic hardware designs. We extended the adversary tool arsenal to demonstrate how to automatically invalidate cryptographic self-tests and wiretap cryptographic key circuitry. We then reviewed a constrained IP watermarking scheme for FPGAs and improved the security by use of opaque predicates. Additionally, we revealed that proposed hardware opaque predicate implementations are not sufficiently secure against reverse engineering. Moreover we broadened the scope of reverse engineering through combination of technical and human-centered perspectives, we provide suggestions for future research directions to holistically capture the complexity of hardware reverse engineering. To this end, we surveyed problem solving research and research on the acquisition of expertise for a general audience which facilitates quantification of decisive human factors.

40 Chapter 4 Graph Similarity and Its Application to Hardware Security

Motivation. Hardware reverse engineering is a powerful and universal tool for both security engineers and adversaries. From a constructive perspective, it allows for detection of intellectual property infringements and hardware Trojans, while it simultaneously can be used for product piracy and malicious circuit manipulations. From a designer’s perspective, it is crucial to have an estimate of the costs associated with reverse engineering based on algorithms with sound mathematical underpinnings, yet little is know about this, especially when dealing with obfuscated hardware.

Contents of this Chapter

4.1 Introduction ...... 42 4.2 The Graph Similarity Problem ...... 42 4.2.1 Preliminaries ...... 42 4.2.2 Hardware Characteristics and Optimizations ...... 44 4.2.3 Graph Similarity Analysis Strategy ...... 44 4.2.4 Graph Edit Distance Approximation ...... 46 4.2.5 Neighbour Matching ...... 48 4.2.6 Multiresolutional Spectral Analysis ...... 50 4.3 Evaluation ...... 51 4.3.1 Implementation ...... 52 4.3.2 Case Study I: Gate-level Netlist Reverse Engineering ...... 52 4.3.3 Case Study II: Trojan Detection ...... 56 4.3.4 Case Study III: Obfuscation Assessment ...... 58 4.4 Discussion ...... 60 4.5 Conclusion ...... 61

Contribution. The research presented in this chapter was joint work with Sebastian Wallat (affiliated with the University of Massachusetts Amherst, USA) who assisted in the evaluation on the Google cluster, and Nicolai Bissantz (affiliated with the Ruhr-Universit¨atBochum, Germany) who developed, implemented, and evaluated the spectral analysis. This work is currently under submission.

41 Chapter 4. Graph Similarity and Its Application to Hardware Security

4.1 Introduction

Goals and Contributions. In this chapter, we focus on detection of similarities between gate- level netlists rather than exact matchings of Boolean function analysis or subgraph isomorphism. Our goal is to examine its suitability for the hardware security domain. This approach seems promising since these heuristics have been used successfully in several other settings, including detection [109, 110], grading of programming assignments [111], bioinformatics and data mining [112]. To this end, we first analyze characteristics of hardware to improve state-of-the-art similarity heuristics through tailored optimizations. We then introduce our novel approach based on spectral analysis of adjacency matrices. Finally, in three case studies we demonstrate the efficacy of similarity analysis for large and complex hardware designs. In summary, our main contributions are:

• Graph Similarity for Hardware Security. To the best of our knowledge, we are the first to apply the graph similarity problem in the hardware security domain. We show a broad spectrum of graph similarity applications by means of three case studies, namely (1) gate-level reverse engineering of security-relevant circuitry, (2) detection of hardware Trojans, and (3) assessment of hardware obfuscation. To this end, we improve state-of-the- art similarity heuristics in terms of accuracy and computation time by optimizations and novel preprocessing techniques tailored to the hardware setting.

• Novel Similarity Heuristic. We present a novel graph similarity heuristic based on spectral graph analysis. More precisely, we analyze spectral information of two graphs in a multiresolutional way to determine their similarity. Eigenvalues of the graph’s adjacency matrices are computed and a suitable distance measure between respective eigenvalue distributions is determined.

• Extensive Evaluation. Our evaluation demonstrates the efficacy of graph similarity heuristics for large hardware benchmarks while keeping analysis time practical. In addition to exploring a variety of algorithms, our evaluation covers different FPGA families and several design optimization goals to emphasize the reliability of our approach for each case study.

4.2 The Graph Similarity Problem

Before we detail state-of-the-art graph similarity heuristics and our hardware-specific improve- ments, we provide essential background on graph similarity and the notation used throughout the rest of this chapter. Note that we represent gate-level netlists as graphs and leverage similarity algorithms between graphs for hardware reverse engineering in the adversary model in Section 2.3.

4.2.1 Preliminaries

Definition 1 (Directed Graph). A digraph G = (V,E) is a pair where V is a set of vertices, and − E ⊆ V × V is a set of edges (ordered pairs of vertices). dG(v) denotes outgoing edges for a vertex + v in G, and dG(v) denotes ingoing edges, respectively. cG(v) is the set of child vertices for a

42 4.2. The Graph Similarity Problem

− vertex v in G, i.e. the projection cG(v) = π1(dG(v)) := {π1(v, va), . . . , π1(v, vz)} = {va, . . . , vz}. + pG(v) is the set of parent vertices for a vertex v in G, i.e. the projection π0(dG(v)). The function label: V → N determines the label of a vertex. Two relationships are relevant for our work: (1) isomorphism, and (2) similarity. From a high- level perspective, graph isomorphism captures whether two graphs are structurally equivalent or not. Graph similarity relaxes this binary decision to a real number indicating a level of similarity, see Figure 4.1.

G1 G2 G3

omorph isomor is ot ph α A 1 n simi but lar B C 2 3 β γ

D 4 δ

Figure 4.1: Example of the difference between isomorphism and similarity. G1 and G2 are isomorph (f(1) = C, f(2) = A, f(3) = B, f(4) = D), whereas G2 and G3 are not isomorph, even though they are topologically similar. The missing edge for an isomorphism from G2 to G3 is (α, δ).

Definition 2 (Graph Isomorphism). Let G1 = (V1,E1) and G2 = (V2,E2) be two graphs. G1 and G2 are isomorph, if there exists a bijection f : V1 → V2 such that ∀(u, v) ∈ E1 ⇐⇒ (f(u), f(v)) ∈ E2.

Definition 3 (Graph Similarity Algorithm). Let G1 and G2 be two graphs. A graph similarity algorithm A: (G1,G2) → [0, 1] computes a real-valued similarity score for G1 and G2.A similarity score of 1 indicates that G1 and G2 are identical. In order to effectively measure similarity, we use the notion of graph edit distance, see Defi- nition 4. The edit distance measures the smallest number of edit operations transforming one graph into another one.

Definition 4 (Graph Edit Distance). Let G1 and G2 be two graphs. The graph edit distance is a function GED(G1,G2) → N which computes the smallest number of edit operations to transform + G1 into G2. The edit operations are: adding an isolated vertex ev , deleting an isolated (without − + − r connecting edges) vertex ev , adding an edge ee , deleting an edge ee , relabeling a vertex ev. + − + − r Each edit operation has a specific cost, defined by cost: {ev , ev , ee , ee , ev} → N, typically 1 for vertex-edit and edge-edit operations. Even though the problem of graph isomorphism and graph edit distance are conceptionally easy to understand, both are hard to solve in a generic way: computation of the graph edit distance is NP-hard [113] (exponential in number of vertices/edges), the graph isomorphism problem is in the low hierarchy of NP, and the subgraph isomorphism problem is NP-complete [114]. Over the years, various heuristic algorithms have been proposed to provide a similarity measure. These heuristics often involve scenario-specific optimizations (1) to increase accuracy, and (2) to reduce computation time via reduction of analyzed graphs.

43 Chapter 4. Graph Similarity and Its Application to Hardware Security

4.2.2 Hardware Characteristics and Optimizations

Since we deal with graphs representing hardware, we now describe algorithm-specific optimiza- tions based on general hardware characteristics, namely (1) vertex labeling, and (2) subgraph analysis. Our optimizations increase accuracy and reliability of graph similarity heuristics and simultaneously enable major reduction of computation times. Note that graph similarity algorithms may have to be adapted to incorporate these optimizations. Vertex Labeling. Since we target graphs representing gate-level netlists, vertices represent gates that implement Boolean functions. To effectively distinguish vertices, each gate type is assigned a specific label (e.g., an XOR gate is labeled xor, an AND gate is labeled and, etc.). To be more precise, we use natural numbers to represent labels, cf. Definition 1. Note that typical hardware primitive libraries may contain hundreds atomic gates [115]. Thus, we may have to adapt similarity algorithms to support labels. Subgraph Analysis. Since modern hardware designs are typically assembled from a variety of modules such as reused IP cores [19], our main focus is to identify these small (e.g., 100 vertices) subgraphs in a large (e.g., 5000 vertices) design graph rather than a similarity analysis for two large graphs. Therefore, we may have to adapt similarity algorithms to support a subgraph analysis rather than two equally-sized graphs.

4.2.3 Graph Similarity Analysis Strategy

We now provide a high-level overview of our two-phased graph similarity analysis using different netlist preprocessing steps. From a high-level point of view, phase 1 represents a fast coarse- grained similarity analysis, however, phase 1 is inevitably prone to false-positive identified similarities. To overcome its fundamental limitations, we perform a slower but fine-grained and thus more reliable phase 2.

Phase 1: Combinational Logic Subgraphs

A typical hardware design consists of combinational logic implementing Boolean functions to transform data stored in FFs forming registers. In particular, combinational logic between register stages are interesting for the human analyst since they implement crucial Boolean functions (e.g., a hardware Trojan trigger). Therefore, we analyze graph similarity among combinational logic subgraphs rather than the whole graph. This approach yields both increased accuracy since registers are a potential pitfall for false-positives and reduced computation time since combinational logic subgraphs are significantly smaller and can be analyzed in parallel. To determine combinational logic subgraphs, we process the design in a two-phased approach. First, we determine so-called register groups [39]. To be more precise, we group all FFs which have equal control signals (e.g., clock, chip enable, or (a)synchronous (re)set). Second, for each FF in each register group we perform a reverse breath-first search until we reach a FF. Here, reverse means that we change direction of each edge. Each combinational logic gate visited during reverse breath-first search is added to the combinational logic group for the register group. Note that we also report the register group size, since this information yields valuable information about the design’s architecture for the human analyst. For example, the register grouping identifies general-purpose registers of a Central Processing Unit (CPU) or the datapath of a crypto implementation, see Section 4.3.

44 4.2. The Graph Similarity Problem register register

Figure 4.2: Combinational logic subgraph (marked as cloud) between two registers.

In summary, phase 1 analyzes similarities among combinational logic subgraphs of both hardware designs.

Phase 2: Combinational Logic Bitslices and LUT Decomposition

Combinational Logic Bitslices. Even though combinational logic subgraphs are significantly smaller than the original graph, they can still consist of numerous gates (e.g., a datapath of a CPU or crypto design). In order to further reduce the size of the subgraphs, we analyze so-called bitslices [116] of combinational logic subgraphs. More precisely, a bitslice is a Boolean function with one output and multiple inputs. Hence, each output signal of a combinational logic subgraph yields a single bitslice. Our bitslice analysis is based on the observation that similar combinational logic subgraphs share similar bitslice subgraphs, and analysis of bitslices provides a more fine-grained similarity value. Analogous to our combinational logic subgraph generation, we perform a reverse breath-first search for each output signal until we reach inputs of the subgraph. Each visited gate is added to the bitslice. For each combinational logic subgraph, we do not obtain one but multiple (= amount of bitslices) similarity values since we compare multiple bitslices. Even though a human analyst is capable of analyzing such a vector of similarity values, we found it practical to reduce this number back to a single value. To this end, we simply determine the arithmetic mean of the similarity values. Note that we also report the standard deviation in case similarity values are spread over a wide range of values. FPGA LUT Decomposition. We now describe a netlist preprocessing technique tailored to FPGA designs which on one hand significantly increases the accuracy of graph similarity algorithms, but on the other hand increases size of the analyzed graphs. A substantial building block of FPGAs are LUTs, typically small truth tables (with 2 to 6 inputs) which implement Boolean functions and thus form combinational logic of a hardware design (along with other dedicated multiplexers or carry gates). From a graph theory perspective each LUT is treated as a single vertex regardless of its implemented Boolean function, so even if a LUT L1 implements a simple Boolean OR and a LUT L2 implements (parts of ) a highly non-linear cryptographic Sbox, both L1 and L2 are treated equally even if labeling is used. To address this fundamental limitation, we preprocess the target gate-level netlist and replace each LUT with its implemented Boolean function. More precisely, we determine the minimal form of a Boolean function with the Quince-McCluskey algorithm [117] in order to represent

45 Chapter 4. Graph Similarity and Its Application to Hardware Security each LUT with the minimum number of AND-OR-INV logic gates. Note that the Quince- McCluskey algorithm’s runtime grows exponentially with the number of variables, however, typical LUTs have a small input size (≤ 6), hence this is not a limitation in practice. This preprocessing step naturally increases the netlist size, however, this strategy enables us to address the aforementioned issue in a generic way, i.e. independent of any graph similarity heuristic. Furthermore, this step can be seen as a lifting to a higher-level representation since it unifies netlists of different LUT architectures. In summary, phase 2 analyzes similarities among combinational logic bitslices of both hardware designs whose LUTs have been decomposed.

4.2.4 Graph Edit Distance Approximation

Although the graph edit distance effectively measures the similarity of two graphs, its com- putational complexity is a fundamental drawback. Hu et al. [110] proposed an algorithm to approximate the graph edit distance to measure similarities among software function-call graphs for malware detection. The key idea is to analyze edit operations to map each vertex in the two graphs, and then leverage the Hungarian method that solves this assignment problem1 in O(|V |3) polynomial time [118]. The Hungarian method finds the optimal assignment, i.e. a matching for vertex sets with minimal cost.

label subgraph Equation false false ed ← |d+ v |, |d+ v | − |d+ v |, |d+ v | |d− v |, |d− v | − |d− v |, |d− v | max ( G1 ( i) G2 ( j) ) min ( G1 ( i) G2 ( j) ) + max ( G1 ( i) G2 ( j) ) min ( G1 ( i) G2 ( j) )

true false ed ←|pG1 (vi)| + |pG2 (vj)| − 2|C(pG1 (vi) ∩ pG2 (vj))| + |cG1 (vi)| + |cG2 (vj)| − 2|C(cG1 (vi) ∩ cG2 (vj))| false true ed ←|d+ v | − |d+ v |, |d+ v | |d− v | − |d− v |, |d− v | G1 ( i) min ( G1 ( i) G2 ( j) ) + G1 ( i) min ( G1 ( i) G2 ( j) )

true true ed ←|pG1 (vi)| − |C(pG1 (vi) ∩ pG2 (vj))| + |cG1 (vi)| − |C(cG1 (vi) ∩ cG2 (vj))|

Table 4.1: Edit cost equations for parameter configurations label and subgraph.

Optimizations Tailored To Hardware. Hu et al. [110] described an optimization similar to vertex labeling to increase accuracy. In our case, we incorporate vertex labeling described in Section 4.2.1. Also, more importantly for our case, we facilitate a subgraph search to measure similarity for a small subgraph rather than two equally sized graphs. Note that in the subgraph search, we assume that G1 is the small subgraph and G2 is the larger one. In case the subgraph search is not used, but G1 is significantly smaller than G2 we obtain a small similarity value. Note that we parameterized optimizations to measure its impact on the heuristic’s accuracy in our evaluation, see Section 4.3. Algorithm 3 shows the graph edit distance approximation extended with our optimizations. First, the quadratic cost matrix (cij) is initialized ( 1 - 19). Note that |V2| dummy vertices are added to V1, and |V1| dummy vertices to V2. The quadratic cost matrix is split into four equally sized parts. The top left part denotes the edit distance to transform a vertex from V1 into a vertex from V2 ( 2 - 4 ). Note that the function edit distance is described below. The top right part denotes cost to transform a vertex from V1 to a dummy vertex from V2, i.e. deletion − + of ingoing |dG(v)| and outgoing |dG(v)| edges as well as vertex deletion ( 5 - 11). Similarly, the bottom left part denotes the cost to transform a vertex from V2 to a dummy-vertex from V1 ( 13 - 19). The bottom right part denotes cost to transform dummy vertices from V1 into

1 The assignment is defined as follows: Let S, T be two sets of equal size and let c: S × T → R be a cost P function. The goal is to find a bijection f : S → T such that the cost s∈S c(a, f(a)) is minimized.

46 4.2. The Graph Similarity Problem

Algorithm 3 Graph Edit Distance Approximation

Input: Graph G1 = (V1,E1),G2 = (V2,E2), Boolean label, Boolean subgraph Output: Similarity Score s ∈ [0, 1] for G1 and G2 (|V |+|V |)×(V |+|V |) Vertex matching m ∈ N 1 2 1 2 //initialization of cost matrix (|V |+|V |)×(|V |+|V |) 1: matrix (cij) ∈ N 1 2 1 2 initialized with 0 2: for vertex vi ∈ V1 do 3: for vertex vj ∈ V2 do 4: cij ← edit distance(vi, vj, label, subgraph)

5: for row i with 0 ≤ i < |V1| do 6: for column index j with 0 ≤ j < |V1| do 7: column index k ← j + |V2| 8: if i = j then 9: c ← |d+ v | |d− v | ik G1 ( i) + G1 ( i) + 1 10: else 11: cik ← ∞ 12: if subgraph = false then 13: for row index i with 0 ≤ i < |V2| do 14: for column index j with 0 ≤ j < |V2| do 15: row index k ← i + |V1| 16: if i = j then 17: c ← |d+ v | |d− v | kj G2 ( i) + G2 ( i) + 1 18: else 19: ckj ← ∞ //search optimal vertex matching (|V |+|V |)2 20: vertex matching m ← hungarian((cij)) ∈ N 1 2 //similarity computation 21: if subgraph = false then P|V1|+|V2|−1 cim 22: return s ← − i←0 i 1 |V1|+|V2|+2(|E1|+|E2|) 23: else P|V1|+|V2|−1 cim 24: return s ← − i←0 i 1 2|V1|+4|E1|

47 Chapter 4. Graph Similarity and Its Application to Hardware Security

dummy vertices from V2, which has zero associated cost 1 . Second, the Hungarian method is used to find the optimal vertex matching with smallest costs between vertex sets V1 and V2 (initialization in 20). Third, graph similarity is computed ( 21 - 24). Note that the original algorithm computed an approximation of the graph edit distance and not a similarity score, Chan et al. [119] defined a formula to compute this value for a given edit distance 22. In case of a subgraph search (subgraph = true) we omit the bottom left part 12, i.e. cost to transform a vertex from V2 to dummy vertices from V1. Thus, we can remove these vertices from V2 at no cost since we are only interested in measuring cost of identifying the small subgraph G1 in G2. Furthermore, we adjust the denominator in the similarity computation 24 to the highest number of edit operations to transform a subgraph of G2 in G1, i.e. delete and add |V1| vertices (cost of 2|V1|) as well as delete and add |E1| edges (cost of 4|E1|). Note that the cost of edge edit operations are 2, see Definition 1. Edit Distance Computation. To compute the edit distance 4 , we use distinct strategies for all 4 different parameter possibilities, see Table 4.1. Equation 4.1 states the original edit distance cost formula by Chan et al. [119].

+ + + + ed ← max (|dG (vi)|, |dG (vj)|) − min (|dG (vi)|, |dG (vj)|) 1 2 1 2 (4.1) |d− v |, |d− v | − |d− v |, |d− v | + max ( G1 ( i) G2 ( j) ) min ( G1 ( i) G2 ( j) ) In case that both parameters label and subgraph are false, Equation 4.1 computes the edit distance. For vertex labeling optimization (label = true, subgraph = false), we count oc- + currences of vertex labels in parent and child sets instead of the number of ingoing dG and − outgoing dG edges. Formally, the input for the function C in Table 4.1 is a set of vertices and it returns a multiset of vertex labels with their multiplicity. For subgraph optimization (label = false, subgraph = true), we omit these edit distance costs for G2 since we are only interested in subgraph G1. If both parameters are true, both strategies are combined. In our implementation, we also return the vertex matching m to analyze similar vertices in both graphs. To detect multiple matchings, for example if a hardware unit is instantiated multiple times, the algorithm is executed again but we initialize costs for already matched vertices to ∞. Furthermore, we implemented all for loops in 2 - 19 in a parallel fashion to speed up execution.

4.2.5 Neighbour Matching In addition to graph edit distance approximation, another strategy to address the issue was proposed by Vujo˘sevi´c-Jani˘ci´c et al. [111]. The key idea is to analyze the graph topology and match neighboring vertices. To this end, a similarity submatrix is built that compares topology of a vertex, and then the Hungarian method is leverged to solve this assignment problem. Similar to graph edit distance approximation, the Hungarian method finds the optimal assignment for this matrix, i.e. a matching for vertex sets with minimal cost. Optimizations Tailored To Hardware. Since Vujo˘sevi´c-Jani˘ci´c et al. [111] developed an algorithm to compare software implementations, their vertex labeling focuses on instructions. In our case, we incorporate the vertex labeling described in Section 4.2.1 and we adapt the similarity score computation by a subgraph search. As noted before, we assume that G1 is the small subgraph and G2 is the large one and we parameterize each optimization to measure its impact on the algorithm’s accurarcy.

48 4.2. The Graph Similarity Problem

Algorithm 4 Neighbour Matching

Input: Graphs G1 = (V1,E1), G2 = (V2,E2), Boolean label, Boolean sg (subgraph),  > 0 Output: Similarity score s ∈ [0, 1] for G1 and G2 //initialization of the sim. matrix |V |×|V | 1: matrix (simij) ∈ R 1 2 2: for vertex vi in V1 do 3: for vertex vj in V2 do 4: if label = false or label(vi) = label(vj) then 5: simij ← 1 6: else 7: simij ← 0.5

8: (tmpij) = (simij) 9: while ∀ indices i, j : |simij − tmpij| <  do 10: (tmpij) = (simij) 11: for vertex vi in V1 do 12: for vertex vj in V2 do //compute parent neighborhood similarity |pG (vi)|×|pG (vj )| 13: matrix (inlk) ∈ R 1 2

14: for row index 0 ≤ l < |pG1 (vi)| do

15: for column index 0 ≤ k < |pG2 (vj)| do 0 th 16: row index l ← l index in pG1 (vi) 0 th 17: col. index k ← k index in pG2 (vj) 18: inlk ← (1 − tmpl0k0 )/ //search optimal matching 19: vertex matching mp ← hungarian((inlk)) min(|p (vi)|,|p (vj )|)−1 P G1 G2 in l←0 lmpl 20: simin ← max(|pG1 (vi)|,|pG2 (vj )|) //compute child neighborhood similarity |cG (vi)|×|cG (vj )| 21: matrix (outlk) ∈ R 1 2

22: for row index 0 ≤ l < |cG1 (vi)| do

23: for column index 0 ≤ k < |cG2 (vj)| do 0 th 24: row index l ← l index in cG1 (vi) 0 th 25: col. index k ← k index in cG2 (vj) 26: outlk ← (1 − tmpl0k0 )/ //search optimal matching 27: vertex matching mc ← hungarian((inlk)) min(|c (vi)|,|c (vj )|)−1 P G1 G2 out l←0 lmcl 28: simout ← max(|cG1 (vi)|,|cG2 (vj )|) 29: if label = false then simin+simout 30: simij ← 2 31: else q simin+simout 32: simij ← tmpij · 2 //search optimal vertex matching 33: vertex matching m ← hungarian((simij)) //similarity computation 34: if sg = false then Pmin(|V1|,|V2|)−1 simim 35: return s ← i←0 i 49 max (|V1|,|V2|) 36: else Pmin(|V1|,|V2|)−1 simim 37: return s ← i←0 i |V1| Chapter 4. Graph Similarity and Its Application to Hardware Security

Algorithm 4 shows the neighbour matching extended with our optimizations. First, the topological similarity matrix (simij) is initialized ( 1 - 7 ). In case of vertex labeling (label = true), we utilize the experimentally determined value 0.5 to distinguish vertices with different labels, i.e. (label(vi) =6 label(vj)). Note that a value in range [0.1 − 0.3] resulted in higher false negative rate and smaller false positive rate, values in range [0.7 − 0.9] caused a higher false positive rate. Second, the similarity matrix is iteratively updated until each element in the −4 temporary matrix (tmpij) is not larger than some chosen precision  9 , i.e.  = 10 [119, 120]. In each iteration ( 10 - 28), the new similarity matrix (simij) is determined as follows: for vertex pair (vi, vj) the optimal vertex matching is computed with the Hungarian method [118] 19. Based on this vertex matching, the vertex similarity value is determined for the parent (input) vertices simin (line 20) and the child (output) vertices simout 28. Note in case both vertices have no parents or children, we set the vertex similarity value to 1. Finally, the new similarity value simij is determined ( 29 - 32). Third, after the similarity matrix stabilizes, the similarity value s is computed ( 33 - 37). Therefore, the Hungarian method is applied again to find the optimal vertex matching. For subgraph search (sg = true), we adjust the denominator since we are only interested to determine a subgraph in G2 similar to graph G1. In our implementation, we also return the vertex matching m 33 to analyze similar vertices in both graphs. To detect multiple matchings, we again execute the algorithm but we initialize the similarity of already matched vertices to 0. Furthermore, we implemented all for loops in a parallel fashion to speed up execution.

4.2.6 Multiresolutional Spectral Analysis Next, we present our novel graph similarity heuristic based on spectral analysis of adjacency matrices. Our key idea is that eigenvalues of adjacency matrices exhibit two important properties from spectral graph theory [121]:

(1) If eigenvalues of two adjacency matrices are different, the graphs are different. This observation does not imply that different graphs have different eigenvalues of their adjacency matrices, so we expect that this only happens with small probability for typical graphs of interest.

(2) Eigenvalues are invariant under cyclic permutation with respect to vertex labels. This is an important feature, since we only want to compare the graph topology (and obviously not its vertex labels).

In contrast to other related works on similarity measures based on spectral analysis (e.g., Crawford et al. [122]), we are mainly interested in localized (non-global) similarity information to identify small modules (e.g., a hardware Trojan) in a potentially erroneous graph. Therefore, we use a multiresolutional strategy to search at all local positions. #» Notation. Let A be the adjacency matrix of a graph G = (V,E). Moreover, let λi, v i, i = 1,..., |V | be the eigenvalues arranged in decreasing order which form the spectral decomposition #» #» of A, i.e. Ai v i = λi v i, i = 1,..., |V |. Once again, G1 is the small reference subgraph and G2 the large targeted one. Algorithm 5 shows our graph similarity approach based on multiresolutional spectral analysis. First, we generate local k-subgraphs ( 1 - 6 ) for G1 and G2. Since a cross-comparison of

50 4.3. Evaluation

Algorithm 5 Spectral Analysis

Input: Graphs G1 = (V1,E1), G2 = (V2,E2), Integer k k×|V | Output: Spectral distance matrix (dij) ∈ R 2 for G1 and G2 //generate local k-subgraphs 1: list of subgraphs S1 ← ∅ 2: for vertex v ∈ ranked vertices(G1) do 3: S1.append(local subgraph(G1, v, k))

4: list of subgraphs S2 ← ∅ 5: for vertex v ∈ V∈ do 6: S2.append(local subgraph(G2, v, k)) //compute spectral distance matrix 7: row index i ← 0 8: for subgraph s1 ∈ S1 do s1 s1 9: vector (λ1 , . . . , λm ) ← eigenvalues(s1) 10: column index j ← 0 11: for subgraph s2 ∈ S2 do s2 s2 12: vector (λ1 , . . . , λn ) ← eigenvalues(s2) s2 s1 Pmax(m,n) λk λk 13: spectral distance dij ← k←1 s2 − s1 λ1 λ1 14: j ← j + 1 15: i ← i + 1 16: return (dij)

all k-subgraphs for both G1 and G2 is computationally expensive, we limit the number of k-subgraphs for G1. To this end, we make the following assumption: if the small subgraph G1 is present in G2, this should turn up in a comparison for basically any vertex of G1. Hence, we select some representative vertices in G1 (ranked vertices in 2 ) (e.g., determined by Google’s page rank algorithm [123]). Second, we compute the spectral distance matrix ( 7 - 15). In particular, we compute eigenvalue vectors of the subgraphs s1 and s2 by means of the function eigenvalue ( 9 and 12). Note that we assume that eigenvalues are arranged in decreasing order, i.e. λ1 is the largest eigenvalue. We then compute the spectral distance with normalized eigenvalue sequences 13. Finally, matching vertices in G2 are identified by the smallest spectral distance to any of the vertices in G1. Note that our spectral analysis returns a spectral distance matrix rather than a rational similarity value when compared to aforementioned algorithms. Since our approach analyzes a spectral distance, distance values of 0 indicate a high similarity.

4.3 Evaluation

We now provide results of our three case studies, namely gate-level netlist reverse engineering (Section 4.3.2), Trojan detection (Section 4.3.3), and obfuscation assessment (Section 4.3.4). In addition, we provide implementation-specific details for previously mentioned graph similarity

51 Chapter 4. Graph Similarity and Its Application to Hardware Security algorithms. Moreover, we report our negative findings on other graph similarity and subgraph isomorphism algorithms (Section 4.3.1).

4.3.1 Implementation Graph Similarity Algorithms. We implemented the graph edit distance approximation (Section 4.2.4) and neighbor matching (Section 4.2.5) in C++ 14 using Boost for graph processing, BuDDy for BDDs, munkres-cpp for Hungarian method. In particular, we used OpenMP for parallelization to significantly accelerate computation, see Section 4.2 for which steps can be parallelized for each algorithm. The spectral analysis (Section 4.2.6) is implemented in R and Python. For our experiments, we utilized several Google Cloud Platform instances with 64 vCPUs which costs around $3/h per instance [124]. Gate-level Netlist Generation. We generated gate-level netlists for numerous designs using the XST suite from Xilinx ISE 14.7 [125], see Table 4.2. Our evaluation targets 3 Xilinx FPGA families, namely Spartan-6 (xc6slx16), Virtex-6 (xc6vlx75t), and Kintex-7 (xc7k70t). Furthermore, we consider all available XST synthesis optimization goals speed, and area. We want to emphasize that we also evaluated each case study for other families and different FPGA devices within the same families yielding similar results. In addition to aforementioned preprocessing techniques in Section 4.2.2, we also remove all (for our case irrelevant) buffers from each analyzed gate-level netlist to reduce graph size and thus computation time. Other Similarity and Isomorphism Algorithms. In addition to the presented graph similarity algorithms, we evaluated the applicability of maximum common subgraph [126] and subgraph isomorphism [127]. However, either subgraph isomorphism search identifies that no subgraph is found in a range of seconds (even using combinational subgraph preprocessing), or its computation time is beyond several hours (in phase 1), thus we found both approaches to be impractical for our evaluation. Note that such a mismatch occurs due to multi-level circuit mini- mization. Furthermore, we implemented and evaluated the applicability of a labeled transition systems approach by Sokolsky et al. [128] and the k-subgraph analysis by Kr¨ugel et al. [109]. However, the computation time of the labeled transition system is beyond several days for larger graphs and thus impractical for our evaluation. Moreover the k-subgraph analysis provides inaccurate similarity results for our case studies even though computation time is practical (several minutes up to hours for selected hardware designs). Note that we adapted the original subgraph generation by Kr¨ugel et al. to cope with hardware graphs where nodes may have more than two successors. We want to remark that we additionally evaluated reliability against potential errors in the netlist graphs (occurring due to imperfect image processing in chip-level reverse engineering). To this end, we randomly changed or deleted around 5% of all edges, yielding similar results. Moreover, we investigated whether the grouping of nets (bundled wires connecting one or more gates) had an influence on accuracy, however, we observed that the accuracy is only marginally effected decimal places of the similarity value. Thus, we deliberately omitted this in both the algorithm’s description and practical evaluation.

4.3.2 Case Study I: Gate-level Netlist Reverse Engineering In our first case study, we evaluated the use of graph similarity algorithms for gate-level netlist reverse engineering with a particular focus on security-critical circuits. To this end, we examined

52 4.3. Evaluation

Table 4.2: Hardware design description and resource utilization synthesized for xc6slx16 with optimization goal area. We selected xc6slx16 FPGA as an representative, since resource utilization only slightly deviate for other FPGA families. Design Description Source #LUTs #FFs Freq. (MHz) 0 Composite-field Sbox [129] 63 0 - 1 AES (composite-field Sbox) [129] 2049 587 225 2 AES (table-based Sbox) [129] 781 584 162 3 AES (table-based Sbox) [129] 1006 586 177 4 AES (table-based Sbox) [130] 5101 723 91 5 AES (PPRM-based Sbox) [129] 2230 587 91 6 AES (ANF-based Sbox) [129] 6469 587 89 7 AES (table-based Sbox) [131] 517 692 125 8 I2C Bus [132] 293 154 214 9 12-bit PIC CPU [133] 514 248 64 10 16-bit MSP430 CPU [134] 2235 686 43 11 AES (no Trojan) [135] 1917 1077 181 12 AES (with Trojan) [135] 1992 1162 181 13 Trojan (in design 12) [135] 48 1 - 14 GCD (no obfuscation) [64] 234 128 152 15 GCD (with obfuscation) [64] 325 96 171 to what extent graph similarity algorithms can be used to reliably derive implementation specifics for cryptographic hardware designs, i.e. whether a given AES design implements an Sbox with a composite-field optimization [136]. We want to emphasize that this case study represents a typical instance of gate-level netlist reverse engineering, since we focus on identification of submodules in a flattened netlist which subsequently enables the retrieval of hierarchy information and parts of the original high-level design implementation goals. In addition, knowledge about the internal architecture of a cryptographic design provides valuable information for other scenarios, for example, to enable injection of cryptographic Trojans through Sbox tampering [77, 78, 79, 80] or to improve assessment of physical attacks such as fault injection or side-channel analysis [74, 75].

Hardware Designs

We obtained numerous publicly available third-party AES implementations and other hardware designs such as CPUs (e.g., from OpenCores). Note that our considered AES designs (1 - 7) utilize different Sbox implementation strategies (e.g., precomputed lookup table or composite- field)[137]. To demonstrate the reliability of graph similarity algorithms, we provide results for other non-cryptographic general-purpose designs, i.e. design 8 (I2C), design 9 (12-bit PIC CPU), and design 10 (MSP430 CPU). Table 4.2 provides further details for each hardware design such as resource consumptions and origins of each design. We want to emphasize that design 1 (composite-field Sbox implementation) should exhibit the highest similarity for all designs in this case study, since the composite-field Sbox of this design is our small reference subgraph G1. Also, we expect design 5 (PPRM-based Sbox implementation) and design 6 (ANF-based Sbox implementation) to cause high similarity values, since both Sbox

53 Chapter 4. Graph Similarity and Its Application to Hardware Security implementation strategies are mainly based on AND-XOR gates. In addition, note that both phase 1 and phase 2 are used in this case study.

Synthesis Hardware Design Device Algorithm Option 1 2 3 4 5 6 7 8 9 10 xc6slx16 speed GED 0.921 0.822 0.779 0.896 0.945 0.908 0.705 0.768 0.826 0.952 xc6slx16 area GED 0.915 0.763 0.768 0.898 0.942 0.903 0.704 0.719 0.791 0.942 Computation Time 3.22s 1.81s 2.66s 10.8s 3.44s 19.0s 0.87s 0.30s 2.31s 25.1s xc7k70t speed GED 0.944 0.766 0.758 0.900 0.945 0.912 0.695 0.707 0.799 0.951 xc7k70t area GED 0.950 0.762 0.768 0.896 0.953 0.915 0.695 0.719 0.791 0.942 Computation Time 3.51s 1.74s 3.32s 10.7s 3.77s 20.1s 1.00s 0.29s 2.60s 29.7s xc6vlx75t speed GED 0.910 0.763 0.758 0.897 0.942 0.920 0.695 0.748 0.799 0.950 xc6vlx75t area GED 0.922 0.762 0.768 0.898 0.941 0.916 0.695 0.719 0.791 0.943 Computation Time 3.61s 1.81s 3.37s 9.15s 4.26s 19.4s 0.79s 0.26s 2.27s 24.7s xc6slx16 speed NM 0.770 0.610 0.571 0.699 0.779 0.783 0.526 0.615 0.641 0.773 xc6slx16 area NM 0.763 0.567 0.573 0.665 0.752 0.757 0.541 0.575 0.622 0.745 Computation Time 54.7s 26.9s 36.9s 4.01m 59.3s 10.7m 14.2s 6.96s 36.6s 4.05m xc7k70t speed NM 0.805 0.578 0.564 0.710 0.783 0.774 0.425 0.583 0.629 0.782 xc7k70t area NM 0.809 0.567 0.573 0.665 0.757 0.765 0.420 0.576 0.622 0.754 Computation Time 49.6s 11.8s 28.6s 3.75m 58.5s 9.18m 7.84s 5.33s 30.7s 4.16m xc6vlx75t speed NM 0.755 0.579 0.564 0.708 0.786 0.778 0.425 0.605 0.629 0.780 xc6vlx75t area NM 0.773 0.567 0.573 0.665 0.758 0.768 0.420 0.576 0.622 0.749 Computation Time 53.1s 16.1s 30.5s 3.98m 7.84s 8.22m 14.1s 7.12s 36.5s 3.83s GED - Graph edit distance approximation NM - Neighbour matching using  = 0.0001 Table 4.3: Gate-level netlist reverse engineering case study results (phase 1). AES Sbox (compos- ite field) is synthesized for xc6slx16 with optimization goal area. Parameter subgraph and label are true for all experiments, and only the combinational logic subgraph preprocessing technique is used. Computation time is the arithmetic mean for both synthesis options for each algorithm. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity.

Synthesis Hardware Design Device Algorithm Option 1 5 6 10 xc6slx16 speed GED 0.899 0.914 0.885 0.818 xc6slx16 area GED 0.899 0.899 0.887 0.818 Computation Time 10.3h 11.4h 11.5h 68h

Table 4.4: Gate-level netlist reverse engineering case study results (phase 2). AES Sbox (com- posite field) is synthesized for xc6slx16 with optimization goal area. Parameter subgraph and label are true for all experiments, and the combinational logic bitslice and LUT decomposition preprocessing techniques are used. Computation time is the arithmetic mean for both synthesis options. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity.

Results (Phase 1) Our evaluation results for phase 1 analysis (combinational logic subgraph) are summarized in Table 4.3, Figure 4.3, and Figure 4.4.

54 4.3. Evaluation

Table 4.3 shows results for graph edit distance approximation and neighbour matching for designs 1 - 10 for three Xilinx FPGA families and both synthesis optimization goals speed and area. The graph edit distance approximation indicates a high similarity ≥ 0.9 to the composite- field Sbox for designs 1, 4, 5, 6, and 10 independent of FPGA family or optimization goal. The neighbour matching indicates a high similarity ≥ 0.77 for designs 1, 5, 6, and 10 independent of FPGA family or optimization goal. Figure 4.3 shows results for several representative designs for our multiresolutional spectral analysis. Similar to graph edit distance and neighbour matching, we see that distance matrices indicate a high similarity for design 1 since it has sufficiently small spectral distances (< 0.05). Similar to graph edit distance and neighbor matching, we see that designs 5, 6, and 10 exhibit some similarity to the composite-field Sbox. Note that we selected three vertices after ranking: (1) top-ranked vertex (marked in black), (2) 75%-quantile vertex (marked in blue), and (3) 50%-quantile vertex (marked in green). Moreover, the top-10 candidate vertices with smallest spectral distance to the three chosen vertices are actual S-Box vertices with accuracy of 100% (top-ranked), 100% (for 75%-quantile), and 70% (for 50%-quantile). Thus, a statistical test shows that 27 out of 30 true-positive vertices rejects a null hypothesis with high significance (binomial test, p-value < 10−7). Figure 4.4 shows distance matrices where designs have been synthesized with another optimization goal. Overall, we see that design 1 exhibits a high similarity since its spectral distance matrices possess most vertices with distance < 0.1 for top-ranked and 75% quantile vertices. Similar to graph edit distance approximation, designs 5, 6 and 10 show similarities for several vertices but clearly less than for design 1. Both lookup-table-based Sbox designs 3 and 4 scarcely show similarities to the composite-field Sbox. We want to emphasize that the results for other FPGA families are similar to Figure 4.4 and hence we deliberately did not provide evaluation figures. Hence, a majority vote for graph edit distance, neighbor matching, and spectral analysis yields high similarities to designs 1, 5, 6, and 10, since these three similarity algorithms provide an accurate and reliable measure. Note that similarity scores are determined within seconds to minutes for all algorithms. We want to emphasize that the combinational logic subgraph which exhibits highest similarity for design 10 is a register using 81 FFs handling memory access. Obviously, such a register size is implausible for an AES implementation, hence a reverse engineer would put this result aside in practice and just focus on designs 1, 5, and 6 since these Sbox candidate gates are identified for a subgraph with 128-bit, i.e. the actual AES datapath registers. Note that our graph similarity analysis yields 63 gates that have to be further analyzed (e.g., with Boolean function analysis or manually). Negative Results for Other Parameters. In addition to aforementioned results, we present several counterexamples which demonstrate why our selected parameters and prepro- cessing techniques perform best. If we compute the similarity of the composite-field Sbox to design 1 (AES using the composite- field Sbox) using the GED algorithm in phase 1, subgraph = false, and label = true, we obtain a low similarity value of 0.239 in 15s (compared to the high similarity value of 0.915 for subgraph = true). Even though GED determines correct Sbox gates in design 1, a low similarity value occurs due to the original similarity computation equation, see Section 4.2.4. If we compute the similarity of the composite-field Sbox to design 1 with all other AES designs (design 2 - 7) using the GED algorithm in phase 1, subgraph = true, and label = false, we obtain high similarity values ranging between 0.963 and 1.000. Thus without a labeled vertices check,

55 Chapter 4. Graph Similarity and Its Application to Hardware Security similarity values cannot be used for effective distinction. If we compute the similarity of the composite-field Sbox to design 1 using the GED algorithm without any preprocessing technique, subgraph = true, and label = true, we obtain a high similarity value of 0.943 in 2s (compared to the similarity value of 0.915 in 3s). Even though the similarity value is higher, matched gates in design 1 contain around 10% registers which are erroneously matched to combinational logic gates of the Sbox circuit. Additionally, for parameters subgraph = false and label = false we obtain a similarity value of 0.0035 in 5m. Hence, our phase 1 preprocessing generally prevents a mismatch between combinational and synchronous gates and thus increased accuracy and reduced computation time.

Results (Phase 2)

Since phase 1 analysis indicates a false-positive similarity score design 10, we now provide results of our more robust phase 2 (LUT decomposition and combinational logic bitslice) similarity anal- ysis. Note that our LUT decomposition unifies different FPGA families, hence we deliberately selected Spartan-6 as a representative as other families yield similar results. Also, we selected graph edit distance approximation since its requires the least computation time in phase 1. Our evaluation results are summarized in Table 4.4. We see that graph edit distance approxima- tion indicates a high similarity ∼ 0.9 to bitslices of the composite-field Sbox while design 10 exhibits a similarity of ∼ 0.8, thus we have evidence that the composite-field Sbox is unlikely present in design 10. Hence, we have a certain degree of confidence that a composite-field Sbox gate structure is within designs 1, 5, and 6, but not in design 10. In summary, we demonstrated that graph similarity algorithms can indeed be utilized for automated and reliable gate-level netlist reverse engineering. To this end, graph edit distance approximation, neighbor matching, and spectral analysis should be used in concert to report reliable and accurate similarity. In case phase 1 analysis yields high similarities for more than one design, phase 2 analysis should be used to obtain more reliable and accurate similarity results. We want to emphasize that we are the first to demonstrate automatic reverse engineering of composite-field-based Sboxes to the best of our knowledge. So far it was only demonstrated that precomputed LUT Sboxes can be automatically identified in third-party IP cores [78, 79].

4.3.3 Case Study II: Trojan Detection

Over the past decade, numerous works have addressed the emerging threat of hardware Trojans since current IC design and fabrication practices rely on untrusted entities (e.g., untrusted third-party IP cores or untrusted offshore fab). To counteract this threat and inspired by control flow graph based malicious software detection approaches [110], we evaluated whether graph-similarity algorithms can be leveraged to reliably detect hardware Trojans in gate-level netlists of potentially untrusted third-party IP cores.

Hardware Designs

We obtained a publicly available hardware Trojan benchmark AES-T1000 from the trusthub benchmark suite [70]. Design 11 refers to the AES-T1000 without the Trojan, design 12 refers to the AES-T1000 including the Trojan, and design 13 is the Trojan itself. The Trojan leaks

56 4.3. Evaluation

Synthesis Hardware Design Device Algorithm Option 11 12 1 5 6 7 8 9 10 xc6slx16 speed GED 0.779 1.000 0.699 0.696 0.753 0.702 0.867 0.865 0.958 xc6slx16 area GED 0.722 1.000 0.699 0.691 0.699 0.699 0.870 0.865 0.945 Computation Time 28.0s 29.8s 2.53s 2.51s 13.0s 0.74s 0.19s 1.87s 24.7s xc7k70t speed GED 0.779 1.000 0.696 0.691 0.751 0.702 0.808 0.865 0.958 xc7k70t area GED 0.715 1.000 0.699 0.699 0.702 0.699 0.870 0.865 0.943 Computation Time 8.02s 9.65s 2.70s 2.58s 14.9s 0.69s 0.19s 2.04s 25.0s xc6vlx75t speed GED 0.772 1.000 0.696 0.694 0.756 0.702 0.865 0.865 0.958 xc6vlx75t area GED 0.753 1.000 0.699 0.699 0.699 0.699 0.870 0.865 0.945 Computation Time 6.72s 9.73s 2.67s 2.42s 12.7s 0.71s 0.21s 1.94s 26.2s xc6slx16 speed NM 0.692 0.825 0.630 0.615 0.670 0.474 0.691 0.614 0.767 xc6slx16 area NM 0.639 0.825 0.619 0.592 0.647 0.475 0.681 0.609 0.811 Computation Time 2.08m 3.05m 25.6s 31.8s 10.4m 7.42s 3.84s 19.5s 2.85m xc7k70t speed NM 0.595 0.825 0.631 0.615 0.669 0.468 0.642 0.610 0.766 xc7k70t area NM 0.541 0.825 0.627 0.611 0.645 0.471 0.681 0.609 0.811 Computation Time 45.5s 1.06m 30.3s 34.8s 8.75m 7.32s 3.12s 19.8s 2.67m xc6vlx75t speed NM 0.580 0.825 0.629 0.618 0.669 0.468 0.689 0.610 0.762 xc6vlx75t area NM 0.541 0.825 0.618 0.615 0.645 0.471 0.681 0.609 0.811 Computation Time 43.6s 1.02m 27.9s 36.3s 6.53m 8.09s 2.56s 19.2s 2.58m GED - Graph edit distance approximation NM - Neighbour matching using  = 0.0001 Table 4.5: Trojan detection case study results (phase 1). Trojan (design 13) is synthesized for xc6slx16 with optimization goal area. Parameter subgraph and label are true in all experiments, and only the combinational logic subgraph preprocessing technique is used. Computation time is the arithmetic mean for both synthesis options for each algorithm. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity. the AES key for a predefined input plaintext through a covert power side-channel using a code- division multiple access sequence. More specifically, an LFSR-based Pseudo Random Number Generator (PRNG) (initialized with the input plaintext) is used to XOR modulate the secret key and finally the output of the XOR gate is connected to 8 identical FF gates to mimic a large capacitance. To demonstrate the reliability of graph similarity algorithms, we also provide results whether the Trojan designs exhibits high similarities to other benign designs, i.e. design 1, 5, 6, and 7 (AES circuits using different Sbox implementation strategies), design 8 (I2C), design 9 (12-bit PIC), and design 10 (MSP430). Table 4.2 provides resource consumptions for each hardware design. We want to emphasize that design 12 (AES-T1000 including the Trojan) should exhibit the highest similarity to the Trojan for all designs in this case study.

Results (Phase 1) Our evaluation results are summarized in Table 4.5 and Figure 4.5. Table 4.5 shows results for graph edit distance approximation and neighbour matching for designs 1, 5, 6, 7, 8, 9, 10, 11, and 12 for three Xilinx FPGA families and both synthesis optimization goals speed and area. The graph edit distance approximation indicates highest similarity 1.0 for design 12 independent of FPGA family or optimization goal, followed by the MSP430 processor with ∼ 0.95. The

57 Chapter 4. Graph Similarity and Its Application to Hardware Security neighbour matching also indicates a high similarity ≥ 0.852 for design 12 independent of FPGA family or optimization goal. Figure 4.3 shows results for several representative designs for our multiresolutional spectral analysis. We selected three vertices after ranking: (1) top-ranked vertex (marked in black), (2) 75%-quantile vertex (marked in blue), and (3) 50%-quantile vertex (marked in green). Based on results in Table 4.6, we selected designs 9 - 12, and similar to graph edit distance approximation and neighbour matching, spectral analysis does identify a higher similarity to the hardware Trojan for design 12 and less similarity to designs 9 and 11. Note that vertices with the smallest distance to the 50%-quantile vertex (marked in green) in design 12 belong to the actual Trojan. Since we have underpinned robustness of our spectral analysis regarding different FPGA families and optimization goals, see Section 4.3.2, we deliberately omit these results here. Note that similarity scores are determined within seconds to minutes for all algorithms. Furthermore, other parameter configurations for graph similarity algorithms and preprocessing techniques yield similar negative results, see Section 4.3.2. Since our evaluation results indicate that design 12 is most similar design for phase 1 analysis (and we actually identify the hardware Trojan), we deliberately not evaluate phase 2 results. We want to note that other parameter configurations for the graph similarity algorithms and preprocessing techniques yield similar negative results, see Section 4.3.2. In summary, we demonstrated that graph similarity can indeed be utilized for automated and reliable hardware Trojan detection in untrusted third-party IP cores. To this end, graph edit distance approximation, neighbor matching, and our spectral analysis should be used in concert to report accurate and reliable similarity values for hardware Trojan detection.

4.3.4 Case Study III: Obfuscation Assessment

Over the past decades numerous hardware obfuscation transformations have been developed to protect valuable IP from reverse engineering and other threats, see Shakaya et al. [20] for a comprehensive overview. However, the development of practical and sound obfuscation schemes is challenging since metrics for obfuscation are hard to quantify (e.g., how much effort has to be invested to break obfuscation) and require the quantification of a human reverse engineer which is challenging and still unsolved [5]. In our third case study, we evaluate the application of graph similarity analysis for assessment of obfuscation transformations. The ideal goal of an obfuscation transformation is to destroy any relation between an unobfuscated circuit C1 and its obfuscated version O(C1), so a graph similarity analysis of C1 and O(C1) should not yield a higher similarity than for other circuits C2,...,Cn implementing a different functionality. Otherwise (if there is a significant similarity of C1 to O(C1)), we can derive critical information from an obfuscated circuit and thus circumvent the obfuscation. We acknowledge that graph similarity analysis is obviously not sufficient to entirely quantify a degree of obfuscation, however, it supports obfuscation designers with a valuable metric indicating topological difference induced by a transformation. Hardware Obfuscation Transformations. In order to demonstrate a typical obfuscation assessment, we selected the obfuscation transformation proposed by Li et al. [64] targeting obfuscation of sequential circuits. In particular, we re-implemented the conditional stuttering and sweep transformations for the 32-bit Greatest Common Divisor (GCD) circuits as proposed. Note that we deliberately did not choose FSM-based obfuscation transformations [24, 23, 138],

58 4.3. Evaluation since they do not or marginally alter a circuit’s datapath and thus do not induce large topological differences. Hence, it is obvious that our approach works and detects datapath circuits such as an Sbox, see Section 4.3.2.

Hardware Design To assess the obfuscation scheme, we evaluated the two GCD circuits (designs 14 - unobfuscated and 15 - obfuscated). Moreover, we selected several cryptographic designs (design 1, 5, 6, 7) and a general-purpose hardware design: design 8 (I2C), design 9 (12-bit PIC), and design 10 (MSP430). In Table 4.2 we provide resource consumptions for each hardware design. Design 14 (unobfuscated) should exhibit the highest similarity for all designs in this case study, since we compare all designs with the obfuscated GCD circuit (design 15).

Synthesis Hardware Design Device Algorithm Option 14 1 5 6 7 8 9 10 xc6slx16 speed GED 0.975 0.741 0.754 0.759 0.643 0.816 0.841 0.926 xc6slx16 area GED 0.979 0.738 0.749 0.750 0.000 0.808 0.831 0.937 Computation Time 2.52s 11.8s 14.3s 66.1s 1.93s 1.91s 16.6s 2.51m xc7k70t speed GED 0.975 0.755 0.756 0.764 0.000 0.000 0.840 0.925 xc7k70t area GED 0.979 0.750 0.750 0.751 0.000 0.808 0.831 0.942 Computation Time 3.46s 13.1s 14.2s 72.7s 1.67s 0.44s 15.5s 2.56m xc6vlx75t speed GED 0.975 0.734 0.754 0.762 0.000 0.801 0.839 0.926 xc6vlx75t area GED 0.979 0.728 0.750 0.750 0.000 0.808 0.831 0.942 Computation Time 3.45s 13.3s 13.3s 63.8s 1.61s 1.87s 16.3s 2.56m xc6slx16 speed NM 0.973 0.624 0.627 0.687 0.000 0.584 0.573 0.717 xc6slx16 area NM 0.970 0.603 0.611 0.680 0.000 0.532 0.566 0.692 Computation Time 22.2s 6.11m 6.58m 44.3s 1.21s 23.2s 2.28m 24.9m xc7k70t speed NM 0.973 0.623 0.627 0.687 0.000 0.000 0.570 0.712 xc7k70t area NM 0.970 0.613 0.613 0.682 0.000 0.531 0.566 0.686 Computation Time 10.8s 4.96m 5.08m 37.1m 0.89s 19.4s 2.21m 23.5m xc6vlx75t speed NM 0.973 0.607 0.649 0.688 0.000 0.522 0.570 0.710 xc6vlx75t area NM 0.970 0.610 0.627 0.683 0.000 0.531 0.566 0.688 Computation Time 9.58s 5.90m 6.08m 36.3m 1.32s 18.8s 2.25m 23.2m GED - Graph edit distance approximation NM - Neighbour matching using  = 0.0001 Table 4.6: Hardware obfuscation assessment case study results (phase 1). GCD (design 15) is synthesized for xc6slx16 with optimization goal area. Parameter subgraph and label are true and only the combinational logic subgraph preprocessing technique is used. Computation time is the arithmetic mean for both synthesis options for each algorithm. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity.

Results (Phase 1) Our evaluation results are summarized in Table 4.6 and Figure 4.6. Table 4.6 shows results for graph edit distance approximation and neighbour matching. Overall, we see that graph edit distance approximation and neighbour matching indicate design 14 as highly similar to the obfuscated GCD (similarity values [0.97, 1.0]), independent of FPGA family or synthesis

59 Chapter 4. Graph Similarity and Its Application to Hardware Security optimization goal. Figure 4.6 shows results for our spectral analysis. Based on results in Table 4.6, we selected designs 14 and 10 as representatives since design 10 exhibits also high similarities for graph edit distance approximation. We selected three vertices after ranking: (1) top-ranked vertex (marked in black), (2) 75%-quantile vertex (marked in blue), and (3) 50%-quantile vertex (marked in green). Similar to graph edit distance approximation, both design exhibit high similarities since both spectral distance matrices possess vertices with distance [0.5, 0.7]. Since we have underpinned the robustness of our spectral analysis regarding different FPGA families and optimization goals, see Section 4.3.2, we deliberately omit these results here. Similarity scores are determined within seconds to minutes for all algorithms. Furthermore, other parameter configurations for graph similarity algorithms and preprocessing techniques yield similar negative results, see Section 4.3.2. Since our evaluation results indicate design 14 as most similar design for phase 1 analysis, we deliberately do not further investigate phase 2. We want to note that other parameter configurations for graph similarity algorithms and preprocessing techniques yield similar negative results, see Section 4.3.2. In summary, we demonstrated that graph similarity algorithms provide a valuable metric to indicate an obfuscation degree to support both designers of obfuscation transformations and engineers instantiating them in their designs. For the selected GCD circuit, we see that the topological difference induced by the obfuscation transformation may not be sufficient to hamper reverse engineering. Furthermore, we want to emphasize that this assessment approach scales to larger designs including multiple IP cores since we analyze register groups with our combinational logic subgraph preprocessing. To report a reliable and accurate metric, graph edit distance approximation, neighbor matching and spectral analysis should be used in concert.

4.4 Discussion

Implications. In previous case studies, we have demonstrated that graph similarity has a variety of applications to hardware security. We have shown that graph similarity heuristics indeed provide accurate and reliable hot spots while keeping analysis time practical. Our case studies demonstrated that in general graph edit distance approximation, neighbor matching, and spectral analysis should be used in concert (with a majority vote) to report reliable and accurate similarity values, using labeled vertices, subgraph adjustments, and our two-phased analysis. As noted in Section 4.3.1, we also evaluated other similarity heuristics and subgraph isomorphism algorithms, however, their computation time or accuracy turned out to be impractical for larger graphs. We want to emphasize that we deliberately did not perform a pair-wise comparison of large designs, since our main focus is to find small modules (e.g., hardware Trojans or datapath circuits) rather than comparing similarity of two large designs. Generality. Our approach scales even to larger designs including numerous IP cores, because only the number of combinational logic subgraphs will increase but not their size. Moreover, subgraphs can be processed in parallel. In addition, our graph similarity heuristics are not specific to FPGAs and can be applied to ASIC gate-level netlists as well. Note that similarity analyses can be leveraged during IC design, simulation, verification and testing phases to improve designer’s productivity for IP reuse [139, 140, 141]. Such approaches are orthogonal to reverse engineering since high-level information is already available.

60 4.5. Conclusion

Theoretical Limitations. We acknowledge that our work has limitations with respect to statements regarding theoretical bounds or proofs of convergence. Proofs or statements of soundness for presented graph similarity heuristics are highly desirable from a theoretical point of view. However, they are an open challenge and out of scope of our work. They must consider the distribution of test statistics “under the alternative” (non-zero difference between the graphs locally) for our multiresolutional spectral analysis. Moreover, in practice, a reverse engineer is interested in accurate and reliable hot spots which are determined in practical computation times so that he can investigate an identified subcircuit in more detail. Future Work. Research on graph similarity can be investigated in several directions. For example similarity analysis might be used to analyze the security of split-manufacturing schemes, since isomorphism property might be to rigid for a security criterion as explained before. For further investigations, our evaluation could be extended to a variety of synthesizers to examine reliability among different synthesizers. As noted before, future work may also explore theoretical bounds or proofs of convergence to support statements of similarity algorithms.

4.5 Conclusion

In this chapter, we presented the graph similarity problem for the first time in the domain of hardware security, particularly for reverse engineering, Trojan detection, and assessment of obfuscation. To this end, we significantly improved graph similarity heuristics with optimizations tailored to hardware designs. Furthermore, we introduced a new technique based on a mul- tiresolutional spectral graph analysis. In our three case studies, we demonstrated the practical feasibility of graph similarity for different FPGA families and several design optimization goals. Particularly, our results showed that graph edit distance approximation, neighbor matching, and our spectral analysis should be used in concert to report accurate and reliable similarity scores.

61 Chapter 4. Graph Similarity and Its Application to Hardware Security

(a) Design 1 (AES composite-field Sbox). (b) Design 3 (AES table-based Sbox).

(c) Design 4 (AES table-based Sbox). (d) Design 10 (MSP430 processor).

(e) Design 5 (AES PPRM-based Sbox). (f) Design 6 (AES ANF-based Sbox).

Figure 4.3: Gate-level netlist reverse engineering case study results for our multiresolutional spectral analysis without any preprocessing techniques. AES Sbox (composite field) is synthesized for xc6slx16 with optimization goal area, the representative designs in a) - f) are also synthesized for xc6slx16 with optimization goal area. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels.

62 4.5. Conclusion

(a) Design 1 (AES composite-field Sbox). (b) Design 3 (AES table-based Sbox).

(c) Design 4 (AES table-based Sbox). (d) Design 10 (MSP430 processor).

(e) Design 5 (AES PPRM-based Sbox). (f) Design 6 (AES ANF-based Sbox).

Figure 4.4: Gate-level netlist reverse engineering case study results for our multiresolutional spectral analysis without any preprocessing techniques. AES Sbox (composite field) is synthesized for xc6slx16 with optimization goal area, the representative designs in a) - f) are also synthesized for xc6slx16 with optimization goal speed. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels.

63 Chapter 4. Graph Similarity and Its Application to Hardware Security

(a) Design 9 (12-bit PIC). (b) Design 11 (AES without Trojan).

(c) Design 12 (AES with Trojan).

Figure 4.5: Trojan detection case study results for our multiresolutional spectral analysis without any preprocessing techniques. Trojan (design 13) is synthesized for xc6slx16 with optimization goal speed, the representative designs 9, 11, and 12 are also synthesized for xc6slx16 with optimization goal area. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels.

64 4.5. Conclusion

(a) Design 14 (GCD with obfuscation). (b) Design 10 (MSP430 processor).

Figure 4.6: Hardware obfuscation assessment case study results for our multiresolutional spectral analysis without any preprocessing techniques. GCD (design 15) is synthesized for xc6slx16 with optimization goal area, the designs 14 and 10 are also synthesized for xc6slx16 with optimization goal area. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels.

65

Chapter 5 On the Difficulty of FSM-based Hardware Obfuscation

Motivation. In today’s IC production chains, a designer’s valuable IP is transparent to diverse stakeholders and thus inevitably prone to piracy. To protect against this threat, numerous defenses based on the obfuscation of a circuit’s control path, i.e. FSM, have been proposed and are commonly believed to be secure. However, the security of these sequential obfuscation schemes is doubtful since realistic capabilities of reverse engineering and subsequent manipulation have been commonly neglected in the security analysis of these schemes.

Contents of this Chapter

5.1 Introduction ...... 68 5.2 Automated FSM Reverse Engineering ...... 68 5.2.1 Phase 1: Topological Analysis ...... 69 5.2.2 Phase 2: Boolean Function Analysis ...... 72 5.3 Reverse Engineering and Deobfuscation of FSM Obfuscation Schemes .. 73 5.3.1 HARPOON ...... 74 5.3.2 Dynamic State Deflection ...... 75 5.3.3 Active Hardware Metering ...... 77 5.3.4 Interlocking Obfuscation ...... 79 5.3.5 Lessons Learned ...... 80 5.4 Evaluation ...... 81 5.4.1 Case Study: Cryptographic Designs ...... 82 5.4.2 Case Study: Communication Interfaces ...... 88 5.5 Discussion ...... 90 5.6 Conclusion ...... 92

Contribution. Parts of this chapter have been previously published in IACR Transactions on Cryptographic Hardware and Embedded Systems [142].

67 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

5.1 Introduction

Goals and Contributions. In this chapter, we focus on reverse engineering and obfuscation of FSMs in third-party, gate-level netlists. Our goal is to demonstrate the shortcomings of allegedly secure, state-of-the-art FSM obfuscation schemes by (semi-)automatic reverse engineering and manipulation. To this end, we address reverse engineering techniques that deduce high-level information under realistic adversarial capability assumptions. We then carefully review obfuscation schemes and show how their protection can be defeated. In summary, our main contributions are: • Deobfuscation of FSM Obfuscation Schemes. We practically demonstrate the semi- automated deobfuscation of several allegedly secure FSM-based obfuscation schemes. In concert with realistic reverse engineering capabilities, we provide comprehensive insights into published security metrics and previous (erroneous) assumptions about reverse engineering to serve as an educational basis for future obfuscation designers and implementers. • Novel Technique and Comprehensive Evaluation. We augment state-of-the-art reverse engineering algorithms to disclose high-level FSM information from FPGA gate- level netlists. We show that the algorithm is effective for several hardware designs while keeping analysis times practical.

5.2 Automated FSM Reverse Engineering

Preliminaries. From a high-level perspective, an FSM is a computational model that can be in exactly one of a finite number of states at any time. An FSM switches state depending on inputs and its current state and generates outputs to control the operations of other units. Two FSM types can be distinguished: the output of a Moore machine depends solely on the current state, and the output of a Mealy machine depends on both the current state and FSM input. To be more precise, we use the notation in Definition 5 throughout the rest of this chapter. Definition 5 (Finite State Machine). We define a Finite State Machine by a 6-tuple (S, I, δ, s0, O, λ): S is a finite set of states, I is the input alphabet, δ : S × I → S is the state transition function, s0 ∈ S is the initial state, O is a finite set of output symbols, and λ is the output function (λ: S → O for a Moore machine, λ: S × I → O for a Mealy machine).

State Input State Transition Memory Output Logic Logic

Figure 5.1: Block diagram of a hardware FSM (dashed line in the case of a Mealy machine).

Figure 5.1 illustrates the high-level structure of an FSM in hardware. An FSM consists of three parts: (1) the state transition logic that implements δ, (2) the memory storing the current state that implements S, and (3) the output logic that implements λ.

68 5.2. Automated FSM Reverse Engineering

State Encoding. Several FSM state encoding styles exist to satisfy diverse optimization goals such as speed or power consumption. Since the encoding affects the hardware implementation (and consequently reverse engineering), we summarize the most common styles:

• Binary. Each state is numbered sequentially in order of appearance (starting from 0). Thus, all states can be represented with a dlog2(|S|)e-bit register. Consequently, the amount of utilized registers is minimized, but the amount of state transition logic is increased.

• Gray. Similar to binary state encoding, Gray-encoded states can be represented with a dlog2(|S|)e-bit register. Based on the employed Gray code, consecutive state values only differ by one bit which reduces the amount of combinational logic needed for state transition and minimizes power consumption.

• One-Hot. Each state is represented with a |S|-bit register. Hence, all register bits except one are equal to 0 at any time. This encoding increases the amount of registers, but simplifies state transition logic in each path to achieve higher clock frequency.

Note that from an obfuscation point of view, binary-like encodings are preferable over one- hot encodings since the latter grow linearly with the number of states while the former grow logarithmically. While reverse engineering the chosen state encoding, an analyst can gather valuable information about design strategies. For example, power consumption minimization can be assumed in the case of Gray encoding.

FSM FSMs Candidates

Boolean Gate Topological Function -level Analysis Netlist Analysis

Figure 5.2: Overview of our FSM reverse engineering work flow. Starting with a third-party, gate-level netlist, we first determine the gates of each FSM candidate using the topological analysis. Afterwards each candidate is processes by the Boolean function analysis to determine the state transition graph.

FSM Reverse Engineering. Figure 5.2 provides an overview of our two-step reverse engineering technique. First, a topological analysis of the gate-level netlist (Section 5.2.1) generates a set of FSM candidates consisting of gates and signals that may form an FSM. Second, each candidate is processed by Boolean function analysis (Section 5.2.2) that determines the set of states S, the input alphabet I, the state transition function δ, and the initial state s0.

5.2.1 Phase 1: Topological Analysis

To disclose FSM gates and signals from a gate-level netlist, we transform the netlist into a multi-digraph using an approach similar to the technique described by Shi et al. [39]. Using this format, we analyze the graph topology for FSM characteristics described hereinafter. Even

69 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

Algorithm 6 Topological Analysis Input: D - Design netlist Output: C - Set of FSM candidates // initialization 1: C ← ∅ // ensure property I 2: set of sets RS ← registers(D) // ensure property II 3: set of sets SCC ← strongly connected components(D) 4: for set register ∈ RS do 5: for gate g ∈ register do 6: if SSC.find element(g) == false then 7: register ← register \{g} 8: for set register ∈ RS do 9: if is splittable(register) == true then 10: RS ← (RS \ register) ∪ split(register) // ensure property III 11: CLFP ← combinational logic feedback path(D, RS) // ensure property IV 12: for set register ∈ RS do 13: for gate g ∈ register do S 14: if Ir ∩ Dr ≤ 1 then 15: register ← register \{g} // ensure property V 16: for set register ∈ RS do 17: if compute control behavior(register) == 0 then 18: RS ← RS \ {register} 19: else 20: C ← C ∪ {c(register, CLFP)} 21: return C though the block diagram in Figure 5.1 might appear to be quite elementary, it inherently considers various fundamental properties. Property I: Registers. Typically FSM state memory registers are controlled by the same set of signals. Therefore, we group all FFs with the same clock, enable, and (a)synchronous (re)set signals into a register (represented as set of FF sets) in line 2 of Algorithm 6. Note that these identified registers are important for further (manual) netlist reverse engineering, since they disclose crucial module-boundary information which partitions the design into easier to analyze functionally-related units. Property II: Strongly Connected Components. FSM memory and state transition logic form a strongly connected component, i.e. a path exists in each direction between each pair of vertices, see Figure 5.1. Thus, we first sort out FFs that are not in any strongly connected component with more than 2 vertices 4 - 7 since state memory FFs should exhibit a cyclic

70 5.2. Automated FSM Reverse Engineering structure. In addition, we split state registers whose FFs are in different strongly connected components 8 - 10 since FSM FFs should influence all other FFs. Note that we are using Tarjan’s algorithm to identify strongly connected components [143]. Property III: Combinational Logic Feedback Paths. FSM state memory register output signals possess a feedback to its inputs via a series of combinational gates, forming a combinational logic feedback path, see Figure 5.1. All state memory FFs that do not reach themselves through combinational gates are removed from the register 11. In this step, we also determine all state transition logic gates. Therefore, we add all combinational gates in the feedback paths. Subsequently, we augment the set by adding predecessors of all logic gates until we reach a global input or a register gate. Property IV: Influence/Dependence Metric. FSM candidates with only one FF in the register are rejected to minimize the number of potential FSMs. In addition, we sort out FFs where the intersection of influenced and dependent FFs is smaller or equal to 1 12 - 15. More precisely, for each register FF r we determine the set of dependent registers Dr and the set of S influenced registers Ir. We then compute the intersection Ir ∩ Dr. To measure the influence and dependence among all FFs in the register, we compute the mean S 2 of all Ir ∩ Dr for each register FF r ∈ R, normalized by dividing through |R| . This metric is used since, in a typical FSM, each state register FF influences and depends on all state register FFs. A value of 1 implies a strong coherence between the FFs whereas a value of 0 implies a loose coherence. Property V: Control Behavior Metric. Typically, FSM state memory register output signals connect to gates which are not in the strongly connected component (implementing λ). These control signals define the circuit’s behavior. Since LUTs are the building blocks used to realize combinational logic in FPGAs, we cannot directly use a previous approach (Shi et al. [39]) which is based on gate type analysis, i.e. whether a control signal connects to the select pin of a multiplexer. Moreover, a technique based on the presence of specific gate types is not reliable for netlists equipped with hardware obfuscation. We solve these issues in a generic way by using a metric to quantify the control behavior 17. To this end, we retrieve the FSM output logic gates which are (1) either successors of state memory FFs that are not in the state transition logic, or (2) transition logic gates that connect to gates outside of the strongly connected component. Note that the latter case occurs due to multi-level logic optimization. To measure the control behavior, we approximate the Boolean difference of a state FF output signal that connects to the output logic. We retrieve the minimal Boolean function representation using the Quine McCluskey algorithm [117]. Then, we count how many minimized clauses are affected by the control signal normalized by dividing by the number of clauses, yielding a real value in [0, 1]. Note that a value of 1 implies a strong control behavior, whereas a value of 0 implies no control behavior. We remove any candidate which possesses a value of 0 for all control signals. Finally, an FSM candidate is added to C 20.

Use of Metrics for Reverse Engineering. We want to emphasize that (1) control behavior and (2) influence/dependence metrics are especially useful for a human reverse engineer since multiple FSM candidates are typically retrieved. In such cases, the metrics and a detailed report of the topological analysis (e.g,. which FFs influence and depend on each other, or the Boolean functions of control signals) are advantageous for manual analysis, see Section 5.4.

71 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

In summary, a topological analysis determines a set of FSM candidates consisting of register gates, combinational logic gates, and input and control signals that behave similarly to FSMs. Although the analysis discloses relevant gates and logic signals, it does not determine the key elements for reverse engineering of the hardware design, i.e. the state transition function δ. Based on δ, we can deduce the set of (reachable) states S and analyze the output function λ.

5.2.2 Phase 2: Boolean Function Analysis

To determine the state transition function δ, the set of states S and the output function λ for each FSM candidate, we analyze its combinational logic gates with an approach similar to Meade et al. [41]. The key idea is similar to a Breadth-First Search (BFS): we start at the initial state s0 and for each possible input value we determine the reachable states before moving on to the next level states. Typically, the initial state can be determined from gate configuration values, i.e. initial register values or (re)set signals. To determine the next state from a given current state and input configuration, we evaluate the Boolean functions of the combinational state transition logic. More precisely, each state memory FF data input is represented by a Boolean function whose input variables consist of the state value (FF data output) and FSM input signal values.

Algorithm 7 Boolean Function Analysis Input: c ∈ C - FSM candidate with register R, combinational logic gates L, and input signals IS Output: m - FSM with set of finite states S, state transition function δ, and initial state s0 // initialization 1: m.s0 ← initial state(c.R) 2: m.S ← {m.s0} 3: Q ← Queue() 4: Q.enqueue(m.s0) // determine set of reachable states 5: while Q= 6 ∅ do 6: s ← Q.dequeue() 7: for input signal configuration i ∈ c.IS do 8: snew ← evaluate(c.L, s, i) 9: m.δ(s, i) ← snew 10: if snew ∈/ m.S then 11: Q.enqueue(snew) 12: m.S ← m.S ∪ {snew} // determine input independent state registers 13: IIS ← input independent state series(m) 14: for gate r ∈ c.R do 15: if has constant value(r, IIS) = false then 16: c.R ← c.R\{r} 17: return m

72 5.3. Reverse Engineering and Deobfuscation of FSM Obfuscation Schemes

Algorithm 7 shows our technique to retrieve the state transition function δ from an FSM candidate independently of the state encoding. First, we initialize m and Q with the initial state determined by the set of registers R in lines 1 - 4 . Second, we determine the set of reachable states S and δ 5 - 12. In line 7 , we iterate through each input signal configuration (e.g., for a 10-bit input signal we enumerate all 210 possible assignments). To compute the evaluate function in line 8 , we use BDDs to represent the Boolean functions. Third, we analyze δ for any input-independent state register series and remove candidate state registers that behave like a counter 13 - 16. Overall, the time complexity is O(|S| · 2i), where i is the bit width across all input signals. Property: Input Independent State Series. FSM reverse engineering faces several chal- lenges in practice: the similarity of FSMs to counter circuits and non-standard implementation styles by designers. Counters are simplistic FSMs and topological analysis misclassifies them as FSM candidates even though they might not be used for design control. Additionally, counters can be utilized in FSMs even though integrating datapath units into the control path can be considered bad design practice. We provide an example of such an FSM implementation in Listing 5.5 in the Appendix.

sg

sh si sj sk si sj sk (a) Branch. (b) Merge.

Figure 5.3: Input independent state series starting point si (series marked in dashed red): (a) Branch: si has one successor and one predecessor which is a branch (multiple successors). (b) Merge: si has one successor and multiple predecessors.

To separate counter registers from state registers, we analyze the state transition function δ. In contrast to FSMs, counters are typically input-independent (except for enable or reset signals). Hence, we search for input-independent state series 13. These series start with either a branch or a merge state, cf. Figure 5.3, and end in a state with more than one successor. For each register in the series only the counter registers toggle, hence we remove them in line 16. Differences to Related Work. As noted earlier in this section, our FSM reverse engineering algorithms are based on previous work by Shi et al. [39] and Meade et al. [42]. Overall, the structure and properties of our algorithms (phases 1 and 2) are similar to both works. However, both previous works specifically target ASIC gate-level netlists, and we had to augment them to work with FPGAs and improve their reliability using Boolean function analysis. Neither work considers FSMs equipped with obfuscation strategies, so we introduced two metrics which aid the human reverse engineer during manual inspection. Also, the separation of counters and FSM circuity was not tackled by the previous approaches.

5.3 Reverse Engineering and Deobfuscation of FSM Obfuscation Schemes

On the basis of how FSM circuity can be (semi-)automatically reverse engineered, we now analyze several FSM obfuscation schemes and review their claimed security with a particular

73 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation focus on realistic reverse engineering and manipulation capabilities (Section 5.3.1- Section 5.3.4). We then summarize the different issues in Section 5.3.5 to serve as an educational basis for future obfuscation designers and implementers.

5.3.1 HARPOON

HARPOON is a design methodology to obfuscate FSMs and provide a form of authenticity which was utilized in a series of works [21, 22, 23, 24]. The HARPOON threat model is equivalent to ours, see Section 2.3.

Design Principle

In general, HARPOON augments an FSM with a series of states that form a preceding obfuscation mode and an authentication mode, see Figure 5.4. More precisely, the obfuscation mode consists O O of several states s0 , . . . , sl which have to be traversed in a suitable sequence to reach the original initial state s0 so that the FSM operates as intended. The new initial state of the O obfuscated FSM is s0 . Note that the input sequence (i0, . . . , im) required to perform the correct transitions leading to s0 is called the enabling key and it is only known by honest parties. Without knowledge of this key, it should be challenging for an adversary to sell and enable A A unauthorized design copies. Authentication mode consists of several states s0 , . . . , sn for which the output function λ generates outputs serving as watermarks. Similar to the obfuscation mode, the input sequence required to traverse the authentication states is called the authentication key. To deter simulation-based reverse engineering of the design, HARPOON leverages modification cells so that the design performs incorrect calculations when the FSM is not in post-validation operating mode.

A A A s0 s1 s2 s3

s O 2 s1 i0 i1 s O O 1 start s0 s2 i2 s O O 0 s3 s4

Figure 5.4: HARPOON design methodology example. The original FSM (dashed blue part) is O O O O O augmented by an obfuscation mode s0 , s1 , s2 , s3 , s4 and an authentication mode A A A s0 , s1 , s2 . The enabling key to reach the original initial state s0 is (i0, i1, i2).

Security Analysis

Chakraborty et al. assessed the security of HARPOON with a “purely random approach” [24] (p. 1497) similar to a brute-force attack of the enabling key. This attack does not reflect a realistic

74 5.3. Reverse Engineering and Deobfuscation of FSM Obfuscation Schemes adversarial proceeding. We found several strategies to enable unauthorized design activation including: (1) disclosure of the enabling key, and (2) initial state / watermark patching. We acknowledge that a similar attack strategy for enabling key disclosure has been described in a theoretical sense by Meade et al. [144] without experimentation. For both strategies mentioned above, we must reverse engineer the state transition function δ of the FSM using our aforementioned method in Section 5.2. Note that Chakraborty et al. [24] claimed a security level of 10−47 for 30 state memory FFs and 4 FSM inputs. Hence, our Boolean function analysis determines the state transition function after 234 steps (worst-case). Disclosure of the Enabling Key. To disclose the enabling key (i0, . . . , im) from a gate- level netlist, the state transition function δ is analyzed using the state transition graph. An important observation is that there is no path from the original s0, s1,... FSM states back to the preceding states in obfuscation mode, see Figure 5.4. Additionally, the state transition function of the original FSM typically consists of a cyclic structure and thus forms a strongly connected component which can be identified with Tarjan’s algorithm [143]. Subsequently, we can disclose the enabling key (i0, . . . , im) by examining which inputs lead to the original initial state s0 (e.g., by using Dijkstra’s shortest path algorithm). Initial State Patching. Based on the observations for enabling key disclosure, we can also patch the state memory to entirely skip the obfuscated mode. Therefore, we have to alter initial values of the FFs to s0 (derived by δ and the strongly connected component property). For Xilinx FPGAs, FFs and latches include an initialization attribute INIT which sets the initial values of state outputs after configuration [115]. For other gate libraries, the (a)synchronous (re)set signals may be rerouted to GND or VCC depending on s0. Moreover typical FFs in ASIC gate libraries offer Q and QN (negated Q) output pins, so these signals can be multiplexed on reset to model either a logic 0 or 1 for the state transition function. Watermark Manipulation. Similar to initial state patching, we can manipulate the watermark to invalidate the design’s authenticity and survive post-silicon authentication where A scan-FFs can be used to set the state memory to s0 . Therefore, we must alter the output A A function λ, so that the output values are changed for authentication states s0 , . . . , sn (e.g., by negating each output). For Xilinx FPGAs, output logic is typically implemented in LUTs so we can simply change its INIT value to alter its functionality. For other gate libraries, we may add additional inverter gates to alter the functionality and thus the watermark.

5.3.2 Dynamic State Deflection Dynamic State Deflection is an FSM obfuscation technique that is used to prevent the unautho- rized overwriting of the FSM state memory register. Our proposed initial state patching uses this type of overwriting. The threat model employed by Dofe et al. [145, 63] is equivalent to our threat model, see Section 2.3.

Design Principle The general principle of Dynamic State Deflection is to protect each original FSM state to verify whether the correct enabling key ik is present for each state transition. In case any invalid key i =6 ik is present, the modified state transition δ yields a deflection to so-called black hole clusters sb0, . . . , sbn, see Figure 5.5, and thus protects from overwriting of the FSM state memory register. A key feature of its construction is that once an invalid key i =6 ik is assigned to the design,

75 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation it never reaches an original state. Since each state transition verifies the presence of a valid enabling key part, the FSM has to be augmented with a dedicated enabling key port. Note that the scheme also builds up on existing techniques such as HARPOON [24] to protect the design with a preceding obfuscation mode and enabling key (i0, . . . , im, ik). To be more precise, the (i0, . . . , im) part refers to the enabling key of the preceding obfuscation mode, while the latter ik refers to the key validated for every original state transition.

s3

s2

start s1

s0

Figure 5.5: Dynamic State Deflection design methodology example. The original FSM (dashed blue part) is augmented by an HARPOON obfuscation mode (dotted red part) and each original state is protected by a black hole (states marked in black).

Security Analysis Dofe et al. examined the security for the Interlocking Obfuscation scheme and claimed to raise the bar for reverse engineering since the transition between states is based on a complex structure. However, we show an automated strategy to enable unauthorized design activation by disclosure of the enabling key. Although a similar attack strategy for key recovery has been described theoretically by Meade et al. [144], we provide a practical implementation. For this strategy, we have to reverse engineer the state transition function of the FSM using our aforementioned method in Section 5.2. Note that Dofe et al. [63] used a 12-bit enabling key size for their evaluation. Our Boolean function analysis is able to practically determine the state transition function in a similar context. Disclosure of the Enabling Key. Similar to HARPOON, we exploit the characteristic construction of the state transition function δ to retrieve the enabling key (i0, . . . , im, ik) from the gate-level netlist. In particular, the black hole clusters (forming a strongly connected component) can be distinguished from original states using Tarjan’s algorithm [143], see Figure 5.5. Note that the original FSM states do not form a strongly connected component with the black hole clusters, since there exists no path from the black hole clusters back to any original state. After distinguishing original from obfuscation mode states, we can analyze δ for the inputs leading to s0 (e.g., by using Dijkstra’s shortest path algorithm). On State Transition Function Patching. To increase security, Dofe et al. proposed increasing the key size for ik (e.g., to 64-bits). The key is checked prior to an original state

76 5.3. Reverse Engineering and Deobfuscation of FSM Obfuscation Schemes transition. Even though this key size increase appears to improve protection against a naive brute- force of δ, a synthesizer will implement a characteristic comparator that can be automatically identified and manipulated, especially for large key sizes, see Section 3.4.1. Note that after reverse engineering the location of the comparator circuit, we may simply patch the state transition function δ (e.g., to report true for each value of ik) and ultimately re-enable initial state patching. For Xilinx FPGAs, state transition function logic is typically realized in LUTs so we can simply change its INIT values to alter the functionality and report true for each ik input. Hence, either the key can be reverse engineered by a naive brute-force evaluation of δ for smaller key sizes or an in-depth analysis of comparator circuits that are present for larger key sizes can be used if a brute-force evaluation might not be efficient (e.g., ≥ 240 on a standard computer).

5.3.3 Active Hardware Metering Hardware Metering [146, 147, 138] refers to a collection of security mechanisms and protocols to facilitate post-manufacturing control of designed IP cores. The collection of metering techniques can be broadly separated into (1) passive, and (2) active techniques. Passive metering facilitates unique chip identification, whereas active metering additionally enables designers to (un)lock chips. We focus on active hardware metering. Note that the threat model for active hardware metering is equivalent to our threat model, see Section 2.3.

Design Principle

Active hardware metering augments an original FSM with preceding states which must be traversed in the correct order to reach the original initial state s0. To this end, the original state l O O register with s FFs is augmented by l FFs, which results in 2 additional states s0 , . . . , s2l−1 for O a binary-like state encoding, see Section 5.2. In particular, the new initial state s0 is determined by a device-unique and unpredictable Random Unique Block (RUB), see Figure 5.6. The state transition function δ is generated using one or more small ring counters (e.g., 16 or 32 states) which are then modified by randomly reconnecting and adding several edges. These small ring counters are then combined to form the obfuscated FSM and original states are also randomly connected to additional black hole states similar to the structure explained for Dynamic State O Deflection in Section 5.3.2. To unlock a design, the initial register value of s0 is read out by the user and sent to the IP provider who determines the enabling key (i0, . . . , im). Without the enabling key it should be challenging for an adversary to sell and enable unauthorized copies of the design.

Security Analysis

The security of active hardware metering was previously examined for reverse engineering and manipulation, i.e. control signal capture and replay1. To increase security, the authors propose to alter the FSM so that the reset state is RUB-dependent as well, hence an FSM only operates correctly when it receives a specific stream of signals from the RUB after reset. Thus, the authors conclude that this renders reverse engineering “much more difficult” and the control

1“In this attack, Bob [the adversary] attempts to bypass the FSM by learning the control signals and attempting to emulate them. Bob may completely bypass the FSM by creating a new FSM that provides control signals to all functional units, and control logic (e.g. MUXs and FFs) in the datapath.” [25] (p. 299)

77 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

State Transition Logic l + s State Memory

RUB reset

Figure 5.6: Active Hardware Metering technique example. The RUB response determines the initial state of the FSM. The enabling key then determines the transition from the obfuscation mode states (marked in red) to the original initial state of the FSM (marked in blue).

signal capture and replay “almost impossible”. Koushanfar [26] proved the security of the scheme with a related structure. Implementation details are dedicated to the state transition function δ, but the output function λ is hardly considered. It is noted that “the BFSM inputs are Primary Inputs (PI) and its outputs are Primary Outputs (PO) since they are the same as the PI and PO in the original design” [26] (p. 57). Despite these claims and associated proof, we now provide an attack to perform unauthorized design activation by means of initial state patching.

Initial State Patching. To disclose the original initial state s0 from the gate-level netlist, we carefully analyze the output generation logic λ. An important observation is that λ is only affected for original states by construction of Active Hardware Metering and thus we can infer original states by analysis of the Boolean functions for the output logic gates. To be more precise, the output logic will typically implement control signals that only become active (e.g., logical ’1’) only for specific states. Based on such comparator circuitry we can directly read out original state configurations from the Boolean functions of the control signals. For example, a comparator circuit is added to the output logic λ to ensure that the l obfuscation FFs hold a pre-defined value (e.g, zero) to safeguard correct functionality of the design. Note that large comparators can be also automatically identified due to its characteristic functionality, see Section 3.4.1. We then initialize Boolean function analysis with the original states extracted from output logic λ. Due to the construction of Active Hardware Metering, we can only transit from an original state to either other original states or black hole states. Therefore, the complexity of our Boolean function analysis is not 2s+l for an l + s-bit state register but rather bound by the (usually linear) number of original and black hole cluster states. Similar to Dynamic State Deflection in Section 5.3.2, we can automatically distinguish between original states and black hole states using Tarjan’s algorithm as the black hole states from a strongly connected component in the state transition graph. In contrast to HARPOON and Dynamic State Deflection, we cannot directly read the original initial state s0 of the state transition graph in each case. Nevertheless, typical designs implement a reset state which initializes the data path registers and thus, by analysis of which control signal causes such a characteristic reset behavior, we can recover the

78 5.3. Reverse Engineering and Deobfuscation of FSM Obfuscation Schemes

original initial state s0. Otherwise the analyst has to reverse engineer (parts of) the datapath to identify an initial state that makes sense. We want to emphasize that recovery of the original initial state s0 by the aforementioned reset behavior can be performed independently of the Boolean function analysis. Thus, neither a large (e.g., > 64) number of input signals nor a large number of state memory FFs prevent initial state patching. On Enabling Key Disclosure. We want to highlight that for a small number of states |S| and a number of input signals i, an enabling key disclosure is possible. Since the initial value of the state memory FFs (defined by the RUB) can be read out by the adversary, he is able to perform Boolean function analysis of the FSM circuit. For example, Alkabani et al. concluded that a brute-force attack on FSMs with up to 18 FFs and 8 input signals does not yield success, cf. Table 3 in [25], however, targeted reverse engineering of the state transition function δ is possible for these values (226 in the worst-case which takes a couple of minutes to perform on commodity hardware, see Section 5.4). It is possible to analyze δ to identify the original FSM states by investigating which states affect the output function λ and control other parts of the circuit. In further work, Koushanfar [138] evaluated this technique for 20 FFs and 64 input signals. Enabling key disclosure would not be effective for such designs, but initial state patching may be performed.

5.3.4 Interlocking Obfuscation

Interlocking Obfuscation is an FSM obfuscation technique that is used to provide anti-tamper hardware [62]. The threat model used by Desai et al. is equivalent to our threat model, see Section 2.3.

Design Principle

Similar to HARPOON, the Interlocking Obfuscation scheme augments the original FSM with an O O obfuscation mode s0 , . . . , sl and a code-word, see Figure 5.7. The code-word is interwoven with the state transition function δ, so that δ is not only dependent on the current state and input but also on the value of the code-word. Moreover, δ modifies the code-word, i.e. δ : S ×I ×C → S ×C for the set of possible code-words C. Hence, without knowledge of the initial correct code-word value c0 ∈ C (which is only available to honest parties) it should be challenging for an adversary to unlock and tamper with the design.

Security Analysis

The security of Interlocking Obfuscation was assessed with a brute-force approach [62] where the adversary tries all combinations to separate the code-word from the actual state memory FFs and subsequently find the correct initial code-word c0. Since Desai et al. choose 56 FFs in their 56 48 78 evaluation (8 state memory + 48 code-word), the number of combinations is 8 · 2 ≈ 2 (in case the adversary knows the number of states) and therefore is impractical to compute. Before we detail two generic problems with the Interlocking Obfuscation scheme, we note that Meade et al. [144] theoretically described a key recovery approach, however, their approach requires the state transition graph, necessitating a separation of the code-word and state memory FFs. Meade et al. did not provide information on how to separate the FFs and thus enable the

79 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

Code start Word

State Transition Logic

State Memory

Figure 5.7: Interlocking Obfuscation design methodology example. The original FSM (dashed blue part) is augmented by an obfuscation mode and a code-word (dotted red part). computation of the decisive state transition graph. Thus, their attack would only work for small code-word sizes. Initial State Patching. A key feature of Interlocking Obfuscation is the interwoven structure of the code-word and the state memory FFs which should be challenging to reverse engineer. To patch the initial state, we use a strategy that is similar to the one we used for Active Hardware Metering, see Section 5.3.3. Since the output function λ is not affected by the code-word, we can simply read out FF assignments from the Boolean functions and thus separate code-word and state memory FFs. Since the “code-word is not needed to compute a correct next state for all original state” (p. 3 [62]), we can identify the original initial state s0 either by reset behavior, or Boolean function analysis similar to Active Hardware Metering. On Anti-Tamper Hardware. Another issue is the effectiveness of anti-tamper protection. Even though the FSM is obfuscated and only a valid code-word enables the design, the tamper resistance of this scheme is questionable since it only protects the design control path. De- sai et al. [62] chose an AES design for their evaluation, however, several AES datapath attacks on Sboxes have been published [77, 78] (the first reference appeared several years prior to the publication by Desai et al.). These attacks leak the secret key and can be performed irregardless of any obfuscated control path, rendering the anti-tamper property questionable.

5.3.5 Lessons Learned We now summarize the different issues of the schemes to provide future obfuscation designers and implementers with a clear picture of what is and is not currently possible with respect to automatic reverse engineering. Topological Analysis Discloses FSM Gates. FSMs exhibit several characteristics (e.g., cyclic structure with a combinational logic feedback path), so state memory FFs and transition and output logic gates can be automatically extracted from a gate-level netlist, independent of the number of inputs or state memory FFs. Separation of Obfuscated Parts. A common denominator of the presented schemes is the possible separation of original and obfuscation circuitry, since the obfuscation circuitry is not logically entangled with remaining logic. For example, the output function λ does not depend on obfuscation states in Active Hardware Metering [25] or Interlocking Obfuscation [62]. This

80 5.4. Evaluation observation enables separation and ultimately enables key recovery or initial state patching. Bogus output generation for obfuscation states (as realized in HARPOON [24]) indeed provides an effective countermeasure against the separation of obfuscation and original FSM circuitry. However, adding the capability to generate a bogus output is not a generic solution (e.g., in case the FSM steers an actuator). Complexity of Boolean Function Analysis. As noted earlier in this section, the com- plexity to retrieve the state transition function δ is O(|S| · 2i), where i is the combined bit width of all input signals. If the state space and number of input signals is small, δ can be reverse engineered, yielding potential enabling keys. Moreover, this approach provides the state transition graph which is valuable for manual analysis. Scaling the FSM (e.g., to 220 states) is not trivial since a straightforward implementation will require a large amount of combinational logic for state transitions. Scaling the FSM input signal count (e.g., to 40) may be prohibitive since the signals must be meaningfully connected to external devices. Eavesdropping on the Enabling Key. Eavesdropping on the enabling key may also be a realistic attack vector, however, it requires access to a benign device and a valid enabling key. Moreover, further reverse engineering is required to understand how the enabling key is transferred from a communication interface to the FSM and subsequently processed.

5.4 Evaluation

We now provide an evaluation of our new automated FSM reverse engineering and manipulation strategies to underline the insecurity of the allegedly secure FSM-based obfuscation strategies presented in Section 5.3. We first present how we implemented the different obfuscation strategies. Implementation. We generated the gate-level netlists for the selected hardware designs using Xilinx ISE (version 14.7). For HARPOON we selected 14 additional states and 8-bit enabling key input signals. For Dynamic State Deflection we used the HARPOON obfuscation mode and added black hole clusters of 5 states for each original state. We deliberately omit an isolated evaluation of HARPOON since this technique is already included in Dynamic State Deflection. Our selected parameters for HARPOON are similar to the original work [24] as the authors choose an FSM with 4 states and 10-bit enabling key size, cf. Section V C [24]. Our parameters for Dynamic State Deflection are also similar to the original work [145] that utilized a 12-bit enabling key. We realized Active Hardware Metering using 256 additional states and an 8-bit enabling key, as well as a black hole cluster of 5 states per original state. This parameter choice is similar to the original work [25], that evaluated obfuscated hardware designs with 18 FFs and 8-bit input which has a comparable worst-case complexity, see Table 3 [25]. Interlocking Obfuscation was generated with a 4 -bit code-word. Our FSM reverse engineering algorithms Algorithm 6 and Algorithm 7 are implemented with the assistance of HAL, see Chapter 3. We want to note that all evaluated obfuscation strategies are realized with a binary state encoding since results for Gray-encoded FSMs are similar. As explained in the previous section, one-hot state encodings are not reasonable for FSM obfuscation since the number of utilized FFs grows linearly with the number of states yielding a large area overhead.

81 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

5.4.1 Case Study: Cryptographic Designs

Even though most cryptographic primitives in use today are resilient against traditional attacks, adversaries leverage implementation attacks (e.g., side-channel analysis) to undermine vital security goals such as confidentiality and integrity. To counteract such attacks, numerous strategies have been investigated by the research community and industry has yielded highly- optimized and secure implementations [74]. Thus, from an economic point of view cryptographic implementations are valuable IP that are worth protecting from IP infringement (e.g., by using FSM obfuscation). For this case study, we selected two cryptographic hardware designs: (1) an iterative AES IP core, and (2) an iterative Secure Hash Algorithm (SHA)-3 IP core, since both cryptographic building blocks are widely deployed in practice. State transition graphs of both FSMs are depicted in Figure 5.12 in the Appendix. We obfuscated each hardware design with Dynamic State Deflection, Active Hardware Metering, and Interlocking Obfuscation as described in the previous section.

HARPOON and Dynamic State Deflection

AES. Our topological analysis identifies 2 FSM circuits after about 2 minutes of computation time on a standard laptop:

(1) 32 FFs, 131 input signals, control value 0.971, influence/dependence value 0.625

(2) 8 FFs, 21 input signals, control value 0.519, influence/dependence value 0.625

Our analysis indicates that each FF in the first candidate only connects to 4 successor gates (an unlikely scenario for control signals that steer a complex data path) and there is no FF subset where all FFs depend on each other. Based on these properties it seems unlikely that this candidate implements an FSM and quick manual analysis reveals that this circuit actually implements a shift-register. Since the influence/dependence value of the second candidate is 0.625 =6 1.0, we analyze the topological analysis report and see that the FFs form two groups where all FFs influence and depend on each other: one group with 2 FFs and the other one with 6 FFs, see Listing 5.1. We omit the group with 2 FFs since we are searching for an obfuscated FSM with >> 4 states. Boolean function analysis of the 6 FFs yields the state transition graph shown in Figure 5.8. As described in Section 5.3.2, an analysis of the strongly-connected components yields three parts: the obfuscation mode of HARPOON, the black hole clusters of Dynamic State Deflection, and the original FSM. With the state transition function δ we can generate the enabling key by searching for a path from the initial state to the original initial state (marked as red boxes in Figure 5.8). Similarly, we can perform an initial state patching attack by altering the INIT value of each state memory FF accordingly. The 222 (6 FFs and 16 input signals) computations for the Boolean function analysis took about 5 min on a standard laptop.

82 5.4. Evaluation

Figure 5.8: State transition graph of AES IP core obfuscated with Dynamic State Deflection (including HARPOON ). Tarjan’s algorithm splits all states into 3 strongly connected components: obfuscation mode of HARPOON (marked in red), black hole cluster states (marked in black), and original FSM (marked in blue). Rectangle nodes mark the sequence from initial state 000000 to the original initial state 010001. Input values for each state transition are deliberately left out for readability.

83 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

Listing 5.1: Excerpt of the topological analysis report of AES IP core obfuscated with Dynamic State Deflection (including HARPOON ). Gate names have been blinded so that no information can be inferred by names. FFs U1, U2, . . . U8 form 2 distinct groups where all FFs influence and depend on each other. ... [+] U1 influences and depends on: U1, U2, U3, U4, U5, U6 [+] U2 influences and depends on: U1, U2, U3, U4, U5, U6 [+] U3 influences and depends on: U1, U2, U3, U4, U5, U6 [+] U4 influences and depends on: U1, U2, U3, U4, U5, U6 [+] U5 influences and depends on: U1, U2, U3, U4, U5, U6 [+] U6 influences and depends on: U1, U2, U3, U4, U5, U6 [+] U7 influences and depends on: U7, U8 [+] U8 influences and depends on: U7, U8 ...

SHA-3. Results for the SHA-3 IP core are similar to those from the AES core. Our topological analysis detects 2 FSM candidates after around 24 minutes on a standard laptop:

(1) 128 FFs, 2573 input signals, control value 0.803, influence/dependence value 0.016 (2) 6 FFs, 19 input signals, control value 0.408, and influence/dependence value 1.0

The first candidate belongs to the iterative SHA-3 data path and does not implement an FSM as indicated by the low influence/dependence value. The latter candidate has characteristics of an FSM circuit. Boolean function analysis of this candidate yields a state transition graph that is similar to the one generated for AES, shown in Figure 5.8. The graph leads to the disclosure of the enabling key and enables initial state patching. The 225 (6 FFs and 19 input signals) computations for the Boolean function analysis took about 27 minutes on a standard laptop. For both obfuscated hardware designs (AES and SHA-3), our input independent state series analysis reports that all FSM FFs yield non-state behavior (triggered by the black hole clusters). Hence, we can use this information to distinguish whether the FSM incorporates black hole state clusters if they are implemented with input-independent state transitions.

Active Hardware Metering and Interlocking Obfuscation

AES. Our topological analysis identifies 3 FSM candidates after about 4 minutes of computation on a standard laptop:

(1) 32 FFs, 131 input signals, control value 0.971, influence/dependence value 0.625 (2) 2 FFs, 7 input signals, control value 0.463, influence/dependence value 1.000 (3) 15 FFs, 17 input signals, control value 0.845, influence/dependence value 0.742

As the first candidate does not implement an FSM, see Section 5.4.1, and the second candidate is too small to implement any obfuscated FSM, we focus on the third FSM candidate. Since the influence/dependence value is 0.742 =6 1.0, the topological analysis report is needed to examine the influence/dependence of each FF. Analogous to Listing 5.1, we can exclude 6 FF since they

84 5.4. Evaluation do not form a subgroup where all FFs influence and depend on each other. As a result, we focus on the remaining 9 FFs of the third candidate.

Listing 5.2: Excerpt of the topological analysis report of AES IP core obfuscated with Active Hardware Metering. Gate and net names have been blinded so that no information can be inferred by names. FFs U1, U2, . . . U9 connect to control signals O1 and O2. Boolean functions of O1, O2, O3 are determined by Quine McCluskey Boolean function minimization, ˜ refers to a negated literal, + to a disjunction, and * to a conjunction. Based on the Boolean functions, we infer 2 states where we assign each FF the Boolean value in the formula. ... [Control Signal] O1 = U1*˜U2*˜U3*˜U4*˜U5*˜U6* U7*˜U8* U9 [Control Signal] O2 = U1*˜U2*˜U3*˜U4*˜U5* U6*˜U7*˜U8*˜U9 ...

Listing 5.2 shows an excerpt of the topological analysis report of the third candidate. The characteristic control signals O1 and O2 only depend on the 9 FFs (and not on the 6 FFs excluded via influence/dependence analysis). This information confirms that the 6 FFs do not belong to the FSM state memory. In the listing, we minimized Boolean functions of 2 control signals and we can infer 2 potential states, namely s1 = U1 ...U9 = 100000101, and s2 = U1 ...U9 = 100001000 which we feed to the Boolean function analysis yielding the state transition graph in Figure 5.9 (similar to the adapted original in the Appendix). Note that the computation time for the Boolean function analysis took around 4 minutes on a standard laptop. Based on the state transition graph, we deduce the original initial state s0 = 000110111 as it behaves as a fallback state (in case a specific condition holds) and the other states form a strongly connected component (in case this specific condition does not hold). As explained in Section 5.3.4, an attack on the Interlocking Obfuscation scheme is the same as for the Active Hardware Metering scheme and thus we omit the results for the former approach.

85 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

Figure 5.9: State transition graph of of AES IP core obfuscated with Active Hardware Metering. Tarjan’s algorithm splits all states into 2 strongly connected components: black hole cluster states (marked in black), and original FSM states (marked in blue). Input values for each state transition are deliberately left out for readability.

86 5.4. Evaluation

SHA-3. The results for the SHA-3 IP core are similar to those found for AES in the previous section. Our topological analysis detects 2 FSM candidates after about 38 minutes of computation on a standard laptop:

(1) 128 FFs, 2574 input signals, control value 0.250, influence/dependence value 0.016

(2) 9 FFs, 19 input signals, control value 0.406, influence/dependence value 1.000

Similar to the results for Dynamic State Deflection, the first candidate belongs to the iterative SHA-3 data path while the latter candidate shows characteristics of an FSM circuit.

Listing 5.3: Excerpt of the topological analysis report of SHA-3 IP core obfuscated with Active Hardware Metering. Gate and net names have been blinded so that no information can be inferred by names. FFs U1, U2, . . . U9 connect to control signals O1, O2, and O3. Boolean functions of O1, O2, and O3 are determined by Quine McCluskey Boolean function minimization, ˜ refers to a negated literal, + to a disjunction, and * to a conjunction. Based on the Boolean functions, we infer 3 potential states where we assign each FF the Boolean value in the formula and all other not considered FFs (e.g, U2 in O1) are assigned logical ’0’. ... [Control Signal] O1 = U1* ˜U5* U6*˜U7*˜U8*˜U9 [Control Signal] O2 = U1* ˜U5*˜U6* U7* ˜U9 [Control Signal] O3 = ˜U2*˜U3* U4* U5*˜U6* U7* U8* U9 ...

Listing 5.3 shows an excerpt of the topological analysis report of the second FSM candidate. We see the minimized Boolean functions of 3 control signals O1, O2 and O3. Each signal implements characteristic control behavior since each becomes a logical ’1’ under one specific state memory configuration. Based on these 3 signals, we infer 3 potential states namely s1 = U1 ...U9 = 100001000, s2 = U1 ...U9 = 100000100, and s3 = U1 ...U9 = 000110111 and feed them to Boolean function analysis which yields the state transition graph in Figure 5.10. This graph is similar to the original one (depicted in Figure 5.12 in the Appendix). As described in Section 5.3.3, Boolean function analysis is not dependent on the number of FFs but rather on the number of original states due to the specific construction of the Active Hardware Metering scheme. Original states transit either to other original states or to black hole cluster states. The computation for the Boolean function analysis took around 17.5 minutes on an standard laptop. Based on the state transition graph, we deduced the original initial state s = 001100111 to perform initial state patching of obfuscation and state memory FFs. Initial state patching to the original reset state s is also possible without Boolean function analysis. Control signal O3 implements a data path reset functionality since it only becomes active in the initial state and connects to the reset ports of 1608 FFs (1600 SHA-3 data path FFs and 8-bit LFSR FFs). As explained in Section 5.3.4, an attack on the Interlocking Obfuscation scheme is the same as one on the Active Hardware Metering scheme and thus we omit the results for the former approach.

87 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

Figure 5.10: State transition graph of of SHA-3 IP core obfuscated with Active Hardware Metering. Tarjan’s algorithm splits all states into 3 strongly connected components: black hole cluster states (marked in black), original reset state (marked in blue), and original FSM states (marked in blue). Input values for each state transition are deliberately left out for readability.

5.4.2 Case Study: Communication Interfaces

Virtually all hardware designs implement dedicated communication interfaces to exchange information with peripheral devices. These interfaces range from simplistic parallel interfaces to

88 5.4. Evaluation complex serial interfaces. Since communication interfaces receive potentially untrusted data from peripheral devices, design activation measures implemented in these building blocks can be used to reject untrusted data before it is forwarded to internal hardware modules, thus protecting valuable IP. For this case study, we selected a serial UART interface since this interface is widely deployed in real-world embedded systems. We augmented the UART with our previously-described FSM obfuscation schemes and evaluated each approach separately.

HARPOON and Dynamic State Deflection UART. Our topological analysis detects 5 FSM candidates after about 17 seconds of computation time on a standard laptop:

(1) 6 FFs, 8 input signals, control value: 1.0, influence/dependence value: 1.0 (2) 16 FFs, 6 input signals, control value: 0.500, influence/dependence value: 0.445 (3) 4 FFs, 18 input signals, control value: 0.400, influence/dependence value: 1.000 (4) 17 FFs, 6 input signals, control value: 0.785, influence/dependence value: 0.495 (5) 5 FFs, 14 input signals, control value: 0.698, influence/dependence value: 0.520

Based on the high control and influence/dependence values, we identify the first candidate as a potential FSM. Boolean function analysis of the 6 FFs and 8 input signals yields a state transition graph similar to the one depicted in Figure 5.8. Analogously, Tarjan’s algorithm splits the states into three sets (obfuscation mode states, black hole states, and original states) and thus enables disclosure of the enabling key and initial state patching. The 214 (6 FFs and 8 input signals) computations for Boolean function analysis took about 2 s on a standard laptop. We now briefly describe the other four identified circuits and why they are marked as potential FSMs by topological analysis. The second candidate represents the transmission FSM with (2 FFs), a clock divider (8 FFs) and a countdown counter (6 FFs). Based on the topological analysis report, we can separate the transmission FSM FFs from the datapath FFs since each of the 16 FFs depends on the 2 FSM state memory FFs. The third candidate refers to a 4-bit counter which naturally exhibits a high internal influence/dependence but a low overall controllability since counters and FSMs behave similarly. Analogous to the transmission FSM, the fourth candidate refers to the receiving FSM (3 FFs) which includes a clock divider (8 FFs) and a countdown counter (6 FFs). The last candidate refers to the state machines for the empty/full logic of both receiving and transmission FIFO circuits used in the design. Each candidate refers (partially) to genuine FSM circuits.

Active Hardware Metering and Interlocking Obfuscation UART. Similar to the results for Dynamic State Deflection, our topological analysis detects 5 FSM candidates:

(1) 9 FFs, 8 input signals, control value: 1.000, influence/dependence value: 1.000 (2) 16 FFs, 6 input signals, control value: 0.500, influence/dependence value: 0.445 (3) 4 FFs, 18 input signals, control value: 0.400, influence/dependence value: 1.000

89 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

(4) 17 FFs, 6 input signals, control value: 0.785, influence/dependence value: 0.495 (5) 5 FFs, 14 input signals, control value: 0.698, influence/dependence value: 0.520

Since the control and influence/dependence values are 1.0, we identify the first candidate as a potential FSM. Candidates 2-4 are the same ones discussed in the previous section.

Listing 5.4: Excerpt of the topological analysis report of UART IP core obfuscated with Active Hardware Metering. Gate and net names have been blinded so that no information can be inferred by names. FFs U1, U2, . . . U9 connect to control signals O1, O2, O3, O4. Boolean functions of the control signals are determined by Quine McCluskey Boolean function minimization, ˜ refers to a negated literal, + to a disjunction, and * to a conjunction. Based on mutual configuration of FFs for all control signals, we identify FFs U3, . . . , U9 as obfuscation state memory since control signals only change their output in case these FFs hold a pre-defined all-zero value and consequently FFs U1 and U2 are marked as original state memory since they yield changing output for the control signals. ...

[Control Signal] O1 = U1* *˜U5*˜U6* U7*˜U8* U9 [Control Signal] O2 = ˜U1* *˜U5*˜U6* U7*˜U8*˜U9 [Control Signal] O3 = U1* *˜U5*˜U6*˜U7* U8*˜U9 [Control Signal] O4 = ˜U2*˜U3* U4* U5*˜U6*˜U7*˜U8*˜U9 ...

Listing 5.4 shows an excerpt of the topological analysis report of the selected FSM candidate. We see the minimized Boolean functions of control signals O1, . . . O4 and can infer 4 potential states s1 = U1 ...U9 = 100000101, s2 = U1 ...U9 = 100000100, s3 = U1 ...U9 = 100000010, and s4 = U1 ...U9 = 000110000 which we feed to Boolean function analysis yielding the state transition graph in Figure 5.11. The computation time for the Boolean function analysis took around 2 s on a standard laptop. Using Tarjan’s algorithm we were able to distinguish original states from black hole cluster states by virtue of the specific construction of Active Hardware Metering. As explained in Section 5.3.4, the attack on the Interlocking Obfuscation scheme is the same for Active Hardware Metering and thus we deliberately omit the results for the former approach. In summary, we observed that (semi-)automatic reverse engineering and targeted manipulation on FSM obfuscation performs well in practice for several obfuscation schemes and hardware designs. We also performed experiments where we integrated a UART communication interface with the cryptographic cores. Results for these experiments were similar to the ones presented in Section 5.4.1 and Section 5.4.2 so they have been omitted.

5.5 Discussion

Applications of FSM Reverse Engineering. We have demonstrated that automated reverse engineering of FSMs provides valuable high-level information, i.e. design of the control path

90 5.5. Discussion

Figure 5.11: State transition graph of UART IP core obfuscated with Active Hardware Metering. Tarjan’s algorithm splits all states in 2 strongly connected components: black hole cluster states (marked in black), and original FSM states (marked in blue). Input values for each state transition are deliberately left out for readability. and its controlled datapath units. Since reverse engineering is a tool to enable constructive and destructive applications, we discuss the implications for both. From a defender’s point of view, high-level FSM information can be used to identify hardware Trojans implemented with sequential logic or Trojans which incorporate FSM-based obfuscation for increased stealthiness [70, 42, 12]. Moreover, our insights on the capabilities of FSM reverse engineering can support assessment of future hardware design obfuscation strategies. Reverse engineering is beneficial in the case of source code loss, faulty product design detection, and competitor analysis [16]. From an adversary’s point of view, high-level FSM information offers an attractive entry point for further reverse engineering of datapath units such as details of a cryptographic implementation or microarchitecture specifics. More importantly, register grouping discloses crucial module boundary information, which partitions the design into easier to analyze, functionally-related units. On the Difficulty of Using Obfuscation Metrics. Modeling the security of practical obfuscation schemes against reverse engineering is challenging, see Section 5.3. We observed that a realistic appreciation of reverse engineering capabilities and consideration of the system

91 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation context are often neglected. The capabilities of (automated) gate-level netlist reverse engineering are often not considered, since this topic is not well studied, see Section 2.4.3. Designers of obfuscation techniques often do not detail why the development of automated reverse engineering is challenging or what steps would be needed by a rational reverse engineer to defeat the obfuscation. In Section 5.3.5 we provided a concise overview of the lessons learned to provide future obfuscation designers with valuable information on adversarial approaches. Focusing solely on FSM obfuscation is not enough to prevent an attacker from gaining meaningful high-level design information. The general system context must be considered as well. Consider an FPGA hardware design with a communication interface, a cryptographic algorithm, and an FSM controlling both cores. The FSM is obfuscated and employs a key-based activation, so that it is not possible to algorithmically or manually reverse engineer its state transition function (e.g., 64-bit input signal). This implementation enables the adversary to obtain high- level design information without analyzing the highly-obfuscated FSM. This example shows that it is necessary to consider obfuscation not just for the FSM, but also for the system as a whole. This directive extends to people researching obfuscation as well as system designers responsible for implementing obfuscation techniques. Designers need to consider the non-obfuscated parts of a design, even in the presence of FSM obfuscation. Future Work Several directions can be explored in future research. Reverse engineering of FPGA designs with dynamic reconfiguration should be explored to quantify the complexity increase (compared to static designs). In addition, further work should explore automated techniques for general-purpose reverse engineering for security-relevant circuitry. It would be desirable to quantify the human factor in reverse engineering or to set up a reverse engineer- ing competition. The evaluation can be extended to diverse (open-source and closed-source) synthesizers to potentially improve reliability of the topological analysis.

5.6 Conclusion

In this chapter, we carefully reviewed the security of several state-of-the-art FSM obfuscation schemes. In concert with realistic reverse engineering capabilities, we demonstrated several generic strategies to bypass these schemes on FPGA gate-level netlists while keeping analysis times practical. We augment netlist reverse engineering algorithms to disclose high-level FSM information in FPGA gate-level netlists. Our rigorous evaluation demonstrates the effectiveness of FSM reverse engineering and the automatically disclosed information supports a human analyst to further reverse engineer a design for constructive as well as destructive purposes. Our insights on realistic reverse engineering capabilities invite a rethinking of future hardware design obfuscation.

Appendix

Listing 5.5 shows an FSM with a merged counter (highlighted lines in yellow). Note the FSM is implemented in an iterative AES core to control key schedule processing. Since the round counter signal CV RUNUP STATE is processed inside the control path, the counter and the state signal RUNUP STATE are merged into a single FSM candidate. Our input independent state series

92 5.6. Conclusion

analysis (presented in Section 5.2.2) addresses this issue and splits the counter from the FSM part.

Listing 5.5: FSM incorporating a counter (see key sched iterative.vhdl [130]).

1 ... 2 constant LAST_ECVRUNUP_STEP : integer := 1; --# of steps for cv runup 3 constant LAST_DCVRUNUP_128 : integer := 9; --# of steps for cv runup 4 5 signal CV_RUNUP_STEP : integer range 0 to 255; 6 type RUNUP_STATE_TYPE is (HOLD, CV_RUNUP, CV_EXPAND, DONE); 7 signal RUNUP_STATE : RUNUP_STATE_TYPE; 8 ... 9 RUNUP_FLOW: process(clock, reset) 10 begin 11 if reset = ’1’ then 12 CV_RUNUP_STEP <= 0; 13 RUNUP_STATE <= HOLD; 14 elsif clock’event and clock = ’1’ then 15 case RUNUP_STATE is 16 ... 17 when CV_RUNUP => 18 if ( CV_RUNUP_STEP /= LAST_ECVRUNUP_STEP and KS_ENC = ’1’) 19 or ( CV_RUNUP_STEP /= LAST_DCVRUNUP_128 and KS_ENC = ’0’ ) then 20 CV_RUNUP_STEP <= CV_RUNUP_STEP + 1; 21 RUNUP_STATE <= RUNUP_STATE; 22 else 23 RUNUP_STATE <= DONE; 24 CV_RUNUP_STEP <= 0; 25 end if; 26 ... 27 end case; 28 end if; -- reset= ’1’ 29 end process; -- RUNUP_FLOW

93 Chapter 5. On the Difficulty of FSM-based Hardware Obfuscation

Figure 5.12 shows the reduced state transition graphs of the hardware designs utilized in our evaluation, see Section 5.4. We deliberately omitted the input signals yielding state transitions for improved readability. Note that we retained original state names for clarity as our Boolean function analysis only recover the state memory value but obviously not the original meaning of the state.

(a) SHA-3 FSM. (b) AES encryption FSM.

Figure 5.12: Original state transition graph diagrams of hardware designs utilized in Section 5.4.

94 Part IV

Hardware-Assisted Instruction Set Architecture Obfuscation

Chapter 6 Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

Motivation. Analogously to hardware, the risk of reverse engineering is particularly acute for software of embedded processors since they often have limited available resources to protect program information. Previous efforts involving code obfuscation provide some additional security against reverse engineering of programs, but the security benefits are typically limited and not quantifiable due to human factors. Hence, new approaches to code protection and creation of associated metrics are highly desirable.

Contents of this Chapter

6.1 Introduction ...... 98 6.2 Technical Background and Related Work ...... 99 6.2.1 System and Adversary Model ...... 100 6.2.2 Instruction Set Architecture ...... 102 6.3 Hardware-level Obfuscation ...... 103 6.3.1 Opcode Substitution ...... 103 6.3.2 Operand Permutation ...... 105 6.3.3 Hardware-enforced Access Control ...... 105 6.3.4 Hardware-level Booby Traps ...... 106 6.4 Software-level Obfuscation ...... 107 6.5 Implementation ...... 108 6.6 Performance Evaluation ...... 108 6.7 Security Analysis ...... 111 6.8 Security Metrics for Obfuscation ...... 112 6.8.1 Similarity Metric ...... 112 6.8.2 Case Study – SPREE Benchmark Suite ...... 114 6.9 Discussion ...... 121 6.10 Conclusion ...... 123

97 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

Contribution. The research presented in this chapter was joint work with Simon Rokicki (affiliated with the IRISA, University of Rennes, France) who implemented the SPREE processor, adapted Obfuscator LLVM, and evaluated performance of both hardware and software, and Nicolai Bissantz (affiliated with the Ruhr-Universit¨atBochum, Germany) who developed the statistical analysis of the instruction memory. Parts of this chapter have been previously published in IEEE Transactions on Computers [148].

6.1 Introduction

Embedded microprocessors are vital resources in a wide array of low-end computing platforms. With the continuing growth of the IoT, simple processors are widely found in vehicles, appli- ances, health sensors, and infrastructure monitors, among other systems [149]. Often, these processors must operate in severely constrained environments with stringent power and perfor- mance demands. However, in many cases, predictable software security and reliability must be maintained. Most previous efforts at providing software protection at the hardware level have included secure processors [150, 151, 152]. Although provably secure, these implementations focus more on preventing illegal execution of code and data security than on obscuring software control flow and algorithm implementation. For many real-world embedded systems, the storage of program code in external, untrusted memories and the knowledge of the ISA provide an unfortunate attack vector [153] regarding reverse engineering and intellectual property protection. Hence, as soon as the ISA can be concealed from an adversary, the effort level needed to disclose critical information from the program code rises (e.g., an algorithm implementation). Increasingly, microprocessors and other circuitry are implemented in devices which include field-programmable logic [154] providing an attractive opportunity to customize a processor’s control logic. Instruction decoding implemented in field-programmable logic offers a flexible solution to randomize instruction encoding among numerous embedded systems. Furthermore, such hardware-level alteration is agnostic to other processor architecture features. Goals and Contributions. In this chapter, we focus on a hybrid approach to security via hardware-level obfuscation on the microarchitecture level and the use of software-level obfuscation techniques suited for embedded systems. Our approach tackles the shortcomings of existing state- of-the art ISA randomization defenses for an adversary with physical access to the target device. Particularly, we address various generic disclosure attacks for embedded systems that can be exploited to extract critical instruction encoding information. We categorize these information disclosure sources as ranging from general hardware access capabilities to program specific characteristics to the employed instruction encoding format. We then introduce our hybrid obfuscation design which defeats the different attacks and thus prevents reverse engineering and software execution on an illegitimate platform. In summary, our main contributions are:

• Hardware-level Obfuscation. We introduce an ISA randomization scheme which prevents information disclosure of a broad range of attacks. The scheme alters the microprocessor decode unit using hardware-efficient transformations. Furthermore, we restrict memory accesses using a fine-grained, hardware-level policy and actively counteract manipulation attacks using hardware-level booby traps.

98 6.2. Technical Background and Related Work

• Software-level Obfuscation. In concert with the augmented hardware, we employ software-level obfuscation techniques including Control Flow Graph (CFG)-level and instruction-level obfuscation to remove diverse program characteristics and thus overcome the general limitations of our cost-efficient ISA randomization.

• Coverage of Dynamic Adversaries. We provide a detailed analysis of various informa- tion disclosure sources for a physical adversary with dynamic access to the instruction bus. We also discuss the generic shortcomings of state-of-the-art ISA randomization defenses in our adversary model.

• Novel Evaluation Methodology. We present a novel metric for our hybrid obfuscation approach to express the effects of different obfuscation transformations. In particular, we incorporate statistical analysis of the instruction-level distributions and similarity analysis of the dynamic CFGs.

6.2 Technical Background and Related Work

Our work builds on previous research in secure and attack-resistant processor design. The relationship between these works and our approach is detailed below: Secure Processors. Over the past fifteen years, a number of secure processors have been developed. XOM [150] protects against piracy and manipulation of by only allowing an application to execute code from specific memory locations. AEGIS [151] uses instruction set extensions to protect execution and access to segments of memory. The OASIS processor [152] also uses ciphered data with cryptographic keys provided by Physical Unclonable Functions (PUFs). Other processors [155] do not include cryptographic cores but instead use PUFs to obfuscate portions of processor function (e.g., instruction opcodes). In order to mask the memory accesses, Oblivious Random Access Memory (ORAM) can be utilized [156]. While these approaches provide provable security, the presence and use of cryptographic cores adds significant area and performance overheads and thus non-negligible costs [157] for the overall system. Our processor architectural changes are considerably more modest compared to cryptographic approaches and do not impact processor performance. Side-Channel Attacks. As a consequence of physical adversarial access to embedded devices, SCA using power consumption or electromagnetic emanation can be exploited. For example, SCA attacks on market-dominating Xilinx and Altera SRAM-FPGAs families [2, 3] have been shown. SCA is not only limited to cryptographic implementations. It can be also leveraged to extract the code of embedded processors based on the electromagnetic emanation [158, 159]. However, the attack technique has various limitations such as imperfect recognition rates and a restriction to opcode and not operand detection. Software Reverse Engineering and Obfuscation. A variety of software obfuscation and deobfuscation approaches have been developed over decades. Software security to hamper reverse engineering has yielded various transformations to restrict static and dynamic analysis [160, 161, 162]. These transformations include code flattening, data encoding, on-demand code decryption, and virtual-machine based techniques [163]. To automatically reverse engineer programs equipped with obfuscation transformations, automatic deobfuscation techniques have

99 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors been developed [164, 165]. Note that these approaches generally require knowledge of the ISA to emulate or statically analyze the targeted program. Statistical analyses of the targeted program are employed for malware detection and classifica- tion. For example, frequency analysis, entropy, and hidden Markov models of instructions are able to classify malware among several families [166, 167, 168]. Also, several algorithms have been proposed to measure the similarity between CFGs with the goal of malware detection [119]. In particular, different approaches were developed based on subgraphs [109] and graph edit distance [110]. Software exploitation [169] and software diversity [170] are related topics, but focus on different adversary models. In our evaluation, we merge the concepts of statistical analysis and CFG similarity to demonstrate the influence of obfuscation techniques and to define a security metric. ISA Randomization. As software obfuscation suffers from the fundamental limitation of a known ISA, various approaches were developed to randomize the ISA with the goal of code injection mitigation [171, 172, 173, 174, 175, 176, 177]. These transformations modify the original instruction encoding and employ additional hardware circuitry to retrieve the original instruction prior to the decode phase. Contrary to our work, these related works focus on a different adversary model without physical access capabilities.

6.2.1 System and Adversary Model

In the following, we specify the assumptions for our hybrid obfuscation defense and identify the generic disclosure attacks in this setup.

System Model

We assume a simple and low-cost Reduced Instruction Set Computer (RISC)-based processor architecture. In particular, we assume the use of secure internal Random Access Memory (RAM) for stored data and insecure in-system (external) Read-Only Memory (ROM) for program memory. Due to the processor’s simplicity, we preclude the use of caches. We further assume Memory Mapped Input/Output (MMIO) to perform communication with external peripheral devices. Overall, we assume that the underlying CPU hardware is trustworthy. In contrast, the external ROM and all external bus interfaces are untrustworthy. Hardware debugging features (e.g., JTAG) that reveal values of internal registers or RAM are excluded from the device or are disabled.

Adversary Model

We suppose that the adversary has physical access to the target device. His main goal is the reverse engineering of high-level information from the program such as protocols, cryptographic keys, or the algorithm(s) itself. Based on the physical access, the adversary can read and write arbitrary values in the untrusted program memory. Furthermore, the adversary is capable of dynamic read/write access to the external bus interfaces via probing and tampering with low-speed and high-speed buses. Passive side-channel analysis can be leveraged by the adversary, however, invasive attacks or gate-level modifications are outside the scope of this work.

100 6.2. Technical Background and Related Work

Adversarial Disclosure Attacks

Instruction execution results in a variety of actions such as register and memory read/writes and status flag register updates. Hence, the attacker can obtain execution information even without knowledge of the instruction encoding. Control Flow Attack. The adversary obtains the address of the next fetched instruction and thus the instruction pointer by passive access to external interfaces. Hence, all control flow instructions and thus the Dynamic Control Flow Graph (DCFG) are immediately revealed and the type (unconditional, conditional, call/return) of each executed control flow instruction is disclosed. Even in the case of an obfuscated instruction encoding, the adversary can exploit control flow instructions to reveal the encoding by manipulation of the instruction operand and observation of the next fetched instruction. Similarly, branch instructions can be exploited to check whether two register values meet a condition (depending on the specific branch opcode) at runtime through modification of the source data in the instruction operand. Input/Output Attack. The adversary can observe communication from the CPU to periph- eral devices through passive access to the external interfaces. Even if the instruction encoding is obfuscated, the adversary can exploit I/O instructions to disclose internal, dynamic values by manipulation of the source data encoding in the instruction operand.

The following three attacks may not directly reveal the instruction encoding as they depend on the underlying CPU and instruction set, but they can aid the adversary’s reverse engineering. System Configuration Attack. In general, the adversary can deduce valuable system information by means of system configuration. Any displayed errors can be used to deduce sensitive information [178]. Even if the instruction encoding is obfuscated, a manipulated system configuration instruction supports the adversary’s reverse engineering efforts to understand the system (e.g., sleep activation, interrupt deactivation, or timer unit configuration). For example, if a manipulated instruction leads to the CPU entering sleep mode, the adversary can deduce which bits in the obfuscated encoding belong to the opcode field. Similarly, any displayed error discloses valuable information regarding the instruction encoding to the adversary. For example, if a manipulated instruction leads to an invalid opcode error the adversary can deduce which bits belong to the opcode field. We want to highlight that such an attack was leveraged to reverse engineer x86 processor microcode [179]. Instruction Timing Attack. Different instruction groups can include instructions which consume a varying number of clock cycles. For example, arithmetic instructions may require more cycles than logical instructions. Based on precise measurements of the consumed clock cycles per instruction, the adversary can disclose the targeted instruction’s group [178]. For example, if a manipulated instruction leads to a different execution time after modification, the adversary can deduce which bits in the obfuscated encoding belong to the opcode. Correctness Attack. The deterministic behaviour of a targeted part of a program may be determined by examining the program’s output. Even if the instruction encoding is obfuscated, the adversary can gain information. In particular, if a deterministic part of program is altered in such a way that its semantic is preserved, the program output remains the same, see Section 6.7 for a detailed attack description. The related works in Table 6.1 do not assume a physical and dynamic adversary. Thus, all publicly known ISA randomization schemes can be circumvented by use of the attacks. For

101 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

Approach ISA Randomization Technique CF Attack√ I/O Attack√ [171] XOR √ √ [172] XOR / Instr. Permutation √ √ [173] XOR √ √ [177] Instr. Perm. and Operand Subst. √ √ [180] XOR / Instr. Permutation Ours see Section 6.3––

Table 6.1: Overview of ISA randomization schemes and their susceptibility to control flow and I/O attacks. example, the key k of XOR-based schemes can be obtained via the control flow attack as follows: an unconditional control flow instruction i ⊕ k is executed and the adversary obtains the next fetched instruction and hence obtains the operand of the instruction i. For permuted instructions, the adversary simply toggles each bit per trial to obtain the manipulated next instruction address and thereby reveals the permutation.

6.2.2 Instruction Set Architecture

The assembly language format of a microprocessor is a crucial component for an obfuscation scheme. Listing 6.1 defines a generic assembly language in Backus-Naur Form (BNF). This language can be mapped to virtually any assembly language. hProgrami ::= hProgrami hInsti hInsti ::= hM i | hCFi himmi | hDTi hregi hregi himmi | hDTCi himmi | hALi hregi hregi hvali hM i :=’ nop’ | ’sleep’ hCFi ::=’ call’ | ’ret’ | ’jmp’ | ’beq’ | ’bne’ hDTi ::=’ ld’ | ’st’ hDTCi ::=’ ldc’ | ’stc’ hALi ::=’ and’ | ’or’ | ’not’ | ’xor’ | ’sll’ | ’srl’ | ’slt’ | ’add’ | ’sub’ | ’mul’ | ’div’ hvali ::= himmi | hregi

Listing 6.1: Generic RISC-based assembly language in BNF clustered into instruction groups.

In general, the instructions of the assembly language can be grouped into four distinct groups:

• Control Flow hCFi: Instructions which cause an unconditional control flow change (jmp), branch instructions (beq, bne), and function calls (call, ret).

102 6.3. Hardware-level Obfuscation

• Data Transfer hDTi,hDTCi: Instructions which transfer data via load (ld) and store (st) between the different memories such as the register file and the RAM.

• Arithmetic/Logical hALi: Instructions that modify the data contents by means of logical and arithmetic functions such as add and xor.

• Miscellaneous hMi: All remaining instructions such as nop and sleep.

Note that the rationale for the missing registers in the branch instructions and the additional ldc and stc instructions is stated in Section 6.3.3.

Instruction Format

We employ the Microprocessor without Interlocked Pipeline Stages (MIPS) instruction format due to its wide use in embedded systems and its simplicity. The latter characteristic is particularly advantageous for security analysis and implementation.

Type Instruction Format R opcode(6) rsrc1(5) rsrc2(5) rdst(5) shamt(5) funct(6) I opcode(6) rsrc(5) rdst(5) immediate(16) J opcode(6) address(26) index 31 0

Figure 6.1: MIPS instruction format with 32-bit width. The bit-width of each instruction field is denoted by the number put in brackets after the field name.

Each MIPS instruction is encoded in a 32-bit vector with a 6-bit opcode. There are three distinct types of instruction formats: R-type instructions encode two source registers rsrc1, rsrc2 and a destination register rdst, each 5-bit wide, a 5-bit shift amount field shamt, and a 6-bit function field funct. The funct field further specifies instruction operation beyond the opcode. I-type instructions employ a source register rsrc, a destination register rdst, and a 16-bit immediate value. The operand of J-type instructions only consists of a 26-bit address field.

6.3 Hardware-level Obfuscation

A fundamental limitation of software-level obfuscation techniques is the adversary’s knowledge of the instruction encoding. Our hardware-level obfuscation targets this encoding knowledge with simultaneous consideration of the disclosure attacks.

6.3.1 Opcode Substitution

The principle idea of our proposed opcode substitution transformation is to employ a randomized encoding of the opcode field so that the adversary does not know which opcode maps to which operation. To enhance the effect of the substitution, we consider homophonic ciphers.

103 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

isSpecial Trusted Area

rd Memory General rs Purpose regFile ALU RAM Instr. rt memory rIO rb0 rb1 BC imm16 signExt MMIO imm26 Operand Permutation Operand +4 instr[26..0] Memory TriggerBoobyTrap Sanitizer

pc +/-

TriggerBoobyTrap

Opcode isSpecial Control Unit Substitution (ldc, stc, branch)

Figure 6.2: CPU datapath augmented with the hardware-level obfuscation features (marked in orange).

Homophonic . A homophonic substitution cipher maps each plaintext symbol to one or more ciphertext symbols, called homophones, thereby flattening the ciphertext symbol distribution and obstructing ciphertext symbol frequency analysis compared to simple substitution ciphers. Since their appearance, homophonic substitution ciphers have been successfully attacked by exploiting inherent characteristics of human language [181, 182, 183]. In contrast to the nature of human language, an instruction sequence can be arbitrarily altered to hide relevant statistical information. For example, the instruction add r1,r2,2 can be split up into an arbitrary combination of multiple add, sub, or shift instructions so that the semantic is preserved. Based on the concept of homophones, we informally define our employed opcode substitution as follows: The opcode substitution transformation randomly replaces the native instruction opcode by a pre-defined relation. Particularly, the relation is right-total and left-unique with respect to the native ISA encoding. To obtain the native opcode during execution, the decode unit of the CPU implements the inverse mapping of the right-total, but not right-unique, and left-unique opcode substitution relation1. This transformation is scalable as the number of relations from one native opcode to the codomain elements can be chosen freely. For an implementation, the upper bound depends on the employed instruction format and number of supported instructions. In case of the MIPS instruction format, the 6-bit funct field is also affected by opcode substitution as it encodes opcode information (jointly with the 6-bit opcode field). Example. We assume an ISA with a 2-bit opcode that only employs the two opcodes 00 and 01. Opcode substitution maps the original opcodes as follows: opcode 00 is related to 10

1Properties of a binary relation R between the sets S and T: Right-Total: ∀t ∈ T∃s ∈ S:(s, t) ∈ R Right-Unique: ∀s ∈ S∀t, t0 ∈ T:(s, t) ∈ R ∧ (s, t0) ∈ R ⇒ t = t0 Left-Unique:∀s, s0 ∈ S∀t ∈ T:(s, t) ∈ R ∧ (s0, t) ∈ R ⇒ s = s0

104 6.3. Hardware-level Obfuscation and 01, and opcode 01 is related to 11 and 00. Thus, all 2-bit values are employed through opcode substitution (right-total), and any substituted opcode relates to only one original opcode (left-unique). Since opcode 00 relates to more than one value, this relation is not right-unique.

6.3.2 Operand Permutation The instruction operand field(s) encode the quantities of the operation and thus enables data flow analysis. For example, immediate values can define branch decisions, memory accesses, or constants. Even without knowledge of the opcode (which defines the interpretation of the operand), the operand field can yield meaningful information. Therefore, we apply an operand permutation transformation to the instruction operand field as follows: The operand permutation transformation permutes the bit-indices of the operand field by a pre-defined random bit permutation. Program memory is split into chunks and each chunk of instructions uses its own randomly chosen permutation. To reveal the native operand during execution in the CPU, the inverse permutation is selected based on the instruction address, see Figure 6.2. This transformation is scalable as the number of permutations can be chosen freely. For the employed MIPS format, the whole 26-bit operand is permuted independent of the instruction type. As a result, the 6-bit funct field for R-type instructions is spread across the operand. Unobfuscated Instructions. As a consequence of the disclosure attacks, we do not obfus- cate the whole instruction set. The following instructions are neither affected by the opcode substitution nor the operand permutation.

• Control flow instructions: jmp, call, ret, beq, bne.

• System configuration instructions: sleep.

• Data transfer instructions: ldc, stc.

6.3.3 Hardware-enforced Access Control Hardware-level obfuscation of the instruction encoding does not prevent instruction tampering via the I/O or control flow attacks described in Section 6.2.1. The lack of hardware-level access control on the the data transfer interface between the CPU and peripheral devices creates the vulnerability. To protect against I/O and branch instruction manipulation, we propose a lightweight hardware-based access control mechanism. Techniques and conditions to detect and respond to tampering are described in Section 6.3.4. Hardware-enforced access control restricts the location of operand registers and memory addresses in I/O and branch instructions to a certain memory area rather than to general purpose registers. The unobfuscated data flow (ldc, stc) and branch instructions (beq, bne) are used in conjunction with three interface registers rI/O, rb0, rb1 to transfer data to or from insecure memory areas (e.g., memory addresses used for I/O). These registers are isolated from the general purpose registers and cannot be accessed as general purpose register operands for hardware-level obfuscated instructions. Access Control Policies. To realize access control, several hardware-level policies are employed. Internal addressable memory in the microprocessor, which includes the three interface

105 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors registers, is divided into three parts: the general-purpose GP, the border control BC, and the MMIO areas.

• Hardware-level obfuscated instructions are forced to operate on locations in the GP and BC areas, but not the MMIO area.

• Unobfuscated data-flow and branch instructions are forced to operate in the BC and MMIO area, but not the GP area.

• The interface register rI/O (accessible via an address in the BC area) is implicitly employed by ldc and stc instructions2. The register is accessed by the CPU using its memory address or implicitly by the ldc and stc instructions.

• The interface registers rb0 and rb1 (accessible via addresses in the BC area) are implicitly utilized by branch instructions. The registers are accessed by the CPU using memory addresses or implicitly by branch instructions.

I/O Instructions. To transfer data between the CPU and peripheral devices, the following steps are used:

• Read from MMIO. To transfer data from peripheral devices to the CPU, the data is first read via an ldc instruction and placed in rI/O. The data is then read from rI/O and placed in the internal register file via an ld instruction.

• Write to MMIO. To transfer data from the CPU to the peripheral devices, the data is first written to rI/O via an st instruction. The data is then written from rI/O to a MMIO address via a stc instruction.

Control Flow Instructions. Before a branch instruction is executed, the two values that are compared are loaded by the CPU into rb0 and rb1 via st instructions. Afterwards, the branch instruction that implicitly accesses the registers rb0 and rb1 is executed. For the employed MIPS ISA, we augment the CPU by a further dedicated register for the call/return mechanism. The MIPS return instruction jr ra jumps to the value in the register ra which could be exploited to reveal dynamic values in the general purpose register file. Hence, the additional register is hardware-enforced to only operate on jal/jr instructions that store and restore the program counter upon execution of a call and return, respectively. By use of hardware-level access control, the goal of the adversary is to find an appropriate obfuscated ld/st instruction to construct an attack. The key idea is that the obfuscated st and ld (that write and read data to and from the memory-mapped registers) can be hidden by software-level obfuscation techniques described in Section 6.4.

6.3.4 Hardware-level Booby Traps In addition to hardware-enforced access control, a hardware-level tamper response mechanism is needed to prevent disclosure of the ISA encoding. This response mechanism is added to our defense arsenal to protect against a variety of the attacks, see Section 6.7.

2 For example, a ldc 0xabcd instruction loads the value from address 0xabcd to the register rI/O.A stc 0xabcd instruction stores the value from the register rI/O to address 0xabcd.

106 6.4. Software-level Obfuscation

A booby trap is an active defense that is directly triggered by a detected attack. Such approaches have been developed for software protection [184] and for protection of high-security hardware devices. Hardware systems with tamper detection and response include hardware fire- walls in smart cards [185] and the erasure of cryptographic material in response to an attack [186].

Booby Trap Triggers. For our system, we employ the following triggers to detect an attack: • Invalid Memory Access. The access control unit triggers a booby trap once an invalid memory access by the ld / st or ldc / stc instructions is executed. • Dedicated Opcodes. Several dedicated opcodes (particularly R-type instructions) are reserved to trigger a booby trap on execution. • Malformed Operands. The instruction format is leveraged to detect malformed operands and subsequently trigger a booby trap on execution. For example, a non-shift R-type instruction that encodes a non-zero shamt value triggers a booby trap. Booby Trap Tamper Response. To prevent the disclosure of the instruction encoding, the instruction decode unit can be cleared in a non-volatile manner to prevent further operation. For practical purposes, a counter could be employed to detect multiple attack attempts prior to the burning of a fuse to disable the unit.

6.4 Software-level Obfuscation

A limitation of hardware-level obfuscation is the preservation of the program structure. Despite an obfuscated instruction encoding, the CFG and instruction sequencing provide valuable information sources as demonstrated in Section 6.8. Hence, we employ well-established software obfuscation transformations to hide these information sources. Control Flow Graph-Level Obfuscation. The control flow of a call graph and associated basic block topology provide viable information for a reverse engineer. Looping and conditional structures can be exploited to deduce information regarding the high-level algorithm implemen- tation. To obscure these traces, code flattening and procedure merging are employed [187]. Code flattening inserts a switch statement and a dispatcher that controls the basic block execution sequence. Procedure merging scrambles both function and interprocedural level control by inlining functions within each other. Basic Block-Level Obfuscation. CFG-level obfuscation effectively disrupts control flow assumptions and splits large basic blocks, however small basic blocks remain unchanged. To hide small basic blocks, we utilize basic block normalization. The number of instructions per basic block are normalized by filling space with garbage code that does not affect the semantic behavior of the basic block. A range of basic block sizes are used and basic block positioning is permuted in memory to hinder static analysis. Instruction-Level Obfuscation. CFG and basic block level transformations do not remove all program characteristics. Dedicated instruction sequences and memory accesses are not affected by these obfuscation techniques. To remove potential code structure assumptions such as the stack clean-up after a function call, instruction substitution is used which replaces specific instruction sequences by other functional equivalent but more complicated instruction sequences [187]. This technique is a vital component for our statistical evaluation.

107 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

Instruction substitution is used to hide critical ld/st instructions employed for hardware-level access control, see Section 6.3.3. For example, an instruction substitution rule can be easily defined for (virtually) any architecture which replaces a ld instruction by a stack pointer displacement, a pop, and a subsequent original stack pointer recovery. The stack pointer displacement can be further constituted of multiple instructions so that the basic block normalization spreads this information among diverse basic blocks. Note that the register value that is added to the immediate value in the ld/st instructions can be arbitrarily changed while the sum of both values (the target address) remains the same. This effect greatly increases the number of distinct representations.

6.5 Implementation

To demonstrate our hybrid obfuscation approach we have developed a soft microprocessor generation framework to automatically implement protected CPUs. This framework is supported with a full compilation flow which performs our structural transformations. Hardware Implementation. Our prototype hardware implementation employs the SPREE soft processor generation framework [154]. Processors generated with the framework use the MIPS instruction format. A RTL description of specific processor instantiations are generated from textual descriptions of the ISA and processor data path. An RTL generator was written to include hardware for a randomized ISA encoding and the insertion of the hardware units described in Section 6.3. The number of opcode homophones and operand permutations in the processor hardware implementation are user-defined with respect to the instruction format. Similarly, the number of booby trap triggering opcodes and the memory layout are also user-defined. Software Implementation. Our prototype compiler implementation leverages the LLVM/- Clang compiler and the Obfuscator-LLVM environment developed by Junod et al. [187]. The latter framework provides transformations such as the code flattening and instruction substitution. We extended the Obfuscator-LLVM environment to implement basic block normalization. To allow for late code address determination, we modified the GNU assembler to generate two non-linked .o object files. Each object file is created with different encoded ISAs: ISA A and ISA B. The object files are then linked to their libraries, which were also encoded with ISAs A and B. Relocatable instructions are then identified by examining instructions that differ in the two files after obfuscation is removed. Once all relocatable instructions have been processed, an object file is linked and address dependent operand permutation is applied. The final executable file is fully linked and ISA encoding is employed.

6.6 Performance Evaluation

In this section we present area and run-time performance results for the hardware and software obfuscation transformations. Area Overhead. We evaluated the area overheads for varying hardware-level obfuscation parameters in contrast to dedicated CPU units based on the SPREE processor generation framework. Hardware designs were synthesized to an Altera Cyclone V A2 FPGA. All designs were tested via simulation to determine their accurate behavior.

108 6.6. Performance Evaluation

Design LUTs FFs Mem. Bits SPREE + BT 915 204 2048 SPREE + BT + OS + OP(1) 933 204 2304 SPREE + BT + OS + OP(2) 936 204 2304 SPREE + BT + OS + OP(4) 976 204 2304 SPREE + BT + OS + OP(8) 1015 204 2304 SPREE 889 202 2048 Available Resources 18868 112960 1802000

Table 6.2: Hardware area overhead for the additional obfuscation elements.

Table 6.2 lists the utilized hardware resources in terms of LUTs and FFs required for the different hardware units. The booby trap (BT) unit realizes the logic to trigger a booby trap based on incorrect memory accesses, malformed instructions, and for dedicated opcodes as described in Section 6.3.4. We considered two dedicated opcode triggers and one MMIO address. The opcode substitution (OS) and the operand permutation (OP) implement the hardware-level transformations described in Section 6.3.1 and Section 6.3.2, respectively. The OS designs include a homophonic substitution which employs each possible value for the opcode and funct fields, the extended register file, and support for ldc/stc instructions. OP(x) denotes that x distinct operand permutations are implemented. The hardware overhead of the BT circuitry is lightweight in terms of additional LUTs and FFs and the overhead of the OS consists only of several LUTs. Increasing the number of operand permutations OP results in a hardware overhead of up to 14% for LUTs compared to the original SPREE processor. The 36 registers (= 32 general purpose registers + rI/O + rb0 + rb1 + 1 call/ret register) are implemented in the memory blocks of the FPGA. Similarly, the internal 64 kB RAM is implemented in the memory blocks of the FPGA. The processor speed is not affected by the hardware-level augmentations listed in Table 6.2. The additional elements do not affect the critical path of the design in the execute stage of the three-stage processor (ALU and regFile). Our modifications are restricted to the instruction fetch and memory stages. Performance and Memory Overhead. We evaluated the performance overhead for the different obfuscation transformations on the SPREE embedded benchmark suite [154]. The obfuscation strategies were applied to the programs and the clock cycle counts for 100 versions of each program were measured. In the following discussion, the employed obfuscation transformations are abbreviated as follows: opcode substitution (OS), opcode permutation (OP), instruction substitution (IS), code flattening (CF), and basic block normalization (BBN). Procedure merging was not considered for the evaluation as the targeted benchmark programs generally consist of only one function. Table 6.3 lists the performance results for the different obfuscation strategies for the SPREE benchmark programs. The average performance slowdown of IS ranges between approx. 1.1× and 2× for all programs except des. For des, the influence of IS is significant as the cryp- tographic algorithm employs numerous xor instructions that are swapped for more complex representations. The CF transformation leads to performance slowdown based on the underlying program structure. Since the implementation of the des program is loop-free, CF does not affect performance. For non-loop-free programs, the slowdown depends on high-level program structure.

109 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

bubbl. crc CRC32 des fact. fft fir iquant quant OS+OP 8012 24611 568886 1097 139 3067 1085 3401 3697 OS+OP+IS 8308 45930 1085654 4428 143 3784 1286 4485 4425 OS+OP+CF 36687 41514 1165379 1097 454 18187 7519 11217 10020 OS+OP+CF+IS 38105 61237 1641889 4429 455 19659 7801 12710 11854 OS+OP+CF+IS+BBN 48293 191189 6609719 4439 1732 119486 22160 35451 42932 Unobfuscated 8012 24611 568886 1097 139 3067 1085 3401 3697

Table 6.3: Software performance evaluation for the obfuscation strategies. Each result indicates the number of cycles arithmetically averaged over 100 programs.

bubbl. crc CRC32 des fact. fft fir iquant quant OS+OP 1.42 1.40 1.56 5.92 1.35 2.04 1.94 1.93 1.98 OS+OP+IS 1.44 1.57 1.92 20.62 1.36 2.20 2.07 2.12 2.15 OS+OP+CF 1.87 1.73 2.03 5.92 1.86 2.95 2.68 2.62 2.71 OS+OP+CF+IS 1.89 1.90 2.34 20.50 1.86 3.15 2.83 2.96 3.03 OS+OP+CF+IS+BBN 2.39 2.77 3.33 20.69 2.33 4.83 3.73 4.11 4.33 Unobfuscated 1.42 1.40 1.56 5.92 1.35 2.04 1.94 1.93 1.98

Table 6.4: Software size evaluation for the obfuscation strategies. Each result indicates the program memory size in kB arithmetically averaged over 100 programs.

Measured slowdown ranges between approx. 2× and 6×. For the combined OS+OP+CF+IS strategy the effect of both transformations is additive. The CF transformation adds a switch statement to the target program and uses a variable to control execution order. This trans- formation is not affected by the IS implementation. The combined obfuscation approaches OS+OP+CF+IS+BBN affect performance loss more significantly. Transformation CF splits up basic blocks which are then padded by BBN. Hence, performance is slowed down by a factor of approximately 4× to 39× compared to the unobfuscated version.

Table 6.4 states program size for the different obfuscation strategies. For the IS transformation, most targeted programs are only increased by several hundred bytes. For reasons similar to those given for performance evaluation, des is strongly influenced by IS. The CF transformation affects program size depending on the underlying program structure. Since the targeted programs in the suite contain a small number of loops and branches, program size increases by less than 1 kB compared to the OS+OP obfuscated programs. The combined obfuscation strategy OS+OP+CF+IS+BBN increases the binary size as each basic block is padded with instructions. A binary size overhead of up to 3 kB is seen for the targeted programs compared to the OS+OP obfuscated programs. The des program size increases by 4× compared to the OS+OP obfuscated program.

In summary, software protection naturally affects code performance and binary size depending on the degree of obfuscation. Our program slowdown and binary size results are in line with several other related works in this field, see Section 6.2. The performance and binary size effects of hardware obfuscation are much more limited.

110 6.7. Security Analysis

6.7 Security Analysis

In this section, a security analysis is provided for the diverse adversary accessible attacks presented in Section 6.2.1. Hardware-level Booby Traps. The booby trap mechanism provides a crucial anchor to prevent the disclosure of processor’s instruction encoding in response to an attack. To the best of the authors’ knowledge, current SRAM-FPGA families only support fuses which can be programmed via external FPGA access. Relying on this type of fuse programming risks disruption by the adversary. Alternatively, a booby trap can be used in which the adversary triggers the device to reload its entire configuration before the soft microprocessor is reactivated. This action requires a time-consuming process. For example, the smallest Cyclone V device (A2) requires 21,061,028 configuration bits [188]. Configuration can be performed 16 bits at a time at 125 MHz. Thus, configuration requires at least 10.5 ms per booby trap trigger. Control Flow Attack. jmp, call, and ret instructions cannot be exploited to disclose instruction encoding information as the instructions are not affected by hardware-level obfuscation. We illustrate our resistance to the control flow attack via an example. In a possible attack, the attacker would write values from the register file to interface registers rb0 and rb1 to compare the values using an unobfuscated beq instruction. Therefore, the attacker must craft two store instructions st r0,r0,imm0 and st r0,r0,imm1, where imm0 and imm1 are the addresses of rb0 6 and rb1, respectively. To accomplish the attack, the attacker must guess the opcode (2 − k), where k is the number of known opcodes. In addition, he must guess the operand permutation 26 16 16 for the 16-bit immediate value ( 16 steps) and the two addresses (2 · (2 − 1) steps). To algorithmically verify the guessed instruction encoding, the attacker could attempt to transfer values from all 32 registers and check for equality. Even without consideration of the booby trap mechanism, the worst-case attack complexity for k = 10 known opcodes is approximately 265. Note that for the MIPS architecture register r0 always holds a constant zero. Input/Output Attack. Similar to the branch instruction attack, the adversary could lever- age writes to MMIO addresses to reveal the instruction encoding, although the amount of time required to perform the attack is prohibitive. He must craft a store instruction st r0,r0,imm, 6 where imm is the address of rI/O. Hence, the attacker must guess the opcode (2 − k steps), 26 the operand permutation for the 16-bit immediate value ( 16 steps), and the address value (216 steps). Even without consideration of the booby trap mechanism, the worst-case attack complexity for k = 10 known opcodes is approximately 249 (including the 31 checks to determine if the written values are distinct). Furthermore, the evaluation of all possible 32-bit instructions will trigger 520, 093, 696 ≈ 228.95 booby traps due to the invalid shamt field (for 8 non-shift R-type instructions). Hence, the reconfiguration time for an A2 FPGA is approx. 63 days as each reconfiguration requires at least 10.5 ms. Even for a randomly chosen instruction value, the probability that it triggers the shamt field booby trap is around 12%. Correctness Attack. As a consequence of the homophones in the opcode substitution, the adversary can leverage the correctness attack to reveal parts of the instruction encoding. The homophones are mainly implemented in the funct field of R-type instructions. For simplicity, we assume that the targeted program is deterministic so that the same inputs compute the same output values and the adversary is able to observe both.

111 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

The attacker would like to substitute homophones for an instruction with just one representative value to reveal parts of the instruction encoding. If the adversary alters the funct field to a correct homophone opcode, the output of a deterministic algorithm will not change. An incorrect homophone yields an incorrect opcode or operand and hence a different output. To complete the attack, the attacker must guess the opcode (26 − k steps), the operand permutation for the 26 6 6-bit immediate value ( 6 steps), and the funct value (2 steps). Even without consideration of the booby trap mechanism, the worst-case attack complexity for k = 10 known opcodes is approximately 229 deterministic program executions. System Configuration / Instruction Timing Attack. Since system configuration in- structions are not affected by the hardware-level obfuscation and all obfuscated instructions consume the same number of clock cycles, both attacks cannot be exploited. Cautionary Note. All the attack strategies noted above require the use of attacker- controlled values to perform a hypothesis test. If the targeted program deliberately contains an st ra,rb,imm instruction in which imm is the address of an interface register and rb is attacker controlled (r0 in the case of MIPS), the attack complexities are significantly reduced. 26 17 The adversary only must guess the ra encoding in the operand ( 6 ≈ 2 steps) and analyze the register file (25 − 1 steps). Hence, the system designer must analyze whether such an st ra,rb,imm instruction exists and obfuscate the instruction accordingly. In summary, hardware-level obfuscation increases the adversary’s efforts. A booby trap is likely to be triggered via an invalid memory access for a guessed st instruction or a dedicated opcode for R-type instructions. To increase the probability that the average-case attack also triggers a booby trap, the system designer can adjust the parameters for the booby trap triggers prior to processor generation.

6.8 Security Metrics for Obfuscation

Obfuscation deters algorithm reverse engineering and analysis of the algorithm’s internal ar- chitecture. A current limitation of obfuscation is the lack of a metric to measure the degree of obfuscation for different approaches. The generic concept of indistinguishability provides a formal treatment and offers provable arguments for obfuscation from the theoretic point of view. However, cryptographic program obfuscation schemes are still far away from being deploy- able [189], especially for embedded systems with constrained resources. A practical measure of obfuscation degree for software-only obfuscation has been limited by the attacker’s knowledge of the targeted ISA and, thus, an ability to emulate and analyze the targeted program. This situation changes for hardware-level obfuscation systems due to the concealed ISA encoding. In the following, we propose a novel evaluation methodology to provide an obfuscation metric for hybrid obfuscated systems.

6.8.1 Similarity Metric

A key characteristic of obfuscation is to (virtually) destroy any correlation between an obfuscated and an unobfuscated program. For example, suppose there are two distinct programs P1, P2 and their obfuscated versions O(P1), O(P2). If there exists a significant correlation of a characteristic between P1 and O(P1), but no significant correlation for this feature between P1 and O(P2), the obfuscated program O(P1) can be matched to its unobfuscated counterpart. For example,

112 6.8. Security Metrics for Obfuscation the similarity of the DCFG or the entropy of the instruction opcodes could be used as such characteristics. Thus, a goal of obfuscation is to make obfuscated programs as similar as possible so that their measurable quantities do not allow a distinction between them.

Methodology. To examine the program similarity created by obfuscation strategies and, thus, the degree of obfuscation, we implement the following: First, a set of programs P = {P1,..., Pn} is selected. Obfuscated versions for each program in P are then generated. For an obfuscation strategy O, multiple versions per program are generated to increase the coverage of the assessment. Then, a set of similarity comparison algorithms A = {A1,..., Am} are employed and the similarities between the obfuscated and unobfuscated programs are examined.

We assume that the adversary can implicitly obtain the function call graph and, hence, some functions of the program. For example, several program functions may be provided by an adversary-accessible open-source or closed-source library. Hence, the adversary is able to examine the similarity of numerous functions in a targeted program (instead of the whole program) to break the obfuscation. This examination allows for testing of function-level similarity that is more fine- grained than similarity analysis for a larger program. Furthermore, this evaluation methodology uses a concept which is similar to computational indistinguishability. Nevertheless, without the use of strong cryptographic primitives, the property of computationally indistinguishability cannot be guaranteed. Our goal is not indistinguishability from a uniform distribution, but rather eliminating similarity between obfuscated programs and their unobfuscated counterparts, which is sufficient for practical purposes.

Advantages. Our proposed evaluation methodology is particularly beneficial for hardware- level obfuscation schemes as the adversary cannot emulate and thus reverse engineer the obfuscated program without the corresponding ISA encoding. Furthermore, this assessment roadmap is generic in the sense that new algorithms can be developed and added to the set of similarity comparison algorithms. In this way we can (automatically) examine obfuscation benefits against a defined set of attacks and provide a measure of the degree of obfuscation for a specific obfuscation strategy applied to certain programs. Notably, we can identify programs for which the selected obfuscation strategy might not be sufficient to bring the program set to a specific measurable obfuscation level.

Limitations. Despite various advantages, we acknowledge that the measurability approach does have certain limitations. Similar to ORAM [156] and cryptographic obfuscation [190], the I/O behaviour cannot be modeled. However, embedded systems are generally equipped with less I/O than general-purpose systems with a rich-featured . It can be said that we cannot conclude from the statistics that a particular obfuscation strategy avoids successful attack. However, the statistics do indicate that at least certain global properties can be successfully obfuscated and certain strategies generally lead to a poor obfuscation. A further arguable issue is the selection of our target program set. In our case, we exclude specially crafted programs that would still have measurable similarity after the obfuscation since we want to provide a measure for programs more typically deployed by users.

Similarity measures of external data memory are outside the scope of our metrics. Data randomization schemes [170] could be utilized to dynamically encode/decode the internal, trusted RAM before it is stored to/loaded from the external, untrusted data memory.

113 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

6.8.2 Case Study – SPREE Benchmark Suite We evaluated the programs of the SPREE benchmark suite [154] with the evaluation methodology described in the previous section. The statistical distributions of the instruction memory (Section 6.8.2) and the dynamic CFG (Section 6.8.2) for the different obfuscation strategies were investigated. It should be noted that almost all programs in this benchmark suite consist of only one vital function making the program set an ideal candidate suite for our evaluation methodology.

Statistical Background To allow the reader to better interpret our results, we provide a concise summary of our measures and the rationale behind their use. Our evaluation measures are based on the number of 6 appearances of a certain 6-bit opcode o ∈ {0, 1} in a program P denoted by NP(o). Similarly, opcode triples and the number of their appearances are denoted by o3 ∈ {0, 1}6 ×{0, 1}6 ×{0, 1}6 3 3 and NP(o ), respectively. Moreover, FP(o) is the empirical distribution for the opcode o of a program P, i.e.

NP(o) 6 FP(o) = P , i ∈ {0, 1} NP(i) i For completeness, we performed instruction operand analysis for our statistical measures. However, the 26-bit operand distributions did not provide meaningful results. Subsequently, we evaluated hashes of the operands using the measures. However, the results for the least significant 6-/7-/8-bit of the operand hashes were similar to the opcode results, hence we omit these results from the following discussion. Entropy. The Shannon entropy is a measure of the information content of a random variable. In our case, we are interested in the entropy of the opcode and operand distributions as we expect that a larger entropy hints at better obfuscation (due to less pronounced peaks). Definition 6 (Shannon Entropy). The Shannon entropy of a program is determined by: P E(P) = − FP(o) · log2(FP(o)) o Standard Deviation. The standard deviation is a measure of the inhomogeneity of a random variable. In our case, we evaluate the standard deviation for the frequency of opcode triples rather than a single opcode value as triplet distributions are also employed for the frequency analysis of simple cryptographic ciphers. The larger the difference in frequency of certain triples, the larger the standard deviation can be. This metric indicates which unobfuscated programs relate to an obfuscated one.

Definition 7 (Adapted Standard Deviation). The adapted standard deviation sd3(P) is deter- mined by: r P 1 P N 3 o3 − ν P 2 for N 3 o3 > , n |{o3}|, ν P 1 P N 3 o3 sd3( ) = n+−1 ( P( ) ( )) P( ) 0 + := ( ) := n+ P( ) o3 o3

We additionally performed analysis for the adapted standard deviations sd2 and sd4 based on pairs and 4-tuples of consecutive opcodes. However, these results were similar to the results of sd3, hence we omit these results from the following discussion.

114 6.8. Security Metrics for Obfuscation

E-sd3 Information. The combined information of entropy E and sd3 was also considered. For two distinct program P1 and P2, the marginal distributions of the E and sd3 may strongly overlap so that the programs cannot be told apart. However, E and sd3 may be strongly positively correlated for P1, but anticorrelated for P2. As a result, the programs form distinct clusters of points in an E-sd3 diagram and hence offer a distinction between the programs P1 and P2. Correlation. The correlation of the distributions between the unobfuscated and obfuscated programs was also considered as a statistical measure. The Spearman correlation was employed as this correlation measures the amount by which two variables are connected by a monotonous trend. This measure contrasts with the more restrictive assumption of linearity in case of the Pearson correlation. The Spearman correlation also achieves increased robustness against possible outliers which heavily impact the Pearson correlation, i.e., opcodes having a (close to) zero frequency. The Spearman correlation uses the ranks of the observations as opposed to the observations themselves (Pearson’s correlation). In our case, we examined the correlation between the ranked opcode frequency distribution for obfuscated and unobfuscated programs.

Definition 8 (Spearman Correlation). The Spearman correlation for two programs Px and Py is determined by:

   rk(N (o))−µˆ  −6 P rk(NPx (o))−µˆPx Py Py ρ(NP , NP ) = 2 · · x y SP SP o∈{0,1}6 x y

−6 P q −6 P 2 µˆP := 2 · o∈{0,1}6 rk(NP(o)) defines the mean and SP := 2 · o∈{0,1}6 (rk(NP(o)) − µˆP) defines the standard deviation, where the pairs (rk(NPx (o)), rk(NPy (o))) are the ranks of the observed numbers of appearances NPy (o) which are determined separately for each program.

Statistical Analysis of the Instruction Memory

Using the above statistical measures, evaluation results for the instruction memory were generated. Hybrid obfuscation was applied to each program. A total of 100 different ISA encodings per obfuscation technique were applied. The obfuscation transformations in the following discussion are abbreviated as in Section 6.6. Entropy. Figure 6.3 depicts the entropy of the instruction opcodes distribution for the increasingly sophisticated obfuscation strategies (a) - (d). The blue points below 3.0 depict the entropy of the unobfuscated programs in each figure (a) - (d). The boxes depict a sketch of the entropy distribution of the obfuscated programs, where the thick horizontal line within the box is the median and the box extends from the lower 25% to the upper 75% quantile, i.e., ≈ 25% of the programs have entropies below the lower edge and ≈ 25% have entropies above the upper edge of the box. The whiskers extend to the smallest and largest entropy and the circles outside of the region covered by the whiskers cover even more extreme entropies (only contained in some figures). OS+OP performs the poorest with many outliers, small boxes, and a range of medians covering a range of approximately 4.1 − 5.0 in entropy. For OS+OP+IS and OS+OP+CF+IS, there are sharp distributions without significant outliers (visible as the isolated points below and above the boxes). OS+OP+CF+IS+BBN combines the best of both groups: the boxes are nearly as large as OS+OP+CF+IS and there is approximately the same homogeneity of box positions (approximately 50% less variable for a large fraction of programs). This obfuscation

115 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors strategy generates a significant number of outliers which complicates the determination of which program is under inspection. The entropy provides an effective measure to quantify the extent of program information loss due to obfuscation. Here, we use the degree of homogeneity of the box locations as well as the percentage of entropy values which are far away from the typical values, i.e., the outliers. Standard Deviation. Figure 6.4 shows the boxplots of the standard deviation sd3 of the instruction opcode distribution for the increasingly sophisticated obfuscation strategies (a) - (d). Note that a favourable obfuscation strategy should result in boxes which overlap for different programs and not allow for programs to be uniquely distinguished. This goal is best achieved by the OS+OP+CF+IS+BBN obfuscation which produces the largest boxes and the greatest overlap in comparison to the other obfuscation methods. In contrast to the entropy metric, sd3 considers the variability in the distribution of opcode triples. A large variability in sd3 for different obfuscations of the same program illustrates a large variability in the homogeneity of the frequencies in which certain opcodes appear in consecutive order rather than the homogeneity of the frequencies in which certain opcodes appear overall in the code. Hence, sd3 provides important supplementary information to entropy to quantify the extent of the information loss due to obfuscation and also to specify the variability of different obfuscations of the same program. E-sd3 Information. Figure 6.5 combines the entropy information and the sd3 values for the opcodes by displaying a point in an entropy-sd3 coordinate system for each program. A small dot is shown for each obfuscated program and a star is shown for an unobfuscated program. The colors for the dots indicate the program from which the obfuscated program was generated. We see that the point clouds and the overlap between clouds for different programs are much larger for the OS+OP+CF+IS+BBN obfuscation technique than for all others. While this effect is also discernible in the boxplots of the marginal distributions of the entropies and the sd3, the E-sd3 diagram demonstrates that these quantities are not correlated or anticorrelated in a way which would allow the determination of a program under consideration. Such a correlation would show that a large entropy value coincides with a large sd3. This correlation would result in disjoint clusters of points (alignment parallel to the main diagonal in the diagram). Thus, the E-sd3 diagram provides essential information of the combined measures to depict the information loss due to obfuscation. Correlation. Figure 6.6 depicts the results of the correlations for the opcodes of the fft program, a typical case. Each panel shows the correlations between the opcode frequencies for a specific obfuscation. The boxes illustrate the distribution of the correlations between the obfuscated programs and the unobfuscated fft and the stars show the correlations between the unobfuscated programs and the unobfuscated program fft. The correlations between the obfuscated programs and the true underlying program fft are not significantly different across all obfuscation approaches. Hence, the correlations do not identify the true underlying program since the hardware-level obfuscation is sufficient to hamper program distinction.

Dynamic Control Flow Graph Similarity

Since the adversary has access to the DCFG, the similarity between obfuscated and unobfuscated DCFGs was evaluated for the SPREE benchmark suite. For each obfuscation strategy, we

116 6.8. Security Metrics for Obfuscation

(a) OS+OP. (b) OS+OP+IS.

(c) OS+OP+CF+IS. (d) OS+OP+CF+IS+BBN.

Figure 6.3: Entropy of the opcode distributions for the different obfuscation strategies.

117 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

(a) OS+OP. (b) OS+OP+IS.

(c) OS+OP+CF+IS. (d) OS+OP+CF+IS+BBN.

Figure 6.4: sd3-distributions for the different obfuscation strategies.

118 6.8. Security Metrics for Obfuscation

(a) OS+OP. (b) OS+OP+IS.

(c) OS+OP+CF+IS. (d) OS+OP+CF+IS+BBN.

Figure 6.5: E-sd3-diagrams for the different obfuscation strategies. Colour legend for all figures is: bubbl. •, crc •, CRC32 •, des •, fact. •, fft •, fir •, iquant •, quant •

119 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

(a) OS+OP. (b) OS+OP+IS.

(c) OS+OP+CF+IS. (d) OS+OP+CF+IS+BBN.

Figure 6.6: Correlations of the opcodes for fft program for the different obfuscation strategies.

120 6.9. Discussion generated 100 programs per benchmark and extracted each DCFG. As described in Section 6.2, several algorithms have been proposed to measure the CFG similarity. In a recent evaluation, Chan et al. [119] demonstrated that the graph edit distance algorithm proposed by Hu et al. [110] is most efficient in terms of accuracy and run time. This approach was used to determine the similarity score of our recorded DCFGs.

1 1

0.8 0.8 bubbl. bubbl. crc crc 0.6 CRC32 0.6 CRC32 des des fact. fact. fft fft 0.4 0.4 fir fir Similarity Score iquant Similarity Score iquant quant quant 0.2 0.2

0 0 crc des fft fir crc des fft fir fact. fact. iquant quant iquant quant bubbl. CRC32 bubbl. CRC32

(a) DCFG similarity for the OS+OP obfus- (b) DCFG similarity for the cation strategy compared to the unobfuscated OS+OP+CF+IS+BBN obfuscation programs. strategy compared to the unobfuscated programs.

Figure 6.7: Dynamic control flow graph similarity evaluation for the benchmark programs and different obfuscation strategies.

Figure 6.7 shows the results of our DCFG similarity evaluation. Obfuscated programs are compared to their unobfuscated versions in (a) and (b). A similarity score close to 1 implies that the graphs are similar, whereas a score close to 0 implies the opposite. The figure shows that a targeted program can be uniquely distinguished among the set of obfuscated programs if the DCFG is not affected by the obfuscation (Figure 6.7 (a)). DCFG similarity between obfuscated and unobfuscated programs decreases as more obfuscation techniques are combined, hampering unique identification (Figure 6.7 (b)). For example, the factorial program in Figure 6.7 (b) cannot be distinguished from the set of programs.

Based on the statistical and DCFG evaluation results, it is apparent that the information characteristic for the des cryptographic algorithm stands out compared to other general-purpose embedded programs. Overall, we see that just using hardware-level obfuscation is not sufficient to hide crucial program characteristics. It must be combined with software-level transformation in order prevent unique distinction by the various measures.

6.9 Discussion

In the following we analyze the diverse properties of our hybrid obfuscation scheme and discuss its security.

121 Chapter 6. Hybrid Obfuscation to Protect Intellectual Property on Embedded Microprocessors

In general, hardware I/O mechanisms and instruction encoding format have crucial impacts on the security of hardware-level obfuscation. Our proposed lightweight processor augmentations mitigate generic adversary accessible attacks to hide the vital ISA encoding from a physical adversary. Furthermore, we are able to detect and respond to tampering attempts by the use of a booby trap mechanism. We have demonstrated how ISA encoding diversification can be implemented so that processors augmented with hardware-level features can be automatically generated. As a consequence of the randomized ISA encoding, the adversary is not able to directly disassemble and reverse engineer a targeted program. Nevertheless, the ISA encoding itself is not sufficient in our adversary model as crucial program characteristics and hence we employ diverse software-level obfuscation transformations ranging from the CFG-level to the instruction-level. Note that this approach does not affect testability during development as the obfuscation can be selectively turned off, so that general-purpose user code can be debugged. Since we employ an integrated and automated compilation flow for the hardware-level and software-level obfuscation, a developer has full access to all compiler log files as well as the hardware-level instruction encoding mapping. Perhaps most importantly, the benefits of hybrid obfuscation transformations have been evaluated with statistical evaluation metrics. It has been demonstrated that it is not possible to match an obfuscated program to one of a group of unobfuscated ones by considering a selection of statistical metrics (Section 6.8.1). The hardware overhead of our approach is about 14% of processor logic area.

Future Work

Hardware Issues. The underlying concept of an obfuscated ISA encoding could potentially be applied to ASICs that provide several field-programmable hardware elements (e.g., the Writable Instruction Set Computer (WISC) concept [191]). For example, such a device could include a user-defined ISA encoding, memory layout, and access control. In particular, an ASIC could be generated by using just one mask to reduce the manufacturing costs. Thus, a fleet of embedded systems could be diversified to counteract the “break one break all” principle. A further interesting direction is dynamic instruction encoding update. This approach would be particularly attractive as a moving target defense. The disabling of an FPGA-based soft processor in response to an attack is challenging since current SRAM-based FPGAs do not offer the ability to permanently set one or more non-volatile fuses at run-time. The addition of this feature would help in preventing deobfuscation attacks. The security analysis of our approaches for processors with dedicated cache memories is left for future research. Software Issues. The performance overhead of software-level obfuscation is significant for embedded systems. However, our evaluation results are in line with reported results from other work published in this area (e.g., [187]). The analysis of further obfuscation transformations such as anti-emulation, code tamper proofing, and self-modifying code [162, 187] in combination with Application Specific Instruction-Set Processor techniques to decrease software performance overhead are left for future research.

122 6.10. Conclusion

Security Metric Issues. Our security metrics could be expanded to include new statistical tests that evaluate hybrid obfuscation systems. A new metric which considers the limited I/O of embedded systems would be particularly advantageous.

6.10 Conclusion

ISA randomization provides a viable approach for obfuscation and exploit mitigation for embed- ded processors. However, for embedded systems, various disclosure sources can be leveraged to reveal crucial ISA information. Once the ISA is revealed, the targeted software can be reverse engineered. This issue is particularly worrisome for low-cost IoT systems with limited cryptographic protection. In this chapter we have presented a hybrid obfuscation scheme consisting of hardware-level and software-level obfuscation transformations to prevent a variety of disclosure attacks. We combined the obfuscation transformations with dedicated hardware booby traps to detect and respond to manipulation attempts. Finally, we demonstrated a novel evaluation methodology to assess the twofold diversification. This methodology provides a quantitative method to qualify the benefits of our approaches. The lack of quantitative metrics has been a long-standing issue in the software obfuscation domain. A performance evaluation of our prototype implementation demonstrates a lightweight hardware overhead of up to 14% for a simple, low-cost embedded processor.

123

Part V

Conclusion

Chapter 7 Conclusion

In this thesis we advanced towards an answer on the complexity of gate-level netlist reverse engineering and manipulation. We closed several research gaps for constructive and destructive applications: we designed and implemented the holistic gate-level netlist reverse engineering and manipulation framework HAL and we provided several research contributions on semi-automated hardware Trojan insertion, costs associated with reverse engineering based on graph similarity algorithms, and novel insights on the (in-)security of several FSM-based hardware obfuscation schemes. In concert with HAL and our methodology on problem solving research and research on the acquisition of expertise, we provide a fundamental basis for quantification of human factors in hardware reverse engineering.

Contents of this Chapter

7.1 Impact of Gate-level Netlist Reverse Engineering ...... 127 7.2 Future Research Directions ...... 128

7.1 Impact of Gate-level Netlist Reverse Engineering

Impact. Since hardware security covers a broad research landscape, HAL can be utilized to accompany and support various directions for constructive and defensive intents and purposes. With the support of HAL, we demonstrated several destructive, semi-automated reverse engi- neering and manipulation aspects which can indeed be carried out for potentially obfuscated FPGA and ASIC designs after a device or design has been tested, its code reviewed, or formally verified. Moreover HAL facilitates constructive (automated) applications such as Trojan detection, assessment of obfuscation or watermarking schemes. In addition, HAL lays the foundation to investigate human factors in reverse engineering as user interactions can be traced and analyzed. We want to note that gate-level netlist analysis also improves the understanding of SCA attacks and countermeasures since more information about a hardware design implementation facilitates a more fine-grained security evaluation. Generally one could argue that instead of reverse engineering and manipulating a design, it would be easier to completely replace it. However, this is not a realistic scenario since the reverse engineer does not know the exact design specifications beforehand. Thus, fully replacing the design without prior reverse engineering typically leads to a design which deviates from the original implementation or requires massive effort.

127 Chapter 7. Conclusion

Other Vendors. Although our case studies targeted Xilinx FPGAs, our research is not specific to Xilinx devices and can be adapted to devices from other FPGA vendors as well. For example, the bitstream encryption scheme of Intel’s Stratix II and Stratix III SRAM-based FPGA families can also be circumvented by means of SCA attacks [192]. To the best of our knowledge, bitstream file format reverse engineering for these families has so far not been practically demonstrated, but we expect that this step can be conducted as well. Project IceStorm [193] demonstrated successful bitstream reverse engineering to a human-readable netlist for iCE40 FPGAs. In 2012, Skorobogatov et al. [194] demonstrated key extraction from a Actel/Microsemi ProASIC3 device. This action can result in the decryption and extraction of bitstream information.

7.2 Future Research Directions

Various constructive and destructive research directions may be considered in the future, including advanced gate-level netlist reverse engineering strategies, human factor quantification to support cognitive design obfuscation, or new design primitives for hardware Trojans based on microcode. Comparative Analysis of Gate-level Trojan Detection Schemes. With the hardware Trojan detection technique ANGEL, see Section 3.3, we demonstrated that even advanced Trojan armed with obfuscation features can be reliably detected by using static Boolean function and graph neighborhood analysis. As other strategies based on dynamic design analysis [55] and machine learning [72] exist, a comparative investigation with an extensive evaluation would provide insights on capabilities and limitations for different Trojan design strategies and Trojan obfuscation techinques. Moreover such an investigation would yield valuable insights on the current state of research, i.e. which Trojan design strategy evades automatic detection and thus is favorable for an attacker. Advanced Reverse Engineering Strategies. Our gate-level netlist reverse engineering strategies were mostly based on static design analysis (e.g., graph similarity see Chapter 4) and to a certain extent on dynamic analysis (e.g., Boolean function analysis for FSMs see Section 5.2.2). Since combined static and dynamic analysis has been shown to be effective for analysis of FSMs, see Chapter 5, more research on such advanced (semi-) automated strategies may yield powerful reverse engineering capabilities. In addition, static and dynamic design information may be a useful resource for machine learning as clustering has been shown to be effective for Trojan detection [72, 55]. Future strategies may also explore reverse engineering of digital hardware systems designed with high-level synthesis. Quantification of Human Factors. Since quantification of human factors is crucial in order to estimate costs associated with reverse engineering, exploration in this interdisciplinary area is inevitable to be able to design sound countermeasures. Based on the building block HAL and our initial work on cognitive aspects [5], future research should investigate human factors. We want to highlight that the author of this thesis was involved in user studies on this topic, see publication list on page 158. As noted before, understanding human factors and profound understanding of advanced reverse engineering strategies may yield strong hardware obfuscation methods to effectively hinder subsequent attacks such as hardware Trojans. Design Space for CPU Microcode Trojans. In a recent work, we demonstrated how microcode—an abstraction layer between user-visible ISA and physical CPU hardware components—can be reverse engineered and manipulated in order to realize microcoded Trojans.

128 7.2. Future Research Directions

Such Trojans exhibit interesting properties such as post-manufacturing versatility which is indispensable for heterogeneity in operating systems and applications running on general-purpose CPUs [179]. Future research should explore capabilities and limitations of microcode for the Trojan design space.

129

Part VI

Appendix

Bibliography

[1] Masoud Rostami, Farinaz Koushanfar, and Ramesh Karri. A primer on hardware security: Models, methods, and metrics. Proceedings of the IEEE, 102(8):1283–1295, 2014.

[2] Amir Moradi, Alessandro Barenghi, Timo Kasper, and Christof Paar. On the vulnerability of FPGA bitstream encryption against power analysis attacks: extracting keys from xilinx virtex-ii fpgas. In Chen et al. [195], pages 111–124.

[3] Amir Moradi, David Oswald, Christof Paar, and Pawel Swierczynski. Side-channel attacks on the bitstream encryption mechanism of altera stratix II: facilitating black-box analysis using software reverse-engineering. In Brad L. Hutchings and Vaughn Betz, editors, The 2013 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’13, Monterey, CA, USA, February 11-13, 2013, pages 91–100. ACM, 2013.

[4] Pawel Swierczynski, Marc Fyrbiak, Philipp Koppe, Amir Moradi, and Christof Paar. Interdiction in practice - hardware trojan against a high-security USB flash drive. J. Cryptographic Engineering, 7(3):199–211, 2017.

[5] Marc Fyrbiak, Sebastian Strauss, Christian Kison, Sebastian Wallat, Malte Elson, Nikol Rummel, and Christof Paar. Hardware reverse engineering: Overview and open challenges. In IVSW [196], pages 88–94.

[6] Karsten Nohl, David Evans, Starbug, and Henryk Pl¨otz.Reverse-engineering a crypto- graphic RFID tag. In Paul C. van Oorschot, editor, Proceedings of the 17th USENIX Security Symposium, July 28-August 1, 2008, San Jose, CA, USA, pages 185–194. USENIX Association, 2008.

[7] Jennifer L. White, Anthony S. Wojcik, Moon-Jung Chung, and Travis E. Doom. Candidate subcircuits for functional module identification in logic circuits. In ACM Great Lakes Symposium on VLSI, pages 34–38. ACM, 2000.

[8] Samuel T. King, Joseph Tucek, Anthony Cozzie, Chris Grier, Weihang Jiang, and Yuanyuan Zhou. Designing and implementing malicious hardware. In LEET. USENIX Association, 2008.

[9] Lang Lin, Markus Kasper, Tim G¨uneysu,Christof Paar, and Wayne Burleson. Trojan side-channels: Lightweight hardware trojans through side-channel engineering. In CHES, volume 5747 of Lecture Notes in Computer Science, pages 382–395. Springer, 2009.

[10] Jeyavijayan Rajendran, Vinayaka Jyothi, and Ramesh Karri. Blue team red team approach to hardware trust assessment. In ICCD, pages 285–288. IEEE Computer Society, 2011. Bibliography

[11] Georg T. Becker, Francesco Regazzoni, Christof Paar, and Wayne P. Burleson. Stealthy dopant-level hardware trojans. In CHES, volume 8086 of Lecture Notes in Computer Science, pages 197–214. Springer, 2013.

[12] Jie Zhang, Feng Yuan, and Qiang Xu. Detrust: Defeating hardware trust verification with stealthy implicitly-triggered hardware trojans. In ACM Conference on Computer and Communications Security, pages 153–166. ACM, 2014.

[13] Samaneh Ghandali, Georg T. Becker, Daniel Holcomb, and Christof Paar. A design methodology for stealthy parametric trojans and its application to bug attacks. In CHES, volume 9813 of Lecture Notes in Computer Science, pages 625–647. Springer, 2016.

[14] Kaiyuan Yang, Matthew Hicks, Qing Dong, Todd M. Austin, and Dennis Sylvester. A2: analog malicious hardware. In IEEE Symposium on Security and Privacy, pages 18–37. IEEE Computer Society, 2016.

[15] Jeyavijayan Rajendran, Michael Sam, Ozgur Sinanoglu, and Ramesh Karri. Security analysis of integrated circuit camouflaging. In Sadeghi et al. [197], pages 709–720.

[16] Arunkumar Vijayakumar, Vinay C. Patil, Daniel E. Holcomb, Christof Paar, and Sandip Kundu. Physical design obfuscation of hardware: A comprehensive investigation of device and logic-level techniques. IEEE Trans. Information Forensics and Security, 12(1):64–77, 2017.

[17] Randy Torrance and Dick James. The state-of-the-art in IC reverse engineering. In Christophe Clavier and Kris Gaj, editors, Cryptographic Hardware and Embedded Systems - CHES 2009, 11th International Workshop, Lausanne, Switzerland, September 6-9, 2009, Proceedings, volume 5747 of Lecture Notes in Computer Science, pages 363–381. Springer, 2009.

[18] Texplained. https://www.texplained.com/process. [Online; accessed 19-May-2017].

[19] Resve A. Saleh, Steven J. E. Wilton, Shahriar Mirabbasi, Alan J. Hu, Mark R. Greenstreet, Guy Lemieux, Partha Pratim Pande, Cristian Grecu, and Andr´eIvanov. System-on-chip: Reuse and integration. Proceedings of the IEEE, 94(6):1050–1069, 2006.

[20] Bicky Shakya, Mark M. Tehranipoor, Swarup Bhunia, and Domenic Forte. Introduction to Hardware Obfuscation: Motivation, Methods and Evaluation, pages 3–32. Volume 1 of Forte et al. [198], 1st edition, 2017.

[21] Rajat Subhra Chakraborty and Swarup Bhunia. Hardware protection and authentication through netlist level obfuscation. In ICCAD, pages 674–677. IEEE Computer Society, 2008.

[22] Rajat Subhra Chakraborty and Swarup Bhunia. Security against hardware trojan through a novel application of design obfuscation. In Jaijeet S. Roychowdhury, editor, 2009 International Conference on Computer-Aided Design, ICCAD 2009, San Jose, CA, USA, November 2-5, 2009, pages 113–116. ACM, 2009.

134 Bibliography

[23] Rajat Subhra Chakraborty and Swarup Bhunia. Security through obscurity: An approach for protecting register transfer level hardware IP. In Mohammad Tehranipoor and Jim Plusquellic, editors, IEEE International Workshop on Hardware-Oriented Security and Trust, HOST 2009, San Francisco, CA, USA, July 27, 2009. Proceedings, pages 96–99. IEEE Computer Society, 2009.

[24] Rajat Subhra Chakraborty and Swarup Bhunia. HARPOON: an obfuscation-based soc design methodology for hardware protection. IEEE Trans. on CAD of Integrated Circuits and Systems, 28(10):1493–1502, 2009.

[25] Yousra Alkabani and Farinaz Koushanfar. Active hardware metering for intellectual property protection and security. In USENIX Security Symposium. USENIX Association, 2007.

[26] Farinaz Koushanfar. Provably secure active IC metering techniques for piracy avoidance and digital rights management. IEEE Trans. Information Forensics and Security, 7(1):51–63, 2012.

[27] Sarah Amir, Bicky Shakya, Domenic Forte, Mark Tehranipoor, and Swarup Bhunia. Comparative analysis of hardware obfuscation for IP protection. In ACM Great Lakes Symposium on VLSI, pages 363–368. ACM, 2017.

[28] Shahed E. Quadir, Junlin Chen, Domenic Forte, Navid Asadizanjani, Sina Shahbazmo- hamadi, Lei Wang, John A. Chandy, and Mark Tehranipoor. A survey on chip to system reverse engineering. JETC, 13(1):6:1–6:34, 2016.

[29] Jean Kumagai. Chip detectives. IEEE Spectr., 37(11):43–49, November 2000.

[30] Ian Kuon, Russell Tessier, and Jonathan Rose. FPGA architecture: Survey and challenges. Foundations and Trends in Electronic Design Automation, 2(2):135–253, 2007.

[31] Neil Weste and David Harris. CMOS VLSI Design: A Circuits and Systems Perspective. Addison-Wesley Publishing Company, USA, 4th edition, 2010.

[32] Philippe Coussy, Daniel D. Gajski, Michael Meredith, and Andr´esTakach. An introduction to high-level synthesis. IEEE Design & Test of Computers, 26(4):8–17, 2009.

[33] E. Wanderley, R. Vaslin, J. Crenne, P. Cotret, G. Gogniat, J.-P. Diguet, J.-L. Danger, P. Maurine, V. Fischer, B. Badrignans, L. Barthe, P. Benoit, and L. Torres. Security FPGA Analysis, pages 7–46. Springer Netherlands, Dordrecht, 2011.

[34] Jean-Baptiste Note and Eric´ Rannaud. From the bitstream to the netlist. In FPGA, page 264. ACM, 2008.

[35] Mark C. Hansen, Hakan Yalcin, and John P. Hayes. Unveiling the ISCAS-85 benchmarks: A case study in reverse engineering. IEEE Design & Test of Computers, 16(3):72–80, 1999.

[36] Gregory H. Chisholm, Steven T. Eckmann, Christopher M. Lain, and Robert Veroff. Understanding integrated circuits. IEEE Design & Test of Computers, 16(2):26–37, 1999.

135 Bibliography

[37] Travis E. Doom, Jennifer L. White, Anthony S. Wojcik, and Gregory H. Chisholm. Identifying high-level components in combinational circuits. In Great Lakes Symposium on VLSI, pages 313–318. IEEE Computer Society, 1998.

[38] Amit Chowdhary, Sudhakar Kale, Phani K. Saripella, Naresh Sehgal, and Rajesh K. Gupta. Extraction of functional regularity in datapath circuits. IEEE Trans. on CAD of Integrated Circuits and Systems, 18(9):1279–1296, 1999.

[39] Yiqiong Shi, Chan Wai Ting, Bah-Hwee Gwee, and Ye Ren. A highly efficient method for extracting fsms from flattened gate-level netlist. In International Symposium on Circuits and Systems (ISCAS 2010), May 30 - June 2, 2010, Paris, France, pages 2610–2613. IEEE, 2010.

[40] Yiqiong Shi, Bah-Hwee Gwee, Ye Ren, Thet Khaing Phone, and Chan Wai Ting. Extracting functional modules from flattened gate-level netlist. In ISCIT, pages 538–543. IEEE, 2012.

[41] Travis Meade, Shaojie Zhang, and Yier Jin. Netlist reverse engineering for high-level functionality reconstruction. In ASP-DAC, pages 655–660. IEEE, 2016.

[42] Travis Meade, Yier Jin, Mark Tehranipoor, and Shaojie Zhang. Gate-level netlist reverse engineering for hardware security: Control logic register identification. In ISCAS, pages 1334–1337. IEEE, 2016.

[43] Wenchao Li, Zach Wasson, and Sanjit A. Seshia. Reverse engineering circuits using behavioral pattern mining. In HOST, pages 83–88. IEEE, 2012.

[44] Wenchao Li, Adri`aGasc´on,Pramod Subramanyan, Wei Yang Tan, Ashish Tiwari, Sharad Malik, Natarajan Shankar, and Sanjit A. Seshia. Wordrev: Finding word-level structures in a sea of bit-level gates. In 2013 IEEE International Symposium on Hardware-Oriented Security and Trust, HOST 2013, Austin, TX, USA, June 2-3, 2013 [199], pages 67–74.

[45] Pramod Subramanyan, Nestan Tsiskaridze, Wenchao Li, Adri`aGasc´on,Wei Yang Tan, Ashish Tiwari, Natarajan Shankar, Sanjit A. Seshia, and Sharad Malik. Reverse engineering digital circuits using structural and functional analyses. IEEE Trans. Emerging Topics Comput., 2(1):63–80, 2014.

[46] Adri`aGasc´on,Pramod Subramanyan, Bruno Dutertre, Ashish Tiwari, Dejan Jovanovic, and Sharad Malik. Template-based circuit understanding. In FMCAD, pages 83–90. IEEE, 2014.

[47] Mathias Soeken, Baruch Sterin, Rolf Drechsler, and Robert K. Brayton. Reverse engineering with simulation graphs. In Roope Kaivola and Thomas Wahl, editors, Formal Methods in Computer-Aided Design, FMCAD 2015, Austin, Texas, USA, September 27-30, 2015., pages 152–159. IEEE, 2015.

[48] Defense Science Board Washington DC. Report of the defense science board task force on high performance microchip supply, Februar 2005.

136 Bibliography

[49] Swarup Bhunia, Michael S. Hsiao, Mainak Banga, and Seetharam Narasimhan. Hardware trojan attacks: Threat analysis and countermeasures. Proceedings of the IEEE, 102(8):1229– 1247, 2014.

[50] Mohammad Tehranipoor and Farinaz Koushanfar. A survey of hardware trojan taxonomy and detection. IEEE Design & Test of Computers, 27(1):10–25, 2010.

[51] Matthew Hicks, Murph Finnicum, Samuel T. King, Milo M. K. Martin, and Jonathan M. Smith. Overcoming an untrusted computing base: Detecting and removing malicious hardware automatically. In IEEE Symposium on Security and Privacy, pages 159–172. IEEE Computer Society, 2010.

[52] Adam Waksman, Matthew Suozzo, and Simha Sethumadhavan. FANCI: identification of stealthy malicious logic using boolean functional analysis. In ACM Conference on Computer and Communications Security, pages 697–708. ACM, 2013.

[53] Kento Hasegawa, Masaru Oya, Masao Yanagisawa, and Nozomu Togawa. Hardware trojans classification for gate-level netlists based on machine learning. In IOLTS, pages 203–206. IEEE, 2016.

[54] S. K. Haider, C. Jin, M. Ahmad, D. Shila, O. Khan, and M. van Dijk. Advancing the state-of-the-art in hardware trojans detection. IEEE Transactions on Dependable and Secure Computing, pages 1–1, 2017.

[55] Hassan Salmani. COTD: reference-free hardware trojan detection and recovery based on controllability and observability in gate-level netlist. IEEE Trans. Information Forensics and Security, 12(2):338–350, 2017.

[56] Jarrod A. Roy, Farinaz Koushanfar, and Igor L. Markov. EPIC: ending piracy of integrated circuits. In DATE, pages 1069–1074. ACM, 2008.

[57] Pramod Subramanyan, Sayak Ray, and Sharad Malik. Evaluating the security of logic encryption algorithms. In HOST, pages 137–143. IEEE Computer Society, 2015.

[58] M. Yasin, B. Mazumdar, O. Sinanoglu, and J. Rajendran. Removal attacks on logic locking and camouflaging techniques. IEEE Transactions on Emerging Topics in Computing, pages 1–1, 2017.

[59] Yang Xie and Ankur Srivastava. Mitigating SAT attack on logic locking. In CHES, volume 9813 of Lecture Notes in Computer Science, pages 127–146. Springer, 2016.

[60] Muhammad Yasin, Bodhisatwa Mazumdar, Jeyavijayan J. V. Rajendran, and Ozgur Sinanoglu. Sarlock: SAT attack resistant logic locking. In HOST, pages 236–241. IEEE Computer Society, 2016.

[61] Muhammad Yasin, Abhrajit Sengupta, Mohammed Thari Nabeel, Mohammed Ashraf, Jeyavijayan Rajendran, and Ozgur Sinanoglu. Provably-secure logic locking: From theory to practice. In CCS, pages 1601–1618. ACM, 2017.

137 Bibliography

[62] Avinash R. Desai, Michael S. Hsiao, Chao Wang, Leyla Nazhandali, and Simin Hall. Interlocking obfuscation for anti-tamper hardware. In CSIIRW, page 8. ACM, 2013. [63] Jaya Dofe and Qiaoyan Yu. Novel dynamic state-deflection method for gate-level design obfuscation. IEEE Trans. on CAD of Integrated Circuits and Systems, 37(2):273–285, 2018. [64] Li Li and Hai Zhou. Structural transformation for best-possible obfuscation of sequential circuits. In 2013 IEEE International Symposium on Hardware-Oriented Security and Trust, HOST 2013, Austin, TX, USA, June 2-3, 2013 [199], pages 55–60. [65] J. D. Parham, J. T. McDonald, Y. C. Kim, and M. R. Grimaila. Hiding circuit components using boundary blurring techniques. In Proceedings of IEEE Annual Symposium on VLSI, 2010. [66] Jeffrey Todd McDonald, Yong C. Kim, and Daniel J. Koranek. Deterministic circuit variation for anti-tamper applications. In CSIIRW, page 68. ACM, 2011. [67] Jeffrey Todd McDonald, Yong C. Kim, Daniel J. Koranek, and James D. Parham. Evalu- ating component hiding techniques in circuit topologies. In ICC, pages 1138–1143. IEEE, 2012. [68] M. Fyrbiak, S. Wallat, P. Swierczynski, M. Hoffmann, S. Hoppach, M. Wilhelm, T. Weidlich, R. Tessier, and C. Paar. HAL—the missing piece of the puzzle for hardware reverse engineering, trojan detection and insertion. IEEE Transactions on Dependable and Secure Computing, 2018, to appear. [69] Sebastian Wallat, Marc Fyrbiak, Moritz Schlogel, and Christof Paar. A look at the dark side of hardware reverse engineering - a case study. In IVSW [196], pages 95–100. [70] Hassan Salmani, Mohammad Tehranipoor, and Ramesh Karri. On design vulnerability analysis and trust benchmarks development. In 2013 IEEE 31st International Conference on Computer Design, ICCD 2013, Asheville, NC, USA, October 6-9, 2013, pages 471–474. IEEE Computer Society, 2013. [71] Adam Waksman, Jeyavijayan Rajendran, Matthew Suozzo, and Simha Sethumadhavan. A red team/blue team assessment of functional analysis methods for malicious circuit identification. In DAC, pages 175:1–175:4. ACM, 2014. [72] Bur¸cin C¸ akir and Sharad Malik. Hardware trojan detection for gate-level ics using signal correlation based clustering. In DATE, pages 471–476. ACM, 2015. [73] U.S. Departement of Defense/National Security Agency (NSA). Suite b cryptography, 2001. [74] Stephen Trimberger and Jason Moore. FPGA security: Motivations, features, and applica- tions. Proceedings of the IEEE, 102(8):1248–1265, 2014. [75] Benoit Badrignans, Jean Luc Danger, Viktor Fischer, Guy Gogniat, and Lionel Torres. Security Trends for FPGAS: From Secured to Secure Reconfigurable Systems. Springer Publishing Company, Incorporated, 1st edition, 2011.

138 Bibliography

[76] U.S.Department of Commerce/National Institute of Standards and Technology (NIST). Fips pub 140-2, security requirements for cryptographic modules, 2001.

[77] Tim Kerins and Klaus Kursawe. A cautionary note on weak implementations of block ciphers. In In 1st Benelux Workshop on Information and System Security (WISSec 2006), page 12, 2006.

[78] Pawel Swierczynski, Marc Fyrbiak, Philipp Koppe, and Christof Paar. FPGA trojans through detecting and weakening of cryptographic primitives. IEEE Trans. on CAD of Integrated Circuits and Systems, 34(8):1236–1249, 2015.

[79] Alejandro Cabrera Aldaya, Alejandro Cabrera Sarmiento, and Santiago S´anchez-Solano. AES t-box tampering attack. J. Cryptographic Engineering, 6(1):31–48, 2016.

[80] Pawel Swierczynski, Georg T. Becker, Amir Moradi, and Christof Paar. Bitstream fault injections (bifi)-automated fault attacks against sram-based fpgas. IEEE Trans. Computers, 67(3):348–360, 2018.

[81] Kris Gaj, Jens-Peter Kaps, Venkata Amirineni, Marcin Rogawski, Ekawat Homsirikamol, and Benjamin Y. Brewster. Athena - automated tool for hardware evaluation: Toward fair and comprehensive benchmarking of cryptographic hardware using fpgas. In FPL, pages 414–421. IEEE Computer Society, 2010.

[82] Jacob Couch and Peter Athanas. An analysis of implanted antennas in xilinx fpgas. In ReConFig, pages 1–6. IEEE Computer Society, 2011.

[83] Andrew B. Kahng, John Lach, William H. Mangione-Smith, Stefanus Mantik, Igor L. Markov, Miodrag Potkonjak, Paul Tucker, Huijuan Wang, and Gregory Wolfe. Constraint- based watermarking techniques for design IP protection. IEEE Trans. on CAD of Integrated Circuits and Systems, 20(10):1236–1252, 2001.

[84] Moritz Schmid, Daniel Ziener, and J¨urgenTeich. Netlist-level IP protection by watermark- ing for lut-based fpgas. In FPT, pages 209–216. IEEE, 2008.

[85] Ginger Myles and Christian Collberg. Software watermarking via opaque predicates: Implementation, analysis, and attacks. Electronic Commerce Research, 6(2):155–171, April 2006.

[86] Vladimir Sergeichik and Alexander Ivaniuk. Implementation of opaque predicates for fpga designs hardware obfuscation. Journal of Information, Control and Management Systems, 12(2), 01/2014.

[87] R. C. Atkinson and R. M. Shiffrin. Human memory: A proposed system and its control processes1. In Kenneth W. Spence, Janet Taylor Spence, and Janet Taylor Spence, editors, The psychology of learning and motivation, volume 2, pages 89–195. Academic Press, New York, 1968.

[88] Patrick Werquin. Terms, concepts and models for analysing the value of recognition pro- grammes: Rnfil - third meeting of national representatives and international organisations.

139 Bibliography

[89] Michael Ollinger.¨ Probleml¨osen. In Jochen M¨usselerand Martina Rieger, editors, Allge- meine Psychologie, pages 587–618. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017.

[90] T. J. Nokes, C. D. Schunn, and M.T.H. Chi. Problem solving and human expertise. In Penelope Peterson, Eva Baker, and Barry McGaw, editors, International Encyclopedia of Education (Third Edition), pages 265–272. Elsevier, Oxford, 2010.

[91] Richard E. Mayer. A taxonomy for computer-based assessment of problem solving. Computers in Human Behavior, 18(6):623–632, 2002.

[92] Walter Hussy and Herbert Selg. Denken und Probleml¨osen, volume 557 of Urban- Taschenb¨ucher. Kohlhammer, Stuttgart, 2., ¨uberarb. und erw. aufl. edition, 1998.

[93] K. Anders Ericsson, Ralf T. Krampe, and Clemens Tesch-R¨omer.The role of deliberate practice in the acquisition of expert performance. Psychological Review, 100(3):363–406, 1993.

[94] K. Anders Ericsson and Tyler J. Towne. Expertise. Wiley interdisciplinary reviews. Cognitive science, 1(3):404–416, 2010.

[95] K. Anders Ericsson. The acquisition of expert performance as problem solving: Construc- tion and modification of mediating mechanisms through deliberate practice. In Janet E. Davidson and Robert J. Sternberg, editors, The psychology of problem solving, pages 31–83. Cambridge University Press, Cambridge, UK and New York, 2003.

[96] Adriaan D. de Groot. Thought and Choice in Chess, volume 4 of Psychological Studies. De Gruyter and De Gruyter Mouton, Berlin/Boston, 2. aufl. reprint 2014 edition, 1978.

[97] David H. Jonassen. Toward a design theory of problem solving. Educational Technology Research and Development, 48(4):63–85, 2000.

[98] Michelene T. H. Chi, Paul J. Feltovich, and Robert Glaser. Categorization and repre- sentation of physics problems by experts and novices. Cognitive Science, 5(2):121–152, 1981.

[99] K. Anders Ericsson. The influence of experience and deliberate practice on the development of superior expert performance. In K. Anders Ericsson, editor, The Cambridge Handbook of Expertise and Expert Performance, pages 683–704. Cambridge University Press, 2006.

[100] S. G. Chase and H. Simon. Perception in chess. Cognitive Psychology, (4):55–81, 1973.

[101] Dennis E. Egan and Barry J. Schwartz. Chunking in recall of symbolic drawings. Memory & Cognition, 7(2):149–158, 1979.

[102] Katherine B. McKeithen, Judith S. Reitman, Henry H. Rueter, and Stephen C. Hir- tle. Knowledge organization and skill differences in computer programmers. Cognitive Psychology, 13(3):307–325, 1981.

[103] John R. Anderson. Problem solving and learning. American Psychologist, 48(1):35–44, 1993.

140 Bibliography

[104] Joachim Funke. Complex problem solving. In Norbert M. Seel, editor, Encyclopedia of the Sciences of Learning, pages 682–685. Springer US, Boston, MA, 2012.

[105] Jean E. Pretz, Adam J. Naples, and Robert J. Sternberg. Recognizing, defining, and representing problems. In Janet E. Davidson and Robert J. Sternberg, editors, The psychology of problem solving, pages 3–30. Cambridge University Press, Cambridge, UK and New York, 2003.

[106] Nancy J. Cooke. Varieties of knowledge elicitation techniques. International Journal of Human-Computer Studies, 41(6):801 – 849, 1994.

[107] K. Anders Ericsson and Herbert A. Simon. Verbal reports as data. Psychological Review, 87(3):215–251, 1980.

[108] Richard E. Nisbett and Timothy D. Wilson. Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84(3):231–259, 1977.

[109] Christopher Kr¨ugel,Engin Kirda, Darren Mutz, William K. Robertson, and Giovanni Vigna. Polymorphic worm detection using structural information of executables. In Alfonso Valdes and Diego Zamboni, editors, Recent Advances in Intrusion Detection, 8th International Symposium, RAID 2005, Seattle, WA, USA, September 7-9, 2005, Revised Papers, volume 3858 of Lecture Notes in Computer Science, pages 207–226. Springer, 2005.

[110] Xin Hu, Tzi-cker Chiueh, and Kang G. Shin. Large-scale malware indexing using function- call graphs. In Ehab Al-Shaer, Somesh Jha, and Angelos D. Keromytis, editors, Proceedings of the 2009 ACM Conference on Computer and Communications Security, CCS 2009, Chicago, Illinois, USA, November 9-13, 2009, pages 611–620. ACM, 2009.

[111] Milena Vujosevic-Janicic, Mladen Nikolic, Dusan Tosic, and Viktor Kuncak. Software verifi- cation and graph similarity for automated evaluation of students’ assignments. Information & Software Technology, 55(6):1004–1016, 2013.

[112] Danai Koutra, Aaditya Ramdas, Ankur Parikh, and Jing Xiang. Algorithms for graph similarity and subgraph matching, 2011.

[113] Zhiping Zeng, Anthony K. H. Tung, Jianyong Wang, Jianhua Feng, and Lizhu Zhou. Comparing stars: On approximating graph edit distance. Proc. VLDB Endow., 2(1):25–36, August 2009.

[114] Uwe Sch¨oning.Graph isomorphism is in the low hierarchy. In Franz-Josef Brandenburg, Guy Vidal-Naquet, and Martin Wirsing, editors, STACS 87, 4th Annual Symposium on Theoretical Aspects of Computer Science, Passau, Germany, February 19-21, 1987, Proceedings, volume 247 of Lecture Notes in Computer Science, pages 114–124. Springer, 1987.

[115] Xilinx. Spartan-6 Libraries Guide for HDL Designs, 2009. v 11.4.

[116] Pramod Subramanyan, Nestan Tsiskaridze, Kanika Pasricha, Dillon Reisman, Adriana Susnea, and Sharad Malik. Reverse engineering digital circuits using functional analysis.

141 Bibliography

In Enrico Macii, editor, Design, Automation and Test in Europe, DATE 13, Grenoble, France, March 18-22, 2013, pages 1277–1280. EDA Consortium San Jose, CA, USA / ACM DL, 2013.

[117] Willard V Quine. The problem of simplifying truth functions. The American Mathematical Monthly, 59(8):521–531, 1952.

[118] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1–2):83–97, March 1955.

[119] Patrick P. F. Chan and Christian S. Collberg. A method to evaluate CFG comparison algorithms. In 2014 14th International Conference on Quality Software, Allen, TX, USA, October 2-3, 2014, pages 95–104. IEEE, 2014.

[120] Mladen Nikolic. Measuring similarity of graph nodes by neighbor matching. Intell. Data Anal., 16(6):865–878, 2012.

[121] Andries E. Brouwer and Willem H. Haemers. Spectra of Graphs. Springer, New York, NY, 2012.

[122] Brian Crawford, Ralucca Gera, Jeffrey House, Knuth Thomas, and Ryan Miller. Graph structure similarity using spectral graph theory. In Hocine Cherifi, Sabrina Gaito, Walter Quattrociocchi, and Alessandra Sala, editors, Complex Networks & Their Applications V, pages 209–221, Cham, 2017. Springer International Publishing.

[123] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107–117, 1998.

[124] Google Inc. Google cloud platform. https://cloud.google.com, 2011. Accessed: April 26, 2019.

[125] Xilinx. XST User Guide, 2009. v 11.3.

[126] Boost C++ Libraries. Boost mcgregor common subgraphs implementation. http://www. boost.org/doc/libs/1_53_0/libs/graph/doc/mcgregor_common_subgraphs.htmlx. [Online] Accessed: April 26, 2019.

[127] Boost C++ Libraries. Boost vf2 implementation. http://www.boost.org/doc/libs/ master/libs/graph/doc/vf2_sub_graph_iso.html. [Online] Accessed: April 26, 2019.

[128] Oleg Sokolsky, Sampath Kannan, and Insup Lee. Simulation-based graph similarity. In Holger Hermanns and Jens Palsberg, editors, Tools and Algorithms for the Construction and Analysis of Systems, 12th International Conference, TACAS 2006 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2006, Vienna, Austria, March 25 - April 2, 2006, Proceedings, volume 3920 of Lecture Notes in Computer Science, pages 426–440. Springer, 2006.

[129] Tohoku University Aoki Laboratory. Aes encryption/decryption circuits with different sbox implementation strategies. http://www.aoki.ecei.tohoku.ac.jp/crypto/web/cores. html. [Online] Accessed: April 26, 2019.

142 Bibliography

[130] National Security Agency. Rijndael128 implementation. http://csrc.nist.gov/ archive/aes/round2/r2anlsys.htm#NSA, 2000. [Online] Accessed: April 26, 2019.

[131] Pascal Sasdrich and Tim G¨uneysu. A grain in the silicon: Sca-protected AES in less than 30 slices. In 27th IEEE International Conference on Application-specific Systems, Architectures and Processors, ASAP 2016, London, United Kingdom, July 6-8, 2016, pages 25–32. IEEE Computer Society, 2016.

[132] R. Herveille. I2c controller core. https://opencores.org/project,i2c, 2017. [Online] Accessed: April 26, 2019.

[133] Mike J. Risc5x implementation. https://opencores.org/project,openmsp430, 2011. [Online] Accessed: April 26, 2019.

[134] O. Girard. openmsp430 implementation. https://opencores.org/project,openmsp430, 2016. [Online] Accessed: April 26, 2019.

[135] H. Salmani. Aes-t1000, 2013. https://www.trust-hub.org/aes-t1000.php.

[136] Christof Paar, Peter Fleischmann, and Peter Roelse. Efficient multiplier architectures for galois fields GF(2 4n). IEEE Trans. Computers, 47(2):162–170, 1998.

[137] Francisco Rodriguez-Henriquez, N. A. Saqib, Arturo Daz Prez, and Cetin Kaya Koc. Cryptographic Algorithms on Reconfigurable Hardware. Springer Publishing Company, Incorporated, 1st edition, 2010.

[138] Farinaz Koushanfar. Active Hardware Metering by Finite State Machine Obfuscation, pages 161–187. Volume 1 of Forte et al. [198], 1st edition, 2017.

[139] Nikolay Rubanov. Subislands: the probabilistic match assignment algorithm for subcircuit recognition. IEEE Trans. on CAD of Integrated Circuits and Systems, 22(1):26–38, 2003.

[140] Kevin Zeng and Peter Athanas. Enhancing productivity with back-end similarity matching of digital circuits for IP reuse. In 2012 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2013, Cancun, Mexico, December 9-11, 2013, pages 1–6. IEEE, 2013.

[141] Kevin Zeng and Peter M. Athanas. A q-gram birthmarking approach to predicting reusable hardware. In DATE, pages 1112–1115. IEEE, 2016.

[142] Marc Fyrbiak, Sebastian Wallat, Jonathan D´echelotte, Nils Albartus, Sinan B¨ocker, Russell Tessier, and Christof Paar. On the difficulty of fsm-based hardware obfuscation. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2018(3):293–330, Aug. 2018.

[143] Robert Endre Tarjan. Depth-first search and linear graph algorithms. SIAM J. Comput., 1(2):146–160, 1972.

[144] Travis Meade, Zheng Zhao, Shaojie Zhang, David Z. Pan, and Yier Jin. Revisit sequential logic obfuscation: Attacks and defenses. In ISCAS, pages 1–4. IEEE, 2017.

143 Bibliography

[145] Jaya Dofe, Yuejun Zhang, and Qiaoyan Yu. DSD: A dynamic state-deflection method for gate-level netlist obfuscation. In ISVLSI, pages 565–570. IEEE Computer Society, 2016.

[146] Farinaz Koushanfar and Gang Qu. Hardware metering. In DAC, pages 490–493. ACM, 2001.

[147] F. Koushanfar. Hardware Metering: A Survey, pages 103–122. Springer Publishing Company, Incorporated, 2012.

[148] M. Fyrbiak, S. Rokicki, N. Bissantz, R. Tessier, and C. Paar. Hybrid obfuscation to protect against disclosure attacks on embedded microprocessors. IEEE Transactions on Computers, 67(3):307–321, March 2018.

[149] Jim Chase. The evolution of the internet of things. Technical report, Texas Instruments, 2013.

[150] David Lie, Chandramohan A. Thekkath, Mark Mitchell, Patrick Lincoln, Dan Boneh, John C. Mitchell, and Mark Horowitz. Architectural support for copy and tamper resistant software. In Larry Rudolph and Anoop Gupta, editors, ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000., pages 168–177. ACM Press, 2000.

[151] G. Edward Suh, Charles W. O’Donnell, and Srinivas Devadas. Aegis: A single-chip secure processor. IEEE Design & Test of Computers, 24(6):570–580, 2007.

[152] Emmanuel Owusu, Jorge Guajardo, Jonathan M. McCune, James Newsome, Adrian Perrig, and Amit Vasudevan. OASIS: on achieving a sanctuary for integrity and secrecy on untrusted platforms. In Sadeghi et al. [197], pages 13–24.

[153] Jonas Zaddach, Luca Bruno, Aur´elienFrancillon, and Davide Balzarotti. AVATAR: A framework to support dynamic security analysis of embedded systems’ firmwares. In 21st Annual Network and Distributed System Security Symposium, NDSS 2014, San Diego, California, USA, February 23-26, 2014. The Internet Society, 2014.

[154] Peter Yiannacouras, Jonathan Rose, and J. Gregory Steffan. The microarchitecture of fpga-based soft processors. In Thomas M. Conte, Paolo Faraboschi, William H. Mangione- Smith, and Walid A. Najjar, editors, Proceedings of the 2005 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES 2005, San Francisco, California, USA, September 24-27, 2005, pages 202–212. ACM, 2005.

[155] Jason Xin Zheng, Dongfang Li, and Miodrag Potkonjak. A secure and unclonable using instruction-level PUF authentication. In 24th International Conference on Field Programmable Logic and Applications, FPL 2014, Munich, Germany, 2-4 September, 2014, pages 1–4. IEEE, 2014.

[156] Oded Goldreich and Rafail Ostrovsky. Software protection and simulation on oblivious rams. Journal of the ACM, 43(3):431–473, 1996.

144 Bibliography

[157] Christopher W. Fletcher, Ling Ren, Albert Kwon, Marten van Dijk, Emil Stefanov, Dimitrios N. Serpanos, and Srinivas Devadas. A low-latency, low-area hardware oblivious RAM controller. In 23rd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2015, Vancouver, BC, Canada, May 2-6, 2015, pages 215–222. IEEE Computer Society, 2015.

[158] Daehyun Strobel, David Oswald, Bastian Richter, Falk Schellenberg, and Christof Paar. Microcontrollers as (in)security devices for pervasive computing applications. Proceedings of the IEEE, 102(8):1157–1173, 2014.

[159] Daehyun Strobel, Florian Bache, David Oswald, Falk Schellenberg, and Christof Paar. Scandalee: a side-channel-based disassembler using local electromagnetic emanations. In Wolfgang Nebel and David Atienza, editors, Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, DATE 2015, Grenoble, France, March 9-13, 2015, pages 139–144. ACM, 2015.

[160] Christian Collberg and Jasvir Nagra. Surreptitious Software: Obfuscation, Watermarking, and Tamperproofing for Software Protection. Addison-Wesley Professional, 1st edition, 2009.

[161] Sebastian Schrittwieser and Stefan Katzenbeisser. Code obfuscation against static and dynamic reverse engineering. In Tom´aˇsFiller, Tom´aˇsPevn´y,Scott Craver, and Andrew Ker, editors, Information Hiding, pages 270–284, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.

[162] Carsten Willems and Felix C. Freiling. Reverse code engineering - state of the art and countermeasures. it - Information Technology, 54(2):53–63, 2012.

[163] Bertrand Anckaert, Mariusz H. Jakubowski, and Ramarathnam Venkatesan. Proteus: virtualization for diversified tamper-resistance. In Moti Yung, Kaoru Kurosawa, and Reihaneh Safavi-Naini, editors, Proceedings of the Sixth ACM Workshop on Digital Rights Management, Alexandria, VA, USA, October 30, 2006, pages 47–58. ACM, 2006.

[164] Kevin Coogan, Gen Lu, and Saumya K. Debray. Deobfuscation of virtualization-obfuscated software: a semantics-based approach. In Chen et al. [195], pages 275–284.

[165] Babak Yadegari, Brian Johannesmeyer, Ben Whitely, and Saumya Debray. A generic approach to automatic deobfuscation of executable code. In 2015 IEEE Symposium on Security and Privacy, SP 2015, San Jose, CA, USA, May 17-21, 2015, pages 674–691. IEEE Computer Society, 2015.

[166] Neha Runwal, Richard M. Low, and Mark Stamp. Opcode graph similarity and metamor- phic detection. Journal in Computer Virology, 8(1-2):37–52, 2012.

[167] Ivan Sorokin. Comparing files using structural entropy. Journal in Computer Virology, 7(4):259–265, 2011.

[168] Srilatha Attaluri, Scott McGhee, and Mark Stamp. Profile hidden markov models and metamorphic virus detection. Journal in Computer Virology, 5(2):151–169, 2009.

145 Bibliography

[169] Laszlo Szekeres, Mathias Payer, Tao Wei, and Dawn Song. Sok: Eternal war in memory. In 2013 IEEE Symposium on Security and Privacy, SP 2013, Berkeley, CA, USA, May 19-22, 2013, pages 48–62. IEEE Computer Society, 2013.

[170] Per Larsen, Stefan Brunthaler, and Michael Franz. Automatic software diversity. IEEE Security & Privacy, 13(2):30–37, 2015.

[171] Elena Gabriela Barrantes, David H. Ackley, Trek S. Palmer, Darko Stefanovic, and Dino Dai Zovi. Randomized instruction set emulation to disrupt binary code injection attacks. In Jajodia et al. [200], pages 281–289.

[172] Gaurav S. Kc, Angelos D. Keromytis, and Vassilis Prevelakis. Countering code-injection attacks with instruction-set randomization. In Jajodia et al. [200], pages 272–280.

[173] Bernhard Fechner, J¨orgKeller, and Andreas Wohlfeld. Web server protection by cus- tomized instruction set encoding. In 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006, Rhodes Island, Greece. IEEE, 2006.

[174] Elena Gabriela Barrantes, David H. Ackley, Stephanie Forrest, and Darko Stefanovic. Randomized instruction set emulation. ACM Trans. Inf. Syst. Secur., 8(1):3–40, 2005.

[175] Georgios Portokalidis and Angelos D. Keromytis. Fast and practical instruction-set randomization for commodity systems. In Carrie Gates, Michael Franz, and John P. McDermott, editors, Twenty-Sixth Annual Applications Conference, ACSAC 2010, Austin, Texas, USA, 6-10 December 2010, pages 41–48. ACM, 2010.

[176] Jean-Luc Danger, Sylvain Guilley, and Florian Praden. Hardware-enforced protection against software reverse-engineering based on an instruction set encoding. In Suresh Jagannathan and Peter Sewell, editors, Proceedings of the 3rd ACM SIGPLAN Program Protection and Reverse Engineering Workshop 2014, PPREW 2014, January 25, 2014, San Diego, CA, pages 5:1–5:11. ACM, 2014.

[177] Ziyi Liu, Weidong Shi, Shouhuai Xu, and Zhiqiang Lin. Programmable decoder and shadow threads: Tolerate remote code injection exploits with diversified redundancy. In Gerhard P. Fettweis and Wolfgang Nebel, editors, Design, Automation & Test in Europe Conference & Exhibition, DATE 2014, Dresden, Germany, March 24-28, 2014, pages 1–6. European Design and Automation Association, 2014.

[178] David Molnar, Matt Piotrowski, David Schultz, and David A. Wagner. The program counter security model: Automatic detection and removal of control-flow side channel attacks. In Dongho Won and Seungjoo Kim, editors, Information Security and Cryptology - ICISC 2005, 8th International Conference, Seoul, Korea, December 1-2, 2005, Revised Selected Papers, volume 3935 of Lecture Notes in Computer Science, pages 156–168. Springer, 2005.

[179] Philipp Koppe, Benjamin Kollenda, Marc Fyrbiak, Christian Kison, Robert Gawlik, Christof Paar, and Thorsten Holz. Reverse engineering x86 processor microcode. In USENIX Security Symposium, pages 1163–1180. USENIX Association, 2017.

146 Bibliography

[180] Shuichi Ichikawa, Takashi Sawada, and Hisashi Hata. Diversification of processors based on redundancy in instruction set. IEICE Transactions, 91-A(1):211–220, 2008.

[181] Thomas Jakobsen. A fast method for the cryptanalysis of substitution ciphers. Cryptologia, 19(3):265–274, 1995.

[182] David Oranchak. Evolutionary algorithm for decryption of monoalphabetic homophonic substitution ciphers encoded as constraint satisfaction problems. In Conor Ryan and Maarten Keijzer, editors, Genetic and Evolutionary Computation Conference, GECCO 2008, Proceedings, Atlanta, GA, USA, July 12-16, 2008, pages 1717–1718. ACM, 2008.

[183] Amrapali Dhavare, Richard M. Low, and Mark Stamp. Efficient cryptanalysis of homo- phonic substitution ciphers. Cryptologia, 37(3):250–281, 2013.

[184] Stephen Crane, Per Larsen, Stefan Brunthaler, and Michael Franz. Booby trapping software. In Mary Ellen Zurko, Konstantin Beznosov, Tara Whalen, and Tom Longstaff, editors, New Security Paradigms Workshop, NSPW ’13, Banff, AB, Canada, September 9-12, 2013, pages 95–106. ACM, 2013.

[185] NXP Semiconductors. NXP J3A040 and J2A040 Secure Controller, May 2011. Rev. 01.03.

[186] NXP Semiconductors. i.MX 6Dual/6Quad Applications Processor Reference Manua, July 2015. Rev. 3, Document Number: IMX6DQRM.

[187] Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. Obfuscator-llvm - software protection for the masses. In Paolo Falcarin and Brecht Wyseur, editors, 1st IEEE/ACM International Workshop on Software Protection, SPRO 2015, Florence, Italy, May 19, 2015, pages 3–9. IEEE Computer Society, 2015.

[188] Altera Corporation. Cyclone V Device Datasheet, December 2015. CV-51002, 2016.12.09.

[189] Daniel Apon, Yan Huang, Jonathan Katz, and Alex J. Malozemoff. Implementing crypto- graphic program obfuscation. IACR Cryptology ePrint Archive, 2014:779, 2014.

[190] Sanjam Garg, Craig Gentry, Shai Halevi, and Mariana Raykova. Two-round secure MPC from indistinguishability obfuscation. In Yehuda Lindell, editor, Theory of Cryptography - 11th Theory of Cryptography Conference, TCC 2014, San Diego, CA, USA, February 24-26, 2014. Proceedings, volume 8349 of Lecture Notes in Computer Science, pages 74–94. Springer, 2014.

[191] Philip Koopman. Writable instruction set, stack oriented computers: The wisc concept. In Proceedings of the 1987 Rochester Forth Conference, pages 49–71, 1987.

[192] Pawel Swierczynski, Amir Moradi, David Oswald, and Christof Paar. Physical security evaluation of the bitstream encryption mechanism of altera stratix II and stratix III fpgas. TRETS, 7(4):34:1–34:23, 2015.

[193] Clifford Wolf and Mathias Lasser. Project icestorm. http://www.clifford.at/icestorm/, 2015.

147 Bibliography

[194] Sergei Skorobogatov and Christopher Woods. In the blink of an eye: There goes your AES key. IACR Cryptology ePrint Archive, 2012:296, 2012.

[195] Yan Chen, George Danezis, and Vitaly Shmatikov, editors. Proceedings of the 18th ACM Conference on Computer and Communications Security, CCS 2011, Chicago, Illinois, USA, October 17-21, 2011. ACM, 2011.

[196] IEEE 2nd International Verification and Security Workshop, IVSW 2017, Thessaloniki, Greece, July 3-5, 2017. IEEE, 2017.

[197] Ahmad-Reza Sadeghi, Virgil D. Gligor, and Moti Yung, editors. 2013 ACM SIGSAC Con- ference on Computer and Communications Security, CCS’13, Berlin, Germany, November 4-8, 2013. ACM, 2013.

[198] Domenic Forte, Swarup Bhunia, and Mark M. Tehranipoor, editors. Hardware Protection Through Obfuscation, volume 1. Springer Publishing Company, Incorporated, 1st edition, 2017.

[199] 2013 IEEE International Symposium on Hardware-Oriented Security and Trust, HOST 2013, Austin, TX, USA, June 2-3, 2013. IEEE Computer Society, 2013.

[200] Sushil Jajodia, Vijayalakshmi Atluri, and Trent Jaeger, editors. Proceedings of the 10th ACM Conference on Computer and Communications Security, CCS 2003, Washington, DC, USA, October 27-30, 2003. ACM, 2003.

148 List of Abbreviations

AES Advanced Encryption Standard API Application Programming Interface ASIC Application Specific Integrated Circuit ASIP Application Specific Instruction-Set Processor BDD Binary Decision Diagram BFS Breadth-First Search BNF Backus-Naur Form CFG Control Flow Graph CPU Central Processing Unit DCFG Dynamic Control Flow Graph DFS Depth-First Search FF Flip Flop FI Fault Injection FIB Focused Ion Beam FPGA Field Programmable Gate Array FSM Finite State Machine GCD Greatest Common Divisor GUI Graphical User Interface HDL Hardware Description Language HLS High-Level Synthesis IC Integrated Circuit I/O Input/Output IoT Internet of Things IP Intellectual Property ISA Instruction Set Architecture LFSR Linear Feedback Shift Register Abbreviations

LUT Look-Up Table MIPS Microprocessor without Interlocked Pipeline Stages MMIO Memory Mapped Input/Output ORAM Oblivious Random Access Memory PRNG Pseudo Random Number Generator PUF Physical Unclonable Function RAM Random Access Memory RISC Reduced Instruction Set Computer ROM Read-Only Memory RTL Register Transfer Level RUB Random Unique Block SCA Side-Channel Analysis SEM Scanning Electron Microscope SHA Secure Hash Algorithm SoC System-on-Chip SRAM Static Random Access Memory UART Universal Asynchronous Receiver Transmitter WISC Writable Instruction Set Computer

150 List of Figures

2.1 Example Moore FSM circuit as state transition graph (upper left part) with associated gate-level netlist in (1) visual graph-based representation (lower left part), and (2) textual representation with an exemplary gate library in Verilog (right part)...... 10

3.1 Overview on HAL’s architecture and workflow...... 20 3.2 Xilinx sp601 development board used for experimentation. A wiretap Trojan was inserted into an existing low-level netlist. Sensitive data is leaked via an additional Trojan UART connected to a ttl2rs232 USB module...... 32 3.3 Example of a LUT-6 with input pins I0 to I5 and output pin O. Pins I4 and I5 are connected to GND. 48 Bits marked with X can be used for watermarks. . . . 34 3.4 Example of LUT-6 with input pins I0 to I5 and output pin O. Pins I4 and I5 are connected to a NOR gate which constantly generates a logical ’0’ for each LFSR state (except the all zero state)...... 35

4.1 Example of the difference between isomorphism and similarity. G1 and G2 are isomorph (f(1) = C, f(2) = A, f(3) = B, f(4) = D), whereas G2 and G3 are not isomorph, even though they are topologically similar. The missing edge for an isomorphism from G2 to G3 is (α, δ)...... 43 4.2 Combinational logic subgraph (marked as cloud) between two registers...... 45 4.3 Gate-level netlist reverse engineering case study results for our multiresolutional spectral analysis without any preprocessing techniques. AES Sbox (composite field) is synthesized for xc6slx16 with optimization goal area, the representative designs in a) - f) are also synthesized for xc6slx16 with optimization goal area. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels...... 62 4.4 Gate-level netlist reverse engineering case study results for our multiresolutional spectral analysis without any preprocessing techniques. AES Sbox (composite field) is synthesized for xc6slx16 with optimization goal area, the representative designs in a) - f) are also synthesized for xc6slx16 with optimization goal speed. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels...... 63 List of Figures

4.5 Trojan detection case study results for our multiresolutional spectral analysis without any preprocessing techniques. Trojan (design 13) is synthesized for xc6slx16 with optimization goal speed, the representative designs 9, 11, and 12 are also synthesized for xc6slx16 with optimization goal area. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels...... 64 4.6 Hardware obfuscation assessment case study results for our multiresolutional spectral analysis without any preprocessing techniques. GCD (design 15) is synthesized for xc6slx16 with optimization goal area, the designs 14 and 10 are also synthesized for xc6slx16 with optimization goal area. The top-ranked vertex (marked in black), 75%-quantile vertex (marked in blue), and 50%-quantile vertex (marked in green). The y-axis shows the spectral distance, and the x-axis shows the vertex labels...... 65

5.1 Block diagram of a hardware FSM (dashed line in the case of a Mealy machine). 68 5.2 Overview of our FSM reverse engineering work flow. Starting with a third-party, gate-level netlist, we first determine the gates of each FSM candidate using the topological analysis. Afterwards each candidate is processes by the Boolean function analysis to determine the state transition graph...... 69 5.3 Input independent state series starting point si (series marked in dashed red): (a) Branch: si has one successor and one predecessor which is a branch (multiple successors). (b) Merge: si has one successor and multiple predecessors...... 73 5.4 HARPOON design methodology example. The original FSM (dashed blue part) O O O O O is augmented by an obfuscation mode s0 , s1 , s2 , s3 , s4 and an authentication A A A mode s0 , s1 , s2 . The enabling key to reach the original initial state s0 is (i0, i1, i2). 74 5.5 Dynamic State Deflection design methodology example. The original FSM (dashed blue part) is augmented by an HARPOON obfuscation mode (dotted red part) and each original state is protected by a black hole (states marked in black). . . 76 5.6 Active Hardware Metering technique example. The RUB response determines the initial state of the FSM. The enabling key then determines the transition from the obfuscation mode states (marked in red) to the original initial state of the FSM (marked in blue)...... 78 5.7 Interlocking Obfuscation design methodology example. The original FSM (dashed blue part) is augmented by an obfuscation mode and a code-word (dotted red part). 80 5.8 State transition graph of AES IP core obfuscated with Dynamic State Deflection (including HARPOON ). Tarjan’s algorithm splits all states into 3 strongly con- nected components: obfuscation mode of HARPOON (marked in red), black hole cluster states (marked in black), and original FSM (marked in blue). Rectangle nodes mark the sequence from initial state 000000 to the original initial state 010001. Input values for each state transition are deliberately left out for readability. 83 5.9 State transition graph of of AES IP core obfuscated with Active Hardware Me- tering. Tarjan’s algorithm splits all states into 2 strongly connected components: black hole cluster states (marked in black), and original FSM states (marked in blue). Input values for each state transition are deliberately left out for readability. 86

152 List of Figures

5.10 State transition graph of of SHA-3 IP core obfuscated with Active Hardware Me- tering. Tarjan’s algorithm splits all states into 3 strongly connected components: black hole cluster states (marked in black), original reset state (marked in blue), and original FSM states (marked in blue). Input values for each state transition are deliberately left out for readability...... 88 5.11 State transition graph of UART IP core obfuscated with Active Hardware Me- tering. Tarjan’s algorithm splits all states in 2 strongly connected components: black hole cluster states (marked in black), and original FSM states (marked in blue). Input values for each state transition are deliberately left out for readability. 91 5.12 Original state transition graph diagrams of hardware designs utilized in Section 5.4. 94

6.1 MIPS instruction format with 32-bit width. The bit-width of each instruction field is denoted by the number put in brackets after the field name...... 103 6.2 CPU datapath augmented with the hardware-level obfuscation features (marked in orange)...... 104 6.3 Entropy of the opcode distributions for the different obfuscation strategies. . . . 117 6.4 sd3-distributions for the different obfuscation strategies...... 118 6.5 E-sd3-diagrams for the different obfuscation strategies. Colour legend for all figures is: bubbl. •, crc •, CRC32 •, des •, fact. •, fft •, fir •, iquant •, quant • ...... 119 6.6 Correlations of the opcodes for fft program for the different obfuscation strategies.120 6.7 Dynamic control flow graph similarity evaluation for the benchmark programs and different obfuscation strategies...... 121

153

List of Tables

3.1 Evaluation of the identification accuracy of ANGEL compared to FANCI [52] for hardware Trojans equipped with DeTrust (see Section 3.3.2 for details). The X symbol indicates that (parts of) the Trojan were identified, the  symbol indicates that no part of the Trojan was identified...... 25 3.2 Evaluation results of self-test reverse engineering. : successfully reverse engi- neered, : reverse engineering required minor manual netlist inspection, n.a.: the given AESG# could not be implemented for the device, blank: reverse engineering did not yield a result...... 29

4.1 Edit cost equations for parameter configurations label and subgraph...... 46 4.2 Hardware design description and resource utilization synthesized for xc6slx16 with optimization goal area. We selected xc6slx16 FPGA as an representative, since resource utilization only slightly deviate for other FPGA families...... 53 4.3 Gate-level netlist reverse engineering case study results (phase 1). AES Sbox (composite field) is synthesized for xc6slx16 with optimization goal area. Param- eter subgraph and label are true for all experiments, and only the combinational logic subgraph preprocessing technique is used. Computation time is the arith- metic mean for both synthesis options for each algorithm. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity...... 54 4.4 Gate-level netlist reverse engineering case study results (phase 2). AES Sbox (composite field) is synthesized for xc6slx16 with optimization goal area. Param- eter subgraph and label are true for all experiments, and the combinational logic bitslice and LUT decomposition preprocessing techniques are used. Computation time is the arithmetic mean for both synthesis options. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity...... 54 4.5 Trojan detection case study results (phase 1). Trojan (design 13) is synthesized for xc6slx16 with optimization goal area. Parameter subgraph and label are true in all experiments, and only the combinational logic subgraph preprocessing technique is used. Computation time is the arithmetic mean for both synthesis options for each algorithm. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity...... 57 4.6 Hardware obfuscation assessment case study results (phase 1). GCD (design 15) is synthesized for xc6slx16 with optimization goal area. Parameter subgraph and label are true and only the combinational logic subgraph preprocessing technique is used. Computation time is the arithmetic mean for both synthesis options for each algorithm. A similarity score of 1 indicates that both graphs are identical whereas a 0 score indicates that both graphs have no similarity...... 59 List of Tables

6.1 Overview of ISA randomization schemes and their susceptibility to control flow and I/O attacks...... 102 6.2 Hardware area overhead for the additional obfuscation elements...... 109 6.3 Software performance evaluation for the obfuscation strategies. Each result indicates the number of cycles arithmetically averaged over 100 programs. . . . . 110 6.4 Software size evaluation for the obfuscation strategies. Each result indicates the program memory size in kB arithmetically averaged over 100 programs...... 110

156 About the Author

Curriculum Vitae

Personal Data Name: Marc Fyrbiak Address: Chair for Embedded Security ID 2/645 Universit¨atsstraße150 44801 Bochum Germany E-mail: [email protected] Date of Birth: August 17th, 1990 Place of Birth: Braunschweig, Germany

Education since 11/2014 Ph.D. Student, Ruhr-Universit¨atBochum, Germany Chair for Embedded Security, Prof. Dr.-Ing. Christof Paar.

09/2014 M. Sc. IT-Security/Networks and Systems, Ruhr-Universit¨atBochum, Germany.

09/2012 B. Sc. Computer Science, Technische Universit¨atBraunschweig, Germany.

07/2009 Abitur, Gaußschule, Gymnasium am L¨owenwall, Braunschweig, Germany. About the Author

Peer-Reviewed Publications in Journals

• P. Swierczynski, M. Fyrbiak, P. Koppe, and C. Paar. FPGA trojans through detecting and weakening of cryptographic primitives. IEEE Trans. on CAD of Integrated Circuits and Systems, 34(8):1236–1249, 2015.

• P. Swierczynski, M. Fyrbiak, P. Koppe, A. Moradi, and C. Paar. Interdiction in practice - hardware trojan against a high-security USB flash drive. J. Cryptographic Engineering, 7(3):199–211, 2017.

• M. Fyrbiak, S. Rokicki, N. Bissantz, R. Tessier, and C. Paar. Hybrid obfuscation to protect against disclosure attacks on embedded microprocessors. IEEE Transactions on Computers, 67(3):307–321, 2018.

• M. Fyrbiak, S. Wallat, P. Swierczynski, M. Hoffmann, S. Hoppach, M. Wilhelm, T. Weidlich, R. Tessier, and C. Paar. HAL—the missing piece of the puzzle for hardware reverse engineering, trojan detection and insertion. IEEE Transactions on Dependable and Secure Computing, 2018, to appear.

• C. Kison, O. Awad, M. Fyrbiak, and C. Paar. Security implications of intentional capacitive crosstalk. To appear in IEEE Transactions on Information Forensics and Security.

Peer-Reviewed Publications in Conferences / Workshops

• P. Swierczynski, M. Fyrbiak, C. Paar, C. Huriaux, and R. Tessier. Protecting against cryptographic trojans in FPGAs. In Proceedings of the IEEE Symposium on Field-Pro- grammable Custom Computing Machines, 2015, pages 215–222.

• M. Fyrbiak, S. Strauss, C. Kison, S. Wallat, M. Elson, N. Rummel, and C. Paar. Hardware reverse engineering: Overview and open challenges. In International Verification and Security Workshop, 2017, pages 88–94.

• S. Wallat, M. Fyrbiak, M. Schl¨ogel,and C. Paar. A look at the dark side of hardware reverse engineering - a case study. In International Verification and Security Workshop, 2017, pages 95–100.

• P. Koppe, B. Kollenda, M. Fyrbiak, C. Kison, R. Gawlik, C. Paar, and T. Holz. Reverse engineering x86 processor microcode. In USENIX Security Symposium, 2017, pages 1163–1180.

• C. Wiesen, M. Elson, N. Rummel, M. Fyrbiak, S. Becker, and C. Paar. Hardware Re- verse Engineering als eine spezielle Art des Probleml¨osens. In Kongress der Deutschen Gesellschaft f¨urPsychologie, 2018.

• M. Fyrbiak, S. Wallat, J. D´echelotte, N. Albartus, S. B¨ocker, R. Tessier, and C. Paar. On the difficulty of fsm-based hardware obfuscation. In IACR Transactions on Cryptographic Hardware and Embedded Systems, 2018, volume 3, pages 293–330.

158 About the Author

• B. Kollenda, P. Koppe, M. Fyrbiak, C. Kison, C. Paar, and T. Holz. An exploratory analysis of microcode as a building block for system defenses. In ACM Conference on Computer and Communications Security, 2018, pages 1649–1666.

• C. Wiesen, S. Becker, M. Fyrbiak, N. Albartus, M. Elson, N. Rummel, and C. Paar. Teaching hardware reverse engineering: Educational guidelines and practical insights. In IEEE International Conference on Engineering, Technology and Education, 2018.

• C. Wiesen, S. Becker, N. Albartus, M. Hoffmann, S. Wallat, M. Fyrbiak, N. Rummel, C. Paar. Towards cognitive obfuscation: Impeding hardware reverse engineering based on psychological insights. In Asia and South Pacific Design Automation Conference, 2019.

Publications in Book Chapters

• G. T. Becker, M. Fyrbiak, and C. Kison. Hardware obfuscation: Techniques and open challenges. Foundations of Hardware IP Protection, Springer International Publishing, 105–123, 2017.

Publications Under Review

• M. Fyrbiak, S. Wallat, S. Reinhard, N. Bissantz, and C. Paar. Graph similarity and its applications to hardware security. Under Review.

Participation in Selected Conferences and Workshops

• IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2015, Vancouver, Canada.

• IEEE International Verification and Security Workshop (IVSW) 2017, Thessaloniki, Greece.

• USENIX Security Symposium 2017, Vancouver, Canada.

• CHES 2018, Amsterdam, The Netherlands.

159