ENERGY-EFFICIENT AND SECURE RECONFIGURABLE COMPUTING ARCHITECTURE

By

ROBERT KARAM

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2017 © 2017 Robert Karam To my parents, Hani and Diana, my sister, Christina, and my wife and best friend, Ran ACKNOWLEDGMENTS I would like to express the deepest appreciation for my advisor and committee chair, Dr. Swarup Bhunia. Without his advice, expertise, and understanding, this work would not be where it is today. I would like to acknowledge my committee members, Dr.

Mark Tehranipoor, Dr. Greg Stitt, and Dr. Kevin Butler, for their expertise, comments, and suggestions during this process. I would like to recognize the Department of

Electrical and Computer Engineering at the University of Florida, and the students and staff of the Florida Institute of Cyber Security (FICS) lab, many of whom are close friends and collaborators.

I would also like to thank friends and family, who have always been there for me, ready to help with anything, to celebrate the wins, and offer support when things were not so great. Finally, a very special and personal thank you to my wife and best friend,

Ran. Despite all the things that have changed these past few years, your support and motivation have never wavered.

4 TABLE OF CONTENTS page

ACKNOWLEDGMENTS ...... 4

LIST OF TABLES ...... 6

LIST OF FIGURES ...... 7 ABSTRACT ...... 8

CHAPTER

1 INTRODUCTION ...... 10

2 BACKGROUND AND MOTIVATION ...... 13 2.1 Efficient Reconfigurable Computing ...... 13 2.1.1 Hardware Architecture ...... 14 2.1.2 MAHA Architecture ...... 16 2.1.3 Comparison to FPGA ...... 16 2.2 Security Concerns for Reconfigurable Architectures ...... 17

3 FPGA BITSTREAM SECURITY FOR MODERN DEVICES ...... 20 3.1 Background ...... 20 3.2 FPGA Bitstream Security Issues ...... 24 3.3 FPGA Dark Silicon ...... 25 3.4 Bitstream Protection Methodology ...... 28 3.4.1 Design Obfuscation ...... 29 3.4.2 Key Generation ...... 30 3.4.3 Initial Design Mapping ...... 30 3.4.4 Security-Aware Mapping ...... 32 3.4.5 Communication Protocol and Usage Model ...... 34 3.5 Overhead and Security Analysis ...... 35 3.5.1 Experimental Setup ...... 35 3.5.2 Overhead Analysis ...... 36 3.5.3 Security Analysis ...... 39 3.5.3.1 Brute force attack ...... 39 3.5.3.2 Known design and bitstream tampering attacks ...... 40 3.6 Summary ...... 41

4 SECURITY FOR NEXT-GENERATION FPGA DEVICES ...... 43

4.1 Background ...... 43 4.2 FPGA Hardware Security ...... 46 4.2.1 Mutable FPGA Architecture ...... 47 4.2.1.1 Physical layer ...... 48

5 4.2.1.2 Logical layer ...... 49 4.2.2 Secure FPGA Mapper ...... 50 4.2.3 Correctness of Mapped Design ...... 52 4.3 Results ...... 53 4.3.1 Security Analysis ...... 53 4.3.1.1 Brute force attacks ...... 53 4.3.1.2 Known design attack and bitstream tampering ...... 54 4.3.1.3 Side channel attack (SCA) ...... 54 4.3.1.4 Destructive reverse engineering (DRE) ...... 55 4.3.2 Secure Mapping Results ...... 55 4.4 Cost Analysis Based on a Case Study ...... 56 4.5 Summary ...... 57

5 ARCHITECTURAL DIVERSITY FOR SECURITY .... 59 5.1 Background ...... 59 5.2 Motivation for IoT Device Security ...... 63 5.2.1 Related Work ...... 63 5.2.2 Attack Vectors in IoT ...... 65 5.2.2.1 Firmware reverse engineering ...... 65 5.2.2.2 Targeted malicious modification ...... 66 5.2.2.3 Malware propagation ...... 66 5.3 Security through Diversity ...... 67 5.3.1 Keyed Permutation Networks ...... 67 5.3.2 Mixed-Granular Permutations ...... 68 5.3.2.1 Instruction encoding ...... 69 5.3.2.2 Dependent instruction reordering ...... 71 5.3.3 Wireless Reconfiguration ...... 72 5.4 Results and Discussion ...... 73 5.4.1 Overhead Analysis ...... 73 5.4.2 Security Analysis ...... 74 5.4.2.1 Brute force ...... 75 5.4.2.2 Known design and side channel attacks ...... 77 5.4.3 Relationship to encryption ...... 78 5.5 Summary ...... 79 6 RECONFIGURABLE ACCELERATOR FOR GENERAL ANALYTICS APPLI- CATIONS ...... 81

6.1 Background ...... 81 6.2 Requirements of Big Data Analytics Systems ...... 84 6.2.1 Overview of Big Data Analytics ...... 84 6.2.2 Analytics Applications ...... 86 6.2.2.1 Operations ...... 86 6.2.2.2 Communication ...... 87 6.2.3 Disk-to-Accelerator Compression ...... 88

6 6.3 In-Memory Analytics Acceleration ...... 90 6.3.1 System Organization ...... 90 6.3.2 Accelerator Hardware Architecture ...... 91 6.3.2.1 Processing element architecture ...... 91 6.3.2.2 Interconnect architecture ...... 91 6.3.3 Application Mapping ...... 94 6.3.4 Parallelism and Execution Models ...... 94 6.3.5 System-level Memory Management ...... 96 6.4 Methods ...... 97 6.4.1 Experimental Setup ...... 97 6.4.2 Functional Verification ...... 99 6.4.3 Analytics Kernels ...... 100 6.4.3.1 Classification ...... 100 6.4.3.2 Neural network ...... 101 6.4.3.3 Clustering ...... 101 6.4.4 Dataset Compression ...... 102 6.5 Results ...... 103 6.5.1 Throughput ...... 103 6.5.2 Energy Efficiency ...... 106 6.6 Discussion ...... 107 6.6.1 Performance and Parallelism ...... 107 6.6.2 Scalability ...... 107 6.6.3 Transfer Energy and Latency ...... 109 6.6.4 Memory-Centric Processing ...... 110 6.6.5 Analytics and Machine Learning ...... 110 6.7 Related Work ...... 111 6.7.1 FPGA Analytics ...... 111 6.7.2 GPGPU Analytics ...... 112 6.8 Summary ...... 112

7 RECONFIGURABLE ACCELERATOR FOR TEXT MINING APPLICATIONS . 114 7.1 Background ...... 114 7.2 Related Work ...... 117 7.2.1 Existing Work ...... 118 7.2.1.1 Interfaces and retrofitting ...... 119 7.2.1.2 Distance to the data ...... 119 7.2.1.3 Flexibility overhead ...... 119 7.2.2 Application Survey ...... 120 7.3 Hardware and Software Framework ...... 121 7.3.1 Processing Elements and Functional Units ...... 121 7.3.1.1 Term frequency counting ...... 122 7.3.1.2 Classification ...... 124 7.3.2 Interconnect Network ...... 125 7.3.3 Control and Data Engines ...... 125

7 7.3.4 Application Mapping ...... 127 7.3.5 System Architecture ...... 128 7.4 Lucene: A Case Study ...... 128 7.4.1 Lucene Optimizations ...... 128 7.4.2 Lucene Indexing Profile ...... 129 7.5 Results ...... 131 7.5.1 Emulation Platform ...... 131 7.5.2 Experimental Setup ...... 132 7.5.3 Indexing Acceleration ...... 134 7.5.3.1 Downcasting and tokenizing ...... 134 7.5.3.2 Frequency counting ...... 135 7.5.3.3 Classification ...... 136 7.5.4 Lucene Indexing Acceleration ...... 136 7.5.5 Iso-Area Comparison ...... 137 7.6 Discussion ...... 137 7.6.1 Performance ...... 137 7.6.2 Memory-Centric Architecture ...... 139 7.6.3 In-Memory Computing ...... 139 7.6.4 Extensions to Other Languages ...... 141 7.6.5 Application Scope ...... 142 7.7 Summary ...... 142

8 SECURE RECONFIGURABLE COMPUTING ARCHITECTURE ...... 147

8.1 Combining Diversity Techniques for MAHA ...... 147 8.2 Implementation Details ...... 149 8.3 Security Analysis ...... 151 8.3.1 Security against Brute Force Attacks ...... 151 8.3.2 Side-Channel and Known Design Attacks ...... 155 8.4 Conclusion ...... 155 9 CONCLUSION ...... 157

9.1 Research Accomplishments ...... 157 9.2 Future Work ...... 159

LIST OF REFERENCES ...... 161 BIOGRAPHICAL SKETCH ...... 174

8 LIST OF TABLES Table page

3-1 Cumulative percentage of 1 - 7 input LUTs ...... 27

3-2 Example LUTs with 2 primary inputs and 1 key input. The true function is Z = X ⊕ Y , which is only selected when K = 0...... 30

3-3 Original and secure mapping results for small combinational benchmarks ... 35

3-4 Original, intermediate, and secure mapping results for three large IP blocks .. 38

4-1 Properties and qualitative comparison of physical and logical keys ...... 45 4-2 Key allocation for various FPGA resources ...... 46

4-3 Mapping results and quantitative comparison between original and secure bitstreams ...... 53

5-1 Summary of proposed diversifications for IoT security ...... 69 5-2 Area, power, and performance overhead due to modifications of OR1200 CPU 75

5-3 Brute force complexity relative to fine-grained permutation network dimension . 76

6-1 Configuration for the general analytics accelerator...... 92

7-1 Configuration of the text analytics accelerator...... 122

7-2 Resource utilization for FPGA-based emulator ...... 131

9 LIST OF FIGURES Figure page

3-1 Key-based security-aware FPGA application mapping using the proposed bit- stream obfuscation technique...... 22 3-2 Security-aware mapping procedure for modern FPGA bitstreams ...... 23

3-3 Software flow leveraging FPGA dark silicon for design security through key- based obfuscation...... 31 3-4 Remote upgrade of secure and obfuscated bitstreams...... 33

4-1 Overview of the time-varying FPGA architecture ...... 47

4-2 Overall design flow for MUTARCH FPGAs ...... 49

4-3 Secure FPGA mapping, with modifications to the VTR mapping flow denoted by the shaded box...... 51

5-1 Devices sharing the same hardware are vulnerable to the same attacks once an exploit is discovered...... 60

5-2 Taxonomic overview of existing techniques for security through diversity .... 63 5-3 Non-blocking networks are appropriate for time-varying permutations ...... 68

5-4 Implementing architectural diversity in a generic RISC microprocessor ..... 71

5-5 Software flow for securing compiled firmware binaries in microprocessor sys- tems ...... 73

6-1 Proposed reconfigurable accelerator for general analytics kernels...... 85

6-2 Trade-off between average compression routine length and achievable com- pression ratio ...... 89 6-3 Memory-centric processing element microarchitecture ...... 92

6-4 The routerless, two level hierarchy with a 8PM cluster and 2D mesh ...... 93

6-5 Software flow for the modified MAHA mapper tool ...... 95

6-6 Schematic of the system-level memory management ...... 96

6-7 The hardware emulation platform. Photograph courtesy of author...... 99 6-8 Instruction mix in the mapped analytics applications ...... 100

6-9 Comparison of the power, performance, and efficiency of the three platforms for general analytics applications...... 105

10 7-1 System architecture showing the location of the accelerator and the last level memory device...... 117

7-2 System architecture for the text mining accelerator ...... 118

7-3 Hardware implementation of term frequency counting using content address- able memory and SRAM for storage...... 124

7-4 Interconnect fabric of the accelerator, with eight PEs sharing data through dedicated bus lines in a cluster, and clusters connected in a 2D mesh...... 126 7-5 Major operations conducted in the Lucene text indexing software flow ..... 129

7-6 Profiling results on Lucene for a 1 GB and 50 GB dataset ...... 130

7-7 FPGA-based emulation platform for the text mining accelerator ...... 144

7-8 Comparison of the throughput (Mbps) and energy (J) for the four application kernels among the four platforms ...... 145

7-9 Comparison of Energy Delay Product (EDP) for the application kernels among the four platforms ...... 146 7-10 Comparison of the transfer and compute energy among the four platforms ... 146

7-11 Iso-area comparison (per mm2) of data processing throughput (Mbps) ..... 146

8-1 Securing the MAHA architecture using architectural diversification ...... 149

11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ENERGY-EFFICIENT AND SECURE RECONFIGURABLE COMPUTING ARCHITECTURE

By

Robert Karam August 2017

Chair: Swarup Bhunia Major: Electrical and Computer Engineering

With billions of smart, connected devices -- and counting -- the Internet of Things (IoT) is poised to be the ''next big thing'' in computing. Recent events such as the hacking of connected vehicles or the mounting of massive Distributed Denial-of-Service

(DDoS) attacks using IoT-enabled botnets demonstrate that, all too often, security of these systems is either overlooked or inadequately ensured. Compounding this problem is the fact that many of these devices are controlled by the same off-the- shelf hardware. This hardware homogeneity has emerged as a major security risk for these devices, which are set to be produced and deployed in the millions, since the discovery of a vulnerability in one system can affect many more which are based on the same underlying hardware. These devices tend to have long in-field lifetimes, providing adversaries with ample time to mount diverse physical attacks on hardware and firmware, and also tend to support remote upgrade, which introduces additional vulnerabilities for network-based attacks. This work investigates the major security issues in the IoT regime and presents a powerful solution, based on architectural diversity, to protect against several major attacks. Just as genetic diversity in nature can help bolster survival of a species, architectural diversity represents a fundamental counter to many of the vulnerabilities introduced by hardware homogeneity. A number of case studies with FPGAs and microcontrollers demonstrate how diversity can be realized with both physical and logical architectural variations, and how it can help

12 curtail the propagation of maliciously modified bitstreams in deployed FPGA-based devices, and the propagation of malware in microprocessor-based devices. This work also introduces an alternative reconfigurable computing framework, designed to be more energy-efficient, scalable, and secure, three critical design goals for the emerging

IoT domain.

13 CHAPTER 1 INTRODUCTION

Energy efficiency, reconfigurability, and security are three seemingly contradictory requirements of today's electronic systems. Modern devices, especially those that are part of the ''Internet of Things'' (IoT), are often battery powered, portable, and have small form factors. Many IoT edge devices act as more than just sensor nodes also have certain computation requirements for complex tasks such as machine learning, and if real-time decisions are required, they may require significant local computing resources. Today's strict time-to-market requirements often results in deployment with little to no security. Devices are in the field for long periods of time, and are often capable of much more than the design for which they are intended, especially when mass-produced commercial off the shelf processors are used. Reconfigurable devices are often power hungry, or else sacrifice performance to reduce power consumption. Meanwhile, secure devices typically require additional circuitry, such as encryption modules, which consume additional area and power. In short, the unique requirements of next-generation devices necessitate a paradigm shift in the design of efficient and secure hardware that is capable of adapting to new and emerging needs.

Reconfigurable hardware devices, such as Field Programmable Gate Arrays (FPGAs), provide numerous benefits to system developers, especially the ease of prototyping, the flexibility to realize the desired functionality, and the ability to leverage inherent parallelism in the design through judicious function duplication in the spatial application mapping. However, compared to dedicated hardware, i.e. Application

Specific Integrated Circuits (ASICs), reconfigurable hardware is generally far from area, power, or delay optimal. However, ASIC development is generally expensive, and FPGAs typically represent the more economical choice for low volume production applications.

14 Thus, despite the fact that ASICs provide better area, power, and delay charac- teristics, recent years have seen a rapid proliferation in the use of Field Programmable

Gate Arrays (FPGAs) in diverse areas, including automotive, defense, and the emerging

Internet of Things (IoT) domain. With estimates in the tens of billions of IoT devices in the coming years, FPGAs are expected to provide the required flexibility to imple- ment critical functions like machine learning algorithms in many of these devices, while consuming less energy than microcontroller-based designs. However, the FPGA configuration file, or bitstream, is susceptible to a variety of attacks, which can lead to malicious modification, unauthorized reprogramming, reverse-engineering, and intellectual property (IP) piracy. Modern devices often include on-board encryption hardware, but encryption alone is not sufficient to guarantee IoT system security. While standard encryption algorithms are known to be highly secure against brute-force attacks, their physical implementations are often susceptible to hardware-based attacks. Because devices are mass produced during fabrication, a vulnerability found in the design or architecture of one device will be present in all others. In the case of FPGA, a valid configuration for one device -- even if it has been maliciously modified -- will function on another just as well. For example, an attacker could feasibly decrypt, reverse engineer, and modify the configuration of a vehicle's safety-critical FPGA, and deploy this maliciously modified bitstream to other similar vehicles, leading to property damage and risking human life.

While FPGAs are highly flexible, it should be noted that they generally suffer from poor scalability due to the expansive programmable interconnect (PI) network, and greater leakage power dissipation due to the highly configurable distributed memory architecture. Previous work has demonstrated that alternative architectures, such as Malleable Hardware (MAHA), a memory-based computing framework, can be used in place of FPGA for a wide range of applications. This framework operates under a spatio-temporal paradigm, significantly reducing the PI network requirements, and

15 employs dense, 2D memory arrays for both computation and data storage. Crucially, it is amenable to hardware architectural diversification due to its memory-centricity, the reconfigurable , and the interconnect.

The goal of this work is twofold: firstly, investigating the security issues surrounding reconfigurable and reprogrammable architectures, using the FPGA and microcontroller as a case study, given their wide-spread (in the case of microcontrollers) and increasing use (in the case of FPGAs) in today's IoT market. The goal is to find a solution for these issues which addresses the root cause, identified as ''hardware homogeneity'', rather than adding additional hardware to treat a ''symptom'' without truly addressing the problem at hand. This solution, referred to as ''architectural diversity,'' is a promising technique which can be applied both to existing and future designs of reconfigurable and reprogrammable systems. These solutions are presented for modern FPGAs, next-generation FPGAs, and finally for next-generation , with detailed security analyses provided at the end of each chapter. The level of security, defined as the number of brute force attempts required to fool the protection scheme, as well as arguments for why the technique is more robust against known design and side channel attacks, are provided. The second aspect of this work is developing techniques for designing an energy-efficient and scalable reconfigurable computing architecture where its efficiency comes, at least in part, from a level of domain specificity. Two different implementations of such a framework are presented, one for accelerating machine learning and general analytics applications, and the other for performing text mining or text analytics operations. These case studies demonstrate how energy efficiency and scalability can be achieved in a reconfigurable framework, and provide a foundation for exploring how security can be built in to the design. To conclude, I discuss some of the new research pathways made available by this work.

16 CHAPTER 2 BACKGROUND AND MOTIVATION

This chapter provides the required background on the two main aspects of this

work, namely energy-efficiency and scalability as they pertain to reconfigurable architec- tures, and security of reconfigurable or reprogrammable hardware.

2.1 Efficient Reconfigurable Computing

There are numerous examples in the literature for improving the efficiency of

FPGAs, but these techniques essentially provide incremental improvements in hardware structures in the FPGA fabric, such as hardened resources / IP selection, application mapping strategies, or the programmable interconnect (PI) network.

Network architectures such as systolic arrays were shown to improve energy effi- ciency in FPGAs for specific applications (1). Similarly, customizing hardened resources

in the FPGA fabric could reduce interconnect usage, reducing power consumption (2). A form of approximate computing implemented as a heterogeneous computing system

within the FPGA has also shown promise for reducing power consumption; lookup

tables (LUTs), digital signal processing (DSP) blocks, and embedded memory blocks

(EMBs) are used in a way that reduces power consumption. For example, responses

to complex functions may be stored in the EMBs and looked up, rather than being im- plemented in logic, if the cost of a memory lookup is less than implementing in LUTs or

using the fabric's DSP blocks (3).

However, the PI network accounts nearly 80% power 60% of delay in FPGAs (4), and still dominates performance scalability for FPGAs and other reconfigurable systems.

There are many recent examples in the literature which show improvements in the electrical characteristics of the PI network. For example, supply voltage programmability can improve efficiency (5). A low-swing, near or sub-threshold interconnect enables

designers to leverage well-known power reduction techniques such as dynamic voltage

scaling (DVS) and power gating to improve energy efficiency (6). Finally, adding tristate

17 buffers into multiplexers (muxes) used in the routing switches can reduce dynamic (active) power by 25%, and static (leakage) power by 81%. (7).

None of these approaches, however, reduce the FPGAs reliance on the PI network, because at its core, the FPGA employs a purely spatial application mapping strategy, whereby individual logic gates, represented in LUTs and fused when appropriate, are placed in the FPGA fabric and connected together via the PI network. Thus, the primary reason the PI network is so expansive is the purely spatial application mapping strategy used by the FPGA. An alternative, used by other reconfigurable fabrics including the

Malleable Hardware Accelerator (MAHA), incorporate temporal computing, where functions are evaluated over multiple cycles. This can alleviate the PI requirement, and the power savings will improve as the PI overhead reduces. Meanwhile, the energy efficiency will only improve as long as performance is not significantly impacted, such that the additional power due to multicycle computations is sufficiently low. Energy efficiency, therefore, can be written more formally as the Energy Delay Product (EDP), or the product of the energy used for a single operation, times the latency. When comparing two systems' EDPs, the lower of the two is considered to be more energy efficient.

The following subsections will describe the MAHA architecture, including the processing elements (PEs), the general interconnect strategy, and the software flow used to map applications to the framework.

2.1.1 Hardware Architecture

MAHA is a spatio-temporal mixed-granular hardware reconfigurable framework.

It consists of a set of single-issue RISC-style processing elements (PEs). Each PE is

referred to as a Memory (MLB) and work independently. Different MLBs are connected through a multi-level hierarchy that balances bandwidth with scalability.

A single MLB contains blocks typically found in a Reduced Instruction Set Com-

puter (RISC) microprocessor, such as a datapath, register file, instruction memory

18 (called a schedule table), and data memory. One of the main differences is that a signif- icant portion of the data memory is reserved for implementing multi-input, multi-output

LUTs, and a primary strategy when mapping applications to the framework is to lever- age these LUTs for more energy-efficient function evaluation. Furthermore, a custom datapath block supports common operations such as add/sub, exclusive-or, and shift, all of which are common in most applications, though as shown in the following chapters, this datapath can be customized depending on the target application domain to achieve greater energy and area efficiency. The schedule table holds the instructions to be executed by the MLB, while a decoder unit is responsible for decoding the instructions fetched from the schedule table. Finally, a local register file holds the intermediate results from the application during processing. Thus, a single MLB represents the tem- poral aspect of the MAHA function evaluation -- functions are evaluated over multiple cycles in an ultra-lightweight PE designed to compute with either a logic datapath, or lookup tables, whichever is more efficient. Typical applications cannot be accommodated by a single MLB -- it would be inefficient to store all instructions of a real-world application in an appropriately sized schedule table, while simultaneously supporting multi-threaded execution, multiported memories, multiple decoders, etc. Therefore, to enable the application mapping procedure to leverage available parallelism in the application, multiple copies of the MLB are grouped and interconnected in a scalable manner. To accommodate various applications' communication requirements, MAHA uses a multi-level hierarchical interconnect, the specifics of which may vary among implementations. Therefore, the number of MLBs found in each level of the hierarchy, as well as the size of the local interconnects within each level of hierarchy, depends strongly on the target applications. As an example, the lower level may consist of groups of four fully-connected MLBs, with increasingly sparse connections at higher levels of the hierarchy, e.g. tree-based or mesh-based interconnects. Since input applications are statically scheduled, all

19 communication is deterministic, and there is no need for dedicated routers or more complex networks. Each PE in the array can function independently of the others, and for most applications, will have to share intermediate results on the data buses between

PEs. This defines the spatial computing aspect of MAHA's function evaluation, enabling larger, more complex application kernels to be executed in a manner that leverages any parallelism, while also enabling re-use of code and LUTs within a single MLB.

Striking the right balance between spatially-distributed function evaluation and temporal computing is a difficult task, depends on the mapping strategy and constraints, and ultimately requires a software toolflow capable of both.

2.1.2 MAHA Software Architecture

In order to optimize the resource usage and efficiency of the MAHA framework, it is necessary to have a powerful and comprehensive software toolflow for application mapping. A set of benchmark applications from various domains has been used to verify the software flow. Applications are represented as Control and Dataflow Graphs, or CDFGs. Each node can be considered an instruction, thus the first step in the mapping process is to decompose the CDFG into a set of smaller subgraphs which are either parallelizable, or minimize dependencies across instances. Complex operations which target lookup tables, as well as datapath instructions, are placed into specific MLBs depending on resource availability and timing requirements. Statically scheduled inter-MLB communication is driven by timing requirements, as well as the specific interconnect hierarchy for the specific variant or customization of the MAHA architecture. The final placement and routing is checked to ensure it meets area, power, and delay requirements, and may be run again with alternative placement if necessary until closure. 2.1.3 Comparison to FPGA

The MAHA architecture is amenable to data-intensive applications due to its low

PI requirements, hierarchical interconnect, and local and in-place computation. Power

20 consumption from data movement through PIs can be minimized with targeted mapping mechanisms. Moreover, since MAHA consists of memory arrays to do the computing,

optimizing memory access energy and latency can have a significant positive impact on

the framework. In particular, the MAHA framework differs from conventional FPGAs in

the following ways:

• FPGA is a spatial-only framework, while this is spatio-temporal. Such architecture can significantly reduce the PI requirement, improving the energy-efficiency and performance scalability.

• Unlike FPGAs, which use a large number of 1 dimensional LUT, MAHA consists of high density, 2D memory arrays to hold the data and LUT results.

• MAHA uses a hierarchical interconnect architecture, which promotes local data computation and scalability, whereas FPGAs use mesh-based connections which are more suited to purely spatial application mapping.

• MAHA is a mixed-granular reconfigurable framework which can be customized to improve energy and performance in a certain domain of applications. Both fine-grained and coarse-grained MAHA frameworks are described in this paper.

A framework based on MAHA can trade off between power and performance, spatial and temporal mappings, with potentially customized , and precomputed

results for complex functions, making it potentially highly energy efficient. MAHA has

been explored for general data-intensive applications (8) using emerging nonvolatile

memory technologies (9), as well as a variant targeted at fine-grained operations (10).

2.2 Security Concerns for Reconfigurable Architectures

Security has emerged as a primary concern in modern electronic devices, but despite efforts to the contrary, vulnerabilities are still commonly exploited, and often on a large scale. The era of IoT has only made these security concerns greater, and solutions are required with increasing urgency. By their very nature, reconfigurable architectures have traditionally been considered to be more robust against supply chain attacks, such as hardware Trojan or malicious circuit insertion, because the design details -- the actual circuits that will be mapped

21 to the chip -- are not known until after fabrication. Recent work has shown that this is not necessarily the case; Trojans may be inserted into the FPGA hardware, and they may significantly impact eventual functionality or reduce reliability, even when final design details are not known (11). Furthermore, recent studies have shown that it may be possible to attack the Computer Aided Design (CAD) tools, which are used to map designs into the target device (12).

Another consideration is the security of the bitstream, or configuration file itself.

Compared to traditional hardware reverse engineering, where fabricated chips must undergo an extensive series of steps, involving chemical or other physical processes, and highly sophisticated and detailed imaging techniques in order to expose design details, reverse engineering a compiled bitstream requires significantly less resources.

Even so, these bitstreams contain representations of valuable intellectual property, and if properly processed, may be stolen (IP piracy) and used in any other design, even on a different FPGA or as a starting point for dedicated hardware. Therefore, the security of the FPGA can be compromised when the FPGA is fabricated, when designs are being mapped, or when the bitstream is acquired by the attacker. These issues are exacerbated by the emerging IoT domain, and when devices are in the field for many years, such as in vehicles or home appliances, since prolonged physical access provides an attacker with the time required to mount many physical and side channel attacks.

Finally, the network connectivity of IoT devices, especially those that do have long in-field lifetimes, is of great concern. Firmware upgrades -- whether bitstreams for

FPGAs, or compiled code for microcontroller-based systems, are a basic requirement for such devices. Over time, upgrades can be used to deploy new features, fix existing bugs, or patch security vulnerabilities when they are found. Encryption is the de facto method for protecting new configurations for devices in the field. However, the encryption hardware itself is often susceptible to hardware-based attacks, which leaves

22 the new configuration vulnerable to reverse engineering, piracy, and other costly attacks. For example, FPGA bitstreams can be encrypted before a remote upgrade, but in order for the device to reconfigure itself, it is necessary to send the decryption key along with the bitstream. A motivated attacker with network access can snoop on such remote upgrades, decrypt the bitstream, and sell or reverse engineer the IP contained therein. If the bitstream is maliciously modified, there are few safeguards against such maliciously modified bitstreams, or malstreams, from propagating to other devices that use the same underlying hardware. Similarly, if one microprocessor-based device is infected with malicious software, or malware, it can easily propagate to other devices with that particular microprocessor. In fact, this exposes the central, underlying security issue -- the problem of hard- ware homogeneity. Especially in today's economy of scale, devices are designed, fab- ricated, assembled, and tested in mass production facilities, and many of these devices share the same underlying hardware, whether it is an FPGA, microprocessor, or ASIC. This naturally increases efficiency and profits, but is not ideal for security. Analogous to the mass production of uniform hardware is the agricultural practice of monocultures, where the same variety of crops or plants are grown in the same field year after year.

This practice leads to the buildup of pests and diseases, which eventually rapidly spread with devastating consequences. In nature, this has historically resulted in the demise of certain crops or plants, such as the Gros Michel banana in the mid-1900s. Addressing the hardware homogeneity in modern devices requires a paradigm shift in the design, production, and deployment practices of today's semiconductor industry, and will be central to system security in the coming years.

23 CHAPTER 3 FPGA BITSTREAM SECURITY FOR MODERN DEVICES

In order to study the issues surrounding the security of reconfigurable systems, it is worthwhile to first consider an existing framework, namely the FPGA, to see what the real world security challenges are, what potential solutions may be available now for existing devices, and what changes can be made to the architecture in the future to truly address these issues. This chapter presents one such case study on FPGA security, though the issues raised are not FPGA-specific; rather, they can potentially impact any current or future reconfigurable architecture unless proper safeguards are in place. This chapter previously appeared* in the 2016 International Conference on ReConFigurable

Computing and FPGAs (ReConFig).

3.1 Background

System security is becoming an increasingly important design consideration in modern computing systems, especially network-connected mobile and Internet of

Things (IoT) devices. With estimates of 10s of billions of connected systems in the coming years, it is imperative to have technologies, architectures, and protocols that pair device efficiency with hardware security, data security, and privacy. The use of a reconfigurable hardware platforms, such as Field Programmable Gate Arrays (FPGA), is a common practice that helps designers to satisfy the growing demands on the area, performance, cost, and power requirements of next-generation devices (106).

In particular, FPGAs are well-suited to after-market reconfigurability, enabling them to adapt to changing requirements in functionality, energy-efficiency, and security during

* R. Karam, T. Hoque, S. Ray, M. Tehranipoor, and S. Bhunia. ''Robust Bitstream Protection in FPGA-based Systems Through Low-Overhead Obfuscation,'' in Proceed- ings of the 2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2016.

24 the lifetime of a device (107). This is a critical design consideration in many application domains, including military, automotive, IoT, and data centers, among others.

The role of FPGAs in computing system security has recently seen significant interest from diverse sectors (108; 109). However, existing research primarily focuses on realizing functional security primitives in FPGA. We note that in addition to serving as a engine for security primitives, FPGAs are inherently more secure against supply chain attacks. In particular, the post-silicon reconfigurability of

FPGAs implies that the true circuit functionality is not realized until after chip production and design details are not exposed to untrusted parties in the foundry and the supply chain. However, the security of the FPGA bitstream (or configuration file), can still be at risk, both during initial configuration in untrusted third-party system integration facilities as well as during wireless and in-field reconfiguration. These bitstreams are susceptible to various attacks, including unauthorized programming to third-party hardware, reverse- engineering of the design, malicious modifications such as hardware Trojan insertion (110), and cloning/piracy of the valuable Intellectual Property (IP) blocks. Such attacks can cause significant monetary loss for the IP designer, and unwanted, unreliable, and even potentially catastrophic operations in the field.

One existing approach to defend against these attacks is bitstream encryption

(111), which is typically available for high-end FPGAs. Unfortunately, for many situa- tions, encryption alone is not sufficient for comprehensive system security:

1. FPGAs aimed at applications constrained by aggressive low-energy and cost requirements or tiny form-factors may not include dedicated encryption blocks due to area or energy constraints.

2. During remote upgrade, encryption keys must be sent along with the bitstream, making designs vulnerable to leakage to an attacker with network access.

3. Devices that remain in the field for many years, such as those in long-life military, automotive, and IoT applications, are susceptible to physical attacks that can result in leaking the encryption key via side channels (e.g. power, timing) (111).

25 Figure 3-1. Key-based security-aware FPGA application mapping using the proposed bitstream obfuscation technique.

Consequently, there is a critical need for a robust, low-cost approach to FPGA bitstream security that goes beyond the protection offered by standard encryption techniques.

This chapter presents a novel, low-overhead FPGA bitstream obfuscation solution, which maintains mathematically provable robustness against major attacks. Central to this contribution is the identification of FPGA dark silicon, i.e., unused LUT memory already available in design-mapped FPGAs, which is exploited to achieve bitstream security. It helps to drastically reduce the overhead of the obfuscation mechanism. The approach does not introduce additional complexity in design verification and incurs a low performance and negligible power penalty. In particular, the mechanism proposed here permits the creation of logically varying architectures for an FPGA, so that there is a unique correspondence between a bitstream and the target FPGA. Fig. 3-1 shows a high-level overview of our approach. Compared to existing logic obfuscation techniques

26 Figure 3-2. Security-aware mapping procedure for modern FPGA bitstreams. A) Mapping a secure key generator enables device-specific bitstream locking using a modified CAD synthesis flow. B) Empty space in each LUT is used to store obfuscation functions.

(112), we do not require design-time changes to the FPGA architecture or expensive on- chip public key cryptography. Note that in addition to obfuscation of design functionality, our approach also enables locking a particular bitstream to a specific FPGA device, helping to prevent piracy of the valuable IP blocks incorporated in a design. Therefore, it goes well beyond standard bitstream encryption in FPGA security. Furthermore, it is targeted to the protection of FPGA bitstreams, rather than hardware metering of integrated circuits (112). Finally, the procedure seamlessly integrates into existing

CAD tool flows for programming FPGA devices. To our knowledge, this is the first comprehensive technique to protect against major attacks on FPGA bitstreams.

The chapter makes the following important contributions:

• We propose a novel, low-overhead key-based bitstream obfuscation technique for FPGAs, which can be used with off-the-shelf FPGA hardware. This approach leverages unused FPGA resources within an existing design to minimize the hardware overhead required for the obfuscation.

27 • We analyze the properties of unused FPGA resources for various (small (< 2000 LUT) to large (> 40000 LUT)) combinational and sequential benchmarks, demonstrating that this approach can be used with almost any design.

• We introduce a custom software tool for the proposed obfuscation process. This tool is fast, processing even large FPGA designs in seconds, and is integrated into a complete CAD flow for application mapping which integrates with Quartus II, a commercial software package for mapping designs to FPGA.

• We provide a detailed security analysis that shows the mathematical robustness of the approach against reverse engineering of a design using both brute-force and known design methods, as well as malicious design modification (e.g. hardware Trojan insertion). We also provide comprehensive evaluation of the approach for diverse FPGA designs for area, power, and performance. The rest of the chapter is organized as follows. Section 3.2 describes the threats to FPGA bitstream security. Section 3.3 defines the concept of dark silicon in FPGAs with benchmark results on a modern FPGA. Section 3.4 explains how to exploit FPGA dark silicon for bitstream security and describes the software architecture, tool flow, and the proposed authentication protocol for remote reconfiguration. Section 3.5 gives a detailed security analysis and mapping overhead results on physical FPGA hardware.

We conclude in Section ??.

3.2 FPGA Bitstream Security Issues

Our approach is targeted to protect FPGA-based systems against the following two major categories of attacks.

1. IP Piracy: For designs implemented in FPGAs, the attacker can obtain the IP by simply converting the unencrypted bitstream to a netlist (113). Attacks during wireless reconfiguration and through physical access to devices in the field can lead to resale of valuable IP cores and unauthorized use in third party products.

2. Targeted Malicious Modification (TMM): Once reverse engineered, insertion of a targeted malicious modification can give an attacker complete access, leading to reduced device lifetime, unreliable or unwanted functionality, decreased battery life, or sharing of private data. Note that we differentiate such targeted attacks from a random malicious modification, in which the attacker blindly modifies portions of the bitstream, intending primarily to render the bitstream non-functional; protecting against such random malicious modifications is outside the direct scope of our work.

28 In the first case, a remote attacker can intercept the communication between the vendor and FPGA during remote upgrade. Decryption keys are commonly sent along with the encrypted bitstream to enable the target device to perform the reconfiguration.

For devices in field for many years, such as military or automotive systems, long-term physical access can additionally compromise system security, giving an attacker time to discover vulnerabilities in the security architecture.

In the second case, an attacker can maliciously modify a bitstream, inserting a

Trojan to alter the functionality of the device. Even if the true functionality is unknown, it is sometimes possible to make a targeted malicious modification:

1. Unused Resources: This attack operates on unencrypted (or decrypted) bit- streams, inserting hardware Trojans using only public domain information (110). For a given bitstream, the attacker identifies sequences of zeros as unused resources into which the Trojan can be integrated alongside the original design.

2. Mapping Rule Extraction: Another attack that does not rely on a priori knowledge of the bitstream format is based on known design attack. This is used to reverse engineer the bitstream format, which enables a targeted malicious modification, such as leaking the secret key from a cryptographic module (114). A number of functional variants are mapped to the device, and the attacker observes how the resultant bitstream changes, enabling the extraction of mapping rules. Once all mapping rules are determined, the knowledge can be used to both reverse- engineer and maliciously alter any bitstream generated for the same FPGA series.

3.3 FPGA Dark Silicon

The typical island-style FPGA architecture consists of an array of multi-input, single- output lookup tables (LUTs). Generally, LUTs of size n can be configured to implement any function of n variables, and require 2n bits of storage for function responses.

Programmable Interconnects (PIs) can be configured to connect LUTs to realize a given hardware design. Additional resources, including embedded memories, multipliers/DSP blocks, or hardened IP blocks can be reached through the PI network and used in the design.

The nature of FPGA architecture requires that sufficient resources be available for the worst case. For example, some newer FPGAs may support 6 input functions,

29 requiring 64 bits of storage for the LUT content. However, typical designs are more likely to use 5 or fewer inputs, while less frequently utilizing all 6. Note that each unused input results in a 50% decrease in the utilization of the available content bits. This leads to an effect that resembles dark silicon in multicore processors (115), where only a limited amount of silicon real estate and parallel processing can be used at a given time. To make this analogy explicit, we refer to the unused space in FPGA as ''FPGA dark silicon''. Note that in spite of the nomenclature the causes behind dark silicon in the two cases are different. For multicore processors, it is typically due to physical limitations or limited parallelism; for FPGAs, it is the reality of having sufficient resources available for the worst-case which may occur infrequently, if at all. This approach critically depends on the presence of FPGA dark silicon to be exploited for obfuscation needs. This been recently exploited for low-overhead insertion of scan flip flops in the context of Design for Testability (DfT) (116). Small combinational circuits (< 2000LUTs) were taken from the MCNC benchmark suite (117), including alu4, apex2, apex4, ex5p, ex1010, misex3, pdc, seq, and spla. Large combinational circuits (3600 to 46000 LUTs) were taken from the EPFL Arithmetic Benchmark suite

(118), and include div, hyp, log2, mult, sqrt, and square. Large IP cores (2300 to 41000 LUTs, and several 100s to 1000s of registers) are taken from various open source repositories, including OpenCores.org and Github, and include AES (119), AltOR32 (120), BTCMiner (121), JPEGE (122), and Salsa20 (123). All benchmarks were mapped to an Altera Cyclone V device (124). The Cyclone V contains two 6-input Adaptive LUTs

(ALUTs) per Adaptive Logic Module (ALM), and 10 such ALMs per Logic Array Block

(LAB).

The evaluation shows the availability of significant unused space across the diversity of benchmarks. Even for small combinational circuits (less than 2000 LUTs), roughly 50% of the LUTs mapped use 4 inputs or fewer, while 82% of the LUTs mapped

30 Table 3-1. Cumulative percentage of 1 - 7 input LUTs Cumulative % of LUTs with Inputs n Circuit Name Total LUTs ≤ 2 3 4 5 6 7 alu4 10.6 26.1 48.4 77.7 97.9 100 188 apex2 11.4 26.0 52.3 91.0 99.1 100 669 apex4 16.7 27.4 50.3 89.4 97.6 100 574 ex5p 41.0 42.1 58.7 84.5 98.4 100 373 ex1010 16.9 24.2 46.4 84.8 98.3 100 711 misex 14.0 27.7 46.9 84.0 97.5 100 480 pdc 16.3 28.5 51.9 77.7 98.4 100 1588 seq 16.6 51.9 51.9 89.1 99.0 100 727 spla 17.8 53.1 53.1 79.9 98.7 100 1509 Avg. 17.9 29.0 51.1 84.2 98.3 100 758 div 7.8 13.1 32.7 60.1 100 -- 12.4k hyp 0.9 28.8 42.6 64.0 100 -- 45.3k log2 7.0 17.2 39.5 59.7 99.0 100 7894 mult 2.5 25.0 50.5 59.0 99.2 100 5553 sqrt 5.8 5.0 43.5 84.5 100 -- 3685 square 5.6 55.9 60.2 74.6 100 -- 4066 Avg. 4.5 24.2 44.8 67.0 99.7 100 13.1k AES 39.7 64.2 72.0 100 -- -- 4112 AOR32 20.7 22.9 31.5 46.8 97.8 100 2299 BTCM 32.5 95.3 99.8 100 100 -- 41.0k JPEGE 45.2 37.6 48.4 67.0 99.4 100 5154 Salsa20 59.9 57.4 93.8 93.9 100 -- 2836 Avg. 39.2 55.5 69.1 81.5 99.4 100 11.1k

use 5 inputs or fewer. The effect is more pronounced for large sequential benchmarks, where 69% of LUTs are 4 inputs or fewer, and 82% use 5 inputs or fewer.

This study of FPGA dark silicon is consistent with, albeit different from, existing

studies. For example, previous research has shown that the point of minimum area

utilization for a typical FPGA is achieved with less than 100% logic utilization (125) due to routing constraints. However, this work refers to utilization in terms of logic, i.e. ALM

usage, rather than the percentage of content bits within the ALMs used to implement the

design. Clearly, the number and type of LUTs can vary widely among specific designs

(cf. Table 3-1). While increased content utilization can be a byproduct of more advanced algorithms, it is not necessarily the only objective, as designs can be optimized for other attributes (e.g. timing or power) during synthesis. Note also that the phenomenon of unused LUTs has been exploited in other contexts, e.g., for low-overhead insertion of scan flip flops in the context of Design for Testability (DfT) (116).

To quantify the role of dark silicon, we define a metric, the Occupancy of the FPGA, as the percentage of content bits used per LUT, divided by the total number of available

31 bits in the LUTs which are used. We use the Cyclone V device architecture as a case study. In Eqn. 3--1, the number of n-input LUTs (#(LUT n)) is multiplied by the content bits used for that LUT (2n); this value is divided by the LUT capacity 2p times the number of LUTs used in total; the variable p indicates the maximum power of the LUT, which in this case is 6. This yields the ALUT Occupancy. Next, ALM Occupancy is computed in Eqn. 3--2 as the average number of ALUTs per ALM; in this case, the ALM_MAX_CAP is 2. Finally, the LAB Occupancy is computed in Eqn. 3--3 as the average number of

ALMs per LAB; LAB_MAX_CAP is 10 for the Cyclone V. Finally, the product of these three terms gives the overall occupancy (Eqn. 3--4), indicating the true percentage of fine-grained resource utilization at the content bit level for the given FPGA architecture.

∑ p #(LUT n) × 2n O = n=1 (3--1) ALUT #(LUT ) × 2p

#(ALUT ) O = (3--2) ALM ALM_MAX_CAP × #(ALM)

#(ALM) O = (3--3) LAB LAB_MAX_CAP × #(LAB)

OTotal = OALUT × OALM × OLAB (3--4)

We computed OTotal for a set of 9 combinational benchmark circuits (117) and found the average occupancy to be 26% ± 4%, leaving nearly 3/4 of the available content bits within the used LUTs empty. This same phenomenon extends to designs which require more resources, e.g. large arithmetic circuits (118) for which the occupancy is slightly higher (31% ± 4) and the previously listed IP cores, for which the occupancy is significantly lower with higher variance (12 %± 8).

32 3.4 Bitstream Protection Methodology

In this section, we describe the proposed bitstream protection methodology and its integration into the design flow. A simplistic approach to creating a keyed LUT may be to explicitly define two different LUTs, and select between them using a key bet, e.g. a decoder with one key bit input. This approach will result in significant overhead, because it effectively creates an overlay, in which each ''LUT'' is in effect two device

LUTs and a decoder. However, the FPGA ''dark silicon'' can be used to obfuscate designs and improve system security modern FPGAs using significantly fewer resources than the overlay LUT approach.

3.4.1 Design Obfuscation

As previously described, most of the LUTs used to implement a given design do

not require full utilization of the available memory bits. This leaves open spaces where

additional function responses can be inserted to obfuscate the true functionality of

the design, which in turn makes it more difficult for an adversary to make a Targeted Malicious Modification.

For example, consider a 3-input LUT, which contains 8 content bits, used to

implement a 2 input function, Z = X Y Y . A third input K can be added at either position

1, 2, or 3, leaving the original function in either the top or bottom half of the truth table, or interleaved with the obfuscation function. An example of this is shown in the 4 LUT design of Fig. 3-2(b), as well as in Table 3-2. In this case, the correct output is selected

when K = 0; if K = 1, a response from the incorrect function (Z = X ∧ Y ) is selected.

However, if it is not known that this truth table is obfuscated, the function could possibly be Z = XYK ∨ XY K ∨ X Y K, Z = XYK ∨ X YK ∨ X Y K, or Z = XYK ∨ X YK + XY K -- three functions with distinctly different responses. The security of this approach depends on the number of LUTs that are mapped for a given design; with more LUTs obfuscated in this manner, the security increases dramatically. For real-world designs, this is not likely to be a limitation, since designs

33 Table 3-2. Example LUTs with 2 primary inputs and 1 key input. The true function is Z = X ⊕ Y , which is only selected when K = 0. XYKZ XKYZ KXYZ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 will typically implement several hundred to several thousand device resources. Further analysis of this security is presented in Section 5.4.2.

3.4.2 Key Generation

The first step for the secure bitstream mapping is a low-overhead key generator, such as a nonlinear feedback shift register (NLFSR), which is resistant to cryptanalysis.

A Physical Unclonable Function can also be used; though this requires an additional enrollment stage for each device, it has the added benefit of not requiring key storage.

Various PUF-based key generators have been proposed, including PUFKY (126), which are amenable to FPGA implementation. Furthermore, using a PUF-based key generator requires that FPGA vendor tools provide floorplanning and/or enable assignment to specific device resources for reproducibility. In general, we refer to the key generator as the system's CSPRNG, or cryptographically secure pseudorandom number generator.

The specific CSPRNG used depends on the application requirements.

3.4.3 Initial Design Mapping

The second step is the synthesis of the HDL design into LUTs. This can be per- formed by freely available tools (e.g. ODIN II (127)); it is also possible to configure commercial tools, e.g. Altera Quartus II, by including specific commands into the project settings file (*.qsf) before compilation:

34 Figure 3-3. Software flow leveraging FPGA dark silicon for design security through key-based obfuscation.

set_global_assignment -name INI_VARS "no_add_opts=on; opt_dont_use_mac=on;

ump_blif_after_lut_map=on"

set_global_assignment -name AUTO_SHIFT_REGISTER_RECOGNITION OFF

set_global_assignment -name DSP_BLOCK_BALANCING "LOGIC ELEMENTS"

set_global_assignment -name IGNORE_CARRY_BUFFERS ON

This generates a Berkeley Logic Interchange Format (BLIF) (117) file with technology-mapped LUTs. With Quartus Prime 15.1, the compilation will fail follow- ing analysis and synthesis, but the file will be in the project root directory.

Note that this also forces the mapper to implement certain functions in LUTs which would otherwise have used hardened IP blocks (such as a DSP block, FIFO, etc.). This is unlikely to have a significant impact on circuits that are purely combinational, but can significantly increase the LUT count for sequential circuits which use those resources.

For example, the Cyclone V supports ``arithmetic mode'' LUTs; when converted to BLIF,

35 addition operations are instead implemented using numerous LUTs, greatly increasing the size of the circuit.

3.4.4 Security-Aware Mapping

The security-aware mapping leverages FPGA dark silicon (Section 3.4.1) for key- based design obfuscation. The software flow is shown in Fig. 3-3. The tool is written in C# with .NET Framework 4.6. The following is a brief description of the processing stages. In the analysis stage, inputs include the BLIF design, as well as the maximum size of LUT supported by the target technology. The circuit is parsed, analyzed, and assembled into a hypergraph data structure. The analysis also determines the current occupancy. In the partitioning stage, inputs include the hypergraph data structure, as well as the key length. The hypergraph is partitioned into a set of subgraphs which share common inputs/outputs using a breadth-first traversal. Nodes are marked as belonging to a particular subgraph such that those with the greatest commonality are grouped into partitions. The number of partitions is directly proportional to the size of the key. During the obfuscation stage, for a device supporting k-input LUTs, every

LUT with at most (k − 1)-inputs is obfuscated by implementing a second function using the unoccupied LUT content bits. One additional input is added to the LUT which corresponds to the key bit used to select the correct half of the LUT during operation.

The second function can be either template-derived, such as basic logic operations (nand, nor, xor, etc.), or functions implemented in other LUTs in the same design. This has the benefit of not introducing obvious randomness, which could potentially be tested for in larger LUTs with statistical analysis. During the optimization stage, individual LUTs are optimized using the Espresso Logic Minimizer (128). Each BLIF-LUT is converted into a PLA representation, which is input into a dynamically launched Espresso process.

The stdin and stdout streams are routed to the mapping application. The optimized Espresso output is converted back into the internal representation. This process significantly reduces both the output file size, as well as eventual compilation time in

36 Figure 3-4. Remote upgrade of secure and obfuscated bitstreams. the FPGA mapping tool. Because the correct key bit is not known to the Espresso tool, the LUT can be optimized, but the obfuscation logic will not be removed. During the output generation stage, the output file generation can take one of two formats: either structural , which implements the circuit as a series of assignment statements, or as a device-specific code file using native LUT primitive functions. The second option is preferred because using low-level primitives ensures that the design will be mapped with the specified LUTs. In this tool, these primitives are specific to Altera devices, though other low-level macros are available in devices as well (129).

Nevertheless, using the structural Verilog output has the benefit of being portable to other commercial tool flows, and requires significantly less compilation time. Finally, a report is generated which details the number of LUTs per partition, the original occupancy, new occupancy, and Levenshtein distance between original and secured versions of the LUT content bits.

37 The number of LUTs per partition is an especially important metric, as it has a direct impact on both the overhead and the level of security. Furthermore, the partitioning and sharing of key bits need to be done judiciously, as a random assignment can potentially dramatically increase area overhead (see Section 5.4.1). Thus, key sharing, when paired with the LUT output generation, is intended to (a) reduce overhead, and (b) strongly suggest to the physical placement and routing algorithms used by the commercial mapping tool to group certain LUTs in a given ALM and/or LAB, and thus minimize area overhead. Ideally, this process could be integrated into a commercial tool itself to enable technology-dependent optimizations.

3.4.5 Communication Protocol and Usage Model

The security-aware mapping procedure creates a one-to-one association between the hardware design and a specific FPGA device, since selection of the correct LUT function responses depends on the CSPRNG output. This means that OEMs must have one unique bitstream for each key in their device database. Therefore, it is critical that the correct bitstream is used with the correct device. Modern FPGAs contain device

IDs which can be used for this purpose; alternatively, if a PUF is used as the CSPRNG, the ID can be based on the PUF response. Using existing FPGA mapping software, generating a large number of bitstreams will take considerable time; however, with modifications to the CAD tools, the security-aware mapping can be done just prior to bitstream generation, so that the design does not need to be rerouted.

The initial device programming, prior to distribution in-field, may be done by a

(potentially untrusted) third party. The third party is able to read the device ID, but does not require access to the key database. Similarly, device testers do not need access to the key, merely the ability to read the ID. This allows OEMs to keep the ID/key relation secret. Once the device is in field, the remote upgrade procedure differs slightly from the initial in-house programming. The typical upgrade flow is shown in Fig. 3-4. After finalizing the updated hardware design, it is synthesized using the security-aware

38 Table 3-3. Original and secure mapping results for small combinational benchmarks Original Mapping* Secured Mapping* Name 234567WXYZ234567WXYZ alu4 18 31 42 55 38 4 185 115 14 30.3 12 17 38 61 63 0 191 128 16 33.1 apex2 74 100 176 259 54 6 664 336 49 25.6 24 93 119 243 308 6 793 594 87 27.5 apex4 78 79 132 224 47 14 569 325 45 24.5 29 59 159 171 143 0 562 388 57 24.3 ex5p 96 61 62 96 52 6 371 192 26 25.6 38 33 89 112 93 2 367 245 37 24.3 ex1010 110 62 158 273 96 12 704 418 57 26.2 11 116 106 191 211 1 636 464 66 26.5 misex3 54 79 92 178 65 12 475 271 32 31.5 24 46 105 134 122 1 432 283 37 30.2 pdc 241 212 371 410 328 26 1585 978 140 24.7 56 225 374 625 685 17 2002 1462 228 25.0 seq 93 125 159 271 72 7 723 396 53 25.9 38 73 194 248 153 2 708 465 61 27.8 spla 237 204 360 405 284 19 1502 916 130 24.4 46 234 310 543 663 14 1810 1339 206 25.7 Avg. 111 106 172 241 115 12 758 439 61 26.5 28 102 166 259 271 5 833 596 88 27.2 ∗W, X, Y, and Z are the total ALUTs, ALMs, LABs, and Total Occupancy (%), respectively. Columns labeled 2-7 are the total number of LUTs into which that size function has been mapped. mapping procedure. Target devices are queried to retrieve the FPGA ID; if the device supports encryption, the bitstream can be encrypted. Next, the bitstream is transmitted to the device, and the device reconfigures itself using its built-in reconfiguration logic.

3.5 Overhead and Security Analysis

In this section, we describe the experimental setup, present the hardware over- head, and analyze the level of security.

3.5.1 Experimental Setup

To obtain area, power, and latency overhead results, BLIF files were generated following the procedure described in Section 3.4.3 for an Altera Cyclone V device. Note that BLIF files cannot be generated if any encrypted IP cores are used in the design.

Moreover, they do not support certain hardware elements (e.g. asynchronous reset), so any design containing these signals will not be functionally equivalent to the original implementation. The Quartus mapper will produce a warning if this occurs during BLIF generation. Therefore, the experiments are limited to those files without the offending signals or encrypted IP. Additionally, for designs which utilize hardened IP blocks within the

FPGA, such as adders or shift registers, these resources are instead implemented in

LUTs when written to BLIF. We refer to this as the intermediate representation. Note

39 that this can result in significant overhead when mapped back to the FPGA, even before undergoing the security-aware mapping procedure. The effect is not seen in

purely combinational circuits, and therefore the intermediate overhead values are not

listed. However, for larger ``IP'' cores mapped to FPGA, the effect is significant, so the

intermediate overhead values are reported; this represents the overhead due to the conversion from the original HDL code to a technology mapped BLIF file. For these

cores, the security-aware mapping overhead should be compared to the intermediate

mapping results, and not the original mapping.

Once the BLIF was generated, the circuits were then processed by the secure

mapping tool to create obfuscated LUTs, and then written to file using the original (unsecured) and mapped (obfuscated) Verilog outputs. For the obfuscation, random

functions were used. Both outputs, along with the original benchmarks, were simulated

using Altera ModelSim and the same test vectors. All benchmarks were found to be

functionally equivalent when the correct key was provided to the obfuscated versions; incorrect keys resulted in different output, as expected. The Verilog files were then

mapped to the same Cyclone V device, from which we obtained the power (estimated

using PowerPlay Power Analyzer), performance (estimated using TimeQuest Timing

Analyzer), and area (obtained from the compilation report). These results were then

compared with the other mapping results to find the overhead. 3.5.2 Overhead Analysis

Table 3-3 lists the initial (pre-obfuscation) and resulting (post-obfuscation) technol-

ogy mappings for LUT usage by function input, the number of ALUTs, ALMs, LABs, and

the Occupancy, as computed by Eqn. 3--4. Note that in both mappings, the number of ALUTs (Column W) may differ slightly from the sum of LUTs in that row; this is be- cause the number of 1 to 7 input LUTs is obtained from post-synthesis results, whereas the total number of ALUTs is post-fit. Both sets of results are shown for comparison purposes.

40 The addition of a single key bit input to LUTs 1 to 5 resulted in an increase of ALUT Occupancy from an average 41% to 54%. However, the ALM Occupancy decreased,

from 88% to 72%. LAB Occupancy also decreased, though less significantly, from 74%

to 71%. Therefore, while the ALUT Occupancy did increase, the Overall Occupancy

remained nearly equivalent (Table 3-3). The decrease in ALM and LAB Occupancy also implies a moderate area overhead, which in terms of ALUTs was under 10% for the combinational benchmarks. In terms of ALMs, however, the area overhead was higher, roughly 36%. Less dense packing also implies increased routing/interconnect delay, which manifests as an average 1.2x reduction in fmax . However, the effect on power consumption was very low, requiring an average 1.02x more power for the secured version.

For larger designs, BLIF conversion was possible for 3 of the 5 ``IP'' cores, specif- ically AES, Salsa20, and the AltOR32. The version of AES mapped here differs from the full encryption core presented in the dark silicon study, in that it utilizes a number of 6-input LUTs, rather than embedded memory blocks, to hold the AES substitution box (SBOX) values. This allows the tool to produce the BLIF representation, since no encrypted IP (i.e. the memory controller) is required. Similarly, we use a ``Lite'' version of the AltOR32 which does not instantiate memories for instruction and data caches, and refer to this as ``AOR32-L''. Results for pre- and post-secure mapping for these three cores are shown in Ta- ble 3-4. Compared to the purely combinational circuits, we observed larger percent

increases in the area overhead, even when comparing to the intermediate mapping re-

sult (Section 3.5.1). For the AES core, this was 25%, 21%, and 18% for ALUTs, ALMs,

and LABs, respectively. Unlike the combinational benchmarks, AES had only a 6%

reduction in fmax , and 2% increase in power consumption, similar to the combinational circuits. Salsa20 and AOR32-L both had significantly higher area overhead (e.g. 54%

41 Table 3-4. Original, intermediate, and secure mapping results for three large IP blocks 1 2 2 3 4 5 6 7 W X Y Z fmax (MHz) Power (mW) Original 181 78 53 66 905 22 1306 1117 153 32.5 168.8 545.8 Intermed.3 169 76 58 76 914 33 1326 1139 154 33.1 167.8 545.8 AES Secured 29 170 214 152 1067 21 1656 1376 182 34.1 158.3 555.5 Original 1620 8 1031 5 171 1 2836 1656 224 11.9 152.6 550.0 Intermed. 916 658 507 1254 1381 46 4463 2929 367 31.6 112.0 549.4 Secured 289 732 1434 1255 3156 52 6869 5023 642 33.5 100.4 560.0 Salsa20 Original 199 141 153 190 973 65 1624 1424 204 29.0 94.7 531.1 Intermed. 337 234 221 355 995 64 2211 1616 244 29.8 82.6 528.8 Secured 421 253 1091 518 1315 87 3686 2496 326 30.2 74.0 538.1 AOR32L 1 Columns labeled 2..7 represent the total number of LUTs into which that size function has been mapped. 2 W, X, Y, and Z represent the total number of ALUTs, ALMs, LABs, and Total Occupancy (%), respectively. 3 Intermed. results show the overhead due to the HDL to BLIF conversion; this gives ``Secured'' a fair comparison.

more ALUTs in Salsa20 and 67% more ALUTs in AOR32-L), and 12% reduction in fmax for both (Table 3-4). We believe that the larger increase in ALUT usage for the large IP cores compared to the small combinational circuits is due primarily to the conversion of arithmetic mode

ALUTs, which use two 4-input LUTs with dedicated hardware full adders (124) to normal

mode ALUTs with adders realized in numerous other LUTs. This was not an issue in

the combinational benchmarks, as they used only normal or occasionally extended mode ALUTs. Based on these results, we believe that tighter integration of the secure

mapping process with the mapping tool would yield further improved occupancy, full

utilization of hard IP resources within the FPGA, and lower latency overhead from

additional routing/interconnect delay:

• The relatively low routing resource utilization strongly implies that the reduction in ALM and LAB occupancy is not entirely due to routability constraints.

• The current hypergraph partitioning stage can only make suggestions to the placement tool by grouping certain nodes; ultimately this cannot be directly controlled unless the two tools are tightly integrated.

• The conversion from original design to BLIF and back can incur significant overhead and running the secure mapping process on the FPGA mapping tool's internal circuit representation could avoid this issue entirely.

42 3.5.3 Security Analysis

For security analysis, we assume the attacker intends to reverse engineer the design or perform malicious modification and reprogram the device.

3.5.3.1 Brute force attack

A brute force attack represents the most challenging and time consuming attack. The basic requirements of the brute force attack on the obfuscated bitstream differ from a typical cipher, because the attacker needs not only the bitstream (and a means to apply various keys) but also knowledge of the bitstream structure and the test patterns with known responses. Knowledge of the bitstream structure is required to identify LUT content bits and LUT interconnection through FPGA routing. Therefore, if a design has 128 LUTs and a 128 bit key is used, the attacker must try 2128 combinations to determine the key and reverse engineer the bitstream. If bitstream encryption was used, the attacker must break the bitstream encryption and the security-aware mapping, before reverse engineering the bitstream. Furthermore, a larger number of LUTs (e.g. >1000) are typical for many designs, leading to a much larger search space.

If the attacker does not know the entire bitstream format (e.g. they can identify the location of the LUT content bits, but not how LUTs are connected), and therefore does not know which input represents the key bit, the search space increases dramatically.

Finding the location of the LUT content bits is feasible through template attacks (as described in Section ??. The difficulty lies in the number of potential combinations of the various LUT content bit organizations, which can be counted as follows:

1. For each k-input LUT, the ordering is important; this contributes a factor of k!

2. For each LUT, it is necessary to select the correct half, depending on if the value of the key bit is ``0'' or ``1'', multiplying this result by 2.

3. These factors are multiplied by the number of LUTs in the design.

However, this is not the complete picture. Recall that some fraction of LUTs which currently have 6 inputs are in fact 5-input LUTs with 1 additional key bit. The remaining

43 6-input LUTs did not receive a key bit, because they were already at maximum oc- cupancy for the given technology. This requires the attacker to differentiate between keyed and un-keyed maximally-occupied LUTs. We denote the total number of k-input

LUTs as #(LUT )k , and the number of keyed (5+1) input LUTs as #(LUT )k−1, which multiply the previous factors to yield Eqn. 3--5.

( ) ∑N #(LUT )k TC = × 2 × k! × #(LUT )k (3--5) #(LUT )k−1 k=2 It follows that an attacker will have significantly more difficulty reverse engineering the design when the complete bitstream format is not known and the design is obfus- cated, than it is to use template attacks to determine the format of an unobfuscated bitstream. Even if the format is known, the previous analysis shows that the difficulty of brute forcing the key depends on the key length, as long as there are sufficient LUTs on which to apply a key bit input (e.g. ≥ 128 LUTs). 3.5.3.2 Known design and bitstream tampering attacks

A known design attack can enable an attacker to reverse engineer the bitstream format, and potentially the IP, due to insufficient protection offered by bitstream en- cryption. This may lead to not only IP piracy, but also malicious modification and unauthorized system reprogramming with the tampered bitstream.

A moving target defense is known to provide robust protection against known design attacks, since they rely on an underlying assumption of consistency between subsequent trials -- in this case, compilations of the bitstream. For FPGA, this means assuming that the meaning of a particular bit does not change over time. Using the proposed technique, we note several aspects that do change between compilations, violating the assumption of consistency and making known design attacks impractical:

• If a strong PUF is used as a key generator, then the key will change each time the design is compiled by issuing a different challenge vector. If an alternative CSPRNG is used, a different seed can be used each time to generate a different key sequence. Using a different key will produce drastically differing bitstreams.

44 • The location of the key input to each LUT also changes each time, leading to a number of different configurations of the content bits equal to the number of inputs. This information is encoded in the routing bits, and therefore its security is independent of the key generator.

• The second function mapped into spare LUT content bits also changes each time. Thus, not only do half the content bits change each time, they are also permuted in different ways. This also does not depend on the key generator, providing another independent layer of security.

In summary, the key, the LUT content bits, and their ordering will mutate each time the design undergoes the security-aware mapping procedure. This approach therefore provides robust protection against known design attacks. This also prevents targeted malicious modifications, because the meaning of individual bits changes during each compilation.

3.6 Summary

This chapter has presented a novel, low-overhead design obfuscation technique aimed at securing FPGA bitstreams. The approach does not require any modification in

FPGA architecture and hence can be readily used with existing FPGA devices -- both before in-field deployment, and subsequently via standard remote upgrade procedures.

It is therefore attractive for both FPGA vendors as well as system designers, who pur- sue system integration with FPGA devices already in the market. Moreover, the tool flow can seamlessly integrate with commercial FPGA mapping tools, such as Altera

Quartus II. While the process does minimally affect the design optimization process for FPGA, and hence incurs modest performance and negligible power overhead, it provides mathematically robust security against several major threats to FPGA-based systems, which are not fully protected against by traditional bitstream encryption ap- proaches. Nevertheless, it can still be used in conjunction with encryption for additional security. The technique capitalizes on the unoccupied space in an FPGA's lookup tables -- which we call the FPGA's dark silicon, making area-efficient use of existing

45 resources, rather than consuming a large number of additional logic elements for secu- rity enhancement. The approach is scalable to larger designs. Future work will focus on refining the tool to further reduce overhead, as well as built-in support for outputting low-level primitives for different FPGA platforms.

46 CHAPTER 4 SECURITY FOR NEXT-GENERATION FPGA DEVICES

Chapter 3 outlined some of the major issues facing modern FPGAs in terms of ensuring device and system security, and protecting synthesized intellectual property in FPGA configuration files. As previously mentioned, the issue of protecting designs in the configuration goes beyond FPGAs, and speaks to a serious security concern in any reconfigurable device. The presented solution -- obfuscating the design, using empty space so as to reduce overhead due to obfuscation -- provided a convenient method for securing designs on modern FPGA devices. If, however, the FPGA can be redesigned, while maintaining the spatial-only computing paradigm central to its operation, what circuitry would have to be added, and how would that impact chip area, power, and the design mapping process? This chapter aims to answer these questions, providing architectural details for physical changes to the FPGA fabric that can, in effect, make every device unique, while providing insight into the specific software modifications required to the traditional FPGA synthesis flow. This chapter previously appeared* in the 2017 Asia South Pacific Design Automation Conference (ASP-DAC).

4.1 Background

Recent years have seen a rapid proliferation in the use of Field Programmable Gate Arrays (FPGAs) in diverse domains, including automotive, defense, networking, health care, and consumer electronics. For many devices, a key requirement is the need for in-field hardware reconfigurability to adapt to changing requirements in functionality, energy-efficiency, and security. FPGAs have emerged as a popular electronic compo- nent for addressing this reconfigurability demand (107), as they provide high flexibility compared to custom ASICs, while entailing significantly higher energy-efficiency and

* R. Karam, T. Hoque, S. Ray, M. Tehranipoor, and S. Bhunia. ''MUTARCH: Architec- tural diversity for FPGA device and IP security,'' in Proceedings of the 2017 Asia South Pacific Design Automation Conference (ASP-DAC), 2017.

47 performance than designs based on firmware/software running in processors. FPGAs often provide significant benefits in real-time performance, making them attractive as

hardware accelerators. Furthermore, FPGA-based designs are known to be more se-

cure than both ASIC and processor against supply-chain attacks, since design details

are not exposed to untrusted foundries or design houses. However, the FPGA configuration file, also called the bitstream, is susceptible

to a variety of attacks, which can potentially lead to unauthorized reprogramming,

reverse-engineering, and/or piracy of the intellectual property (IP). Modern high-end

FPGA devices often include on-board decryption hardware, allowing for some measure

of security; however, encrypted bitstreams are generally transmitted along with the decryption key, which creates a significant vulnerability. Furthermore, even dedicated

decryption hardware can incur significant hardware overhead for area and energy-

constrained systems, e.g. Internet-of-Things (IoT) edge devices. Mathematically,

encryption algorithms are known to be highly secure against brute-force attacks. However, in many cases, attackers can have physical access, and most on-board

encryption techniques are susceptible to side-channel attacks, e.g. by key extraction through power profile signatures (107; 111; 130).

Unless additional countermeasures are in place, an adversary can convert the

bitstream to a netlist (113), enabling targeted malicious modifications (e.g. Trojan insertion) as well as IP piracy. The conversion step may not be necessary for Trojan insertion; techniques such as Unused Resource Utilization (110), which inserts Trojans

in empty spaces in the configuration file, and Mapping Rule Extraction (114), a type of

known design attack, can be mounted on a bitstream. Alternatively, if the hardware itself

is cloned (131), a pirated bitstream could be used with a counterfeit hardware. Such attacks are made possible by the fact that all FPGAs of a given family have

physically identical architectures. In other words, the decrypted bitstream taken from

one specific product can just as easily be mapped as a blackbox to another (identical)

48 product. Similarly, a maliciously modified bitstream that is successfully deployed on one system can be mapped to another. This is analogous to a computer virus (the maliciously-modified bitstream) which infects a particular version of an operating system

(the FPGA), and can propagate to other computers with the same OS (FPGA) because program execution (the architecture) is identical. For products which are intended to remain in the field for long time (10-30 years), such as automotive systems, an attacker could feasibly modify the configuration of a safety-critical FPGA, and deploy this to other vehicles of the same year, make, and model. Therefore, though encryption offers a security layer, when used alone, it is not sufficient to protect devices with long in-field lifetime, especially those with network connectivity.

Table 4-1. Properties and qualitative comparison of physical and logical keys Key Type Time Var. Storage Area Ovhd. In-field Upgrades Known Design Destructive RE Physical (P) No Fuses Low Not Secure Weak Strong Logical (L) Yes Runtime Mod Secure Strong Weak Combined Yes Mixed High Secure Strong Strong

This chapter presents MUTARCH, a novel architecture with associated CAD tech- nology for FPGA security which provides provably robust protection against in-field bitstream reprogramming and IP piracy. Figure 4-1 illustrates the overall approach. MU-

TARCH permits wireless reconfiguration that does not rely on encryption, though it can still be used in conjunction as an extra layer of defense when available. The proposed approach presents a fundamental departure from existing protection approaches that rely on cryptographic techniques. It is rooted in the idea of ``security through diversity'', where each FPGA device will have a unique architecture, despite being manufac- tured with existing processes and techniques, that can additionally mutate over time.

This supplies robust protection against brute force attack, as well as security against known design attacks through a moving target defense. This also results in a unique bitstream-to-device mapping which has several major benefits: (1) device identification is an intrinsic requirement, which ensures that only valid devices receive upgrades;

49 Table 4-2. Key allocation for various FPGA resources Architectural Level Configuration Physical Configuration Logical Configuration Output Inversion No Yes LUT Input Reordering Yes Yes LUT Content Inversion No Yes Switch Box Config. Bits Yes Yes Mux Selection Bits No Yes

(2) upgrades sent to unauthorized or counterfeit devices will not function because the bitstream cannot be mapped correctly, mitigating an attacker's economic motivation for device cloning, and preventing reverse engineering of valuable IP blocks; and (3) mali- ciously modified bitstreams cannot be mapped to another device, so that breaking one device does not put others at risk. It is distinct from existing logic encryption or hardware metering techniques (112) because: (1) it applies to FPGA bitstreams rather than an

ASIC design; (2) it allows changing the architecture over time; (3) it is integrated into the

FPGA's reconfigurable fabric and the application mapping tool flow; and (4) it does not require any expensive on-chip resource, e.g. support for public key cryptography. In particular, the chapter (1) investigates the concept of mutable FPGA architectural fabric for the purposes of device and IP security, including efficient hardware modi- fications to enable unique and time-variant mappings, as well as the communication protocol for remote in-field upgrades; (2) presents a detailed security analysis for the proposed approach, considering all possible attack models; and (3) demonstrates the vi- ability of this approach using a complete CAD framework that we have developed based on a widely-used open-source FPGA mapping tool called VTR (132) and evaluates bitstream security as well as overhead for a set of common benchmark circuits.

4.2 FPGA Hardware Security

In this section, we describe the proposed mutable FPGA architecture in details, including the static and time-variant architecture configuration layers. Next we present

50 Figure 4-1. Overview of the time-varying FPGA architecture. A) A two part (logical and physical) key is used to perform a device-specific obfuscation transform. B) The secure bitstream is mapped to the appropriate FPGA. Resources such as C) switch/connection boxes and D) LUTs are augmented with logic which implements the inverse transform. This ensures that bitstreams mapped to unauthorized devices will be nonfunctional. Because the logical key is time-varying, the architecture is mutable and thus prevents known design attacks. the secure FPGA mapping tool, which guarantees functional correctness when mapping a design. 4.2.1 Mutable FPGA Architecture

The security of the MUTARCH architecture arises from two separate ``layers''

-- one physical, and the other logical -- with one configuration key for each layer.

Qualitatively, these keys differ in terms of time variance, storage, overhead, and resilience to various attacks, as shown in Table 4-1. Together, these keys can also be considered as the FPGAs unique architectural configuration (Fig. 4-1), because they are used as input to the secure bitstream transform process. Hence, prior to upgrading, the device must be identified to ensure that the correct bitstream is sent to a system.

The actual bitstream transformation can therefore occur towards the back-end of a vendor's tool flow (e.g. after place & route, but before bitstream generation), reducing overall compilation time. Because of the unique bitstream-to-device association, device authentication is necessary for the process. The design flow for MUTARCH is illustrated in Fig. 4-2.

51 A typical mapping function F (S (K, B) , I (K, B)) operates on disjoint subsets of the key K and bitstream B to modify the bitstream in such a way that, when mapped

to the target FPGA, the internal logic implements an appropriate inverse transform,

resulting in a functionally correct mapping. This is appropriate for modern FPGAs, in

which bitstreams are generated such that they implement the desired functionality for a given device architecture. However, rather than a bitstream functioning on all devices

of a given family, it will only work on one specific device. Therefore, it creates a unique

bitstream-to-device association to prevent piracy of the IP mapped to the FPGA.

Given the highly flexible nature of FPGAs, there are many internal components

which can be subject to physical configuration that, when changed from device to device, would represent a unique device architecture. A subset of these components

is listed in Table 4-2. It gives a designer the flexibility to balance overhead with the

required level of security by implementing changes in as few or as many separate

structures as needed. Additional related details are given in the case study (Section 4.4. Note that the physical key should not be allocated to the output inversion or the multi- plexer select lines, which will induce static change to a bitstream and therefore, can be exploited by an attacker to gain knowledge of the architecture. Conversely, the logical key can be applied to any of the listed resources because it changes each reprogram cycle. 4.2.1.1 Physical layer

The first of two security layers is based on physical architectural modifications to

the underlying FPGA fabric. This layer is comprised of a network of fuses judiciously

placed on different configurable components and programmed after fabrication using

techniques commonly applied for defect and fault tolerance. This programming should be performed by the manufacturer and not at the fabrication facility, rendering it less

susceptible to supply chain attacks. In addition, because each FPGA must eventually

52 Figure 4-2. Overall design flow for MUTARCH FPGAs: A) Device identification enables the system designer to obtain device configuration keys; B) The unique bitstream for a specific device is used in the upgrade process. be programmed with its vendor's specific toolset, the physical modification prevents the fabrication facility from overproduction or cloning attacks at foundry.

As a concrete example of the physical security layer, consider the inputs to a given lookup table (LUT). Inputs can come from other LUTs in the design, and both the value of the input, as well as their order, is crucial to proper functionality. By inserting a switch network whose inputs can be programmed by a fuse, the order of inputs to a given LUT can be permanently modified (Fig. 4-1, right). This must be factored in during bitstream generation, so that proper functionality is preserved. 4.2.1.2 Logical layer

The second security layer is based on in-field architectural modifications rather than physical changes. This enables the underlying FPGA fabric to effectively change with time, making it a mutable during its life-time. This property of time-variance is

53 essential to security against known design attacks. This layer is realized through the run-time configuration of permutation and inversion networks which modify the functions

mapped to the FPGA (e.g. the lookup table contents) and how the LUTs are connected

together. This layer takes as input a subset of the bitstream, as well as a key that is

generated internally using, for example, a cryptographically secure pseudorandom number generator (CSPRNG) such as PUFKY (126). PUFKY is attractive because it

leverages Physical Unclonable Functions (PUFs) for challenge/response-based seeding

of the CSPRNG, which fits well with the desired properties for a remotely reconfigurable

and upgradeable system (i.e. enables device identification through unique signatures).

Just like in the physical layer, the logical network requires that the bitstream be modified during the vendor tool flow, applying the transform to various structures,

such as the LUT content, connection and switch boxes, and other supported FPGA

resources. Because known design attacks require a large number of mappings of

designs to a device, changing the logical key during each reprogramming cycle presents a robust moving target defense.

4.2.2 Secure FPGA Mapper

To evaluate MUTARCH comprehensively, we have implemented a complete

secure FPGA mapping tool based on VTR (132), a popular academic tool for FPGA architecture research. An architecture description file implementing a variable number of 4-input LUTs is used for mapping a series of benchmarks. The VTR tool takes as input a file in Verilog HDL, parses and decomposes the circuit into appropriately sized LUTs (as defined by the architecture description file), and then with the Versatile Place and Route

(VPR) tool (132), performs packing, placement, and routing.

We modified VPR as shown in Fig. 4-3 to perform the function described in the Algorithm 4.1. This function takes as input a design, a pre-determined physical key, and

the seed for the logical key. The number of inputs (num_inputs) and the original truth

table (tt) are calculated. Next, a number of bits equal to the LUT size are obtained from

54 Figure 4-3. Secure FPGA mapping, with modifications to the VTR mapping flow denoted by the shaded box.

the CSPRNG, after which the tt undergoes addition with the key modulo 2, followed

by a bitwise permutation as defined by first the physical key, then the logical key. Next,

the current LUTs output inversion status (OI _Status) is set based on the logical key, and the status of the output inversion for all fan in nodes are checked. This affects the permutation of the current LUTs content, and the current tt must undergo one last output inversion transform (oi_xform) defined by the fan in output inversion status

(FIOIS).

This function results in two bitstreams defined by their LUT content bits. It allows us to determine the level of security, as determined by the Hamming Distance between the original and secure bitstream, and the difference in pairwise intra-bitstream distance.

The mapping tool additionally modifies the existing Verilog writing functionality in VTR to enable functional simulation. The existing LUT primitives, defined in Verilog, are modified to support the bitstream transforms through the addition of LUT input switch networks, and XOR gates on the LUT content bits and output bit. The top level module generation code, which instantiates the CLBs and interconnects, is modified as well, to support input of the physical and logical key networks. A test bench is also generated by

55 the tool which instantiates both the original and secure FPGA mapping, generates the test patterns from known input patterns for the benchmark circuits, and finally compares the output of both the original and secure mappings.

4.2.3 Correctness of Mapped Design

An important aspect of the FPGA synthesis process is to ensure that the mapped design is functionally correct. In the proposed scenario, correctness is guaranteed by construction. The mapper tool is cognizant of the architecture (both physical and logical) of the target device as defined by the configuration key. Device-specific modifications in the bitstream are done by the mapper tool in such a way that correspond to the specific architectural mutations. For example, if the order of LUT inputs is changed, the interconnect bits in the bitstream are correspondingly reordered for a functionally correct mapping. Algorithm 4.1. Secure Bitstream Transform

Input: Circuit C, Physical Key Kp, Logical Key Seed Ks Output: Original Bitstream Bo , Secure Bitstream Bs InitCSPRNG(Ks ) for each Blocks B in C do for each Primitives P in B do if P is type LUT then FIOIS ← 0 numInputs ← getLUTinputs(P) tt ← getTruthTable(num_inputs, P) Bo ← append(Bo , tt) subKey ← getNextKey(1 << numInputs) tt ← physicalXform(tt, K_p) tt ← logicalXform(tt, sub_key) OI _Status ← oiXform(P, sub_key) for each Fan In fi in P do FIOIS ← FIOIS | getStatus(fi) end for tt ← oiXform(tt, FIOIS) Bs ← append(Bs , tt) end if end for end for

56 Table 4-3. Mapping results and quantitative comparison between original and secure bitstreams

Benchmark # CLBs Crit. Path Nodes Bitstream Size (Bytes) D1 D2 (Original) D2 (Secured) x Latency (sec.) alu4 430 4 6878 8.00 1.68 8.00 1.14 apex2 520 13 8316 7.99 1.70 8.00 1.12 apex4 249 8 3974 8l05 1.32 8.00 1.14 des 973 12 15558 7.95 1.69 8.00 1.12 ex5p 159 4 2540 8.01 1.00 8.00 1.13 ex1010 387 9 6192 8.05 1.02 8.00 1.14 misex3 384 9 5554 8.01 1.61 8.00 1.14 pdc 996 6 15922 7.99 1.48 8.00 1.10 seq 506 8 8096 7.95 1.68 8.00 1.15 spla 894 12 14296 8.05 1.42 8.00 1.11

4.3 Results

In this section, we provide a thorough security analysis considering brute force, side channel, destructive reverse engineering, as well as known design attacks. Using the secure FPGA mapping tool described in Section 4.2.2, we generate results for a set of

10 benchmark circuits.

4.3.1 Security Analysis

We provide a security analysis for four possible attack scenarios, namely 1) brute force, 2) known design, 3) side channel, and 4) destructive reverse engineering. We assume that the attacker has knowledge of the bitstream format, and has access to the obfuscated bitstream.

4.3.1.1 Brute force attacks

A brute force attack represents the most challenging and time consuming attack on the system. For a given design, there can be thousands of feasible interconnected

LUTs. Modern FPGAs typically support up to 6-7 input functions for each LUT. Thus, there are a huge number of possible combinations, which can be represented by even a small number of LUTs and which grows rapidly as the number of LUTs increases. When factoring in the potential content bit inversion, the programmable interconnect network inversion, and other architectural modifications, both static and time variant, the process of modifying some LUT bits, their input ordering, and the connections between them, mapping to FPGA, and testing for proper functionality becomes intractable. For

57 example, consider a small design with 128 LUTs with 6 inputs (64 content bits). Each content bit may or may not be inverted, and the correct ordering of bits is unknown. This

can be represented as the number of permutations (64!) multiplied by the number of

128 ways in which the LUTs can be connected ( C6). Hence, even with known input and output pairs, mounting such a brute force attack is not feasible in a reasonable time frame with current technology.

4.3.1.2 Known design attack and bitstream tampering

Known design attacks utilize small benchmark circuits (e.g. a single and gate)

mapped to the target FPGA, which enables an attacker to observe how the resulting

bitstream changes. By launching this attack repeatedly, it is possible to reverse engi- neer the bitstream format -- as well as the IP -- in modern devices. It is also possible

to tamper with the design for targeted malicious modifications. However, our approach

can protect against known design attacks because it provides a moving target defense,

whereby the architecture's logical security layer changes each time the bitstream is recompiled. Therefore, even if a small benchmark is repeatedly mapped to the target

device, no new information about the architectural configuration is leaked.

4.3.1.3 Side channel attack (SCA)

Compared with brute force attacks, SCA is a more refined and powerful attack.

We first assume the attacker has used power analysis to discover the key to the logical security layer, which is generated at runtime. However, because the key generation

uses non-linear functions, and so is not susceptible to machine learning attacks, the

next key will not be known, and therefore the moving target defense due to the time-

varying architecture will prevent deobfuscation of the bitstream. Furthermore, even in

the unlikely case that the next key can be guessed correctly, the ability to deobfuscate one bitstream does not enable the attacker to maliciously modify the design so that

they can map it to another device, because all other devices will have their own unique

58 physical and time-varying security keys. This effectively eliminates the economic motivation for an attacker.

4.3.1.4 Destructive reverse engineering (DRE)

DRE is an expensive and time consuming process, but it can reveal the inner workings of the device. We present two scenarios of using DRE attacks. In the first case, DRE is used to reveal the structure of the physical security layer by identifying which fuses have been programmed. This alone would not compromise bitstream security, since the logical key is generated at runtime, and different keys are generated during each re-programming. Furthermore, this will only reveal the physical key network for one specific device, and therefore is not economical, given the high cost of DRE. In the second case, DRE is used to reveal the CSPRNG structure. However, this too would not be sufficient, because every device still has a different physical key network.

4.3.2 Secure Mapping Results

We used the modified VTR mapper tool to investigate the effect of the secure bitstream transform on a set of common benchmarks from the MCNC benchmark suite (117). With the VTR mapping flow, Verilog files were parsed into their BLIF representation using an architecture description file that defines a CLB as consisting of a 4-input LUT, a flip flop, and interconnect logic. Therefore, all benchmarks were mapped into a series of interconnected CLBs, with 16 content bits per LUT. These LUT content bits were taken as the bitstream for this particular FPGA. Unlike commercial

FPGAs, which have a fixed size bitstream for each device, the bitstream sizes vary among benchmarks, since the number of available CLBs is a function of the resources required at run time by the tool. We present results in terms of inter- and intra- bitstream

Hamming Distance. Inter-bitstream distance (D1)is defined as the average distance between LUTs in the original bitstream (BO ) and the secured bitstream (BS ), as shown in

Eqn 4--1. The intra-bitstream distance (D2) is defined as the average pairwise distance between LUTs in a given bitstream.

59 ∑N HD(BO,i , BS,i ) D = i=0 (4--1) 1 N

∑N ∑N HD(Bi , Bj ) i=0 j=i+1 D = (4--2) 2 0.5 × N × (N − 1) While a high quality CSPRNG -- especially one that is amenable to efficient hardware implementation -- should be used in the final design, we have used the standard mt19937 generator for evaluation purposes. The average D1 value was found to be normally distributed, with a mean and standard deviation of 8.00 ± 0.03

for the 16 bit LUTs. The value for D2 for the original bitstream was found to be 1.5 ±

0.3, whereas the transformed bitstream D2 result was nearly equivalent to D1. This implies that, even with designs where the pairwise intra-bitstream LUT content only

differs by 1 or 2 bits, the functionality can be effectively obscured. The addition of LUT

content and output inversion logic will also affect the critical path delay. We assume the

additional interconnect delay within the ALM is not significant compared to the XOR gate

delay of 1.02 ns (133), This gives us an increase of around 2 ns per ALM. This yields moderate latency overhead for all benchmarks, with an average reduction of 1.14x in

the maximum operating frequency.

4.4 Cost Analysis Based on a Case Study

There is an inherent trade-off between the area/power/delay overhead and the level of security provided by the architectural modifications. The results presented in

Section 4.3.2 represent a very high level of security, where every content bit in every

LUT can be selectively inverted through an additional XOR gate, the LUT output can

be selectively inverted, and when the select inputs to the LUT are permuted using

a switching network. We can estimate the area overhead from such architectural modifications by adding the approximate area of the additional XOR gates, plus one

60 switching network, to the area of a given ALUT. For the 65 nm Altera III FPGA, the area of one Logic Array Block (LAB) is estimated to be 0.0211 mm2, and the core of

the largest Stratix III (EP3LS340) has 13,500 LABs, comprising 72.4% of the total die

area (134). Thus, for a 4 µm2 XOR gate implementation (133), additional XOR gates

result in an overall 8.5% increase in die area (411 mm2 to 447 mm2). For SRAM FPGAs, this could potentially be reduced with a custom design leveraging the existing Q and

Q signals and using pass transistors to select between them. Using Design

Compiler, and 90 nm cell library (with results scaled to 65 nm), we estimate the area

of the switch network for the LUT select inputs to be roughly 135.5 µm2, increasing the

total die area to 465 µm2. This does not consider the area of the programmable fuses used in physical key storage. As reported in Table 4-2, physical key storage for the LUT

select input ordering is appropriate; therefore, with a 6-input LUT, at most ⌈log2(6) = 3 fuses can be used. Assuming an area of 15 µm2 per fuse (135), the total area would

increase to 466 mm2, or 13% area overhead. In practice, programming every LUT content switch with a static value may not provide the highest level of security, since the content ordering will not vary with time. Instead, a smaller number of fuses (e.g. 128 or

256) can be used on certain inputs (with remaining inputs connected to the logical key

network). This will help reduce overhead from fuse area and increase security against

known design attacks. 4.5 Summary

This chapter has presented MUTARCH, a distinctive approach for next-generation

FPGA security that enables secure wireless reprogramming and protects diverse

FPGA-based systems against piracy and tampering attacks in the field. The central

idea of MUTARCH is to create architecturally unique devices so that adversaries cannot interpret a bitstream, or use knowledge about one device architecture to break into

another device. MUTARCH provides a low-cost, low-overhead, and scalable protection

mechanism against field attacks on FPGA-based systems. Furthermore, it can be

61 used in conjunction with existing encryption techniques for additional security, and as a means to lock a specific bitstream to a specific FPGA instance, preventing potential economic gains from IP piracy. Although MUTARCH requires minor architectural change and device programming during manufacturing test, it does not impact the functional behavior of a mapped design. It also does not impact routing and FPGA resource utilization during application mapping, reducing compile-time overhead during firmware upgrades.

With increasing usage of FPGA devices in diverse application domains, including the emerging IoT space, effective protection of FPGA-based systems, and the IPs mapped in them, is paramount. MUTARCH is particularly attractive for the emerging IoT regime, which involves a large number of identical, connected devices, where using knowledge of one device architecture to attack another device is more feasible.

Although we focus on bitstream tampering and piracy in this work, the proposed approach is promising in preventing side-channel attacks. This is because architectural mutation can effectively obfuscate the correlation between secret key and the side- channel signature. Finally, the overhead of the proposed approach, although modest, can be further reduced through enhanced security versus hardware overhead trade-off analysis.

62 CHAPTER 5 ARCHITECTURAL DIVERSITY FOR MICROPROCESSOR SECURITY

5.1 Background

The Internet of Things (IoT) refers to the growing phenomenon of devices able to share data via network connectivity. Already, these devices outnumber humans on Earth, a trend which is expected to continue and even grow in the coming years

(136; 137). Generally, these systems, including smartphones and tablets, smart home appliances, security systems, medical devices, and automotive systems, are controlled by System-on-Chip (SoC) based around the same or similar processors and instruction set architectures (ISAs), and are created by relatively few manufacturers. In many cases, the security of these devices is overlooked, leading to potentially disastrous consequences (138; 139).

By their very nature, products manufactured in large volumes will utilize the same underlying hardware and firmware. This is beneficial for design, fabrication, assembly, and test, but is not ideal for security, since they will also be vulnerable to the same hardware and software level exploits, as shown in Fig. 5-1. In other words, this hard- ware homogeneity leads to a ``break one, break all'' scenario, which can put millions of consumers at risk, and can place a large financial burden on the manufacturer if/when an exploit is found. This was recently demonstrated by the mass recall of 1.4 million vehicles (140), costing the manufacturer upwards of $100 million USD. In this case, part of the attack was enabled by decompiling the firmware, leading to the discovery of the WPA2 wireless key (140). Other firmware attacks, including tampering / malicious modification (141; 142) and reverse engineering / cloning (143; 144), are feasible and must be actively defended against. For IoT devices, physical access may not even be required to mount an attack. This represents major threat to embedded system security in the IoT, especially during remote firmware upgrade. Therefore, any proposed security measure must offer robust protection against attacks in this context.

63 Figure 5-1. Devices sharing the same hardware are vulnerable to the same attacks once an exploit is discovered.

Previously, techniques such as randomizing instruction encoding (Instruction Set

Randomization (ISR) (145), Instruction-Level Authentication (ILA) (146)), Address

Space Layout Randomization (ASLR) and related address obfuscation techniques (147), and software-level diversity (N-variant Systems (NVS) (148)) have been explored for processor security. These techniques all share the concept of ``diversity,'' albeit at different levels of abstraction. These diversification approaches can be categorized into instruction, memory, and execution randomization, as depicted in the taxonomy in

Fig. 5-2. Both ISR and ILA attempt to diversify the instruction set so that it becomes difficult to compile valid code for a target microcontroller. However, ISR has been shown to be weak against certain attacks, specifically when a weak encryption is used (149). Even a strong encryption, however, may not be sufficient on its own, since hardware imple- mentations of ciphers like Advanced Encryption Standard (AES) may be vulnerable to physical or side channel attacks such as Differential Power Analysis (DPA) that can leak the key. ILA depends on the security assurance of a strong Physical Unclonable

64 Function (PUF; however, ILA's continuous use of the PUF to decode each instruction will accelerate the aging effects from negative bias temperature instability, hot carrier injection, oxide breakdown, and electromigration, causing performance and reliability degradation over time. For many embedded and IoT systems, including those used in automotive and industrial control systems, lifetimes upwards of 10 -- 20 years are com- mon, making accelerated aging unacceptable from a safety and reliability standpoint.

Numerous exploits have been found in ASLR, including software (150) and side-channel

(151) attacks. N-variant systems, which use multiple, automatically-generated vari- ants of the same process to protect against a wide range of attacks, would have high overhead for many embedded/IoT systems, and therefore may not be viable. Recently, a form of hardware-level diversification has been investigated in the con- text of side channel attack resilience (152). By randomly reordering certain independent instructions, side channels such as power profiles or electromagnetic emissions that are associated with certain instructions or certain data are no longer time correlated in consecutive iterations. This greatly increases the difficulty in recovering encryption keys.

However, this method only calls for independent instructions to be shuffled, so the same firmware could feasibly be loaded on another device with no execution errors; therefore, while it does diversify execution, it does not lock an executable to a particular device, and cannot mitigate malware propagation. Another form of diversification has also been demonstrated for Field Programmable Gate Arrays (FPGAs), which aimed to make logical and physical changes in the underlying architecture so that the configuration file, or bitstream, would only function properly on one specific device (153). It remains to be seen if such an approach can be adapted to microcontrollers. We propose that such an approach to diversify the microcontroller hardware can provide the following benefits:

1. It will be more difficult for attackers to understand the underlying hardware architecture, which is necessary for finding and exploiting vulnerabilities.

65 2. It will mitigate the propagation of malware among networked devices, since malware designed to execute on one device will, with very high probability, not execute properly on other target devices in the network. The ability to mitigate the effect of malware propagation is particularly important for IoT devices, which are often unwitting participants in botnets designed to mount massive

Distributed Denial of Service (DDoS) attacks.

This chapter presents a number of modifications to a generic microprocessor architecture which provide hardware-level diversity for the purpose of IoT system security. I demonstrate how diversification at this level offers superior system security against common attacks and helps to mitigate the effects of malware propagation.

These modifications give original equipment manufacturers (OEMs) the ability to make in-field logical changes to the underlying device architecture as frequently as required to maintain security. This ability is invaluable for devices with long in-field lifetimes. Moreover, this approach is compatible with existing security techniques including firmware encryption, which can still provide another layer of defense. In short, this chapter makes the following novel contributions:

• It presents a comprehensive taxonomy which categorizes existing diversification approaches, covering both software- and hardware-level diversity.

• It describes a low-overhead approach to microprocessor security using architec- tural diversity. This can help prevent typical firmware attacks, such as targeted malicious modification and malware propagation.

• It presents a case study in which we modify the OpenRISC 1200 CPU, and report the associated area, power, and delay overhead.

• It analyzes the security of the proposed modifications in the context of brute force, known design, and side channel attacks.

To my knowledge, this is the first example of a mixed-granular, low-overhead tech- nique that secures microprocessor-based systems against typical firmware attacks, while helping to improve side-channel attack resilience, and mitigating malware prop- agation in IoT devices. The rest of the chapter is organized as follows: Section 5.2

66 Figure 5-2. Taxonomic overview of existing techniques for security through diversity

provides additional background on system diversity and describes the threat model in

the IoT design space; Section 5.3 describes the architectural diversification approach;

Section 5.4 provides area, power, and delay overhead analysis, as well as a security analysis of the approach; finally, Section 5.5 concludes the paper.

5.2 Motivation for IoT Device Security

This section describes some of the more common approaches to system security

through diversity, and then describe the threat model with possible attack vectors in IoT.

5.2.1 Related Work

Fig. 5-2 provides a taxonomic overview for existing diversification approaches for microprocessors. Broadly, these can be divided into two categories, software-level and hardware-level diversity. Software-level diversity aims to change execution patterns, randomize memory addresses, change the order of system calls, or use virtualization (i.e. virtual machines) to help mitigate a slew of software exploits. Memory / address space randomization, for example, aims to prevent such attacks as return-to-libc, by randomizing the location of different memory spaces for user code, stack, etc.

Virtualization provides a number of well-known security benefits, but a virtual machine is impractical for a constrained device. It should be noted that hardware-level attacks have been successfully mounted against software-level protections, such as ASLR

67 (151). Determining whether software-level attacks are more or less susceptible to hardware-level attacks is itself an interesting research area.

Though most diversification approaches have been applied to software, a number of hardware-level approaches have emerged in recent years. Instruction randomization, including Instruction Set Randomization (ISR) and Instruction Level Authentication (ILA), require hardware-level support for proper instruction decoding. ILA requires a strong

PUF implementation for authenticating instructions, and this constant usage of the

PUF can affect its reliability. A hardware/software partitioning approach where certain vulnerable code sections are offloaded to an FPGA device is presented in (154). This is a promising approach for providing a moving target defense against side channel and tampering attacks, but ideally requires a microprocessor with built-in FPGA fabric supporting dynamic partial reconfiguration; otherwise, new variants require a system restart to take effect. Furthermore, there is no explicit mechanism which prevents mapping the same software/hardware to a second device with the same CPU/FPGA, meaning that the hardware/software can be copied and mapped to a cloned device.

Another promising method for increasing resistance to side-channel attacks is presented in (152). This approach involves a customized hardware instruction shuffling unit which randomizes the order of independent instructions during execution.

Instructions can be identified using a VLIW compiler then randomly permuted at runtime, so as to decorrelate certain instructions in time. To evaluate the effectiveness against differential power analysis (DPA) attacks, the authors implemented AES-128 on a MicroBlaze soft processor, and showed that the resultant plots of difference- of-means generated from a large number of power traces had no distinguishable peaks. However, because the instructions are independent (and therefore, the order of execution is irrelevant), the same software would still execute properly on any device with the same underlying hardware. So, while this reduces side channel information leakage in a single device instance, it cannot be used to mitigate malware propagation.

68 It is nevertheless an important result because it shows that shuffling independent instructions can reduce side channel information leakage, and that the instruction shuffling hardware itself does not contribute significantly to side channel leakage.

5.2.2 Attack Vectors in IoT

Several threats for embedded system / IoT firmware security have been identified in the past, but to our knowledge, there is no existing technique that is both suffi- ciently low-overhead, and also provides protection from firmware reverse engineering, malicious modification, and malware propagation. The following items highlight how understanding the target architecture is fundamental to facilitating firmware reverse engineering and code generation. 5.2.2.1 Firmware reverse engineering

Tools for reverse engineering embedded system firmware such as Binwalk (155) are common and freely available. In addition, there exist a number of standard Linux utilities which may be used to assist in the analysis of a device binary. Other utilities like the Firmware Reverse Analysis Konsole (FRAK) (156) have been described, and still others, such as DynamoRIO (157) are available for dynamic/runtime code manipulation.

Note that DynamoRIO is an exemplar of the risk of hardware homogeneity; because it is targeted at the x86 instruction set, it can therefore be used to monitor and/or manipulate code for various platforms, including Windows, Linux, and Android. In other words, investing the effort to develop an ISA-specific tool can provide vast returns, since it can be used on billions of systems worldwide. From a software perspective, firmware encryption will render these attacks infeasible. However, certain area- and energy- constrained IoT edge devices may not be able to support encryption due to aggressive constraints. Even in devices that do support encryption, long in-field lifetimes render them susceptible to physical and side channel attacks, which circumvent the protections offered by firmware encryption. In the exponentially expanding IoT, all devices, including those for which encryption is an option, can benefit from an additional layer of security,

69 especially one that is relatively low overhead and does not dramatically disrupt design flows.

5.2.2.2 Targeted malicious modification

In (141), the authors demonstrate how the unencrypted (decrypted) firmware can be modified to insert malicious code with full read, write, and execute privileges into the firmware ELF file header. This enables attackers to perform undetected network reconnaissance and data exfiltration. The malware can even autonomously propagate to other networked devices, despite the existing security measures in the built-in remote firmware update (RFU) utility. They also demonstrate that digital signatures of the firmware are not foolproof and should not be relied upon. Therefore, if a vulnerability is found in one device, it can be similarly exploited in a number of other devices running the same hardware and firmware, as shown in Fig. 5-1. Consider by extension the same attack, except mounted on firmware in a safety-critical automobile subsystem. The firmware could be reverse engineered, modified, and, if a remote firmware upgrade util- ity exists -- a likely scenario in the IoT regime -- this modified firmware could be flashed into other automobiles with the same hardware. Because a targeted modification is fa- cilitated by reverse engineering and recompiling the firmware for the target architecture, it follows that preventing RE and increasing the complexity for code generation can help protect microprocessor-based devices. 5.2.2.3 Malware propagation

Malware, such as that responsible for recent IoT botnet-enabled distributed denial of service (DDoS) attacks, can presently be cross-compiled for nearly any system using freely available software tools (e.g. GCC). Besides poor deployment security (i.e., neglecting to change passwords for root-level access), the primary enabler for such attacks is the homogeneity of the underlying hardware, even if the devices themselves differ (e.g. printers, set-top boxes, etc.). As demonstrated in (141), devices supporting remote upgrade are vulnerable to automatic propagation of malware among networked

70 devices. Coarse-grained architectural changes, such permuting the order of dependent instructions, and fine-grained architectural changes, such as permuting opcodes,

ALU inputs, branch target addresses, etc. can make functionally correct execution on an arbitrary target device infeasible. In other words, malware which functions on one device will not perform the same functions, or may not function at all, on another, thereby mitigating the potential damage.

5.3 Security through Diversity

As described in the previous section, architectural diversity can make it more difficult for attackers to reverse engineer firmware and make targeted malicious mod- ifications, as well as help prevent malware propagation among networked devices. In this and later sections, we describe in detail the techniques we can use to implement this diversity in a generic microprocessor, in particular, a single-issue, in-order execution

RISC CPU, and describe the remote firmware upgrade model. We then examine the security provided by these techniques. Note that, while some IoT devices may use more complex processors, it is worth studying this type of basic microprocessor, which is often used in low-power IoT edge devices where encryption is too expensive to support. Such techniques can still apply to more complex processors, and as noted earlier, can still be used in conjunction with firmware encryption when available. 5.3.1 Keyed Permutation Networks

Numerous studies have explored the theory behind hardware permutation net- works, which find uses in many domains in which multiple input nodes must communi- cate with one or more output nodes, where the required connections may change over time. The Beneš network (158) is a provably congestion-free network that can realize any permutation of the input. In other words, for n inputs, the network can realize any of the n! possible permutations using edge-disjoint paths, so there is no contention for particular wires at any given time. In such a network, the number of stages equals

71 Figure 5-3. Non-blocking networks are appropriate for time-varying permutations. A) Basic building block of the keyed switch elements. B) The Beneš network topology, comprised of the basic keyed switch elements; any of the input's n! permutations can be realized at the output.

× − 2 log2 (N) 1. The network is built on a simple 2x2 switching element, which accepts two packet inputs of 1 or more bytes, and a key input k, which describes the behavior of the switch. Typically, when k = 0, the inputs 0, 1 are mapped to the outputs 0, 1.

When k = 1, the inputs are instead mapped 0 → 1, 1 → 0. These switch elements are

depicted in Fig. 5-3(a), and the complete 16-input network is shown in Fig. 5-3. The full permutation is defined by the combination of key inputs, one to each switch element.

Therefore, the input/output connections can be changed over time, simply by updating

the key.

5.3.2 Mixed-Granular Permutations

The Beneš network can be used to provide a high degree of architectural diversity to microprocessors by operating at multiple levels of . We begin by describing

the two main diversity classes, namely, fine and coarse grained. Table 5-1 summarizes

these, and qualitatively provides relative overhead for area, power, and delay, as

well as the relative security benefits against side channel attacks (e.g. DPA), reverse engineering (RE) and targeted malicious modification (TMM), and against malware

72 Table 5-1. Summary of proposed diversifications for IoT security Relative Overhead Security Granularity Examples Area Power Delay SCA RE/TMM MWP Fine Opcode, DReg. Low Low Mod. Low Mod. Yes Coarse Instr. Perm. Mod. Low Low High Mod. Yes Both -- Mod. Low Mod. High High Yes

propagation (MWP). More quantitative results for overhead and security are provided in

Section 5.4.

5.3.2.1 Instruction encoding

Fine-grained modifications refer to changes made at the bit- or byte-level in a

single instruction word. This effectively diversifies the instruction set, similar to ISR, but

does so in a fine grained fashion. Previous implementations of ISR aimed to mitigate

the efficacy of code injection attacks. In principle, this was done by creating artificial diversity in the instruction set architecture by changing the instruction encoding based

on a key at compile-time, i.e. In = I0 ⊕ K0, where In is the new instruction encoding, I0 is the original instruction, and K0 is the key. By performing the same operation within the processor, the original instruction is recovered, and can execute as usual within the unmodified datapath. In other words, the only required hardware modification is the addition of the inverse transform (the XOR operation) during instruction fetch. Previous implementations of ISR (145) had fundamental flaws which enabled an attacker to

determine the key piecemeal. In particular, the authors note that the use of XOR to

encrypt the instructions will expose portions of the key learned from just one ciphertext- plaintext pair. Furthermore, the variable instruction length of the x86 architecture used in

that particular ISR implementation is also beneficial for attackers -- shorter instructions,

such as the single byte RET instruction, or near return to calling procedure, can be guessed in 28 attempts (149). A random 32-bit transposition was also noted to be less

vulnerable to this approach, since multiple plaintext-ciphertext pairs would be needed, even when a short instruction is available.

73 Therefore, a transposition of the instruction bits is the preferred method for our implementation; and, unlike x86 and other CISC machines, all RISC instructions are the same length, so there are no short instructions that can be used for obtaining the key piecemeal. A keyed Beneš permutation network, as previously described, can realize any of the permutations of the instructions. For a full 32 bit permutation, a total − N of N log2(N) 2 = 144 switches are required. This can provide the greatest amount of diversity, but also has the highest cost in hardware. To reduce this overhead, but still sufficiently permute bits from different instruction encoding fields, we use 2x16-bit networks in parallel, each with a network delay of 7 units (2 × log2(16) − 1) and 56 switches, for 112 switches in total (about a 28% reduction). This will permute bits from the top and bottom 16 instruction bits. For example, in an register ADD instruction, this will mix bits from the opcode, one source register, and one destination register, which could produce any opcode, potentially from instructions which have inherently different formats. Therefore, from a reverse engineering perspective, it will be difficult to determine not just the intended opcode, but also the intended instruction type (ALU,

MEM, etc.) and corresponding format. As shown in (152), hardware permute units will not contribute significantly to side channel information leakage, so using such permutation networks will reduce the device's susceptibility to side channel attack as compared to key addition. Logically, this is because addition with a key will result in output bits which may have a different number of 1's and 0's than the original input value, as dictated by the key itself, whereas with the permute unit, only the position of the bits varies -- the number of 1's and 0's will not, regardless of the key.

Note that, for each hardware modification, a corresponding modification in the compiler is required which takes as input a key, and produces an appropriate binary for the specific device. This procedure is outlined in Fig. 5-4. Also note that in Fig. 5-4, the example permutation networks are placed within the datapath to demonstrate how the instruction encoding is effectively changed between devices depending on the network

74 Figure 5-4. Implementing architectural diversity in a generic RISC microprocessor. A) A code file is compiled with respect to the target device architecture. B) An example microprocessor with modified controller, register file, ALU, and data memory. configuration. In practice, this can be done outside the datapath, when instructions are loaded from the cache, to avoid unnecessary performance degradation.

5.3.2.2 Dependent instruction reordering

Previous work has explored randomizing the execution order of independent instructions to reduce information leakage through side channels (152). An example of independent instructions would be the substitution box (SBOX) lookups in the AES cipher, which can be done in any order at the byte level without affecting the result of the state matrix. However, because the order of such instructions does not matter, the code remains portable, and could execute properly on any other platform with the same microprocessor. If dependent instructions were permuted, such a scheme would effectively lock the firmware to a single unique device, since functionally correct execution on any other device would fail.

The length of the instruction cache line (e.g. 16 bytes for the standard OR1200 CPU) may limit the efficacy of full, 32-bit instruction word reordering, since each line

75 stores only 4x32-bit instructions. Here, the number of reorderings is 4! = 24, which is easily brute forced. Instead, byte-level, inter-instruction permutations are an efficient method to reorder instructions within a cache line. For example, the 16-byte cache line, upon being fetched, will be reordered in 1 of 16! ≈ 244) ways, using a 7-stage Beneš network similar to that described previously. As before, such byte-level inter-instruction reordering will be less susceptible to key leakage via side channel, since once again the same 16 bytes from the input will be present at the output.

5.3.3 Wireless Reconfiguration

When a firmware update is required for an IoT device, additional steps are needed to support the hardware architectural diversity. Some of these steps must be taken be- fore device deployment, and others must occur each time the device is upgraded. Prior to deployment, devices must first be configured and registered into the manufacturer's database at production time. Each device contains a unique identifier which has a one- to-one correspondence with the architectural configuration key. Next, when an update is required, the manufacturer first compiles the firmware as usual using standard tools, such as GCC. This is done to save time, because the initial compilation process will take significantly longer than the subsequent modifications to the binary, which consist of bit transpositions and byte-level inter-instruction permutations. The manufacturer must then communicate with the target device to obtain its identity. The device-specific architec- ture is retrieved from the database, and the binary is modified using the device-specific configuration. Individual bits from the top and bottom halves of each instruction must be permuted, so that when loaded by the target device, the inverse permutation reverts the instructions back to their original encoding. Similarly, every n instructions, which are stored in a single cache line, must be reordered at the byte granularity, so that when loaded in the target device, the inverse permutation will revert these instructions to their original ordering. Finally, device-specific executables are transmitted using the existing

76 Figure 5-5. Software flow for securing compiled firmware binaries in microprocessor systems; additions to the typical remote upgrade flow are highlighted in the shaded boxes. network infrastructure to the correct device, enabling the remote firmware upgrade.

These steps are outlined in Fig. 5-5. 5.4 Results and Discussion

In this section, we describe the overhead results, namely the area, performance, and power impact of the architectural modifications. We also provide a thorough security analysis of the approach. 5.4.1 Overhead Analysis

Overhead results were obtained by implementing the permutation networks in Ver- ilog, integrating them in the open source OR1200 CPU, and comparing to an unmodified

OR1200 (120). The functionality was verified in simulation, and demonstrated that, given the correct key, the final processor state (e.g. register contents) will match the original, while an incorrect key results in a detectable error (exception), and therefore in an incorrect processor state, since the code could not execute to completion.

Synthesis results were obtained using Synopsys Design Compiler with a 90nm library (SAED 90nm EDK) for the logic portion of the CPU, and using CACTI to estimate

77 the size of the OR1200's default 8KB instruction and data caches (159). The area, power, and delay for each fine-grained (FG) permutation network was found to be roughly 1300 µm2, 150 µW , and 1.93 ns, respectively. The two FG networks operate on half instruction words in parallel and are outside the critical path. Therefore, they do not impact processor performance. The total area increase is 2600 µm2, with a total power increase of 300 µW . The coarse grained (CG) network is larger, occupying an area roughly 10625 µm2, and consuming 1190 µW of power, with a total delay of 2.01 ns.

In both cases, the vast majority of this power is due to transistor leakage, even when using high Vt cells, making power gating an attractive option for reducing this overhead (160). In particular, each network is only active for a fraction of the overall cycle time (about 6.7%). We therefore employ power gating to reduce the overall power consumption, from 300 µW to about 20 µW , and from 1189 µW to 80 µW .

The area, power, and delay for the original OR1200 processor, also synthesized at

90 nm, were found to be 200,000 µm2, 1.1 mW, and 30 ns, respectively. Added to this are the area and power results for the caches taken from CACTI (8192B size, 16B line size, direct mapped cache), and are 0.41 mm2 each, consuming 0.046 nJ in dynamic power per read. At the cycle time of 30 ns, this gives roughly 1.5 mW per cache, or 3.0 mW for both. While the actual power consumption will vary with particular benchmarks, we assume an average 50% of this total power, or 1.5 mW in our calculations. Including the modifications in the OR1200 yields an area and power overhead of roughly 1.30% and 3.80%, respectively. However, both permutation networks are outside the critical path, and were not observed to affect the cycle time of the processor.

These results are summarized in Table 5-2.

5.4.2 Security Analysis

The proposed mixed-granular architectural modifications offer a high degree of security with manageable overhead for typical designs. Here, we analyze the level of security offered by this approach.

78 Table 5-2. Area, power, and performance overhead due to modifications of OR1200 CPU Component Area (µm2) Power (µW ) Latency (ns) Fine Grained Perm. (x2) 2600 19 1.93 Coarse Grained Perm 10625 79 2.01 8kB I-Cache & D-Cache 820,000 1529 30.0 OR1200 (Original) 1020000 2629 30.0 OR1200 (Modified) 1033214 2727 30.0 Overhead 1.3% 3.8% 0.0 %

5.4.2.1 Brute force

Security against brute force attack stems from a combination of the diversification

approaches described in Section 5.3. First we consider the fine-grained, intra-instruction permutation, Because these networks will only transpose the bits, the same number of 1's and 0's will be present at the output. Therefore, although there are 16! possible permutations, they are not necessarily unique. The number of possible unique permuta- tions therefore depends on the number of 1's (0's) in the input, and can be defined using the binomial coefficient

( ) n n! = (5--1) r (r!)(n − r)! where n is the number of bits, and r is the number of 1's present in the input (or equiv- alently, the number of 0's). Thus, having a relatively equal number of 1's and 0's in the original instruction will lead to the highest number of possible combinations. One po- tential outcome is that, when designing an ISA, certain encodings may lend themselves better to security via bit permutations. In existing ISAs, this may still be feasible, for ex- ample, by adding compiler support in selecting certain registers in particular instructions to maximize the potential encodings.

For the proposed 16-bit permutation networks, the maximum number of combina- ( ) 16 tions is given by 8 = 12870 for both the top 16 bits and bottom 16 bits in the worst case. The average case is given by

79 Table 5-3. Brute force complexity relative to fine-grained permutation network dimension Dimensions # Switches # Networks Total Switches # Combinations 32 144 1 144 229.2 16 56 2 112 227.3 8 20 4 80 224.5 4 6 8 48 220.7 2 1 16 16 216.0

∑ ( ) n n 2n r=0 r = (5--2) n n which for n = 16 is 212. Since different permutations are defined for each half of the instruction word (via separate switch network controlling bits), they are multiplicative, roughly equaling 227 in the worst case, and 224 in the average case. As previously dis- cussed, there is an inherent trade-off between the number of potential encodings, and the area overhead (due to additional switching elements). This trade-off is summarized in Table 5-3.

At the same time, the coarse-grained, byte-level inter-instruction permutation will be implemented. Because 8 bits are treated as a group, it is far less likely for two of the bytes to be identical, unlike when individual bits were permuted. Therefore, we assume all 16! orderings of the 8-bit groups will be unique. The complexity is again multiplicative, and is given as

( ) n (n!)2 C = × n! = (5--3) r (r!)(n − r)! which for n = 16 and 4 = 8 yields roughly 272 combinations for every 4 instructions, or 268 in the average case. This means that, for any given binary, there is a less than 1 in 1020 probability of compiling valid code without knowing the target architecture.

Finally, because the permutation network is rearrangeable and non-blocking, it can, at any time, be reprogrammed to produce a different output. This presents a

80 moving target defense, and with proper key management, can be done as frequently as required to maintain system security.

5.4.2.2 Known design and side channel attacks

A known design attack is mounted when an attacker uses knowledge about the input to observe patterns in the output binary. For example, if the attacker has access to the development tools, writing a simple program and compiling it for the target device may enable them to eventually determine the configuration of that device's permutation networks, but not for any other device. Therefore, knowledge of one device's archi- tecture does not enable an attacker to compile valid code on any other device. This method does not, however, protect against reverse engineering the firmware for the purpose of piracy -- other techniques such as firmware encryption would be necessary in that regard. However, it does prevent maliciously modified code (i.e. firmware that appears valid, but has a software Trojan or other malicious functionality hidden inside) from being flashed to other devices. Similarly, it also prevents the spreading of malware, e.g. if the attacker designs malware to automatically propagate among networked de- vices, because successfully deploying on one device does not enable it to spread to others on the network.

If a known design attack does not work, or if the attacker does not have access to the development tools to produce their own test binaries, a side channel attack may be an alternative technique. Attacks like power analysis or electromagnetic emission analysis can be used to leak secret information from a system, such as an encryption key, given that an attacker has physical access. This is possible because repetitive operations, such as rounds during encryption, operate in a regular, known manner, and analyzing the system's side channels can enable attackers to identify time-correlated changes in, for example, the power tracing. This can in turn be used to extract the secret key used in encryption. However, using only permutation networks, there is no key-dependent power draw, because the same (reordered) data will be present in the

81 output, and the same number of transistors will switch regardless of the value of the network control bits.

5.4.3 Relationship to encryption

Encryption has traditionally been considered the de facto solution in the field of information security. However, for achieving certain security goals, encryption could become expensive or even inadequate. As a motivational example, consider the prevention of hardware intellectual property (IP) theft at untrusted design houses. This cannot be solved with encryption alone, because only the decrypted design can be manufactured. An alternative, hardware obfuscation and locking mechanisms, have been developed and are considered useful in such a scenario since designs can be fabricated and tested in obfuscated form while still achieving the security goals not met by encryption alone.(161).

Another major security concern is in-field malicious reprogramming. Even though one could try to prevent this problem using encryption and authentication, the ability to use or reprogram a desired encryption (and authentication) key on the target device allows the attacker to execute unwanted malicious software. Besides, leakage of the encryption key and encrypted authentication key would allow the device to be maliciously programmed.

However, the presence of hardware diversity vastly increases the complexity of mounting such an attack. One needs to reverse engineer individual details of the underlying hardware that impacts the corresponding software compilation for that device. Since reverse engineering one instance of the hardware does not reveal any information about another, the same exhaustive process must be followed for each device to be reprogrammed. Additionally, extracting certain architectural secrets may lead to destructive reverse engineering of the device, which could eventually make the device unusable. The increase in complexity of malicious programming is likely to cost more than the benefits of successful programming of a single device. Therefore, while

82 encryption provides unparalleled advantages in solving data oriented security issues, certain hardware related vulnerabilities may be solved more efficiently with solutions like obfuscation and diversification.

5.5 Summary

This chapter has presented a novel, low-overhead approach to embedded system security for IoT devices. The approach employs architectural diversification to counter the threat of hardware homogeneity and the break-one/break-all paradigm. This makes the approach strong against common firmware attacks such as targeted malicious modification, as well as network-based malware propagation. We have described the permutation networks which are area, power, and delay efficient, and shown how they represent relatively low overhead when implemented in an example processor. Our security analysis shows that the security benefits of this technique provide mathemat- ically provable protection due to the complexity of brute force attacks, and the lack of side-channel information leakage. Though this is not a replacement for encryption, and does not aim to replace encryption hardware, it does provide a solution to the problem of hardware homogeneity which is less susceptible to side channel attacks. Further- more, this technique does not preclude the possibility of using firmware encryption as an additional layer of security. By implementing the proposed modifications, designers of future microprocessors for embedded/IoT systems can find an appropriate trade-off between the hardware overhead and security benefits depending on the application.

This can include implementing permutation networks for fine- and coarse-grained in- struction set and instruction order diversification, and designing the ISA and compiler to produce instructions with near-balanced binary encodings (i.e. roughly 50% 0's and 1's) to maximize the efficacy of the permutations. The methods described in the previous chapers provide a pathway for designing secure reconfigurable computing frameworks. The following chapters will change focus

83 to energy-efficiency, and show how a general spatio-temporal computing framework may be made more energy-efficient for certain applications.

84 CHAPTER 6 RECONFIGURABLE ACCELERATOR FOR GENERAL ANALYTICS APPLICATIONS

Chapter 2 provided background on a generalized MAHA architecture, and de- scribed how the specifics of the framework -- the architectural details within each MLB such as the supported datapath operations, the size and organization of the memory, and the type and organization of the interconnection network, among others, can de- pend greatly on the specific application to which the accelerator is targeted. In this chapter, which previously appeared* in the ACM Journal on Emerging Technologies in

Computing Systems (JETC), Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems, I describe a series of architectural customizations to the general MAHA framework, derived from careful analysis of typical analytics application kernels and operations, aimed at improving the performance and efficiency of the reconfigurable MAHA architecture. 6.1 Background

Big data has become a ubiquitous part of everyday life (13; 14; 15). For companies that amass large datasets - whether from sensor data or software logs, social networks, scientific experiments, network monitoring applications, or other sources - the problem is not acquiring the data, but rather analyzing it to identify patterns or provide insight for operational decisions (16). Often, these decisions are time-sensitive, such as identifying network intrusions in real time, or reacting to fluctuations in stock indicators. Meeting such low-latency requirements on a limited power budget will become increasingly difficult as data streams increase in size and velocity (13), especially when leveraging machine learning (ML) techniques for data processing.

* R. Karam, S. Paul, R. Puri, and S. Bhunia. ''Memory-Centric Reconfigurable Accel- erator for Classification and Machine Learning Applications,'' ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 13, no. 3, 2017.

85 To address this issue, many have turned to hardware acceleration, relying primarily on either the parallel-processing capabilities of General Purpose Graphics Processing

Unit (GPGPU) platforms like NVIDIA CUDA (17; 18; 19) or Field Programmable Gate

Array (FPGA) implementations of certain time-intensive kernels (20; 21; 22). Such

solutions are promising for simple analytics as they generally accelerate relational database queries on platforms implementing the MapReduce framework (23), exploiting

the straightforward data parallelism in such applications (15). However, there are two

main issues with these techniques:

Firstly, the problem of accelerating more complex analytics, where the parallelism

may not be as readily apparent and the communication requirements vary significantly between applications, remains to be addressed in the context of increasingly stringent

power and latency budgets. In fact, it has been shown that conventional multicore

(both CPU and GPU) scaling is unlikely to meet the performance demands in future

technology nodes, even with a practically unlimited power budget, and that additional cores yield little increased performance due to limited application parallelism (24).

Secondly, while these accelerators generally connect to the host via high speed

interconnects (e.g. PCIe), they are still at the mercy of the data transfer energy bot-

tleneck. Regardless of the speed and/or efficiency of the accelerator, data transfer

requirements vastly reduce their efficacy when processing massive datasets (25). These issues motivate us to find an alternative, ultralow power domain-specific

architecture, focusing on energy-efficient execution of machine learning application,

especially in the context of big data analytics. Such an architecture must be scalable,

to deal with growing datasets, and reconfigurable, to readily implement a broad range

of kernels as needed by the various application. Furthermore, domain-specificity will improve area efficiency while supporting diverse applications in the target domain (26).

Thic chapter presents a framework which addresses these issues using a memory- centric spatio-temporal reconfigurable accelerator (8; 27) customized for analytics

86 applications, which operates at the last level of memory. As a memory-centric acceler- ator, it leverages the high-speed, high-density cell integration found in current memory devices to efficiently map entire analytics kernels to the accelerator. Furthermore, as a memory-centric framework, advances in promising emerging CMOS-compatible memory technologies can be leveraged to improve performance while reducing power consumption (28; 29; 30).

To validate this approach, three classes of analytics applications are analyzed for their typical operations and communication patterns, then mapped to the framework: classification (Naïve Bayes Classifier, (NBC)), artificial neural networks (Convolutional

Neural Network, (CNN)), and clustering (K-Means Clustering, (KMC)). Data is streamed to the framework using a back-buffering technique in both uncompressed and com- pressed formats, employing a Huffman-based entropy coding scheme to effectively increase data bandwidth and system storage capabilities, while significantly reducing transfer energy. A Verilog model of the framework is synthesized, and the power, per- formance, and area are extracted, and the energy efficiency is calculated. Finally, these values are compared with CPU and GPGPU implementations using the same datasets.

In summary, this chapter describes the following novel contributions:

1. It presents the processing element microarchitecture and hierarchical interconnect for a domain-specific, memory-centric reconfigurable accelerator for machine learning kernels in the context of big data analytics. PEs support multi-input, multi-output lookup tables and are highly tailored to the application domain.

2. It explores the processing and communication requirements for three classes of analytics applications, and uses these observations in the design of the processing elements and interconnect.

3. It considers the role of data compression for big data analytics in the context of disk-to-accelerator bandwidth and overall power savings, and demonstrates the feasibility of in-memory decompression of datasets in the accelerator framework.

4. It validates the approach using a multi-FPGA hardware emulation framework which provides functional and timing verification for the design. It studies the power, performance, and area results from synthesis, and compares with imple- mentations on CPU and GPGPU. Overall, the accelerator demonstrates excellent

87 energy efficiency when compared with CPU and GPU for the machine learning and classification domain of applications.

The rest of the chapter is organized as follows: Section 6.2 provides a brief

overview on existing work in the related fields, and summarizes the domain-specific requirements for the target applications; Section 6.3 describes the system organization,

hardware architecture, the application mapping procedure, parallelism and execution

models, and the data management techniques; Section 6.4 provides accelerator-

specific implementation details for the benchmark applications, the dataset compression

methodology, and the experimental setup; Sections 6.5 and reports the experimental results, including raw throughput, energy efficiency, and the effects of dataset compres-

sion; Section 6.6 discusses the performance, parallelism, scalability and application scope; Section 6.7 covers a broad range of related works for the hardware acceleration

field, including FPGA, GPGPU, CGRA, and other manycore architectures, and contrasts these with the proposed accelerator; finally, Section 6.8 concludes with future directions

for the research.

6.2 Requirements of Big Data Analytics Systems

In this section, we explore the requirements of a big data analytics accelerator and

discuss the state-of-the-art systems. We note the shortcomings of these systems, which motivates us to find a new solution.

6.2.1 Overview of Big Data Analytics

In general, analytics aims to identify meaningful patterns in data by employing a

variety of statistical or machine learning methods. "Big Data" analytics presents a slew

of problems beyond the basic algorithms and methods employed (31), including storage and retrieval of data, and division and balancing of workloads, among others. These

problems are multifaceted and complex, but are wholly contingent on the distributed

computing paradigm (e.g. MapReduce) that is commonly used. A growing body of

research aims to reduce processing time under these conditions using general purpose

88 Figure 6-1. Proposed reconfigurable accelerator for general analytics kernels.

CPUs, GPUs, or FPGAs for acceleration (23; 22; 32; 33; 34; 18; 35; 36; 37; 38; 39;

40; 21; 41; 42; 43), but generally prioritize raw performance over energy efficiency due to the rapidly increasing data volume and generation velocity. We mainly compare with these platforms, rather than other CGRAs or manycore systems, because CPU,

GPGPU, and FPGA tend to be more commonly used in larger scale and enterprise- level big data systems. A more in-depth discussion of these platforms is offered in

Section ??. While these approaches tend to give excellent results once the data is transferred to the CPU/accelerator memory, the data transfer itself cannot be ignored. In fact, for most big data applications, data transfer represents the single largest component of the total processing power, potentially taking half the total processing time just in the transfer (25). Bringing the processing or acceleration closer to the data is therefore crucial for big data processing. Several examples of integrating processing in the main memory exist (44; 45; 46). However, this brings up two issues, (1) compared to the last

89 level of memory, main memory is limited in capacity, and (2) data must still be brought to the main memory, which unnecessarily expends energy. By pushing the processing further down the hierarchy, a system is no longer limited by the main memory capacity, and data movement is greatly reduced. Therefore, an accelerator which is situated at the last level of memory, as shown in Fig. 6-1, can take full advantage of the analytics domain-specific architecture.

6.2.2 Analytics Applications

Common analytics applications, including classification (47; 48) and clustering (49), are evaluated in order to identify typical operations and communication requirements.

Neural networks, which are capable of both classification and clustering, are also evaluated. Finally, the concept of in-accelerator dataset compression/decompression, which can potentially reduce overall power consumption, is also explored.

6.2.2.1 Operations

In general, the most common datapath operations were addition, multiplication, shift, and comparison, which were common to classification, and clustering applica- tions, as well as neural network based approaches. Furthermore, both the neural networks and the clustering algorithms require evaluation of analytic functions which are readily implemented in lookup tables (LUTs). In particular, the logistic (sigmoid) and hyperbolic tangent functions are popular for non-linear activation stages in neu- ral networks. For clustering, the distance computation (e.g. distance from the cluster mean/median/centroid/etc., or locating a nearest neighbor in hierarchical clustering) is critical, since it is repeatedly applied to all points in the dataset. Two common distance metrics are Euclidian (LUT, square and square root) and Manhattan distance (datapath, subtraction). In all instances studied, no more than four different non-datapath functions were required by the applications. In general, a very lightweight datapath, paired with sufficient lookup memory, enables the mapping of a wide range of analytics kernels.

90 6.2.2.2 Communication

To some extent, all applications exhibit a mix of Instruction Level Parallelism (ILP)

and Data Level Parallelism (DLP) which can be exploited by an amenable multicore

environment. Here, we refer to each core more generally as a Processing Element

(PE), and we assume inter-PE communication is available on-demand for illustrative purposes.

Classification: For NBC, training can be parallelized at the instruction level

among several PEs, as long as there is sufficient inter-PE bandwidth to communicate

copies of the of the partial probabilities. Once each PE has a copy of the probabilities,

independent classification (DLP) of each vector in a dataset is readily parallelized, and requires no inter-PE communication.

Clustering: For KMC, the PE communication requirement is higher than for NBC,

and could be significantly higher using another technique (e.g. agglomerative hier-

archical clustering). From the algorithm definition, we know the number of clusters is predetermined, and we assume that the initial locations of the cluster centers are

distributed along with a portion of the data among the PEs. Summary statistics, for ex-

ample, the local cluster means and number of cluster members, must be communicated

among the PEs so that each has a locally up-to-date copy of the current cluster means.

Neural Network: The PE communication requirement is highest for the neural networks. In the convolutional neural network model evaluated, at least two mappings are possible, exploiting either ILP or DLP. For example, individual PEs can perform the two dimensional convolution and nonlinear activation function independently for each feature map (DLP) given sufficient memory resources. Alternatively, a group of PEs can simultaneously evaluate the same network, as long as there is sufficient inter-PE bandwidth for data transfer .

In general, the applications evaluated are highly parallel, with varying communi- cation requirements which depend strongly on the particular mapping strategy. With

91 smaller internal memories, individual kernel iterations must be mapped to multiple PEs, indicating a greater reliance on ILP, PE communication, and therefore higher bandwidth

requirements. In contrast, larger internal memories allow individual kernel iterations to

be mapped in individual PEs, reducing the overall inter-PE bandwidth requirements.

For a fixed die area, this translates to fewer PEs. From the communication analysis, the preferred model depends on the application mapping: classifications are more likely to

achieve higher throughput using a large number of small PEs, while KMC and CNN can

use either model depending on the desired latency and throughput. Finally, we consider

the data network: in the first case, for example, a classifier will only achieve its maxi-

mum throughput when sufficient data is available for processing. With a large number of PEs, this data distribution network increases in size and complexity. By having a smaller

number of PEs with larger memories, the complexity of this network decreases. For a

system to retain high performance and remain energy efficient, these factors must be

balanced. 6.2.3 Disk-to-Accelerator Compression

Next, we consider the role of dataset filtering or compression in the context of

big data analytics. Data filtering has been used previously at an enterprise level (20) to intelligently reduce the data volume by selectively transferring rows of structured data which met the query requirements. For unstructured data, it is unknown a priori where the salient data resides; this motivates us to instead use dataset compression to increase the effective disk-to-accelerator bandwidth. This is especially important in big data applications, where a reduction in data volume can result in lower transfer energy and latency if certain conditions are met, as discussed here.

Assuming that the data can be effectively compressed, decompression at the accelerator end must be lightweight and not negate the latency and energy saved from transferring less data. Determining if the application can benefit from in-accelerator dataset decompression therefore requires consideration of several variables, including

92 Figure 6-2. Trade-off between average compression routine length and achievable compression ratio the amount of data to be transferred, the associated transfer energy per bit, and the energy and latency associated with the decompression routine, as summarized by the inequality Pcx + Pdr < Pdx , where Pcx refers to the compressed dataset transfer power,

Pdr is the decompression routine power, and Pdx is the full dataset transfer power. Furthermore, we assume the data is compressed once, but accessed and processed an indefinite number of times, so the initial data compression energy is amortized and not considered in the inequality.

Using representative values obtained from the synthesis results (Section 5.4), we can observe the relation between compression routine cycles, compression ratio, and estimated overall energy savings in Fig. 6-2. Here, individual area plots represent different compression routines, using (from bottom to top) 15, 50, 100, 250, and 500 cycles on a given platform. Depending on the compression ratio, expressed as a percentage of space saved, the different routines achieve varying levels of total energy savings (I/O + Compute). Energy savings is achieved when the sum of transfer energy and decompression energy are less than the transfer energy for the uncompressed

93 dataset. In this case, 1 GB is transferred at 13 nJ/bit (25). This analysis is used when choosing the dataset compression algorithm for the accelerator.

Huffman coding (50) is chosen to compress the data prior to transfer. Decompres-

sion is implemented as a table lookup, and can be completed in a single cycle, while

data setup and processing contribute an extra 13 cycles/byte. Furthermore, in cases where the data output size is of the same magnitude as the data input size, it is bene-

ficial to compress the result using a viable technique. As a reconfigurable framework,

several compression algorithms (among them Huffman coding) are supported.

6.3 In-Memory Analytics Acceleration

In this section, we provide a broad system-level overview, describe in detail the hardware architecture, including the PE microarchitecture the hierarchical interconnect, the application mapping strategy, supported parallelism and execution models, and finally system-level memory management.

6.3.1 System Organization

From a system level perspective, the in-memory analytics accelerator is situated at

the last level of memory, between the last level memory device, and the I/O controller,

as shown in Fig. 6-1. This is in contrast to previous work which situates processing at

the main memory level (46), or within the last level memory itself, requiring a memory

redesign (27). Sitting at the I/O interface, the accelerator intercepts and relays I/O commands as needed for typical operation. Acceleration mode is initiated when the host issues a read command to a specific disk location containing configuration information. The host can specify a set of file locations or directories which contain the data to be processed.

Using standard I/O transfers, the accelerator can read these files, distribute data, and resume processing as specified by the particular configuration.

94 6.3.2 Accelerator Hardware Architecture

The accelerator is comprised of a set of single-issue RISC-style processing elements. Multiple PEs are interconnected with a two level hierarchy which balances bandwidth with scalability, as described here:

6.3.2.1 Processing element architecture

Each PE contains a lightweight integer datapaths, data memory, scratch memory, an instruction (``schedule") table, and support for multi-input, multi-output lookup operations, as shown in Fig. 6-3. The datapath supports typical operations, including addition/subtraction, shifting, comparison and equality operators, and fixed point multiplication, all of which are common to the evaluated kernel classes, making this lightweight datapath tailored to the analytics domain. Instructions are encoded in 32 bits, and 256 instructions can be mapped to each PE. Based on the communication and data distribution requirements discussed in Section 6.2.2 a total of 4 kB lookup and data memory is available. At least 1 kB is reserved for four LUTs, another 1 kB each to the dedicated data input and output buffers, and up to 1 kB for scratch memory. A portion of the scratch memory can also be used for LUTs if more functions are needed. Data input/output buffers and scratch memories are addressed using a two bit bank offset and eight bit address in memory instructions. Similarly, a LUT offset is determined at compile time by the particular function it implements and its physical placement during kernel mapping. The PE specification is summarized in Table 7-1.

6.3.2.2 Interconnect architecture

From the application analysis, we observed a range of communication require- ments, and consequently chose a suitable interconnect architecture to match. With

4 kB/PE, applications like clustering and neural networks will require significant inter-PE communication, while applications like classification can achieve high throughput. To accommodate the various communication requirements, we use a two level hierarchical

95 Figure 6-3. Memory-centric processing element microarchitecture

Table 6-1. Configuration for the general analytics accelerator. Property Description Single Issue Processing Elements Integer ALU (Add/Sub/Shift/Comp.) 8 bit Fixed Point Multiplier 32 x 32 bit General Purpose Registers 256 x 32 bit Instruction Memory Memory Organization 2 kB I/O Memory Buffers 1 kB LUT Memory 1 kB Shared LUT/Scratch Memory Two level Interconnect Cluster: 4 PE, Fully Connected Intercluster: Mesh interconnect, as shown in Fig. 6-4. At the lowest level, groups of four PEs are fully con- nected in a cluster, which provides maximum on-demand communication between local

PEs, satisfying the bandwidth requirements for clustering and neural networks. Each PE has a dedicated 8 bit output, to which all other PEs in the cluster connect, giving us a 32 bit intra-cluster interconnect. Data written to the bus is available on the next

96 Figure 6-4. The routerless, two level hierarchy with a 8PM cluster and 2D mesh clock cycle; at the maximum operating frequency of 1.27 GHz, this provides a maximum 4̃0.6 Gbps local link between PEs.

For the second hierarchical layer, we implement a routerless two dimensional mesh interconnect; individual PEs within the cluster are responsible for communication with their inter-cluster neighbor. Specifically, the top left PE in one cluster communicates with the bottom right PE in the other cluster; similarly the top right PE in one cluster communicates with the bottom left PE in the other, as shown in Fig. 6-4. Communication is two way, with one dedicated input and output bus from each PE to the inter-cluster mesh. For the applications evaluated, clustering and neural networks are the most likely to utilize the inter-cluster bus; other analytics applications too large to map into one cluster, or for which lower kernel latency is required, can also make use of this resource. Furthermore, we note that the mesh reduces physical distances and wire

97 length between clusters, making it scalable to larger numbers of clusters. It is preferable to global bus lines, which are commonly used in other frameworks.

6.3.3 Application Mapping

We developed a software framework for automatic application mapping into

the proposed framework by modifying the general purpose MAHA mapper tool (27), accounting for differences in the interconnect architecture and supported instruction set.

The tool is written in ``C" and maps applications according to the flow shown in Fig. 6-5.

An input kernel is first decomposed into a set of subgraphs, which are opportunistically fused to reduce overall instruction count. The fused operations are mapped to each

PE under the given resource constraints using Maximum Fanout Free Cone and Maximum Fanout Free Subgraph heuristics (51). Operation groups (including LUTs and

datapath operations) are mapped to individual PEs during placement, and distributed

in such a way that communication between PEs is minimized, following the specific

interconnect architecture. Finally, the communication schedule is generated which statically schedules intra- and inter-cluster communications among PEs. In this manner,

the software tool generates the configuration file used to program the accelerator.

Several operation classes are supported, including bit-sliceable operations such

as logic or arithmetic, as well as complex (LUT-based) operations. Memory access

instructions, including those for requesting new data from the data manager, are inserted in appropriate locations in individual PEs as well as the gateways between

clusters. Thus, this flexible software framework complements the reconfigurable nature

of the hardware, enabling the mapping of different sizes of applications to different

hardware configurations.

6.3.4 Parallelism and Execution Models

We note that the described hardware architecture offers a high degree of flexibility

for kernel execution. Firstly, because each PE retains its own local instruction and data

caches, there is implicit support for both SIMD and MIMD execution. Secondly, there

98 Figure 6-5. Software flow for the modified MAHA mapper tool is flexibility in how the analytics applications can be mapped. As noted from the appli- cation analysis, these applications can generally exploit different types of parallelism, including instruction-level (ILP) and data-level (DLP). The particular execution model is in part determined by the application mapping. For example, four iterations of NBC can be mapped to a single cluster in several ways; assuming four vectors comprised of four

8 bit variables ⟨A0B0C0D0, ..., A3B3C3D3⟩ executing on ⟨PE0, ..., PE3⟩, we can have:

1. Each PE independently processes each vector (DLP).

2. Two PEs ⟨PE0, PE1⟩ coordinate to process a single vector (ILP/DLP).

3. Four PEs ⟨PE0, ..., PE3⟩ coordinate to process a single vector (ILP). Cases 2 and 3 require use of the intra-cluster bus, while Case 1 does not. Fur- thermore, the MIMD execution model also enables task-level parallelism. For example, multiple PEs can be pipelined, where one is responsible for dataset decompression or

99 Figure 6-6. Schematic of the system-level memory management

preprocessing, and the other three within the cluster handle the primary kernel using any of the above execution models.

6.3.5 System-level Memory Management

Data management is a crucial aspect of big data processing, and ensuring that PEs

are not starved while maintaining efficient and scalable on-chip data transfer is key. In

the proposed accelerator, a system memory manager is responsible for transferring data streamed in by the onboard SATA/SAS controller (Fig. 6-6). Configurable Read and

Write memories contain the data schedule, determined at compile time. The memory

manager intercepts SATA commands from the OS, including what files or memory

locations to retrieve, and where results can be safely written back.

Within each cluster, a simple data router called the Data I/O Block (DIOB) (Fig. 6-6) accepts a directed data transfer and distributes it to the correct Cluster and PE. Directed data transfers are preceded by a 9 bit header containing an MSB of '1' (header ID), a

5 bit Cluster ID, and a 2 bit PE ID; after that, each data byte is automatically forwarded to the appropriate PE. Note that 1 bit is reserved as a ``continuation" bit which will allow

100 expansion by using multiple bytes to specify the Cluster ID. Data transfer is localized to rows of Clusters; DIOBs cannot route data to clusters in their columns. Instead, the

main data manager, shown in Fig. 6-4, initiates row-wise transfers as data is received.

Internally, dedicated data lines to and from the PEs are used for I/O, rather than reusing the intra-cluster bus, allowing data transfer to overlap with execution. Each PE contains two data buffers, Front and Back (labeled ``F and ``B" in Fig. 6-6), which are

reserved for I/O. If the data can be processed in a single iteration, and the output data

size matches the input data size, the results can be read from the backbuffer, while

data in the front buffer is processed. When ready, the PE swaps the roles of the buffer.

However, for clustering and neural networks, the intermediate values may not be of interest, or in other cases, the data output sizes may be significantly smaller than data

input. In these cases, scratchpad memory can hold intermediate results, and the data

manager does not need to read back values between computations. This allows the

system to efficiently coordinate the parallel processing of a large volume of data. This memory management scheme is flexible enough to contend with the various

uses of the fabric, including cases where PE data transfer requirements differ between

Clusters or even between PEs (e.g. ILP or TLP modes). Static scheduling is leveraged

to ensure PEs have sufficient data to process, and that there are no contentions for the

data I/O bus. By having a single DIOB per cluster and not relying on global data lines, the architecture can scale to an arbitrary number of clusters.

6.4 Methods

In this section, we describe the experimental setup, FPGA-based hardware emula-

tor for functional and timing validation, and provide implementation details of the three

representative kernels on the accelerator.

101 6.4.1 Experimental Setup

We obtain initial functional and timing verification of the accelerator using the hardware emulation platform described in Section 6.4.2. For CPU and GPU imple- mentations, we use a desktop system running Ubuntu server 12.04 x64, with an

Core2 Quad Q8200 2.33 GHz CPU (52), 8 GB of 800 MHz DDR2 memory, and a 384- core NVIDIA Quadro K2000D GPU (53). The same datasets were used for functional verification on the emulator, the CPU, and the GPU.

For CPU results, NBC and KMC were written in C and optimized with processor- specific march flags and -o3 compiler optimization. Latency results were obtained using the C function clock_gettime, with the CLOCK_PROCESS_ CPUTIME_ID high- resolution clock specifier (54). CNN code was obtained from Theano (55) for the sample network provided in the documentation. For GPU results, the NBC and KMC kernels were written in C with the CUDA extensions and compiled with the NVIDIA C Compiler

(NVCC). Latency results were obtained using the built-in NVIDIA profiling tools for high- precision kernel and PCIe data transfer timing (56). CNN latency was obtained using high-resolution timing functions available in Python.

Energy results were derived from the latency and estimated power consumption of the CPU based on 25% maximum TDP (95W), scaled by the number of active cores and threads (1 of 4), for a conservative power estimate of 6W. We believe this is estimated to the benefit of the CPU, as the actual power consumption is likely to be greater than 6 W. Similarly, we estimate GPU power consumption based on the number of active cores and threads; since GPU kernels were launched with sufficient blocks and threads per block to ensure maximum occupancy and core utilization, the TDP (51 W) is used. For isoarea comparisons, the CPU area is taken to be 1/4 of the die shown (57), which we estimate to be 25 mm2 at 45 nm.

The accelerator was also implemented in Verilog (the same used in the emulation platform). This was synthesized using Synopsys Design Compiler and a 90 nm cell

102 Figure 6-7. The hardware emulation platform. Photograph courtesy of author.

library. The maximum clock frequency of 1.27 GHz, as reported by the synthesis tool,

was used. Power, performance, and area estimates for the non-memory components

were extracted from this model. CACTI (58) was used for the power, performance, and area of the memories, also estimated for a 90 nm process.

6.4.2 Functional Verification

The proposed accelerator was functionally verified using the hardware emulation

platform pictured in Fig. 6-7. Two separate FPGA boards were used: first, a Nios II/f softcore processor (59) was mapped to a Terasic DE0 with Cyclone III FPGA, labeled

``Host"; second, the accelerator and flash interface were mapped to a Terasic DE4 with

Stratix IV FPGA, labeled ``SSD". The flash interface facilitates communication between

the ``LLM", a 2 GB SD card, and the accelerator. The two boards communicate over SPI

using the GPIO ports. The hardware was developed in Verilog and compiled using the Altera Quartus

II software (60) with optimizations for speed. A single cluster (4 PEs) is used in the

hardware prototype, though for synthesis and simulation results, a two cluster (8 PE)

implementation was used. The three kernels, NBC, CNN, and KMC, were mapped to the framework using techniques highlighted in Section ??. For NBC and KMC,

103 Figure 6-8. Instruction mix in the mapped analytics applications. A) Breakdown of lookup table and datapath instructions for kernels mapped to the framework. B) Breakdown of the four most common datapath (non-memory access) operations. randomly-generated 1 GB datasets were processed; for CNN, samples from the MNIST handwritten digit dataset were used instead. For the emulated CPU processing, data is transferred from the ``SSD" to the

``Host" over the SPI connection, emulating the functionality of the SATA interface. After processing, data is similarly written back to the ``SSD". The results are then offloaded to a desktop system, the SD card is reformatted, and the original dataset is reloaded.

Accelerator-based processing follows, initiated by the ``Host" by sending the kernel specification and data location. This initiates the transfer and processing stages. Data is written back to designated locations once processing is complete, and results are again offloaded for analysis. For all three kernels, accelerator output and CPU output were compared and found to match. Fig. 6-8(a) shows the average instruction breakdown for the LUT versus Datapath, and Fig. 6-8(b) shows the same for the four most frequent datapath (non-memory access) instructions.

6.4.3 Analytics Kernels

6.4.3.1 Classification

For testing, we implement Naïve Bayesian Classification (NBC), though the prin- ciples apply similarly to classification using SVM (61), especially the linearly separable

104 case. NBC makes heavy use of the available datapath operations, but does not require any of the scratchpad memory; the register file is used to store intermediate results.

Training is performed off-line (e.g. on the host) on a small subset of the data, and the probabilities are stored as 8-bit fixed point. No complex functions are used, and no inter-PE or inter-cluster communications are required. 6.4.3.2 Neural network

As with NBC, the neural network implementation (specifically, Convolutional Neural

Network (CNN)) also requires two stages, one for training, and the other for evaluating the input. The input is a set of small gray scale images containing handwritten digits from the MNIST database (62). The network consists of convolutional layers with nonlinear subsampling operations immediately following. The test network is based on the example CNN from Theano (55).

Input images are stored in 8-bit grayscale in the data memory, and a copy of each image, along with the filter coefficients, are distributed to each PE. The full convolution operation is evaluated over multiple cycles using a multiply-accumulate unit. A non- linear activation function tanh is approximated with lookups, using the most significant bit as the sign bit, and the remaining bits used for the decimal. Portions of the same image are processed in parallel within a single 4 PE cluster, and output data are shared among the PEs upon completion of a single stage. Like KMC, a mix of datapath and lookup operations are used. While this imple- mentation requires significantly more intra-cluster communication, individual network evaluation latency is decreased, and clusters can independently evaluate different input data. We note that the approach applied to CNN can be applied to a more general

Artificial Neural Network (ANN), where the primary difference will be increased utilization of the intra-cluster bus.

105 6.4.3.3 Clustering

Parallel implementations of KMC have been investigated with respect to large

datasets using MapReduce (43), GPUs (35), and other multi-processor, multi-node

machines (63). We implement parallel k-means clustering (KMC) on the accelerator, testing two scenarios, in which (1) one processing cluster is used, and (2) multiple processing clusters are used. The number of processing clusters is not solely a function of the number of vectors in the dataset; additional consideration is given to the desired balance between spatial and temporal computing, in particular, the energy/latency trade- off. In the first case, communication within the cluster occurs after each point in local

PE memory has been assigned to a group and the group summary statistics have been computed. The same principles apply to the second case, except summary statistics must be additionally shared between processing clusters.

During the KMC operations, a mix of datapath and lookup operations are used.

Namely, the Euclidian distance measure is used, requiring square and sqrt functions, which are stored in fixed point in the LUTs, in addition to datapath addition. Magnitude comparisons are also performed using datapath operations.

6.4.4 Dataset Compression

Dataset compression using Huffman-based entropy encoding is employed to

reduce data transfer volume. Decompression is straightforward, implemented as a table lookup, in total requiring 14 cycles, 1 for the actual lookup, and the remainder for

data management and formatting for the NBC, KMC, or CNN routines. Compression

is slightly modified, using padding to enable the efficient lookup. For example, in the

case of the NBC data, the longest Huffman code was determined to be 4 bits, and the

shortest was 2 bits. Values shorter than 4 are left-aligned in a half byte and padded with all remaining combinations. For code lengths greater than 4 bits, an extra 2 cycles are

needed to track intermediate values.

106 As a reconfigurable framework, the user has the freedom to chose one of several compression/decompression algorithms and implementations; the choice depends

on the data, and which algorithm can provide the best compression ratio. Run length

coding or an LZ variant (64; 65) can be used, as long as the inequality (Section 6.2.3)

holds. 6.5 Results

The implementations of all three applications resulted in identical outputs for the

proposed accelerator, CPU, and GPU, confirming functionality of the implementations.

The following is a comparison of throughput (at iso-area) and energy-efficiency (at

iso-throughput) among the three platforms. 6.5.1 Throughput

We begin the throughput analysis by comparing the raw throughput of applications

on the target platforms (Fig. 6-9(a)). Both the GPU and accelerator outperform the CPU

considerably, about 6x on average. While the CPU implementation was single threaded, a multi-threaded/multi-core implementation is unlikely to bridge the performance gap;

even if we assume CPU performance to scale linearly with the number of cores for all

applications -- an unlikely scenario -- this would at most achieve 66% the performance

of the other systems (assuming a quadcore CPU), and would certainly increase power

consumption. Interestingly, when considering raw throughput, the GPU outperforms the accelera-

tor by about 1.2x for NBC, and the two are nearly equivalent for CNN, but for KMC, the

accelerator outperforms the GPU by about 1.6x. This can be explained by considering

the GPU warp-based execution model: NBC achieves higher performance through the

use of coalesced memory accesses; meanwhile, KMC suffers from both uncoalesced memory accesses and significant branch divergence, leading to poor performance (66).

Iso-Comparison: Given the architectural differences between the processing elements in the three platforms, we provide two iso-comparisons in Figs. 6-9(b) and (c),

107 which show Throughput per Core and isoarea throughput, respectively, in Megabits per second (Mbps). The main takeaway is the effect of core size and domain specificity: the

CPU (a ``heavyweight" core) yields the highest per-core throughput, while the GPU (a

``lightweight" core) yields the lowest. Between them is the accelerator, which provides

moderate per-core throughput. Considering area efficiency, we observe relatively low average isoarea throughput

for the general purpose platforms (CPU, GPU), whereas the accelerator yields signifi-

cantly higher results. This is primarily due to the domain specificity, both the inclusion

of functional units which are related and useful to the the particular domain, and the

exclusion of functional units that are less relevant. In other words, the accelerator cores are more focused towards the target domain, resulting in a higher percentage of useful

area on the chip.

Application Level: At the application level, we generally observe the highest av- erage isoarea throughput for NBC in all platforms, followed by KMC, and finally CNN. There are two reasons for this difference: first, the average number of operations ap- plied to each byte during processing varies from NBC (least) to CNN (most); second, the inter-cluster communication overhead, which similarly varies from NBC (no inter- cluster communication) to KMC (frequent communication, but a small amount of data), and finally CNN (frequent, with a large amount of data). Generally, applications highly amenable to parallelization (DLP) will benefit the most from a larger number of cores, up to a point (24), which in this case favors the GPU. The mesh inter-cluster interconnect

supports greater throughput for small, frequent data transfers, leading to a significant

performance increase for KMC when using the accelerator.

Effect of Compression: We conclude our discussion of throughput by observing the effect of dataset compression on performance. Fig. 6-9(d) shows that on average,

the CPU was adversely affected, the GPU was unaffected, and that only the proposed

accelerator experienced an increase in raw throughput by compressing the dataset.

108 Figure 6-9. Comparison of the power, performance, and efficiency of the three platforms for general analytics applications. A) Comparing raw throughput shows the accelerator outperforms both the CPU and the GPU, but architectural differences such as B) throughput per core shows the effect of core complexity on performance. C) Isoarea throughput provides a normalized performance per unit area comparison that demonstrates the effects of architectural specialization on efficiency; D) shows the improvement in raw throughput from compression, while E) and F) show the accelerator is the only architecture that has an improvement in energy efficiency from compression.

Generally, we can expect to see an increase in throughput only when the decom- pression routine overhead is less than the difference in transfer latency between the compressed and uncompressed datasets (Fig. 6-2). This was not the case for any of the

CPU kernels, and therefore it experienced the greatest decline in performance among the three platforms, Note that among the kernels, the lowest decline was experienced by NBC; in fact, across all three platforms, NBC experienced the highest performance improvement relative to the other kernels, while KMC experienced the lowest. This is primarily due to the high compression ratio achieved for the NBC dataset, and the low compression ratio achieved for KMC.

109 The GPU saw a significant increase in throughput for the NBC kernel, and a steep decline for the others. Compared with CPU, the decline in performance for CNN and

KMC is far more drastic on the GPU. The same factors affecting the raw throughput

without compression are exacerbated by the addition of the compression routine,

which itself suffers from warp divergence. Finally, the accelerator demonstrates the only positive average improvement due to dataset compression among the three

platforms, which indicates the decompression routine was sufficiently low overhead on

this platform.

6.5.2 Energy Efficiency

For big data analytics, throughput is only one aspect. Arguably the more critical design consideration, especially in the coming years, is the energy efficiency of the

system. Fig. 6-9(e) shows the energy delay product (EDP) of the applications on the three platforms. Note that a lower EDP implies greater energy efficiency. The CPU appears as the least efficient platform on average, followed by the GPU. The accelerator achieves the design goal of high energy efficiency, and is on average two orders of magnitude more energy efficient than either the CPU or GPU.

From an application standpoint, the most energy efficient on all three platforms was NBC, which is primarily due to the small size of the kernel. For the accelerator, this results in 61x and 11x improvement versus the CPU and GPU; for CNN, we observe improvements of 84x and 89x versus CPU and GPU; and for KMC, we observe improvements of 798x and 67x versus CPU and GPU, respectively. The relative EDP improvement over CPU, even when comparing to the GPU, is quite large, and primarily stems from the already significant difference in compute energy, and ultimately from the difference in compute latency, between the CPU and other platforms. On the accelerator, the mix of lookup operations and customized datapath, as well as the reduced transfer energy and latency, result in an average 212x improvement over single-threaded CPU and 74x improvement over GPU. On average, the majority of this

110 improvement (> 80%) is attributed to the low power, domain-specific architecture and interconnect network, while the remainder derives from energy saved in data transfer.

Efficiency and Dataset Compression: Finally, we consider the effect of dataset compression on the overall energy efficiency of the system (Fig. 6-9(f)). Similar to the effect on raw throughput, we observe the proposed accelerator is the only platform to benefit from the compression. This is primarily driven by the EDP improvement of

NBC, but CNN also demonstrates a slight efficiency improvement. However, the low compression ratio for KMC yields no increase in efficiency on any platform.

6.6 Discussion

In this section, we discuss different aspects of the experimental results with regards to benchmark performance, the memory-centric architecture, the application scope, and provide a thorough differentiation from related work.

6.6.1 Performance and Parallelism

All three benchmark applications performed well on the accelerator in terms of throughput and energy efficiency. Moreover, the accelerator can rival the much larger

GPU, even in raw throughput, while using significantly less power. Depending on the mapping, the application kernels are amenable to ILP and DLP, The high-bandwidth intra-cluster communication network allows greater flexibility in the mapping, so a wide range of kernels can take advantage of any inherent parallelism. This was observed for NBC, which can exploit ILP in training, and DLP in classification. CNN and KMC can both benefit from a mix of these two models as well. Finally, task-level parallelism can also be exploited; for example, pipelining the dataset decompression and dataset processing in adjacent PEs or clusters as needed. This task-level parallelism can even be taken one step further, running multiple classifiers (NBC and SVM, for example) or other statistical models simultaneously for forecasting.

111 6.6.2 Scalability

The results presented have used an eight PE (two cluster) version of the acceler- ator. With more clusters, we expect to see a performance increase while maintaining comparable levels of energy efficiency, given the scalable interconnect architecture. In contrast to other Coarse Grained Reconfigurable Arrays (CGRAs) such as RaPiD (67), MATRIX (68), or MorphoSys (69), which contain some form of global bus along rows or columns at the top level of hierarchy, the second level interconnect of the accelerator uses a routerless mesh to provide adequate bandwidth for the analytics applications while keeping physical connections between PEs short.

Generally, the performance of any accelerator for big data will have some inherent limitations due to available I/O bandwidth, since it remains constant even when adding additional memory capacity. However, one can assume that the data already resides in the last level memory of the system. With a typical platform, this data will have to be read out of the disk, passing through the system I/O controller, to be processed by the CPU or GPU. Conversely, the proposed system places the accelerators between the last level of memory and the I/O controller. This results in two use cases: 1) each accelerator can operate independently of the others with limited host intervention, using data on the disk to which they are connected (typical of kernels amenable to map/reduce), or 2) accelerators must communicate intermediate results, in which case host intervention is required. In the first case, the physical limitations of the system

(e.g. pin count) would not cause a bottleneck, because processing is distributed among the disks. In the second case, system performance would be reduced only if the communication requirement exceeds that available for the system. If possible, applications should be mapped in such a way that only intermediate results (e.g. summary statistics) need to be communicated, in order to minimize the negative impact.

Furthermore, note that effective data compression can go a long way in addressing the limitation imposed by memory bandwidth. Since our reconfigurable accelerator is

112 amenable to efficient implementation of on-chip decompression/compression logic, as described in Section 6.4.4, it can significantly help in reducing the bandwidth require-

ment for diverse application kernels. Our results show that even a simple compression

routine can significantly reduce the input data size leading to an improvement in both

energy-efficiency and scalability for handling larger datasets. 6.6.3 Transfer Energy and Latency

Hiding data transfer latency is ubiquitous to modern computing, used in caching

systems and employed by off-chip accelerators (FPGA and GPUs) to overlap process-

ing with data transfer. Frameworks like CUDA Streams (70) provide higher levels of

abstraction to programmers for hiding transfer latency. Modern GPU hardware contains on-board DMA engines which can transfer data directly from the last level memory

without CPU intervention; when paired with streams, this greatly reduces or entirely

eliminates transfer latency after the initial transfer. While this does help improve perfor-

mance, it cannot hide the data transfer energy, which has been shown to consume a large percentage of overall power consumption for high performance systems (25).

Therefore, bringing computation closer to the memory has been investigated as

a means to reduce transfer energy and latency. With the advent of 3D integration,

the efficiency and cost-effectiveness of processor-in-memory (PIM) architectures has

improved (71). However, there are several important considerations: 1) data must still be transferred into the memory, which must pass through the I/O controller, and would therefore be subject to the physical bandwidth limitations mentioned in the previous comment; 2) the amount of physical memory available to PIM architectures is significantly less than the last level memory available to a given system through the use of port multipliers or expanders, meaning that applications whose data size exceeds hundreds of gigabytes will need to be processed piecemeal and sequentially in the memory; 3) while PIM solutions can reduce data transfer latency and energy, they do not aim at developing a tailored accelerator architecture for data-intensive kernels by

113 customizing the datapath elements and interconnection fabric; 4) as general purpose processors, PIM architectures are unlikely to perform sufficiently low-overhead data compression/decompression, making it difficult to reduce the already limited memory bandwidth in modern computing systems. By comparison, the proposed accelerator offers greater scalability and energy-efficiency compared to PIM architectures for future growth of big data analysis.

In the ultimate case, processing can be moved into the last level memory itself, removing the need for external data transfer entirely. This however would require a fully custom designed last level memory, and the replacement of existing drives. The proposed accelerator overcomes this difficulty by operating at the interface, rather than inside the memory itself. Nevertheless, with the headway in ``universal memory" research (i.e. a memory technology that can be used for anything from CPU cache to main memory to secondary storage), it is plausible to integrate efficient processing at other levels of the memory hierarchy to further reduce data movement. 6.6.4 Memory-Centric Processing

The proposed architecture is memory-centric in that the primary component within each PE is the memory. Synthesis results estimate upwards of 90% of the PE area and power consumption is taken by the memory array. Therefore, higher density, lower power, and more reliable memories have the potential to vastly improve the energy efficiency and reliability of the accelerator. Many emerging memory technologies are not only promising in terms of integration density, efficiency, and reliability, but also offer non-volatility.

6.6.5 Analytics and Machine Learning

Analytics is a rapidly evolving field, and for many businesses, the ability to make more informed operational decisions is driven by smarter data processing. A recon- figurable accelerator tuned to the analytics domain can potentially reduce the costs

114 associated with big data analysis. The three benchmark applications represent a di- verse range of algorithms, from statistical classification to biologically-inspired neural

networks, but by tailoring the accelerator design to common execution patterns and pro-

viding a highly flexible interconnect, the accelerator is general enough to handle other

applications from this domain, including basic statistics (e.g. mean, median, variance, histograms, etc.), other classifiers (e.g. SVM and decision trees), different clustering

techniques (e.g. hierarchical), as well as standard Artificial Neural Networks (ANN).

The framework can also be expanded to include support for single precision floating

point operations, catering to applications where more precision is required than can be

offered by the fixed point datapath. 6.7 Related Work

There are many examples in literature of accelerating big data analytics workloads

using different hardware frameworks, including FPGA and GPGPU, as well as dis-

tributed computing systems with MapReduce. Here, we describe the related work and how the proposed architecture differs, primarily focusing on FPGA and GPGPU rather

than other CGRAs or manycore systems, as the former tend to be more commonly used

in larger scale enterprise systems.

6.7.1 FPGA Analytics

One of the most famous uses of FPGA to accelerate analytics is the IBM Netezza platform (20), where database queries are sifted such that only the relevant results are

transferred to the host for processing, effectively compressing the data. Other query

processing and join processing systems exist (41)(38), along with more general regular

expression acceleration for text analytics (32), or FPGA systems for general data mining or search applications (72; 40). Additionally, FPGAs have been used to implement a custom accelerator within a Hadoop MapReduce framework (39).

115 One common theme among FPGA-based systems is the focus on raw perfor- mance. In many cases the significantly reduced latency associated with FPGA accel- eration will reduce power consumption over a comparable CPU implementation, but this is not a scalable architecture in the context of overall power draw. More efficient reconfigurable architectures, with domain specific features such as that presented here, can serve to improve performance while maintaining or even reducing overall power consumption. Finally, the high flexibility of the FPGA systems is such that any future change to the application/kernel/implementation will require a redesign of the underlying hardware description (e.g. Verilog or VHDL). Frameworks with higher-level software support can compile high level descriptions into hardware, making them somewhat im- mune from this. Nevertheless, an ideal system would be programmable as in software, and potentially easier to update and manage.

6.7.2 GPGPU Analytics

Like FPGAs, GPUs have also been used to accelerate analytics operations in the context of big data. While the architecture is not as flexible as FPGA, the intrinsic parallel processing in GPGPU makes them attractive for a wide variety of analytics and mining applications, including various clustering algorithms (35; 73; 42) and general MapReduce functions (18). In addition, GPU database acceleration has been investigated (37; 33), as well as accelerating Network Intrusion Detection Systems (34), graph processing (74; 36), and deep learning from neural networks (75). High bandwidth PCIe interconnects, as is standard for modern GPUs, enables multiway data streaming, which is complemented by the GPU hardware architecture and memory hierarchy.

However, like FPGAs, GPUs are not domain specific. Fitting an analytics problem, including required computations and communication between parallel elements, to the

GPU architecture may not result in optimal energy efficiency, as we observed in the

KMC and CNN benchmarks.

116 6.8 Summary

This chapter has presented a domain-specific, memory-centric, adaptive hardware acceleration framework which operates at the last level of memory and is capable of running advanced analytics applications on a large volume of data. The architecture is tailored to the functional and inter-PE communication requirements of the common kernels. A software tool for automatic application mapping is developed that takes advantage of the features of the underlying architecture. The framework is functionally validated using a hardware emulation platform with implementations of three common analytics kernels. Synthesis results and subsequent comparison with CPU and GPU implementations of these kernels demonstrate excellent improvements in energy efficiency, which we attribute primarily to the domain-specificity and the proximity to the data. In-accelerator dataset decompression is also shown to be a viable option by improving overall system energy efficiency. Using the proposed accelerator, reduced latency and power requirements will translate into faster results and lower energy costs, both of which are crucial to many business applications. Finally, as a memory-centric hardware acceleration framework, it can leverage promising properties of emerging memory technologies (e.g. high density, low access energy, and non-volatility) to further improve area- and energy-efficiency.

117 CHAPTER 7 RECONFIGURABLE ACCELERATOR FOR TEXT MINING APPLICATIONS

Chapter 6 described one implementation of the MAHA framework targeted at

accelerating general analytics applications for large datasets. This chapter offers another implementation of the MAHA framework which instead aims to accelerate text

analytics, or text mining, specifically. This differs from general analytics in many ways,

and requires more specialized hardware for completing certain tasks. This chapter

previously appeared* in IEEE Transactions on Very Large Scale Integration (VLSI)

Systems. 7.1 Background

Text mining, or text analytics, is a growing field which employs statistical methods to

find relevant information within data sources (76). Analysis of data sources is of utmost

importance to many businesses (77; 78), and it is widely accepted that the majority of data containing business intelligence is unstructured. Meanwhile, the amount of data needing analysis grows exponentially, with estimates reaching the exabyte (1018) or zettabyte (1021) levels in the coming years. Using distributed computing systems with map-reduce style applications has become the de facto method for storing and analyzing large data sets. However, due to physical limitations -- power consumption, space, and cooling requirements, among others -- as well as processor-to-memory bandwidth bottlenecks within the system itself, the practice of increasing computational power simply by adding more of the same processing elements (PEs) will inevitably reach a fundamental limit (79; 25). Therefore, the growing volume, variety, and velocity

of data generation necessitates a new approach to expedite the preprocessing and analysis of data while reducing the overall power requirements of the system.

* R. Karam, R. Puri, and S. Bhunia. ''Energy-Efficient Adaptive Hardware Accelerator for Text Mining Application Kernels,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 12, pp. 3526-3537, 2016.

118 A dedicated, yet flexible, accelerator for text analytics is therefore attractive; however, due to the breadth of the text analytics field, the number and complexity of different algorithms, approaches, and techniques for performing analysis, and the existing infrastructure of data warehouses and processing centers, the options for such an accelerator are limited. We note three critical requirements for such a system:

1. It must support acceleration of multiple kernels found in a variety of common text analytics techniques.

2. It must be amenable to hardware retrofitting and seamless operating system integration with systems at existing data warehouses, while requiring minimal host-side software development and compiler overhead.

3. It must function independent of the input data representation (encoding) and potentially accept a variety of languages as input.

This chapter presents a reconfigurable hardware accelerator residing at the interface to the last level memory device, as shown in Fig. 7-1, which is designed specifically to accelerate text analytics data processing applications. The accelerator connects to the LLM and host system using the existing peripheral interface, such as

SATA or SAS. Due to the close proximity of the two, the system bus is only used to transfer results to the CPU, rather than an entire dataset. This also allows the use of port multipliers without significant bandwidth requirements, as data processing occurs before the communication bottleneck is reached. By employing massively parallel kernel execution in close proximity to the data, the processor-to-memory bandwidth bottleneck, which plagues data-intensive processing systems today, is effectively mitigated by minimizing required data transfers, thus reducing transfer latency and energy requirements. Several architectural customizations, including content addressable memory (CAM), tokenizing hardware, a shuffle engine for bit- specific memory accesses, and fixed point multipliers for classification enable efficient, hardware accelerated text processing for ubiquitous kernels. Specifically, the hardware is capable of accelerating character downcasting and filtering for string tokenization,

119 as well as token frequency analysis, a basic operation found in many text analytics and natural language processing techniques, including Term Frequency - Inverse

Document Frequency (TF-IDF) (80) and Latent Semantic Analysis (LSA) (81). A data engine, which interfaces the LLM to the accelerator, also ensures the input matches the expected or supported languages and encodings, while mapping characters to an internal compressed representation to reduce power consumption.

In particular, this chapter makes the following contributions:

1. It presents a reconfigurable architecture with processing elements leveraging func- tional units tailored to text analytics acceleration, including content addressable memory, shuffled bit-specific memory access, and fast fixed-point multipliers. The PEs and interconnect architecture are optimized based on an analysis of common text analytics kernels. The accelerator is capable of significantly improving the energy-efficiency of diverse text-analytics kernels. To our knowledge, this is the first example of an accelerator designed specifically for text analytics applications which operates at the last level of memory.

2. It describes custom hardware units and identifies optimizations in their implemen- tations for several text analytics primitives, kernel functions that are common to many text analytics applications.

3. It presents a custom data engine, which augments the framework's capabilities by accepting both ASCII and UTF-8 encoding, and losslessly compressing input data when possible (e.g. for English-only text).

4. It validates the proposed architecture and its integration with a processor-based computing framework using an FPGA-based hardware emulator for the acceler- ator, and provides area, power, and delay estimates from synthesis at the 32nm technology node. It demonstrates a significant improvement in energy-efficiency with a case study using a widely-used open source text search and index library called Lucene (82). The rest of the chapter is organized as follows: Section 7.2 provides a brief overview existing work in the related fields, and summarizes the domain-specific re- quirements for the target applications. Section 7.3 provides architectural details for the proposed accelerator framework in the context of big data text analytics, as well as the mapping procedure for host and accelerator application programming. Section 7.4 shows how the accelerator can be used with Lucene, a popular text index and search

120 Figure 7-1. System architecture showing the location of the accelerator and the last level memory device. application, and provides background, application profiling, and kernel analysis. Sec- tion 7.5 provides power and performance results for the selected kernels, and compares to CPU and GPU. Section 7.6 offers a discussion of these results, with performance analysis, accelerator interfacing, and accelerator application scope. Finally, Section 5.5 concludes with future directions for the research.

7.2 Related Work

Text analytics uses statistical techniques and pattern recognition to discover meaning in unstructured data. Existing frameworks built to tackle this problem on a large scale are able to intelligently divide tasks across PEs in a system. However, at the current pace of data generation, these systems will, at some point, reach a fundamental limit for power consumption, space, cooling, or even cost, making energy efficiency, either measured as Energy Delay Product (EDP) or Performance per Watt, one of the key drivers in designing such systems for the exascale regime. As the majority of business intelligence is found in unstructured sources, it is critical that new techniques are developed to analyze the data in a time- and cost-effective manner.

121 Figure 7-2. System architecture for the text mining accelerator. A) The overall accelerator architecture, showing 8 interconnected PEs (2x4 configuration), the controller for SATA/SAS packet snooping and configuration, as well as an on-chip data distribution engine.B) Microarchitecture of the processing element, showing separate lookup and data memories with a output Shuffle Engine (SE) for bit-specific access. The core consists of an instruction table, register file, datapath, and a small content addressable memory.

7.2.1 Existing Work

The design and development of text mining accelerators has largely revolved around general purpose acceleration platforms, especially General Purpose Graphics Processing Units (GPGPUs) and Field Programmable Gate Arrays (FPGAs) using

MapReduce-style computations (83; 84; 85; 86; 87). We note several limitations to

122 these techniques that limit their scalability, making them less effective for future Big Data text processing:

7.2.1.1 Interfaces and retrofitting

In most cases, the interfaces used for existing accelerators still necessitate a large amount of data movement between the last level of memory, the main memory, and the accelerator memory. This increases power consumption and negatively impacts performance, reducing the benefits gained even for energy-efficient acceleration.

Additionally, FPGAs and GPUs which utilize the PCIe bus for data transfer may not be easily retrofitted to existing systems without additional hardware.

7.2.1.2 Distance to the data

Text analytics applications, which generally have a large input dataset and small output result set, would largely benefit from a framework in which processing occurs in close proximity to the data, so that data movement between the processor and memory is minimized. Though FPGAs have been used in close proximity to the last level of memory as data compression and search engines (88), they generally do not perform the analytics operations themselves. Reducing data transfer by bringing computing closer to the main memory has also been investigated (46; 44; 89); however, we note that bringing computation to the main memory is not as scalable as bringing computation to the last level of memory, because data must still be transferred up the memory hierarchy. It is also not as effective for big data processing, since the amount of main memory available to a system is significantly more limited than the amount of local last level storage that can be connected through the use of SATA/SAS expanders.

7.2.1.3 Flexibility overhead

GPUs and FPGAs are both general-purpose accelerators; neither is specifi- cally geared towards text analytics. The FPGA's spatial reconfigurability has high interconnect overhead and generally low operating frequencies, while the GPGPU's

123 instruction-based execution allows for higher operating frequencies, but does not pro- vide customized analytics-centric datapaths. Both platforms contain extra functional

units and memories which are required for general application execution, but may be

extraneous to the text analytics domain. These hardware resources increase the area

and leakage power without necessarily improving performance. For an accelerator located at the last level of memory (Fig. 7-1), including domain-specific hardware while

excluding extraneous peripherals can drastically improve performance and scalability for

future data intensive applications.

In comparison to other Coarse Grained Reconfigurable Array (CGRA)-style acceler-

ators such as MorphoSys(69) or AsAP (90), we note the improvements in performance and energy-efficiency can be attributed to the following: (i) accelerator proximity to the last level memory, with intelligent data transfer between the two; (ii) text analytics- inspired functional units; (iii) multilevel hierarchical interconnect based on typical dataflows for text mining applications. The proposed accelerator capitalizes on these in addition to flexibly supporting task-level or application-level parallelism. Other platforms like TOTEM (91) enable the automated design of domain-specific reconfigurability,

but domain specificity is achieved at a lower logic circuit level, rather than as domain-

targeted kernel functional units.

7.2.2 Application Survey

We present a survey of common text analytics techniques in which specific kernels

are identified as amenable to acceleration in an off-chip framework. Generally, these are

kernels which are found in many analytics applications and are easily parallelized (87).

We note that many text mining applications, including text indexing and search, pattern

mining, summarization, sentiment analysis, and classification, among others, require similar preprocessing steps:

• Tokenization: dividing a text stream into a series of tokens based on specific punctuation rules

124 • Change of case: changing all characters to lower case

• Stemming: removal of suffixes from tokens, leaving the stem

• Stop word removal: removal of common, potentially unimportant words from the document

• Frequency analysis: counting the number of times each (stemmed) token has appeared in an individual document and the corpus as a whole Tokenization, change of case, and frequency analysis represent three kernels which are trivial relative to the complexity of the full application, but are nevertheless neces- sary and time-consuming tasks. Stemming and stop word removal can be considered extensions of frequency analysis, and are potentially more complex depending on the target language. In addition, while the classification kernel is not considered to be a preprocessing step, it is common to several text mining applications, and was therefore be considered in the architecture.

7.3 Hardware and Software Framework

In this section, we describe the proposed text analytics accelerator hardware in detail, including the processing elements, the functional units, the interconnect network, the controller, and data engine (Fig. 7-2). We also describe the application mapping techniques applied to both the host-side and accelerator-side, and consider the implications to I/O throughput when using the proposed accelerator. 7.3.1 Processing Elements and Functional Units

The accelerator contains an array of interconnected processing elements (PEs), each capable of executing instructions for datapath and lookup operations, similar to the architecture described in (79). However, we note that several distinctions exist which enable the acceleration of text mining applications, as opposed to a more general framework. Specifically, the size of the local interconnects (8 PE vs. 4 PE), the levels of hierarchy (2 level vs. 3 level), the PE micro-architecture, and the memory design (Flash versus SRAM).

125 Table 7-1. Configuration of the text analytics accelerator. Property Description Single issue Processing Element 8 bit Integer ALU (Add/Sub/Shift) Custom Datapath 256 x 32 bit instruction 32 x 8 bit registers Memory Organization 1 kB LUT memory 3 kB data memory 1024 x 38 bit CAM/SRAM array Two-level Interconnect Cluster: 8 PE, Shared Bus Intercluster: Mesh

The architecture details for the text mining accelerator are provided in Table 7-1.

PEs are designed to accelerate a wide range of text analytics tasks while minimizing

data movement, as shown in Fig. 7-2. This is accomplished by accelerating individual text mining primitives, listed in Section 7.2.2. Here, we describe how one of the more complex operations, term frequency counting, can be accelerated.

7.3.1.1 Term frequency counting

Analyzing how often a term appears in a document is one of the key operations performed on textual data. It enables statistical analysis of a given document, and

can be used in the TF-IDF and LSA algorithms. Because it is so common, dedicated

hardware is added to the each PE which is capable of efficiently counting the occur-

rence of terms in the dataset (92). Two kinds of memory are used for this task: first, a Content Addressable Memory (CAM), which enables single-cycle lookup of terms

already encountered (93; 94; 95), and second, an SRAM array which stores the cor-

responding term counts. When presented with an input term, the CAM searches all

memory locations in parallel; if a match is found, the match line (ML) for that word goes

high, and if no match is found, a signal is generated which stores the search term in the

126 CAM memory. A high ML is detected by the match line sense amplifier (MLSA), which enables the bitline precharge circuitry and the corresponding wordline. The value at the specified SRAM address is automatically read, incremented, and written back to the same location. This process, as shown in Fig. 7-3, reduces the number of instructions required to count terms in a document considerably. Rather than relying on a software hash table or similar data structure, the CAM enables single-cycle, parallel lookup, without collisions or requiring support for chaining or other collision handling techniques.

Furthermore, the adder has a very small hardware overhead, and no special hardware is required for hash computation, a process which would otherwise require several software cycles to compute. The CAM, however, is a limited hardware resource, and so must be carefully sized in order to avoid overflow.

An analysis of a typical English text dataset reveals interesting properties of the text; specifically, while a large number of tokens may be present, a relative few of them are unique. For example, in an English text dataset containing 1.8 million tokens, roughly 37,000 (2%) were found to be unique. While this can differ between languages and datasets, the general case of a ``full" CAM can be handled by (1) consolidating terms/frequencies when a maximum count is reached by any one PE within a cluster, and (2) writing terms and frequencies to file and merging later. Both methods incur some overhead, but it is significantly less than comparable software methods (e.g. hash table chaining or re-hashing when full).

The limited CAM resource can additionally be viewed as beneficial. Before counting terms, it is common to perform a stemming operation, which removes suffixes like "-ing" or "-ment" from terms. Here, the bitwidth of each CAM entry plays an important role in how individual terms are counted. Using a 30 bit wide CAM, and using only 5 bits to represent each lowercase ASCII character in the Latin alphabet, up to 6 characters can be stored. These physical limitations perform a kind of naïve stemming operation, where the ends of words longer than 6 characters are truncated. In practice, this will

127 Figure 7-3. Hardware implementation of term frequency counting using content addressable memory and SRAM for storage. not perform as well as a more complicated stemming algorithm, but will still result in a similar frequency table for a large dataset. Preliminary analysis with Rank Biased

Overlap (RBO) (96) shows little difference in TF-IDF score lists generated from input text based on this method, or the more complex Porter stemming algorithm (97) for a large body of English text. With a persistence parameter (denoted as p) of 0.95 (that is, a 95% chance a person comparing the lists will compare two entries and continue on to the next line), the lists share a similarity index of 92%; for p = 0.90, this increases to 95%; for p = 0.75, they are nearly identical, at 99% similarity. In many applications, especially mining large datasets on a low power budget, this method offers a trade-off between speed and accuracy.

7.3.1.2 Classification

Another common task for text mining is classification, which can be used in senti- ment analysis (for example, detecting if a product review is positive or negative based on the content) or categorizing a document by subject. The Naïve Bayes Classification

128 (NBC) (49) technique is commonly applied to such problems because it is relatively easy to implement and in some cases can have results comparable to other, more complex methods, such as decision trees or support vector machines. NBC requires a large number of computations of probabilities during both training and classification.

Assuming training has been done beforehand and a lookup table of partial probabilities is available to each PE, the actual classification can be done by simple fixed point multiplication and comparison. As such, the datapath of each PE contains a fast fixed point multiplier and comparator to perform the necessary calculations. This is also an easily parallelized task, as different documents may be classified by different PEs simultaneously. 7.3.2 Interconnect Network

PEs are organized in a two level hierarchy. The lowest level of hierarchy consists of eight PEs in a cluster connected by a shared bus, with 8 bits dedicated per PE

(Fig. 7-4). The choice of an 8 PE cluster helps accommodate the frequency counting application, which benefits from high bandwidth intra-cluster communication. However, inter-cluster communication is far less frequent, and so the architecture may be scaled by adding multiple clusters connected by a 16-bit routerless mesh interconnect fabric.

Previous works have proposed similar interconnect hierarchies (98), but these generally rely on routers to transfer data where needed. Due to the infrequent inter-cluster bus usage, this is not required in many text mining applications. The routerless design is made possible by providing a dedicated connection for 1 of every 4 PEs in the cluster to the shared bus, making it responsible for reading and writing data.

7.3.3 Control and Data Engines

The control and data engines are integral parts of the accelerator which are used in data I/O and preprocessing. The control engine is responsible for intercepting and relaying commands from the SATA or SAS port to the LLM device. Upon receiving a special command (e.g. accessing a particular location in memory where the accelerator

129 Figure 7-4. Interconnect fabric of the accelerator, with eight PEs sharing data through dedicated bus lines in a cluster, and clusters connected in a 2D mesh. configuration is stored), the control engine initiates the required transfers from the

LLM. Meanwhile, the data engine is responsible for distributing data among the PEs.

In addition, the data engine contains specialized hardware for data preprocessing. As the majority of data is encoded using UTF-8, it is important to verify that the incoming data is compatible with the framework. The presented architecture is specific to English

-- therefore, if only English is expected, it is sufficient to check the most significant bit

(MSB) of every incoming byte to filter ASCII-coded characters (which are identical to single byte UTF-8 characters) from multi-byte UTF-8 characters, which always contain a 1 in the MSB. In addition, if enabled, the data engine is capable of changing the case of every incoming byte by setting the 5th bit position to 1. Finally, the constant value of

0x61 is subtracted, mapping every valid character (a through z) to the lower 5 bits. As a special case, punctuation and whitespaces are mapped to a constant 5'b11111, which is later used by the Tokenization primitive when identifying tokens.

130 7.3.4 Application Mapping

Applications are mapped to the framework using a custom software tool. Because applications are based on a combination of the available primitives, application mapping is simplified to a series of instruction macros which operate on all data in parallel. These instructions are parsed and compiled into a bitstream which can be loaded into the PE instruction memory. Examples are provided here; lower level instructions (e.g. datapath or memory access) are also available for application programming.

• getNewData(): instructs the data manager to read out and save the entire data memory and retrieve new data for processing.

• toLowerAll(): sets all bits in the position 5th bit position to 1.

• countFreqAll(): iteratively transfers data stored from the memory start (up to a null terminator) to the frequency counting functional unit, implicitly tokenizing during execution.

• classifyAll(): performs Naïve Bayes Classification on stored data.

Specifically, these macros are translated into RISC assembly instructions using optimized code templates which contain all the instructions, looping constructs, and registers used for the control flow for the particular function. These assembly instruc- tions are output primarily for debugging and readability; they are subsequently compiled into the machine instructions for the PEs. Sets of 256x32 bit words, one set per PE, are loaded sequentially at runtime into the PE's instruction memory. The order in which PEs are programmed is hardcoded into the framework, and corresponds with the compiler output.

On the host side, small modifications to existing applications must be made to facilitate acceleration. In the host code, a filestream is opened to a specific memory location, and these commands are intercepted by the accelerator. Data written to this stream is used for initialization and configuration.

131 7.3.5 System Architecture

As previously mentioned, the proposed accelerator is designed to operate at the interface between the I/O controller and the LLM. A common setup, shown in Fig. 7-

1, may include port multipliers or expanders used to increase a system's storage capabilities. When designing this system, it is important to consider the maximum read/write speeds from the LLM such that the communication channel does not become a bottleneck. For example, SATA III (6 Gbps) may not be sufficient for more than one high speed solid state drive.

With the proposed accelerator, more drives could be attached to the same channel, greatly expanding the storage capabilities of the system. There will still be a bottleneck for writes from the CPU; however, once the data is in storage, it can be processed be- fore reaching the bottleneck, and can vastly outperform another system with equivalent storage which relies on data transfer between the CPU, the main memory, and the LLM.

7.4 Lucene: A Case Study

To evaluate the effectiveness of the text mining accelerator on a real-world ap- plication, we analyzed the kernel functions used in Lucene, one of the most popular open source full-text search software libraries (82). Though Lucene's development was initially targeted for text search applications, with the advent of big data, it is now used largely for text analytics and data mining applications (99). 7.4.1 Lucene Optimizations

Generally, attempts to optimize Lucene will focus on software optimizations or, if more significant improvements are required, improving disk performance or I/O bandwidth by updating the LLM devices or increasing the available system memory.

For mining very large datasets, however, small software optimizations may not be sufficient, and hardware upgrades may not be feasible, especially for enterprise-level, data warehouse applications.

132 Figure 7-5. Major operations conducted in the Lucene text indexing software flow

Improving search performance often comes at the cost of more complicated indexing strategies. Internally, Lucene utilizes an inverted index (100) which tracks, among other things, in what documents certain words are located. This task, though usually done once for a dataset, is compute- and data-intensive, and must be done for each incoming document. However. the searching of an inverted index is essentially a lookup and merge operation, and therefore runs very fast once the data is in the main memory.

133 Figure 7-6. Profiling results on Lucene for a 1 GB and 50 GB dataset

7.4.2 Lucene Indexing Profile

From an architectural standpoint, the major operations undertaken by the stan- dard Lucene indexing software flow are shown in Fig. 7-5, and include tokenization, downcasting, frequency counting, sorting, and writing to disk. By running HPROF(101), the built-in Java profiler, on Lucene 5.3.1 (102), indexing can be categorized into five major categories at the implementation level: downcasting, tokenizing, I/O, frequency counting, and other Lucene operations.

Depending on the size of the dataset indexed, the percent breakdown can change.

For example, in Fig. 7-6 shows results for the 1 GB and 50 GB English text dataset used in the experimental setup. For both datasets, the majority of time (4̃0%) is spent in in the native I/O methods, specifically the file reads and directory reads, if any. This is approximately equal to the time spent in downcasting, tokenizing, and frequency counting combined. Regardless of data size, the I/O time will be drastically reduced by operating at the last level of memory, especially considering that port multipliers or expanders may be used, as described in Section 7.3.5.

134 Table 7-2. Resource utilization for FPGA-based emulator Component Name Combinational Registers Memory (Bytes) Datapath and program counter 23 11 0 LUT & data memory 25 0 4096 Output multiplexers 11 0 0 Register file 44 256 0 Schedule table 74 64 1024 Interconnect and data interface 207 0 0 Accelerator Controller 12 15 0 PE Memory Controller 2 78 0 Nios II/e Core 2520 2543 851,712 System totals 5609 5207 892,672

When including I/O acceleration, nearly 80% of the operations done by the standard Lucene indexer can be accelerated using the proposed architecture. While NBC is not included as part of the standard indexing strategy, recent versions of Lucene include an NBC API which can use the document index for classification; therefore, NBC acceleration is also considered in the results. 7.5 Results

In this section, we describe how the proposed accelerator architecture was eval- uated using a series of benchmarks taken from Lucene which exercises the functions performed by the accelerator. These kernel applications were first run on a hardware emulation platform to verify timing and functional correctness. Then, results for a mod- ern quad core CPU and a workstation GPU, were compared with area, power, and delay estimates from a synthesized 32 nm (SAED 32 nm EDK) implementation.

7.5.1 Emulation Platform

The hardware emulation platform is a stand-alone system intended to emulate the proposed accelerator and functionally verify its operation and timing. It consists of two separate FPGA boards, a Terasic DE4 (Stratix IV EP4SGX230 device), and a DE0

(Cyclone III EP3C16F484 device). The DE4 serves as the LLM device with built-in text mining accelerator, while the DE0 serves as the host CPU, as shown in Fig. 7-7. This

135 is in contrast to the typical accelerator setup, where the DE4 would be configured as an accelerator only, and connected to commodity hardware. Because the DE4 serves as both the accelerator and last level memory device, the host will not be involved beyond the initial control signal generation and reading back results

In addition to the last level (Flash) memory device, the accelerator also contains a small CPU serving as a flash controller, and the accelerator itself, consisting of eight interconnected PEs with a memory managing data interface, operating at a maximum frequency of 112 MHz. Resource utilization, as reported by the Quartus II software, is provided for reference in Table 7-2.

The accelerator also contains a module which decodes the input text and verifies that it is in a compatible encoding, as described in Section 7.3.3. Control signals in the accelerator are monitored using SignalTap II Logic Analyzer.

Meanwhile, the host CPU is a Nios II/f core, which is programmed in C using the Eclipse IDE with Nios Software Build Tools, and executes the target applications using the data and processing resources of the LLM + Accelerator. The host CPU has interfaces to 8 MB of main memory (SDRAM), a CFI Flash interface for storage, two timers (CPU and high-resolution stopwatch timer), and a JTAG/UART interface for user and program I/O. The two boards are connected and communicate using a standard

3-wire SPI protocol. Applications running in the accelerator do so with no host CPU intervention, so the host is free to perform other tasks while the accelerator performs all processing functions. Note that the PE programming is performed on the emulator just as described in Section 7.3.4. When processing is complete, the host can read results back from the memory.

7.5.2 Experimental Setup

The CPU and GPU experiments were conducted on a server running Ubuntu 12.04 x64, with an Intel Core2 Quad Q8200 2.33 GHz (52), 8GB DDR2 memory 800 MHz,

1TB WDC HDD, and an NVIDIA Quadro K2000D GPU with 384 multiprocessors and

136 2GB GDDR5 memory. Because the CPU was implemented with 45nm technology, relevant area, frequency, and power values are scaled down to 32nm to match the

GPU and synthesized accelerator. CPU applications were compiled with the latest GCC toolchain, and used maximum optimization and processor-specific flags to achieve the highest possible performance. For GPP measurements, timestamps were obtained using the clock_gettime POSIX interface (54), and multiple iterations were averaged to reduce the effect of noise from background system processes. GPU measurements were obtained using high resolution CUDA event timers for both data transfer and execution latencies (103), and kernels were executed using launch parameters that were found to maximize thread occupancy for the entire core. Energy measurements include I/O transfer energy to more fairly compare the platforms in a real-world setting. We assume a value of 13 nJ for transferring a bit from the last level memory to beyond the main memory (25). I/O transfer rates are fixed at the ideal 6.0 Gbps (SATA III) rate in order to remove any potential variations, including background I/O and differences in seek time between kernel launches on both CPU and GPU. In practice, the transfer energy and latency constitute a major percentage of the results for all preprocessing kernels due to the size of the data being transferred, and the complexity of the memory hierarchy between the LLM, the CPU, and the GPU, compared with the low instruction count of the kernels. For the accelerator, we assume a value of 15% of the CPU/GPU data transfer latency. This is a possible scenario considering a combination of port multiplier usage and the associated reduction in physical distance between LLM and the processing elements, in addition to the fact that the data no longer needs to pass through the I/O controller and main memory before processing occurs. Finally, the results are compared on raw throughput, iso-area throughput, and energy efficiency, measured as the product of energy and time (J × s). The testing dataset used on all platforms was comprised of a

137 number of UTF-8 encoded English Wikipedia pages. whose total uncompressed size is 1 GB.

7.5.3 Indexing Acceleration

Several kernels from the Lucene library were executed on all three platforms in or- der to evaluate the power and performance improvements of the proposed accelerator. Specifically, Downcasting (DNC), Tokenization (TKN), Frequency Counting (FRQ), and

Classification (CLA) were tested. All kernels were verified for functional correctness by comparing with the output of other platforms. Raw throughput (in Megabits per second) is shown in Fig. 7-8(a); results for total energy (in J/GB) are shown in Fig. 7-8(b), and the EDP for each application is shown in Fig. 7-9. Note that the throughput values, dis- played in Megabits per second (Mbps), represent kernel execution throughput, because at the given interconnect speeds, none of the platforms are limited by the transfer band- width. This effectively facilitates a comparison based solely on architectural differences among the platforms (e.g. datapath customizations and interconnect fabric). Finally, the energy breakdown is shown in Fig. 7-10.

7.5.3.1 Downcasting and tokenizing

The process of downcasting and tokenizing characters is an integral preprocessing step in the vast majority of text analytics kernels. Both of these kernels are easily par- allelized -- in the case of downcasting, this can be done at the byte level regardless of word breaks, and for tokenizing, this can be done by intelligently transferring data up to a whitespace or other delimiter. In either case, the serial CPU implementation was found to be inefficient relative to the other platforms, as the GPU implementations im- proved the CPU performance per Watt by 4.6x and 4.2x for tokenizing and downcasting, respectively. Meanwhile, the performance per Watt of the accelerator was found to be two orders of magnitude greater than the serial CPU implementation -- around 443x for tokenizing and 262x for downcasting. This is primarily due to the accelerator's proximity to the LLM and the lightweight core architecture. In most cases, the throughput of the

138 accelerator and the GPU platforms are comparable, while the accelerator is significantly more energy efficient, as shown in Fig. 7-9.

7.5.3.2 Frequency counting

Frequency counting was directly implemented on CPU and GPU using the available resources. However, the proposed framework utilizes a custom hardware functional unit for frequency counting. Because power and performance values reported in literature vary widely for various CAM sizes, types, implementations, and technology node, we elected to simulate the operation of a custom 1024 word × 30 bit (CAM) and 8 bit

(SRAM) memory, along with match line sense amplifier and other interfacing logic using

32nm low power models (104). The SRAM array was based on the standard 6T SRAM cell, and the CAM array included 3 match transistors per cell implementing NAND logic for the match operation.

Power and performance of the SRAM cells were compared with CACTI (58) output for the same technology node, and were found to agree well with the estimates. The power estimates for the decoder and output logic are added to the results. Finally, area estimates are taken directly from CACTI for a 1024 word × 64 bit memory. Note that this is an overestimate, as the accelerator memory is only 1024 words × 38 bits, with

30/38 bits incurring a 3 transistor area overhead per cell.

The simulated operation is as follows: for a single frequency count operation, search data is driven onto the search lines SL and SL. A match at a given row causes the match line ML to go high, driving the bitline precharge circuitry and corresponding wordline WL of the adjacent SRAM row. This value is read by an 8 bit adder which increments the value before writing it back to the array.

The total dynamic read energy for the 8 bit SRAM word was found to be 1.6E-10 J, while a CAM match was found to require 0.4 fJ/bit/search or about 1.2E-11 J for the array, consuming 25 mW at the nominal 2GHz frequency. Finally we estimate the 8 bit adder energy, also synthesized with the 32nm low power model, to be 1.4E-15 J, for

139 a total of 172 pJ/operation. This is in addition to the necessary accelerator operations for the given kernel. Since the number of instructions varies based on the dataset, we assume a worst-case constant 6 characters per word, requiring 6 cycles to move data, and two cycles to process and store the result, and several cycles for data transfer and looping. In total this requires under 12 seconds per GB of data for the worst case. In comparison, the CPU and GPU required the use of hash tables to realize the same functionality as the CAM/SRAM array described above. The serial CPU implementation required 59.6 seconds per GB of data. In contrast, the GPU finished the task in 11.8 seconds per GB. Though the raw performance of the GPU was higher than that of the proposed accelerator, it was far less energy efficient -- the accelerator improves the EDP by 200x vs CPU, and 30x vs GPU. This is once again due to the close proximity to data and the custom term frequency counting hardware.

7.5.3.3 Classification

Unlike Downcasting, Tokenizing, or Frequency Counting, the Classification routine was not a low instruction count preprocessing kernel. Instead, it was a full routine, using hundreds of instructions on all platforms, including many branch statements and multiplications. Fixed point multiplication was used on all platforms. No custom hardware was used to implement the classifier in the accelerator, so the total raw throughput strongly favored the GPU, although the accelerator still demonstrated an improvement in EDP.

7.5.4 Lucene Indexing Acceleration

We estimate the percent performance improvement in running the Lucene indexer by using the accelerator compared with the serial CPU and parallel GPU implementa- tions. As mentioned in Section 7.3.5, the I/O latency improvement can vary depending on a number of factors. In Section 7.5.2, we assume 15% of the CPU/GPU latency, re- sulting from a large ratio between total required bandwidth and available link bandwidth in addition to the physical distance reduction between LLM and the PEs. Under these

140 conditions, the largest single improvement would come from the reduction of data trans- fer latency, reducing the percent time taken from 39% to an estimated 6%, compared to a similarly loaded commodity system. Once the data is transferred to the accelerator, the custom hardware and parallel execution of kernels for downcasting, tokenization, and frequency counting can further improve individual kernel performance, resulting in a total estimated improvement of 70%.

7.5.5 Iso-Area Comparison

An iso-area comparison is useful for comparing the performance while accounting for the area overhead from various extraneous functional units. Area measurements for the CPU and GPU are estimated from the die area of each platform, scaled to the same technology node. In the case of CPU, the area is divided by the number of active cores (1/4), as the code was single threaded and ran on only one core. Because GPU occupancy was maximized (as described in Section 7.5.2), the full die area is used.

For FPGA (emulator) area, this is estimated using the resources used in the mapping divided by the total amount of available resources. Finally, the projected accelerator size is based on the estimate provided by Synopsys DesignCompiler combined with memory area estimates generated using CACTI for the same technology node, totaling

0.62mm2. The comparison, shown in Fig. 7-11 reveals an average 10x improvement of the accelerator vs the GPU, and over 100x improvement vs the CPU. This is consistent with the small footprint design of the accelerator datapath. Furthermore, the accelerator demonstrates energy efficiency (Mb/s/W)improvements on the order of 64x vs CPU, and over 10x vs GPU.

7.6 Discussion

In this section we discuss the results of simulation and hardware emulation, and the comparison with CPU and GPU for the series of text analytics benchmark applications.

141 7.6.1 Benchmark Performance

By using hardware acceleration, it is possible to significantly reduce the power consumption and latency of text analytics applications. This was demonstrated using kernels from the Lucene text search and indexing application. Note that all tested benchmarks were amenable to SIMD-style parallelization, and their performance on an 8-core accelerator, even running on a significantly lower speed FPGA, still outperformed the single-threaded CPU in terms of energy efficiency. The simulated custom implementation of the accelerator can further improve energy efficiency for both data transfer and computation.

Additionally, the frequency counting simulation results show great promise for a hardware implementation of term frequency counting. Note that the design of the

SRAM/CAM cells, decoders, muxes, and sense amplifiers is a proof-of-concept, and while functional, it is far from optimized. Using more advanced low power circuit design techniques, it can consume even less power and improve the EDP. For example, the current timing slack in the memory array can be capitalized on by lowering the supply voltage to take advantage of the quadratic reduction in power consumption.

One important consideration is the correlation between the results of individual kernel acceleration, compared to the profile results provided in Section 7.4. Each of the steps from Fig. 7-5 will be performed sequentially, and we know from profile results the relative duration of each step. If, for example, the input text is already in lower case, then the downcasting becomes unnecessary and can be wholly omitted. This can be demonstrated by preprocessing a dataset to all lowercase, and omitting the downcasting calls in the Lucene StandardAnalyzer. Doing so results in an equivalent output, but

5.7% reduction in overall execution latency (48.8 seconds down to 44.4 seconds for 1 GB data), which agrees well with the profiled values, where DNC was found to take around 6% of the total runtime.

142 Similarly, the IndexWriter uses hash codes to track term occurrences in a docu- ment. A frequency table could instead be integrated which provides ready-to-use term frequencies for each document. This has the added benefit of significantly less data transfer, since only unique (stemmed) terms and their frequencies need to be trans- ferred for each document. Using this method can also replace the typical stop-word removal procedure (e.g. using the StopFilter in the StandardAnalyzer); instead of com- paring tokens to a predetermined list of English stop words, those words which are most common (according to a sorted frequency table) can be removed from the index.

We have also considered the effect of the ``simple'' (truncation-based) stemming on search accuracy. It is known that a simple stemming will not yield the same search re- sults as a more complex stemming algorithm, or no stemming at all; however, the actual effect it has depends on many factors. To study this, we implemented a ``SimpleStem- mer'' class in Lucene, and ran the Indexing on 1 GB worth of English text files. When compared to search results which used the EnglishAnalyzer with the PorterStemmer, the results were indeed different, but the top 3-5 hits were generally the same; occasion- ally one item was swapped for another in the ranking, and the lists tended to diverge after the first few results. However, these differences could potentially be addressed or mitigated using a different scoring or sorting approach for the SimpleStemmer.

7.6.2 Memory-Centric Architecture

Memory constitutes the vast majority of the area of each PE in the proposed framework, upwards of 90%, and nearly all the power, dynamic and static, consumed in the PE. This is primarily due to the relatively low-cost (power and area) datapath and control logic. As such, the accelerator will benefit greatly from the implementation of novel, non-volatile memory technologies such as RRAM or STTRAM. Many of these technologies promise lower-power, faster access times, smaller footprints, and greater robustness and reliability, for both traditional storage and content addressable memories

(92). As these technologies mature, there is a huge potential for improvement in on-chip

143 PE density that will not be limited by the size, complexity, or power requirements of the datapath or interconnect.

7.6.3 In-Memory Computing

The proposed accelerator architecture can be situated in one of two locations for effective operation. In one case, it can be located at the interface between the last level of memory and the main memory, as previously described. This is a similar setup as the emulation platform, which has the accelerator between the last level

(flash memory) storage and the host CPU on the separate board. Alternatively, it may be possible to augment last level memory devices in the future to contain the processing elements themselves, a feature which would entirely remove the need for external data movement. Such an architecture has been previously proposed

(79; 105) and represents the extreme case of many-core offchip processing. Note that this scenario would require a complete redesign of the LLM, and does not work with traditional hard disks. In the meantime, the proposed accelerator is designed to be situated at the interface, enabling immediate retrofitting into existing data warehouses, and accelerating data processing with a special datapath, lookup operations, and an interconnect fabric customized to the specific needs of text analytics kernels.

Additionally, this method works with all LLM technologies, regardless of the underlying storage mechanism, because the disk's I/O interface provides a level of abstraction from the lower level LLM implementation details. This fulfills one of the requirements for a readily-adoptable text acceleration platform.

The second of these requirements is realized by the nature of the accelerator/OS interface. In order to perform the necessary operations, the proposed accelerator can interface between the OS and LLM using the existing connections, as shown in Fig. 7-2(a). Minimizing the interface overhead is critical for widespread adoption of the framework. As it sits between the main memory and last level storage, the accelerator will need to relay requests from the OS to the LLM in normal operation, making a

144 separate connection from the LLM to the system bus unnecessary. It is also assumed that the data for processing will generally be stored on dedicated disks, not the system disks or boot drive, and therefore the OS will not typically need to access data in parallel. To initiate accelerator operation, kernels identified by the programmer, as well as any supplementary information such as URIs to files or folders for processing, are sent as a SATA DMA_WRITE command to a special region of the LLM (i.e. D:\.accel).

This is known a priori by the accelerator and rather than being relayed to the LLM, it is intercepted and written to registers in the accelerator which configure the operation.

Because the OS is expecting an asynchronous DMA operation, it can continue its own tasks while the accelerator configures itself and performs the required functions on the data stored in the LLM. In the case where the CPU needs to access data on a disk currently in use by the accelerator, the CPU can preempt the accelerator once the

SATA command FIFO (in the accelerator) is full. A batch relay of commands will occur, including transfer of any required data to the main memory. Thus, the interfacing is accomplished with minimal software overhead, a small modification to the compiler, and no further OS intervention.

7.6.4 Extensions to Other Languages

As presented, the accelerator is a proof-of-concept for off-chip text analytics acceleration. While the current design operates best on English language documents represented by 8-bit/character ASCII/UTF-8, it is feasible to extend the accelerator to other languages. However, there are still some specific requirements which must be met to use the accelerator in its current state for other languages. Specifically, the target language must be easily tokenizable. The use of the Latin alphabet is not a strict requirement; as long as only one language is input at a time, and the target language contains fewer than 256 characters in the alphabet. Through the use of lookup tables, the data engine can itself be made reconfigurable and capable of mapping arbitrary symbols of one ore more bytes (decoded from UTF-8) to an 8 bit representation.

145 While this may not make a truly language independent accelerator, a large number of languages meet the above requirements, and so could be used with the framework.

In addition, for languages other than English where the truncation approximation

to stemming is not appropriate, or if a more accurate result for English is desired, the

framework will require a larger CAM resource. For example, a 128 bit CAM can be used to store 16 8-bit characters in their (non-stemmed) form. Afterwards, the host can

perform the desired stemming operation on the individual terms, and merge the counts.

This will greatly reduce the overhead for performing software stemming, making this

technique viable for a number of languages. For languages with very long average

wordlength or compound words, special rules (e.g. omitting every nth character from CAM storage), when paired with post-analysis error correction on the host, can increase

the storage capacity, though such a scheme must also handle the risk of conflating

some tokens.

7.6.5 Application Scope

While the proposed framework shows great promise for text analytics applica-

tions, it is important to note that some of these kernels are not strictly applicable to

text analytics, and can actually be employed in other domains. For example, the fre-

quency counting can be applied to counting the occurrence of numbers using the same

CAM/SRAM array. Any binary value, whether it represents an integer, a fixed point, or floating point value up to the CAM width can be counted. This has implications in many

analysis applications, including basic statistics on large datasets. Alternatively, the

frequency count functionality can be applied to genetics and bioinformatics for counting

the occurrence of DNA sequences, for example, with sequences of interest encoded

as one of 230 combinations. Additionally, the classification capabilities are not restricted to documents, and can be applied to many application domains, including healthcare decision support systems, medical informatics, bioinformatics, network security, and many other domains in which a predictive model is required. Finally, as a reconfigurable

146 accelerator, the framework is capable of supporting novel applications as they arise, without the need for updating or modifying the underlying hardware, making this a robust and adaptive solution to current and future problems.

7.7 Summary

This chapter has presented novel reconfigurable computing framework tailored towards text analytics. The framework consists of a large number of ultra-lightweight, interconnected processors with custom functional units that support rapid execution of text analytics-inspired kernel applications. The functionality was demonstrated using an FPGA-based hardware emulation platform. Synthesis results of the framework at the 32nm technology node demonstrate iso-area throughput and energy efficiency improvements over both CPU and GPU for a variety of kernels. A case study was presented which shows how such an accelerator could improve the performance and energy efficiency in a real-world application.

Future work will first focus on improving the on-chip data movement to reduce the number of cycles required in larger, non-preprocesing type applications such as classification. We will explore the feasibility of integrating higher level compres- sion/decompression hardware to further reduce data transfer energy and latency and support compressed database formats. We will explore different interconnect architec- tures to improve scalability of the framework. Finally, we will implement the accelerator in custom hardware and interface with a large capacity SSD and host CPU system to demonstrate real-world hardware retrofitting.

147 Figure 7-7. FPGA-based emulation platform for the text mining accelerator. A) System level architecture of the accelerator emulation platform. A Terasic DE4 FPGA board emulates the last level memory flash storage with built-in accelerator, while a Terasic DE0 houses a host CPU with large main memory (SDRAM), its own flash storage (CFI Flash), two timers (CPU and high resolution), and finally a JTAG/UART port for terminal communication. B) Photograph of the emulation platform, showing the DE4, DE0, and the interconnect via general purpose I/O ports (photograph courtesy of author).

148 Figure 7-8. Comparison of the A) throughput (Mbps) and B) energy (J) for the four application kernels among the four platforms

149 Figure 7-9. Comparison of Energy Delay Product (EDP) for the application kernels among the four platforms

Figure 7-10. Comparison of the transfer and compute energy among the four platforms

Figure 7-11. Iso-area comparison (per mm2) of data processing throughput (Mbps)

150 CHAPTER 8 SECURE RECONFIGURABLE COMPUTING ARCHITECTURE

In the first half of this dissertation, I described a number of techniques diversifying hardware architectures in both modern and next-generation systems. In the second half, I described how domain-specific customizations to a spatio-temporal computing fabric can provide significant improvements in scalability and energy-efficiency compared to other commodity platforms. As a spatio-temporal platform, the MAHA architecture has aspects of both FPGAs and CPUs, and suffers from some of the same vulnerabilities due to hardware homogeneity. Chapters 3, 4, and 5 described these security issues in the context of IoT, and proposed solutions for securing these systems against attack.

While some of the security techniques already proposed will have an analog in the

MAHA framework, certain hardware properties provide additional security in different ways. This chapter focuses on the components of MAHA that can be diversified, and discusses the security implications of these changes. Portions of this chapter have been accepted for publication* in IEEE Embedded Systems Letters (ESL).

8.1 Combining Diversity Techniques for MAHA

The previous studies on diversification extend naturally to the MAHA framework, which as a spatio-temporal framework, has properties of both FPGAs and microcon- trollers. The next step is therefore to apply the most effective of these techniques to create a secure MAHA framework. Three such techniques, namely Instruction Set

Randomization, Instruction Order Randomization, and Interconnect Randomization, have been applied to MAHA as an FPGA overlay (162), but not as a standalone device.

Implementing diversity on an overlay is significantly less challenging, because it is pos- sible to make these and other modifications to the RTL, or even directly to the bitstream

* G. Stitt, R. Karam, K. Yang, and S. Bhunia. ''A Uniquified Virtualization Approach to Hardware Security,'' IEEE Embedded Systems Letters (ESL), 2017.

151 if the format is known. However, as a standalone architecture, there are additional considerations on how to implement the diversity, which will be discussed here.

A permutation-network based approach, similar to that employed in Chapter 5, enables instructions to be permuted as they are fetched from the schedule table.

However, only fine-grained permutation is applicable, since only one instruction is fetched at a time. However, it is possible to have every MLB in a given device use a different input to a local permutation network, greatly increasing the number of possible encodings within a single device.

While ISR has been previously applied to microprocessors, IOR is much more difficult to realize in a typical system due to the requirement of instruction caching. In the FPGA overlay implementation, instruction caching was not an issue because the schedule table size could be modified to match the application requirements, and applications could be mapped spatially, using additional MLBs as necessary, only limited by the number of physical resources on the FPGA. In a standalone system, it is feasible to have a large array of MLBs, due to the highly scalable interconnect described in previous chapters. Therefore, just as an FPGA may come in different sizes for different applications, the same can be said about a secure MAHA architecture. This has the added benefit of providing scalable security with increasing application size.

Finally, the ICR implementation depends on the particular interconnect architecture. As previously described, different interconnects work better for different applications.

For example, the 8-PE cluster in the text analytics accelerator required a costly intra- cluster interconnect, which was only justified due to the high intra-cluster communication requirements in text mining applications. The secure MAHA framework instead uses the 4 PE cluster, 2D cluster mesh interconnect from the general analytics accelerator. The caveat to using ICR is that the inter-MLB communication network is different among all devices, which raises questions on application mapability as well as meeting timing

152 Figure 8-1. Securing the MAHA architecture using architectural diversification. A) Overview of the MAHA architecture, with B) a set of interconnected processing elements. C) Individual processing elements and the interconnect fabric D-G) are highly customizable, which can be exploited for security. and bandwidth requirements across all devices. These issues, as well as the security implications of this choice are discussed in Section 8.3.

8.2 Implementation Details

These uniquification approaches can be implemented within each MLB by making small modifications to the MLB structure, shown in Fig. 8-1(A). In the overlay implemen- tation, ISR was achieved by either using a permutation network to permute the order of inputs to certain functional units, which did not require resynthesis of the overlay, but did increase area and delay, or the encoding was uniquified at the RTL level, which did

153 not affect timing, but did require resynthesis. In the standalone version, a permutation network is required, since this will allow the encoding to potentially change with time.

This provides the powerful moving target defense as described in Chapters 3 and 4.

IOR can be implemented in a number of ways. The important aspect of IOR is to introduce randomness into the instruction order sequence, such that no two MLBs have the same program counter activity, and no two devices share the same instruction ordering among all MLBs. While this can be achieved using a cryptographically secure sequence generator such as a stream cipher, it is sufficient in this case to use a maxi-

− mum period linear (or nonlinear) feedback shift register ((N)LFSR). There are 22n 1−n+1 different n-bit NLFSRs (163), providing a source of diversity which complements the other randomization approaches.

Finally, ICR can be implemented by beginning with a fully-connected network of

MLBs in the cluster level, and a two dimensional mesh of clusters, and then cutting specific connections. This is shown for a 4-PE cluster in Fig. 8-1D-G. Once again, this raises questions on whether applications can actually be mapped if too many connections are cut, and how the performance between different devices may differ if one has a less restrictive interconnection network than the other. In general, the goal is that, as long as the mapping tool which generates the MLB configurations is cognizant of the modifications (ISR, IOR, and ICR) and timing requirements, it can create a functionally correct and latency-aware mapping with little variation in performance across all uniquified overlays.

Finally, because a the configuration for the secure platform is not portable among devices, this approach realizes the goal of ''node-locking'' a configuration file to a single device. In other words, an attacker will not be able to generate a valid configuration for more than one device, and OEMs will need to follow the typical remote upgrade pro- cedure outlined in Chapters 3 and 4 for in-field upgrades. Thus, the secure framework realizes the best properties of the FPGA diversity and microprocessor diversity.

154 8.3 Security Analysis

This section presents an analysis on the impact on system security from the

proposed diversification approaches.

8.3.1 Security against Brute Force Attacks

Quantitatively, the level of security against brute force attacks is defined as the number of attempts required to reverse engineer the IP mapped to the overlay. Security

due to ISR can be borrowed from Chapter 5, but IOR and ICR require additional analysis in the standalone secure MAHA framework.

For ISR, the number of possible per-MLB encodings can be expressed using the binomial coefficient

( ) ( ) n n! 2 C = = (8--1) ISR r (r!)(n − r)! where n is the number of bits in the instruction, and r is the number of 1's present in

the input (or equivalently, the number of 0's). Note that the result is squared, because

there are two such networks operating in parallel with different configurations, since

the instructions are 32 bits wide. This is the worst-case number of combinations for an

attacker, and will be used in the rest of the analysis. In addition, the placement of the instructions, as in, which MLB has which permu-

tation configuration, matters. For example, if the framework has two MLBs, A and B,

the way the instructions in each MLB will be permuted depends on the configuration of

each MLB's permutation network. If the instructions intended for MLB-A are placed into

MLB-B, the resulting encodings will not be correct, and the application will not execute properly.

Meanwhile, for the m instructions in each schedule table, there are m! possible

orderings. However, realizing all m! orderings will be very expensive in hardware. One

option is to use a sequence generator in each MLB that will produce a different program counter sequence at startup, and an 8-bit LUT will translate the normal PC sequence

155 to one for that particular MLB. While this is feasible, a dynamically loaded sequence will be in the critical path, and would severely limit performance. A sequence generator,

such as an NLFSR, could also be used, but this limits the diversity since starting at

a different point in the sequence only provides a rotation, not a permutation. In other

words, if the attacker can determine the sequence, the only remaining question is where in the sequence the first instruction is located. For a schedule table with m = 256

entries, this can be in 1 of 256 locations for each of the MLBs in the system. While this

does limit the diversity if used alone, when paired with ISR, it provides a low overhead

alternative that is viable for security. And, just as the MLB instruction placement matters

for ISR, a similar argument can be made for IOR -- if instructions in MLB-A are placed in MLB-B, and the program counter sequence generated by MLB-A differs from that

generated by MLB-B, then the instructions will not execute in the proper order, and

the application will not execute properly. Thus, for each individual MLB, the number of

possible configurations is

( ) n! 2 C (n, r, m) = C (n, r) × m = × m (8--2) MLB ISR (r!)(n − r)! Furthermore, the particular implementations of ISR and IOR used can differ among all K MLBs in the system. This impacts the security from IOR, since the PC sequence for each MLB can now begin in 1 of 256 ways. During application mapping, nodes from the control/dataflow graph (CDFG) are distributed among all available MLBs. A functionally correct mapping can only be realized with proper execution within each

MLB, so the overall security from these two approaches also depends on K, the number

of MLBs: ( ) n! 2 m! C − (n, r, m, K) = × × K! (8--3) MAHA 1 (r!)(n − r)! (K!)(m − K)! This expression assumes that the size of the schedule table and the instruction widths are constant across all MLBs. The worst-case combinations for CISR are n = 16 and r = 8; with a schedule table size of m = 64 and K = 4 fully-connected MLBs, this

156 yields roughly 281.8 possible combinations. With 8 MLBs, this increases to 2105.2, and with 16 MLBs, this increases to 2120.3.

Meanwhile, the security provided by ICR depends not only on the number of MLBs,

but also on their particular interconnect configuration, variations in which will result in

different mappings and different power/performance/area tradeoffs. To demonstrate this, consider K identical, fully connected MLBs, and an application which is perfectly

parallelizable. For each application mapping, the total number of possible placements

is K! because for any given mapping, a particular subgraph of the original CDFG may

be placed into any MLB, and only the routing, encoded within the instructions, needs

to be updated to match. In other words, the fully connected network of identical MLBs is isomorphic, and therefore the placement algorithm is free to assign any subgraph to

any MLB. For the purpose of security, the isomorphism is not ideal, because the overlay

bitfiles will not differ significantly.

However, other mappings based on different interconnections are possible. This will change not just the routing portion of the instructions, but also the application mapping

itself. For example, in Fig. 8-1(e), there exists a path from M1 ↔ M3 → M2 ↔ M0.

Compared with an equivalent mapping on Fig. 8-1(d), removal of the M1 → M0

adjacency will either change the mapping entirely, or modify it by using M3 and M2 to

pass required data. Given that different interconnect configurations will result in different mappings,

ICR has profound implications for system security through overlay diversification, as

long as there exists at least one functionally correct mapping for the various ICR-based

interconnect configurations. If this is true, then the total number of possible mappings

would be equal to the number of interconnect configurations for K MLBs. Computing

this is nontrivial, but is given by S2 (K) = A [K], where A is the OEIS sequence A035512 (164), the 13th term of which is roughly 2123. For reference, the exponent more than

doubles (≈ 2253) with just five additional MLBs. We assume the digraph is unlabeled,

157 because as with the example of K fully connected MLBs, isomorphic configurations do not contribute significantly to security.

In fact, it can be shown that for every interconnect configuration, there exists at least one functionally correct application mapping, given that the particular configuration satisfies the requirements for a strongly connected digraph, and the per-MLB schedule and LUT memory size constraints are relaxed. By extension, if the interconnect is only weakly connected, this holds as long as there exists a path from the MLB processing the

CDFG’s primary input (PI) to the MLB processing its primary output (PO), as shown in

Fig. 8-1F) and G).

To prove this, first consider the case of one MLB (K = 1). By definition, a single MLB is a connected graph, and with sufficient schedule table and lookup table memory, the entire application can be mapped into the single MLB. For K > 1, it follows that either 1) a portion of the application can be parallelized, or 2) that the application is implemented in a pipeline fashion. In the first case, the particular subgraphs can be mapped to any available MLB, as long as partial or intermediate results may be communicated between any two MLBs (even over multiple cycles), which is true if the network is strongly connected. If instead multiple MLBs are used for pipelining, then the application can be divided into sequential subgraphs, each of which can be placed in adjacent MLBs along the direction of the given edge. Thus, pipelining requires only a weakly connected network of MLBs with an extant path from PI to PO.

Therefore, regardless of the application properties, there is at least one functionally correct mapping for every interconnect configuration.

8.3.2 Side-Channel and Known Design Attacks

The typical goal for a side channel attack is to obtain secret information, such as an encryption key, by carefully observing certain time-varying system properties, such as power consumption or electromagnetic radiation. The reason such attacks are effective is that these side channels inadvertently leak information because certain operations

158 take more or less power, depending on if the bits involved are 1 or 0. By comparison, the permutation networks have the same number of transitions (e.g. from 0 to 1 or

1 to 0) in the input as the output, and the power and timing profile do not depend on

the key input. The ability to change the permutation network configuration, as well as

where the instruction order begins, as well as the interconnect configuration together provide a moving target defense which makes the framework highly effective against

side-channel attacks. In fact, this also works well against known design attacks, since

each time an attacker attempts to map a known design to the framework, the placement

will effectively appear random, since the instruction encoding, instruction reordering, and

physical placement of the instructions in the MLBs will change each time. 8.4 Conclusion

This chapter has presented implementation details for diversification of a stan- dalone MAHA framework. The hardware structures such as permutation networks, sequence generators, and configurable interconnect provide ample opportunities for customization and diversification, leading to robust, low-overhead security against typical attacks in IoT. These techniques join together approaches developed separately for FPGA and microcontroller security, achieving high level, mathematically provable security. In addition, the level of security depends, at least in part, on the number ofMLBs that are part of the framework; therefore, if larger designs are also more valu- able designs (e.g. contain more IP, more complexity, etc.), they will also be mapped to hardware that intrinsically offers more security. This does not compromise security for smaller designs. Hence, energy-efficiency and scalability refer not only to the ap- plication kernel execution, but also to the system security, making this framework very valuable in the future of IoT.

159 CHAPTER 9 CONCLUSION

The primary goal of this research was to develop an energy-efficient and secure reconfigurable computing architecture. This was accomplished by first understanding how existing reconfigurable computing architectures can be improved, while considering the security implications of design choices. MAHA laid the groundwork for energy effi- ciency and scalability in reconfigurable computing, but as a general-purpose framework, numerous optimizations in the datapath, functional units, and interconnect remained available for domain customization. Furthermore, domain customization provided a means to begin investigation of how diverse architectures would impact the security landscape if such devices were found in the field. Diversity in FPGA architectures was investigated, and the lessons learned were applied to microcontrollers, and finally the

MAHA framework. 9.1 Research Accomplishments

The primary research goal of developing an energy-efficient and secure recon- figurable computing architecture was conducted in a number of stages. Firstly, I investigated the security challenges facing modern reconfigurable computing devices.

Such devices are increasingly common in numerous domains, and so require special consideration. It was determined that the configuration file is a weak point in the security protocols, despite prevalent use of bitstream encryption, given the new side channel and physical threats utilized by attackers. Hardware homogeneity was identified as the root cause of the vulnerabilities in these devices, and several advances were made in dealing with homogeneity in a practical manner. On one hand, domain specificity pro- vides some level of diversity, but this is insufficient to protect against attackers wishing to exploit homogeneity among similar devices. True diversity in modern FPGAs was implemented using a secure key generator, such as a physical unclonable function mapped into the FPGA fabric, to provide a device specific configuration key. This key

160 is then used in a modified application mapping flow to implement obfuscation functions alongside the existing application. In order to limit the area, power, and performance overhead of the obfuscation, this required innovations in the usage of empty space in the FPGA LUT resources, which has been termed FPGA dark silicon. This provided a means to diversify existing FPGAs, but only on a logical level. I then developed the design for a next-generation FPGA with actual, post-fabrication, physical mutability, and simulated the functionality using VPR. This showed how physical changes to the archi- tecture, paired with time-varying logical changes, can provide highly robust protection against attacks stemming from hardware homogeneity, such as side-channel attacks, or malicious bitstream propagation. The techniques developed for FPGA bitstream security can partly be applied to

MAHA. However, the MAHA framework does not have purely spatial application map- ping, so the same fine-grained LUT obfuscation strategy is not directly applicable. In many ways, MAHA is similar to a large array of interconnected RISC CPUs. Therefore, methods for diversifying a generic RISC microcontroller were explored, again in the context of IoT security. The use of mixed-granular permutation networks and dependent instruction reordering using side-channel attack resistant permutation networks was investigated, and found to provide strong protection against brute-force code generation, side-channel attacks, and known design attacks. These principles were further explored using a secured version of MAHA as an FPGA overlay, which used the FPGAs flexibility to implement a mutable MAHA architecture.

Unlike previous approaches to security, such as the addition of an encryption block, which result in significant area, power, and often performance overhead, the built-in diversity in the secure MAHA architecture offers robust protection against brute force, side channel, and known design attacks, while also preventing the spread of malware among networked devices.

161 In addition, I developed a more energy-efficient alternative to traditional recon- figurable computing platforms, such as FPGA, by implementing domain-specific customizations to the MAHA spatio-temporal framework. Two such implementations focused either on general analytics or a more specific text mining accelerator. Gen- eral analytics applications cover a broad range of computing workloads, and therefore had to remain flexible, limiting potential customizations of the datapath. However, the interfacing between the last level memory device and the host processor provided an opportunity for improving accelerator scalability, which is of critical importance for big data applications. This also required new innovations in the compiler support for lever- aging the available resources for complex analytics and machine learning applications. The text mining accelerator was much more domain-specific, and provided a great opportunity for customization. This included an innovative usage of content addressable memory for term frequency counting, and adjustments to the interconnect hierarchy, which increased the local cluster bandwidth to accommodate the additional data transfer observed in typical text mining applications.

Combining the security benefits from architectural diversity, which uses hardware structures that are primarily intrinsic to the target framework, with the numerous benefits of the MAHA framework, provides the basis for an energy-efficient and secure reconfig- urable computing architecture. The specific diversification approaches, learned through investigation of existing, commodity architectures, provide design-time customizable security that can be traded off with power and area overhead in a given device.

9.2 Future Work

The work presented here offers numerous opportunities and research pathways for future work. Firstly, one may investigate how emerging non-volatile nanoscale memories can be used to further improve the efficiency of the MAHA framework, in- cluding ways to leverage the unique properties of these memories for tasks such as more efficient learning-on-a-chip or filtering applications. Secondly, the diversification

162 approaches presented here may be applied to more complex CPUs and SoCs, but nu- merous challenges, both on the hardware and software side, remain. These include OS integration, software interoperability, compatibility with other IPs in an SoC, and several others. A secure cloud service for facilitating device-specific application downloads and software updates will require innovations in numerous areas, and will ultimately be nec- essary to support true hardware diversity for more complex processor-based devices. In summary, this work has provided a starting point for investigations into the use of secure hardware for the purpose of secure software, which represents a paradigm shift in how future researchers will considers system level security challenges.

163 LIST OF REFERENCES [1] V. K. Prasanna, ``Energy-Efficient Computations on FPGAs,'' The Journal of Supercomputing, vol. 32, no. 2, pp. 139--162, 2005.

[2] D. Chen, J. Cong, and Y. Fan, ``Low-Power High-Level Synthesis for FPGA Archi- tectures,'' in Low Power Electronics and Design, 2003. ISLPED'03. Proceedings of the 2003 International Symposium on. IEEE, 2003, pp. 134--139.

[3] A. Ghosh, S. Paul, J. Park, and S. Bhunia, ``Improving energy efficiency in fpga through judicious mapping of computation to embedded memory blocks,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 6, pp. 1314--1327, 2014.

[4] A. Rahman, S. Das, A. P. Chandrakasan, and R. Reif, ``Wiring Requirement and Three-Dimensional Integration Technology for Field Programmable Gate Arrays,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 11, no. 1, pp. 44--54, 2003.

[5] Y. Hu, Y. Lin, L. He, and T. Tuan, ``Physical Synthesis for FPGA Interconnect Power Reduction by Dual-Vdd Budgeting and Retiming,'' ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 13, no. 2, p. 30, 2008.

[6] H. Qi, O. Ayorinde, and B. H. Calhoun, ``An energy-efficient near/sub-threshold fpga interconnect architecture using dynamic voltage scaling and power-gating,'' in Field-Programmable Technology (FPT), 2016 International Conference on. IEEE, 2016, pp. 20--27.

[7] S. Huda and J. H. Anderson, ``Leveraging unused resources for energy optimiza- tion of fpga interconnect,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2017.

[8] S. Paul, R. Karam, S. Bhunia, and R. Puri, ``Energy-Efficient Hardware Accel- eration through Computing in the Memory,'' in Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 2014, p. 266.

[9] R. Karam, K. Yang, and S. Bhunia, ``Energy-Efficient Reconfigurable Computing Using Spintronic Memory,'' in Circuits and Systems (MWSCAS), 2015 IEEE 58th International Midwest Symposium on. IEEE, 2015, pp. 1--4.

[10] W. Qian, C. Babecki, R. Karam, S. Paul, and S. Bhunia, ``ENFIRE: A Spatio- Temporal Fine-Grained Reconfigurable Hardware,'' IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 1, pp. 177--188, 2017.

[11] S. Mal-Sarkar, R. Karam, S. Narasimhan, A. Ghosh, A. Krishna, and S. Bhunia, ``Design and validation for fpga trust under hardware trojan attacks,'' IEEE

164 Transactions on Multi-Scale Computing Systems, vol. 2, no. 3, pp. 186--198, 2016.

[12] C. Krieg, C. Wolf, and A. Jantsch, ``Malicious lut: A stealthy fpga trojan injected and triggered by the design flow,'' in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2016, pp. 1--8.

[13] H. Sun and P. Heller, ``Oracle information architecture: An architect's guide to big data,'' Oracle, Redwood Shores, 2012.

[14] L. Garber, ``Using in-memory analytics to quickly crunch big data,'' Computer, vol. 45, no. 10, pp. 16--18, 2012.

[15] A. Gray, ``Analyzing massive datasets,'' http://www.skytree.net/resources/, 2013, accessed: 2013-12-15.

[16] S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, ``Big data, analytics and the path from insights to value,'' MIT Sloan Management Review, vol. 52, no. 2, pp. 21--31, 2011.

[17] J. Nickolls, I. Buck, M. Garland, and K. Skadron, ``Scalable parallel programming with cuda,'' Queue, vol. 6, no. 2, pp. 40--53, 2008.

[18] L. Chen, X. Huo, and G. Agrawal, ``Accelerating mapreduce on a coupled cpu-gpu architecture,'' in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 2012, p. 25.

[19] M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguade et al., ``Assessing accelerator- based hpc reverse time migration,'' Parallel and Distributed Systems, IEEE Transactions on, vol. 22, no. 1, pp. 147--162, 2011. [20] S. C. Helmreich and J. R. Cowie, ``Data-centric computing with the netezza architecture,'' 2006.

[21] Y. Shan, B. Wang, J. Yan, Y. Wang, N. Xu, and H. Yang, ``Fpmr: Mapreduce framework on fpga,'' in Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays. ACM, 2010, pp. 93--102.

[22] M. Papadonikolakis and C. Bouganis, ``Novel cascade fpga accelerator for support vector machines classification,'' Neural Networks and Learning Systems, IEEE Transactions on, vol. 23, no. 7, pp. 1040--1052, 2012.

[23] J. Dean and S. Ghemawat, ``Mapreduce: simplified data processing on large clusters,'' Communications of the ACM, vol. 51, no. 1, pp. 107--113, 2008.

165 [24] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, ``Power challenges may end the multicore era,'' Communications of the ACM, vol. 56, no. 2, pp. 93--102, 2013. [25] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Fran- zon, W. Harrod, K. Hill, J. Hiller et al., ``Exascale Computing Study: Technology Challenges in Achieving Exascale Systems,'' Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, vol. 15, 2008.

[26] R. Karam, R. Puri, and S. Bhunia, ``Energy-efficient adaptive hardware acceler- ator for text mining application kernels,'' IEEE Transactions on Very Large Scale Integration, 2016.

[27] S. Paul, A. Krishna, W. Qian, R. Karam, and S. Bhunia, ``Maha: An energy- efficient malleable hardware accelerator for data-intensive applications,'' Very Large Scale Integration Systems, IEEE Transactions on, 2014. [28] S. Paul, S. Chatterjee, S. Mukhopadhyay, and S. Bhunia, ``Nanoscale reconfig- urable computing using non-volatile 2-d sttram array,'' in Nanotechnology, 2009. IEEE-NANO 2009. 9th IEEE Conference on, July 2009, pp. 880--883. [29] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, ``Emerging trends in design and applications of memory-based computing and content-addressable memories,'' Proceedings of the IEEE, vol. 103, no. 8, pp. 1311--1330, 2015. [30] R. Karam, K. Yang, and S. Bhunia, ``Energy-efficient reconfigurable computing using spintronic memory,'' in 58th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS), August 2015. [31] A. Jacobs, ``The pathologies of big data,'' Communications of the ACM, vol. 52, no. 8, pp. 36--44, 2009.

[32] K. Atasu, R. Polig, C. Hagleitner, and F. R. Reiss, ``Hardware-accelerated regular expression matching for high-throughput text analytics,'' in Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on. IEEE, 2013, pp. 1--7.

[33] P. Bakkum and K. Skadron, ``Accelerating sql database operations on a gpu with cuda,'' in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 2010, pp. 94--103.

[34] S. Bandre and J. Nandimath, ``Design consideration of network intrusion detec- tion system using hadoop and gpgpu,'' in Pervasive Computing (ICPC), 2015 International Conference on, Jan 2015, pp. 1--6.

[35] R. Farivar, D. Rebolledo, E. Chan, and R. H. Campbell, ``A parallel implementa- tion of k-means clustering on gpus.''

166 [36] Z. Fu, M. Personick, and B. Thompson, ``Mapgraph: A high level api for fast development of high performance graph analytics on gpus,'' in Proceedings of Workshop on GRAph Data management Experiences and Systems. ACM, 2014, pp. 1--6.

[37] N. K. Govindaraju, B. Lloyd, W. Wang, M. Lin, and D. Manocha, ``Fast compu- tation of database operations using graphics processors,'' in Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 2004, pp. 215--226.

[38] R. J. Halstead, B. Sukhwani, H. Min, M. Thoennes, P. Dube, S. Asaad, and B. Iyer, ``Accelerating join operation for relational databases with fpgas,'' in Field- Programmable Custom Computing Machines (FCCM), 2013 IEEE 21st Annual International Symposium on. IEEE, 2013, pp. 17--20.

[39] K. Neshatpour, M. Malik, M. Ghodrat, and H. Homayoun, ``Accelerating big data analytics using fpgas,'' in Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on, May 2015, pp. 164--164.

[40] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., ``A reconfigurable fabric for accelerating large-scale datacenter services,'' in (ISCA), 2014 ACM/IEEE 41st International Symposium on. IEEE, 2014, pp. 13--24.

[41] B. Sukhwani, H. Min, M. Thoennes, P. Dube, B. Iyer, B. Brezzo, D. Dillenberger, and S. Asaad, ``Database analytics acceleration using fpgas,'' in Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM, 2012, pp. 411--420. [42] R. Wu, B. Zhang, and M. Hsu, ``Clustering billions of data points using gpus,'' in Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop. ACM, 2009, pp. 1--6.

[43] W. Zhao, H. Ma, and Q. He, ``Parallel k-means clustering based on mapreduce,'' in Cloud Computing. Springer, 2009, pp. 674--679.

[44] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, ``A case for intelligent ram,'' Micro, IEEE, vol. 17, no. 2, pp. 34--44, 1997.

[45] K. Murakami, S. Shirakawa, and H. Miyajima, ``Parallel processing ram chip with 256 mb dram and quad processors,'' in Solid-State Circuits Conference, 1997. Digest of Technical Papers. 43rd ISSCC., 1997 IEEE International. IEEE, 1997, pp. 228--229.

167 [46] Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi, J. C. Hoe, and F. Franchetti, ``3d-stacked memory-side acceleration: Accelerator and system design,'' in In the Workshop on Near-Data Processing (WoNDP)(Held in conjunction with MICRO-47.), 2014.

[47] R. O. Duda, P. E. Hart et al., Pattern classification and scene analysis. Wiley New York, 1973, vol. 3. [48] N. Friedman, D. Geiger, and M. Goldszmidt, ``Bayesian network classifiers,'' Machine learning, vol. 29, no. 2-3, pp. 131--163, 1997.

[49] J. MacQueen et al., ``Some methods for classification and analysis of multivariate observations,'' in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 281-297. California, USA, 1967, p. 14.

[50] D. A. Huffman, ``A method for the construction of minimum-redundancy codes,'' Proceedings of the IRE, vol. 40, no. 9, pp. 1098--1101, 1952.

[51] J. Cong and S. Xu, ``Technology mapping for fpgas with embedded memory blocks,'' in Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays. ACM, 1998, pp. 179--188.

[52] ``Intel® Core￿2 Quad Processor Q8200,'' http://ark.intel.com/Products/Spec/ SLG9S. [53] ``Quadro for desktop workstations,'' http://www.nvidia.com/object/ quadro-desktop-gpus.html.

[54] ``clock-gettime(3) - Linux man page,'' http://linux.die.net/man/3/clock_gettime.

[55] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, ``Theano: a CPU and GPU math expression compiler,'' in Proceedings of the Python for Scientific Computing Conference (SciPy), Jun. 2010, oral Presentation.

[56] ``Cuda profiling tools interface,'' https://developer.nvidia.com/ cuda-profiling-tools-interface. [57] V. George, S. Jahagirdar, C. Tong, K. Smits, S. Damaraju, S. Siers, V. Naydenov, T. Khondker, S. Sarkar, and P. Singh, ``Penryn: 45-nm next generation intel® core￿ 2 processor,'' in Solid-State Circuits Conference, 2007. ASSCC'07. IEEE Asian. IEEE, 2007, pp. 14--17.

[58] ``Cacti,'' Online. [Online]. Available: http://arch.cs.utah.edu/cacti/

[59] ``Nios II Processor: The World's Most Versatile Embedded Processor,'' http://www.altera.com/devices/processor/nios2/ni2-index.html.

168 [60] ``Altera,'' http://www.altera.com/products/software/quartus-ii/subscription-edition/ qts-se-index.html.

[61] B. E. Boser, I. M. Guyon, and V. N. Vapnik, ``A training algorithm for optimal margin classifiers,'' in Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992, pp. 144--152.

[62] Y. Lecun and C. Cortes, ``The MNIST database of handwritten digits.'' [Online]. Available: http://yann.lecun.com/exdb/mnist/

[63] P. Kraj, A. Sharma, N. Garge, R. Podolsky, and R. A. McIndoe, ``Parakmeans: Implementation of a parallelized k-means algorithm suitable for general laboratory use,'' BMC bioinformatics, vol. 9, no. 1, p. 200, 2008.

[64] J. Ziv and A. Lempel, ``A universal algorithm for sequential data compression,'' IEEE Transactions on information theory, vol. 23, no. 3, pp. 337--343, 1977.

[65] ------, ``Compression of individual sequences via variable-rate coding,'' Information Theory, IEEE Transactions on, vol. 24, no. 5, pp. 530--536, 1978.

[66] T. G. Rogers, M. O'Connor, and T. M. Aamodt, ``Divergence-aware warp schedul- ing,'' in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2013, pp. 99--110.

[67] C. Ebeling, D. C. Cronquist, and P. Franklin, ``Rapid￿reconfigurable pipelined datapath,'' in Field-programmable logic smart applications, new paradigms and compilers. Springer, 1996, pp. 126--135.

[68] E. Mirsky and A. DeHon, ``Matrix: a reconfigurable computing architecture with configurable instruction distribution and deployable resources,'' in FPGAs for Custom Computing Machines, 1996. Proceedings. IEEE Symposium on. IEEE, 1996, pp. 157--166. [69] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, ``MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications,'' IEEE transactions on computers, vol. 49, no. 5, pp. 465--481, 2000.

[70] CUDA Toolkit Documentation, 7th ed., NVIDIA. [Online]. Available: https: //docs.nvidia.com/cuda/

[71] C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, ``Bridging the processor-memory performance gap with 3d ic technology,'' IEEE Design & Test of Computers, vol. 22, no. 6, pp. 556--564, 2005.

[72] L. Woods and G. Alonso, ``Fast data analytics with fpgas,'' in Data Engineering Workshops (ICDEW), 2011 IEEE 27th International Conference on. IEEE, 2011, pp. 296--299.

169 [73] K. Stoffel and A. Belkoniene, ``Parallel k/h-means clustering for large data sets,'' in Euro-Par￿99 Parallel Processing. Springer, 1999, pp. 1451--1454.

[74] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, ``Gunrock: A high-performance graph processing library on the gpu,'' in Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 2015, pp. 265--266. [75] Y. Wang, B. Li, R. Luo, Y. Chen, N. Xu, and H. Yang, ``Energy efficient neural networks for big data analytics,'' in Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014. IEEE, 2014, pp. 1--2. [76] R. Feldman and J. Sanger, The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, 2007.

[77] P. Giudici, Applied data mining: statistical methods for business and industry. John Wiley & Sons, 2005.

[78] P. Zikopoulos, C. Eaton et al., Understanding big data: Analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, 2011. [79] S. Paul, A. Krishna, W. Qian, R. Karam, and S. Bhunia, ``Maha: An energy-efficient malleable hardware accelerator for data-intensive applications,'' Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. PP, no. 99, pp. 1--1, 2014. [80] K. S. Jones, ``A statistical interpretation of term specificity and its application in retrieval,'' Journal of documentation, vol. 28, no. 1, pp. 11--21, 1972.

[81] S. C. Deerwester, S. T. Dumais, and R. A. Harshman, ``Indexing by latent semantic analysis,'' 1990.

[82] ``Poweredby,'' Web, February 2013. [Online]. Available: http://wiki.apache.org/ lucene-java/PoweredBy

[83] Y. Zhang, F. Mueller, X. Cui, and T. Potok, ``Gpu-accelerated text mining,'' in Workshop on exploiting parallelism using GPUs and other hardware-assisted methods, 2009, pp. 1--6. [84] ------, ``Data-intensive document clustering on graphics processing unit (gpu) clusters,'' Journal of Parallel and Distributed Computing, vol. 71, no. 2, pp. 211--224, 2011. [85] Y. Zu, M. Yang, Z. Xu, L. Wang, X. Tian, K. Peng, and Q. Dong, ``Gpu-based nfa implementation for memory efficient high speed regular expression matching,'' in ACM SIGPLAN Notices, vol. 47, no. 8. ACM, 2012, pp. 129--140. [86] R. Sidhu and V. K. Prasanna, ``Fast regular expression matching using fpgas,'' in Field-Programmable Custom Computing Machines, 2001. FCCM'01. The 9th Annual IEEE Symposium on. IEEE, 2001, pp. 227--238.

170 [87] P. Wittek and S. DaráNyi, ``Accelerating text mining workloads in a mapreduce- based distributed gpu environment,'' Journal of Parallel and Distributed Computing, vol. 73, no. 2, pp. 198--206, 2013. [88] S. C. Helmreich and J. R. Cowie, ``Data-centric computing with the netezza architecture,'' 2006.

[89] G. Kirsch, ``Active memory: Micron's yukon,'' in Parallel and Distributed Processing Symposium, 2003. Proceedings. International, April 2003, pp. 11 pp.--.

[90] Z. Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, D. Truong, T. Mohsenin, and B. Baas, ``Asap: An asynchronous array of simple processors,'' Solid-State Circuits, IEEE Journal of, vol. 43, no. 3, pp. 695--705, March 2008.

[91] S. Hauck, K. Compton, K. Eguro, M. Holland, S. Phillips, and A. Sharma, ``Totem: domain-specific reconfigurable logic,'' IEEE Transactions on VLSI Systems, pp. 1--25, 2006.

[92] R. Karam, R. Puri, S. Ghosh, and S. Bhunia, ``Emerging trends in design and applications of memory-based computing and content-addressable memories,'' Proceedings of the IEEE, vol. 103, no. 8, pp. 1311--1330, Aug 2015.

[93] K. Pagiamtzis and A. Sheikholeslami, ``Content-addressable memory (cam) circuits and architectures: a tutorial and survey,'' Solid-State Circuits, IEEE Journal of, vol. 41, no. 3, pp. 712--727, March 2006.

[94] L. Chisvin and R. Duckworth, ``Content-addressable and associative memory: alternatives to the ubiquitous ram,'' Computer, vol. 22, no. 7, pp. 51--64, July 1989. [95] A. Hanlon, ``Content-addressable and associative memory systems,'' IEEE Trans. Electronic Computers, vol. 15, no. 4, pp. 509--521, 1966.

[96] W. Webber, A. Moffat, and J. Zobel, ``A similarity measure for indefinite rankings,'' ACM Transactions on Information Systems (TOIS), vol. 28, no. 4, p. 20, 2010.

[97] M. F. Porter, ``An algorithm for suffix stripping,'' Program: electronic library and information systems, vol. 14, no. 3, pp. 130--137, 1980. [98] R. Das, S. Eachempati, A. K. Mishra, V. Narayanan, and C. R. Das, ``Design and evaluation of a hierarchical on-chip interconnect for next-generation cmps,'' in High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on. IEEE, 2009, pp. 175--186.

[99] G. Ingersoll, ``Next-generation search and analytics with apache lucene and solr 4,'' Web, October 2013. [Online]. Available: https: //www.ibm.com/developerworks/library/j-solr-lucene/

[100] D. R. Cutting and J. O. Pedersen, ``Space optimizations for total ranking e,'' 1997.

171 [101] ``HPROF: A Heap/CPU Profiling Tool.'' [Online]. Available: http: //docs.oracle.com/javase/7/docs/technotes/samples/hprof.html

[102] ``Lucene 5.3.1 demo api,'' Online. [Online]. Available: https://lucene.apache.org/

[103] ``Cuda toolkit documentation,'' Online. [Online]. Available: http://docs.nvidia.com/ cuda/cuda-runtime-api

[104] ``Latest models,'' Online, June 2012. [Online]. Available: ptm.asu.edu

[105] S. Paul, R. Karam, S. Bhunia, and R. Puri, ``Energy-efficient hardware acceleration through computing in the memory,'' in Design, Automation and Test in Europe Conference and Exhibition (DATE), 2014, March 2014, pp. 1--6.

[106] P. Garcia, K. Compton, M. Schulte, E. Blem, and W. Fu, ``An Overview of Reconfigurable Hardware in Embedded Systems,'' EURASIP Journal on Embedded Systems, 2006.

[107] M. Majzoobi, F. Koushanfar, and M. Potkonjak, ``FPGA-oriented Security,'' Introduction to Hardware Security and Trust/eds. M. Tehranipoor and C. Wang. Springer, pp. 195--231, 2011.

[108] G. Gogniat, T. Wolf, W. Burleson, J.-P. Diguet, L. Bossuet, and R. Vaslin, ``Reconfigurable Hardware for High-Security/High-Performance Embedded Systems: the SAFES perspective,'' TVLSI, vol. 16, no. 2, pp. 144--155, 2008.

[109] P. Kocher, R. Lee, G. McGraw, A. Raghunathan, and S. Moderator-Ravi, ``Security as a New Dimension in Embedded System Design,'' in DAC. ACM, 2004, pp. 753--760.

[110] R. S. Chakraborty, I. Saha, A. Palchaudhuri, and G. K. Naik, ``Hardware Trojan Insertion by Direct Modification of FPGA Configuration Bitstream,'' Design & Test, IEEE, vol. 30, no. 2, pp. 45--54, 2013.

[111] A. Moradi et al., ``On the Vulnerability of FPGA Bitstream Encryption Against Power Analysis Attacks: Extracting Keys from Xilinx Virtex-II FPGAs,'' in CCS, 2011, pp. 111--124. [112] J. A. Roy, F. Koushanfar, and I. L. Markov, ``EPIC: Ending Piracy of Integrated Circuits,'' in DATE. ACM, 2008, pp. 1069--1074.

[113] É. Rannaud, ``From the Bitstream to the Netlist,'' in FPGA. ACM, 2008, pp. 264--264.

[114] P. Swierczynski, M. Fyrbiak, P. Koppe, and C. Paar, ``FPGA Trojans Through Detecting and Weakening of Cryptographic Primitives,'' IEEE TCAD, vol. 34, no. 8, pp. 1236--1249, 2015.

172 [115] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, ``Dark Silicon and the End of Multicore Scaling,'' in ISCA. IEEE, 2011, pp. 365--376.

[116] A. Palchaudhuri and A. S. Dhar, ``Efficient Implementation of Scan Register Insertion on Integer Arithmetic Cores for FPGAs,'' in VLSID. IEEE, 2016, pp. 433--438.

[117] S. Yang, Logic Synthesis and Optimization Benchmarks User Guide: Version 3.0. Microelectronics Center of North Carolina (MCNC), 1991.

[118] EPFL, ``The EPFL Combinational Benchmark Suite.'' [Online]. Available: http://lsi.epfl.ch/benchmarks

[119] R. Usselmann, ``AES (Rijndael) IP Core :: Overview,'' Oct 2013. [Online]. Available: http://opencores.org/project,aes_core

[120] Ultra Embedded, ``AltOr32 - Alternative Lightweight OpenRisc CPU :: Overview,'' Feb 2015. [Online]. Available: http://opencores.org/project,altor32 [121] Progranism, ``Open Source FPGA Bitcoin Miner,'' July 2013. [Online]. Available: https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner

[122] M. Krepa, C. Gutierrez, and A. Tekyildiz, ``JPEG Encoder :: Overview,'' Sept 2014. [Online]. Available: http://opencores.org/project,mkjpeg

[123] L. Dzianach, ``Salsa20StreamCipher :: Overview,'' Nov 2012. [Online]. Available: http://opencores.org/project,salsa20 [124] Altera, ``Cyclone V Device Handbook,'' Tech. Rep., Dec. 2015.

[125] R. Tessier and H. Giza, ``Balancing Logic Utilization and Area Efficiency in FPGAs,'' in FPL. Springer, 2000, pp. 535--544.

[126] R. Maes, A. Van Herrewege, and I. Verbauwhede, ``PUFKY: A Fully Functional PUF-based Cryptographic Key Generator,'' in CHES. Springer, 2012, pp. 302--319. [127] V. Betz and J. Rose, ``VPR: A New Packing, Placement and Routing Tool for FPGA Research,'' in FPL. Springer, 1997, pp. 213--222.

[128] R. L. Rudell, ``Multiple-Valued Logic Minimization for PLA Synthesis,'' DTIC, Tech. Rep., 1986.

[129] Xilinx, ``Spartan-6 Libraries Guide for HDL Designs,'' Tech. Rep., December 2009. [Online]. Available: http://www.xilinx.com/support/documentation/sw_manuals/ xilinx11/spartan6_hdl.pdf

[130] S. B. Örs et al., ``Power-analysis attacks on an fpga--first experimental results,'' in CHES. Springer, 2003, pp. 35--50.

173 [131] K. Huang, J. M. Carulli, and Y. Makris, ``Counterfeit electronics: A rising threat in the semiconductor manufacturing industry,'' in ITC, 2013.

[132] J. Luu et al., ``Vtr 7.0: Next generation architecture and cad system for fpgas,'' ACM TRETS, vol. 7, no. 2, p. 6, 2014.

[133] K. Dhar, ``Design of a low power, high speed and energy efficient 3 transistor xor gate in 45nm technology using the conception of mvt methodology,'' in CICCT, 2014, pp. 66--70.

[134] H. Wong, V. Betz, and J. Rose, ``Comparing fpga vs. custom cmos and the impact on processor microarchitecture,'' in FPGA, 2011.

[135] K. Matsufuji et al., ``A 65nm pure cmos one-time programmable memory using a two-port antifuse cell implemented in a matrix structure,'' in ASSCC, 2007, pp. 212--215.

[136] I. Gartner, ``Gartner says 6.4 billion connected "things" will be in use in 2016, up 30 percent from 2015,'' Online, November 2015. [Online]. Available: http://www.gartner.com/newsroom/id/3165317 [137] Cisco, ``The internet of things how the next evolution of the internet is changing everything,'' Online, April 2011. [Online]. Available: http: //www.cisco.com/c/dam/en_us/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf [138] Rapid7, ``Hacking iot: A case study on baby monitor exposures and vulnerabilities,'' Online, September 2015. [Online]. Available: https://www.rapid7.com/docs/ Hacking-IoT-A-Case-Study-on-Baby-Monitor-Exposures-and-Vulnerabilities.pdf [139] A. Greenberg and K. Zetter, ``How the internet of things got hacked,'' Online, December 2015. [Online]. Available: http://www.wired.com/2015/12/ 2015-the-year-the-internet-of-things-got-hacked/ [140] D. Schneider, ``Jeep hacking 101,'' IEEE Spectrum, August 2015. [Online]. Available: http://spectrum.ieee.org/cars-that-think/transportation/systems/ jeep-hacking-101

[141] A. Cui, M. Costello, and S. J. Stolfo, ``When firmware modifications attack: A case study of embedded exploitation.'' in NDSS, 2013.

[142] Z. Basnight, J. Butts, J. Lopez, and T. Dube, ``Firmware modification attacks on programmable logic controllers,'' International Journal of Critical Infrastructure Protection, vol. 6, no. 2, pp. 76--84, 2013.

[143] V. Chipounov and G. Candea, ``Reverse engineering of binary device drivers with revnic,'' in Proceedings of the 5th European conference on Computer systems. ACM, 2010, pp. 167--180.

174 [144] I. McLoughlin, ``Secure embedded systems: The threat of reverse engineering,'' in Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on, Dec 2008, pp. 729--736. [145] G. S. Kc, A. D. Keromytis, and V. Prevelakis, ``Countering code-injection attacks with instruction-set randomization,'' in Proceedings of the 10th ACM conference on Computer and communications security. ACM, 2003, pp. 272--280. [146] J. Zheng, D. Li, and M. Potkonjak, ``A secure and unclonable embedded system using instruction-level puf authentication,'' in Field Programmable Logic and Applications (FPL), 2014 24th International Conference on. IEEE, 2014, pp. 1--4. [147] S. Bhatkar, D. C. DuVarney, and R. Sekar, ``Address obfuscation: An efficient approach to combat a broad range of memory error exploits.'' in Usenix Security, vol. 3, 2003, pp. 105--120. [148] B. Cox, D. Evans, A. Filipi, J. Rowanhill, W. Hu, J. Davidson, J. Knight, A. Nguyen- Tuong, and J. Hiser, ``N-variant systems: a secretless framework for security through diversity,'' in Usenix Security, vol. 6, 2006, pp. 105--120.

[149] A. N. Sovarel, D. Evans, and N. Paul, ``Where's the feeb? the effectiveness of instruction set randomization.'' in Usenix Security, 2005.

[150] H. Shacham, M. Page, B. Pfaff, E.-J. Goh, N. Modadugu, and D. Boneh, ``On the effectiveness of address-space randomization,'' in Proceedings of the 11th ACM conference on Computer and communications security. ACM, 2004, pp. 298--307.

[151] R. Hund, C. Willems, and T. Holz, ``Practical timing side channel attacks against kernel space aslr,'' in Security and Privacy (SP), 2013 IEEE Symposium on. IEEE, 2013, pp. 191--205.

[152] A. G. Bayrak, N. Velickovic, P. Ienne, and W. Burleson, ``An architecture- independent instruction shuffler to protect against side-channel attacks,'' ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, no. 4, p. 20, 2012.

[153] R. Karam, T. Hoque, S. Ray, M. Tehranipoor, and S. Bhunia, ``Mutarch: Architectural diversity for fpga device and ip security,'' in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 2017, pp. 611--616.

[154] T. R. Andel, L. N. Whitehurst, and J. T. McDonald, ``Software security and randomization through program partitioning and circuit variation,'' in Proceedings of the First ACM Workshop on Moving Target Defense. ACM, 2014, pp. 79--86.

[155] Binwalk, ``Binwalk firmware analysis tool,'' Online, 2016. [Online]. Available: http://binwalk.org/

175 [156] A. Cui, ``Embedded device firmware vulnerability hunting using frak,'' Black Hat USA, 2012.

[157] DynamoRIO, ``Dynamic instrumentation tool platform,'' Online. [Online]. Available: http://www.dynamorio.org/home.html

[158] V. Beneš, ``Permutation groups, complexes, and rearrangeable connecting networks,'' Bell System Technical Journal, vol. 43, no. 4, pp. 1619--1640, 1964.

[159] HP, ``Cacti.'' [Online]. Available: http://quid.hpl.hp.com:9081/cacti/index.y?new [160] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose, ``Microarchitectural techniques for power gating of execution units,'' in Proceedings of the 2004 international symposium on Low power electronics and design. ACM, 2004, pp. 32--37.

[161] R. S. Chakraborty and S. Bhunia, ``Harpoon: an obfuscation-based soc design methodology for hardware protection,'' IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 10, pp. 1493--1502, 2009.

[162] G. Stitt, R. Karam, K. Yang, and S. Bhunia, ``A uniquified virtualization approach to hardware security,'' IEEE Embedded Systems Letters, vol. PP, no. 99, pp. 1--1, 2017.

[163] E. Dubrova, ``A list of maximum period nlfsrs.'' IACR Cryptology ePrint Archive, vol. 2012, p. 166, 2012.

[164] ``Number of Unlabeled Strongly Connected Digraphs with n Nodes.'' [Online]. Available: https://oeis.org/A035512

176 BIOGRAPHICAL SKETCH Robert Karam received the B.S.E. and M.S. degrees in computer engineering from

Case Western Reserve University (Cleveland, OH, USA) in 2012 and 2016, respec- tively, and the Ph.D. degree in computer engineering from the University of Florida

(Gainesville, FL, USA) in 2017. His research focuses on hardware security for Internet of Things (IoT), energy-efficient and domain-specific reconfigurable computing architec- tures, and algorithm/hardware co-design for ultra-constrained bioimplantable systems.

During his graduate career, Robert authored or co-authored over 20 peer-reviewed articles and abstracts in top international journals and premier conferences, and served as a reviewer for several IEEE and ACM publications. He was the recipient of the Best Paper Award at the 2016 IEEE Biomedical Circuits and Systems (BioCAS) Conference, the NSF Award for Young Professionals Contributing to Smart and Connected Health at the 2016 IEEE Engineering in Medicine and Biology Conference (EMBC), and winner of the 2016 Attributes of a Gator Engineer Award for Professional Excellence from the University of Florida Herbert Wertheim College of Engineering.

177