NETLIST SECURITY ACCELERATION USING OPENCL ON FPGAS

Thesis

Submitted to

The School of Engineering of the

UNIVERSITY OF DAYTON

In Partial Fulfillment of the Requirements for

The Degree of

Master of Science in Computer Engineering

By

Nicholas Michael Pelini

UNIVERSITY OF DAYTON

Dayton, Ohio

August, 2017 NETLIST SECURITY ALGORITHM ACCELERATION USING OPENCL ON FPGAS

Name: Pelini, Nicholas Michael

APPROVED BY:

Eric Balster, Ph.D. Frank Scarpino, Ph.D. Advisor Committee Chairman Committee Member Associate Professor, Department of Professor, Department of Electrical and Electrical and Computer Engineering Computer Engineering

John Weber, Ph.D. Committee Member Professor, Department of Electrical and Computer Engineering

Robert J. Wilkens, Ph.D., P.E. Eddy M. Rojas, Ph.D., M.A., P. E. Assoc. Dean for Research & Innovation, Dean, School of Engineering Professor School of Engineering

ii Copyright by

Nicholas Michael Pelini

All rights reserved

2017 ABSTRACT

NETLIST SECURITY ALGORITHM ACCELERATION USING OPENCL ON FPGAS

Name: Pelini, Nicholas Michael University of Dayton

Advisor: Dr. Eric Balster

Integrated circuits continue to grow in number of transistors and design complexity. Production of many of these components are also outsourced to facilities in a number of countries. Therefore, there is a need to ensure all parts within a system are reliable and free from modification. Verification tools must be able to assess circuits down to a gate level but also be scalable to assess complex designs. In response to this problem, an accelerated version of verification software is proposed to determine if a manufacturer design is the same as a known, reference design by comparing the circuit’s netlists. Optimizations are made to the Python code, and an FPGA hardware accelerated version of the code is created using OpenCL. Results of the OpenCL implementation show an 18x to 24x across various netlists. Additionally, a netlist previously too large for verification tools to run is able to be tested by the OpenCL algorithm.

iii ACKNOWLEDGMENTS

I want to thank everyone who supported me over the years in my life and education. I would especially like to thank the following people:

• Dr. Eric Balster: For serving as my adviser during graduate school and serving on my thesis

committee.

• Kerry Hill, Chris Taylor, and Air Force Research Laboratory: For providing funding and

research to make this thesis possible.

• Andrew Kordik, Jonathon Skeans, and everyone at Univeristy of Dayton Research Institute

Sensor APEX: For introducing me to OpenCL and answering

• Dr. Frank Scarpino: For serving on my thesis committee.

• Dr. John Weber: For serving on my thesis committee.

• My family: For always supporting me.

iv OF CONTENTS

ABSTRACT ...... iii

ACKNOWLEDGMENTS ...... iv

LIST OF FIGURES ...... vii

LIST OF TABLES ...... ix

I. INTRODUCTION ...... 1

1.1 Starting Code and Approach ...... 1 1.2 Thesis Objective and Organization ...... 2

II. CIRCUIT BACKGROUND ...... 3

2.1 Integrated Circuit Security ...... 3 2.2 Netlist Description ...... 4 2.3 Circuit Component Background ...... 5

III. SOFTWARE AND HARDWARE BACKGROUND ...... 6

3.1 Python ...... 6 3.2 Stratix V FPGA ...... 7 3.3 OpenCL ...... 8 3.4 Ctypes Package ...... 9

IV. INTEGRATED CIRCUIT VERIFICATION SOFTWARE ...... 10

4.1 Read Netlists and DFFs ...... 10 4.2 Fan In ...... 11 4.3 Fan Out ...... 12 4.4 Comparison of Fan In and Fan Out Signatures ...... 13

v V. PYTHON OPTIMIZATIONS ...... 14

5.1 Flatten Gate List into Separate Tuple Lists ...... 14 5.2 Hashing Strings into Integers ...... 15 5.3 Python Optimization Results ...... 16

VI. OPENCL IMPLEMENTATION ...... 18

6.1 General Algorithm Overview ...... 18 6.2 Storing Inputs / Outputs in Memory ...... 19 6.3 Fan In OpenCL ...... 20 6.4 Fan Out OpenCL ...... 22 6.5 Matching Golden Netlist to Manufacturer Netlist ...... 25

VII. RESULTS ...... 26

7.1 Timings of FPC Netlist ...... 26 7.2 Timings of FPU Netlist ...... 27 7.3 Timings of RISC Processor Netlist ...... 29 7.4 Timings of Four Core RISC Processor Netlist ...... 30

VIII. CONCLUSIONS AND FUTURE WORK ...... 32

8.1 Conclusions ...... 32 8.2 Future Work ...... 32

BIBLIOGRAPHY ...... 34

vi LIST OF FIGURES

2.1 Diagram of One Gate in a Test Netlist ...... 4

2.2 Circuit Symbol for a D-type Flip-flop [1] ...... 5

3.1 Diagram of a Stratix V FGPA [2] ...... 8

4.1 High Level Overview of the Algorithm ...... 11

4.2 Diagram of the Original Fan In Function ...... 12

4.3 Diagram of the Original Fan Out Function ...... 13

5.1 Flattening the Netlist into Separate Lists Improves Efficiency by Removing Nested List Access ...... 15

5.2 Integer Comparisons Improve Memory Bandwidth and Comparison Speed . . . . . 16

5.3 Python Optimizations Resulted in a 9.7x Speedup of the Algorithm ...... 17

6.1 High Level Overview of the OpenCL Version of the Algorithm ...... 19

6.2 Diagram of the Fan In Function in the OpenCL Version of the Algorithm ...... 20

6.3 Diagram of the FPGA Portion of the Fan In Function ...... 21

6.4 Diagram of the Fan Out Function in the OpenCL Version of the Algorithm . . . . . 23

6.5 Diagram of the FPGA Portion of the Fan Out Function ...... 24

7.1 Timing Comparison of All ICVS Running the FPC Netlist ...... 27

7.2 Timing Comparison of All ICVS Algorithms Running the FPU Netlist ...... 28 vii 7.3 Timing Comparison of All ICVS Algorithms Running the RISC Processor Netlist . 30

7.4 Timing of the OpenCL ICVS Implementation Running the Four Core RISC Proces- sor Netlist ...... 31

viii LIST OF TABLES

7.1 Average Runtime and Speedup of All ICVS Algorithms Running the FPC Netlist . 26

7.2 Average Runtime and Speedup of All ICVS Algorithms Running the FPU Netlist . 28

7.3 Average Runtime and Speedup of All ICVS Algorithms Running the RISC Netlist . 29

7.4 Average Runtime of the OpenCL ICVS Algorithm Running the Four Core RISC Processor Netlist ...... 31

ix CHAPTER I

INTRODUCTION

Electronic systems perform integral roles in many industries such as technological, medical, and government. These systems are responsible for handling sensitive data that must remain secure.

There is a great need to verify that electronic components within a system are reliable and free from alteration. Verification tools must be able to assess circuits down to a low level to certify electronics as reliable and free from modifications. These tools must be developed to perform faster than current capabilities which require manual support on complex designs. Also, improved tools must be scalable to analyze exponentially growing circuit designs.

1.1 Starting Code and Approach

The first implementation of the Integrated Circuit Verfication Software (ICVS) is a Python ap- plication supplied by Air Force Research Laboratories (AFRL). The program is able to parse and map two netlist files, a golden, reference design and an unknown design commissioned by a manu- facturer. The netlists are text files that detail every component and the component’s input and out- put connections in an electronic circuit. Bioinformatics algorithms used in deoxyribonucleic acid

(DNA) sequencing are applied to accomplish the mapping of the netlists. Two functions named fan in and fan out generate signatures for each data (delay) flip-flop (DFF) by performing a breadth-first search of depth two in the netlist. This search begins at each DFF and builds a signature containing 1 all gates connected to the DFF within two levels. The golden and unknown netlists’ signatures for each DFF are then compared to provide an initial mapping of the two netlists.

1.2 Thesis Objective and Organization

The objective of this thesis is to accelerate the current Python program and implement a faster, more scalable version using the advanced programming language Open Computing Language (OpenCL).

The field-programmable gate array (FPGA) chosen to run the OpenCL portion of the code is the In- tel Stratix V.

Chapter II of this thesis provides an overview of the necessity of circuit security, test netlists, and relevant circuit components. Chapter III provides an overview of the programming languages used.

Chapter IV walks through the starting code provided by AFRL. Chapter V details the Python opti- mizations performed to improve the applications performance. Chapter VI describes the OpenCL accelerated version of the program. Chapter VII presents the results section comparing the runtime and of all versions of the application. Lastly, Chapter VIII offers conclusions of the project and proposes future work to continue the netlist mapping.

2 CHAPTER II

CIRCUIT BACKGROUND

The ICVS provides a method to ensure a circuit’s reliability and security. As the need and availibility of more complex circuit designs continues to rise, the integrity of these parts must be maintained. Four netlists with information about every component in a circuit are used to test the

ICVS. DFFs in the netlist are the basis of the signatures generated to map the golden and unknown netlists to each other.

2.1 Integrated Circuit Security

Modern integrated circuits (IC) are extremely complex and are composed of billions of transis- tors. [3] Vulnerabilities potentially exist within these intricate designs which could leak information, allow unauthorized access, or disable a device unknowingly to the user. [4] Complicating this issue has been the rise of overseas production. These facilities prove difficult to monitor for trust and security. [5]

In many cases, functional testing can reveal a flaw in a modified circuit due to an incorrect output. However, purely testing the functionality does not reveal the precise component in the circuit that is the root cause of the behavior modification. However, other vulnerabilities are not detectable by functional testing. In these cases, a modified circuit often produces the expected output which provides a major challenge in identifying circuits with liabilities. For example, a NOR gate insertion 3 into a full adder circuit can generate a superfluous output along with the expected output. [6] Cases like the malicious full adder circuit demonstrate that gate level checking must be performed in order to ensure a circuit is secure.

Due to the growing concern over IC security, verification tools are in development to assess the integrity of ICs returned from manufacturers. These tools must be accurate, quick, and scalable to be able to identify potential modifications in devices that are growing exponentially in transistor size.

2.2 Netlist Description

Four netlists of varying size have been made available by the Defense Advanced Research

Projects Agency (DARPA) Trust Program in order to test the integrity of ICs. These test netlists are varied in size and complexity to test the speed and scalability of verification tools. These netlists consist of a number of logic gates listed in a text file. The gates consist of AND, OR, NOT, etc along with data (delay) flip-flops (DFF). Each gate in the list has four fields: a unique gate name, the gate type, gate inputs, and gate outputs. A diagram showcasing one gate in the circuit, or one line of a test netlist, can be seen below in Figure 2.1

Figure 2.1: Diagram of One Gate in a Test Netlist

4 The floating point controller (FPC) netlist contains 8,097 gates and 720 DFFs. The floating point unit (FPU) netlist has 10,570 gates and 2331 DFFs. The reduced instruction computer

(RISC) netlist has 59,269 gates and 10,594 DFFs. The four core RISC netlist is the largest netlist with 384,955 gates and 76,983 DFFs.

2.3 Circuit Component Background

DFFs provide the starting point for the netlist mapping . All DFFs have two inputs which are a control input and a clock. The DFF also has just one output as nor or nand gates in the DFF combined with cycles can provide the output negation. [7] An example of a DFF can be seen below in Figure 2.2.

Figure 2.2: Circuit Symbol for a D-type Flip-flop [1]

For the purpose of generating signatures from the netlist, the clock input is ignored and the DFF is essentially treated as a gate with one input and one output. This fact plays an important role when beginning the fan in and fan out functions. At the start of the first level, only one input or output needs to be examined as the DFF is the origin, but later in the process gates potentially have more than one input that need examined.

5 CHAPTER III

SOFTWARE AND HARDWARE BACKGROUND

The ICVS is created using the Python programming language. Python’s flexibility and ease of use provide many benefits such as fast development time and efficient data . Python also has the benefit of access to a vast number of packages such as Ctypes. Ctypes provides a method to interface Python data to C and in turn an FPGA through the use of OpenCL. The ICVS employs a

Stratix V FPGA in order to accelerate performance.

3.1 Python

Python is an interpreted language and the primary programming language of this project. The language provides a faster development time than other languages such as C because there is no compile checking before runtime. Some features that set Python apart from other languages in- clude no type declarations, code grouping by whitespace indentation instead of curly braces, and automatic .

Pythons features provide a number of developmental advantages. First, the reduced syntax makes code easier to read and write. Combined with the no compile time check, Python code is quicker to generate and get running. Also, features such as list comprehensions and dictionaries provide an easy way to parse file data into a speedy . List comprehensions are a concise method to operate on each in a list, and dictionaries are a with unordered key: value 6 pairs. Lookups in Python dictionaries are constant time, O(1), compared to list lookups which are

O(n). [8]

Python also has a few disadvantages worth mentioning. Not having compile time checking leads to quicker development, but this can also increase errors encountered during runtime. Error checks that would have been detected by a compiler can surface abruptly in the middle of runtime which can cost development time. Another issue that can occur in Python is operating on incompatible data types. Variables are not explicitly given a data type like in some other languages, so caution needs to be taken to ensure operations are performed on compatible types or runtime errors will occur. Lastly, Python has great readability and simplicity compared to a language such as C, but this can also be a negative as this can result in slower speeds executing tasks due to extra storage noting datas type and a lack of primitives. However, Python is growing in popularity for usage in scientific applications due to packages such as and scipy have increased Pythons viability in scientific computing. [9]

3.2 Stratix V FPGA

The OpenCL portion of the ICVS targets an Intel Stratix V FPGA. The Stratix V was introduced in 2010 and has 28 nm process technology. The FPGA has integrated transceivers up to 28.05 gigabits per second and is optimized for bandwidth focused tasks. A diagram of the Stratix V can be seen below in Figure 3.1.

7 Figure 3.1: Diagram of a Stratix V FGPA [2]

3.3 OpenCL

OpenCL is a nonproprietary framework to write programs able to run on CPUs, GPUs, FPGAs, and more in order to accelerate performance. OpenCL is managed by the Khronos Group, with developers from Apple Inc., Intel Corporation, NVIDIA, Advanced Micro Device, and more. The

OpenCL standard is not a standalone language because data types, structures, and functions are employed on top of C and C++.

OpenCL code is compiled at runtime in order to detect the host device which allows for great portability. The C or C++ portion of the program detects the OpenCL compatible devices. Next, the function which runs on the device, called a kernel, is deployed to the device along with any OpenCL function parameters. The kernel executes the given task and returns the results. [10]

8 The major advantages of OpenCL are portability, standardized vector processing, and parallel programming. OpenCL code is written once but runs on any compliant hardware making develop- ment much more streamlined. Vendors provide the tools that allow the OpenCL code to run across multiple types of hardware. Standardized vector processing allows OpenCL code to run the vec- tor routines across different devices despite vector instructions normally being specific to a devices particular vendor. Lastly, OpenCL kernels can be run in parallel across multiple devices. The host machine assigns the devices a program, which is a function from the kernel. Each device has a command queue of instructions given by the host.

OpenCL 1.2 is the version used for this project as later versions such as 2.0 and above do not currently support FPGAs.

3.4 Ctypes Package

Ctypes is a FFI Python package that enables a Python program to interface with a C program.

A foreign function interface (FFI) is a method in which a program written in one language can communicate with another language. The FFI allows one language to create data structures and make calls according to the other languages conventions. The Ctypes package allows functions from dynamic linked libraries (dlls) and shared libraries to be called in a Python wrapper. [11] The

Ctypes package provides a bridge between Python and the OpenCL language.

9 CHAPTER IV

INTEGRATED CIRCUIT VERIFICATION SOFTWARE

The ICVS Python code reads in two netlists, parses the netlist data into structures, runs the fan in and fan out functions on both netlists, and performs comparison analysis of the signatures.

This code is the starting point of the project and lays the groundwork for future development and acceleration. The original ICVS is developed by AFRL.

4.1 Read Netlists and DFFs

The starting Python code reads in a netlist and parses the unique gate name, gate type, inputs, and outputs into an array . To simulate a comparison of two matching netlists, the program reverses the input netlist, so the netlist can compare against another form of itself. This ensures a way to test against an exact matching netlist. As the netlists are parsed, a list is created to track the D flip-flops (DFFs). The DFFs are the starting points for the fan in and fan out functions in the next section of the program. An overview of the algorithm can be seen below in Figure 4.1.

10 Figure 4.1: High Level Overview of the Algorithm

Once the data is parsed and read into the necessary arrays, the forward and reversed versions of the netlist are separately sent to both the fan in and fan out functions.

4.2 Fan In

The fan in process takes place for each DFF in the netlist. The clock input is ignored while the

DFF’s other input serves as the starting point for the function. The input is compared against the entire list of outputs in the netlist to find matches. The gate type of each matched output is noted.

This is known as level one of the fan in process.

Level two begins with the gates matched from level one. These gates’ inputs are compared against the entire list of outputs to find matches. The matches’ gate types are again noted. The entire process of the fan in function including the level one and level two processes can be seen below in Figure 4.2.

11 Figure 4.2: Diagram of the Original Fan In Function

The list of gate types of the matches are known as the fan in signature of the particular starting

DFF. The fan in process is repeated for each DFF in the netlist so each one has a signature.

4.3 Fan Out

The fan out process also takes place for each DFF in the netlist. The DFF’s output serves as the starting point for the function. The output is compared against the entire list of inputs in the netlist to find matches. The gate type of each matched input is noted. This is known as level one of the fan out process.

Level two begins with the gates matched from level one. These gates’ outputs are compared against the entire list of input to find matches. The matches’ gate types are again noted. The entire process of the fan out function including the level one and level two processes can be seen below in

Figure 4.3.

12 Figure 4.3: Diagram of the Original Fan Out Function

The list of gate types of the matches are known as the fan out signature of the particular starting

DFF. The fan out process is repeated for each DFF in the netlist so each one has a signature.

4.4 Comparison of Fan In and Fan Out Signatures

At the end of the fan functions, the signatures are compared to map the DFFs from the two netlists. First, each golden, or reference, fan in signature is compared against every unknown fan in signature. If the fan in signatures are a match, the fan out signatures are compared. If both the fan in and fan out signatures match, the unknown DFF is appended to a dictionary of possible matches for that golden DFF. The fan process only builds signatures from the circuit of depth two in order to maintain a lower runtime. As a result however, a golden DFFs signature can match to multiple unknown DFF signatures. The fan process eliminates X percent of matches, but the final goal is to have all golden DFFs mapped to exactly one unknown DFF to ensure the netlists are identical.

Thus, DFFs that are not mapped one to one must move on to more thorough testing.

13 CHAPTER V

PYTHON OPTIMIZATIONS

The original version of the ICVS accurately processes smaller netlists, but the program is slow and unable to process the largest of the four test netlists. Consequently, an optimized version of the ICVS is necessary to improve efficiency. Two major changes are made to the original ICVS to reduce runtime and prepare the netlist data for an OpenCL implementation. All testing for the ICVS implementations are performed on a server running an Intel Xeon Processor E7-4870 with 256 GB of memory.

5.1 Flatten Gate List into Separate Tuple Lists

One major Python optimization is a change in the data structure in which the netlist data is stored. Initially, the netlist data which includes the gate type, input, and outputs is parsed into a multi-dimensional array. The format is convenient to code and saved disk space, but proved time inefficient during runtime. In order to increase the program speed, the netlist data is read into

flattened list structures. A visual representation of the process can be seen below in Figure 5.1.

14 Figure 5.1: Flattening the Netlist into Separate Lists Improves Efficiency by Removing Nested List Access

Figure 5.1 depicts how the data is no longer nested; instead, the gate types, inputs, and outputs are separated into their own list structures. The unique gate name is repeated across the multiple lists in order to keep track of which piece of data belongs to which gate. While this increases the memory required to run the program, the speed increase is worth the tradeoff. Removing the nesting also allows for easier migration from a Python only implementation to an OpenCL version.

5.2 Hashing Strings into Integers

Another optimization of the Python code is the change from representing data with strings to representing data with integers. As the netlist is parsed at the beginning of the program, the strings read are hashed into integer lists. A comparison of the representation of a string versus an integer can be seen in the Figure 5.2 below.

15 Figure 5.2: Integer Comparisons Improve Memory Bandwidth and Comparison Speed

The shift from a string to integer representation is able to reduce a five byte string down to two bytes. The change is a significant reduction in memory requirements for the program when applied across the entire netlist worth of data. The conversion to integers also enables the code to move from Python to OpenCL.

5.3 Python Optimization Results

The fan in and fan out algorithms Python optimizations improve the performance of the pro- gram while maintaining accuracy. Flattening the netlist and hashing the lists are the two major improvements. The original program and the optimized program are tested on an Intel Xeon Pro- cessor E7-4870 with 32 GB *double check* system. The changes are purely Python improvements that also aided in streamlining the OpenCL version. Both programs are tested on the system and the runtimes compared. The results for the two versions can be seen below in Figure 5.3 below.

16 Figure 5.3: Python Optimizations Resulted in a 9.7x Speedup of the Algorithm

As seen in Figure 5.3, the flattened and hashed version of the program runs substantially quicker than the original program. The original fan program completes processing the FPU controller netlist in 893 seconds. The flattened version of the fan program completes the same netlist in 92 seconds which is approximately a 9.7x speedup. As seen from the runtime results, the optimizations greatly increase the speed at which the fan algorithm performs. These modifications simultaneously prepare the data for a smooth transition to an OpenCL implementation.

17 CHAPTER VI

OPENCL IMPLEMENTATION

Improvements to the Python ICVS version provide a smooth transition to the OpenCL imple- mentation on a Bittware Stratix V FPGA. The goal of the OpenCL integration is to decrease the

ICVS runtime and provide the capability to test larger netlists.

6.1 General Algorithm Overview

An OpenCL kernel is most commonly called from a C program, but the fan program is coded entirely in Python. Thus, a method to interface C calls from Python is necessary to create an

OpenCL application of the fan algorithm without rewriting the entire program in C. As a solution, the foreign function library Ctypes is utilized to wrap C libraries in Python. Ctypes allows the

Python program to call a C shared library. The C library can then call an OpenCL kernel to run and return data through the pipeline back to the Python program. A visual chain of the program calls can be seen below in Figure 6.1 below.

18 Figure 6.1: High Level Overview of the OpenCL Version of the Algorithm

The process seen above in Figure 6.1 is performed eight times throughout the course of the

OpenCL implementation. The eight kernel calls are the golden netlists fan in, golden netlists fan out, unknown netlists fan in, and unknown netlists fan out. Each of these four functions runs a kernel for the level one and level two process. The OpenCL kernel run each time is identical as only the function parameters change.

6.2 Storing Inputs / Outputs in Memory

Before the program begins, the C shared library is compiled. Once the Python script begins, the library is included with a Ctypes call. The entire lists of inputs and outputs read from the netlist are created as C style arrays and stored on the boards memory. The two fan in function calls make use of the list of outputs, while the two fan out function calls use the list of inputs. These two lists are sent to the board once at the beginning to prevent sending redundant data during the kernel calls.

The input and output lists are unchanged throughout the program while the other kernel parameters change depending on the netlist, and whether fan in or fan out is run. 19 6.3 Fan In OpenCL

The OpenCL fan in function operates similarly to the original Python implementation but with a few notable variations. A graphic of the fan in function running on a host machine in conjunction with an FPGA can be seen below in Figure 6.2.

Figure 6.2: Diagram of the Fan In Function in the OpenCL Version of the Algorithm

First on the host machine, the list of DFFs is traversed and each DFFs inputs are obtained and stored into a list. This list of DFF inputs is converted using Ctypes into a C style array and sent to the FPGA along with a double pointer that will return the matches for the current level. From the

C shared library, the OpenCL fan kernel is called and begins searching the entire list of outputs for a match to each DFFs input. A depiction of the fan in function running on the FPGA can be seen below in Figure 6.3.

20 Figure 6.3: Diagram of the FPGA Portion of the Fan In Function

The entire fan in function runs at once, so a single array holds all the matches. When a match is found, the integer index of the match is stored in the array. Every DFF input will have exactly one output match during the level one process.

Once the level one array of matched indices is returned back to the Python program, dictionary lookups are performed to build the DFFs signatures. The signatures are two dimensional lists com- posed of the gate types of the matched indices returned from the kernel. The first element is created with the results of the level one kernel. Each DFFs input is a gates output, and the kernel returns the index of that match. The index is used to perform a lookup in the Python dictionary to determine what type of gate the output belongs.

21 While traversing the level one matches, the Python program also starts preparing the level two kernel. The inputs to the matched gate are appended to a new Ctypes C array with a -1 integer sepa- rator to maintain grouping by gate. This array of new inputs is sent to the OpenCL kernel where the same fan in process occurs. However, in the level two kernel, the inputs no longer belong to DFFs necessarily. So, each gate may have multiple inputs whose grouping must be maintained. Thus, a -1 integer separator is appended between every gates group of matched indices. The separator enables one array returned from one kernel to accomplish the entire level two process because the matches can maintain grouping. A -2 integer separator is appended to the end of the array to indicate the end of all matches just like in the level one kernel.

The level two array is returned to the Python program, and dictionary lookups are once again performed. The list of gate types belonging to the outputs matched are appended into the second element of the fan in signatures list. By the end of level one and two kernels, each DFF has a signature, the two dimensional list, comprised of all gates in the netlist connected to the DFFs input for a depth of two.

6.4 Fan Out OpenCL

The OpenCL fan out function operates similarly to the original Python implementation. The fan out process is essentially the same process as fan in except the process begins at the golden DFFs output. A graphic of the fan out function running on a host machine in conjunction with an FPGA can be seen below in Figure 6.4.

22 Figure 6.4: Diagram of the Fan Out Function in the OpenCL Version of the Algorithm

First on the host machine, the list of DFFs is traversed and each DFFs outputs are obtained and stored into a list. This list of DFF outputs is converted using Ctypes into a C style array and sent to the board along with a double pointer that will return the matches for the current level. From the

C shared library, the OpenCL fan kernel is called and begins searching the entire list of inputs for a match to each DFFs output. A depiction of the fan out function running on the FPGA can be seen below in Figure 6.5.

23 Figure 6.5: Diagram of the FPGA Portion of the Fan Out Function

The entire fan out function runs at once, so a single array holds all the matches. When a match is found, the integer index of the match is stored in the array. Every DFF output will have exactly one input match during the level one process.

Once the level one array of matched indices is returned back to the Python program, dictionary lookups are performed to build the DFFs signatures. The signatures are two dimensional lists com- posed of the gate types of the matched indices returned from the kernel. The first element is created with the results of the level one kernel. Each DFFs input is a gates output, and the kernel returns the index of that match. The index is used to perform a lookup in the Python dictionary to determine what type of gate the input belongs.

24 While traversing the level one matches, the Python program also starts preparing the level two kernel. The outputs to the matched gate are appended to a new Ctypes C array with a -1 integer separator to maintain grouping by gate. This array of new outputs is sent to the OpenCL kernel where the same fan out process will occur. A -1 integer separator is appended between every gates group of matched indices. The separator enables one array returned from one kernel to accomplish the entire level two process because the matches can maintain grouping. A -2 integer separator is appended to the end of the array to indicate the end of all matches just like in the level one kernel.

The level two array is returned to the Python program, and dictionary lookups are once again performed. The list of gate types belonging to the inputs matched are appended into the second element of the fan in signatures list. By the end of level one and two kernels, each DFF has a signature, the two dimensional list, comprised of all gates in the netlist connected to the DFFs input for a depth of two.

6.5 Matching Golden Netlist to Manufacturer Netlist

The OpenCL implementation of the program uses the same comparisons with the same data structures as the original code in order to ensure compatibility with future work. A few functions are employed to remove the integer separators and to place the signatures into the same list structure as the original code. Converting the data structures of the signatures to the same format as the original code is important in order to maintain compatibility. DFFs that are not exactly mapped after the fan process must undergo more thorough testing, so rearranging the data back into the original codes structure ensures compatibility with future testing on the fan results. After this data conversion, the same comparisons from the original code take place between the golden netlists and unknown netlists fan in and fan out signatures. Both the converting of the lists structure and comparing of signatures take place entirely in Python.

25 CHAPTER VII

RESULTS

All testing for the Python applications is performed on a server running an Intel Xeon Processor

E7-4870 with 32 GB of memory. The three versions of the code, original, flattened, and OpenCL, run the fan in and fan out process on four netlists of varying size to test for accuracy and speed.

In order of increasing size the netlists are the floating point controller (FPC), floating point unit

(FPU), RISC processor, and four core RISC processor. The Python portion of the OpenCL version is performed on the same server setup while the OpenCL kernel is run on a Stratix V FPGA. Each iteration of the ICVS is tested on each netlist five times to obtain an average runtime.

7.1 Timings of FPC Netlist

The FPC netlist is the smallest and least complex of the four netlists. Table 7.1 below provides the trial runtimes, the average runtime, and speedup compared to the original code for the FPC netlist.

Table 7.1: Average Runtime and Speedup of All ICVS Algorithms Running the FPC Netlist

Algorithm Run 1 Run 2 Run 3 Run 4 Run 5 Average Speedup Original ICVS 203.67 222.51 210.03 217.08 208.83 212.42 Optimized ICVS 19.87 19.97 19.70 20.06 19.68 19.86 10.70 OpenCL ICVS 8.62 8.61 8.71 8.59 8.65 8.64 24.60

26 A graph comparing the performance of the FPC netlist run by all three algorithms can be seen below in Figure 7.1.

250

200

150 Time (s) 100

50

0 Original ICVS Optimized ICVS OpenCL ICVS

Figure 7.1: Timing Comparison of All ICVS Algorithms Running the FPC Netlist

As seen above in Figure 7.1, the OpenCL implementation of the ICVS is the fastest of the three versions. On average, the OpenCL ICVS achieves a 24.6x speedup over the original ICVS.

7.2 Timings of FPU Netlist

The FPU netlist is the second smallest of the four testing netlists. Table 7.2 below provides the trial runtimes, the average runtime, and speedup compared to the original code for the FPU netlist.

27 Table 7.2: Average Runtime and Speedup of All ICVS Algorithms Running the FPU Netlist

Algorithm Run 1 Run 2 Run 3 Run 4 Run 5 Average Speedup Original ICVS 867.93 910.43 210.03 887.42 903.28 886.34 Optimized ICVS 85.47 85.51 85.66 85.61 85.52 85.55 10.36 OpenCL ICVS 47.69 47.34 47.47 47.39 48.16 47.61 18.62

A graph comparing the performance of the FPU netlist run by all three algorithms can be seen below in Figure 7.2.

900

800

700

600

500

Time (s) 400

300

200

100

0 Original ICVS Optimized ICVS OpenCL ICVS

Figure 7.2: Timing Comparison of All ICVS Algorithms Running the FPU Netlist

As seen above in Figure 7.2, the OpenCL implementation of the ICVS is the fastest of the three versions. On average, the OpenCL ICVS achieves a 18.62x speedup over the original ICVS.

28 7.3 Timings of RISC Processor Netlist

The RISC processor netlist is the second largest of the four netlist. Table 7.3 below provides the trial runtimes, the average runtime, and speedup compared to the original code for the RISC processor netlist.

Table 7.3: Average Runtime and Speedup of All ICVS Algorithms Running the RISC Netlist

Algorithm Run 1 Run 2 Run 3 Run 4 Run 5 Average Speedup Original ICVS 21776.09 20604.43 20741.42 22071.84 20238.32 21086.42 Optimized ICVS 2446.64 3286.43 3301.03 2978.71 2679.18 2938.40 7.18 OpenCL ICVS 1051.52 1001049.19 1060.61 1043.84 1066.96 1054.42 20.00

A graph comparing the performance of the FPU netlist run by all three algorithms can be seen below in Figure 7.3.

29 4 x 10 2.5

2

1.5 Time (s) 1

0.5

0 Original ICVS Optimized ICVS OpenCL ICVS

Figure 7.3: Timing Comparison of All ICVS Algorithms Running the RISC Processor Netlist

As seen above in Figure 7.3, the OpenCL implementation of the ICVS is the fastest of the three versions. On average, the OpenCL ICVS achieves a 20x speedup over the original ICVS.

7.4 Timings of Four Core RISC Processor Netlist

The four core RISC processor netlist is the largest and most complex of the four netlists. Ta- ble 7.4 below provides the trial runtimes and the average runtime for the OpenCL implementation.

The original and flattened Python code are unable to run the four core RISC processor netlist. Both run for an indefinite amount of time or crash while running.

30 Table 7.4: Average Runtime of the OpenCL ICVS Algorithm Running the Four Core RISC Proces- sor Netlist

Algorithm Run 1 Run 2 Run 3 Run 4 Run 5 Average OpenCL ICVS 73353 79012 70642 73216 74536 74151.80

A graph with the performance of the OpenCL algorithm running the four core RISC processor netlist can be seen below in Figure 7.4.

4 x 10 8

7

6

5

4 Time (s)

3

2

1

0 Original ICVS Optimized ICVS OpenCL ICVS

Figure 7.4: Timing of the OpenCL ICVS Implementation Running the Four Core RISC Processor Netlist

As seen above in Figure 7.4, the OpenCL implementation of the ICVS is the only version of the

ICVS to complete processing of the largest netlist. Processing the four core RISC processor netlist takes approximately 19 hours. 31 CHAPTER VIII

CONCLUSIONS AND FUTURE WORK

8.1 Conclusions

The goal of this thesis is to accelerate an application which maps a reference netlist to an un- known netlist to ensure security. To accomplish this, a Python applications performance is improved by flattening list structures and hashing strings to integers to improve comparison speeds. This

Python program coordinates with an OpenCL kernel, through the use of Ctypes, to accelerate the comparisons in the fan in and fan out functions. These changes result in an 18x to 24x speedup across multiple netlists. Additionally, the four core RISC processor netlist is able to be processed by the OpenCL version of the code. The original and flattened versions of the code are unable to

finish the largest netlist. Finally, the OpenCL version of the code performs the mapping of the two netlists and the comparisons using the same method in the same data structures as the original code.

This ensures compatibility in the future with any code that needs the results from the fan in and fan out process.

8.2 Future Work

The fan in and fan out functions provide a quick method to map the DFFs in the golden netlist to the unknown netlist. After the comparisons at the end of the process, a number of golden DFFs

32 are each still potential matches for multiple DFFs in the unknown netlist. Thus, more testing is necessary to obtain one to one mapping between the netlists DFFs. This testing comes in the form of a process called a k-mer search, normally a DNA sequencing method. The k-mer search is a much more thorough procedure which is why the fan in and fan out functions are performed first to greatly narrow potential matches in a timely manner. The k-mer search performance may also benefit from optimization and OpenCL acceleration in the future.

33 BIBLIOGRAPHY

[1] SFUptownMaker. Digital logic. [Online]. Available: https://cdn.sparkfun.com/assets/3/3/c/0/ 9/51e5c089ce395fc915000000.png

[2] [Online]. Available: https://www.altera.com/content/dam/altera-www/global/en US/images/ products/devkits/altera/images/sv gs diagram.jpg

[3] “Inside pascal: Nvidias newest computing platform,” https://devblogs.nvidia.com/ parallelforall/inside-pascal/, accessed: 2017-06-14.

[4] M. Tehranipoor and F. Koushanfar, “A survey of hardware trojan taxonomy and detection,” IEEE design & test of computers, vol. 27, no. 1, 2010.

[5] M. K. Seery, “Complex vlsi feature comparison for commercial microelectronics verification,” AIR FORCE INSTITUTE OF TECHNOLOGY WRIGHT-PATTERSON AFB OH GRADU- ATE SCHOOL OF ENGINEERING AND MANAGEMENT, Tech. Rep., 2014.

[6] L. A. Hsia, M. Y. Lanzerotti, M. K. Seery, and L. Orlando, “Gate-level commercial microelec- tronics verification with standard cell recognition,” in Aerospace and Electronics Conference, NAECON 2014-IEEE National. IEEE, 2014, pp. 374–378.

[7] C. Lin. (2003) Introduction to flip flops: D and t.

[8] (2017) The python tutorial data structures.

[9] W. McKinney et al., “Data structures for statistical computing in python,” in Proceedings of the 9th Python in Science Conference, vol. 445. SciPy Austin, TX, 2010, pp. 51–56.

[10] M. Scarpino, “Opencl in action: how to accelerate graphics and computations,” 2011.

[11] (2017) ctypes a foreign function library for python.

34