TRANSISTOR-LEVEL PROGRAMMABLE FABRIC

by

Jingxiang Tian

APPROVED BY SUPERVISORY COMMITTEE:

______Carl Sechen, Chair

______Yiorgos Makris, Co-Chair

______Benjamin Carrion Schaefer

______William Swartz Copyright 2019

Jingxiang Tian

All Rights Reserve

TRANSISTOR-LEVEL PROGRAMMABLE FABRIC

by

JINGXIANG TIAN, BS

DISSERTATION

Presented to the Faculty of

The University of Texas at Dallas

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY IN

ELECTRICAL ENGINEERING

THE UNIVERSITY OF TEXAS AT DALLAS

December 2019

ACKNOWLEDGMENTS

I would like to express my deep sense of gratitude to the consistent and dedicated support of my advisor, Dr. Carl Sechen, throughout my entire PhD study. You have been a great mentor, and your help in my research and career is beyond measure. I sincerely thank my co-advisor, Dr.

Yiorgos Makris, who brings ideas and funds for this project. I am grateful that they made it possible for me to work on this amazing research topic that is of great interest to me.

I would also like to thank my committee members, Dr. Benjamin Schaefer and Dr. William Swartz for serving as my committee members and giving lots of advice on my research. I would like to give my special thanks to the tech support staff, Steve Martindell, who patiently helped me a thousand times if not more.

With the help of so many people, I have become who I am. My parents and my parents-in-law give me tremendous support. My husband, Tianshi Xie, always stands by my side when I am facing challenges. Ada, my precious little gift, is the motivation and faith for me to keep moving on.

“I didn’t come this far to only come this far.” Graduation is not only the end of my student career, but also the starting point of more possibilities. I am looking forward to uncovering the future.

October 2019

iv

TRANSISTOR-LEVEL PROGRAMMABLE FABRIC

Jingxiang Tian, PhD The University of Texas at Dallas, 2019

ABSTRACT

Supervising Professor: Carl Sechen, Chair Yiorgos Makris, Co-Chair

We introduce a CMOS computational fabric consisting of carefully arranged regular rows and columns of which can be individually configured and appropriately interconnected in order to implement a target digital circuit. This TRAnsistor-level Programmable (TRAP) fabric allows simultaneous storage of four independent configurations, along with the ability to dynamically between them in a small fraction of a clock cycle. We term this board-level virtualization in that each configuration, in effect, implements an independent chip. TRAP also supports chip-level virtualization in which a single IC design is partitioned over a set of configurations and the computation cycles from one configuration to the next in the set. This allows a design that requires more computational logic than physically available on the TRAP chip to be nonetheless executable. TRAP also features rapid partial or full modification of any one of the stored configurations in a time proportional to the number of modified configuration bits through the use of hierarchically arranged, high throughput, pipelined memory buffers. TRAP supports libraries of cells of the same height and variable width, just as in a typical standard cell circuit. We developed a complete Computer-aided Design (CAD) tool flow for programming TRAP chips. A prototype 3mm X 3mm TRAP chip was fabricated using the Global Foundries 55nm process. We

v

show that TRAP has substantially better area efficiency compared to a leading industrial FPGA and would, therefore, be ideal for embedded FPGA (eFPGA) applications.

vi

TABLE OF CONTENTS

ACKNOWLEDGMENTS...... iv

ABSTRACT ...... v

LIST OF FIGURES...... xi

LIST OF TABLES……...... xiv

CHAPTER 1 INTRODUCTION...... 1

1.1 Industrial Architectures ...... 2

1.2 Novel Architectures in Research...... 8

CHAPTER 2 TRAP OVERVIEW ...... 11

2.1 A Brief Introduction to TRAP ...... 11

2.2 TRAP Features and Applications ...... 12

2.2.1 Virtualization ...... 12

2.2.2 Rapid Selective Reprogramming ...... 13

2.2.3 Seamless Transition with ASIC Flow ...... 13

2.2.4 Mixed TRAP and ASIC Design ...... 14

CHAPTER 3 DESIGN DETAIL OF TRAP ...... 15

3.1 Logic Cell & Built-in Gates ...... 15

3.1.1 Logic Cell ...... 15

3.1.2 Built-in Gates ...... 19

3.1.3 Advantages ...... 22

3.2 Board-level Virtualization and Chip-level Virtualization ...... 23

3.2.1 Board-level Virtualization ...... 23

3.2.2 Chip-level Virtualization ...... 24

vii

3.3 Interconnect Architecture ...... 27

3.3.1 Switch ...... 27

3.3.2 Segment ...... 30

3.3.3 Challenge in Routing ...... 32

3.3.4 Area Benefit and Structure Superiority ...... 33

3.4 Local ...... 33

3.4.1 Memory Cell Architecture ...... 33

3.4.2 Dynamic Reconfiguration ...... 35

3.5 Hierarchical Function Blocks ...... 37

3.5.1 Unit...... 37

3.5.2 Group...... 38

3.6 Core Programming ...... 39

3.6.1 Partial/Full Core Programming...... 39

3.6.2 Hierarchical Bi-directional Memory Buffer ...... 40

3.6.3 Asynchronous Pipelined Control ...... 44

3.7 I/O Design and Programming ...... 48

3.7.1 Power Pad Design ...... 48

3.7.2 Digital I/O Pad and Bi-directional I/O Pad Design ...... 49

3.7.3 Metal Track Selecting ...... 53

3.7.4 I/O Programming ...... 55

3.8 Clock Tree Design...... 62

3.9 Power Network Design...... 63

CHAPTER 4 DESIGN DECISIONS ...... 66

4.1 Logic Cell Design Decisions ...... 66

viii

4.2 Programmable Interconnect Design Decisions ...... 67

4.3 Design of Virtualizations ...... 71

4.4 Memory Design Decisions ...... 71

4.5 Core Programming Design Decisions ...... 73

4.6 I/O Programming Design Decision ...... 73

CHAPTER 5 FABRICATION AND PACKAGE ...... 75

5.1 FPTA Design Layout...... 75

5.2 TRAP Design Layout ...... 76

5.3 TRAP Package ...... 79

CHAPTER 6 TEST SETUPS AND TEST RESULTS...... 82

6.1 CAD Tool Flow for TRAP Design ...... 82

6.2 Test Setups ...... 85

6.2.1 Test Setup Version I ...... 85

6.2.2 Test Setup Version II ...... 87

6.3 Test Results ...... 88

6.3.1 FPTA and Commercial FPGA Area Efficiency Comparison...... 89

6.3.2 FO1 Delay ...... 90

6.3.3 Single-cycle Switching Between Configurations ...... 93

6.3.4 Selective Partial Dynamic Reconfiguration ...... 95

6.3.5 Implementation Examples ...... 97

6.3.6 TRAP Area Efficiency Evaluation...... 103

CHAPTER 7 FUTURE WORK ...... 105

7.1 Design Improvements for TRAP ...... 105

7.1.1 One-layer TRAP ...... 105

ix

7.1.2 TRAP as an eFPGA ...... 109

7.2 Routing Algorithm Improvement...... 110

7.3 Design Obfuscation ...... 113

REFERENCES ...... 115

BIOGRAPHICAL SKETCH ...... 120

CURRICULUM VITAE ...... 121

x

LIST OF FIGURES

Figure 1.1. General FPGA architecture ...... 1

Figure 1.2. Classic LUT architecture ...... 4

Figure 1.3. 7 series CLB structure ...... 5

Figure 1.4. Routing architecture example ...... 6

Figure 1.5. 3D SpaceTime architecture ...... 7

Figure 3.1. Logic cell schematic with multiplex inputs ...... 16

Figure 3.2. I/Os of an LC connect to programmable interconnect ...... 17

Figure 3.3. NAND2 and AOI22 gates implemented with LC ...... 18

Figure 3.4. Built-in DFF ...... 20

Figure 3.5. Built-in FA ...... 21

Figure 3.6. Built-in MUX ...... 22

Figure 3.7. Board-level virtualization ...... 24

Figure 3.8. Chip-level virtualization not using built-in DFF ...... 25

Figure 3.9. Chip-level virtualization using built-in DFF ...... 26

Figure 3.10. Interconnect network ...... 28

Figure 3.11. Switch and half keeper ...... 30

Figure 3.12. (a) Pass transistor, and (b) Bi-directional repeater ...... 31

Figure 3.13. (a) Switch in TRAP, (b) Switch in FPGA ...... 32

Figure 3.14. Memory cell structure ...... 34

Figure 3.15. Unit structure ...... 37

Figure 3.16. Group structure ...... 38

xi

Figure 3.17. Chip programming architecture...... 39

Figure 3.18. L0 memory buffer connecting with the MCs in a unit ...... 41

Figure 3.19. Memory buffers: (a) Data register, (b) Address register ...... 42

Figure 3.20. Muller C element schematic and symbol ...... 44

Figure 3.21. Asynchronous write-in control scheme ...... 45

Figure 3.22. Asynchronous read-out control scheme ...... 47

Figure 3.23. Bi-directional I/O schematic ...... 51

Figure 3.24. Input ESD protection ...... 52

Figure 3.25. RC-clamp for power pad ESD protection ...... 53

Figure 3.26. 16:1 metal track selecting MUX with its programming scheme ...... 54

Figure 3.27. Top-side metal tracks selections for primary I/Os ...... 57

Figure 3.28. Right-side metal tracks selections for primary I/Os ...... 58

Figure 3.29. Bottom-side metal tracks selections for primary I/Os ...... 59

Figure 3.30. Left-side metal tracks selections for primary I/Os ...... 60

Figure 3.31. H-trees clock distribution ...... 63

Figure 4.1. The optimized interconnect pattern ...... 70

Figure 5.1. FPTA design layout ...... 75

Figure 5.2. TRAP layout...... 77

Figure 5.3. TRAP die photo ...... 78

Figure 5.4. PGA132M topside view and underside view ...... 79

Figure 5.5. (a) PGA132M bonding connection to TRAP, (b) I/O pad name labeled corresponding to package ...... 80

Figure 6.1. CAD tool flow for TRAP ...... 84

xii

Figure 6.2. First test setup: FPGA-PCB-TRAP ...... 86

Figure 6.3. Second test setup: FPGA-PCB-TRAP ...... 87

Figure 6.4. FO1 delay measurement ...... 91

Figure 6.5. (a) Up counter, (b) Down counter ...... 93

Figure 6.6. Single cycle switching between configurations ...... 94

Figure 6.7. Selective partial reconfiguration ...... 96

Figure 6.8. FrontPanel SDK ...... 97

Figure 6.9. FrontPanel GUI ...... 98

Figure 6.10. GCD calculator interface ...... 99

Figure 6.11. (a) GUI for the multiplier implementation, (b) A list of multiplier configurations 100

Figure 6.12. (a) GUI for the branch jumper implementation, (b) Truth table for the branch jumper ...... 101

Figure 6.13. (a) GUI for the decoder implementation, (b) Truth table for the decoder ...... 102

Figure 6.14. Overhead of TRAP & FPGA over ASIC ...... 103

Figure 7.1. New Built-in DFF ...... 106

Figure 7.2. A shift-register cell ...... 107

Figure 7.3. A routing path with faulty result ...... 111

Figure 7.4. An optimized routing path ...... 111

Figure 7.5. Sources of uncertainty in TRAP-based obfuscation ...... 113

Figure 7.6. TRAP-based design obfuscation ...... 114

xiii

LIST OF TABLES

Table 1.1. Comparison of programming technologies ...... 3

Table 3.1. The site coordinates for the programmable I/Os in VPR ...... 61

Table 5.1. TRAP chip summary...... 76

Table 6.1. Area utilization compared to a commercial FPGA ...... 89

Table 6.2. FO1 delay measurement from HSPICE and fabric ...... 92

xiv

CHAPTER 1

INTRODUCTION

In the past decades, Field-Programmable Gate Arrays (FPGAs) and other devices have been particularly popular. Benefiting from programmability and reconfigurability, they have been widely used to prototype new designs and even have been directly marketed as a System on Chip (SoC). Benefiting from its reconfigurability, embedded

FPGAs (or eFPGAs) have enabled SoCs to have flexibility in critical areas where algorithm,

Interconnection Resources I/O Block

Logic Block Figure 1.1. General FPGA architecture protocol or market needs are changing frequently. The premium that companies pay for a programmable chip is more affordable than the development resources spent on creating an

Application Specific (ASIC) counterpart.

However, FPGAs have very significant overhead due to their configurability. The common FPGA architecture consists of three core parts, which are configurable logic blocks, programmable

1

interconnect, and programmable I/Os. The configurable logic blocks which contain the common

Look Up Tables (LUTs) could be entirely allocated for relatively simple logic functions. This structure underutilizes transistor resources. The programmable interconnect involves much more delay on the interconnect network than a corresponding ASIC. In an ASIC design, the I/Os are clearly defined and connected. In an FPGA, the programmable I/Os consume much more area for the programmable feature to support flexibility. All of these architectural features in FPGAs which enable reconfiguration also result in more area, lower speed, and worse power consumption comparing to an ASIC. FPGAs typically require 20X more silicon area, 3X larger delay, and 10X more dynamic power consumption compared with a functionally equivalent ASIC [1].

1.1 Industrial Architectures

There are three competing technologies for programming FPGAs. SRAM-based FPGAs have the highest density, and flash-based FPGAs and -based FPGAs consume less power. Antifuse- based FPGAs are programmed only once and obviously are not reconfigurable, and they require a complex fabrication process [1]. Microsemi (Previously ) and , acquired by Microchip, produce antifuse-based and flash-based mixed-signal FPGAs [2] [3]. , who acquired SiliconBlue Technologies, provides low-power SRAM-based FPGAs [4]. QuickLogic provides SRAM FPGA families and antifuse FPGA families [5] [6]. supplies SRAM- based FPGAs and embedded FPGAs [7]. The leaders of the FPGA market,

Xilinx and (acquired by ), also use SRAM-based technology. SRAM-based technology uses a standard CMOS manufacturing process, while anitifuse and flash-based technologies require more expensive fabrication technologies. A comparison of these three FPGA technologies is shown in Table 1.1 [1].

2

Table 1.1. Comparison of programming technologies

Comparison SRAM Flash Anti-

Volatile Yes No No Reprogrammable Yes Yes No Area High Moderate Low (Storage element (6 transistors) (1 transistor) (0 transistor) size) Manufacturing Standard CMOS Flash Process Special development process

Each vendor has its own FPGA architecture, but in general, they are all a variation of that shown in Fig. 1.1. The architecture consists of configurable logic blocks (CLBs), programmable interconnect, and configurable I/O blocks. The basic CLB typically consists of a look-up table

(LUT) to generate arbitrary combinational logic functions for a small number of inputs (typically up to 6), a D-flip-flop (DFF) to store state information, a full , and for connecting the to external resources. Essentially, a LUT consists of a programmable cascade of multiplexers, arranged tournament style, to produce an output (winner). Each of the 2N inputs to the first column of multiplexers, where N is the number of logic signal inputs to the LUT, are programmed to be either 0 or 1 and represent the value of a specific minterm. The number of multiplexers needed for an N-input LUT is 2N – 1. Each pass transistor based (including a half- keeper) multiplexer requires at least 5 transistors, so the number of transistors needed for N-input

LUT is at least (2N – 1)*5. So, for a 6-input LUT, at least 315 transistors are needed for the multiplexers. In addition, an is needed for each input, and after every two series multiplexers, a buffer including a half keeper is needed since the multiplexer pass transistors do

N-2 N-4 not pass a full VDD. The number of such buffers needed is 2 + 2 + … . For a 6-input LUT, the

3

number of buffers is 16+4+1 = 21, and each buffer requires 5 transistors, so a grand total of 105 transistors implementing buffers. Including the 12 transistors required for the 6 input inverters, overall, a LUT requires 315+105+12 = 432 transistors.

Figure 1.2. Classic LUT architecture

As shown in Fig. 1.2, a conventional 4-input LUT structure contains 24-1 = 15 2:1 MUXs. The 16 inputs of the MUXs are supplied with programming bits. In general, for a k-input LUT, it contains

2k-1 MUXs and needs 2k programming bits for the inputs. A LUT can be entirely allocated for a relatively simple logic function. This structure is generally used because of its simplicity.

However, area efficiency is generally poor since the transistor utilization is low on average.

The Xilinx 7 series CLB provides advanced, high-performance FPGA logic. Four 6-input LUTs,

4

eight flip-flops, multiplexers, and arithmetic carry logic forms a slice [8]. In a slice, the extra logic from the LUTs is designed to improve the arithmetic and memory implementation efficiency. As shown in Fig. 1.3, two slices form a CLB, and the CLBs are connected through a switch matrix.

Figure 1.3. Xilinx 7 series CLB structure

In addition to the logic block architecture, another key FPGA feature is the routing architecture.

Xilinx and Altera (acquired by Intel) have appreciable differences with respect to the structure of programmable interconnections. The Xilinx routing structure is a combination of different length connected by . Shorter wires are used for local connections and longer wires are used for longer connections to minimize the number of switch nodes traversed. Altera uses a hierarchical interconnect structure to maximize connectivity and performance. The lowest level provides high-speed connectivity between different clusters of logic blocks, called logic array blocks (LABs). The routing resources around LABs are organized as wires in several rows and

columns, and vertical or horizontalFig. switches 1.3 Xilinx located7 series CLB at the structure crossings. The and Stratix II

5

families use a three-sided routing architecture, as shown in Fig. 1.4 [9]. This means that a LAB can drive or listen to all of the wires on one horizontal (H) channel above it and two vertical (V) channels to the left and right side of it. The channels contain wires of length 4, 8, 16, and 24, and signals can get off at any LAB along the length of the .

Figure 1.4. Routing architecture example

Tabula, Inc., a privately-held fabless semiconductor company, developed a creative architecture that uses time as a third dimension, which improves routing and performance in spacetime [10]

[11]. The company designed and built three-dimensional field programmable gate arrays (3-

D FPGAs) and ranked third on the Wall Street Journal's annual "Next Big Thing" list in 2012 [12].

Tabula had a product line called the ABAX family. The ABAX introduced its 3D SpaceTime architecture to the FPGA world, as shown in Fig. 1.5 [10]. In conventional FPGAs, each LUT performs a single function. In an ABAX chip, each LUT performs up to 8 functions in a time- multiplexed manner. The second generation, ABAX2, in 22nm Tri-Gate technology, allowed

6

implementation up to a 12-level architecture. Each time slice is called a fold. Transparent latches are embedded in the interconnection network to handle both routing and temporary state storage between folds.

Figure 1.5. 3D SpaceTime architecture

Tabula’s architecture is a novel attempt to solve the interconnect bottleneck in FPGAs. In this approach, a design larger than the fabric size can time-share the fabric. By using each LUT and memory block multiple times, Tabula claimed that less interconnect (wire length) is required.

Unfortunately, Tabula shut down in 2015.

Xilinx supports time-sharing and instant switching between multiple configurations [13]. Xilinx offers a tool called PlanAhead to aid in the implementation of partial reconfiguration (PR) [14].

The FPGA device is divided into a mission-critical region and a reconfiguration region. The main

7

advantage of this methodology is that mission-critical operations can be preserved while another part of the FPGA device is reconfigured, as opposed to the full reconfiguration of the FPGA device, which does not allow for uninterrupted operation. However, due to the strict guidelines, significant effort is required to implement PR. Intel uses a more software-based approach called software programmable reconfiguration (SPR) [15]. Instead of reconfiguring the device with multiple bitstreams, the application integrates all the functions in just a single bitstream.

1.2 Novel Architectures in Research

Transistor-level reconfiguration has been investigated in the past in the context of evolvable hardware. Specifically, several Field-programmable Transistor Array (FPTA) architectures were designed in the late ‘90s and early ‘00s [16] [17] [18]. These FPTAs focused on the analog domain and created programmable analog cells, with a partial exception being Stoica’s work [19], which also focused on mixed-signal circuits.

There have been many efforts to try to reduce the power/area/performance gap between an FPGA and an ASIC. An FPGA logic core contains three fundamental blocks, namely, a logic block, an interconnection scheme, and programmable IOs. The LUT is the most commonly used architecture for the logic block in commercial FPGAs. As the number of inputs to a LUT increases, its area and the number of functions it can implement increases exponentially. In turn, a fewer number of LUTs are needed to implement a design. However, the required routing area goes up. A study [20]. has argued that the optimal number of inputs for a LUT is between three and four, largely independent of the programming technology. That being said, contemporary FPGAs have largely turned to the use of 6-input LUTs [21].

If the programmable logic density is improved, the communications tend to become localized.

8

There has been research using controllable-polarity devices to build an ultrafine grain logic cell aimed at reduced area [15]. In [22], the hybrid CMOS/nanotechnology reconfigurable architecture called NATURE has increased logic density by order of magnitude. The high logic density provided by logic folding makes it possible to use much less expensive chips with smaller capacities to execute the same applications. However, this architecture adds extra DFFs to retain states between each fold. Each fold has a reconfiguration delay. The clock cycle of the implemented circuit is composed of reconfiguration delay, LUT computation delay, and interconnect delay. Furthermore, nanotechnology is not easy to develop and is currently not mature with respect to fabrication.

The bottleneck for the area of an FPGA is the interconnection network. An FPGA contains a sea of mesh interconnects and switches which contribute substantially to the area. A multi-granularity

FPGA with hierarchical interconnects has been developed to improve area efficiency [23]. Instead of the conventional interconnect architecture, this work is based on the idea of the Beneš network

[24]. It aims at providing a flexible and efficient eFPGA for use in SoCs, primarily focused on an improved interconnect architecture. Flex Logix has sought to provide eFPGAs with superior interconnect efficiency [25].

Recently, FPGAs are further improved by coarse-grain functional blocks, e.g., memories, multipliers, and processors, in combination with the fine-grain LUT-based logic blocks [26]. It can improve the density and performance if the flexibility of the interface between the coarse-grain and fine-grain resources is optimized.

By raising reconfiguration granularity to coarse macros and functional blocks as complex as

Digital Signal Processors (DSPs), Coarse-grained Reconfigurable Architectures (CGRAs) have

9

been proposed as a method for accelerating applications which contain large amounts of parallelism and computationally intensive tasks [27] [28] [29] [30]. CGRAs are larger, slow, power-hungry and of limited agility, drawing their acceleration benefits mostly from dynamic reconfiguration.

To improve hardware security, bitstream encryption has been attempted via the Physical

Unclonable Function (PUF) response [31]. The bitstream contains the configuration information and is a vulnerability in FPGA design. By inserting a PUF cell along with every programming bit, the reconfigurable logic and interconnect in the FPGA are deeply hardware entangled. The PUF cells are tightly integrated with the memory cells retaining the programming bits. Significant security benefits are achieved with an area overhead.

The remainder of this dissertation is organized as follows. Chapter 2 is a brief overview of TRAP features. Chapter 3 presents the design details of all the functional architectures of TRAP. In addition, Chapter 4 records all the design decisions that were made during the development of the architectures. The fabrication and packaging of TRAP are described in Chapter 5. Chapter 6 presents test setup and test results, as well as describes the capabilities of TRAP. Finally, Chapter

7 is devoted to future work and concluding remarks.

10

CHAPTER 2

TRAP OVERVIEW

2.1 A Brief Introduction to TRAP

In this work, we developed a novel TRAnsistor-level Programmable fabric (TRAP) aiming at improving area efficiency. TRAP is comprised of carefully arranged regular rows and columns of transistors, which can be individually configured to implement logic gates. This is a fine-grain architecture comparing to the commonly used LUT. FPGAs that use LUTs are programmable at gate-level while TRAP is programmable at the transistor-level. By dynamically allocating individual transistors as constituents of logic gates, this fabric improves transistor utilization.

Specifically, instead of allocating entire LUTs for relatively simple functions, our fabric will only use the precise number of transistors needed to implement the function. TRAP uses a standard

CMOS manufacturing process, which makes it easy to use and would be affordable.

The interconnection architecture that is commonly used in industry is a matrix of switches, comprised of bi-directional repeaters, at every intersection. We have aggressively simplified the switch, whereby we use only a single pass transistor. As the interconnection architecture takes up

80% of the area of an FPGA [23], this modification can greatly reduce the area consumption.

TRAP also allows simultaneous storage of four independent configurations, along with the ability to dynamically switch between them within a clock cycle. The logic blocks, the interconnection network, the programmable IOs, and other peripheral blocks are all time-shared between different configurations. Only the bitstream storage is increased by 4X. When multiple bitstreams are saved, the average overhead per configuration of the programmable architecture is significantly reduced.

11

TRAP supports libraries of cells of the same height and variable width, just as in a typical standard cell circuit. We developed a complete Computer-aided Design (CAD) tool flow for programming

TRAP chips. A prototype 3mm X 3mm TRAP chip was fabricated using the Global Foundries

55nm process. We show that TRAP has substantially better area efficiency compared to a leading industrial FPGA and would, therefore, be ideal for embedded FPGA (eFPGA) applications.

2.2 TRAP Features and Applications

2.2.1 Virtualization

Our fabric supports simultaneous storage of up to four configurations rather than one compared to standard FPGAs. TRAP stores four bitstreams which corresponds to four configurations, or four virtual layers of logic. Switching between any stored bitstream is controlled by a multiplexer, and therefore it can be achieved within a small fraction of a clock cycle. While one configuration is active, a new bitstream can be sent in to update one of the other non-active configurations stored on the fabric.

When the four stored configurations are independent, TRAP is being used in board-level virtualization mode. We term this board-level virtualization in that each configuration, in effect, implements an independent chip. The computational state is retained before switching to another configuration. TRAP supports the ability to switch dynamically between any of the stored configurations and continue the execution when switched back as the computational states are retained and independent.

TRAP also supports chip-level virtualization in which a single IC design is partitioned over a set of configurations and the computation cycles from one configuration to the next in the set. By

12

time-sharing the fabric, the TRAP can implement a design that requires more computational logic than physically available on the TRAP chip.

TRAP is capable of supporting either board-level virtualization or chip-level virtualization, as explained in Chapter 3.

2.2.2 Rapid Selective Reprogramming

TRAP also features rapid partial or full modification of any one of the stored configurations in a time proportional to the number of modified configuration bits. Bitstreams are loaded in a word- based fashion using hierarchically arranged and addressed memory buffers. This high throughput scheme supports selective bitstream editing, which enables updating only a portion of a configuration. Partial editing in this manner significantly speeds up the programming process comparing to loading the full configuration when there are only a few specific regions of the fabric that need to be updated.

2.2.3 Seamless Transition with ASIC Flow

FPGAs are extensively used for prototyping ASICs. Before volume production of an ASIC, the

FPGA is used primarily in the design phase and preproduction phase to emulate and test the design to lower risk. Our design aims to minimize the transition effort.

Unlike the basic configurable logic block (CLB) in conventional FPGAs, which employs look-up tables (LUTs) to generate combinational logic functions, TRAP relies on a carefully arranged configurable array of transistors, which can be interconnected to implement the topologies of standard library cells. As a result, logic synthesis and placement of this transistor-level granularity fabric are fully compatible with an ASIC design flow.

13

2.2.4 Mixed TRAP and ASIC Design

Moreover, our fabric can easily be co-designed along with a conventional standard-cell block on an ASIC. It can be included as an Intellectual Property (IP) block and be embedded into an SoC design through a Computer-Aided Design (CAD) tool-flow that is highly compatible with typical

ASIC design. For example, a design is partitioned into a TRAP portion and an ASIC portion. The

TRAP portion can be a sensitive netlist that needs to be obfuscated. Both the TRAP and ASIC portions can use the same (or similar) cell placement program.

Embedded Field Programmable Gate Arrays (eFPGAs) are used as IP blocks in a System-on-Chip

(SoC) or Application Specific Integrated Circuit (ASIC) to enable flexibility for applications that may change quickly. Compared to FPGAs, eFPGAs do not have I/O pads, which consume a lot of area. For FPGAs, the maximum number of IOs is limited by the available space on the die. The

I/Os of an eFPGA are directly connected with the AISC boundary, and can be far more numerous.

With a CAD tool flow that is compatible with an ASIC, TRAP should be ideal for eFPGA applications.

14

CHAPTER 3

DESIGN DETAIL OF TRAP

TRAP’s transistor-level architecture consists of regular rows and columns of transistors. By means of programming bits, certain transistors are interconnected to form logic gates, which resemble gates in an ASIC’s standard cell library as the logic gates are of the same height and variable width.

The interconnect is programmed to interconnect the various instantiated logic gates. Memory cells are used to store four configurations simultaneously. TRAP has a word-based, pipelined, high- throughput configuration bit loading scheme. This scheme also passes an address along with each word of configuration bits to enable a partial configuration update. This section will describe all the mentioned innovations and design details in sequence.

3.1 Logic Cell & Built-in Gates

3.1.1 Logic Cell

The TRAP architecture consists of numerous long rows of transistors. A cell library is defined in the same way as it is for a typical standard cell circuit, where each cell will have the same height and variable width. In TRAP, the granularity of the logic cells is one column of transistors. As shown in Fig. 3.1, a column consists of two pMOS transistors above two nMOS transistors, which connect to the left terminal of the horizontally placed two pMOS transistors and two nMOS transistors. The right terminal of these four transistors will connect to the next column. This basic column is replicated in the horizontal direction, forming a continuous row. The actual hardware implementation of the row-based TRAP groups four columns and terms this a logic cell (LC). Each column consists of four vertical transistors used for connecting, isolating or logic (two pMOS and

15

two nMOS transistors), as well as two outer horizontal transistors used for connecting or isolating and two inner horizontal transistors used for connecting, isolating or logic.

Figure 3.1. Logic cell schematic with multiplex inputs

As shown in Fig. 3.1, there are four columns of transistors in an LC. The eight innermost horizontally placed transistors (blue-colored) can be programmed to be always on or always off which are only strictly used for connection or isolation. All the other transistors can also receive logic signals rather than only to be programmed to be always on or always off. These transistors can be used to construct logic functions that require up to three transistors in series. For the

16

transistors that can be used either for connection/isolation or to receive logic signals, they are controlled by a multiplexer which selects by ctrl from the two inputs, memory_in, and signal_in.

The memory_in programming bit is sent from a memory cell. The polarity of this bit will determine whether the transistor is to be used for isolation or connection. The logic signal signal_in is propagated from the interconnection network.

Figure 3.2. I/Os of an LC connect to programmable interconnect

To be specific, the 12 nMOS logic/connecting/isolation inputs in an LC (four columns) are directly connected to individual vertical M2 tracks. Similarly, the 12 pMOS logic/connecting/isolation inputs in an LC (four columns) are directly connected to individual vertical M2 tracks, as shown in Fig. 3.2. Either a programming bit or a logic signal from the interconnection network is on each of these tracks. The potential output is illustrated by means of a small green rectangle in Fig. 3.1.

Each potential output is connected to an M2 track, which leads into the interconnection network, as shown in Fig. 3.2.

17

A study suggested that an ASIC standard cell library should contain cells with at most two transistors in series for the pull-up and pull-down networks in order to achieve the best power efficiency [32]. However, for FPGAs, the vast majority of delay and power are due to the interconnection network rather than logic gates. Thus, we decided to reduce the number of nets by allowing more complex cells. In particular, a decision was made to allow up to three transistors in series.

Figure 3.3. NAND2 and AOI22 gates implemented with LC

We characterized and compiled a 55nm cell library for TRAP by Siliconsmart and

Library Compiler. We also created a functionally equivalent cell library for the ASIC portion of a design. The base library has 24 gates whose series limit is 3 transistors, and 3 built-in gates (DFF, full adder, and multiplexer) with their different input and output polarities.

The 24 base library components are: INV, NAND2, NOR2, AOI12, AOI22, OAI12, OAI22,

NAND3, NOR3, AOI31, OAI31, AOI41, OAI41, AOI32, OAI32, AOAI311, OAOI311,

AOAI211, OAOI211, AOOAI212, OAAOI212, AOAAI2111, OAOOI2111 and MAJI (which is

18

the inverted mirror carry). In addition, the following custom cells are built into the LCs: FA, FAI,

DFF, MUX, and MUXI.

Fig. 3.3 depicts how two cells, a NAND3 and an AOI22 gate, are programmed in TRAP. All of the non-bold transistors are programmed off. Among the bold transistors, the ones labeled with a signal name receive primary inputs; the rest are turned on to complete the gate topology. In the figure, the dark black transistors with signal names on their gates are used to implement the logic function, while other dark black transistors are enabled (turned on with programming bits). The light gray transistors are turned off with programming bits. The dark blue transistors are turned on with programming bits to form the output node of the logic function.

3.1.2 Built-in Gates

In the general market FPGAs, a logic block (e.g., slice in Xilinx FPGA) contains a few built-in gates to improve logic efficiency. Similarly, in addition to the four columns of transistors, each

LC comes with three custom built-in gates, which are a D flip-flop (DFF), a full adder (FA), and a multiplexer (MUX). The inputs and outputs (I/Os) of the built-in gates are optionally connected to LC I/Os. A programming bit labeled as en in Fig. 3.4, Fig. 3.5, and Fig. 3.6 will select from a de-multiplexer (DEMUX) either to send the logic signal as an input to a transistor or to a built-in gate. Accordingly, when a built-in gate is in use, en is logic 1. When en is logic 0, it will disable the path which connects the built-in gate to the LC output.

In Fig. 3.4, a potential output is illustrated by means of the small green rectangle. Each potential output is optionally connected to a vertical metal2 (M2) track, by means of a programmed switch. In addition, an optional custom D-flip-flop (DFF) is provided in each LC with its output optionally provided into the first column and its D input from the first column as well, as shown

19

in Fig. 3.4. While it is possible to construct such a flip-flop from two LCs (8 columns), it would be far less area efficient, which is a significant issue since DFFs are frequently used. In Fig. 3.4, the D input (Signal_in) either passes to a logic transistor in the first column of the LC (if ctrl = 1 and en = 0) or passes to the D input of the DFF if en = 1 via the demultiplexer. Note that all the

DFFs provided in the LCs are connected in a scan chain.

s_x_dff

s_x_dff

Q

rst DFF clk en1

D

0 1

0 1 ctrl Signal_in Mem_in

Figure 3.4. Built-in DFF

While an FA could be implemented in TRAP, it would occupy 8 columns (two LCs). Meanwhile, providing a custom FA cell in each logic block takes up only two columns. The three inputs of the

FA, as shown in Fig. 3.5, span across the second and third columns of the LC. The carry and sum outputs are provided at the output of the second and third columns, respectively. The MUX I/O connects to the fourth column in LC, as shown in Fig. 3.6. The LC and built-in gates are interconnected using the Metal1 (M1) layer. The custom MUX only occupies one column and it

20

saves 3 columns compared to programming it using an entire LC. Since many digital signal processing blocks use large numbers of FAs and multiplexers, the area reduction is very significant.

Mem_in1 Signal_in2

Mem_in2 Signal_in1

ctrl1 1 0 ctrl2 1 0

en2 0 1 en2 0 1

po3 1 0 po2 1 0

en2

MC B

en2 FA A MS C

0 1 0 MS

en2

po4

po1 1 0

en2 0 1

ctrl3 0 1 Signal_in3 Mem_in3

Figure 3.5. Built-in FA

To make the built-in gates more area efficient, all the I/Os received can be modified to either polarity. For example, the MUX logic is (S*A+S_bar*B), and it could be implemented as an

XNOR (S*A+S_bar*A_bar) when the B input connects to inverted A. By inverting the output, it

21

becomes an XOR. With small changes in programming bits, the MUX can be constructed to provide other logic functions. The polarities are realized by adding the inverter and DEMUX logic in Fig. 3.6. One programming bit called po is involved in selecting the desired polarity.

Mem_in1 Signal_in2 Mem_in2 Signal_in1

ctrl1 1 0 ctrl2 1 0

en3 0 1 en3 0 1

VDD po2 1 0 1 0 po3 1 0

en3

po1 A MUX MUX B 0 1 0 MUX

en3 S

po4 1 0

en3 0 1 ctrl3

0 1 Signal_in3 Mem_in3

Figure 3.6. Built-in MUX

3.1.3 Advantages

The way that FPGAs execute Boolean algebra is by using Look-Up Tables (LUTs). The LUT consists of a cascade of multiplexers that can be programmed, as shown in Fig. 1.2. Though the

22

FPGA is called a programmable , it does not physically contain discrete logic gates

(AND, OR, etc.). To construct a logic gate, the LUT uses the truth table concept to relate outputs to inputs, which is very flexible but inefficient. In general, for a k-input LUT, it contains 2k-1

MUXs and needs 2k programming bits for the inputs. A LUT can be entirely allocated for a relatively simple logic function. On the other hand, the LC architecture in TRAP is programmable at the transistor-level enabling a logic gate to use only the transistor columns necessary to implement the gate. The unused columns in an LC can be used to map other gates or portions of other gates, thus significantly improving the area efficiency/logic density in a programmable architecture.

In the current TRAP prototype, for each column in an LC, there is one built-in gate which selectively connects to that column. That built-in gate only occupies one column of LC resources.

3.2 Board-level Virtualization and Chip-level Virtualization

3.2.1 Board-level Virtualization

We designed four virtual layers of logic in TRAP in which four configurations are stored simultaneously. One option enables switching between multiple independent configurations [33].

This is called board-level virtualization. The current TRAP fabric also enables chip-level virtualization. This feature enables switching between dependent configurations which are each partitioned from a design larger than the size of one layer (or configuration).

In board-level virtualization, there are four DFFs which operate in parallel, where each is available to a particular respective layer and logically resides in the first column of each LC. As highlighted in Fig. 3.7, the four DFFs share the I/O of the first column of the LC. Each blue encircled DFF

23

will be used on each virtual layer accordingly as the built-in DFF in that LC. Among the control signals sa, sb, sc, and sd, one will be on at a time regarding the active virtual layer. The four layers have their own computational states, and they are strictly independent.

Figure 3.7. Board-level virtualization

3.2.2 Chip-level Virtualization

In chip-level virtualization, TRAP executes a partitioned circuit, in which the dependent partial configurations are executed on different virtual layers. When implementing the partitioned configurations across multiple virtual layers, all four potential gates outputs in an LC are stored in

DFFs to prepare for the transition to another layer. The four DFFs serve to retain the four computational states. Each potential gate output labeled O is captured by a respective DFF. Each

24

of these DFF outputs drives the extra line labeled E in Fig. 3.8 and Fig. 3.9 which serves as a primary input for the next layer. Note that the number of meaningful computational states varies from zero to four, as it depends on the exact gates that been implemented on the current layer.

Only valid computational states will propagate to the interconnect network to be used on the next layer.

Figure 3.8. Chip-level virtualization not using built-in DFF

The circuitry highlighted in Fig. 3.8 and Fig. 3.9 is active in the chip-level virtualization mode. As shown in Fig. 3.8, when the built-in DFF is not used in the first column, the control signal en2 is turned on, and sa, sb, sc, and sd are turned off. After the control signal capture is applied, the four temporary computational states are available on the E lines. If the current layer needs to implement a built-in DFF, sa is turned on to enable it. The built-in DFF labeled 1 (DFF1) corresponds to the

25

system clock and holds the overall circuit state. The control difference is clearly shown in Fig. 3.8 and Fig. 3.9.

Figure 3.9. Chip-level virtualization using built-in DFF

The circuit state might update multiple times in accordance with clk, whereas the DFF labeled 1’

(DFF1’) only captures a logic gate output right before exiting the current layer. DFF1’ and DFF1 carefully isolate the computational state of the current layer and the next layer. DFF1 is immediately available for an update. In general, when the current layer has finished logic computations and all computational states are saved, then the configuration for the next layer is instantiated, at which point the next layer becomes the current layer. Global control is responsible for executing the virtual layers in the right sequence.

26

TRAP implements a circuit partitioned over up to four configurations with a negligible time overhead. For a huge circuit that requires more than four layers (e.g., to fit on the fabric), a non- active partition (layer) can be rapidly overwritten using the word-based high-speed pipelined configuration bit loading scheme in TRAP. The programming bit loading scheme (which is explained in Section 3.6) applied to a specific layer does not interrupt execution on the other layers.

3.3 Interconnect Architecture

The programmable interconnect is one of the essential portions of a field programmable architecture and consumes most of its area. In TRAP, the programmable interconnect is designed to construct the routing paths between the logic gates implemented in the LCs. The programmable interconnect network includes two fundamental elements, which are the switch and the segment.

3.3.1 Switch

As shown in Fig. 3.10, the programmable interconnect has two symmetric halves, and each LC connects to the two symmetric halves. The upper half of the programmable interconnect connects to the pMOS inputs in an LC, and the lower half of the programmable interconnect connects to the nMOS inputs in the LC. Each small square box (of various colors) comprises an nMOS switch that can be turned on or off to connect or disconnect two track segments going in opposite directions and on adjacent layers.

The M2 and the Metal4 (M4) are the vertical routing resources, while Metal3 (M3) and the Metal5

(M5) are the horizontal routing resources. Metal6 (M6) is primarily used to route the programming bits. The label shows the name of a metal track. Mx_y means that the metal track is in layer x (in which x ranges from 2 to 5), and where y is in the range of the total number of tracks in Mx.

27

M3_1 M5_1 M3_2 M5_2 M3_3 M5_3 M3_4 M5_4 M3_5 M5_5 M3_6 M5_6 M3_7 M5_7 M3_8 M5_8 M3_9 M5_9 M3_10 M5_10 M3_11 M5_11 M3_12 M5_12 M3_13 M3_14 M3_15 M3_16 M3_17 M3_18 M3_19 M3_20 M3_21 M3_22 M3_23

M3_24

M2_1t M2_3t M2_6t M2_8t M2_11t M2_2t M2_4t M2_5t M2_7t M2_9t M2_10t M2_12t

M4_1 M4_3 M4_6 M4_8 M4_11 M2_S1 M2_S3 M2_O2 M2_O4 M4_2 M4_4 M4_5 M4_7 M4_9 M4_10 M4_12 M2_S2 M2_S4 M2_O1 M2_O3

M2_1b M2_3b M2_6b M2_8b M2_11b M2_2b M2_4b M2_5b M2_7b M2_9b M2_10b M2_12b

M3_25 M3_26 M3_27 M3_28 M3_29 M3_30 M3_31 M3_32 M3_33 M3_34 M3_35 M3_36 M5_37 M3_37 M5_38 M3_38 M5_39 M3_39 M5_40 M3_40 M5_41 M3_41 M5_42 M3_42 M5_43 M3_43 M5_44 M3_44 M5_45 M3_45 M5_46 M3_46 M5_47 M3_47 M5_48 M3_48

Figure 3.10. Interconnect network

28

The pink vertical M2s in the right portion of the upper half of the programmable interconnect terminate at the upper boundary. These 12 M2s (labeled as M2_1t to M2_12t) are used for sending inputs from the interconnection network down to the 12 pMOS transistors in the LC.

Symmetrically, in the lower half of the programmable interconnect associated with an LC, the 12

M2s (labeled as M2_1b to M2_12b) are used to apply inputs to the 12 nMOS transistors in the LC.

All other vertical wire segments extend through the entire programmable interconnect of the LC, and each has a bidirectional repeater to the neighboring LC above and below the current LC. The middle 4 M2s (labeled as M2_O1 to M2_O4) are designed for propagating the logic gate outputs to the programmable interconnect. To support the chip-level virtualization mode, TRAP has four short M2s (labeled as M2_S1 to M2_S4). The tracks inside the dashed rectangle are designed to send the saved computational states from the previous layer to the current layer. The blue and maroon horizontal tracks are M3 and M5, respectively. The 12 vertical tracks on the left side are

M4. The M2s, M3s, M4s, and M5s with the pass transistor switches at the designated intersections constitute the programmable interconnect. The pass transistor switches in blue are at the intersections of M2 and M3, the switches in gray are between M3 and M4, and the red switches are between M4 and M5.

The lower half of the programmable interconnect is slightly different from the upper half, insofar as there is one more switch located at the intersection of M2_O1 and M3_28. In chip-level virtualization mode, it may occur that a saved computational state from the previous layer may not be used in the current layer but rather in a subsequent layer. This is especially common for the state generated from a built-in DFF whose output is S1. Therefore, we add one extra switch directly constructing a potential path from S1 to O1 to improve routing efficiency.

29

3.3.2 Segment

ctrl

Figure 3.11. Switch and half keeper

Each switch is implemented by an nMOS transistor controlled by a programming bit, with the source and drain connecting the two perpendicular metal lines (on different layers), as shown in

Fig. 3.11 (with the programming bit called ctrl in this case). Since the pass transistors do not pass a full Vdd, a half-keeper, as shown in Fig. 3.11 is added to each metal segment. The half-keeper is only active (i.e., pulls up the voltage) when Vdd is being passed onto a metal track segment. Fig. 2.10 Switch and half keeper The routing architecture of TRAP features the advantages of a hierarchical structure as well as local connections. M2 is used primarily to connect inputs and outputs of the LCs to the programmable interconnect. The remaining higher metal layers (M3, M4, and M5) perpendicularly alternate the directions. At the boundary of the programmable interconnect associated with a LC, a metal track on M3, M4, or M5 is segmented by either a pass transistor or a repeater, as shown in

Fig. 3.12(a)(b). A pass transistor is a configurable nMOS with drain and source connected to two wire segments on adjacent layers. A bi-directional repeater is controlled by two programming bits to select the right signal propagation direction, as is shown in Fig. 3.12(b).

30

ctrl2

ctrl2

ctrl1 io1 io2 ctrl ctrl1

(a) (b)

Figure 3.12. (a) Pass transistor, and (b) Bi-directional repeater

Vertical M2 segments literally terminate at the top and bottom of the LC. However, vertical M4 segments can be connected to M4 segments of other LCs above and below, in either direction, using the optional bi-directional repeater, shown in Fig. 3.12(b), at the top and bottom ends of the

M4 segments. The M3 and M5 horizontal metal track segments terminate at the left and right boundaries but connect to adjacent M3 and M5 segments in the neighboring programmable interconnect via a pass transistor, as shown in Fig. 3.12(a). When the aggregate lengths of the pass- transistor connected segments reach a calculated limit, a bi-directional repeater is inserted in the place of the pass transistor to improve delay. In TRAP, the pass transistor is replaced with a repeater after every 4 LCs in the horizontal direction. One-quarter of the horizontal M3 and M5 lines have a repeater, shown as Fig. 3.12(b), instead of a pass transistor at a given right boundary of an LC, and thus, the maximum M3 or M5 length between repeaters is the width of 4 LCs.

Detailed simulation determined that this length of 4 LCs was optimal.

31

3.3.3 Challenge in Routing

The programmable interconnect becomes the critical limit for performance as the FPGA gets bigger. The amount of interconnect typically needs to grow to avoid routing problems. As 70-80% of the fabric of an FPGA is the programmable interconnect network which routes the signals between every logic block, it is very necessary and challenging to improve the interconnect structure for better area but not negatively affecting performance.

(a) (b)

Figure 3.13. (a) Switch in TRAP, (b) Switch in FPGA

As shown in Fig. 3.13, the switches used by TRAP versus those in conventional FPGAs are substantially different. Conventional FPGAs commonly use two types of switches at intersections

[23]. The simpler type is one switch at the crossing, which will either pass or not pass the signal between the two metal tracks, as shown in Fig. 3.13(a). This is the same type of switch used in

TRAP. The other type contains six switches at the crossings and two sets of bidirectional repeaters.

Fig. 3 .1 3 (b) is a n example of this fundamental (a) type of switch (b) in FPGAs. This structure can support a signal going from any oneFig. direction 2.12: (a) Switchto any in other TRAP, d irection(b) Switch as in well FPGA as to disable the metal track segment that is not to be used.

32

3.3.4 Area Benefit and Structure Superiority

An FPGA has a sea of mesh interconnect with a matrix of switches at every intersection. As shown in Fig. 3.13(b), the 6 nMOS and 2 bi-directional buffers together construct the switch structure in an FPGA. TRAP only uses a bi-directional repeater to connect metal segments going in the same direction and on the same layer, and only after a certain aggregate metal segment length. The switch in TRAP is simplified to an extreme level as to be a single nMOS.

In conventional FPGAs, switch-boxes connect each of m vertical wires to each of n horizontal wires, each such connection requiring a programmable bi-directional buffer. Bi-directional buffers consume a great deal of layout area, requiring a minimum of 16 transistors, and this would be needed for each of the m x n connections.

The TRAP routing architecture does not use any switch-boxes. Instead, only single nMOS pass transistors are used to connect a vertical (horizontal) wire segment to a vertical (horizontal) wire segment. Since the nMOS devices do not pass a full logic 1, a half-keeper is attached to each wire segment. Of course, our routing architecture does require a small number of bidirectional repeaters, namely, when a vertical segment connects to another vertical segment. Also, a bidirectional repeater is used after every four horizontal-to-horizontal wire segments, with pass transistors used otherwise. As a pass transistor is approximately 20X smaller than a bi-directional buffer, we surmise that our routing architecture should be about an order of magnitude smaller in area.

3.4 Local Memory Cell

3.4.1 Memory Cell Architecture

Programming bits are saved in on-chip memory. Each cell of the memory is called a local memory

33

cell (MC), as shown in Fig. 3.14. As TRAP supports the saving of four sets of bits for each programming bit, each MC has the structure to store four bits. Thus, an MC consists of four latches driven by transmission gate switches. There are four latches in parallel in one MC sharing the same input line and output line, which are labeled as BL and out.

clka

clka clka cpa

clka cpa clkb

clkb clkb cpb

clkb cpb BL clkc out

clkc clkc cpc

clkc cpc clkd cpd

clkd clkd

clkd cpd

Figure 3.14. Memory cell structure

Consider the top latch as an example. Data is sent into the top latch when clka is on, and the data is retained when clka is off. The signal cpa controls whether to send the stored bitstream to the interconnect network and the LCs to form the desired circuit. In general, one of the global control signals cpa, cpb, cpc, or cpd is used to select one of the four stored bitstreams to configure TRAP.

This enables TRAP to be reconfigured (in other words, re-programmed) from one virtual layer to another in a small fraction of a clock cycle.

34

In general, when writing programming bits in from the bitline (BL), one of the global control signals clka, clkb, clkc, or clkd selects the proper latch to receive the bit. The control signal is global in the sense that every MC on TRAP receives the same control signal value. One of the global control signals cpa, cpb, cpc, or cpd is used to select from the four stored bitstreams which is currently in use to configure TRAP. Subsequently, every memory cell on TRAP receives the same copy signal. That is, for each memory cell, only one of cpa, cpb, or cpc is high at any given time (meaning that its transmission gate is conducting), and that represents the latch that is supplying the current programming bit to the array.

The MC is able to both write a programming bit in from the bit line (BL) as well as read out a programming bit to the BL. The extra feedback path controlled by re (read enable) is for transferring the data out during the reading process. The BL has a large capacitance as it travels a long distance. When reading data out, the driver is the minimum size latch, which fails to fully pull up the BL. Thus, in TRAP the BL is precharged to a middle voltage before reading is executed.

An evaluation stage is added to help the MC to drive the BL. This scheme will be depicted in detail later in this section.

3.4.2 Dynamic Reconfiguration

As the control (or so-called copy) signals (cpa, cpb, cpc, or cpd) and clock signals (clka, clkb, clkc, or clkd), are separately defined, it gives the maximum flexibility to retain and implement the bitstreams. This design enables dynamic reconfiguration, within a fraction of a clock cycle, by turning off the copy signal that is currently on (among cpa, cpb, cpc, or cpd) and globally turning a different configuration on. Thus, the fabric can be executing one virtual layer and switch to another layer with negligible delay. Furthermore, while one layer is executing, new programming

35

bits can be loaded for a different non-active layer in TRAP without disturbing the executing layer.

Consider a board-level virtualization example in TRAP, where each layer executes a separate

(unrelated) digital logic system. Suppose that cpa is on, which then supplies the current configuration for TRAP. While the system configured by cpa is running, clkb may be on, and the programming bits comprising an alternative system are being loaded into the second latch of the memory cell. Since cpb is off, the current state of TRAP is not impacted by the loading of an alternate system. After this alternate system is completely loaded using clkb, a third system can be subsequently loaded by shutting off clkb and turning on clkc. While clkc is being used to load the third system configuration, within a fraction of a clock cycle, the second system can start (or resume) running by turning off cpa and turning on cpb.

In order to properly enable toggling between running systems, not unlike swapping jobs in a of a computer, separate systems or finite state machine flip-flops must be available for each copy signal (cpa, cpb, cpc, or cpd). Along those lines, with respect to Fig. 3.7, there are four DFFs whose output may optionally appear at the first column of each LC. The four different

DFFs enabled by, respectively, cpa, cpb, cpc, or cpd.

In summary, it is possible to store the programming for four separate systems and toggle between the set of four systems, each toggle taking place within a fraction of a clock cycle. Or, one can toggle between two systems (within a fraction of a clock cycle) while loading another system.

Thus, this design enables dynamic reconfiguration with a negligible time overhead.

36

3.5 Hierarchical Function Blocks

3.5.1 Unit

Local ...... REPEATER Memory

LOCAL MEMORY

Top REP

... UPPER

Switch ... /

Array SWITCH PASS ARRAY

LOGIC CELL LC

... DECODER REP

LOWER / SWITCH PASS

ARRAY Logic ... Cell UNIT

LOCAL MEMORY

...

Figure 3.15. Unit structure

All the functional blocks described in previous sections comprise one basic module of TRAP, which is called a unit. A unit contains one LC with built-in gates, upper and lower programmable interconnect, repeaters and pass transistors, local MCs, and a decoder. The decoder is used to select the MC from the local MC array, in which there are 8 columns, with each column having 85 MCs.

The local memory array is separated into an upper half and the lower half in the layout to optimize the connecting wire length. The MCs inside a unit store all the programming bits needed for a unit.

37

The left portion of Fig. 3.15 shows the detailed connections from the out of MCs to the switches or LC. As the unit structure is approximately symmetric, only the upper half unit is expanded in the left of the figure.

3.5.2 Group

Four units are combined to form the basic replicated block, termed a group, in TRAP. As mentioned in section 3.3.2, a pass transistor used to segment horizontal metal tracks is replaced by a bi-directional repeater every four units. Fig. 3.16 shows how the repeaters are inserted in view of a group, where the repeaters are represented by the small square boxes. The horizontal repeaters are inserted every four units in a staircase sequence, and every horizontal track has one repeater in

...

L

0 0 MEMORY BUFFERMEMORY

LC LC LC LC UNIT UNIT UNIT UNIT

2 DECODER

Figure 3.16. Group structure

38

the range of a group. The first unit also shows that there is a repeater on every vertical track at the upper boundary of the programmable interconnect. The figure has omitted the vertical tracks in other units as they are defined in the same way as for the first unit. The decoder in the bottom is used to select the destination unit for the programming bits that are temporarily stored in the left side memory buffer.

3.6 Core Programming

3.6.1 Partial/Full Core Programming

L L G1_2 L G1_3 L G1_4 L G1_5 L G1_6 L G1_7

0 G1_1 0 0 0 0 0 0

...

......

L L L L L G5_2 L L G5_3 L L G5_4 L L G5_5 L L G5_6 L G5_7 0 G5_1 0 0 0 0 0 0 1 1 1 1 1 1 1

PAD L2

L L G6_2 L G6_3 L G6_4 L G6_5 L G6_6 L G6_7

0 G6_1 0 0 0 0 0 0

...

......

L L L L L L L G11_1 G11_2 G11_3 G11_4 G11_5 G11_6 G11_7 0 0 0 0 0 0 0

Figure 3.17. Chip programming architecture

39

We use hierarchically arranged memory buffers to update configurations in TRAP. We designed an asynchronous memory buffer pipeline controlled by a handshake scheme to load programming bits to TRAP. Through these hierarchically connected memory buffers, TRAP can be selectively updated to support rapid partial reconfiguration. Our prototype has 11 rows and 7 columns of groups labeled as Ga_b (in which the a and b stand for row and column index), and has 3 hierarchical levels of memory buffers, as shown in Fig. 3.17. The Level 2 memory buffer (L2) contains a 97-bit word of programming bits, including 12 address bits. L2 will then send 94 bits to one of the 7 Level1 memory buffers (L1), selected using 3 address bits. Then L1 sends 90 bits to one of the 11 Level 0 memory buffers (L0) selected by 4 address bits. As shown in Fig. 3.16 and

Fig. 3.17, the selected L0 memory buffer uses two of the remaining address bits to select one of the 4 units in the group and then uses the remaining 3 address bits to select the destination column within the local memory array for that group. Thus, for every configuration word sent in, 12 address bits are consumed and 85 programming bits are saved in local MCs.

A complete set of programming bits (or bitstream) in TRAP comprises 2464 words (8 words per unit X 4 units per group X 77 groups) for one virtual layer. Since there are 85 programming bits per word, the total number of programming bits for one virtual layer in TRAP is 85 times 2464, which equals 209,440. Using the addressing feature, TRAP can partially and selectively update a stored configuration.

3.6.2 Hierarchical Bi-directional Memory Buffer

The bidirectional memory buffers L2, L1, and L0 receive and send a payload of up to 97 programming bits, in which 12 of the bits are actually address bits. L2 uses 3 of the 12 address bits to locate the addressed L1 memory buffer, which then receives the payload. This addressed L1

40

memory buffer then uses 4 address bits to route the payload to an addressed L0 memory buffer. 2 address bits are used to select from the 4 units in a group. In each unit, 3 address bits are applied to the decoder to select which column of memory cells receiving the payload.

BL0 BUS0 BL1 BUS1 BL2

BUS2

......

L 0

BL82 BUS82 BL83 BUS83 BL84 BUS84 BUS85 AL0 CNT0 BUS86 AL1 CNT1 DECODER FOR WORD LINE BUS87 AL2 CNT2 BUS88 AL3 CNT3 DECODER FOR UNIT BUS89 AL4 CNT4

Figure 3.18. L0 memory buffer connecting with the MCs in a unit

The L2, L1, and L0 memory buffers comprise both data registers (DR) and address registers (AR), as shown in Fig. 3.19. When writing configuration bits, a 5-bit address was sent to an L0 address.

As we discussed in previous sections, four units form a group. Among the 5-bit address, 2 bits

(labeled as BUS 88 and BUS89) are used to locate the addressed unit. The other 3 bits control the decoder and select proper columns of local memory cells (shown as small rectangular boxes in Fig.

41

3.18). In a unit, there is an array of 8X85 MCs. The L0 memory buffer sends an 85-bit word simultaneously to a column of MCs. The destination column of MCs is controlled by the decoder which consumes 3 address bits (labeled as BUS85, BUS86, and BUS87).

0.9V

bus RE_bus

WE RE_bus PC P C

WE clki WE BL d DFF q RE

W E

RE

(a)

(b) Figure 3.19. Memory buffers: (a) Data register, (b) Address register

Fig. 3.19(a) shows the structure of the programming bit register (which we term data register).

When WE (write enable) is set high, the write-in path is ready. The clk triggers the DFF to pass

42

the data from the bus to the BL. The DFFs in the L2, L1, and L0 memory buffers are controlled by distinct clk signals. These clk signals are derived from an asynchronous pipeline control unit, which will be discussed in the next section.

Read-out is performed for an entire group and is initiated from off-chip, wherein a 7-bit address is supplied that identifies the group to be read-out. These 7 address bits also identify the L1 and L0 memory buffers to be used to accomplish the read-out of the group. The read-out process for the identified group proceeds in an asynchronous pipelined manner, in the same way (albeit directionally reversed) as a write-in process.

In Fig. 3.19(a), when RE (read enable) is set high, the other path is open to enable a read-out of the programming bit. The signal RE_bus is set high only after the DFF updates the data on the read-out operation. There is a separate RE_bus signal for each BL on a read-out operation. For the read-out direction, 11 L0 memory buffers share the load on a bus and connect to the same L1 memory buffer. On the next level, 7 L1 memory buffers share the load on a bus and connect to the single L2 memory buffer. For a read-out operation, the memory cell size is too small to drive the load on the BL. Therefore, as shown in the dashed box in Fig. 3.19(a), we found that the optimum read-out time was achieved when the BL was pre-charged to 0.9V before performing the read operation.

As shown in Fig. 3.18, during read, L0 memory buffers collect programming bits from local memory cells to L0 data registers. Five-bit counter outputs (labeled as CNTs in Fig. 3.18) are applied to the L0 address registers. Then the address selects the proper column of memory cells.

Eventually, the values are available on the BLs for L0 data registers to capture. The address registers in the L0 memory buffer are defined in Fig. 3.19(b) with the dashed branch. A built-in

43

counter is used to sequentially generate the addresses of the group’s local memory that are to be read-out. This address is applied to the L0 address register and is sent through Address Lines (ALs) to the decoder inside the group. The decoder selects one column of local memory cells each time.

After the values are passed from the local memory cells to the BLs, the L0 data registers are ready to capture them. Later, L1 and then L2 data registers follow. This is how one 85-bit payload is read out.

After each read-out is completed, the counter is used to increment the address, until the read-out process is complete for an entire set of payloads for a unit, before advancing to the next unit. It is redundant to pass the address bits out, as we determined the address pattern from a counter and supplied the group-level 7-bit address from off-chip. The only meaningful information is the 85- bit payload passed through the data registers.

3.6.3 Asynchronous Pipelined Control

rq ack rq out Ri ack out

Figure 3.20. Muller C element schematic and symbol

44

We use an asynchronous pipeline to achieve a high programming rate. For example, after an L2 memory buffer receives an 85-bit payload from off-chip in our prototype, it will send that payload

(along with the address) to the respective L1 memory buffer, when that L1 memory buffer indicates it is ready to receive it. When the L1 memory buffer receives the address and payload, that frees up the L2 memory buffer to accept a new address and payload from external to the chip.

A bounded-delay asynchronous pipeline control scheme was designed [34]. A Muller C element is used to send a request out, as shown in Fig. 3.20. If a memory buffer stage (L2, L1 or L0 in our prototype) receives a request in (thus, the request is high), and if the acknowledge (actually the request) of the successor stage is low, then this stage issues a request to the successor stage.

Conversely, if the request into a stage is low and the successor stage is currently sending a request

(request is high, corresponding to a high acknowledge), then the stage issues an acknowledge low

(request low).

4 3 decoder decoder

RQ_WR OR7 OR11 ......

R1 R2 R3 R4

D D D D

clk2_wr clk1_wr clk0_wr clk00_wr

Figure 3.21. Asynchronous write-in control scheme

45

The asynchronous control scheme controls the programming bits written to the local memories of the units, as shown in Fig. 3.21. A request to write programming bits (RQ_WR) comes from off- chip. Local clock clk1_wr controls the DFF in the L2 memory buffer. The L2 memory buffer feeds any of seven L1 memory buffers in the prototype, depending on the address appended to the payload. Acknowledges from the seven L1 memory buffers are OR’d to form the acknowledge for

Muller C element R1. There are seven local clocks, labeled clk2_wr in the figure, and each controls an L1 memory buffer by triggering its corresponding DFF (see Fig. 3.19). Also, since each L1 memory buffer feeds any of eleven L0 memory buffers, depending on the address bits appended to the payload, acknowledges from the eleven L0 memory buffers are OR’d to form the acknowledge for Muller C element R2. There are eleven local clocks, labeled clk3_wr in the figure, and each controls an L0 memory buffer by controlling its corresponding DFF. The local memory driven by the L1 memory buffer is controlled by local clock clk4_wr. Note that the last stage uses the delayed request as its acknowledge.

The delay elements, shown as a box with the label ‘D’ in Fig. 3.21, are set based on careful worst- case simulation of the extracted layout of the prototype. Since this pipeline is used for programming bits and is not a signal data path, maximum throughput is not required. Therefore, we conservatively double the worst-case simulated delay to set the delay element value with sufficient margin to handle process, voltage, and temperature (PVT) variations.

The asynchronous control scheme controls the programming bits read from a group to off-chip, as shown in Fig. 3.22. A request to read the programming bits (RQ_RD) comes from off-chip, along with the 5-bit address to select the group to be read out.

The clk1_rd is also used to trigger the pre-charging of the selected local memory’s BL. As the BL

46

is 320um long (the BL goes through 4 units) and also connects to a significant number of transmission gates in the local memory cells, there is a large load on the BL compared to the driving ability of the minimum-sized memory cell. A pre-charge voltage of 0.9V minimized the worst-case delay and maximized the noise margin.

7 decoder

RQ_RD OR11 OR7 ......

R1 R2 R3 R4 R5

D D D D D

clk000_rd clk00_rd clk0_rd clk1_rd clk2_rd

Figure 3.22. Asynchronous read-out control scheme

Local clock clk2_rd is used to control the local memory to place the payload onto the BL. Local clock clk3_rd controls the L0 memory buffer, causing it to place the payload onto the bus, to which the L1 memory buffers are connected to as well. Local clock clk4_rd controls an L1 memory buffer, which receives the payload from one of eleven L0 memory buffers. Therefore, requests from eleven L0 memory buffers are OR’d in Fig. 3.22 to form the request into the L1 memory buffer. Local clock clk5_rd controls one L2 memory buffer, which receives the payload from one of seven L1 memory buffers. Therefore, requests from seven L1 memory buffers are OR’d in Fig. Fig. 2.21 Asynchronous read-out control scheme 3.22 to form the request into the L2 memory buffer.

47

3.7 I/O Design and Programming

The GF65LPE/GF55LPE PDK does not have ready-to-use power pads or digital I/O pads in the module library. However, the PDK does provide multiple ESD modules that can be used when designers construct basic I/O pads. The digital input pad, digital output pad, bi-directional inout pad, VDD pad, and GND pad were carefully designed following the basic rules. Certain I/O design principles found in Chapter 13 in the ‘CMOS VLSI Design: A Circuits and Systems Perspective’

[35] were adopted.

3.7.1 Power Pad Design

VDD and GND pads are normally built using a stack of metal layers along with the vias between them. The top-level aluminum metal layer is called LB in the process we chose. The lowest used metal layer for the VDD and GND pads usually corresponds to the lowest layer needed for routing

VDD and GND inside the chip. For ESD protection, a RC-based power rail ESD clamp is inserted under a power pad. The theory and circuitry of power rail ESD protection will be explain in section

3.7.3.2. Power and ground pads are connected to the on-chip power grid. A sufficient number of power pads (and ground pads) are necessary to supply the requisite power to all portions of a chip, for otherwise the voltage levels may drop at various locations in the chip due to being starved of the needed charge. This typically manifests itself as a Ldi/dt voltage drop in that insufficient power pads leads to a higher inductance and a corresponding large demand for charge leads to a large di/dt. Adding as many power pads as possible under the chip I/O pads budget surely helps. In addition, adding as much decoupling capacitance as possible between power and ground provides a vast and widely distributed source of charge which greatly reduces the inductance to the supply.

48

In addition, supply and ground noise is greatly suppressed by maximizing decoupling capacitance.

In TRAP, there are 12 pairs of power/ground supply pads, and these pads are distributed evenly around the chip boundary.

Separating the core power supply with the I/O power supply was considered but not adopted. This is essential if the I/O runs at a different voltage than the core. Also, it serves to isolate the noisy

I/O power supply ring from the quieter core power supply. In TRAP, the core and I/O supply is the same at 1.2V, which means it is not required to separate the power supply. The benefit of reducing the coupling of I/O noise into the core is more useful in a system when many pins switch simultaneously. This is not the case for the TRAP design. Note that for a system with multiple power supplies, the GNDs from two systems connect with a footer in between [36].

3.7.2 Digital I/O Pad and Bi-directional I/O Pad Design

3.7.2.1 Digital Input Pad Design

An input pad contains an inverter or buffers to convert the signal from the noisy external world into a valid logic level for the core circuitry. Using a Schmitt trigger instead of an inverter helps to produce a cleaner signal. For the TRAP digital input pad design, a Schmitt trigger at the first stage eliminates the impact of the noisy external input signal. The Schmitt triggers switches at different thresholds to increase the noise margin of the next consecutive input buffer. The structure of the Schmitt trigger has hysteresis that changes the switching point oppositely according to the input voltage. This helps filter out the false-positive pulses that might occur if the input rises slowly or is very noisy. Note, the input pad contains ElectroStatic Discharge (ESD) protection circuitry, described in Section 3.7.3.

49

3.7.2.2 Digital Output Pad Design

The most important ability of an output Pad is to have sufficient drive capability for a given capacitive load. Typically, the circuity for an output pad has multiple levels of buffers sized by logical effort. The first stage of the buffer is very small to reduce the load seen from the core. The latter transistors become very big and are folded into many fingers. Latchup is a particular problem at output pads. To avoid latchup, the big nMOS and pMOS transistors should be surrounded by guard rings. In TRAP output pad design, the output transistors, which are those whose drains connect directly to external circuitry, are isolated by double guard rings.

3.7.2.3 Bi-directional I/O Pad Design

The bi-directional I/O pad merges the previous input and output pads together. As shown in Fig.

3.23, the upper branch is for data propagating out of the chip, and the bottom branch is for data getting into the chip. The two branches on the left share a pin pad connected to a metal track

(M_track) from the core. The right shared pad is a metal pad connecting to the package.

The upper circuity uses a buffer chain to drive the data to the pad. The logical effort formulation was used to derive the number of stages. In this design, there are two buffers. The last stage in the buffer needs to be tri-stated to enable or disable the output path. The circuitry in blue shows a clever variation in which the NAND and NOR are merged together into a single six-transistor network with two outputs. Such a tristate buffer is smaller and presents less input capacitance [35].

An ESD protection module is attached right next to the metal pad, and the principles will be discussed in the next section. A Schmitt trigger is inserted at the first stage of the input circuitry.

A Schmitt trigger has hysteresis that raises the switching point when the input is low and lowers

50

the switching point when the input is high. This helps filter out glitches that might occur if the input rises slowly or is rather noisy [35].

NAND

Dout En

Dout

NOR M_track

Weak Din En

Din Weak

Improved Tri-state Schmitt ESD Buffer Trigger Protection

Figure 3.23. Bi-directional I/O schematic

3.7.3.1 Input Pad ESD Protection

One of the well-known reliability issues in the integrated circuit industry is ElectroStatic Discharge

(ESD). The main goal of ESD protection circuits is to provide a low-resistive discharge path for voltages outside the desired range of GND to VDD. The essence of ESD protection is to provide a controlled path to discharge high voltages without damaging the gate oxides. The path consists of extra circuit elements that clamp the I/O pins to safe levels.

51

Figure 3.24. Input ESD protection

As shown in Fig. 3.24, the thin gate oxide is the hardware we aim to protect. A human directly or indirectly touching an input pad can provide a high electrostatic voltage to that input pad. As shown in Fig. 3.24, a typical ESD input protection circuit consists of clamps and a current-limiting . When the input pad voltage becomes greater than about VDD + 0.7V or less than –0.7V, one of the turn on. The diode shunts the ESD current into the power networks and avoids the damage to the thin gate oxide.

3.7.3.2 Power Pad ESD Protection

The power supply pads have a different ESD protection module. The previously described ESD protection circuit constructs a path to shunt the subversive current to the power rails. If this module is used between the VDD and GND pads, it will construct a short path. This is functionally incorrect, let alone helping the ESD protection.

As shown in Fig. 3.25, the power pad ESD protection circuit is the classic RC-based power rail

ESD clamp circuit. The RC-clamp is used for shunting current to ground from the power supply

52

line. When the voltage on the power rail is stable at VDD, the is fully charged so that the voltage at the point between the resistor and the capacitor, which is labeled as A, is VDD. The logic 1 is inverted to 0 after 3 inverters, and this is applied to the nMOS, which isolates the power rial and ground rail. When a voltage spike occurs on the power rail, the RC has a response time for the voltage change. If the spike is significant, the gate threshold voltage is raised by the rising

VDD. If the spike in VDD is sufficiently large, the voltage at node A can fall below the tripping point of the first inverter which causes it to output a logic 1. As a consequence, the output of the inverter chain becomes logic 1, and the nMOS turns on and constructs a path to shunt current to the ground rail. The RC-triggered power clamp is an efficient implementation to achieve the protection goal. The RC-clamp is present inside each VDD pad in TRAP.

Figure 3.25. RC-clamp for power pad ESD protection

3.7.3 Metal Track Selecting Multiplexer

The programmable architecture has three basic functional modules: logic blocks, programmable interconnect, and programmable I/Os. The programmable interconnect is described in section 3.3.

At the core boundary, there are a large number of metal tracks that can be sent to off-chip through

53

programmable I/Os. The programmable I/Os use the bidirectional I/O pad described in section

3.7.2.3. There are two challenges: how to connect the various metal tracks in the programmable interconnect with the limited number of programmable I/Os, and how to program these bi- directional I/O pads.

clk shift_data_in

D Q CLK

MC

m0 m3 m8 m1 m2 m4 m5 m6 m7 m9

m11 m12 m13 m14 m15 D m10 Q CLK a3 a2 MC a1 D a0 Q CLK

MC

D Q CLK

MC

D Q CLK en_IO MC

shift_data_out

Figure 3.26. 16:1 metal track selecting MUX with its programming scheme

TRAP has 72 programmable bi-directional IOs in total. At the boundary of the chip core, some metal segments at the boundary are connected to a 16:1 multiplexer and then pass to a primary I/O pad. These 16 metal segments are the candidate paths for an I/O pad. They can directly send/receive a signal to/from off-chip. Whenever one track segment is turned on among the 16 tracks connecting to the same multiplexer, the remaining 15 metal tracks can no longer be a

54

candidate for primary input nor output. However, these 15 metal tracks can still be used for internal signal paths.

As shown in Fig. 3.26, the right half shows the 16:1 bi-directional multiplexer with one side connecting to 16 candidate metal track segments and the other side connecting to the bi-directional programmable I/O pad.

3.7.4 I/O Programming

The 72 bi-directional I/Os are programmed serially. The I/O programming input pad is on the left top corner of the chip. A huge shift-register which aligns around the boundary of the chip shifts the programming bits in a clockwise direction. As shown in Fig. 3.26, each programmable bidirectional I/O pin pad uses 5 programming bits. One bit selects the I/O direction, thus, input or output. The other 4 bits control the 16:1 multiplier to select the metal track segment that is used.

Two pads connect to one group. As the chip consists of 11 rows and 7 columns of groups, the left and right side each have 22 programmable I/O pads, and the top and bottom side each have 14 programmable I/O pads. The total number of programmable I/O pads is 72, and the number of I/O programming bits is 360.

As the speed of loading programming bits is not critical, the clock tree of the I/O programming shift register is not designed in a balanced distributed manner. To save area and for design simplicity, this clock tree is an evenly buffered long wire without a branch. The delay is accumulated by the distance that the clock has propagated. As the shift register is implemented by one DFF directly feeding to the next DFF and there is no gate in between them, there is a very small combination logic propagation delay. Thus, violating the hold time is more of a concern in this approach. To alleviate this potential hold time problem, for this large shift-register design, we

55

propagate the clock signal in the reverse direction to the data path. The clock tree is buffered in a counterclockwise direction while the data shifts in a clockwise direction. In this way, the clock transition is slightly more delayed at the sender DFF than at the receiver DFF. So, the receiver

DFF has a slightly negative clock skew, which eliminates the hold time violation concern.

In Chapter 4, we will discuss the CAD tool flow for implementing a design onto the TRAP fabric.

Here we go a bit further to record the necessary information of the 72 programmable I/Os for the placement and routing tool used in our CAD tool flow. On the left side of the chip, the pin pads are named as left1, left2, …, left21, and left22 from top to bottom. One the right side, the pin pads are named as right1, right2, …, right21, and right22 from top to bottom. On the top side, the pin pads are named as up1, up2, …, up13, and up14 from left to right. On the bottom side, the pin pads are named as bottom1, bottom2, …, bottom13, and bottom14 from left to right.

As mentioned in the last section, the programmable interconnect at the core boundary has optional connections to the programmable I/Os. The detailed pattern of which metal tracks are candidates and which metal tracks are discarded for the programmable I/Os is shown in Fig. 3.27. We separately discuss the patterns on the four sides of the boundary. A portion of the I/O programming shift register is included in each figure. In Fig. 3.27, the M4 metal track segments that are candidates for passing to a programmable I/O are highlighted in red for a group at the top side of the core boundary. A group with 4 units (labeled as UNIT_1, UNIT_2, UNIT_3, and UNIT_4) is shown in this figure. Out of the 48 vertical M4 metal track segments, 32 are candidates, and the remaining 16 metal tracks labeled in black terminate at the top boundary. Fig. 3.28 shows the particular fourth unit within a group that is on the right side of the core boundary. The design detail of a unit should look familiar as we discussed it in section 3.5.1.

56

clk clk

D D D D D D D D D D

CLK CLK CLK CLK CLK CLK CLK CLK CLK CLK

Q Q Q Q Q Q Q Q Q Q

shift_data_out

shift_data_in shift_data_in

shift_data_out

MC MC

MC MC MC MC MC MC MC MC

a3 a2 a3 a2

a1 a0 a1 a0

en_IO en_IO

I/O I/O

A0 A0

A1 A1

A2 A2

M2 M6 M10 M14 M2 M6 M10 M13 M0 M1 M3 M4 M5 M7 M8 M9 M11 M12 M13 M15 M0 M1 M3 M4 M5 M7 M8 M9 M11 M12 M14 M15

A3 A3

M4_2 M4_6 M4_10 M4_3 M4_6 M4_10 M4_1 M4_3 M4_4 M4_5 M4_7 M4_8 M4_9 M4_11 M4_12 M4_1 M4_2 M4_4 M4_5 M4_7 M4_8 M4_9 M4_11 M4_12

M4_2 M4_5 M4_9 M4_2 M4_6 M4_10 M4_1 M4_3 M4_4 M4_6 M4_7 M4_8 M4_10 M4_11 M4_12 M4_1 M4_3 M4_4 M4_5 M4_7 M4_8 M4_9 M4_11 M4_12

UNIT_1 UNIT_2 UNIT_3 UNIT_4

Figure 3.27. Top-side metal tracks selections for primary I/Os

57

shift_data_in clk

D A3 Q A2 CLK a3 A1 MC M3_1 A0 M0 D M5_1 M1 Q M3_2 M2 CLK a2 M5_2 M3 MC M3_3 M4 D M5_3 M5 Q M3_4 M6 CLK M5_4 a1 M7 MC M3_5 M8 D M5_5 M9 Q M3_6 M10 CLK a0 M5_6 M11 MC M3_7 M12 D M5_7 M13 Q M3_8 M14 CLK en_IO M5_8 M15 MC M3_9 I/O M5_9 shift_data_out M3_10 M5_10 M3_11 M5_11 M3_12 M5_12 M3_13 M3_14 M3_15 M3_16 M3_17 M3_18 M3_19 M3_20 M3_21 M3_22 M3_23

M3_24

M2_2t M2_7t M2_11t M2_1t M2_3t M2_4t M2_5t M2_6t M2_8t M2_9t M2_10t M2_12t

M4_6 M4_10 M2_S1 M2_O1 M4_1 M4_2 M4_3 M4_4 M4_5 M4_7 M4_8 M4_9 M4_11 M4_12 M2_S2 M2_S3 M2_S4 M2_O2 M2_O3 M2_O4

M2_2b M2_7b M2_11b M2_1b M2_3b M2_4b M2_5b M2_6b M2_8b M2_9b M2_10b M2_12b

M3_25 M3_26 M3_27 M3_28 M3_29 M3_30 M3_31 M3_32 M3_33 M3_34 M3_35 M3_36 M5_37 M3_37 clk M5_38 M3_38 shift_data_in M5_39 D M3_39 A3 Q M5_40 A2 CLK a3 M3_40 A1 MC M5_41 A0 M0 D M3_41 M1 Q M5_42 M2 CLK a2 M3_42 M3 MC M4 M5_43 D M3_43 M5 Q M5_44 M6 CLK a1 M3_44 M7 MC M5_45 M8 D M3_45 M9 Q M5_46 M10 CLK a0 M3_46 M11 MC M5_47 M12 D M3_47 M13 Q M5_48 M14 CLK en_IO M3_48 M15 MC I/O shift_data_out

Figure 3.28. Right-side metal tracks selections for primary I/Os

58

UNIT_1 UNIT_2 UNIT_3 UNIT_4

M4_4 M4_8 M4_11 M4_4 M4_8 M4_12 M4_1 M4_2 M4_3 M4_5 M4_6 M4_7 M4_9 M4_10 M4_12 M4_1 M4_2 M4_3 M4_5 M4_6 M4_7 M4_9 M4_10 M4_11

M4_1 M4_4 M4_8 M4_12 M4_1 M4_5 M4_9 M4_12 M4_2 M4_3 M4_5 M4_6 M4_7 M4_9 M4_10 M4_11 M4_2 M4_3 M4_4 M4_6 M4_7 M4_8 M4_10 M4_11

A3

A3

M1 M5 M9 M13 M0 M2 M3 M4 M6 M7 M8 M10 M11 M12 M14 M15

M3 M7 M11 M15 M0 M1 M2 M4 M5 M6 M8 M9 M10 M12 M13 M14

A2

A2

A1

A1

A0

A0

I/O

I/O

en_IO

en_IO

a3

a3

a1 a2

a0

a1 a2

a0

MC MC MC MC

MC

MC MC MC MC

MC

Q Q Q Q Q

Q Q Q Q Q

CLK CLK CLK CLK CLK

CLK CLK CLK CLK CLK

D D D D D

D D D D D

shift_data_in

shift_data_in

shift_data_out

shift_data_out clk clk

Figure 3.29. Bottom-side metal tracks selections for primary I/Os

59

shift_data_out I/O en_IO MC M0 M3_1 CLK M1 M5_1 Q D M2 M3_2 a0 MC M3 M5_2 M4 M3_3 CLK M5 M5_3 Q D M6 M3_4 a1 M5_4 MC M7 M8 M3_5 CLK M9 M5_5 Q D M10 M3_6 a2 MC M11 M5_6 M12 M3_7 CLK M13 Q M5_7 D M14 M3_8 a3 MC A0 M15 M5_8 A1 M3_9 CLK A2 M5_9 Q A3 D M3_10 M5_10 clk shift_data_in M3_11 M5_11 M3_12 M5_12 M3_13 M3_14 M3_15 M3_16 M3_17 M3_18 M3_19 M3_20 M3_21 M3_22 M3_23

M3_24

M2_3t M2_8t M2_12t M2_1t M2_2t M2_4t M2_5t M2_6t M2_7t M2_9t M2_10t M2_11t

M4_1 M4_6 M4_10 M2_S1 M2_O2 M4_2 M4_3 M4_4 M4_5 M4_7 M4_8 M4_9 M4_11 M4_12 M2_S2 M2_S3 M2_S4 M2_O1 M2_O3 M2_O4

M2_3b M2_8b M2_12b M2_1b M2_2b M2_4b M2_5b M2_6b M2_7b M2_9b M2_10b M2_11b

M3_25 M3_26 M3_27 M3_28 M3_29 M3_30 M3_31 M3_32 M3_33 M3_34 M3_35 M3_36 M5_37 M3_37 M5_38 M3_38 M5_39 shift_data_out M3_39 I/O M5_40 en_IO MC M3_40 M0 M5_41 CLK M1 M3_41 Q D M2 M5_42 a0 MC M3 M3_42 M4 M5_43 CLK M5 M3_43 Q D M6 M5_44 a1 MC M7 M3_44 M8 M5_45 CLK M9 Q M3_45 D M10 M5_46 a2 MC M11 M3_46 M12 M5_47 CLK M13 Q M3_47 D M14 M5_48 a3 A0 M15 M3_48 MC A1 CLK A2 Q A3 D clk shift_data_in

Figure 3.30. Left-side metal tracks selections for primary I/Os

60

Two programmable pads correspond to this group, which in this case are the metal tracks in the fourth unit. 16 horizontal M3 tracks and 16 M5 tracks from the fourth unit are candidates for passing to a programmable I/O and are labeled in red. Fig. 3.29 looks like a symmetric copy of

Fig. 3.27. However, the shift register proceeds in a clockwise sequence from the top left corner, which means in these two figures, the shift registers are in opposite directions. This difference also applies to Fig. 3.28 and Fig. 3.30. These figures not only help to explain the I/O programming scheme from a chip-level view but also show the proper connections. The placement and routing tools must be aware of the programming I/O structure.

We use a customized version of the open-source tool VPR (Versatile Place and Route) for routing

TRAP designs. VPR is a coordinate-based tool with the origin at the bottom left corner. In the

VPR definitions, each I/O module has two sites (site_0 and site_1). There are 11 I/O modules on the left side. The left22 pin pad is programmed as (0,1) at site_0. The left21 pin pad is programmed as (0,1) at site_1. The left1 pin pad is (0,11) at site_1, and the left2 pin pad is (0,11) at site_0. The naming pattern is shown in Table 3.1.

Table 3.1. The site coordinates for the programmable I/Os in VPR

Location Location Location Location I/O pad I/O pad I/O pad I/O pad in VPR in VPR in VPR in VPR

Left1 0, 11, 1 Right1 29, 11, 1 Up1 1, 12, X Bottom1 1, 0, X

Left2 0, 11, 0 Right2 29, 11, 0 Up2 3, 12, X Bottom2 3, 0, X

Left21 0, 1, 1 Right21 29, 1, 1 Up13 25, 12, X Bottom13 25, 0, X

Left22 0, 1, 0 Right22 29, 1, 0 Up14 27, 12, X Bottom14 27, 0, X

61

3.8 Clock Tree Design

In a synchronous digital chip, ideally, the clock should arrive at all sequential logic elements in the system simultaneously. In practice, unbalanced clock paths, non-uniform load locations and amounts, power rail spikes at the destination, process variations, and various sources of noise cause the clock to arrive at different times at the various destination DFFs. The magnitude of the difference in the arrival time of the rising edge of the clock at the various locations is defined as clock skew. Clock skew is unwanted because it subtracts from the portion of the clock period during which the logic computations must complete, hence straining the ability to meet setup times. Hold time violations also can result from unanticipated clock skew. At the chip level view, the global clock must be distributed across the chip in a way that reaches all the clocked elements at nearly the same time. Global clock distribution networks can be classified as grids, H-trees, spines, ad hoc, or hybrid [37]. In TRAP, we chose the H-tree structure for global clock distribution.

The H-tree is a fractal structure built by recursively drawing H shapes. The vertices of the current level connect to the center of the H at the next level, as shown in Fig. 3.31. Buffers are inserted to achieve a good slew rate along the signal propagation path. The external clock signal starts its propagation from an I/O pad. The H-tree clock path is evenly distributed, which means the path length from the clock pad to any destination is identical.

The H-tree structure is simple and regular, which works perfectly if the clock loads are uniformly distributed. In this case, the H-trees can have a zero systematic skew. The rectangles in the Fig.

3.31 stand for DFFs. All DFFs connecting to clock network theoretically receives a clock pulse at the same time. Due to the simple and efficient branch networks, the H-trees trend to use less wire than clock grids.

62

Figure 3.31. H-trees clock distribution

TRAP has an extremely regular layout as it has replications of the group in an array. The leaves of the clock are the built-in DFFs in every unit. There is one built-in DFF per unit, four units in a group, and there are 7 by 11 groups in TRAP. However, it rarely happens that the number of H- trees leaves and sequential logic elements is exactly the same. In the TRAP case, neither 7 nor 11 is a power of two. For the 7 columns of groups in TRAP, we generate 3 levels of H-trees, which have 8 leaves. One leaf does not connect to a DFF but is retained for balancing the load on other paths. Eventually, the TRAP design simulation result shows near-zero skew.

3.9 Power Network Design

The power network is so complicated that manual analysis is hard to accomplish. It typically must be modeled in a finite element simulation [35]. So, this section aims at introducing and discussing the problem without going to the numerical computation. This is called qualitative analysis. This

63

section focuses on the power integrity analysis of the power network. The return paths in signal wires that are caused by the power network are not discussed.

The two fundamental power supply noise sources are IR drops and L di/dt. The power supply is from an off-chip source that proceeds through the (PCB), the package, the power pads on-chip, and eventually to the on-chip power networks. Because the package and PCB typically use copper that is much thicker and wider, the resistance is negligible comparing to on- chip resistance. The chip center, which is farthest from the power pad, normally yields the largest resistive drop. The inductance of the power supply is dominated by the inductance of the bond wires from the die to the package. The resistance is reduced by implementing a power grid network. The inductance can be reduced by carefully choosing a package. The largest sources of current transients are switching I/O signals and changes between idle and active mode in the chip core. Current transients are the largest in a high-performance circuit. The key to eliminating noise from current spikes is to provide adequate decoupling capacitance. A substantial amount of decoupling capacitance should be placed between power and ground across the chip. Then a local need for charge can be supplied by the decoupling capacitance instead of through the power grid.

Current flowing along a signal wire must return in the power/ground network. The area of the current flowing loop sets the inductance of the signal. A discontinuity in the power grid can force return current to find a path far from the signal wire, greatly increasing the inductance, which increases delay and noise. Because signal inductance is usually not modeled, the delay and noise will not be discovered until after fabrication [35].

In TRAP, the power/ground grid network generously used the top two metal layers (EA layer and

EB layer in GF55LPE PDK). These thick layers with less resistance carry the bulk of the current.

64

There are many vias and metal strips from the top layer down to bottom layers, to carry the current down to transistors. Moreover, any spare space is filled with big decoupling .

65

CHAPTER 4

DESIGN DECISIONS

4.1 Logic Cell Design Decisions

The commonly used LUT architecture in FPGAs is a programmable cascade of multiplexers arranged tournament style, to produce an output (winner). In addition, numerous inverters, buffers, and half-keepers are needed to complete the multiplexer cascade, collectively requiring a large number of transistors. Also, when a logic function is mapped to a LUT, the entire LUT is occupied even if the function has a smaller number of inputs than the LUT can accommodate. This causes significant fragmentation and wastes resources.

TRAP was motivated by the desire to create a programmable fabric comprised of a sea of transistors. We decided to have the transistors arranged in identical rows, of arbitrary length, and that there would be an arbitrary number of rows. From there we decided that gates could be constructed that mimic a standard cell library, where cells would all have the same height

(essentially equal to the row height) and variable width, depending on the complexity of the gate.

Most of the transistors can either receive logic signals (inverted or not) or be programmed to perform isolation or connection. As there are four choices for each transistor gate, 4-to-1 multiplexers are connected to the inputs of these transistors. We wanted to implement gates that had up to 3 transistors in series in their pull-up or pull-down networks. We found that this could be accomplished by a column having two nMOS and two pMOS in series, whereby the third in series can be obtained by using a transistor whose drain and source connect between two columns.

Our trade-off was that we wanted to minimize the number of transistors that would not be used for

66

any given design, while also enabling a reasonably large library of functions that could be implemented in TRAP. Going to 3 nMOS and 3 pMOS in series in each column surely would have enabled a lot more library gates to be defined, but for any given design, many of those transistors would not be receiving a logic signal at their gates since it’s our experience that the simpler gates get used much more often.

Conventional FPGAs have a few built-in gates in their logic blocks to improve logic efficiency.

As our basic logic cell (LC) has four columns of transistors, we decided we would multiplex those

4 outputs with the outputs of 4 custom built-in gates. The first selection was the DFF since it is typically the most complex library cell. Next, the full adder was selected as it is used quite often in digital ICs and the combination of the sum and carry units take up quite a bit of area. The two full adder outputs (carry and sum) use two of the four LC outputs. Finally, our experience was that a multiplexer was also often used and also consumed a lot of area, so we added the MUX as well.

The I/Os of these built-in gates are shared with the LC I/Os. Using the built-in DFF takes only one column of a LC, compared to 8 columns if directly mapped to the transistor array. The FA using the transistor array requires 8 columns whereas the custom FA occupies only 2 columns. Finally, the MUX is reduced from 2 columns to 1 using the built-in version.

Adding an optional polarity change to the inputs of the transistor array and the built-in gates eliminates the need to ever use the transistor array to implement simply an inverter.

4.2 Programmable Interconnect Design Decisions

In ASIC standard cell design, pMOS and nMOS transistors are connected in pairs, with each pair receiving a logic signal. However, in TRAP it is not possible to permanently pair up pMOS and nMOS transistors. In other words, the various gates that can be configured in TRAP pair up nMOS

67

and pMOS transistors in different ways. As a result, it is necessary to route a logic signal to the pMOS section of an LC separately from the routing to the nMOS section of an LC. That is, the route from a driver to a logical input of a target gate will split at some distance before it reaches the two terminals (that is, the relevant pMOS and nMOS devices). This is not a significant challenge in routing aside from a slightly increased demand for routing tracks. However, it does present (solvable) challenges for timing analysis.

In one LC, there are 24 inputs and 4 outputs to enable the full flexibility of mapping any gate in the library at any starting point. All inputs and outputs connect to the programmable interconnect.

The relatively large number of inputs gives pressure to the programmable interconnect architecture.

A novel programmable interconnect architecture was developed to support the transistor-level flexibility while consuming less area. In conventional FPGAs, switch-boxes connect vertical and horizontal wires and these switch-boxes consist of switches and 2 programmable bi-directional buffers, in which each switch-box requires at least 38 transistors, as shown in Fig. 3.13(b). On the other hand, a transistor is the simplest switch. The nMOS is stronger than a pMOS with the same size because of the mobility difference. We selected a single nMOS to connect horizontal-to- vertical and vertical-to-horizontal wire segments. Different sizes of the nMOS pass transistor

(1um, 1.5um, and 2um) were implemented in the interconnect to optimize the delay of the path and the area consumption. In this architecture, increasing the pass transistor size yields very little improvement for the delay, but proportionally increases the area. As the nMOS pass transistor cannot pass a full VDD, a half-keeper is attached to each wire segment.

The other option would have been to use a transmission gate (T-gate) as a switch. A T-gate consists

68

of an nMOS and a pMOS connected in parallel (as well as an inverter to generate also the opposite control signal for the pMOS), which consumes much more area than the single nMOS pass transistor. As the switch is frequently used at intersections in the interconnect, the single nMOS transistor is an area-efficient solution compared to the T-gate.

The programmable interconnect provides candidate routing paths from primary I/Os to LC I/Os and between LC I/Os. As shown in Fig. 4.1, the upper half interconnect network inside of a unit is highlighted as 5 parts. The parts labeled as A, B, C, and D form the original version of the programmable interconnect. And the part in the dashed rectangle is designed for passing states between virtual layers. Part D provides tracks for all pMOS inputs as well as the outputs of an LC.

Part C contains tracks between M4 and M3, which can be programmed to propagate LC I/O signals. Part A contains switches between M3 and M4, and between M4 and M5. M4 and M5 tracks are to be used in long vertical and horizontal paths. Part B is used to pass the input signals down to the I/Os of an LC.

The number of metal tracks and the number of switches on each track affect the routability of the programmable interconnect. The basic rule followed was to balance the number of switches on each track. The number of switches on each metal track and the number of tracks are termed as switch pattern. It turns out that VPR does model a number of the A, B, C, and D patterns. The interconnect pattern efficiency is evaluated by implementing benchmarks. The resources in the dash-line rectangle are used to support passing computational states between virtual layers. It contains the tracks from the extra output directly to the input of an LC in the same unit and can also be connected to the original programmable interconnect. As shown in Fig. 4.1, this interconnect pattern was optimized based on a number of experiments. The optimization of the

69

switch pattern depends on many factors, which is so complicated that can be a separate research topic.

Figure 4.1. The optimized interconnect pattern

A metal track is segmented by a pass transistor at the boundary of a unit. When the aggregate wire segment length exceeds a certain limit, the pass transistor is replaced by a bi-directional repeater.

70

A switchbox in the conventional interconnect architecture contains 6 pass transistors and two bi- directional repeaters. However, a switch in the TRAP architecture is only one single transistor.

One half-keeper is added per metal track. A small number of repeaters are inserted as required.

We surmise that the programmable interconnect architecture uses about an order of magnitude less area than conventional architecture.

Note that the tracks in TRAP only segment at the boundary of a unit, not at every switch. Whenever a small portion of a track segment has been used, the whole segment is not available anymore.

4.3 Design of Virtualizations

Board-level virtualization was developed first. By using multiplexers, the reconfiguration delay between the independent configurations is virtually negligible and is enabled by storing a number of configurations on-chip. Board-level virtualization was designed so that while one configuration is programmed on the fabric, another configuration may be in the process of being loaded. The original version of the programmable transistor array (which we referred to as an FPTA) stored up to 3 independent configurations. The second version of the programmable transistor array (TRAP) increased the number of independent configurations to 4 and added chip-level virtualization. The latter feature required the addition of four DFFs per LC in order to pass output states from the current virtual layer to a subsequent virtual layer. Also needed were additional interconnect resources, as shown in the dashed rectangle in Fig. 4.1.

4.4 Memory Design Decisions

In any type of field-programmable array, including TRAP, the output of every memory cell must have a direct and active connection to a programmable transistor, a programmable logic gate input,

71

or other programmable load (such as for LUTs). The 6-T memory cell requires both polarities of the BL in order to reliably write new values into the memory cell. In addition to the six transistors in a so-called 6-T memory cell, at least one inverter must be added to drive the programmable wire and corresponding load. In this case, the basic memory cell is at least 8. An entire row of the 8-T memory cells can be written at the same time. In addition, a critical challenge was how to support the ability to write new programming bits while currently reading programming bits. That is, while one configuration is programmed on the fabric, it needs to be possible to write a new configuration into the memory. We also didn’t have access to the layout of the 6-T memory cell from GlobalFoundries. Typically that layout is about 1.5-2X smaller than what can be laid out using standard design rules. The increased number of transistors (8) for each memory cell in the so-called 6-T SRAM caused us to rule out this approach for TRAP.

TRAP’s memory cell contains 4 conventional level-sensitive latches in parallel to store up to 4 programming bits for the 4 possible configurations. The feedback tri-state inverter is turned off when a new programming bit is being written. The 4 latches share the input line (BL) and output line (out), as shown in Fig. 3.14. The memory cells are organized in arrays in a group. A row of memory cells in a group (8 memory cells/unit X 4units/group), which is 32 memory cells, share a

BL.

TRAP physically stores up to 4 bitstreams simultaneously, with the total number of programming bits being N, where N equals 239,368. The number of wires (BLs) for writing the memory cells is

N/32, and the number of wires (out) for reading the memory cells is N.

72

4.5 Core Programming Design Decisions

Because of the limited room for I/Os, the bitstream is serially loaded into the TRAP chip. The programming bits are accumulated into a word by a shift-register. The storage of a group contains

32 words. The destination of a word can any one of the 7X11 groups. This requires the datapath to fan-out throughout the chip. Thus hierarchical memory buffers were added to the design. A bounded-delay asynchronous control scheme was developed since it was not difficult to obtain the bounded delay for each stage in the pipeline. Furthermore, the bounded delays for each pipeline stage were very different, meaning that a clocked pipelined scheme would not be efficient. Plus it would require another global clock to be distributed through the TRAP chip. In the asynchronous control scheme, a word of programming bits passes to one of the 7 L1 memory buffers, then passes to one of the 11 L0 memory buffers, by address decoding. Each group has an L0 memory buffer on the left side, which drives the BLs of the memory cells in a group.

4.6 I/O Programming Design Decision

At the boundary of the chip, there are 2,256 terminals from the programmable interconnect. TRAP has 72 programmable I/Os. Only a very few metal tracks can actually be connected to programmable I/Os. To give routing more choices, each I/O connects to 16 metal tracks through a bi-directional MUX. Thus, 1,152 out of 2,256 metal tracks have direct paths to primary I/Os. To save area, the bi-directional MUX consists of 16 T-gates rather than 16 bi-directional repeaters.

The 16 T-gates adds a big load on the path. When a signal comes into the chip, the input pad driver is close to the bi-directional MUX. When a signal propagates off the chip, the driver (which is the

73

last repeater the signal has passed), can be as far away as the width of a group. Careful simulations have shown that there is sufficient driving ability for both cases.

The programming bits for I/Os are loaded by a shift-register, which differs from the hierarchical memory buffers for the core. The bitstream for I/Os and the bitstream for the core are in different formats and are separately loaded to TRAP. Most of the controls are shared between loading bitstreams for I/Os and the core to reduce the numbers of pins used for controls.

74

CHAPTER 5

FABRICATION AND PACKAGE

5.1 FPTA Design Layout

The first transistor array designed was called a Field Programmable Transistor Array (FPTA). We designed and laid out a prototype for fabrication using the IBM130 1.2V process. The layout is shown in Fig. 5.1. The core design was completed, but the I/O ring design was not done. The design includes a 6 X 4 array of groups, each containing eight units. As one unit contains one LC, the FPTA layout has a total of 192 LCs. The overall size of the layout is 4113.41um X 2769.50um.

Figure 5.1. FPTA design layout

75

The FPTA design was not taped-out, in part because the metal stack we used in this design was no longer supported for fabrication. However, it was our first trial for the chip-level layout of the design and much experience was gained in evaluating our ideas and design features.

5.2 TRAP Design Layout

After checking with MOSIS, we decided to use the GF55LPE process. The metal stack consists of

6 thin metal layers, 2 thick metal layers, and 1 Aluminum layer (6_00_02_00_LB). GF55LPE is a sub-node of GF65LPE. The poly length and DRC rules are exactly the same. For the designer experience, there is very little difference. After the GDSII file is sent to the foundery, a 15% shrink is applied at the fabrication step.

Table 5.1. TRAP chip summary

Technology 55nm 10LPE CMOS

Core VDD 1.2V

I/Os 72 Bidirectional + 60 Bonded

Core Size 2.6mm x 2.7mm

Transistor Count > 10 Million

I/O Bitstream Bit Count 360 / Layer

Core Bitstream Bit Count 239,008 / Layer

76

Figure 5.2. TRAP layout

The completed layout of TRAP is shown in Fig. 5.2. The two thick layers on the top are unselected to make the structure on the bottom layers visible. The TRAP die size is 3mm X 3mm. The core size is 2.6mm X 2.7mm. The TRAP core contains 7 X 11 array of groups, each containing four units. As one unit contains one LC, the TRAP layout has a total of 308 LCs. In Table 5.1, there is

77

Figure 5.3. TRAP die photo

a brief list of the TRAP chip properties. The supply for the TRAP core and I/O is the same as 1.2V.

The chip has 132 I/Os, which are 72 programmable bi-directional I/Os, 24 power pads, and 36 global control pads. The I/O bitstream contains 360 bits, and the core bitstream contains 239,008 bits. Two bitstreams together can program a circuit which is up to 1,232 gates and 72 bidirectional

I/Os.

78

The programming bits are stored in MCs on TRAP. TRAP has a total count of 239,368 MCs. As described in Chapter 3, each MC has four latches in parallel. So, the TRAP chip can store up to four configurations at the same time, thus 957,472 programming bits. The estimated transistor count for TRAP is 10 million.

5.3 TRAP Package

The TRAP die photo is shown in Fig. 5.3. It is identical to the layout shown in Fig. 5.2. The packaging vendor took this die photo to ask how to define the die orientation. The outline of the chip is called the image bevel. A label should have been placed on the right top corner inside of the image bevel using the top layer metal. It can identify the orientation of the die, which will avoid misplacing the die inside the package.

When a chip size of 3mm X 3mm, we can calculate the maximum number of I/Os that can fit in the die. Regarding the maximum I/O number, we carefully assign 72 I/O pads for programmable bi-directional I/Os, 36 I/O pads for global controls, and 24 I/O pads for power supply. The best

I/O plan has 132 I/Os.

Figure 5.4. PGA132M topside view and underside view

79

Figure Figure

5

.

5

. connectiontolabeledI/Oname. (b)(a) TRAP, PGA132Mpadbonding to correspondingpackage

80

A pin grid array (PGA) is a type of integrated circuit package. In a PGA, the package is square or rectangular, and the pins are arranged in a regular array on the underside of the package. PGAs are often mounted on printed circuit boards (PCBs) using the through-hole method or inserted into a socket. Regarding the PGA features, PGA132M was chosen as the package for TRAP. The top view and underside view of the TRAP package are shown in Fig. 5.4.

As shown in Fig. 5.5(a), the inner square represents the die, and the drawn lines represent bonding wires. The small rectangles on the four sides are labeled with the PGA pin index. In Fig. 5.5(b), the blue names are the I/O pin names for TRAP. The names are placed close to their bonding pads, respectively.

The TRAP PGA package is inserted into a socket. The socket is mounted on a PCB. The PCB connects TRAP with an FPGA as the test setup. The details will be discussed in Chapter 6.

81

CHAPTER 6

TEST SETUPS AND TEST RESULTS

6.1 CAD Tool Flow for TRAP Design

This section describes the CAD tool flow developed for TRAP programming bit generation. First, we characterized and compiled a 55nm cell library for TRAP using Synopsys’ Siliconsmart and

Library Compiler. The base cell library has 24 gates whose series limit is 3 transistors, and 3 built- in gates (DFF, full adder, and multiplexer) with their different input and output polarities.

The 24 base library components are: INV, NAND2, NOR2, AOI12, AOI22, OAI12, OAI22,

NAND3, NOR3, AOI31, OAI31, AOI41, OAI41, AOI32, OAI32, AOAI311, OAOI311,

AOAI211, OAOI211, AOOAI212, OAAOI212, AOAAI2111, OAOOI2111 and MAJI (which is the inverted mirror carry). In addition, the following custom cells are built into the LCs: FA, FAI,

DFF, MUX, and MUXI.

The cell library was characterized using Synopsys’ SiliconSmart. Synopsys’ Design Compiler is used to synthesize a gate-level netlist from the Register-Transfer Level (RTL) description of the design. Any application can be synthesized from this library by using Synopsys’ Design Compiler.

TimberWolf features a complete timing driven placement and routing tool. It is applicable to row- based design styles, namely, standard cell, gate arrays, and sea-of-gates circuits [38]. We leverage the TimberWolf placement and partitioning tool for the TRAP architecture.

The Timberwolf placer is very effective for row-based placement and was extended to handle the specific TRAP architecture. Each built-in gate in an LC is at a fixed column. If a built-in gate is placed, the column is labeled as occupied. This column of transistors will not be used for

82

constructing any other logic gates. The 72 programmable IOs are also handled accordingly in the

Timberwolf placer. The synthesized circuit netlist is applied to TimberWolf, and the output file contains the placement information.

For TRAP’s chip-level virtualization mode, the virtual layers can be generated by high-level synthesis [39]. Timberwolf was augmented to simultaneously optimize the placements over the multiple virtual layers. TimberWolf seeks to place outputs from one layer close to the fanouts on the subsequent layer to greatly reduce overall x-y wire length.

The open-source tool Versatile Place and Route (VPR) from the University of Toronto was used for routing TRAP designs. VPR is a placement and routing tool for array-based FPGAs, and was written to allow circuits to be placed and routed on a wide variety of FPGAs to facilitate comparisons of different architectures [40]. It takes two input files, a netlist describing the circuit to be placed and routed, and a description of the FPGA architecture. Moreover, an extra input file can be taken as a placement file to VPR, if one desires that an existing placement be routed only.

We customized the open-source tool VPR for the routing of TRAP designs. The TRAP architecture was carefully described according to the VPR format. The placement file generated by

TimberWolf is used as an input file for VPR.

We developed a custom Python framework that translates the netlist, placement, and routing files to generate the bitstream for the TRAP. The core programming bit file and I/O programming bit file are both generated in this bitstream generation step.

The CAD tool flow for TRAP is shown in Fig. 6.1. The oval bubbles are the input files applied to and output files generated from the CAD tools. The rectangle boxes are the CAD tools being used.

Precisely the same as the first step of an ASIC design, the target circuit in a Hardware Description

83

Language (HDL) is synthesized with the liberty file for the TRAP cell library. We use Synopsys’

Design Compiler for synthesis. Note that any CAD tool that can be used for ASIC design synthesis is applicable for this step. The gate-level netlist is sent to TimberWolf for placement. Then that

Figure 6.1. CAD tool flow for TRAP

84

placement result is sent to the customized VPR for routing. The gate-level netlist, placement file, and routing file are together sent to the programming bit generator. This then yields the target bitstream for TRAP.

As we use a commercial FPGA to program TRAP, TRAP’s bitstream is embedded into the testbench from the FPGA. The testbench is a state machine to program TRAP and test the target circuit. The commercial FPGA synthesis tool generates the bitstream for the FPGA in the test setup. We will go into detail about the test setup in the next section.

6.2 Test Setups

Continued from the last section, in which the CAD tool flow generates the bitstreams, we use commercial FPGAs to program and test our TRAP fabric. We built a test setup to realize the connection between the FPGA and the TRAP.

6.2.1 Test Setup Version I

The test setup is such that the FPGA connects to TRAP through a custom-designed interposer printed circuit board (PCB). In Fig. 6.2, the first version of the test setup uses the Opal Kelly integration module based on a 1,500,000-gate Xilinx Spartan-3 FPGA (XEM3010-1500P) [41].

There are two black 80-pin connectors on the bottom side of the Opal Kelly development board.

The signals in the test setup communicate through the attached connectors between the FPGA and the PCB. The PCB is designed with a socket that holds TRAP and has many test points. The interposer PCB houses the TRAP chip and facilitates communication with the FPGA board. The

FPGA serves a dual role in this platform: first, it acts as a controller for programming the TRAP chip; second, it sends a stimulus for the configuration and captures its output for further analysis.

85

Figure 6.2. First test setup: FPGA-PCB-TRAP

As mentioned in Chapter 5, the supply voltage for the TRAP is 1.2V. However, the pins on the development board are powered at 1.8V by default. The 1.8V is applied from a pin called VCCO.

To change the I/O supply, four ferrite beads were removed from the development board. Each of the ferrite beads was used to keep the VCCO as 1.8V. The development board has three supplies available, thus 1.2V, 1.8V, and 3.3V. A path is constructed on the PCB to apply 1.2V to the VCCO pin. Also, the TRAP chip is powered from the 1.2V from the development board. The jumpers in the middle are used to connect the 1.2V supply to the pin connectors on the development boards and the VDD pads on the TRAP. Therefore, the supply voltage of the development board I/Os changed to 1.2V. Keeping the I/O supplies at the same level is essential to send and receive correct values between the two systems. Otherwise, a level-shifter must be inserted to solve the problem.

86

6.2.2 Test Setup Version II

Figure 6.3. Second test setup: FPGA-PCB-TRAP

As shown in Fig. 6.3, the second version of the test setup is where we replaced the Opal Kelly

XEM3010-1500P development board with the Xilinx VC707 Evaluation Kit [42]. The FPGA part of the test setup was upgraded from the Xilinx Spartan-3 FPGA to the Xilinx Virtex-7 FPGA. The

Xilinx Spartan-3 FPGA uses ISE to generate the bitstream, while the Xilinx Virtex-7 FPGA uses

87

Vivado. The formats of the constraint files are different. A significant effort was required to transfer to the newer version. However, this evolution gives us the possibility to run and test tremendously more complex designs.

This new PCB enables TRAP to be powered up either from the external power resources or from the FPGA. In Fig. 6.3, the black jumper on the PCB is used to switch from the two sources. The

DC-DC adapter located close to the jumper changes the external power supply (5V) to 1.2V. The connector between PCB and the VC707 board is called FMC. The FMC is powered at 1.8V by default. We need to solve the same voltage difference problem as the first test setup. This was done by removing the jumper on J51 (FMC_VADJ is turned off) on the VC707 board. Then the TI USB interface adapter was connected to the PMBUS socket to access the TI power controller [43]. After that the TI Fusion Digital Power Designer was opened to change the FMC_VADJ rail to be set to

1.2V. Before quitting the software, it was necessary to save the changes to flash so that the power supply settings will remain at 1.2V regardless of restarting the board.

6.3 Test Results

To date, using the rudimentary CAD capabilities described in section 6.1, we have successfully programmed, verified, and characterized the TRAP chip with a broad range of designs, providing correctness of its fundamental concepts and functionality. The results will be described in this section.

88

6.3.1 FPTA and Commercial FPGA Area Efficiency Comparison

The first version of our transistor array was called a Field Programmable Transistor-level Array

(FPTA). The FPTA was laid out in a 6-thin-layer IBM130 process that is no longer supported for fabrication. The FPTA was a meaningful standpoint for the further exploration of our design ideas.

In order to determine the area utilization efficiency of our FPTA, we compared it with a commercial FPGA, Altera Stratix EP1S10, which uses the same 130nm technology and has a core

Table 6.1. Area utilization compared to a commercial FPGA

Altera Stratix Benchmark Cell Count FPTA Utilization Utilization

B04 317 1.02% 2% B05 353 1.29% 2% B12 539 2.02% 4% SPI 1240 4.27% 8% B14 2123 8.69% 10% Tv80 3077 11.13% 19% B15 3461 12.18% 22% B20 4407 17.78% 20% B21 4635 19.07% 20% B22 6702 26.85% 29% B17 10942 37.75% 68% AES_cipher 9422 38.91% 47% AES_inv_cipher 13578 52.07% 72% WB_conmax 16436 70.00% 148% B18* 25303 87.85% 140%

(∗) One instance of B15 is removed from B18 to reduce the size of the benchmark in order to ensure that it fits within the available resources

89

size of 23mm X 23mm [44]. To make a fair comparison, we scaled up our FPTA to the same size, resulting in an array of 51 x 35 groups of 8 LCs, for a total of 14,280 LCs.

We then implemented various benchmarks from ITC’99 [45] and [46] on both our

FPTA and the Altera chip. A comparison of resource utilization is presented in Table 6.1 [33]. In the comparison, we only count the resource utilization by the number of used logic resources.

Despite the additional area overhead due to having three memory cells per programming bit, the density (or utilization) of the FPTA is quite competitive with the Altera chip. We attribute this observation to the fact that for logic outside the custom cells (e.g., full adders, carry units, flip- flops, multiplexers) that both designs possess, the transistor utilization of the LCs in the FPTA is higher than the transistor utilization of the LUTs in the Altera design. Essentially, even a relatively simple logic function might take up an entire LUT, whereas in the FPTA, only the precise number of columns needed to implement the gate is used. Thus, simple gates such as NAND2, NAND3,

NOR2, NOR3, and up to three or even four input AOI and OAI gates are comparatively very area- efficient in the FPTA.

6.3.2 FO1 Delay

The fabricated TRAP design has a built-in module that is reserved for evaluating the delay differences between HSPICE simulation and measurements of the actual TRAP chip. This module consists of a long ring oscillator followed by an 8-bit divider. The ring oscillator is formed by 51 inverting gates in a loop. The last stage output of the 50-inverter chain connects to a NOR2, and the other input of NOR2 is an enable signal. The output of the NOR2 feeds to the inverter chain.

The extracted netlist with R+C+CC is simulated in HSPICE. The frequency of the ring oscillator is 517 MHz. An 8-bit asynchronous counter attaches to the ring oscillator to reduce the frequency

90

to 2.02 Mhz. After calculation, the inverter fan-out of 1 (FO1) delay is 38.0 ps. This module is placed at the right top corner in the TRAP design. An input pad called ring_osc_en connects to one of the inputs of the NOR2. And an output pad called ring_osc_out connects to the output of the frequency divider.

Figure 6.4. FO1 delay measurement

As shown in Fig. 6.4, the test point connecting to the ring_osc_out was probed on the interposer

PCB to measure the frequency of the built-in module, which was measured at 1.99 MHz. The frequency measurement reading varies in the range of ±2% with time. The oscilloscope measures the frequency from pulse to pulse. However, the pulse, as shown in Fig. 6.4, has considerable ringing. The ringing is a result of undesired inductance and capacitance on the signal propagation path. TRAP is held by a socket attaching to the interposer PCB. The package, socket, and PCB all contribute inductance. A ringing power supply also has an effect. Thus a low-dropout regulator

91

(LDO) is commonly inserted into a DC power line, and big capacitors are attached to the power line.

As shown in Table 6.2, the FO1 delay is calculated from the measured built-in module frequency

8 (fout). The frequency of the ring oscillator (fosc) is 1.99 MHz * 2 = 509 MHz. The delay of the 51- stage inverting chain (Tosc) is 1/ fosc = 1.96 ns. The FO1 delay, which is the delay of each inverter in the chain (Tinv), is 38.5 ps. Compared with the FO1 delay from HSPICE simulation, the difference is 1.2%. Note, the extracted netlist used in simulation was based on the native 65nm design rules while the fabrication has a 15% shrink (55nm) at tape-out. There would be a larger difference rate if the HSPICE simulation could take the shrink into account.

Table 6.2. FO1 delay measurement from HSPICE and fabric

Timing HSPICE Chip Items Measurement Measurement

fout 2.02 MHz 1.99 MHz

fosc 517 MHz 509 MHz

Tosc 1.94 ns 1.96 ns

Tinv 38.0 ps 38.5 ps

92

6.3.3 Single Cycle Switching Between Configurations

(a)

(b) Figure 6.5. (a) Up counter, (b) Down counter

TRAP can switch between configurations within a single cycle and can be partially reconfigured.

To verify these two features, we developed a simple test shown in Fig. 6.5. There are two configurations, which are the detailed implementation of an up counter and a down counter. These two configurations are only different in the red-dashed box. The up counter contains an XOR while the down counter contains an XNOR. Only one LC is different between the two designs. Hence only a small number of programming bits need to be updated.

The single cycle reconfiguration capability of TRAP is demonstrated using these two 2-bit counters, shown in Fig. 6.5. Two bitstreams are generated for the ‘down counter’ and ‘up counter’ and are loaded into the D layer and A layer of the local memory of TRAP, respectively. In Fig.

6.6, the screenshot of the resulting waveforms are obtained from an oscilloscope using the test

93

setup described in section 6.2. TRAP finished loading the two bitstreams at time t0, labeled with the first red dashed line. The programming_clk is off after bitstream loading has been completed.

(∗) Manually inserted the COUNT to help for reading the counter result

Figure 6.6. Single cycle switching between configurations

The cpd and cpa waveforms are in black and are used to select between the two configurations.

The blue waveform is called counter_clk and is used to trigger the counter. The counter output waveforms are labeled as counter LSB and MSB. From time t0 to t1, which is between the first and second red dashed lines, the down counter is activated as the cpd pulse is provided. The counter

DFFs are reset at the very beginning. So the counter counts down as 0-3-2-1 at the rising edge of the counter_clk pulses. At time t1, as soon as cpa applies high and cpd applies low, the bitstream corresponding to the up counter is activated, within a negligible delay, and the counter counts up

94

from 0 to 3. Between the third and fourth red dashed lines (from time t2 to t3), TRAP is configured back to layer D, which is a down counter. The down counter continues counting down from the previous state ‘1’. The DFFs in the counters retain the states for each layer during reconfiguration.

So, leaving a virtual layer pauses a computation, and that computation resumes when that virtual layer becomes active again. After the fourth dashed line (t3), TRAP stays in the layer A configuration. The up counter counts at the rising clock edge, and when the counter_clk pulse is de-asserted, the counter retains its state. The waveforms confirm that the TRAP resources can be time-shared between two different bitstreams within a single-cycle toggle.

6.3.4 Selective Partial Dynamic Reconfiguration

The partial/selective dynamic reconfiguration capability is demonstrated using the same example of two 2-bit counter configurations. In this case, a virtual layer is initially configured as a down counter, and after partially updating the bitstream, that layer is reconfigured as an up counter. The schematic of the two counters in Fig. 6.5 show that they differ only in the red dashed part. By selectively changing only the bits corresponding to the LC in the middle, its functionality is changed.

Fig. 6.7 contains a screenshot of the waveform results obtained from an oscilloscope. The counter

LSB and MSB are in black. Before t0, a bitstream implementing a down counter has been loaded to TRAP. Initially, the down counter starts counting ‘down’ in the sequence of 0-3-2-1-0-3-2 soon after receiving the cpa pulse. The counter is stopped at the count ‘2’ of its second counting cycle, which is labeled by the second red dash line at time t1. Between t1 and t2, a small portion of the bitstream is being updated, shown as the blue waveform programming_clk. This portion shows that TRAP is being partially reconfigured, converting it from a down counter into an up counter.

95

(∗) Manually inserted the COUNT to help for reading the counter result

Figure 6.7. Selective partial reconfiguration

Note that the bitstream is on layer A and the partial reconfiguration is also on layer A, as the cpd is always low which means layer D is not used in this case. At time t2, the up counter starts counting

‘up’ from the same state (count ‘2’) where the down counter had stopped. Now the counting sequence is 2-3-0-1-2-3-0. This partial/selective reconfiguration mode also allows the retention and transfer of computational state between the configurations, as illustrated in the waveforms of

Fig. 6.7. Selective reconfiguration eliminates the need to reload the entire bitstream for a small design change; hence, the time required for reconfiguring the TRAP is proportional to the number of bits changed in the bitstream.

96

6.3.5 Implementation Examples

Many designs have been implemented on TRAP. Besides the basic gates implementations and the inverter chain implementation which were used to verify the correctness of the TRAP fabric, we also implemented several functional configurations. We take advantage of an interface tool called

FrontPanel to apply inputs to and receives output form the implemented system for verification.

6.3.5.1 GCD Implementation Example

Figure 6.8. FrontPanel SDK

Using the first version of the test setup, we implemented a part of an 8-bit Greatest Common

Divider (GCD) on TRAP and the remaining parts on the FPGA. Only a small portion of the GCD design is implemented on TRAP. This portion can be a sensitive netlist that needs to be obfuscated.

This implementation is a simple setup to prove that TRAP can be used for design obfuscation,

97

which will be discussed in section 7.3. TRAP works as an eFPGA for this application. The Opal

Kelly XEM3010 serves two roles in this platform: first, it acts as a controller for programming the

TRAP chip; second, it implements the remaining parts of the GCD configuration.

Figure 6.9. FrontPanel GUI

The Opal Kelly XEM3010 development kit can connect to a host PC through USB with the help of the FrontPanel firmware. As shown in Fig. 6.8 [47], the Opal Kelly development kit has built- in FrontPanel modules (blue-colored blocks). Many endpoints on the FPGA are reserved for inputs and outputs, which can be directly controlled by the FrontPanel GUI. The green-colored memory module stores these endpoints. The HDL adds the instantiation of the module ‘okHost’. Scripting the .xfp file modifies the GUI to contain the needed user interface, shown as the blue block

‘FrontPanel API’. The stand-alone FrontPanel enables one to quickly and easily define a graphical user interface that communicates with TRAP. The FrontPanel SDK dramatically accelerates the development of this FPGA-based TRAP test procedure.

The test setup connects to a computer through USB. Then the system is powered up and the

FrontPanel software is opened. As shown in Fig. 6.9, the left side shows the connected development kit information. The first button is clicked and then the .xfp file is executed. The GUI

98

defined by the .xfp file pops up after the FrontPanel profile loaded successfully, as shown in Fig.

6.10. The second button loads the programming bitstream to the FPGA. When the FPGA has finished programming, any two numbers from 0 to 255 are typed in as inputs. The ‘Compute’ button is pushed, and then the result is shown at the window below the GCD label. As an example, it shows the GCD of 255 and 17 is 17, as in Fig. 6.10.

Figure 6.10. GCD calculator interface

6.3.5.2 Multiplier Implementation Example

Following the same steps described in the last section, a 2-bit multiplier and a 3-bit multiplier have been implemented on TRAP. The 3-bit multiplier implementation experience help us notice a flaw of the current routing algorithm. It will be discussed in detail in section 7.2.

99

We debugged the routing paths of a misfunctioning configuration and found the problem route.

Eventually we developed 5 different functioning multiplier configurations, where each has different route after manually modification. The testing GUI is shown in Fig. 6.11(a). Two decimal inputs (0 to 7) are typed into the blanks labeled as ‘Number 1’ and ‘Number 2’. The bottom blocks show the decimal and binary calculation results. There are 5 versions of the multiplier configurations listed in Fig. 6.11(b).

(a) (b) Figure 6.11. (a) GUI for the multiplier implementation, (b) A list of multiplier configurations

100

6.3.5.3 Branch Jumper Implementation Example

The branch jumper is synthesized as a MUX, where ‘imjbrupperi’ is the select input. When select is low it chooses ‘opbnfi’, while it chooses ‘obfi’ when select is high. The testing GUI is shown in

Fig. 6.12(a). Three binary inputs are typed into the blanks labeled as ‘opbnfi’, ‘opbfi’ and

‘immjbrupperi9’. The bottom block show the one-bit branch selection result. The truth table of the branch jumper is listed in Fig. 6.12(b).

(a) (b)

Figure 6.12. (a) GUI for the branch jumper implementation, (b) Truth table for the branch jumper

101

6.3.5.4 A Decoder Implementation Example

The implemented decoder has a 6-bit input and a 1-bit output. The testing GUI is shown in Fig.

6.13(a). A binary input is typed into the blank labeled as ‘opcin’. The bottom block labeled as

‘decodeoplsu’ shows the one-bit decoder output. The ‘n7’ block is reserved for debugging. The truth table of the decoder is listed in Fig. 613(b). Note that only the listed 4 vectors will trigger a high output, and meanwhile the output will be low for all other vectors.

(a) (b)

Figure 6.13. (a) GUI for the decoder implementation, (b) Truth table for the decoder

102

6.3.6 TRAP Area Efficiency Evaluation

Moreover, we have successfully implemented hybrid designs where a portion of the circuit is programmed on the TRAP chip, while the rest is programmed on the FPGA (emulating an ASIC).

Among them, we highlight the implementation of the FabScalar [48] [49] on our platform. FabScalar is an open-source, superscalar microprocessor, whose functionality has been verified through multiple fabricated ICs. The design implements an out-of-order, superscalar core

Figure 6.14. Overhead of TRAP & FPGA over ASIC and uses the Portable-ISA (PISA) as its instruction set architecture. Three control-oriented modules, namely the Architectural Map Table (AMT), the result shift register (RSR), and the branch predictor (BP) were implemented on the TRAP chip, using our rudimentary CAD flow, while the rest of the was mapped on the FPGA, using the standard Xilinx CAD tool flow.

The comparative advantage of TRAP over an FPGA is illustrated in Fig. 6.14. We report the overhead (area, performance, and power) of a TRAP-based implementation of the three FabScalar

103

control-oriented modules (AMT+RSR+BP), normalized to an ASIC implementation, as well as the overhead of an FPGA-based implementation normalized to an ASIC implementation.

Specifically, we compare our 65nm TRAP fabric vis-a-vis a prominent commercial FPGA in the same technology node, namely the Xilinx Virtex 5. Area and latency data were collected from

Xilinx’s post-synthesis design report, while power results were acquired from the post place-&- route simulation model (XPower Analyzer). We note that the Virtex 5 FPGA contains large block

RAMs (BRAM) to facilitate specific applications (e.g., DSP), which occupy more than 50% of the

FPGA area and consume static power even when unused [50]. To make the overhead comparison fair, we disabled BRAM usage during synthesis, subtracted the total BRAM area from the die size, and adjusted the power consumption accordingly. Additionally, to be conservative and keep the field level, we only contrast the FPGA versus the single-layer (TRAP-1L) version of our fabric.

As shown in Fig. 6.14, in comparison to TRAP-1L, Virtex 5 leaves a 70X larger footprint, results in a 2X slower performance, and consumes 34X more power when implementing the

AMT+RSR+BP. These results corroborate our conjecture that TRAP can be far more cost- effective in the realization of hybrid programmable/ASIC designs than conventional LUT-based

FPGAs and eFPGAs.

104

CHAPTER 7

FUTURE WORK

7.1 Design Improvements for TRAP

7.1.1 One-layer TRAP

The current version of TRAP has four virtual layers, which enable TRAP to implement board- level virtualization and chip-level virtualization. Going forward, it seems that one of the key applications for TRAP is in the design obfuscation area. In this regime, area overhead is critical and sufficient obfuscation is already achieved with a single-layer TRAP design. We therefore plan to design and fabricate a one-layer version of TRAP likely using a similar fabrication process as the current TRAP chip. The design obfuscation application also doesn’t require the word-based high-throughput pipelining scheme for entering programming bits, nor does it require rapid partial reconfiguration.

7.1.1.1 Changes

There are two fundamental changes we anticipate for the new TRAP design. First, the functionality of the built-in DFF will be combined with the scan DFF. The current TRAP has a complex structure of DFFs to support virtualization, as shown in Fig. 3.7. The one-layer design greatly simplifies the structure and reduces the controls, as shown in Fig. 7.1. The data input and scan input are selected by the global control signal scan. When the module works as a built-in DFF, D_in is captured by the DFF and the Q_out updates (labeled in the blue color). Q_out passes to the output of the first column of the LC. The tri-state guarantees that the DFF only drives the output line when a built-

105

in DFF is in use. The scan chain path is labeled in pink. When scan =1, the ‘scan_in’ directly connects with the buffered output ‘scan_out’ from the previous stage. In this case, all the DFFs are connected in a chain. All the states can be shifted out for verification. The gated clock at the left bottom of the Fig. 7.1 can disable the clock when this module is not in use (neither built-in

clk clk

clk clk __ clk D_in clk

scan_in

1 0 1 r _

scan clk __ r clk

en · scan

Q_out

scan en clk clk en · scan scan_out sys_clk

Figure 7.1. New Built-in DFF

DFF mode nor scan DFF mode).

Secondly, the current TRAP design has hierarchal memory buffers to load the bitstream, as shown in Fig. 3.17. The new one-layer design will remove the levels of memory buffers and replace the current 4-latch memory cells by a shift register. Compared to the 4-latch memory cell, as shown in Fig. 3.14, the DFF-like shift-register cell is totally different. The schematic of a shift-register cell is shown in Fig. 7.2. The structure of a shift-register cell is a DFF with an extra output branch.

106

The output branch with a tri-state inverter is controlled by a global signal copy. The signal copy turns on all the output branches in the shift-register cells. Simultaneously all the programming bits are applied.

shift_clk shift_clk

shift_clk shift_clk

shift_clk shift_clk

shift_in

shift_clk shift_clk

shift_out

copy

programming_bit

copy

Figure 7.2. A shift-register cell

The new scheme for loading the bitstream is to use the huge shift register in which DFF-like cells are connected head to the tail. To change the configuration, a full bitstream is loaded into the system. The advantage of the shift register scheme is that all the programming bits can be scanned out.

7.1.1.2 Improvements

With the experience of designing and taping-out the current version TRAP, some suggestions to improve the design performance are summarized as follows.

107

1) Add a buffer at every LC output for a better drive on the output tracks. Meanwhile, the size

of the transistors in the array are reduced significantly, as an implemented gate drives the

buffer instead of the output track.

2) Add both polarities for all transistor inputs. The current design has both polarities available

for built-in gates only. It is a straightforward extension to have all the transistor inputs to

be able to receive either polarity of an incoming logic signal. Likewise, the output of an

implemented gate can be of either polarity. With this extension, an inverter will never need

to be implemented in the transistor array.

3) The internal power and ground busses on metal1 in the cells should be wider than the

minimum to provide lower resistance and better electromigration immunity.

4) A power grid should use generous amounts of the top two metal layers running in

orthogonal directions. Most of the current handling capability is provided in the upper two

layers with lowest resistance.

The grid must extend down to metal1 or metal2 to provide easy connection to cells. The

power grid should use many narrow wires interdigitated with the signals rather than a few

wide wires to avoid large current return loops. The grid should also avoid slots and other

discontinuities that might lead to large current loops and high inductance.

5) The decoupling capacitors are only placed between the core and pad ring in the current

design. The new design should place more decoupling capacitors all throughout the core

of the chip. Chips need a substantial amount of capacitance between power and ground to

provide the instantaneous current demands of the chip. The bypass capacitance is

distributed across the chip so that a local spike in current can be supplied from nearby

108

bypass capacitance rather than through the resistance of the overall power grid. It also

greatly reduces the di/dt drawn from the package.

6) Add a logo by the image bevel, which helps the packaging vendor for chip orientation.

7.1.2 TRAP as an eFPGA

An eFPGA has significant advantages in applications that have frequently changing architectures.

Nowadays, eFPGA architectures are used, for example, in the quickly developing 5G technology and Artificial Intelligence (AI) accelerators in machine learning.

FPGAs can have a significant bottleneck with respect to the limit on the number of I/Os. For a fabricated FPGA die, the boundary limits the space for I/Os. The I/O pads have to meet size requirements to guarantee accurate data transmission between core and off-chip, as well as accommodating the large area needed for the actual bond pads. As the size of I/O pads is relatively large, the number of programmable I/Os that can be put on an FPGA die is much fewer than the number of interconnection tracks. For an embedded FPGA (eFPGA), there is no longer a limit as there are no physical I/O pads at the boundary of the eFPGA. Also, the eFPGA saves the power consumed from the programmable I/Os.

The current TRAP fabric, as described in Chapter 6, shows that TRAP can be far more cost- effective in the realization of hybrid programmable/ASIC designs than conventional LUT-based

FPGAs and eFPGAs. So we plan to embed TRAP into SoCs going forward.

Another change we want to try is to add a few more frequently used gates as built-in gates. Base on the 4-coloumn logic cell module, we propose 12 built-in gates with each producing an output at one of 2 columns. Each column in some sense has 3 built-in gates, but their outputs can go to that column or the next column (with column 4 going also to column 1). The built-in gates are:

109

ND2, ND3, NR2, NR3, AOI21, AOI22, OAI21, OAI22, DFF, MAJI (carry), XOR, MUX. With this setup, we can map designs mostly onto built-in gates efficiently. The routing will be much less complex since there will be only one input (no separate pMOS and nMOS inputs) for each built- in gate input.

A more aggressive idea is to replace the current transistor-level logic cell module with pure standard cells. Instead of a sea of transisotors, the programmable logic contains a sea of standard cells. This will lead to a new direction of optimizing the area efficiency of a programmable architecture.

7.2 Routing Algorithm Improvement

During the extensive testing process, we noticed a faulty result on a specific routing path, as shown in Fig. 7.3. In Fig. 7.3, the black square is an open switch and the black circles are pass transistors.

The used horizontal tracks are in red, and the used vertical tracks are in blue. The output in the purple color is generated from the left unit. It has four fan-outs labeled as 1, 2, 3, and 4. Signals 1,

2, and 4 all change, following the output. However, net 3 is stuck-at-1.

The routing path generated by VPR, as shown in Fig. 7.3, is not optimized. We can easily come up with a better solution manually for the exact same output and fan-outs, as shown in Fig. 7.4.

The metal tracks used by the routing path are greatly reduced. The simulation result of the optimized routing path is correct.

Form this faulty routing path example we noticed a problem with the routing tool. VPR is designed for a general FPGA architecture, in which each crossing of interconnects has a switchbox, as shown in Fig. 3.13. The routing tracks are in a grid network, and every switchbox is a node. The cost of the routing path is calculated by the number of steps from a source node to a target node.

110

Figure 7.3. A routing path with faulty result

Figure 7.4. An optimized routing path

111

However, the interconnect architecture of TRAP has a substantial difference, since it only has one pass-transistor switch at each crossing. It controls either to connect the crossing two metal tracks or not. When two metal tracks are connected, the whole length of the track segments are in use.

Also, it is necessary to keep track of the aggregate routing length (or capacitance) attached to a driver gate or repeater and make sure it never exceeds a certain limit. The router must seek and use a repeater if the limit would otherwise be exceeded.

The existing algorithms for routing are mostly node-based, e.g., Dijkstra's algorithm, Lee’s algorithm, and A* algorithm. In most of the routing tools, the Manhattan distance from a start point to a target is the cost to optimize. However, it is not sufficient for TRAP, because the interconnect tracks only segment at the boundary of a unit, not at every switch. The Manhattan distance is the absolute distance of a routing path. The actual resources that are occupied are more than the edges that Manhattan distance covered. Whenever a small portion of a track segment has been used, the whole segment is not available anymore.

The transient delay in an ASIC depends on the driver's size and the load on the routing path. In a reconfigurable architecture, the interconnect delay dominates the transient delay. The routing path is very short comparing to the interconnects in an FPGA. Moreover, the switches on the interconnect contribute a lot of delay.

We substantially modified the source code of VPR to make it as a rudimentary routing tool for our architecture. We are developing and optimizing a modified algorithm that can carefully take care of the TRAP interconnect architecture.

112

7.3 Design Obfuscation

Finally, we discuss one potential application of the TRAP fabric, which is to structurally obfuscate a sensitive design to deter reverse engineering. TRAP has unique advantages in design obfuscation.

As compared to an FPGA implementation, TRAP-based obfuscation offers superior resistance against both brute-force and oracle-guided SAT attacks, while incurring an order of magnitude less area, power, and delay overhead [51].

Figure 7.5. Sources of uncertainty in TRAP-based obfuscation

Hardware security is more and more of great concern nowadays. The TRAP architecture makes inferring the obfuscated design extremely challenging. As shown in Fig. 7.5 [51], TRAP-based obfuscation introduces the following sources of uncertainty: gate obfuscation, interconnect obfuscation, state obfuscation, and virtual layering. The TRAP architectures have different levels of sources of uncertainty, which makes the TRAP-based obfuscation a better solution for hardware security than commercial eFPGA.

113

The TRAP-based obfuscation is realized with a TRAP + ASIC tool flow. The portion that contains a highly sensitive structure is replaced by the TRAP layout. TRAP works as an eFPGA in the

ASIC design. When the combined GDSII is sent to the untrusted fab, there is no chance to reverse- engineering the structure hidden by the TRAP portion as neither logic gates nor placement & routing information is shown. Only the trusted customer has knowledge about how the TRAP portion is configured, as shown in Fig. 7.6.

Figure 7.6. TRAP-based design obfuscation

Several directions to improve the TRAP design have been discussed in this chapter. The TRAP fabric research established a foundation to explore the possibilities of the novel architectures. The

TRAP design will keep improving as long as creative ideas brought into the design. More and more applications will be fulfilled by the current or future versions of TRAP.

114

REFERENCES

[1] I. Kuon, R. Tessier, and J. Rose, “FPGA Architecture: Survey and Challenges,” Found. Trends® Electron. Des. Autom., vol. 2, no. 2, pp. 135–253, Apr. 2008.

[2] Actel, “Press Release: Actel Achieves Key Milestone with its Cost-Effective, Flash- Based FPGAs; Company Ships More Than 1 Million Units,” 29-Aug-2008. [Online]. Available: https://web.archive.org/web/20080829235012/http://www.actel.com/company/press/2004/3/ 29/. [Accessed: 06-Oct-2019].

[3] Microsemi, “Antifuse FPGAs | Microsemi.” [Online]. Available: https://www.microsemi.com/product-directory/fpga-soc/1641-antifuse-fpgas. [Accessed: 06- Oct-2019].

[4] Lattice Semiconductor, “Products - Lattice Semiconductor.” [Online]. Available: https://www.latticesemi.com/Products.aspx#_D5A173024E414501B36997F26E842A31. [Accessed: 06-Oct-2019].

[5] QuickLogic, “FPGAs - SRAM,” 02-Aug-2018. [Online]. Available: https://www.quicklogic.com/products/fpga/fpgas-sram/. [Accessed: 06-Oct-2019].

[6] QuickLogic, “FPGAs - Antifuse,” 02-Aug-2018. [Online]. Available: https://www.quicklogic.com/products/fpga/fpgas-antifuse/. [Accessed: 06-Oct-2019].

[7] Achronix, “SpeedcoreTM Embedded FPGA (eFPGA) – Achronix Semiconductor Corp.” [Online]. Available: https://www.achronix.com/product/speedcore/. [Accessed: 06-Oct- 2019].

[8] Xilinx, “7 Series FPGAs Configurable Logic Block User Guide (UG474),” p. 74, 2016. [Online]. Available: https://www.xilinx.com/support/documentation/user_guides/ug474_7Series_CLB.pdf. [Accessed: 06-Oct-2019].

[9] Altera, “wp-01003.pdf.” July 2006. [Online]. Available: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/wp/wp- 01003.pdf. [Accessed: 14-Oct-2019].

[10] Tabula, "Going Beyond the FPGA with Spacetime," 31 August 2012. [Online]. Available: http://www.fpl2012.org/Presentations/Keynote_Steve_Teig.pdf. [Accessed: 14-Oct-2019].

[11] A. Rohe and S. Teig, “Concurrent optimization of physical design and operational cycle assignment,” US7496879B2, 24-Feb-2009.

115

[12] “Here’s 30 Bay Area startups pegged as ‘Next Big Thing,’” Silicon Valley Business Journal. [Online]. Available: https://www.bizjournals.com/sanjose/blog/2012/09/bay-area-startups- top-next-big-thing.html. [Accessed: 06-Oct-2019].

[13] S. Trimberger, D. Carberry, A. Johnson, and J. Wong, “A time-multiplexed FPGA,” in Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186), 1997, pp. 22–28.

[14] Xilinx, “PlanAhead Design and Analysis Tool.” [Online]. Available: https://www.xilinx.com/products/design-tools/planahead.html. [Accessed: 06-Oct-2019].

[15] Intel, “Intel® Quartus® Prime Software Features Partial Reconfiguration.” [Online]. Available: https://www.intel.com/content/www/us/en/programmable/products/design- software/fpga-design/quartus-prime/features/partial-reconfiguration.html. [Accessed: 06- Oct-2019].

[16] P. Layzell, “A new research tool for intrinsic hardware evolution,” in Evolvable Systems: From Biology to Hardware, 1998, pp. 47–56.

[17] D. Keymeulen, R. S. Zebulum, Y. Jin, and A. Stoica, “Fault-tolerant evolvable hardware using field-programmable transistor arrays,” IEEE Trans. Reliab., vol. 49, no. 3, pp. 305– 316, Sep. 2000.

[18] J. Langeheine, J. Becker, S. Fölling, K. Meier, and J. Schemmel, “Initial Studies of a New VLSI Field Programmable Transistor Array,” in Evolvable Systems: From Biology to Hardware, 2001, pp. 62–73.

[19] A. Stoica, D. Keymeulen, R. Tawel, C. Salazar-Lazaro, and Wei-Te Li, “Evolutionary experiments with a fine-grained reconfigurable architecture for analog and digital CMOS circuits,” in Proceedings of the First NASA/DoD Workshop on Evolvable Hardware, 1999, pp. 76–84.

[20] J. Rose, R. J. Francis, D. Lewis, and P. Chow, “Architecture of field-programmable gate arrays: the effect of logic block functionality on area efficiency,” IEEE J. Solid-State Circuits, vol. 25, no. 5, pp. 1217–1225, Oct. 1990.

[21] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-submicron FPGA performance and density,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 12, no. 3, pp. 288–298, Mar. 2004.

[22] T. Lin, W. Zhang, and N. K. Jha, “A Fine-Grain Dynamically Reconfigurable Architecture Aimed at Reducing the FPGA-ASIC Gaps,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 22, no. 12, pp. 2607–2620, Dec. 2014.

116

[23] F. Yuan, C. C. Wang, T. Yu, and D. Marković, “A Multi-Granularity FPGA With Hierarchical Interconnects for Efficient and Flexible Mobile Computing,” IEEE J. Solid- State Circuits, vol. 50, no. 1, pp. 137–149, Jan. 2015.

[24] “High Performance Computer Architectures: A Historical Perspective.” [Online]. Available: http://homepages.inf.ed.ac.uk/cgi/rni/comp-arch.pl?Networks/benes.html,Networks/benes- f.html,Networks/menu-dyn.html. [Accessed: 06-Oct-2019].

[25] Flex Logix, “Flex Logix™ is the leading provider of embedded FPGA hard IP and software. ” [Online]. Available: https://flex-logix.com/. [Accessed: 14-Oct-2019].

[26] C. W. Yu, J. Lamoureux, S. J. E. Wilton, P. H. W. Leong, and W. Luk, “The Coarse- Grained/Fine-Grained Logic Interface in FPGAs with Embedded Floating-Point Arithmetic Units,” International Journal of Reconfigurable Computing, 2008. [Online]. Available: https://www.hindawi.com/journals/ijrc/2008/736203/. [Accessed: 06-Oct-2019].

[27] R. W. Hartenstein and R. Kress, “A datapath synthesis system for the reconfigurable datapath architecture,” in Proceedings of ASP-DAC’95/CHDL’95/VLSI’95 with EDA Technofair, 1995, pp. 479–484.

[28] H. Singh, Ming-Hau Lee, Guangming Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, “MorphoSys: an integrated reconfigurable system for data-parallel and computation- intensive applications,” IEEE Trans. Comput., vol. 49, no. 5, pp. 465–481, May 2000.

[29] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix,” in Field Programmable Logic and Application, 2003, pp. 61–70.

[30] Y. Kim, M. Kiemb, C. Park, J. Jung, and K. Choi, “Resource Sharing and Pipelining in Coarse-Grained Reconfigurable Architecture for Domain-Specific Optimization,” in Proceedings of the Conference on Design, Automation and Test in Europe - Volume 1, Washington, DC, USA, 2005, pp. 12–17.

[31] B. Erbagci, M. Bhargava, R. Dondero, and K. Mai, “Deeply hardware-entangled reconfigurable logic and interconnect,” in 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), 2015, pp. 1–8.

[32] M. Rahman, R. Afonso, H. Tennakoon, and C. Sechen, “Power reduction via separate synthesis and physical libraries,” in 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC), 2011, pp. 627–632.

[33] J. Tian, G. R. Reddy, J. Wang, W. Swartz, Y. Makris, and C. Sechen, “A field programmable transistor array featuring single-cycle partial/full dynamic reconfiguration,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2017, 2017, pp. 1336–1341.

117

[34] G. N. Hoyer, G. Yee, and C. Sechen, “Locally clocked pipelines and dynamic logic,” IEEE Trans. Very Large Scale Integr. VLSI Syst., vol. 10, no. 1, pp. 58–62, Feb. 2002.

[35] N. H. E. Weste and D. M. Harris, CMOS VLSI design: a circuits and systems perspective, 4th ed. Boston: Addison Wesley, 2011.

[36] O. Semenov, H. Sarbishaei, and M. Sachdev, ESD Protection Device and Circuit Design for Advanced CMOS Technologies. Springer Netherlands, 2008.

[37] P. J. Restle and A. Deutsch, “Designing the best clock distribution network,” in 1998 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.98CH36215), 1998, pp. 2–5.

[38] TimberWolf System, “Company Overview.” [Online]. Available: http://www.twolf.com/overview.html. [Accessed: 06-Oct-2019].

[39] B. Hu et al., “Functional Obfuscation of Hardware Accelerators Through Selective Partial Design Extraction Onto an Embedded FPGA,” in Proceedings of the 2019 on Great Lakes Symposium on VLSI, New York, NY, USA, 2019, pp. 171–176.

[40] “VPR and T-VPack: Versatile Packing, Placement and Routing for FPGAs.” [Online]. Available: http://www.eecg.toronto.edu/~vaughn/vpr/vpr.html. [Accessed: 06-Oct-2019].

[41] Opalkelly.com, “XEM3010,” Opal Kelly. [Online]. Available: https://opalkelly.com/products/xem3010/. [Accessed: 07-Oct-2019].

[42] Xilinx, “Xilinx Virtex-7 FPGA VC707 Evaluation Kit.” [Online]. Available: https://www.xilinx.com/products/boards-and-kits/ek-v7-vc707-g.html. [Accessed: 07-Oct- 2019].

[43] Xilinx, “VC707 Power Bus Reprogramming,” p. 15, 2015. [Online]. Available: https://www.xilinx.com/Attachment/VC707_Power_Controllers_Reprogramming_Steps.pd f. [Accessed: 07-Oct-2019].

[44] Intel, “stratix_handbook.pdf.” 2006. [Online]. Available: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stx/stratix _handbook.pdf. [Accessed: 14-Oct-2019].

[45] F. Corno, M. S. Reorda, and G. Squillero, “RT-level ITC’99 benchmarks and first ATPG results,” IEEE Des. Test Comput., vol. 17, no. 3, pp. 44–53, Jul. 2000.

[46] “Home: OpenCores.” [Online]. Available: https://opencores.org/. [Accessed: 06-Oct-2019].

[47] Opalkelly.com, “FrontPanel®,” Opal Kelly. [Online]. Available: https://opalkelly.com/products/frontpanel/. [Accessed: 07-Oct-2019].

118

[48] N. K. Choudhary et al., “FabScalar: Composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template,” in 2011 38th Annual International Symposium on Computer Architecture (ISCA), 2011, pp. 11–22.

[49] N. Choudhary et al., “FabScalar: Automating Superscalar Core Design,” IEEE Micro, vol. 32, no. 3, pp. 48–59, May 2012.

[50] M. Lin, A. El Gamal, Y.-C. Lu, and S. Wong, “Performance Benefits of Monolithically Stacked 3-D FPGA,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 2, pp. 216–229, Feb. 2007.

[51] M. M. Shihab et al., “Design Obfuscation through Selective Post-Fabrication Transistor- Level Programming,” in 2019 Design, Automation Test in Europe Conference Exhibition (DATE), 2019, pp. 528–533.

119

BIOGRAPHICAL SKETCH

Jingxiang Tian was born in Shijiazhuang, China. She enrolled in the Nanjing University of

Aeronautics and Astronautics, China in 2008, After receiving her bachelor's degree in 2012,

Jingxiang joined The University of Texas at Dallas for her master’s degree. Shortly after, in the summer of 2013, she enrolled in the PhD program. She received the prestigious merit-based scholarship, the Ericsson Graduate Fellowship, for the 2014-2015 academic year. Her paper ‘A field programmable transistor array featuring single-cycle partial/full dynamic reconfiguration’ is nominated for best paper reward in Design, Automation, and Test in Europe Conference (DATE)

2017. She is currently a PhD candidate working in the Nanometer Design Laboratory (NDL) at

The University of Texas at Dallas.

120

CURRICULUM VITAE

Jingxiang (Amelie) Tian October 22, 2019

CONTACT INFORMATION Department of Electrical and Computer Engineering The University of Texas at Dallas 800 W. Campbell Rd. Richardson, TX, 75080, U.S.A. Email: [email protected]

EDUCATIONAL HISTORY B.S., Electrical Engineering, Nanjing University of Aeronautics and Astronautics, 2012 Transistor-level Programmable Fabric Ph.D. Dissertation Electrical and Computer Engineering, The University of Texas at Dallas Advisors: Dr. Carl Sechen

EMPLOYMENT HISTORY Research Assistant/Teaching Assistant, The University of Texas at Dallas, May 2013-present

PROFESSIONAL RECOGNITION AND HONORS As a Reviewer Serving Journal of Circuits, System, and Computers, JCSC, 2017 Ericsson Graduate Fellowship, The University of Texas at Dallas, 2014-2015 Merit Student and Scholarship, Nanjing University of Aeronautics and Astronautics, NUAA, 2008-2011 University Top Hundred Students, Nanjing University of Aeronautics and Astronautics, NUAA, 2012

PROFESSIONAL MEMBERSHIPS Institute of Electrical and Electronics Engineers (IEEE), 2019