The Pennsylvania State University the Graduate School College of Engineering TUNNEL FET BASED FIELD PROGRAMMABLE GATE ARRAYS

Home , ICE (FPGA)

The Pennsylvania State University

The Graduate School

College of Engineering

TUNNEL FET BASED FIELD PROGRAMMABLE GATE ARRAYS

A Thesis in

Computer Science and Engineering

Ravindhiran Mukundrajan

c 2011 Ravindhiran Mukundrajan

Submitted in Partial Fulﬁllment

of the Requirements

for the Degree of

Master of Science

December 2011 The thesis of Ravindhiran Mukundrajan was reviewed and approved∗ by the following:

Vijaykrishnan Narayanan Professor of Computer Science and Engineering Thesis Advisor

Mary Jane Irwin Evan Pugh Professor of Computer Science and Engineering A. Robert Noll Chair in Engineering

Mahmut Taylan Kandemir Graduate Oﬃcer of the Department of Computer Science and Engineering

∗Signatures are on ﬁle in the Graduate School. Abstract

The proliferation of mobile computing systems has created a new segment in the semiconductor ecosystem where energy efficiency is the most critical design parameter. Moreover, sustaining the growth trajectory of this segment is a difficult task due to lengthy design turnaround times associated with custom design. These difficulties are further exacerbated by the consumer expectation for rapid improvements in functionality within the same energy budget. To cope with these twin challenges, it is critical to explore energy-efficient emerging technologies that can outperform CMOS and construct design frameworks that significantly reduce the design turnaround time. Commercially-available CMOS-based FPGAs provide a flexible platform for rapid prototyping and implementation, however they are energy inefficient for utilization in mobile systems. In this thesis, the design of an energy-efficient FPGA based on Tunnel FETs (TFETs), a prospective CMOS replacement device, is presented. Novel circuit designs are showcased to overcome idiosyncracies unique to TFETs, that prevent them from being direct replacements for MOSFETs in FP- GAs. The impact of TFET usage at the system level is characterized by simulating a FPGA architecture that demonstrate a significant reduction (≈ 1x) in critical path delay at reduced operating voltages, compared to traditional FinFET based FPGAs at the 22nm node.

iii Table of Contents

List of Figures vi

List of Tables viii

Acknowledgments ix

Chapter 1 Introduction 1 1.1 Scope and Organization of Thesis ...... 3

Chapter 2 Circuit Design Challenges for Tunneling FETs 5 2.1 The Tunnel FET Device ...... 6 2.1.1 Characterisitcs of TFETs ...... 9 2.2 Circuit Design using TFETs ...... 9 2.2.1 Static and Dynamic Logic Circuits ...... 10 2.2.2 Pass Transistor Logic for TFETs ...... 11 2.2.2.1 Sense Ampliﬁer Pass Transistor Logic ...... 12 2.2.2.2 Dynamic Discharge Design ...... 15 2.2.2.3 Dynamic Pre-charge Design ...... 16 2.2.2.4 Bi-directional Switch based Pass Transistor Logic . 18

Chapter 3 Tunneling FET based FPGA 21 3.1 FPGA Architectures ...... 23 3.1.1 Logic Block Architecture ...... 23 3.1.2 Routing Architecture ...... 25 3.1.2.1 Single-Driver Routing Architecture ...... 26

iv 3.1.3 Heterogeneous FPGAs ...... 27 3.2 Logical Architecture & Circuit Design ...... 27 3.2.1 Circuit Assumptions ...... 27 3.2.2 Circuit Implementation of FPGA Components ...... 31 3.2.2.1 Logic Block ...... 31 3.2.2.2 Connection Blocks ...... 35 3.2.3 Switch Blocks ...... 35 3.3 FPGA Simulation Infrastructure & Results ...... 35 3.3.1 Critical Path Delay Reduction ...... 36 3.4 FPGA Area-Delay Product Improvement ...... 37

Chapter 4 Conclusions & Future Work 39 4.1 Future FPGA Explorations ...... 39

Bibliography 41

v List of Figures

2.1 Structure of an ultra-thin body NTFET ...... 7 2.2 Generic Band Diagram of a NTFET ...... 7 2.3 InGaAs Homojunction NTFET with its Band Diagram ...... 8 2.4 GaSb-InAs Heterojunction NTFET with its Band Diagram . . . . . 8 2.5 Id − V gs Characteristics of NTFET ...... 10 2.6 Id − V ds Characteristics of NTFET ...... 11 2.7 Sneak leakage paths in PTL [1] ...... 13 2.8 Structure of a SAPTL circuit [2] ...... 13 2.9 Sense amplifier used in SAPTL [2] ...... 14 2.10 Energy-Delay characteristics of a 6-input XOR gate with α = 10% . 15 2.11 Energy-Delay characteristics of a 6-input XOR gate with α = 1% . 16 2.12 A 4:1 MUX implemented using Dynamic Discharge Pass Transistor Logic. Direction of current flow in the PTL stack is indicated by the dotted arrow ...... 17 2.13 A 4:1 MUX implemented using Pre-Charge (Type-1) pass transistor logic. Direction of current flow in the PTL stack is indicated by the dotted arrow ...... 19 2.14 A 4:1 MUX implemented using Pre-Charge (Type-2) pass transistor logic Logic. Direction of current flow in the PTL stack is indicated by the dotted arrow ...... 19 2.15 4:1 Multiplexer implemented using bi-directional switches. Direc- tion of current flow through NTFETs in bi-directional switch shown in inset ...... 20

3.1 Island style FPGA Architecture ...... 24 3.2 Logic Block Architecture [3] ...... 25 3.3 Energy-Delay characteristics of an Inverter ...... 29 3.4 Energy-Delay characteristics of a TSPC Flip-Flop ...... 30 3.5 Energy-Delay characteristics of a 16:1 fully encoded MUX implemented in diﬀerent technologies and circuit styles ...... 33

vi 3.6 Delay improvements for diﬀerent implementations of a 16:1 MUX with respect to simple FinFET based PTL ...... 34 3.7 Critical path delay reduction for TFET FPGAs compared to baseline FinFET FPGA ...... 37 3.8 Area-Delay Product improvements for TFET FPGAs compared to baseline FinFET FPGA ...... 38

vii List of Tables

3.1 FPGA Architectural Parameters ...... 28

viii Acknowledgments

Words are never going to be enough to express my gratitude to my parents. Amma and Appa have always been pillars of support and have encouraged me at every step to make my own decisions. They are the best parents a child can hope for and I pray to the Almighty to provide me with the ability and strength to uphold the dharma expected out of a dutiful son and keep them happy and healthy at all times. I also express my heartfelt thanks to my sister, Dheeptha, for all her help and encouragement. My heartfelt thanks goes out to my adviser at Penn State, Prof.Vijaykrishnan. Dr.Vijay, thank you for providing me with this opportunity and supporting me throughout my time at Penn State. I regret the fact that I cannot not purse my PhD under your guidance... I would probably never be in this field if not for my Guru, Prof.Venkateswaran. I first met Waran as a naive undergraduate and his constant motivation made me choose a research career in Computer Engineering. I shall always remain indebted to him given the amount of time and energy he has spent on me. Professionally and personally, I have imbibed a lot from him and that will stay with me through my lifetime. I owe a huge debt of gratitude to Niranjan Soundarajan who has been a great mentor and friend. Having him as a mentor on my very first project helped me understand and cope with the rigors of grad school. As a friend, he has helped me out on numerous occasions and supported me through tough times. I shall always be grateful for all that he has done for me. It is also an opportunity to put in ink my gratitude towards all my teachers and professors - right from pre-school to grad school. Special thanks to Prof.Chita Das, Prof.R.Narayanan, Prof. Ganesh Vaidyanathan, Prof. Srikanth Dath, Mrs. Srimathi and Mr.K.K.Anand. Finally, a big thanks to all my friends - there are at least 100 of you that I need to acknowledge, so I will let this pass....

ix Chapter 1

Introduction

Tremendous growth has been observed for the past few years in the ubiquitous and mobile computing devices segment and market indicators predict that this growth trajectory will be sustained for the foreseeable future. The key challenges that are encountered by designers in this segment are improving cost-effectiveness, reducing form factor and achieving higher energy-efficiency. Most of these devices are battery operated and hence the fundamental tradeoff made in this segment is with regard to performance and energy-efficiency.

Current day systems are predominantly designed using the 4-decade old workhorse,

Complementary Metal Oxide Semiconductor (CMOS) technology. CMOS technology has proved to be an ideal framework to realize designs due to its desirable performance, power, cost and reliability characteristics. However as we scale down to smaller feature sizes, fundamental limits are being breached and this in turn causes transistors and wires in the nanometer regime to behave in a manner that is far from ideal [4]. The continued scaling of the MOSFET device is leading to increased leakage or OFF state current due to short channel eﬀects, such as Drain 2

Induced Barrier Lowering (DIBL), and the supply voltage cannot be reduced further due to the subthreshold slope being limited to 60 mV/decade at room temperature. These challenges have brought the future of CMOS into question [5] and researchers have begun their quest for the next digital switch [6].

Technological limitations aside, ubiquitous and mobile computing systems are generally custom designed to suit the application characteristics and consequently minimizes area, maximizes energy-efficiency and reduces delay. The scope and application of these devices range from Ultra Low Power (ULP) miniature devices, custom designed for health care and implantable bio-electronics [7], to chipsets in mobile phones that are optimized for a particular function. Custom design en- hances the energy-efficiency of system, but the flip side is the long design turnaround time coupled with limited applicability of the design in other environments. These drawbacks lead to reduced cost efficiency, thus, necessitating the exploration of better implementation strategies.

A flexible design framework is required to overcome the deficiencies encountered with regard to applicability and cost-effectiveness in current day ubiquitous systems. However, fundamental requirements of energy efficiency and performance cannot be compromised in the quest for this flexible framework. Commercially available processors and Field Programmable Gate Arrays (FPGAs) are not ideal for this design space as they are energy inefficient. Research labs have demonstrated ULP designs operating at sub-threshold and near-threshold voltages that greatly enhance energy-efficiency. Sub-Vt microprocessors have been realized with very low energy dissipation per instruction [8] [7], but they require a large amount of instructions to complete even simple computing tasks. A sub-Vt FPGA along 3 with a custom CAD flow was presented recently [9] [10] to provide a flexible, energy efficient and low cost framework for designing ubiquitous systems. While this is a step in the right direction with regard to certain kinds of ubiquitous systems like implantable bio-electronics, the performance offered by a sub-Vt FPGA renders it ineffective for more regular computing devices like mobile phones which requires an operating frequency that is at least in the megahertz range.

The employment of FPGAs in mobile domain has been suggested recently by both industry and the academic community in order to meet the stringent requirements with regard to design turnaround time and cost eﬀectiveness. FPGAs for mobile applications have been recently showcased by vendors like Actel [11] and

SiliconBlue [12], and improvements in energy efficiency for certain mobile applications have been reported in [13]. However, the reluctance in adoption of FPGAs into mainstream designs can be attributed to the fact that the energy efficiency offered by a FPGA implementation cannot be compared to that of custom design. The flexibility offered by a FPGA is also the cause of energy inefficiency as it requires the utilization of circuit elements which are typically not used in custom design. Technological advancements combined with sophisticated CAD tools that minimize resource wastage will help improve energy efficiency for FPGAs and render them more useful in the embedded systems design space.

1.1 Scope and Organization of Thesis

The primary focus of this thesis is to evaluate the feasibility of designing a Tun- nel FET (TFET) based FPGA. The Tunnel FET (TFET) is a prospective CMOS 4 replacement device [6] [14] shown to have attractive operation characteristics compared to CMOS at future technology nodes [15]. This thesis is an attempt to design a FPGA by overcoming circuit design challenges that are unique to Tunnel

FETs.

This thesis is organized as follows. The structure, operating principle, clas- siﬁcations and the working characteristics of TFETs are presented in Chapter

2. Further, circuit design challenges encountered due to the operating characteristics of TFETs are enumerated and solutions are presented to overcome them.

Chapter 2 also evaluates the standing of TFET based circuits in terms of energy consumption and delay when compared to FinFET based CMOS designs. Chapter

3 provides a brief overview of modern FPGA architectures and explains the various architectural parameters. Further, a logical higher level FPGA architecture is translated to a circuit level implementation and the impact of TFET based designs at the system-level is evaluated. The ﬁnal chapter is used to provide conclusions and explore scope for future work. Chapter 2

Circuit Design Challenges for

Tunneling FETs

The previous chapter provided a brief introduction to the key challenges faced by the semiconductor industry with regard to CMOS scaling and energy eﬃciency.

The energy consumed by a CMOS gate is given by Equation 2.1.

1 Energy = ∗ C ∗ V 2 ∗ α + I ∗ V ∗ τ (2.1) gate 2 gate dd Leak dd

From Equation 2.1, it can be inferred that the switching energy can be quadrat- ically reduced by lowering the supply voltage. However, the threshold voltage (Vt) of the MOSFET must also be reduced in order to maintain a high on-state drive current (ION ) and avoid large circuit delays [16]. When the threshold voltage (Vt) is reduced, the oﬀ-state leakage current (IOFF ) increases exponentially which in turn results in larger static energy consumption. Thus, there is a fundamental limit to scaling of MOSFET threshold voltage and subsequently the supply volt- 6 age. This is attributed to the subthreshold slope of MOSFETs being limited to 60 mV/decade at room temperature [17].

This necessitates the exploration of alternate devices that can outperform the

MOSFET at nanometer dimensions [6] [14]. A promising alternative to MOSFETs, which does not suﬀer from the limitations discussed previously, is the Tunnel FET

(TFET) [15] which works on the principle of interband tunneling. From a design perspective, however, TFETs are not direct replacements for MOSFETs and certain unique idiosyncracies of TFETs must be overcome before they can be integrated into mainstream designs.

2.1 The Tunnel FET Device

A Tunnel FET (TFET) is a transistor that works on the principle of inter-band tunneling. A TFET, shown in Figure. 2.1, is basically a p−i−n diode with a gate- oxide over the intrinsic semiconductor region. The gate action induces a strong band bending at the source-channel interface such that the length of the tunneling path decreases. The band structure of a NTFET at OFF and ON states is shown in Figure. 2.2. The tunneling current has an exponential dependence on the tunnel path length. Thus, in essence, the TFET can be deﬁned as “a semiconductor device in which the gate controls the source-drain current through modulation of Band- to-Band Tunnelling (BTBT)”. Band-to-Band Tunnelling (BTBT) is a process in which electrons tunnel from the valence band through the semiconductor bandgap to the conduction band or vice-versa [18].

Tunnel FETs promise sub-60mV/decade subthreshold slopes and resilience to 7

Figure 2.1. Structure of an ultra-thin body NTFET

Figure 2.2. Generic Band Diagram of a NTFET short channel eﬀects [19]. These attractive characteristics provide an opportunity to obtain better performance at lower supply voltages without impacting the OFF- state current. However, it must be noted that at higher supply voltages, the ION of

MOSFETs is much larger than that of TFETs and as a consequence CMOS based design perform better [20].

Tunnel FETs can be fabricated either as a homojunction, which employs the same material system throughout the device, or as a heterojunction employing diﬀerent material systems within the device. Ideally, silicon based TFETs are most attractive as they would allow a full re-use of the expertise acquired over decades and the existing fab infrastructure. However, the small band-to-band tunneling eﬃciency in large-bandgap silicon results in low ON currents for silicon TFETs [14].

This led to the exploration of III-V material system based homojunction TFETs and heterojunction TFETs as shown in Figure. 2.3 and Figure. 2.4. Compared to homojunction TFETs, a higher ION can be obtained in in heterojunction TFETs 8

Figure 2.3. InGaAs Homojunction NTFET with its Band Diagram

Figure 2.4. GaSb-InAs Heterojunction NTFET with its Band Diagram because the staggered P-N heterojunction at the source-channel interface provides a higher critical-ﬁeld strength for eﬃcient interband tunneling [21]. Further, the heterojunction used in this study employs a InAs, a lower bandgap material system compared to the InGaAs used in the homojunction TFET. 9

Recently, a number of TFETs have been experimentally demonstrated [22] [23] and thus, eﬀectively demonstrates the improvements made in process ﬂows. A key advantage with TFETs is that their fabrication is completely compatible with standard CMOS processing unlike other alternative devices. However, from a design perspective, it must be understood that TFETs are not direct replacements for MOSFETs in digital designs and overcoming certain idiosyncracies associated with the device is key challenge.

2.1.1 Characterisitcs of TFETs

The transfer (ID − VGS) characteristics of a 22nm GaSb-InAs heterojunction NT-

FET is shown in Figure. 2.5 and the output (ID − VDS) characteristics are shown in Figure. 2.6. The steep subthreshold slope is clearly observed in the transfer characteristics. The output characteristics of the device provide some interesting insights about the device. Unlike MOSFETs, we observe asymmetric current conduction characteristics with TFETs with conduction currents present only in the reverse-bias region. Thus, the device acts like a unidirectional switch with minimal conduction currents observed under moderate forward bias. Under high forward bias, there is signiﬁcant IDS irrespective of applied gate voltage.

2.2 Circuit Design using TFETs

This section will focus on identifying circuit design opportunities where the attractive characteristics of TFETs can be exploited and also identify potential challenges that must be overcome to successfully commercialize TFETs. The scope of 10

Figure 2.5. Id − V gs Characteristics of NTFET this discussion is restricted to the design of digital circuits. Digital circuits can be classiﬁed into logic circuits, which are predominantly used to do computation and logic operations, and memory circuits like SRAM which are used to store digital data. Several SRAM designs for TFETs have been proposed and evaluated [24] [25] [26] [20] and hence, the focus will be on logic circuit families for the remainder of this section.

2.2.1 Static and Dynamic Logic Circuits

With regard to static and dynamic logic, TFETs are a drop in replacement for

MOSFETs. Static circuits use pull-up and pull-down networks where the current

ﬂow in each device is uni-directional and dynamic circuits utilize a pull down network similar to static circuits and a clocked pull-up transistor. All the devices used in these circuit styles are operated only in the reverse-bias region and hence 11

Figure 2.6. Id − V ds Characteristics of NTFET the asymmetric current conduction characteristics, as observed in the device output characteristics (Figure. 2.6), does not aﬀect static and dynamic circuit styles.

2.2.2 Pass Transistor Logic for TFETs

Pass Transistor Logic (PTL) [4] is widely used to implement many important logic functions and circuits like XOR, MUX etc. This is because PTL can implement a logic circuit with fewer transistors compared to their static counterparts. In PTL, logic operation is performed by connecting and disconnecting the input signal to the output and the same pass transistor stack is used to perform both pull-up and pull-down operation. This in turn reduces the latency and switching energy consumed by the circuit due to reduced capacitance in the network.

In MOSFET based Pass Transistor Logic (PTL), input signals are provided at the gate and drain of a nMOS tranistor and the output node is charged and 12 discharged based on the inputs. Unlike MOSFETs, the source and drain architecture of a TFET is asymmetric and hence results in an asymmetric current ﬂow between the two nodes as observed in the output characteristics shown in Fig- ure. 2.6. This limits the utilization of TFETs as bi-directional pass transistors and renders pass-gate logic useless for TFET based circuits. It also necessitates that the orientation of the transistor, i.e. the location of source and drain, be determined at design time. Thus, the uni-directional conduction characteristic is one of the major challenges that must be overcome for successful utilization of TFETs in the mainstream. The following sections provides some circuit techniques that can be employed to design functional pass transistor circuits.

2.2.2.1 Sense Ampliﬁer Pass Transistor Logic

Sense Ampliﬁer Pass Transistor Logic (SAPTL) is a special type of pass transistor logic where the focus is on limiting leakage energy consumption in pass transistor circuits, especially at very low supply voltages. While there are no explicit supply and ground connections in PTL, unlike static and dynamic CMOS circuits, the inputs of the PTL stack can create temporary sneak leakage paths as shown in

Figure. 2.7. In order to overcome such sneak leak paths a new topology was presented [1] [2] where a single driver, typically an inverter, is used to drive an inverted Binary Decision Diagram (BDD) based PTL stack as shown in Figure. 2.8.

The pseudo dual rail outputs of the the stack are then evaluated using a latch based sense ampliﬁer.

The structure of SAPTL provides an opportunity to drastically reduce the threshold voltage of MOSFETs in the stack, as there is no leakage path, while 13

Figure 2.7. Sneak leakage paths in PTL [1]

Figure 2.8. Structure of a SAPTL circuit [2] using high threshold devices for the driver and sense amplifier. The sense amplifier used is a simple cross-coupled latch based one with an input pre-amplifier stage as presented in Figure. 2.9. As expected, SAPTL is more efficient for designing structures with large fan-ins as there is considerable energy being consumed by the sense amplifier. SAPTL suffers from the following flaws:

1. There is no path to discharge the PTL stack. Hence, discharge transistors

must be provided at the output of the stack and this creates a leakage path.

2. The use of low − Vt transistors in the PTL stack causes the leakage current 14

Figure 2.9. Sense ampliﬁer used in SAPTL [2]

to charge the internal node capacitances in the oﬀ state path and thus waste

energy.

3. The CMOS based sense ampliﬁer is not very sensitive at ultra-low voltages.

SAPTL is a style where TFETs can be used as drop in replacements for MOS-

FETs. TFETs are inherently uni-directional and hence are ideal devices to use in this topology. Further, using TFETs in SAPTL overcomes the problems associated with CMOS based SAPTL. The sense amplifier is very sensitive because TFETs outperform CMOS at ultra low voltages and energy is not wasted in charging the off-state path capacitances. Further, since there is only a small voltage drop at the output of the stack when TFETs are used, a simple buffer can be used to provide the required drive strength at the output if further energy reduction is required.

Figure. 2.10 and Figure. 2.11 show the energy-delay characteristics obtained for a 6-input XOR gate with activity factors for 10% and 1% respectively. Two CMOS based conﬁgurations are presented, the ﬁrst employs high Vt devices in the driver 15

Figure 2.10. Energy-Delay characteristics of a 6-input XOR gate with α = 10%

and sense ampliﬁer and the other utilizes only low Vt devices. The PTL stack is designed using low Vt devices for both cases. The supply voltage is swept from

700mV down to 300mV and the simulation is performed for 1000 cycles. Based on the energy-delay characteristics, it can concluded that TFET based SAPTL clearly beats its CMOS counterpart. Further, the leakage energy dominance observed in

CMOS designs at ultra low supply voltages in absent in TFET based designs.

2.2.2.2 Dynamic Discharge Design

The SAPTL is at best an esoteric design style which is not suited for mainstream applications. Better solutions to implement pass transistor logic using TFETs are required. It is obvious from the output characteristics that TFETs in a pass transistor stack can be oriented only to charge or discharge the output node. An easy way to overcome this hurdle is to use a clocked discharge transistor to discharge 16

Figure 2.11. Energy-Delay characteristics of a 6-input XOR gate with α = 1% the output before every computation. While this seems straightforward, there are concerns that must be addressed. Some internal nodes can be charged up and still be disconnected from the end output node. This necessitates discharge transistors at every internal node. Further, a set of blocking transistors are need to disconnect the inputs from the stack while the internal nodes are being discharged. These extra transistors increase the area of the stack and the net capacitance that needs to be charged and hence directly aﬀects the delay and energy consumption. A 4:1 multiplexer using dynamic discharge design is shown in Figure. 2.12. The blocking transistors which isolate the inputs from the PTL stack is shown in dotted lines.

2.2.2.3 Dynamic Pre-charge Design

An alternative method, which will negate the need to discharge all internal nodes before every cycle, is the dynamic pre-charge design. In this case, the TFETs in 17

Figure 2.12. A 4:1 MUX implemented using Dynamic Discharge Pass Transistor Logic. Direction of current ﬂow in the PTL stack is indicated by the dotted arrow the pass transistor stack are oriented to only discharge the output node which is pre-charged to Vcc every cycle. The inputs of the PTL stack must be isolated from the output node while pre-charging to prevent the chance of a direct Vcc to

GND short circuit. Figure. 2.13 & Figure. 2.14 are two implementations of 4:1

MUX implemented using pre-charge based PTL. In Figure. 2.13, the transistors shown with dotted lines are used to isolate the stack from inputs during the pre- charge cycle. This is similar to the dynamic discharge design and increases the circuit area considerably. A better design to achieve this purpose is presented in

Figure. 2.14. This design consumes only one extra transistor to isolate the output node from the inputs and hence is more area eﬃcient. However, as with any pre- charge and evaluate design, there exists the possibility of degradation of output voltage due to sharing of charges among internal nodes during the evaluate phase. 18

Traditionally, this drawback is overcome by using a pull-up transistor based keeper that restores the output to Vcc. However, in TFET based designs, a NTFET can be used as a keeper with the output node driving the keeper by itself as shown in Figure. 2.13 & Figure. 2.14. This is possible as NTFETs can be envisioned as near-zero threshold voltage devices that cause only a minute drop (≈ 50mV) in the output voltage when used as a pass transistor. This attractive property of

NTFETs allows them to be employed as a level restorer. Using a NTFET as a keeper improves the response of the keeper circuitry and helps in improving the energy and delay characteristics. The degradation in output voltage does not result in any extra leakage power consumed by circuits downstream as it is very small.

For all its advantages, a subtle drawback that is inherent in this type design is that the range of operating voltages is limited. This is to prevent the NTFETs in the PTL stack from becoming forward-biased and consequently conduct large currents irrespective of the gate voltage.

2.2.2.4 Bi-directional Switch based Pass Transistor Logic

The dynamic PTL designs presented previously overcome the problem of designing

PTL circuits with uni-directional switches. However, these styles requires a re- design of pass transistor logic based standard cells and the development of new synthesis methodologies and tools. A novel way of overcoming uni-directional conduction is to use two NTFETs, with their drains oriented in opposite directions, to implement a bi-directional switch as shown in the inset in Figure. 2.15. The bi-directional switch operates just like a nMOS pass transistor and hence allows the re-use of existing PTL synthesis methods and tools. The obvious drawback 19

Figure 2.13. A 4:1 MUX implemented using Pre-Charge (Type-1) pass transistor logic. Direction of current ﬂow in the PTL stack is indicated by the dotted arrow

Figure 2.14. A 4:1 MUX implemented using Pre-Charge (Type-2) pass transistor logic Logic. Direction of current ﬂow in the PTL stack is indicated by the dotted arrow 20

Figure 2.15. 4:1 Multiplexer implemented using bi-directional switches. Direction of current ﬂow through NTFETs in bi-directional switch shown in inset of this implementation is that area of circuits is doubled. Further, the range of operating voltages must be limited in order to make sure that the NTFETs in the PTL stack does not become forward-biased. A 4:1 multiplexer designed using bi-directional switches is shown in Figure. 2.15.

Based on the circuit designs presented in this chapter, a tile based FPGA is designed and the impact of TFET based circuits is evaluated at the system level. A

FPGA is predominantly made up of different varieties of multiplexer designs, which are most efficient in terms of delay, area and energy when pass transistor logic is utilized. Hence, the impact of above discussed circuit styles can be best evaluated by constructing a FPGA. The architecture and the circuit level implementation of the different FPGA blocks and their characterization are presented in the following chapter. Chapter 3

Tunneling FET based FPGA

Field Programmable Gate Arrays (FPGAs) are re-conﬁgurable hardware systems that can used to implement arbitrary logic designs. FPGAs have evolved considerably since their introduction in 1984 and are now used in a wide range of markets including communications, consumer electronics, automotive electronics, defence systems and as accelerators in high performance computing [27]. FPGAs provide an attractive platform for rapid implementation of designs and thus achieve fast turnaround times at reduced costs compared to Application Speciﬁc Integrated

Circuits (ASICs). However, the flip side of the field-programmable characteristic of a FPGA is the area, energy and performance penalty incurred by providing the reconfigurable fabric compared to an ASIC. It was estimated that a fine-grained, pure-soft logic based FPGA when compared against a standard cell based ASIC design at the same technology node, might be 4x slower, 35x larger and may consume

14x more dynamic power [28]. This necessitates the exploration of more eﬃcient circuit techniques and better architectures to bridge the gap between ASICs and

FPGAs. Further, the static power consumed by a FPGA increases with technology 22 scaling and is anticipated to a major obstacle for utilization of FPGAs in battery powered embedded applications [29] [30].

Commercial FPGAs can be broadly classiﬁed into three categories:

1. SRAM based

2. Anti-fuse based

3. Flash based

Anti-fuse based FPGAs are one-time programmable FPGAs. The connections between logic blocks are made by burning anti-fuses on the routing tracks. This type of FPGAs are predominantly used for satellite and space electronics [3] and scaling to smaller technology nodes has been a critical challenge [31]. In flash based FPGAs, the configuration bits and the look-up table (LUT) memory are made up of flash memory. The use of flash memory reduces the area consumed by the configuration bits and guarantees better area-efficiency. However, the critical challenges faced in this type FPGA is the manufacturing cost overhead in integrat- ing flash memory with standard CMOS process, write endurance of flash memory and the scaling challenges associated with flash memory. The LUT memory and configuration bits of a SRAM based FPGA are volatile. However, they can be re- programmed almost infinite times over the life-cycle of the FPGA. Further, since they are purely CMOS based, the manufacturing process is highly optimized. 23

3.1 FPGA Architectures

A basic FPGA consists of numerous look-up table (LUT) based logic blocks connected to each other through an interconnection network of programmable routing switches and a set of I/O blocks that provide the oﬀ-chip interface. A logic design is implemented on a FPGA by utilizing the logic blocks to implement parts of the design and conﬁguring the interconnection network to interconnect the logic blocks.

Almost all current day FPGAs use an architecture known as Island-style architecture [27]. As depicted in Figure. 3.1, the logic blocks are surrounded by routing elements in this type of architecture and hence the name. The island style FPGA is designed by repeatedly instantiating a single tile [32] numerous times. Each tile consists of a Logic Block (LB), two Connection Blocks (CBs), a Switch Block

(SB) and interconnect wires. The functionality of these components along with other deﬁnitions that deﬁne the logical architecture of a FPGA is presented in the following sections.

3.1.1 Logic Block Architecture

The logic block is used to provide the customizable logic functionality, that is expected of a FPGA. The implementation of the logic block significantly impacts the delay, area and energy consumption of a FPGA. Each logic block consists of a cluster of Basic Logic Elements (BLEs) as depicted in Figure. 3.2, which typically consists a K-input look-up table paired with a flip-flop. The BLEs are normally clustered in order to share input and output signals within the cluster. It has been 24

Figure 3.1. Island style FPGA Architecture reported in [33] that for a logic cluster of size N employing K-bit BLEs, the ideal input vector size is given by Equation 3.1. A full crossbar architecture capable of connecting any logic block input or BLE output to any BLE input, as shown in

Figure. 3.2, is assumed for the intra-cluster routing.

K I = ∗ (N + 1) (3.1) 2

Current day FPGAs sometimes employ fracturable LUTs [34] that can be split into multiple smaller LUTs if required in their BLE. The use of fracturable LUTs 25

Figure 3.2. Logic Block Architecture [3] minimize the number of unused inputs in each BLE and hence optimizes CLB utilization. Further, it was found in [35] that the ﬂexibility oﬀered by the full crossbar intra-cluster routing architecture described previously is typically not required and is no longer common in modern day FPGAs [36].

3.1.2 Routing Architecture

In an island-style FPGA architecture, the LBs are surrounded by programmable routing wires in routing channels that are used to interconnect LBs. The routing tracks in the channels are typically segmented and can be routed using pro- 26 grammable switches found in the switch block at the intersection of channels. The routing tracks are connected to the logic block through the connection block (CB).

The channel width, W, is the number of tracks in the channel and the logical length of each routing segment, L, is defined as the number of logic blocks spanned by the segment. The number of segments that any segment can connect to in a switch block is the switch block flexibility, Fs. The input connection block flexibility, Fcin, is the number of tracks within the channel that can connect to a logic block input and the number of tracks to which a logic block output can connect is the output connection block flexibility, Fcout. Some commercial architectures merge the output connection block with the switch block [34].

3.1.2.1 Single-Driver Routing Architecture

One signiﬁcant attribute of the routing architecture is the nature of the connections driving each routing segment. In the past, approaches that allow each routing segment to be driven from multiple points along the segment were common [27].

These multi-driver designs required some form of tri-state mechanism on all potential drivers. A single-driver approach is now widely used instead [37]. The single-driver approach, while reducing the ﬂexibility of the individual routing segments, is advantageous for both area and performance reasons because it allows standard inverters to drive each routing segment instead of the tri-state buﬀers or pass transistors required for the multi-driver approaches. Single-driver routing is the only type of routing that will be considered in this work. 27

3.1.3 Heterogeneous FPGAs

Some modern FPGAs employ hard-designed blocks like multipliers and block

RAMs in addition to the soft logic in order to improve performance and reduce the area foot print. These components provide resource heterogeneity within a FPGA.

An investigation presented in [28] found hard logic blocks useful in narrowing down the area gap in comparison to an ASIC to 4.7x, however these blocks had only a moderate impact on power and almost no impact on delay. The other types of heterogeneity that are found in FPGAs are the presence of multiple diﬀerent tiles and denser routing tracks in selective regions.

3.2 Logical Architecture & Circuit Design

In previous section, the various parameters that characterize a FPGA architecture were deﬁned. The logical architecture of a FPGA deﬁnes the logical behavior of a

FPGA in terms of the LUT size, cluster size and the structure of routing segments and switches. In this section, the logical architecture of a FPGA is translated into an electrical circuit level netlist, which determines the area, delay and energy consumption. For the purpose of this discussion, the architectural parameters that have been assumed are presented in Table. 3.2. These are consistent with other

FPGA architecture explorations performed in the academia [38] [39].

3.2.1 Circuit Assumptions

An island-style tiled architecture, comprising of only pure soft logic blocks, is the focus of this study. This is because soft-logic is the most important factor in 28

Parameter Value LUT Size (K) 4 Cluster Size (N) 10 Number of Cluster Inputs (I) 22 Tracks per Channel (W) 104 Track Length (L) 4 Interconnect Style Unidirectional Driver Style Single Driver Fcin 0.15 Fcout 0.10 Pads per row/column 4 Table 3.1. FPGA Architectural Parameters determining the area, performance and delay of an FPGA [38]. As a consequence of this constraint, the circuit implementation of the FPGA architecture under study will consist purely of inverters, configuration memory, flip-flops and different varieties of multiplexers.

The above mentioned circuits circuits are designed and characterized for 22nm planar-CMOS, FinFET CMOS and Tunnel FET technologies. Predictive technology models [40] are used for simulating planar-CMOS designs. For simulating

FinFET and TFET designs, look up table based Verilog-A models [21] are utilized.

Verilog-A based modeling is an eﬃcient and accurate way for simulating emerging devices, like TFETs, for which compact or SPICE models are not available [?].

Inverters are implemented using the standard static design style. The energy- delay characterisitcs of a size one inverter operating with various activity factors is shown in Figure. 3.3 for diﬀerent technologies. An obvious observation that can be made is that FinFET-CMOS inverter clearly beats the planar-CMOS inverter across the design space as expected. Further, we observe that the FinFET inverter performs better than the TFET inverter at higher voltages. This is again expected 29

Figure 3.3. Energy-Delay characteristics of an Inverter

as ION of FinFETs is much higher compared to that of TFETs at higher voltages.

The crossover point where TFETs start to beat FinFETs is in between 600mV and

500mV. Further, the energy-delay characteristics of TFET inverter does not exhibit the leakage energy dominance at lower supply voltages and activity factors unlike planar-CMOS and FinFET-CMOS inverters. Since FinFET based designs clearly beat planar-CMOS designs, only FinFET based designs are used for comparison with TFET designs for other circuit elements.

For the conﬁguration memory, a 6T TFET SRAM presented in [25] is used.

This SRAM uses a novel circuit design to overcome the unidirectional conduction property of TFETs. However, it must be noted that these SRAMs are written into only once while programming and the reads from them are not through a sense ampliﬁer. Instead, the outputs of the cross coupled inverter are directly hardwired 30

Figure 3.4. Energy-Delay characteristics of a TSPC Flip-Flop to logic and routing block inputs, negating the concern for strong read margins requirements which arise if the access transistors are used.

The flip-flops in FPGAs contribute very little to the FPGA area ( 5% of total area) and performance. The flip-flop used in this study are True-Single Phase

Clock (TSPC) ﬂip-ﬂops discussed in [4]. The energy-delay characteristics of this

flip-flop for different activity factors is presented in Figure. 3.4.

The major component in a FPGA is the multiplexer which is used in both logic blocks, to create LUTs, and in routing fabric as switches. Multiplexers can be eﬃciently designed using pass transistor logic when compared to other design styles and hence all multiplexers in this work are constructed using nMOS/NTFET transistors. Circuit design solutions presented in the previous chapter are used to implement NTFET based PTL. Fully encoded MUX designs are used for the LUTs 31 whereas partially decoded 2-level MUX structures, which exhibit a reduced latency by using more select signals, are used in routing switches [3]. The width of each

MUX is determined by its position in the FPGA and will be explained in the following subsection.

3.2.2 Circuit Implementation of FPGA Components

The primitive circuits described in the previous subsection need to be combined to- gether appropriately to obtain the tile implementation, which in turn is repeatedly instantiated to construct the FPGA. This subsection deals with the design of different sub-components like the CLB, CB and SB within the tile of a unidirectional single driver FPGA architecture.

3.2.2.1 Logic Block

The logic block contains a cluster of Basic Logic Elements (BLEs) interconnected using a full crossbar routing switch (Figure. 3.2). The BLE consists of a LUT and flip-flop. To implement a 4-input LUT as specified in Table. 3.2, a 16:1 fully encoded tree MUX is required. The 4 LUT inputs are basically the select signals of the MUX which select the required data inputs stored in LUT memory.

Multiplexer designs are most eﬃcient when implemented using pass-transistor logic. In the previous chapter, some circuit design options to implement PTL using

TFETs was presented. These circuit implementations are used to design the fully encoded 16:1 MUX. Further, in order to obtain insight on the impact of the change in circuit design along with that of technology change to TFETs, the dynamic pre- charge design (Type 2) is also implemented using FinFETs. The energy-delay 32 characteristics for a fully encoded 16:1 MUX is presented in Figure. 3.5 for an activity factor of 10% over 1000 clock cycles for each design.

Many interesting insights are obtained from Figure. 3.5. This ﬁrst key observation is that the energy consumed by the FinFET designs, both traditional PTL and the dynamic PTL implementation, is almost the same across the design space.

However, it must be noted that in traditional PTL design, dynamic energy is determined only by the number of 0- 1 transitions of the output node whereas in the case of dynamic PTL both 0- 1 and 0- 0 transitions are contributors. Secondly, the leakage power dissipated by both circuits is the same, however, the leakage energy consumed is marginally diﬀerent. This can be observed clearly at lower operating voltages where the reduced delay of the dynamic design implies reduced cycle time and hence lesser reduced leakage and total energy consumption. With regard to TFET based designs, the energy-delay characteristics of the pre-charge based designs are most impressive. Better delay characteristics can be attributed to the isolated charging of just the output capacitance during the pre-charge phase, resulting in an enhanced rise time and thus lesser overall delay. With regard to energy consumption, the orientation of devices in the PTL stack of the the pre-charge based designs result in reduced leakage energy consumption as there are no sneak leakage paths created by data inputs as shown in Figure 2.7. The bi-directional switch based designs suﬀer from the presence of sneak leakage paths which result in higher leakage energy consumption compared to the pre-charge based designs.

In the case of pre-discharge design, the presence of discharge transistors at internal nodes creates more leakage paths resulting in higher static energy dissipation.

Further, the dynamic energy consumption of the pre-charge based designs is also 33

Figure 3.5. Energy-Delay characteristics of a 16:1 fully encoded MUX implemented in diﬀerent technologies and circuit styles lesser as internal node capacitances present in the PTL stack is much smaller in these circuits.

Based on Figure. 3.5, it is obvious that the combined inﬂuence of technology

(TFETs) and new circuit styles provide far greater benefits that just what novel circuit designs alone can offer. The improvement obtained in delay for different circuit implementations with respect to traditional PTL implemented using FinFET is presented in Figure. 3.6.

The CLB also contains a full crossbar intra-cluster routing that provides the functionality to connect any input of the logic block or any output of the BLEs to the required LUT input. The width of this routing MUX is determined by 34

Figure 3.6. Delay improvements for diﬀerent implementations of a 16:1 MUX with respect to simple FinFET based PTL equation 3.2. Since, this is a routing MUX, a 2-level structure is used to reduce delay. Further, since there are 10 BLEs in a cluster and 4 inputs per each BLE, 40 such MUXes are required. The energy-delay characteristics of the 2-level MUXes is very similar to that of the fully encoded structure and is hence not presented here.

W idthBLEroutingM UX = I + N = 22 + 10 = 32 (3.2) 35

3.2.2.2 Connection Blocks

The Connection Block is used to connect the routing tracks to CLB input pins.

This functionality is again implemented using a 2-level multiplexer. The parameter

Fc,in is deﬁned as the number of tracks within the channel that can connect to a logic block input. Using the value of Fc,in from Table. 3.2, the CB MUXes should have a width given by equation 3.3. Since there are 22 inputs to the CLB, 22

MUXes are grouped to form the Connection Block (CB).

W idthCBM UX = Fc,in ∗ W = 104 ∗ 0.15 ≈ 16 (3.3)

3.2.3 Switch Blocks

The Switch Blocks (SB) connect the routing segments across tiles. An single driver routing architecture with switch block ﬂexibility (FS) of 3 has been assumed in this work. The width of the SB routing MUX [3] is given by equation 3.4. The switch block is constructed from 52 such 2-level MUXes.

2W 2W L FS + (2W − L )(FS − 1) + Fc,outWN W idthSBM UX = 2W = 11 (3.4) L

3.3 FPGA Simulation Infrastructure & Results

The observations presented in the previous section clearly showcase the fact that at the component level, TFET based circuits outperform their FinFET counterparts at reduced supply voltages. However, the performance beneﬁts obtained at the component level need not necessarily be reﬂected at the system level when a circuit 36 is mapped onto the FPGA. In order to characterize the impact of TFET based

FPGAs in terms of critical path reduction, a simulation framework based on the

Virtual Place and Route (VPR) [27] tool, which is essentially a FPGA simulator, is used to place and route 20 of the largest MCNC benchmark circuits [41]. The inputs needed for simulating a FPGA fabric using VPR are extracted from the component level circuit simulations presented in the previous section.

3.3.1 Critical Path Delay Reduction

Figure. 3.7 shows the average reduction in critical path delay for circuits mapped to

TFET based FPGAs, employing different multiplexer implementations, compared to those mapped on the baseline FinFET FPGA which employs traditional PTL based multiplexers. The dynamic discharge based multiplexer implementation is not evaluated as it area inefficient. Best results are obtained when the operating voltage is 300mV and the benefit reduces as the operating voltage is increased.

The TFET based FPGAs become slower compared to their FinFET counterparts when the operating voltage is past 700mV.

Further, it can be seen that the improvements seen at the component level is closely reﬂected in critical path delay improvements for operating voltages of

500mV, 400mV and 300mV. However, this is not the case for 600mV. This is because at 600mV, the delay through interconnect wires is the dominant factor in determining the critical path. However, at lower voltages, the performance of the FinFET circuits degrade rapidly and hence the circuit delay determines the critical path. 37

Figure 3.7. Critical path delay reduction for TFET FPGAs compared to baseline FinFET FPGA 3.4 FPGA Area-Delay Product Improvement

The average improvements obtained in the Area-Delay Product (ADP) for diﬀerent operating points is presented in Figure. 3.8. The ADP improvements obtained for the TFET FPGAs employing pre-charge based multiplexers is much greater than that achieved with bi-directional switch based multiplexers. This is because the area penalty is greater for designs employing bi-directional switches.

Based on these results, it can be argued that TFET designs create a new spot in the design space where enhanced energy-eﬃciency can be obtained, by virtue of voltage scaling, without severely compromising on performance. However, more characterizations and rigorous explorations need to be performed to be able to 38

Figure 3.8. Area-Delay Product improvements for TFET FPGAs compared to baseline FinFET FPGA accurately determine the potential of TFET based FPGAs in the ubiquitous and mobile computing ecosystem. These issues are discussed further in the following chapter. Chapter 4

Conclusions & Future Work

Based on the explorations presented in this thesis, it can be concluded that TFETs are a promising solution to achieve enhanced energy-eﬃciency without compromising on performance. Circuit design challenges in designing TFET based systems are identiﬁed and novel circuit solutions are proposed and characterized. Further, a FPGA architecture utilizing TFETs is simulated to evaluate the system level impact of TFETs. Tremendous reduction in critical path delay is observed at reduced operating voltages along with improvements in the area-delay product. The current simulation framework needs to be extended to estimate energy consumption at the FPGA level and the simulate of larger benchmark circuits that are representative of current day mobile systems.

4.1 Future FPGA Explorations

In this work, the diﬀerent components of a FPGA, like LUTs and routing switches, were redesigned using Tunnel FETs without any major change in the basic archi- 40 tecture of the FPGA. The exploration presented here provides some insight as to how an emerging technology like TFETs can inspire future generation embedded system designs. However, further explorations are required to create FPGA architectures that will be beneﬁt further from the attractive characteristics of TFETs and allow it to be utilized in energy-constrained computing systems. Further, the

TFET SRAM based conﬁguration memory has a large area footprint and hence co-exploration of TFET based LUTs and routing switches combined with emerging memory technologies like STT-RAM, which have a much smaller area footprint, might lead to enhanced beneﬁts due to the reduction obtained in the length of the programmable interconnect wires. Bibliography

[1] Alarcon, L. P., T.-T. Liu, M. D. Pierson, and J. M. Rabaey (2007) “Exploring Very Low-Energy Logic: A Case Study,” Journal of Low Power Electronics, 3, pp. 223–233. URL http:///php/pubs/pubs.php/151.html

[2] Markovic, D., C. Wang, L. Alarcon, T.-T. Liu, and J. Rabaey (2010) “Ultralow-Power Design in Near-Threshold Region,” Proceedings of the IEEE, 98(2), p. 237252. URL http:///php/pubs/pubs.php/1286.html

[3] Kuon, I. and J. Rose (2010) Quantifying and Exploring the Gap Between FPGAs and ASICs, Springer, New York, NY, USA.

[4] Rabaey, J. M., A. Chandrakasan, and B. Nikolic (2004) Digital integrated circuits- A design perspective, 2ed ed., Prentice Hall.

[5] Horowitz, M., E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein (2005) “Scaling, power, and the future of CMOS,” in Electron Devices Meeting, 2005. IEDM Technical Digest. IEEE International, pp. 7 pp. –15.

[6] Bernstein, K., R. Cavin, W. Porod, A. Seabaugh, and J. Welser (2010) “Device and Architecture Outlook for Beyond CMOS Switches,” Pro- ceedings of the IEEE, 98(12), pp. 2169 –2184.

[7] Zhai, B., L. Nazhandali, J. Olson, A. Reeves, M. Minuth, R. Helfand, S. Pant, D. Blaauw, and T. Austin (2006) “A 2.60pJ/Inst Subthreshold Sensor Processor for Optimal Energy Eﬃciency,” in VLSI Cir- cuits, 2006. Digest of Technical Papers. 2006 Symposium on, pp. 154 –155.

[8] Jocke, S., J. Bolus, S. N. Wooters, A. D. Jurik, A. F. Weaver, T. N. Blalock, and B. H. Calhoun (2009) “A 2.6-W Sub-threshold Mixed-signal ECG SoC,” in Symposium on VLSI Circuits. 42

[9] Ryan, J. F. and B. H. Calhoun (2010) “A Sub-Threshold FPGA with Low-Swing Dual-VDD Interconnect in 90nm CMOS,” in Custom Integrated Circuits Conference (CICC).

[10] Calhoun, B. H., J. Ryan, S. Khanna, M. Putic, and J. Lach (2010) “Flexible Circuits and Architectures for Ultra Low Power,” Proceedings of the IEEE, 98, pp. 267–282.

[11] Belhadj, H., V. Aggrawal, A. Pradhan, and A. Zerrouki (2009) “Power-Aware FPGA Design,” in Actel White Paper.

[12] SiliconBlue (2009) “Ultra Low-Power iCE FPGAs,” in SiliconBlue White Paper.

[13] Todman, T., G. Constantinides, S. Wilton, O. Mencer, W. Luk, and P. Cheung (2005) “Reconﬁgurable computing: architectures and design methods,” Computers and Digital Techniques, IEE Proceedings -, 152(2), pp. 193 – 207.

[14] Heyns, M., F. Bellenger, G. Brammertz, M. Caymax, M. Can- toro, S. D. Gendt, B. D. Jaeger, A. Delabie, G. Eneman, G. Groe- seneken, G. Hellings, M. Houssa, F. Iacopi, D. Leonelli, D. Lin, W. Magnus, K. Martens, C. Merckling, M. Meuris, J. Mitard, J. Penaud, G. Pourtois, M. Scarrozza, E. R. Simoen, B. Soree, S. V. Elshocht, W. Vandenberghe, A. Vandooren, P. Vereecke, A. Verhulst, and W.-E. Wang (2010) “Shaping the future of nanoelec- tronics beyond the Si roadmap with new materials and devices,” vol. 7640, SPIE, p. 764003. URL http://link.aip.org/link/?PSI/7640/764003/1

[15] Seabaugh, A. and Q. Zhang (2010) “Low-Voltage Tunnel Transistors for Beyond CMOS Logic,” Proceedings of the IEEE, 98(12), pp. 2095 –2110.

[16] Taur, Y. and T. H. Ning (2009) Fundamentals of modern VLSI devices, 2 ed., Cambridge University Press. URL http://www.worldcat.org/isbn/0521832942

[17] Kam, H., T.-J. King-Liu, E. Alon, and M. Horowitz (2008) “Circuit- level requirements for MOSFET-replacement devices,” in Electron Devices Meeting, 2008. IEDM 2008. IEEE International, p. 1.

[18] Vandenberghe, W., A. Verhulst, G. Groeseneken, B. Soree, and W. Magnus (2008) “Analytical model for a tunnel ﬁeld-eﬀect transistor,” in Electrotechnical Conference, 2008. MELECON 2008. The 14th IEEE Mediter- ranean, pp. 923 –928. 43

[19] Verhulst, A. S., W. G. Vandenberghe, D. Leonelli, R. Rooy- ackers, A. Vandooren, S. D. Gendt, M. M. Heyns, and G. Groe- seneken (2009) “Tunnel Field-Eﬀect Transistors for Future Low-Power Nano- Electronics,” ECS Transactions, 25(7), pp. 455–462. URL http://link.aip.org/link/abstract/ECSTF8/v25/i7/p455/s1 [20] Saripalli, V., S. Datta, V. Narayanan, and J. P. Kulkarni (2011) “Variation-tolerant ultra low-power heterojunction tunnel FET SRAM design,” Nanoscale Architectures, IEEE International Symposium on, 0, pp. 45–52.

[21] Saripalli, V., A. K. Mishra, S. Datta, and V. Narayanan (2011) “An energy-eﬃcient heterogeneous CMP based on hybrid TFET-CMOS cores,” in DAC, pp. 729–734.

[22] Gandhi, R., Z. Chen, N. Singh, K. Banerjee, and S. Lee (2011) “Ver- tical Si-Nanowire n-Type Tunneling FETs With Low Subthreshold Swing ( ¡=50mV/decade ) at Room Temperature,” Electron Device Letters, IEEE, 32(4), pp. 437 –439.

[23] Ford, A. C., C. W. Yeung, S. Chuang, H. S. Kim, E. Plis, S. Krishna, C. Hu, and A. Javey (2011) “Ultrathin body InAs tunneling ﬁeld-eﬀect transistors on Si substrates,” Applied Physics Letters, 98(11), pp. 113105 – 113105–3.

[24] Kim, D., Y. Lee, J. Cai, I. Lauer, L. Chang, S. J. Koester, D. Sylvester, and D. Blaauw (2009) “Low power circuit design based on heterojunction tunneling transistors (HETTs),” in Proceedings of the 14th ACM/IEEE international symposium on Low power electronics and design, ISLPED ’09, ACM, New York, NY, USA, pp. 219–224. URL http://doi.acm.org/10.1145/1594233.1594287 [25] Singh, J., K. Ramakrishnan, S. Mookerjea, S. Datta, N. Vijaykr- ishnan, and D. Pradhan (2010) “A novel Si-Tunnel FET based SRAM design for ultra low-power 0.3V VDD applications,” in Design Automation Conference (ASP-DAC), 2010 15th Asia and South Paciﬁc, pp. 181 –186.

[26] Yang, X. and K. Mohanram (2011) “Robust 6T Si tunneling transistor SRAM design,” in DATE, pp. 740–745.

[27] Betz, V., J. Rose, and A. Marquardt (1999) Architecture and CAD for Deep-Submicron FPGAs, Kluwer Academic Publishers, Norwell, MA, USA.

[28] Kuon, I. and J. Rose (2007) “Measuring the Gap Between FPGAs and ASICs,” IEEE Trans. on CAD of Integrated Circuits and Systems, 26(2), pp. 203–215. 44

[29] Tuan, T., S. Kao, A. A. Rahman, S. Das, and S. Trimberger (2006) “A 90nm low-power FPGA for battery-powered applications,” in FPGA, pp. 3–11.

[30] Srinivasan, S., A. Gayasen, N. Vijaykrishnan, and T. Tuan (2005) “Leakage control in FPGA routing fabric,” in ASP-DAC, pp. 661–664.

[31] Kuon, I., R. Tessier, and J. Rose (2007) “FPGA Architecture: Survey and Challenges,” Foundations and Trends in Electronic Design Automation, 2(2), pp. 135–253.

[32] Chow, P., S. O. Seo, J. Rose, K. Chung, G. Paez-Monzon, and I. Ra- hardja (1999) “The design of an SRAM-based ﬁeld-programmable gate array - Part I: Architecture,” Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 7(2), pp. 191 –197.

[33] Ahmed, E. and J. Rose (2004) “The eﬀect of LUT and cluster size on deep-submicron FPGA performance and density,” Very Large Scale Integra- tion (VLSI) Systems, IEEE Transactions on, 12(3), pp. 288 –298.

[34] Lewis, D., E. Ahmed, G. Baeckler, V. Betz, M. Bourgeault, D. Cashman, D. Galloway, M. Hutton, C. Lane, A. Lee, P. Lev- entis, S. Marquardt, C. McClintock, K. Padalia, B. Pedersen, G. Powell, B. Ratchev, S. Reddy, J. Schleicher, K. Stevens, R. Yuan, R. Cliff, and J. Rose (2005) “The Stratix II logic and routing architecture,” in Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays, FPGA ’05, ACM, New York, NY, USA, pp. 14–20. URL http://doi.acm.org/10.1145/1046192.1046195

[35] Lemieux, G. and D. Lewis (2001) “Using sparse crossbars within LUT,” in Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, FPGA ’01, ACM, New York, NY, USA, pp. 59–68. URL http://doi.acm.org/10.1145/360276.360299

[36] Lewis, D. M., V. Betz, D. Jefferson, A. Lee, C. Lane, P. Leventis, S. Marquardt, C. McClintock, B. Pedersen, G. Powell, S. Reddy, C. Wysocki, R. Cliff, and J. Rose (2003) “The Stratix routing and logic architecture.” in FPGA’03, pp. 12–20.

[37] Lemieux, G., E. Lee, M. Tom, and A. Yu (2004) “Directional and single- driver wires in FPGA interconnect,” in Field-Programmable Technology, 2004. Proceedings. 2004 IEEE International Conference on, pp. 41 – 48. 45

[38] Kuon, I. and J. Rose (2011) “Exploring Area and Delay Tradeoﬀs in FPGAs With Architecture and Automated Transistor Design,” IEEE Trans. VLSI Syst., 19(1), pp. 71–84.

[39] Chen, C., R. Parsa, N. Patil, S. Chong, K. Akarvardar, J. Provine, D. Lewis, J. Watt, R. T. Howe, H.-S. P. Wong, and S. Mitra (2010) “Eﬃcient FPGAs using nanoelectromechanical relays,” in Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, FPGA ’10, ACM, New York, NY, USA, pp. 273–282. URL http://doi.acm.org/10.1145/1723112.1723158

[40] Zhao, W. and Y. Cao (2006) “New generation of predictive technology model for sub-45nm design exploration,” in Quality Electronic Design, 2006. ISQED ’06. 7th International Symposium on, pp. 6 pp. –590. URL http://ptm.asu.edu/

[41] Yang, S. (1991), “Logic Synthesis and Optimization Benchmarks User Guide Version 3.0,” .