<<

REVERSIBLE COMPUTING: THE DESIGN OF AN ADIABATIC

MICROPROCESSOR

A Thesis

Submitted to the Graduate School

of the University of Notre Dame

in Partial Fulfillment of the Requirements

for the Degree of

Master of Science in Electrical Engineering

by

Rene Celis Cordova

Gregory L. Snider, Co-Director

Alexei O. Orlov, Co-Director

Graduate Program in Electrical Engineering

Notre Dame, Indiana

November 2019

© Copyright 2019

Rene Celis Cordova

REVERSIBLE COMPUTING: THE DESIGN OF AN ADIABATIC

MICROPROCESSOR

Abstract

by

Rene Celis Cordova

Despite the exponential progress over the past fifty years the increase of performance of has recently come to a rather abrupt end. Modern microprocessors are limited by heat dissipation. Speeds have been capped around 4 GHz since 2004 to limit heat generation. Such speeds are well below the RC time constant limit of the circuits, sacrificing speed to prevent the chips from melting. Reversible computing is a viable alternative to traditional circuit implementations since it reduces heat generation by avoiding unnecessary dissipation. Traditional CMOS circuits dissipate power on every switching in the form of heat. Adiabatic reversible computing uses reversible and quasi-adiabatic transitions to reduce heat generation by introducing a trade-off between speed and power. By using reversible logic and switching the circuits slowly, relative to their RC time constants, power can be dramatically reduced. In this thesis, I will present the design of a 16-bit adiabatic microprocessor that operates at 0.5

GHz dissipating an energy one order of magnitude lower than its CMOS counterpart. Rene Celis Cordova

The adiabatic microprocessor is a multicycle RISC with a 16-bit and it follows the MIPS architecture. Only a subset of instructions is implemented but they are sufficient for universal computation. Adiabatic reversible logic is achieved using split-rail charge recovery logic (SCRL) with Bennett clocking, which consists of clocks that only ramp up once the previous stage has a valid steady . An adiabatic SCRL implementation allows a circuit to operate both in adiabatic and in standard CMOS mode. The adiabatic microprocessor design includes on-chip temperature sensors that directly measure heat dissipation and provide a direct comparison of the power between the two modes of operation.

The 16-bit adiabatic microprocessor is successfully implemented in 90 nm technology with an operating frequency of 0.5 GHz, which demonstrates the design of a real-life circuit using adiabatic reversible logic and shows a promising future for energy- efficient computing.

This is for my family, who encouraged me to pursue my dreams.

ii

CONTENTS

Figures...... v

Tables ...... vii

Acknowledgments...... viii

Chapter 1: Introduction ...... 1 1.1 Reversible Computing ...... 1 1.1.1 Heat Production in Modern Computing ...... 1 1.1.2 Landauer’s Principle ...... 3 1.1.3 Quasi-adiabatic Clocking ...... 5 1.2 Adiabatic Reversible Computing ...... 7 1.2.1 Split-level Charge Recovery Logic (SCRL) ...... 7 1.2.2 Bennett Clocking ...... 9

Chapter 2: Adiabatic Reversible Computing Implementation ...... 10 2.1 Reversible Energy Production Compared to CMOS ...... 10 2.1.1 Shift Register Using SCRL ...... 10 2.2 Choice of Technology Node ...... 13 2.3 Implementing Adiabatic Digital Circuits ...... 14 2.3.1 Adiabatic Library ...... 16 2.4 SCRL Quasi-adiabatic Analysis ...... 18

Chapter 3: Reversible 16-bit Microprocessor Design ...... 22 3.1 Architecture...... 22 3.1.1 Multi-cycle Operation ...... 23 3.1.2 MIPS and Bennett Clocking ...... 24 3.1.3 Optimizations for Adiabatic Computing ...... 26 3.2 Novel Memory Elements for Adiabatic Computing ...... 29 3.2.1 Adiabatic Flip-Flops ...... 29 3.2.2 Adiabatic SRAM Cells ...... 30 3.3 Microprocessor Physical Implementation...... 32 3.3.1 ALU and Critical Path ...... 32 3.3.2 ...... 33 3.3.3 Instruction and Data Fetch ...... 35 3.3.4 ...... 36 3.4 Microprocessor Verification ...... 40

iii

Chapter 4: Reversible CMOS Heat Dissipation Measurements ...... 41 4.1 On-chip Thermocouples...... 43 4.1.1 Metallic Thermocouples ...... 44 4.1.2 Silicon Thermocouples ...... 47 4.2 On-chip Temperature Sensors ...... 48

Chapter 5: Conclusion and Future Work ...... 51 5.1 Future Work ...... 52 5.1.1 Fabrication and Testing...... 52 5.2 Microprocessor Scalability ...... 53

Bibliography ...... 56

iv

FIGURES

Figure 1.1: Number of per chip and their clock speeds compared to the year of introduction. Reproduced from [1]...... 2

Figure 1.2: Block diagram of reversible computing system. Reproduced from [8]...... 4

Figure 1.3: (a) Adiabatic implemented with SCRL. (b) Timing diagram ...... 8

Figure 1.4: Timing diagram of three-level Bennett clocking ...... 9

Figure 2.1: (a) CMOS shift register. (b) Adiabatic shift register. (c) Energy lost per cycle SPICE simulation comparing both modes of operation using the 90nm technology ...... 12

Figure 2.2: Comparison of energy dissipated by an adiabatic shift register implemented in 90nm and 28nm...... 13

Figure 2.3: (a) Standard CMOS digital logic circuit. (b) Conversion to adiabatic logic of the circuit ...... 15

Figure 2.4: (a) Adiabatic two input NAND circuit . (b) NAND physical implementation using 90nm technology ...... 17

Figure 2.5: (a) Ordinary SCRL logic NAND gate. (b) SPICE simulation shows voltage drop across nodes Output and Vx which leads to undesired dissipation...... 18

Figure 2.6: (a) Fixed SCRL logic NAND gate. (b) SPICE simulation shows node Vx follows the Output, therefore preventing undesired dissipation...... 19

Figure 2.7: Energy losses of ordinary SCRL NAND gate, and fixed-SCRL gate...... 21

Figure 3.1: Multi-cycle MIPS microprocessor main components. Adapted from [36]. ....23

Figure 3.2: MIPS multi-cycle operation state machine, and corresponding states...... 24

Figure 3.3: Adiabatic microprocessor main components showing three separate Bennett blocks to implement adiabatic logic with Bennett clocking ...... 25

v

Figure 3.4: Kogge-Stone using propagate-generate logic. Adapted from [36]...... 27

Figure 3.5: Adiabatic 16-bit Kogge-Stone adder layout implemented in 90nm technology ...... 28

Figure 3.6: Adiabatic master-slave flip-flop. a) Circuit schematic. b) Timing diagram. c) Layout implemented in 90nm technology ...... 30

Figure 3.7: Adiabatic SRAM. a) Circuit schematic. b) Timing diagram. c) Layout implemented in 90nm technology ...... 31

Figure 3.8: Physical layout of the microprocessor’s critical path using 90nm technology ...... 33

Figure 3.9: Physical layout of the register implemented in 90nm technology ...... 34

Figure 3.10: Physical layout of the instruction/data fetch using 90nm technology ...... 35

Figure 3.11: Physical layout of the control unit implemented in 90nm technology ...... 38

Figure 3.12: Physical layout of the 16-bit adiabatic microprocessor in 90nm technology39

Figure 3.13: Control unit simulation verifies the expected behavior of the microprocessor ...... 40

Figure 4.1: Gold-nickel thermocouple operated at 300K, showing open circuit voltage at reference junction generated by a small temperature change (in the range of milli- Kelvin) in measurement junction...... 44

Figure 4.2: Cross section diagram of on-chip thermocouples fabricated on top of CMOS transistors...... 45

Figure 4.3: (a) Gold-nickel thermocouples on top of CMOS circuits. (b) Current supplied to the circuit. (c) Temperature difference measured by metallic thermocouple ....46

Figure 4.4: Silicon thermocouple implemented in 90nm technology within densely populated region...... 48

Figure 4.5: Adiabatic microprocessor with on-chip temperature sensors ...... 49

vi

TABLES

Table 3. 1 Control unit encoding for adiabatic microprocessor ...... 36

Table 3. 2 Control Signals for multi-cycle MIPS ...... 37

Table 4. 1 Energy losses per cycle of digital gates in microprocessor ...... 42

vii

ACKNOWLEDGMENTS

I would like to acknowledge my advisors Dr. Snider and Dr. Orlov who have taught me the value of scientific research and the impact it has on people. I thank you for your guidance and for always teaching me something new.

I would also like to thank the U.S. Air Force Research Laboratory for their generous grant which allowed me to pursue my work: STTR AF18B-T013 clearance number AFMC-2019-0527. Approved for public release: distribution unlimited.

Finally, I’d like to acknowledge our collaborators Tian Lu and Jason M. Kulick from Indiana Integrated Circuits, LLC who worked with us in this project.

viii

CHAPTER 1:

INTRODUCTION

1.1 Reversible Computing

Modern microprocessors are limited by heat dissipation. Speeds have been capped around 4 GHz since 2004 to limit heat generation [1]. Such speeds are well below the RC time constant limit of the circuits, sacrificing speed to prevent the chips from melting.

Reversible computing is a viable alternative to traditional circuit implementations since it reduces heat generation.

1.1.1 Heat Production in Modern Computing

Moore’s law states that the number of transistors on a microprocessor doubles every two to three years. [2] Up to the year 2002 the clock speeds for microprocessors increased alongside the number of transistors as shown by Figure 1.1 adapted from [1].

For almost three decades the transistors were run as fast as possible only restricted by their internal RC time constant. However, transistors in modern microprocessor are not run as fast as their RC time constant limit but rather at a much slower speed imposed by heat dissipation. Furthermore, heat production forces a considerable number of transistors inside a microprocessor to remain inactive. Such a strategy is known as dark silicon, and estimates [3] predict that by 2020 the amount of dark silicon on a chip will be around

1

90%. This means that only 10% of the chip transistors will be active at a time, placing an asterisk in the validity of Moore’s law. Even though the dark silicon strategy alongside multi-core processing has shown to be profitable, many efforts have emerged to pursue novel forms of energy efficient computing. Some of these efforts focus on reducing the passive power of a chip, caused simply by turning it on, such as steep devices [4,5], and nano-electro-mechanical relays [6,7]. On the other hand, reversible computing reduces the active power of a chip caused by the energy thrown away in every logic transition.

Figure 1.1: Number of transistors per chip and their clock speeds compared to the year of introduction. Reproduced from [1].

2

1.1.2 Landauer’s Principle

At the heart of reversible computing lies the somewhat controversial Landauer’s

Principle. The “Landauer’s Principle” (LP) states that an energy of kBT ln2 is necessarily dissipated to heat only when information is destroyed, but if information is conserved then there is no fundamental lower limit to the heat dissipated. [8] Reversible computing preserves information, and following LP, it has no lower limit to the heat it must dissipate. The “Ultimate Shannon Limit” (USL) of kBT ln2, where kB is the Boltzmann constant and T is the temperature, was the center of arguments against the LP in the past.

Some theoretical papers [9, 10] argued that the USL was a fundamental constraint of charged-based computing and reversible computing was flawed. However, experimental tests have shown that there is no fundamental lower limit to energy dissipation in moving charge [11, 12]. Since experimental evidence shows that LP is in fact correct, then reversible computing could represent considerable energy savings.

Reversible computing requires both physical and logical reversibility. Logical reversibility implies that the logical operations performed during computation can be reversed, such as an inverter gate. Other logical operations are not inherently reversible, but they can be implemented using reversible gates. The [13] for example, can implement all Boolean operations reversibly, allowing for universal computing.

Physical reversibility, on the other hand, requires the devices to avoid energy losses simply due to their operation. Fully-fledged reversible physical systems are hard to achieve, for example let’s consider a simple model for reversible computing as presented in Figure 1.2 reproduced from [8]. If a computing process is to be reversible then the computing blocks can borrow bits of information from a “reservoir” of logical bits.

3

Erasure of information can be avoided by keeping track of the information borrowed, either 1’s or 0’s. Once the reversible computing process is complete, the information is then given back to the appropriate reservoir. The compute and de-compute phases, where information is borrowed and given back, are set by a controller. This compute/de- compute controller introduces an overhead to the system in order to implement logical reversibility. However, in order to implement physical reversibility, the process would have to use a truly reversible adiabatic process which is impossible to implement in practice. Therefore, an approximation to this physical reversible process is made in practical implementations by using quasi-adiabatic transitions.

Figure 1.2: Block diagram of reversible computing system. Reproduced from [8].

4

1.1.3 Quasi-adiabatic Clocking

Complementary MOSFET logic (CMOS) is the most widely implemented technology in the and led to the becoming the most manufactured object in human history. [14] CMOS implements devices switching with sharp transitions yielding the following equation for its bit energy, the energy required to compute one bit of information:

1 2 퐸 = 퐶푉 (1.1) 푏푖푡 2 푑푑

Here, C is the load capacitance of the logic gate, and Vdd is the power supply. The bit energy is discarded and dissipated into heat after each switching event, therefore imposing the current technology speed limit. The reduction in the device size translates to a lower capacitance, but we are reaching the physical limits of miniaturization efforts.

[15, 16] And the undesired, leakage current of transistors imposes a limit to the threshold or activation voltage of the devices. This subthreshold leakage current is minimized by increasing the threshold voltage therefore preventing further scaling of the power supply voltage Vdd.

In contrast reversible computing uses quasi-adiabatic switching to implement physical reversibility introducing a trade-off between speed and power. [17] By slowly ramping up the power supply the energy dissipation of the circuit changes to:

2 푅퐶 퐸 = 퐶푉 (1.2) 푟푒푣푒푟푠푖푏푙푒 푡 푇

5

Here C is the load capacitance of the logic gate, and Vt is the ramping power supply, RC is the time constant of the gate, and T is the ramping time of the power supply. The extra term including the RC time constant of the gate and the ramping time of the power supply allows for further reduction in the dissipation. When T is much lower than the RC time constant of the gate the energy dissipation can be reduced considerably.

Reversible computing is not a novel concept, [18] but it was dismissed as slow and thus impractical in the past since a trade-off between speed and power did not make sense. However, as discussed above, modern computing already undergoes a trade-off between speed and power. Furthermore, the RC time constant of modern CMOS gates is orders of magnitude larger than the operating frequency of such gates. [19] Therefore, reversible computing has regained attention as a path to reduce energy production.

It must be noted that the quasi-adiabatic switching introduced above makes the physical process of computation asymptotically adiabatic, with no lower limit to heat dissipation. [20] The ramping of the power supply, T, could take an infinite time to reach the desired value effectively dropping the dissipation to zero. Of course, this is not practical since it makes no sense to wait an infinite amount of time for a computation to be performed. Furthermore, when using transistors, the physical process for computation generally becomes quasi-adiabatic introducing unavoidable dissipation. Therefore, a trade-off between speed and energy must be taken into consideration when implementing quasi-adiabatic reversible circuits. Even though all classical quasi-adiabatic computing is simply referred to as “adiabatic computing” or “adiabatic logic” the differences in the level of adiabaticity are relevant and will be discussed in chapter 2.

6

1.2 Adiabatic Reversible Computing

Adiabatic reversible computing can reduce the power dissipation of a circuit by using slowly ramping up and down clocks as voltage supplies. When implementing complex adiabatic logic, with multiple gates, the following power dissipation is obtained:

2 2푅퐶 푃 = 푃 + 푁훼푉 퐶푓 (1.3) 푇표푡푎푙 푝푎푠푠푖푣푒 푑푑 푇

Where the first term is the passive power wasted simply by applying a voltage to the circuit, N is the number of gates, α is the activity factor, Vdd is the ramping power supply, C is the load capacitance, f is the operating frequency, RC is the time constant of the gate, and T is the rising/falling time of the power supply. When T is much greater than the RC time constant the active power dissipation can be much lower than its CMOS counterpart. The passive power is independent of the operating frequency and is unaffected by adiabatic logic. Yet the efforts to reduce passive power, such as low temperature operation [21] -all-around devices [22], can also be applied to an adiabatic reversible system. However, circuits dominated by active power such as microprocessors benefit the most from adiabatic logic as it can dramatically reduce energy dissipation.

1.2.1 Split-level Charge Recovery Logic (SCRL)

Adiabatic computing uses reversibility to recover energy supplied to a circuit, and can be implemented as split-rail charge recovery logic (SCRL) [23]. An inverter using adiabatic SCRL logic is shown in Figure 1.3(a), where the power supply and ground have been replaced by ramping clocks. As seen in Figure 1.3(b), both ramping clocks start at a null state (0 V), then ramp up for time T until reaching a valid logic state, retain the logic 7

value for some period, and ramp back down to the null state recovering energy. It should be noted that since the logic values are not valid when the clocks are ramping up and down or during the null state, the following stages require clocks with different phases.

Figure 1.3: (a) Adiabatic inverter implemented with SCRL. (b) Timing diagram.

SCRL implements reversible computing in a similar fashion to the example presented in Figure 1.2. Here, the positive and negative clocks act as the information reservoirs providing either a logical ‘1’ or a logical ‘0’. When the computation is done the information borrowed is given back to the appropriate clock. This reversible operation combined with the quasi-adiabatic switching makes SCRL an asymptotically adiabatic implementation, with no lower limit to dissipation. Other adiabatic approaches such as “Positive Logic” (PFAL) [24] are in nature only quasi-adiabatic since stranded charge in transistors leads to unavoidable heat dissipation. SCRL also allows the circuit to be operated both in adiabatic mode and in CMOS mode by simply changing the

8

voltage power supplies. This presents a direct comparison between CMOS and adiabatic reversible logic making SCRL the choice of technology for this thesis.

1.2.2 Bennett Clocking

A chain of three SCRL inverters is shown in Figure 1.4, which illustrates the use of phased clocks to power different stages of the circuit. This clocking scheme is known as Bennett clocking [25], which consists of clocks that only ramp up once the previous stage has a valid state. Similarly, during the energy recovering process the last level of logic ramps down first, and once the null state is reached the previous stage follows.

Bennett clocking ensures that the logical input values of the circuit are valid during energy recovery and presents a straightforward implementation of adiabatic reversible systems.

Figure 1.4: Timing diagram of three-level Bennett clocking.

Bennett clocking is not limited to circuits using transistors but can be implemented in beyond CMOS devices such as molecular quantum-dot cellular automata

(QCA) [25] which perform reversible computing using molecular cells. However, CMOS circuits using Bennett clocking and SCRL logic could be a more immediate implementation of reversible computing.

9

CHAPTER 2:

ADIABATIC REVERSIBLE COMPUTING IMPLEMENTATION

2.1 Reversible Energy Production Compared to CMOS

Reversible logic is at the core of novel computing approaches such as [26]. Here quantum systems require reversibility due to the nature of their operation. And even quasi-adiabatic implementations exist for quantum annealing systems [27], where slow transitions help finding the ground state of a quantum system.

However, practical implementations of quantum computing and other beyond CMOS approaches are a few years away. On the other hand, adiabatic reversible classical computing implemented with CMOS transistors using split-charge recovery logic and

Bennett clocking could be implemented right away as a replacement for standard CMOS logic. This chapter will show the methods required to implement adiabatic reversible computing as a replacement for standard CMOS logic. Here, quasi-adiabatic reversible classical computing will be simply referred to as adiabatic computing.

2.1.1 Shift Register Using SCRL

An adiabatic SCRL implementation allows a circuit to operate both in adiabatic and in standard CMOS mode, providing a direct comparison of the power between the two modes. To evaluate the benefits of adiabatic computing a three-bit shift register is

10

analyzed. The shift register shown in Figure 2.1(a) uses transmission gates and inverters to control and shift three bits of information. The transmission gates control the feedback to avoid the uncontrolled erasure of information, and the inverters can hold one bit. The energy dissipation of the middle inverter in the shift register is analyzed as it operates.

The same circuit is operated in standard CMOS mode, and in adiabatic reversible mode.

As seen in Figure 2.1(b) an adiabatic operation with Bennett clocking requires ramping clocks only for the inverters since the transmission gates are simply letting information through. SPICE simulations are performed using the predictive technology model for

90nm developed by Arizona State University (ASU) [28] in order to avoid the use of sensitive foundry specific SPICE models. The simulations use equally sized transistors, with a ratio of W/L=6 and 1 Volt power supply.

The SPICE simulation of the shift register energy over one cycle of operation at different frequencies is presented in Figure 2.1(c). The adiabatic energy losses increase with frequency, as with other adiabatic implementations [29], even though the CMOS energy remains relatively constant. At high enough frequencies, the adiabatic energy dissipation would be the same as in standard CMOS mode. However, the range of frequencies presented show that up to 5 GHz an adiabatic operation of the circuit still shows an energy dissipation orders of magnitude lower than CMOS.

11

Figure 2.1: (a) CMOS shift register. (b) Adiabatic shift register. (c) Energy lost per cycle SPICE simulation comparing both modes of operation using the 90nm technology.

12

2.2 Choice of Technology Node

The choice of technology node for an adiabatic circuit involves a number of trade- offs. The 90nm node technology has been used for low power CMOS applications [30] and as shown by the shift register simulations it is a viable technology for an adiabatic implementation. Other advanced node technologies could be used, but an energy analysis must be made in order to evaluate their benefits. To demonstrate this, a shift register was simulated using both the 90nm technology and the 28nm technology using the predictive technology models from ASU. Since both technologies can operate with a 1V power source the same architecture and power supply was used, and the transistors were scaled appropriately keeping the ratio of W/L=6. The results are presented in Figure 2.2.

Figure 2.2: Comparison of energy dissipated by an adiabatic shift register implemented in 90nm and 28nm.

13

As seen in Figure 2.2, the energy per cycle of the adiabatic shift register changes with frequency depending on the dominating power dissipation mechanism, either passive power dissipation or active power dissipation. At low frequencies passive power dominates, until the energy reaches a minimum point, and then at higher frequencies active power takes over. The 90nm technology presents a lower energy dissipation at low frequencies due to the high passive power of the 28nm node mostly related to leakage.

However, at higher operating frequencies where active power dominates, the energy plots cross and above 300 MHz the 28nm technology presents a lower dissipation. This is the expected behavior as shorter advanced nodes, like 28nm, have a lower bit energy since they use smaller devices which translates into lower active power dissipation.

A circuit operating frequency of 0.5 GHz represents a useful speed as well as a low enough frequency to benefit greatly from adiabatic logic. For this frequency, the adiabatic energy dissipation as shown by the shift register is still orders of magnitude lower than in CMOS operation. And the 90nm technology is an appropriate choice since both the 90nm and 28nm technologies have a very similar energy dissipation. Therefore, the 90nm technology is a viable option to implement an adiabatic computing circuit such as a microprocessor.

2.3 Implementing Adiabatic Digital Circuits

In order to implement adiabatic computing circuits, the standard CMOS digital gates are adapted to SCRL and Bennett clocking. Any digital design can be transformed into adiabatic logic by using Bennett clocks instead of DC power supplies. For example, a simple digital block is presented in Figure 2.3(a) which uses CMOS technology and

14

only requires DC power supplies. Figure 2.3(b) shows that in order to use adiabatic logic the DC power supplies are changed to Bennett clocks. Since Bennett clocks have specific phases to recover energy efficiently, multiple clocks are needed to avoid heat dissipation. The logical depth of a circuit indicates the number of number Bennett clocks required. This is not a limitation of the technology but it is desirable to use the lowest possible number of clocks to reduce complexity in terms of implementation, timing, and testing.

Figure 2.3: (a) Standard CMOS digital logic circuit. (b) Conversion to adiabatic logic of the circuit.

The adiabatic computing implementation using SCRL and Bennett clocking is straightforward but not trivial for complex circuits such as a microprocessor. To address this an adiabatic standard cell library of digital gates is developed using the 90nm technology.

15

2.3.1 Adiabatic Standard Cell Library

The design of a library containing adiabatic digital gates is presented. The digital gates are used in a bottom-up approach to create more complex circuits, eventually implementing an adiabatic microprocessor. Careful design of the standard cells allows them to be easily interconnected and generate all the building blocks of the processor. For example, the design of an adiabatic SCRL NAND gate is shown in Figure 2.4(a). It includes inputs and outputs, substrate connections for both n-channel and p-channel transistors, and the positive and negative power clocks which follow Bennett clocking. It should be noted that the n and p wells of the devices are connected to DC voltage supplies, while the source and drain terminals of the transistors are connected to the power clocks to implement adiabatic logic.

The generic 90nm process design kit by Cadence, gpdk090, is used to implement the physical layout of the microprocessor. As seen in Figure 2.4(b) the physical implementation of the NAND gate includes inputs at the top and outputs at the bottom which creates a downward flow of information using Bennett Clocking. Previous logical stages are placed above the gate and following stages below. Other gates that share the same Bennett level can simply be placed on the side. The floor-planning of complex circuits is straightforward using standard cells, the only limitation being the power rails which have a specific Bennett clock.

16

Figure 2.4: (a) Adiabatic two input NAND circuit schematic. (b) NAND physical implementation using 90nm technology.

The physical design is verified using a layout versus schematic (LVS) check, and a design rule check (DRC) included in the generic 90nm kit [31]. The LVS requires a schematic with matching architecture, which was also developed for the standard cell library. The design rules used in the DRC verification do not correspond to any foundry and are conservative in terms of the minimum distance between features. This allows the design to be adjusted for a specific foundry fabrication technology if needed.

Furthermore, the devices are implemented leaving some flexibility to more restrictive design rules, while still passing the DRC checks from the generic model. This extra flexibility might not yield the optimal minimum area for the design but allows a circuit to be easily adjusted for future fabrication.

17

2.4 SCRL Quasi-adiabatic Analysis

The use of SCRL multiple input gates, such as the NAND gate described above, introduces non-idealities that produce undesired dissipation. For example, the SCRL

NAND gate was analyzed by M. P. Frank in [32], showing that part of the circuit may be conducting even if the whole gate is not. Figure 2.5 illustrates the problem, where two different inputs create a voltage drop in node Vx causing charge to flow and hence dissipation. It must be pointed out that the problem only arises under certain conditions, and not with every possible input. However, this non-ideality or bug in ordinary SCRL logic described by Younis and Knight [23] limits the “adiabaticity” of the technology making the implementation quasi-adiabatic instead of asymptotically adiabatic. Complex

SCRL gates will be limited then by a lower bound caused by these non-idealities, though single input gates such as an inverter will not suffer from this.

Figure 2.5: (a) Ordinary SCRL logic NAND gate. (b) SPICE simulation shows voltage drop across nodes Output and Vx which leads to undesired dissipation.

18

A fix for this non-ideality in SCRL logic was proposed by M. P. Frank in order to create asymptotically adiabatic computing. [32] The proposed correction to SCRL logic is presented in Figure 2.6, where an extra transistor is introduced and the problematic node Vx doesn’t have a voltage drop anymore, therefore preventing undesired dissipation. It is important to note that this fix is relevant when trying to reduce the energy losses as much as possible and when trying to reduce dissipation by scaling down the frequency of operation.

Figure 2.6: (a) Fixed SCRL logic NAND gate. (b) SPICE simulation shows node Vx follows the Output, therefore preventing undesired dissipation.

Quasi-adiabatic implementations of reversible computing have undesired energy losses, which are normally introduced as trade-off to complexity and speed. On the other hand, asymptotically adiabatic implementations try to avoid all undesired losses since they have no lower limit to dissipation. For example, superconducting reversible computing [33] pursues asymptotically adiabatic computing by implementing “ballistic” devices to minimize losses due to friction when processing information.

19

CMOS transistors however are limited by passive power dissipation, which imposes a lower bound to dissipation when operating at room temperature. Even low- temperature SCRL CMOS gates would be limited by non-idealities unless the fix proposed by M. Frank [32] is implemented. However, quasi-adiabatic implementations of reversible computing could outperform asymptotically adiabatic implementations when operating at relatively high frequencies.

The two-input NAND gate is analyzed in order to compare ordinary SCRL logic and “fixed” SCRL logic. SPICE simulations using the 90nm technology were performed for both architectures of the NAND gate at different frequencies. The energy losses per cycle for both implementations are presented in Figure 2.7.

20

Figure 2.7: Energy losses of ordinary SCRL NAND gate, and fixed-SCRL gate.

As seen in Figure 2.7, low operating frequencies benefit from the fixed-SCRL logic and have lower dissipation. However, at higher frequencies the dissipation of the ordinary SCRL logic is lower. The extra transistor introduced in the fixed-SCRL logic increases the bit energy of the gate producing a higher dissipation at relatively high frequencies. Furthermore, the non-idealities, or bug, only affect certain ordinary SCRL when using certain inputs. Therefore, selecting the architecture of SCRL is not trivial and must be made considering the circuit requirements.

The operating frequency is extremely relevant for a real-life circuit implementation of SCRL logic, such as an adiabatic microprocessor. If adiabatic reversible computing is to replace standard CMOS logic, with useful operating frequency such as 0.5 GHz, then ordinary SCRL logic is a viable implementation.

21

CHAPTER 3:

REVERSIBLE 16-BIT MICROPROCESSOR DESIGN

This chapter presents the design of an adiabatic 16-bit datapath microprocessor implemented in 90nm technology using SCRL and Bennett clocking. The microprocessor is a multi-cycle RISC processor and it follows the MIPS architecture described by Harris

[34]. Only a subset of instructions is implemented but they are enough for universal computation. The instructions are: addition, subtraction, bitwise AND, bitwise OR, set less than, add immediate, branch if equal, jump, load byte and store byte. This microprocessor represents an improvement over previous implementations [35] since it has a larger bit datapath and it is designed using a 90nm technology node.

3.1 Architecture

The microprocessor is composed of a controller unit, an external memory, and a

16-bit datapath with multiple combinational and sequential elements as shown in Figure

3.1, adapted from [36]. The datapath of the microprocessor implements instructions in multiple cycles determined by a control unit labeled in blue in Figure 3.1. The control unit generates control signals, also highlighted in blue, that determine the path to be taken by information in order to execute an instruction. It must be noted that the microprocessor uses 32-bit instructions, but the datapath is only 16 bits. Therefore two 22

cycles are used to fetch the instruction which goes into the control unit and determines the correct control signals to execute the corresponding instruction.

Figure 3.1: Multi-cycle MIPS microprocessor main components. Adapted from [36].

3.1.1 Multi-cycle Operation

The multi-cycle operation is summarized in Figure 3.2, where a state machine is employed to execute instructions. The states are determined by the opcode of the instruction, which corresponds to bits 26 to 32 of the instruction fetched from memory.

Thirteen different states, including the initial state after reset, are enough to implement the ten instructions of the MIPS microprocessor, which are determined by the opcode. As seen in Figure 3.2, the different opcodes of the state machine are: load byte (LB), store byte (SB), add immediate (ADDI), r-type instruction (R-type), branch if equal (BEQ),

23

and jump. These correspond to standard processor architecture opcodes, where for example r-type instructions implement to arithmetic and logic operations.

Figure 3.2: MIPS multi-cycle operation state machine, and corresponding states.

3.1.2 MIPS and Bennett Clocking

Since full instructions are divided into multiple cycles, the different blocks in the datapath can be divided into three different zones which can be activated in parallel when implementing adiabatic computing. Three different Bennett blocks are defined in order to implement adiabatic logic with Bennett clocking, and each Bennett zone shares Bennett clocks with the others and the control unit. The definition of Bennett blocks and their main components is presented in Figure 3.3. The system controller, or control unit, also shares Bennett clocks with the datapath. But the instruction and data memory is implemented as an external memory, and a 16-bit interconnects the microprocessor with the memory.

24

Figure 3.3: Adiabatic microprocessor main components showing three separate Bennett blocks to implement adiabatic logic with Bennett clocking.

It is desirable to implement the Bennett blocks in the lowest possible number of clocks to reduce complexity in terms of implementation, timing, and testing. By implementing the microprocessor in three different Bennett blocks, which share clocks, the number of Bennett clocks requires is minimized. Further optimizations to the architecture can be performed to reduce the number of clocks. For example, the critical path of the microprocessor, which contains the longest delay in the Bennett blocks, can be optimized in order to reduce the logical depth of the circuit and reduce the number of

Bennett clocks.

25

3.1.3 Optimizations for Adiabatic Computing

The critical path determines the minimum number of Bennett clocks required for energy recovery. The critical path of the microprocessor is between the register file and the Algorithmic and Logic Unit (ALU) output register, which must be optimized for adiabatic computing.

The biggest block in the critical path is the ALU which is composed of an addition unit, bitwise AND, bitwise OR, a comparator, and a result . The dedicated comparator unit is introduced into the ALU in order to implement the “set less than” instruction in the lowest possible number of Bennett clocks, in contrast to a regular

CMOS implementation which uses the addition unit to perform this instruction [36].

However, introducing a dedicated comparator reduces the complexity of an adiabatic implementation.

The addition unit inside the ALU is the most complex block within the critical datapath, therefore a fast addition architecture is chosen in order to optimize the number of Bennett clocks used. Fast adders using adiabatic logic have shown benefits over standard CMOS, such as the 16-bit carry-look-ahead adder by Lim, J. et al. [37]. Yet in order to reduce the number of Bennett clocks a Kogge-Stone architecture [38] is selected which only requires eight Bennett clocks to perform 16-bit addition. As seen in Figure

3.4, the Kogge-Stone adder uses propagate-generate logic, similarly to other fast adder architectures [39], in order to optimize the carry out signals generated by bitwise addition.

26

Figure 3.4: Kogge-Stone adder using propagate-generate logic. Adapted from [36].

The Kogge-stone adder is implemented as a physical layout using the techniques described in Chapter 2, where the adiabatic standard cell library developed with the

Cadence generic 90nm technology is used. And the adder architecture is achieved by placing the adiabatic standard cells manually. The physical layout for the Kogge-Stone

16-bit adder illustrates the use of the standard cell library and is presented in Figure 3.5.

The adder is composed of 972 transistors, and it occupies an area of 80 um by 60 um.

Eight Bennett levels can be seen from top to bottom, each level contains both a positive and a negative clock. A bit slice shows the path of single bit addition, and the full adder contains 16 bit slices. Only four layers of metal are used to interconnect this fast adder, but another two metal layers are required to interconnect it with the rest of the ALU.

27

Figure 3.5: Adiabatic 16-bit Kogge-Stone adder layout implemented in 90nm technology.

The adder implemented using only eight Bennett levels allows the critical path of the microprocessor to use a total of 12 Bennett clocks. These levels include the register source operands, , the ALU, and the ALU output register. Optimizing the architecture of the ALU for adiabatic computing shows that the number of clocks needed to implement this 16-bit microprocessor can be equal to the number required for an earlier 8-bit adiabatic microprocessor. [35] The only extra complexity of the circuit is related to the increased size of the datapath, and the effort to interconnect the main subsystems of the microprocessor.

28

3.2 Novel Memory Elements for Adiabatic Computing

Sequential elements, such as registers, are some of the main components of the microprocessor but they must retain information between the microprocessor operating cycles, this represents a challenge when implementing reversible computing. The reversible computing combinational blocks described so far borrow information from

Bennett clocks and then give it back each cycle. Therefore, sequential elements using

SCRL and Bennett clocking generally do not lend themselves to energy recovery.

In the proposed adiabatic microprocessor, all the Bennett clocks are in a null state between cycles in order to recover energy, which means loss of information. Therefore, information must be stored in sequential elements that do not lose such information when the Bennett clocks are in the null state. We present a novel design for adiabatic latches and SRAM memory cells that retain information and partially recover energy using adiabatic logic.

3.2.1 Adiabatic Flip-Flops

An adiabatic master-slave flip-flop is presented in Figure 3.6(a). It includes a master latch controlled by a Bennett clock (MClk), and a slave latch controlled by a regular irreversible clock (SClk). As seen in Figure 3.6(b), the master latch stores the incoming information by activating MClk. The information is passed to the slave latch by activating SClk, and then the master latch performs energy recovery returning MClk to a null state. The slave latch retains the information and provides it to the next Bennett block. Even though the slave latch presents non-adiabatic transitions, it ensures that the information is retained between microprocessor cycles.

29

Figure 3.6: Adiabatic master-slave flip-flop. a) Circuit schematic. b) Timing diagram. c) Layout implemented in 90nm technology.

3.2.2 Adiabatic SRAM Cells

The second adiabatic sequential element design is an adiabatic SRAM cell as seen in Figure 3.7(a). The design is based on a standard six transistor SRAM cell, but with four extra transistors that control the power supplied to the . Bennett clocks

(Clk+ and Clk-) are used to write the SRAM cell in order to perform energy recovery.

When the Bennett clocks go to a null state information is lost, therefore, the Bennett clocks are swapped to DC voltage supplies in order to retain the information stored in the

SRAM cell. This is accomplished by switching between the Bennett clocks and the DC voltage supplies using four control transistors. The timing diagram of the adiabatic

SRAM control signals is presented in Figure 3.7(b). The SRAM cell remains connected to the DC voltage supplies (Vdd and Vss) during the read operation. The SRAM only to the Bennett clocks when the cell is being written.

30

Figure 3.7: Adiabatic SRAM. a) Circuit schematic. b) Timing diagram. c) Layout implemented in 90nm technology.

The novel adiabatic sequential elements described above achieve energy recovery in adiabatic computing and are compatible with an SCRL and Bennett clocking implementation. These sequential elements retain information between adiabatic cycles in contrast to other adiabatic memory approaches such as PFAL latches [40]. The trade-off between dissipation and the circuit overhead required to avoid information destruction is minimum, as seen in Figure 3.6(c) and Figure 3.7(c). Furthermore, simulations show that the novel SRAM cell dissipates an energy one order of magnitude lower than CMOS when operating at 0.5 GHz. Making these novel sequential elements a viable implementation for an adiabatic microprocessor.

31

3.3 Microprocessor Physical Implementation

Due to the lack of automated tools for adiabatic computing the physical implementation of the microprocessor was done by hand. The 90nm standard cell library developed for adiabatic computing, discussed above in chapter 2, was used to implement the full microprocessor design. The implemented devices leave some flexibility to more restrictive design rules, while still passing the DRC checks from the generic model. This extra flexibility might not yield the optimal minimum area for the design but allows the microprocessor to be easily adjusted for future fabrication.

3.3.1 ALU and Critical Path

The physical implementation of the critical path is presented in Figure 3.8. This layout was implemented using the standard cell library for a generic 90nm technology as described above. The input and output registers are composed of adiabatic master-slave flip-. The ALU is dominated by interconnects since it is composed of the 16-bit

Kogge-Stone adder and a 16-bit logic unit, with bitwise AND, bitwise OR, and comparator functions. Six layers of metal are required to interconnect the blocks inside the ALU. The full critical path contains 3078 transistors and occupies an area of 130 um by 140 um. This physical design realizes adiabatic logic successfully using Bennett clocking and it shows that logic can be compactly placed even with the separate logical

Bennett levels.

32

Figure 3.8: Physical layout of the microprocessor’s critical path using 90nm technology.

3.3.2 Register File

A register file stores data within the microprocessor in order to increase the throughput of executing instructions. The instructions are executed faster by using the information located in the register file and avoiding fetching information from memory in every cycle. The register file used in this adiabatic implementation uses the novel SRAM

33

design described above. The register file is composed of two dedicated registers named A and B that are used separately by the microprocessor. Since the datapath of the microprocessor is 16 bits, the register file is composed of 16-bit words. Each word of information is stored in sixteen single-bit adiabatic SRAM cells. The layout of the complete register file is presented in Figure 3.9.

Figure 3.9: Physical layout of the register implemented in 90nm technology.

As seen in Figure 3.9, the register file is composed of two identical blocks one for register A and another for register B. Each block is composed of a 32 by 16 SRAM bank, a word decoder, and column circuitry for inputs and outputs. The column circuitry is composed of a write driver, only used during the write operation, and a column output

34

which includes initial conditioning of the SRAM information. The complete register file occupies an area of 220 um by 215 um and is composed of 11456 transistors.

3.3.3 Instruction and Data Fetch

The remaining part of the datapath is the instruction/data fetch. The 32-bit instruction is fetched from memory using registers and multiplexers. Since the datapath of the microprocessor is only 16 bits two separate registers are required to store the incoming instruction. A separate 16-bit register is used to store data coming from memory. The registers of the instruction and data fetch block are presented in Figure

3.10. It must be pointed out that the wires to the left of the registers correspond to the interconnection with memory. The wires need to go straight into input/output pins, since an external memory is used.

Figure 3.10: Physical layout of the instruction/data fetch using 90nm technology.

35

3.3.4 Control Unit

The last block in the adiabatic microprocessor is the control unit. The control unit implements a state machine, presented above in Figure 3.2, and it also generates the control signals that enable the datapath to compute an instruction. The state machine is encoded using 8 bits since it is optimized for adiabatic logic. By using redundant bits, the states can be implemented with a low logical depth which translates to lower Bennett clocks. The encoding of the different states is presented in Table 3.1, while the control signal activation is presented in Table 3.2.

TABLE 3. 1

CONTROL UNIT ENCODING FOR ADIABATIC MICROPROCESSOR

Current state Encoding Number Name S7 S6 S5 S4 S3 S2 S1 S0 1 INIT 0 0 0 0 0 0 0 0 2 FETCH1 1 0 0 0 0 0 0 1 3 FETCH2 1 0 0 0 1 0 0 1 4 DECODE 1 0 0 0 0 0 1 1 5 MEMADR 1 0 0 0 0 1 1 0 6 LBRD 1 0 0 0 1 0 0 0 7 LBWR 1 0 0 1 0 0 0 0 8 SBWR 1 0 0 1 1 0 0 0 9 RTYPEX 1 1 0 0 1 1 0 0 10 RTYPEWR 1 0 1 0 0 0 0 0 11 BEQEX 0 0 0 0 0 1 0 0 12 JEX 1 0 1 0 1 0 0 0 13 ADDIWR 1 0 1 1 0 0 0 0

36

TABLE 3. 2

CONTROL SIGNALS FOR MULTI-CYCLE MIPS

The physical design of the control unit, containing the state machine and control signals, is presented in Figure 3.11. It consists of 632 transistors and occupies and area of

100um by 100um.

37

Figure 3.11: Physical layout of the control unit implemented in 90nm technology.

The physical design of the complete microprocessor is presented in Figure 3.12. It contains 17,508 transistors and has a 16-bit bus on the left side to be connected to an external memory containing data and instructions. It uses seven layers of metal to interconnect all the transistors and it occupies an area of 800um by 400um.

38

Figure 3.12: Physical layout of the 16-bit adiabatic microprocessor in 90nm technology.

The physical layout of the adiabatic microprocessor matches the schematic diagram presented above in Figure 3.3. Three separate Bennett blocks are defined to implement the datapath, and they share Bennett clocks with the control unit. The adiabatic design of the microprocessor using SCRL logic and Bennett clocking shows that reversible computing can be implemented in a straightforward manner. The physical implementation using the 90nm technology presents little overhead as compared to

CMOS. While this microprocessor only implements a subset of instructions, they are enough for universal computation. Furthermore, the MIPS architecture is still relevant for computation such as applications of embedded real time processing. [41]

39

3.4 Microprocessor Verification

Once the physical layout was done it was verified using logical simulations. The verification process started with a physical layout which was directly extracted into high- level Verilog code. The netlist generated by Cadence is a transistor-level Verilog file consisting only of transistors and interconnections. This file was then adapted to adiabatic-compatible Verilog set of tools and the verification was done using well-known methods. For example, Figure 3.13 shows a screenshot of the control unit simulation where the next state and the ALU control signals were calculated from a starting state.

These simulations are extremely useful since they allow the verification of the logical behavior of the final physical design, therefore increasing the chances of a successful fabrication run.

Figure 3.13: Control unit Verilog simulation verifies the expected behavior of the microprocessor.

40

CHAPTER 4:

REVERSIBLE CMOS HEAT DISSIPATION MEASUREMENTS

The measurement of power dissipation on a chip for standard CMOS dissipative logic is typically obtained by conventional electrical techniques, such as measuring the current supplied to the circuit by a DC power supply. However, power measurement of adiabatic switching circuits becomes a complicated task, not only because the adiabatic circuits are powered by time-varying voltages, but also because power dissipation is masked by a large reactive power component, due to the nature of reversible adiabatic logic where charge is recycled. [8]

However, power dissipation estimates can be obtained by performing SPICE simulations of reversible circuits. Table 4.1 presents the energy losses per cycle of digital logic gates operating at 0.5 GHz, driving a load of 1fF, comparing CMOS and adiabatic modes of operation. Here, different digital gates were simulated for 2 ns using the 90nm technology. The voltage of the power supplies, or Bennett clocks for adiabatic operation, was multiplied by the current supplied to the gate in order to calculate the power consumption. The time-dependent power signal was then integrated over the full cycle, 2 ns, and the results are presented in Table 4.1. The number of digital gates was estimated for the full MIPS microprocessor design in order to obtain a total energy estimate. As seen in Table 4.1, the estimated energy losses for an adiabatic operation of the 41

microprocessor are roughly one order of magnitude lower than for standard CMOS operation.

TABLE 4. 1

ENERGY LOSSES PER CYCLE OF DIGITAL GATES IN MICROPROCESSOR

Digital Gate CMOS Adiabatic MIPS MIPS MIPS Name Energy/cycle Energy/cycle Number CMOS Adiabatic [Joules] [Joules] of Gates Energy/cycle Energy/cycle [Joules] [Joules] NOT 2.32E-12 2.09E-13 186 4.32E-10 3.89E-11 NAND 2.41E-15 4.16E-16 78 1.88E-13 3.24E-14 NOR 4.55E-15 4.85E-16 68 3.09E-13 3.30E-14 XOR 1.25E-14 1.79E-15 48 6.00E-13 8.59E-14 AndOrInvert 1.26E-14 1.65E-15 28 3.53E-13 4.62E-14 OrAndInvert 1.04E-14 1.02E-15 30 3.12E-13 3.06E-14 FlipFlop 1.86E-14 2.41E-15 112 2.08E-12 2.70E-13 Mux2to1 4.10E-15 1.08E-16 64 2.62E-13 6.91E-15 Mux4to1 8.26E-15 2.12E-16 48 3.96E-13 1.02E-14 SRAM 3.28E-12 3.76E-13 512 1.68E-09 1.93E-10

Total: 2.12E-09 2.32E-10

On -chip temperature sensors are implemented in order to experimentally measure the heat dissipation of the microprocessor operating in adiabatic mode. Two types of temperature sensors are presented: thermocouples and used as temperature sensors. While thermocouples are widely used in semiconductor manufacturing equipment [42], the simplicity of their design allows them to be fabricated using thin films. On the other hand, p-n junctions are one of the main components of silicon 42

technologies and can be used as temperature sensors due to their relationship between current and temperature which is shown in formula 4.1.

퐼퐷 ln( +1)ƞ푘퐵푇 퐼0 푉 = (4.1) 퐷 푞

Here, 푉퐷 is the diode voltage, 퐼퐷 is the diode current, 퐼0 is the saturation current, ƞ is the ideality factor of the p-n junction, kB is the Boltzmann constant, T is the temperature in Kelvin, and q is the fundamental charge of an electron. Since the diode voltage is proportional to the temperature of the device, it can be used as a temperature sensor when applying a constant current through the diode.

4.1 On-chip Thermocouples

Thermocouples use the Seebeck effect [43] to measure the difference in temperature between two junctions of dissimilar conductors. Thermocouples have two different conductors that are connected with a measurement junction and a reference junction. When there is a small difference in temperature between the two junctions, the difference in absolute Seebeck coefficients S1 and S2 produces a voltage at the reference junction following the equation:

푉푇퐶 = (푆1 − 푆2)훥푇 (4.2)

The Seebeck coefficients are determined by the material of the conductors, and are a function of temperature and the material Fermi energy. For small temperature changes in the range of milli-Kelvin, a linear approximation to the Seebeck coefficients is

43

valid, generating the expression in formula 4.1. For example, the Seebeck coefficient of gold is 6.5 uV/K at 300K, while the Seebeck coefficient of Nickel is -15 uV/K at 300K.

[44] Therefore, a thermocouple made of gold-nickel junctions will yield a Seebeck coefficient of 21.5 uV/K when operated at room temperature. Even though the linear approximation to the Seebeck coefficients introduces some error, when dealing with changes of temperature between the junctions in the range of milli-Kelvin this error is negligible. A figure representing a gold-nickel thermocouple is presented in Figure 4.1.

Figure 4.1: Gold-nickel thermocouple operated at 300K, showing open circuit voltage at reference junction generated by a small temperature change (in the range of milli-Kelvin) in measurement junction.

4.1.1 Metallic Thermocouples

Metallic thermocouples (metal TC) can be implemented as on-chip temperature sensors when fabricating CMOS circuits. For example, gold-nickel on-chip thermocouples can be placed on top of the gates of transistors where the heat dissipation is greatest. [45] Here, the measurement junction is located on top of CMOS transistors,

44

and the reference junction is hundreds of micrometers away effectively measuring the heat generated by the circuit.

The cross-section diagram of metal TC is presented in Figure 4.2. Where the gold-nickel thermocouple is placed on top of the n-channel transistor. The layers shown in Figure 4.2 correspond to a standard CMOS process using polysilicon as the gate material. Once the metallization is finished, an insulating layer of silicon nitride (SiNx) is deposited on top of the CMOS circuits to avoid shorting the terminals of the thermocouples.

Figure 4.2: Cross section diagram of on-chip thermocouples fabricated on top of CMOS transistors.

These on-chip thermocouples were fabricated using a full custom CMOS process in our facilities at Notre Dame. Figure 4.3(a) presents an SEM micrograph of the devices.

The heat generated by CMOS transistors is proportional to the current supplied to the circuit. Figure 4.3(c) presents the heat dissipated by the CMOS transistors which matches well the current supplied to the circuit shown in Figure 4.3(b). 45

Figure 4.3: (a) Gold-nickel thermocouples on top of CMOS circuits. (b) Current supplied to the circuit. (c) Temperature difference measured by metallic thermocouple.

46

Due to the flexibility of our fabrication process at Notre Dame, the gold and nickel were easily deposited, and the implementation of on-chip thermocouples was successful. However, other materials with higher Seebeck coefficients could generate a much stronger signal that benefits the experimental measurement of heat in CMOS circuits. Electrical pickup is one of the main challenges when using thermoelectric devices to measure heat in electric circuits. A careful layout of the devices helps avoiding undesired capacitive coupling, and reducing electrical resistance, therefore minimizing electrical pickup. And a thermocouple with higher Seebeck coefficient would increase the signal to noise ratio. Therefore, silicon thermocouples are proposed to implement on-chip thermocouples for the adiabatic microprocessor described above.

4.1.2 Silicon Thermocouples

Highly doped silicon has been reported to have a high Seebeck coefficient [46], with 236 uV/K for p-type silicon, and -230 uV/K for n-type silicon both at 300 K. [47]

Due to their much higher than metal TC sensitivity, highly doped silicon thermocouples are a viable implementation for on-chip temperature sensors. Moreover, the thermocouples do not have be placed atop the transistor’s gates, but they can be placed to the side as close as possible to the active devices. Most of the heat dissipation occurs through the silicon substrate, and areas with multiple transistors would increase the signal measured by the thermocouple. Figure 4.4 presents the design of a silicon thermocouple implemented in the 90nm technology to be used as an on-chip temperature sensor for the adiabatic microprocessor. Only the measurement junction is shown, which consists of highly doped n-type silicon (n+) and highly doped p-type silicon (p+). The reference junction is not shown but is located several microns away from the active devices. 47

Figure 4.4: Silicon thermocouple implemented in 90nm technology within densely populated region.

4.2 On-chip Diode Temperature Sensors

Not every modern foundry allows for the fabrication of highly doped silicon regions of different type that overlap. Since on-chip silicon thermocouples require a junction of both highly doped n-type silicon and highly doped p-type silicon, then another kind of on-chip temperature sensor may be required. The specific foundry to implement the adiabatic microprocessor of this work is still to be determined. Therefore, diodes are proposed as an on-chip temperature sensor that could complement or even replace thermocouples.

48

Diodes are readily available in every state-of-the-art CMOS fabrication facility, therefore they represent a viable implementation for on-chip temperature sensors. If a diode has a constant current applied, then the voltage across its p-n junction is proportional to the temperature of the device. [48] In order to measure the heat generated by transistors a reference “cold” area can be measured in parallel to compare to the temperature measured in the CMOS devices. Therefore, the adiabatic microprocessor design includes reference diodes away from the actual microprocessor in order to have a reliable temperature measurement. The final design of the 16-bit adiabatic microprocessor with on-chip temperature sensors is presented in Figure 4.5.

Figure 4.5: Adiabatic microprocessor with on-chip temperature sensors.

49

On-chip thermocouples and diodes are used as temperature sensors in order to measure the power dissipation of an adiabatic reversible microprocessor. Different temperature sensors are implemented and can be interchanged in order to comply with the design rules of the chosen foundry for future fabrication. The design presented in

Figure 4.5 includes an 80-pin pad frame with ESD protection. It must be pointed out that the pad frame can be adjusted to a smaller area if required since the MIPS microprocessor uses only a fraction of the total area. The relatively high number of input/output pins is due to the different Bennett clocks required for energy recovery.

The final design of the adiabatic microprocessor including on-chip temperature sensors presents a real-life implementation of reversible computing with the capability of performing direct heat dissipation measurements. Using the on-chip temperature sensors the energy estimates of Table 4.1 can be corroborated providing a useful comparison between CMOS and adiabatic operations.

50

CHAPTER 5:

CONCLUSION AND FUTURE WORK

A 16-bit adiabatic microprocessor was successfully implemented in 90nm technology using split-rail charge recovery logic. The simulation of a subsystem of the microprocessor, a shift register, shows the benefits of using adiabatic logic against standard CMOS and confirms that the 90nm node is a viable option for its implementation. Optimizing the architecture of the microprocessor for adiabatic computing pays off by reducing the number of clocks, and therefore reducing the complexity of the system. The physical design of the microprocessor implements a newly developed standard cell library with Bennett clocking. The final layout contains 17,508 transistors and has an area of 800 um by 400um. The implementation of the microprocessor in 90nm technology with an operating frequency of 0.5 GHz shows a real-life circuit using adiabatic reversible logic and shows a promising future for energy- efficient computing.

51

5.1 Future Work

5.1.1 Fabrication and Testing

The adiabatic microprocessor design presented in this work uses a generic 90nm model which can be easily adapted to a specific foundry. As future work the design will need to be adapted to the foundry-specific requirements before fabrication. This should be a straightforward task since the current design leaves some flexibility for future optimizations.

Testing is to be performed for both functionality and performance, once the microprocessor is fabricated. The logical functionality of the microprocessor could be tested using a simple piece of code that executes several instructions. One of the advantages of an adiabatic SCRL implementation is that the circuit can be operated in both CMOS and reversible mode. Therefore, the initial testing of the microprocessor could be performed in standard CMOS mode.

The challenge of testing the microprocessor arises when operating in adiabatic mode. First, multiple signal generators must be employed in order to implement Bennett clocking. Then, the Bennett clocks and the external memory providing instructions must be carefully timed to execute instructions correctly. Finally, the adiabatic reversible clocks must run at the proposed operating frequency of 0.5 GHz. This means that a full cycle of operation, where all the Bennett clocks are activated and de-activated, must happen within 2 nanoseconds. Specific electrical equipment is required to test the adiabatic operation of the microprocessor at frequency. However, the external memory containing instructions and data can be emulated using an FPGA platform.

52

Software development will be required to setup the instructions to be performed.

A compiler for the MIPS microprocessor will translate high-level code, such as C++, into low-level machine code that will be loaded to the microprocessor through the FPGA. The compiler will generate 32-bit instructions, like many microprocessors currently in the market. Therefore, the software development required for testing should be straightforward.

The on-chip temperature sensors will provide direct heat dissipation measurements of the areas chosen during the design and corroborate the dissipation estimates presented in this work. But other type of heat measurements could be performed and would complement the experimental data obtained. A different approach by Frank and Solomon [49] uses a thermoelectric cooler as a thermocouple to measure the heat from the . This approach is insensitive to the reactive power component of the circuits and can measure heat from large adiabatic reversible circuits.

This technique could be implemented since the full adiabatic microprocessor generates a power in the range of milli-Watts, which is measurable through the chip carrier.

5.2 Microprocessor Scalability

Larger microprocessors, with more instructions and increased , could be implemented using the techniques described in this work. More complex microprocessor designs could require more Bennett clocks, but careful optimizations can reduce the number of clocks to a desired value. Such future optimization efforts can be alleviated by the development of automated tools for adiabatic computing. Furthermore, larger microprocessor designs can also be optimized at the architecture level to limit the number

53

of Bennett clocks in the system. For example, pipelining allows the implementation of more complex designs with a larger datapath at a fixed number of Bennett clocks.

The 16-bit adiabatic microprocessor was implemented by hand in a few months.

But automated tools could dramatically reduce the design time. These automated tools could translate high-level code into transistors and physical layouts using SCRL and

Bennett clocking. This means that adiabatic computing is not limited by the complexity of microprocessor designs, since optimizations could be performed by automated tools in the future. If the industry automated tools are adapted to reversible computing larger microprocessors could be easily implemented, and adiabatic reversible computing could replace standard CMOS logic. Whether this could happen it remains to be seen.

It must be pointed out that this work deals only with reducing the energy losses due to heat dissipation in the switching devices. Energy recycling, on the other hand, is not addressed, but it is a crucial part of practical implementations for reversible computing. [50] For the work presented here, the whole system, including the clocks, could have considerable dissipation. The adiabatic microprocessor can still have important applications such as cryogenic CMOS. While the transistors could have low dissipation, preventing the increase of temperature in the cryostat, the Bennett clocks could present considerable dissipation. In application such as cryogenic CMOS this is an acceptable trade-off since the clocks would be located outside of the cryostat. However, full systems trying to break through the Landauer-Shannon limit should include Bennett clocks that can recycle energy. Clocks that recycle energy could be implemented using

MEMS resonators to drive the logic gates and recycle the bit energies.

54

Lastly, other approaches to asymptotically adiabatic reversible computing should be pursued. For example, adiabatic capacitive logic (ACL) presents an implementation of reversible computing that eliminates leakage current. [51] ACL implements variable as pull-up and pull-down networks instead of transistors to implement reversible computing. Devices such as these could be operated reversibly and will not be limited by the ultimate Shannon limit discussed in the Landauer’s principle. While this work presents a short-term implementation of reversible computing by using CMOS transistors operated reversibly, other approaches to adiabatic reversible computing, such as ACL, could be implemented as beyond CMOS technology.

55

BIBLIOGRAPHY

[1] M. M. Waldrop, “The chips are down for Moore’s Law,” Nature: News Feature, vol. 530, iss. 7589, Feb 9, 2016.

[2] G. E. Moore, “Cramming More Components onto Integrated Circuits,” , pp. 114–117, April 19, 1965.

[3] A. Kanduri, A.M. Rahmani, P. Liljeberg, A. Hemani, A. Jantsch, H. Tenhunen, “A Perspective on Dark Silicon,” In: A. Rahmani, P. Liljeberg, A. Hemani, A. Jantsch, H. Tenhunen (eds) The Dark Side of Silicon. Springer, Cham, 2017.

[4] H. Lu and A. Seabaugh, "Tunnel Field-Effect Transistors: State-of-the-Art," IEEE Journal of the Electron Devices Society, vol. 2, pp. 44-49, Jul 2014.

[5] H. Kam, T. King Liu, E. Alon, “Design Requirements for Steeply Switching Logic Devices,” IEEE Transactions on Electron Devices, Vol 59, No. 2, Feb 2012.

[6] A. Peschot, C. Qian, and T.J.K. Liu, "Nanoelectromechanical Switches for Low- Power Digital Computing". Micromachines-Basel, 6, p. 1046-65, 2015.

[7] M. Ramezani, S. Severi, A. Moussa, H. Osman, H. A. C. Tilmans, K. De Meyer, "Contact reliability improvement of a poly-SiGe based nano- with titanium nitride coating", Proc. 18th IEEE Transducers, pp. 576-579, Jun. 2015.

[8] C. S. Lent, A. O. Orlov, W. Porod, G. L. Snider, “Energy limits in computation: A Review of Landauer’s principle, Theory and Experiments,” New York: Springer, 2018.

[9] R.K. Cavin, V.V. Zhirnov, J.A. Hutchby, and G.I. Bourianoff, "Energy barriers, demons, and minimum energy operation of electronic devices," Fluctuation and Noise Letters, 5, p. C29-C38, 2005.

[10] V.V. Zhirnov, R.K. Cavin, J.A. Hutchby, and G.I. Bourianoff, “Limits to Binary Logic Scaling—A Gedanken Model,” Proceedings of the IEEE, Vol. 91, No. 11, Nov 2003.

[11] A.O. Orlov, C.S. Lent, C.C. Thorpe, G.P. Boechler, and G.L. Snider, "Experimental Test of Landauer’s Principle at the Sub-kBT Level," Jpn. J. Appl. Phys, 51, p. 06FE10-1-5, 2012.

[12] G.P. Boechler, J.M. Whitney, C.S. Lent, A.O. Orlov, and G.L. Snider, “Fundamental limits of energy dissipation in charge-based computing,” Applied Physics Letters 97, 103502, 2010. 56

[13] T. Toffoli, “Reversible computing,” In: de Bakker J., van Leeuwen J. (eds) Automata, Languages and Programming, ICALP 1980, Lecture Notes in Science, vol 85. Springer, Berlin, Heidelberg 1980.

[14] B. C. David, “How Moore’s Law came to be” Computer History Museum CORE, pp. 31-33, 2015.

[15] K. J. Kuhn, “Considerations for ultimate CMOS scaling” IEEE Trans. Electron Devices, 59, 1813–1828, 2012.

[16] K. Kuhn, "CMOS and Beyond CMOS: Scaling Challenges." High Mobility Materials for CMOS Applications. Woodhead Publishing, 2018. 1-44.

[17] P. Teichman, “Adiabatic Logic: Future Trend and System Level Perspective.” New York: Springer, 2012.

[18] C. H. Bennett, “The Thermodynamics of Computation-A Review,” Int. J. Theor. Phys., vol. 21, no. 12, pp. 905-940, 1982.

[19] A. Pandey, et al., "Effect of load capacitance and input transition time on FinFET inverter capacitances," IEEE Transactions on Electron Devices, 61.1: 30-36, 2013.

[20] V.I. Starosel'skii, "Adiabatic logic circuits: A review". Russian Microelectronics, 31, p. 37-58, 2001.

[21] R. M.Incandela, et al., "Nanometer CMOS characterization and compact modeling at deep-cryogenic temperatures," 2017 47th European Solid-State Device Research Conference (ESSDERC), IEEE, 2017.

[22] N. Singh, et al., "High-performance fully depleted silicon (diameter/spl les/5 nm) gate-all-around CMOS devices," IEEE Electron Device Letters 27.5: 383-386, 2006.

[23] S. G. Younis and T. F. Knight, "Practical Implementation of Charge Recovering Asymptotically Zero Power CMOS," in Research On Integrated Systems, Seattle, WA, 1993, pp. 234-250.

[24] A. K. Bakshi, and M. Sharma, "Design of basic gates using ECRL and PFAL," 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, 2013.

[25] C. S. Lent, M. Liu, and Y. H. Lu, “Bennett clocking of quantum-dot cellular automata and the limits to binary logic scaling,” Nanotechnology, vol. 17, pp. 4240-4251, Aug 28 2006.

57

[26] D. Mermin, “Quantum Computer Science”. University Printing House, Cambridge, UK 2007.

[27] M. W. Johnson, et al., "Quantum annealing with manufactured spins," Nature 473.7346 194, 2011.

[28] W. Zhao, Y. Cao, "New generation of Predictive Technology Model for sub- 45nm early design exploration," IEEE Transactions on Electron Devices, vol. 53, no. 11, pp. 2816-2823, Nov 2006.

[29] V. G. Oklobdzija, D. Maksimovic and F. Lin, "Pass-transistor adiabatic logic using single power-clock supply," IEEE Transactions on Circuits and Systems II: Analog and Processing, vol. 44, no. 10, pp. 842-846, Oct. 1997.

[30] M. Meijer, et al., “Ultra-Low-Power Digital Design with Body Biasing for Low Area and Performance-Efficient Operation,” Journal of Low Power Electronics, vol. 6, No. 4, 2011.

[31] Cadence Reference Manual “Specification for 90nm Generic Process Design Kit (gpdk090)”, 2008.

[32] M. . P. Frank, and T. F. Knight, “Reversibility for efficient computing. Diss.” Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.

[33] N. Takeuchi, Y. Yamanashi, and N. Yoshikawa, "Reversible Computing Using Adiabatic Superconductor Logic", Lect. Notes Comput. Sc, 8507, p. 15-25, 2014.

[34] D. Harris, and S. Harris, “Digital design and ”. Boston : Morgan Kaufmann Publishers, 2007.

[35] C. O. Campos-Aguillón, et al., "A Mini-MIPS microprocessor for adiabatic computing," 2016 IEEE International Conference on Rebooting Computing (ICRC), San Diego, CA, 2016, pp. 1-7.

[36] N. Weste and D. Harris, “CMOS VLSI Design: A Circuits and Systems Perspectives,” 4th ed.: Addison-Wesley, 2010.

[37] J. Lim, et. al., “A 16-bit carry-lookahead adder using reversible energy recovery logic for ultra-low-energy systems”, IEEE Journal of Solid-State Circuits, vol. 34, iss. 6 , Jun 1999.

[38] Ghosh, Swaroop, Patrick Ndai, and Kaushik Roy, "A novel low overhead fault tolerant Kogge-Stone adder using adaptive clocking," Proceedings of the conference on Design, and test in Europe, ACM, 2008.

58

[39] D. Esposito, et al., " Variable Latency Speculative Parallel Prefix Adders for Unsigned and Signed Operands," IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 63, iss. 8, 2016.

[40] S. Maheshwari, V. A. Bartlett and I. Kale, "Adiabatic flip-flops and sequential circuit design using novel resettable adiabatic buffers," 2017 European Conference on Circuit Theory and Design (ECCTD), Catania, 2017, pp. 1-4.

[41] K. D. Kissell, "MIPS MT: A multithreaded RISC architecture for embedded real- time processing," International Conference on High-Performance Embedded Architectures and Compilers. Springer, Berlin, Heidelberg, 2008.

[42] J. F. O'Hanlon, “A user's guide to vacuum technology,” John Wiley & Sons, 2005.

[43] D. Pollock, “Thermocouples: Theory and properties,” Florida: CRC Press, 1991.

[44] D. M. Rowe, “CRC handbook of thermoelectrics,” CRC press, 1995.

[45] M. M. McConnell, et al., “Heat Dissipation in Adiabatic Reversible CMOS Circuits Measured by On-chip Nanothermocouples,” 2018 IEEE Silicon Nanoelectronics Workshop, Honolulu, HI, 2018, P2-24

[46] W. Fulkerson, et al., "Thermal conductivity, electrical resistivity, and seebeck coefficient of silicon from 100 to 1300 K," Physical Review 167.3, 765. 1968.

[47] A. Stranz, J. Kähler, A. Waag, et al., “Thermoelectric Properties of High-Doped Silicon from Room Temperature to 900 K”, Journal of Elec. Materials 42: 2381, 2013.

[48] S. M. Sze, and K. Ng. Kwok, “Physics of semiconductor devices,” John wiley & sons, 2006.

[49] P. M. Solomon and D. J. Frank, "Power measurements of adiabatic circuits by thermoelectric technique," 1995 IEEE Symposium on Low Power Electronics. Digest of Technical Papers, San Jose, CA, USA, 1995, pp. 18-19

[50] M. McConnell, “Adiabatic Reversible Computing: Measurement of Heat Dissipation Using Nanothermocouples and Fabrication of MEMS Power Clocks”, University of Notre Dame, 2018.

[51] H.F. G. Pillonnet, S. Houri, "Adiabatic Capacitive Logic: a paradigm for low- power logic," IEEE International Symposium on Circuits and Systems, IEEE: Baltimore, MD, 2017.

59