ULTRA LOW-POWER ELECTRONICS and DESIGN Tlfebook
TLFeBOOK TLFeBOOK
ULTRA LOW-POWER ELECTRONICS AND DESIGN TLFeBOOK
This page intentionally left blank TLFeBOOK Ultra Low-Power Electronics and Design
Edited by Enrico Macii Politecnico di Torino, Italy
KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW TLFeBOOK
eBook ISBN: 1-4020-8076-X Print ISBN: 1-4020-8075-1
©2004 Springer Science + Business Media, Inc.
Print ©2004 Kluwer Academic Publishers Dordrecht
All rights reserved
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Visit Springer's eBookstore at: http://www.ebooks.kluweronline.com and the Springer Global Website Online at: http://www.springeronline.com TLFeBOOK
Contents
CONTRIBUTORS…………………………………………………………………….VII PREFACE…………………………………………………………….………………...IX INTRODUCTION……………………………………………………………………XIII
1. ULTRA-LOW-POWER DESIGN: DEVICE AND LOGIC DESIGN APPROACHES……………………………………….………………………………….1
2. ON-CHIP OPTICAL INTERCONNECT FOR LOW-POWER……………………21
3. NANOTECHNOLOGIES FOR LOW POWER……………….…………………….40
4. STATIC LEAKAGE REDUCTION THROUGH SIMULTANEOUS
Vt/Tox AND STATE ASSIGNMENT………………………………………………….56
5. ENERGY-EFFICENT SHARED MEMORY ARCHITECTURES FOR MULTI-PROCESSOR SYSTEMS-ON-CHIP…………………………………...…..84
6. TUNING CACHES TO APPLICATIONS FOR LOW-ENERGY EMBEDDED SYSTEMS……………………………………………………………………………..103
7. REDUCING ENERGY CONSUMPTION IN CHIP MULTIPROCESSORS USING WORKLOAD VARIATIONS……………………………………………....123
8. ARCHITECTURES AND DESIGN TECHNIQUES FOR ENERGY EFFICIENT EMBEDDED DSP AND MULTIMEDIA PROCESSING……….….141
9. SOURCE-LEVEL MODELS FOR SOFTWARE POWER OPTIMIZATION…..156
10. TRANSMITTANCE SCALING FOR REDUCING POWER DISSIPATION OF A BACKLIT TFT-LCD…………………………………………………………..172 TLFeBOOK vi
11. POWER-AWARE NETWORK SWAPPING FOR WIRELESS PALMTOP PCS…………………………………………………………………………………… 198
12. ENERGY EFFICIENT NETWORK-ON-CHIP DESIGN…………………………214
13. SYSTEM LEVEL POWER MODELING AND SIMULATION OF HIGH-END INDUSTRIAL NETWORK-ON-CHIP……………………………….233
14. ENERGY AWARE ADAPTATIONS FOR END-TO-END VIDEO STREAMING TO MOBILE HANDHELD DEVICES…………………………….255 TLFeBOOK
vii Contributors
A. Acquaviva Università di Urbino L. Benini Università di Bologna D. Bertozzi Università di Bologna D. Blaauw University of Michigan, Ann Arbor A. Bogliolo Università di Urbino A. Bona STMicroelectronics C. Brandolese Politecnico di Milano W.C. Cheng University of Southern California G. De Micheli Stanford University N. Dutt University of California, Irvine W. Fornaciari Politecnico di Milano F. Gaffiot Ecole Centrale de Lyon J. Gautier CEA-DRT–LETI/D2NT–CEA/GRE A. Gordon-Ross University of California, Riverside R. Gupta University of California, San Diego C. Heer Infineon Technologies AG M. J. Irwin Pennsylvania State University I. Kadayif Canakkale Onsekiz Mart University M. Kandemir Pennsylvania State University B. Kienhuis Leiden I. Kolcu UMIST E. Lattanzi Università di Urbino D. Lee University of Michigan, Ann Arbor A. Macii Politecnico di Torino S. Mohapatra University of California, Irvine I. O’Connor Ecole Centrale de Lyon K. Patel Politecnico di Torino M. Pedram University of Southern California C. Pereira University of California, San Diego C. Piguet CSEM M. Poncino Università di Verona F. Salice Politecnico di Milano P. Schaumont University of California, Los Angeles U. Schlichtmann Technische Universität München D. Sylvester University of Michigan, Ann Arbor TLFeBOOK viii F. Vahid University of California, Riverside and University of California, Irvine N. Venkatasubramanian University of California, Irvine I. Verbauwhede University of California, Los Angeles and K.U.Leuven N. Vijaykrishnan Pennsylvania State University V. Zaccaria STMicroelectronics R. Zafalon STMicroelectronics B. Zhai University of Michigan, Ann Arbor C. Zhang University of California, Riverside TLFeBOOK
ix Preface
Today we are beginning to have to face up to the consequences of the stunning success of Moore’s Law, that astute observation by Intel’s Gordon Moore which predicts that integrated circuit transistor densities will double every 12 to 18 months. This observation has now held true for the last 25 years or more, and there are many indications that it will continue to hold true for many years to come. This book appears at a time when the first examples of complex circuits in 65nm CMOS technology are beginning to appear, and these products already must take advantage of many of the techniques to be discussed and developed in this book. So why then should our increasing success at miniaturization, as evidenced by the success of Moore’s Law, be creating so many new difficulties in power management in circuit designs?
The principal source and the physical origin of the problem lies in the differential scaling rates of the many factors that contribute to power dissipation in an IC – transistor speed/density product goes up faster than the energy per transition comes down, so the power dissipation per unit area increases in a general sense as the technology evolves.
Secondly, the “natural” transistor switching speed increase from one generation to the next is becoming downgraded due to the greater parasitic losses in the wiring of the devices. The technologists are offsetting this problem to some extent by introducing lower permittivity dielectrics (“low- k”) and lower resistivity conductors (copper) – but nonetheless to get the needed circuit performance, higher speed devices using techniques such as silicon-on-insulator (SOI) substrates, enhanced carrier mobility (“strained silicon”) and higher field (“overdrive”) operation are driving power densities ever upwards. In many cases, these new device architectures are increasingly leaky, so static power dissipation becomes a major headache in power management, especially for portable applications. TLFeBOOK x A third factor is system or application driven – having all this integration capability available encourages us to combine many different functional blocks into one system IC. This means that in many cases, a large part of the chip’s required functionality will come from software executing on and between multiple on-chip execution units; how the optimum partitioning between hardware architecture and software implementation is obtained is a vast subject, but clearly some implementations will be more energy efficient than others. Given that, in many of today’s designs, more than 50% of the total development effort is on the software that runs on the chip, getting this partitioning right in terms of power dissipation can be critical to the success of (or instrumental in the failure of!) the product.
A final motivation comes from the practical and environmental consequences of how we design our chips – state-of-the-art high performance circuits are dissipating up to 100W per square centimeter – we only need 500 square meters of such silicon to soak up the output of a small nuclear power station. A related argument, based on battery lifetime, shows that the “converged” mobile phone application combining telephony, data transmission, multimedia and PDA functions that will appear shortly is demanding power at the limit of lithium-ion or even methanol-water fuel cell battery technology. We have to solve the power issue by a combination of design and process technology innovations; examples of current approaches to power management include multiple transistor thresholds, triple gate oxide, dynamic supply voltage adjustment and memory architectures.
Multiple transistor thresholds is a technique, practiced for several years now, that allows the designer to use high performance (low Vt) devices where he needs the speed, and low leakage (high Vt) devices elsewhere. This benefits both static power consumption (through less sub-threshold leakage) and dynamic power consumption (through lower overall switching currents). High threshold devices can also be used to gate the supplies to different parts of the circuit, allowing blocks to be put to sleep until needed.
Similar to the previous technique, triple gate oxide (TGO) allows circuit partitioning between those parts that need performance and other areas of the circuit that don’t. It has the additional benefit of acting on both sub-threshold leakage and gate leakage. The third oxide is used for I/O and possibly mixed-signal. It is expected over the next few years that the process technologists will eventually replace the traditional silicon dioxide gate dielectric of the CMOS devices by new materials such as rare earth oxides with much higher dielectric constants that will allow the gate leakage problem to be completely suppressed. TLFeBOOK
xi Dynamic supply voltage adjustment allows the supply voltage to different blocks of the circuit to be adjusted dynamically in response to the immediate performance needs for the block – this very sophisticated technique will take some time to mature.
Finally, many, if not most, advanced devices use very large amounts of memory for which the contents may have to be maintained during standby; this consumes a substantial amount of power, either through refreshing dynamic RAM or through the array leakage for static RAM. Traditional non- volatile memories have writing times that are orders of magnitude too slow to allow them to substitute these on-chip memories. New developments, such as MRAM, offer the possibility of SRAM-like performance coupled with unlimited endurance and data retention, making them potential candidates to replace the traditional on-chip memories and remove this component of standby power consumption.
Most of the approaches to power management described briefly above will be employed in 65nm circuits, but there are a lot more good ideas waiting to be applied to the problem, many of which you will find clearly and concisely explained in this book.
Mike Thompson, Philippe Magarshack
STMicroelectronics, Central R&D Crolles, France TLFeBOOK
This page intentionally left blank TLFeBOOK
xiii Introduction
ULTRA LOW-POWER ELECTRONICS AND DESIGN
Enrico Macii Politecnico di Torino
Power consumption is a key limitation in many electronic systems today, ranging from mobile telecom to portable and desktop computing systems, especially when moving to nanometer technologies. Power is also a showstopper for many emerging applications like ambient intelligence and sensor networks. Consequently, new design techniques and methodologies are needed to control and limit power consumption. The 2004 edition of the DATE (Design Automation and Test in Europe) conference has devoted an entire Special Focus Day to the power problem and its implications on the design of future electronic systems. In particular, keynote presentations and invited talks by outstanding researchers in the field of low-power design, as well as several technical papers from the regular conference sessions have addressed the difficulties ahead and advanced strategies and principles for achieving ultra low-power design solutions. Purpose of this book is to integrate into a single volume a selection of these contributions, duly extended and transformed by the authors into chapters proposing a mix of tutorial material and advanced research results. The manuscript consists of a total of 14 chapters, addressing different aspects of ultra low-power electronics and design. Chapter 1 opens the volume by providing an insight to innovative transistor devices that are capable of operating with a very low threshold voltage, thus contributing to a significant reduction of the dynamic component of power consumption. Solutions for limiting leakage power during stand-by mode are also discussed. The chapter closes with a quick overview of low-power design techniques applicable at the logic level, including multi-Vdd, multi-Vth and hybrid approaches. Chapter 2 focuses on the problem of reducing power in the interconnect network by investigating alternatives to traditional metal wires. In fact, according to the 2003 ITRS roadmap, metallic interconnections may not be able to provide enough transmission speed and to keep power under control for the upcoming technology nodes (65nm and below). A possible solution, explored in the chapter, consists of the adoption of optical interconnect networks. Two applications are presented: Clock distribution and data communication using wavelength division multiplexing. TLFeBOOK xiv In Chapter 3, the power consumption problem is faced from the technology point of view by looking at innovative nano-devices, such as single-electron or few-electron transistors. The low-power characteristics and potential of these devices are reviewed in details. Other devices, including carbon nano- tube transistors, resonant tunnelling diodes and quantum cellular automata are also treated.
Chapter 4 is entirely dedicated to advanced design methodologies for reducing sub-threshold and gate leakage currents in deep-submicron CMOS circuits by properly choosing the states to which gates have to be driven when in stand-by mode, as well as the values of the threshold voltage and of the gate oxide thickness. The authors formulate the optimization problem for simultaneous state/Vth and state/Vth/Tox assignments under delay constraints and propose both an exact method for its optimal solution and two practical heuristics with reasonable run-time. Experimental results obtained on a number of benchmark circuits demonstrate the viability of the proposed methodology.
Chapter 5 is concerned with the issue of minimizing power consumption of the memory subsystem in complex, multi-processor systems-on-chip (MPSoCs), such as those employed in multi-media applications. The focus is on design solutions and methods for synthesizing memory architectures containing both single-ported and multi-ported memory banks. Power efficiency is achieved by casting the memory partitioning design paradigm to the case of heterogeneous memory structures, in which data need to be accessed in a shared manner by different processing units.
Chapter 6 addresses the relevant problem of minimizing the power consumed by the cache hierarchy of a microprocessor. Several design techniques are discussed, including application-driven automatic and dynamic cache parameter tuning, adoption of configurable victim buffers and frequent-value data encoding and compression.
Power optimization for parallel, variable-voltage/frequency processors is the subject of Chapter 7. Given a processor with such an architecture, this chapter investigates the energy/performance tradeoffs that can be spanned in parallelizing array-intensive applications, taking into account the possibility that individual processing units can operate at different voltage/frequency levels. In assigning voltage levels to processing units, compiler analysis is used to reveal hetherogeneity between the loads of the different units in parallel execution. TLFeBOOK xv Chapter 8 provides guidelines for the design and implementation of DSP and multi-media applications onto programmable embedded platforms. The RINGS architecture is first introduced, followed by a detailed discussion on power-efficient design of some of the platform components, namely, the DSPs. Next, design exploration, co-design and co-simulation challenges are addressed, with the goal of offering to the designers the capability of including into the final architecture the right level of programmability (or reconfigurability) to guarantee the required balance between system performance and power consumption. Chapter 9 targets software power minimization through source code optimization. Different classes of code transformations are first reviewed; next, the chapter outlines a flow for the estimation of the effects that the application of such transformations may have on the power consumed by a software application. At the core of the estimation methodology there is the development of power models that allow the decoupling of processor- independent analysis from all the aspects that are tightly related to processor architecture and implementation. The proposed approach to software power minimization is validated through several experiments conducted on a number of embedded processors for different types of benchmark applications.
Reduction of the power consumed by TFT liquid crystal displays, such as those commonly used in consumer electronic products is the subject of Chapter 10. More specifically, techniques for reducing power consumption of transmissive TFT-LCDs using a cold cathode fluorescent lamp backlight are proposed. The rationale behind such techniques is that the transmittance function of the TFT-LCD panel can be adjusted (i.e., scaled) while meeting an upper bound on a contrast distortion metric. Experimental results show that significant power savings can be achieved for still images with very little penalty in image contrast.
Chapter 11 addresses the issue of efficiently accessing remote memories from wireless systems. This problem is particularly important for devices such as palmtops and PDAs, for which local memory space is at a premium and networked memory access is required to support virtual memory swapping. The chapter explores performance and energy of network swapping in comparison with swapping on local microdrives and FLASH memories. Results show that remote swapping over power-manageable wireless network interface cards can be more efficient than local swapping and that both energy and performance can be optimized by means of power- aware reshaping of data requests. In other words, dummy data accesses can be preemptively inserted in the source code to reshape page requests in order to significantly improve the effectiveness of dynamic power management. TLFeBOOK xvi Chapter 12 focuses on communication architectures for multi-processor SoCs. The network-on-chip (NoC) paradigm is reviewed, touching upon several issues related to power optimization of such kinds of communication architectures. The analysis goes on a layer-by-layer basis, and particular emphasis is given to customized, domain-specific networks, which represent the most promising scenario for communication-energy minimization in multi-processor platforms.
Chapter 13 provides a natural follow up to the theory of NoCs covered in the previous chapter by describing an industrial application of this type of communication architecture. In particular, the authors introduce an innovative methodology for automatically generating the power models of a versatile and parametric on-chip communication IP, namely the STBus by STMicroelectronics. The methodology is validated on a multi-processor hardware platform including four ARM cores accessing a number of peripheral targets, such as SRAM banks, interrupt slaves and ROM memories.
The last contribution, offered in Chapter 14, proposes an integrated end-to- end power management approach for mobile video streaming applications that unifies low-level architectural optimizations (e.g., CPU, memory, registers), OS power-saving mechanisms (e.g., dynamic voltage scaling) and adaptive middleware techniques (e.g., admission control, trans-coding, network traffic regulation). Specifically, interaction parameters between the different levels are identified and optimized to achieve a reduction in the power consumption.
Closing this introductory chapter, the editor would like to thank all the authors for their effort in producing their outstanding contributions in a very short time. A special thank goes to Mike Thompson and Philippe Magarshack of STMicroelectronics for their keynote presentation at DATE 2004 and for writing the foreword to this book. The editor would also like to acknowledge the support offered by Mark De Jongh and the Kluwer staff during the preparation of the final version of the manuscript. Last, but not least, the editor is grateful to Agnieszka Furman for taking care of most of the “dirty work” related to book editing, paging and preparation of the camera-ready material. TLFeBOOK 1
Chapter 1 ULTRA-LOW-POWER DESIGN: DEVICE AND LOGIC DESIGN APPROACHES
Christoph Heer1 and Ulf Schlichtmann2 1Infineon Technologies AG; 2Technische Universität München
Abstract Power consumption increasingly is becoming the bottleneck in the design of ICs in advanced process technologies. We give a brief introduction into the major causes of power consumption. Then we report on experiments in an advanced process technology with ultra-low threshold voltage (Vth) devices. It turns out that in contrast to older process technologies, this approach increasingly is becoming less suitable for industrial usage in advanced process technologies. Following, we describe methodologies to reduce power consumption by optimizations in logic design, specifically by utilizing multiple levels of supply voltage Vdd and threshold voltage Vth. We evaluate them from an industrial product development perspective. We also give a brief outlook to proposals on other levels in the design flow and to future work.
Keywords: Low-power design, dynamic power reduction, leakage power reduction, ultra- low-Vth devices, multi-Vdd, multi-Vth, CVS
1.1 INTRODUCTION
The progress of silicon process technology marches on relentlessly. As predicted by Gordon Moore decades ago, silicon process technology continues to achieve improvements at an astonishing pace [1]. The number of transistors that can be integrated on a single IC approximately doubles every 2 years [2,3]. This engineering success has created innovative new industries (e.g. personal computers and peripherals, consumer electronics) and revolutionized other industries (e.g. communications). Today, however, it is becoming increasingly difficult to achieve improvements at the pace that the industry has become accustomed to. More and more technical challenges appear that require increasing resources to be TLFeBOOK
2 solved [4]. One such problem is the increasing power consumption of integrated circuits. It becomes even more critical as an increasing number of today’s high-volume consumer products are battery-powered. In the following, we will consider the sources of power consumption and their development over time. We will show why reduction of power consumption increasingly is becoming critical to product success and will review traditional approaches in Sections 1.1 and 1.2. In Section 1.3 we will then analyze a potential solution based on introduction of an optimized transistor with a very low threshold voltage Vth. Thereafter, we will present and discuss logic-level design optimizations for power reduction in Section 1.4. Also, we will briefly point out potential optimizations on higher levels. Our observations are made from the perspective of industrial IC product development where technical optimizations must be carefully evaluated against the cost associated with achieving and implementing them. Mostly, the presented methodologies are already being utilized in leading-edge industrial ICs.
1.2 POWER CONSUMPTION BECOMES CRITICAL
Depending on the type of end-product and its application, different aspects of power consumption are the primary concern: dynamic power or leakage power. Reduction of dynamic power consumption is a concern for almost all IC products today. For battery-powered products, reduced power consumption directly results in longer operating time for the product, which is a very desirable characteristic. Even for non-battery-powered products, reduced power consumption brings many advantages, such as reduced cost because of cheaper packaging or higher performance because of lower temperatures. Finally, reduced power consumption often leads to lower system cost (no fans required; no or cheaper air conditioning for data / telecom center etc.). Dynamic power consumption is caused by the charging and discharging of capacitances when a circuit switches. In addition, during switching a short-circuit current flows, but this current is typically much smaller, and will therefore be neglected in the following. The dynamic current due to capacitance charging and discharging is determined by the following well- known relationship:
•• 2 dyn ~ VCfP ddL TLFeBOOK
3
Based on constant electrical field scaling, Vdd and CL each are reduced by 30% in each successive process generation. Also, delay decreases by 30%, resulting in 43% increase in frequency. Therefore, the dynamic power consumption per device is reduced by 50% from one process generation to the next. As scaling also doubles the number of devices that can be implemented in a given die area, dynamic power consumption per area should stay roughly identical. However, historically frequency has increased by significantly more than 43% from one process generation to the next (e.g. in microprocessors, it has roughly doubled, due to architectural optimizations, such as deeper pipeline stages), and in addition, die sizes have increased with each new process technology, further increasing the power consumption, due to an increased number of active devices [5]. For these reasons, dynamic power consumption has increased exponentially, as is shown in Figure 1-1 for the example of microprocessors. Reduction of leakage power consumption today is primarily a concern for products that are powered by battery and spend most of their operating hours in some type of standby mode, such as cell phones. For many process generations, however, leakage has increased roughly by a factor of 10 for every two process nodes [6]. Due to this dramatic increase with newer process generations, leakage is becoming a significant contribution to overall IC power consumption even in normal operating mode, as can be seen in Figure 1-1 as well. Leakage was estimated to increase from 0.01% of overall power consumption in a 1.0µm technology, to 10% in a 0.1µm technology [6]. For a microprocessor, Intel estimated leakage power consumption at more than 50W for a 100nm technology node[3]. This figure probably is extreme, and leakage depends strongly on a number of factors, such as threshold voltage (Vth) of the transistor, gate oxide thickness and environmental operating conditions (supply voltage Vdd, temperature T). Nevertheless, for an increasing number of products leakage power consumption is turning into a problem, even when they are not battery-powered. TLFeBOOK 4
Figure 1-1. Development of dynamic and leakage power consumption over time [3,7]
1.3 TRADITIONAL APPROACHES TO POWER REDUCTION
As outlined above, dynamic power consumption is governed by:
•• 2 dyn ~ VCfP ddL
with f denoting the switching frequency, CL the capacitance being switched, and Vdd the supply voltage . This formula immediately identifies the key levers to reduce dynamic power: • Reduce operating frequency • Reduce driven capacity • Reduce supply voltage Traditionally, reduction in supply voltage Vdd has been the most often followed strategy to reduce power consumption. Unfortunately, lowering Vdd has the side effect of reducing performance as well, primarily because gate TLFeBOOK 5 overdrive (the difference between Vdd and Vth) diminishes if the threshold voltage Vth is kept constant. Based on the alpha power law model [8], the delay td of an inverter is given by
•VC t = ddL d ()− α VV thdd
with α denoting a fitting constant. As supply voltages are driven below 1.0V, the reductions in gate overdrive are more pronounced than previously. In addition, newer process technologies give significantly less of a performance boost compared to the previous process generation than has traditionally been the case, therefore a further reduction in performance is highly undesirable. Finally, the power reduction achieved by moving to a new process generation has trended down over time, since supply voltages have been scaled by increasingly less than the 30% prescribed by the constant electrical field scaling paradigm. Consequently, more advanced approaches are required. In the following, our main focus will be on dynamic power consumption, but we will also consider leakage power consumption.
1.4 ZERO-VTH DEVICES
The concept of zero-Vth devices was developed in the mid 90-ies. It overcomes the diminishing gate overdrive by radically setting the threshold voltage of the active devices to zero. It has been shown [9], that the optimum power dissipation is obtained, if Pleak (leakage contribution) is in the same order of magnitude as Pdyn (dynamic switching contribution). This can be achieved for transistors with Vth close to 0V (‘zero-Vth transistor‘). Therefore the devices will never completely switch off. But from an overall power perspective the gain in active power consumption is tremendous. Using these transistors the supply voltage of 130nm circuits can be reduced to values below 0.3V to achieve a Pdyn reduction by 90% without performance degradation. Alternatively, the circuit can be operated at twice the clock frequency when keeping the supply voltage at 1.2V, as shown in Figure 1-2. The corresponding Ion/Ioff-ratio for the zero-Vth transistor is about 10-100 instead of >105 for the standard transistor options. During standby, the complete circuits are switched-off or are set into a low leakage mode to cope with the very high leakage contribution. The low leakage mode is achieved by ‘active well’ control, which denotes the use of the body effect. The well potentials of the PFETs and NFETs are altered to change Vth. To achieve a lower leakage current, the absolute value of Vth is increased by TLFeBOOK
6
reverse back biasing: a negative well-to-source voltage Usb is used. Therefore voltages below Vss for NFETs and above Vdd for PFETs have to be generated. Furthermore, active well is required to compensate the lot-to- lot or wafer-to-wafer variations of Vth. The initial ‘zero-Vth’ concept assumed constant junction temperatures Tj below 40°C. For some high-end computer equipment the costs for active chip cooling are affordable to achieve this junction temperature. But this is definitely not the case for cost-driven consumer products. For this application domain Tj in active mode ranges between 85°C and 125°C, and in some applications the specified worst-case ambient temperature is even 80°C. The proposed zero-Vth concept is therefore not applicable without changes and adaptations.
Figure 1-2. Simulated performance curves of transistors with ultra-low Vth. Compared to low- Vth, either a performance gain or a Vdd reduction can be achieved. Curves for reg-Vth and high-Vth transistors of a 130 nm technology are included
A more conservative approach with respect to zero-Vth, but still aggressive compared to current devices, had to be chosen. An ultra-low Vth device with about 150mV threshold voltage proved to be the best TLFeBOOK
7 compromise between zero-Vth and current low-Vth of about 300mV within a 130 nm CMOS technology. To identify the optimal choice of Vth and Vdd in combination with the higher junction temperature Tj, simulations with modified parameters of the 130nm low-Vth transistor are performed. In Figure 1-3 the power dissipation is shown for a high activity circuit (ı= 20%) with various options for the transistor threshold voltages: reg-Vth, low-Vth, and transistors whose Vth are reduced to 200mV, 150mV, 100mV and 50mV. The reg-Vth circuit performance was used as the reference (Vdd = 1.5V), and the supply voltages for the other transistor options were reduced to meet that reference performance.
3,5E-05 V = 1.5V dd T= 125°C fast 3,0E-05
2,5E-05 nom = target 2,0E-05
1.2V slow 1,5E-05 Power [W] 1.0V 0.6V 1,0E-05 0.8V 0.7V 5,0E-06
0,0E+00 reg-Vt low-Vt 200mV 150mV 100mV 50mV Device Option / Vth (mV)
Figure 1-3. Power dissipation at T=125°C in active mode for several transistor options with reduced Vth. A minimum power consumption is achieved at 150mV Vth. (At T=55°C the minimum is achieved for the same option but process variations show less impact).
The reduced supply voltage leads to lower overall active power consumption Pactive. A minimum power consumption is reached at Vth = 150mV. With even lower threshold voltages Pactive starts to increase again because of the increase of the leakage current. The steep rise of Pactive originates from the exponential relation between Vth and leakage current. As a rule of thumb a 100mV reduction of the threshold voltage allows for a Vdd TLFeBOOK
8 reduction by § 0.15V but on the other hand results in a tenfold increase of the leakage current. From Figure 1-3 also the impact of technology variations is visible. Due to the high leakage contribution a power reduction of only 25% is achieved under fast process conditions. Using back biasing in reverse mode, the high performance of fast transistors can be reduced through increasing Vth. The corresponding leakage current therefore decreases and allows a power reduction by 50% (stippled arrow).
A process modification has been developed to manufacture devices with the threshold voltage of 150 mV, which proves to be the most efficient for the target application domain of mobile consumer products [10]. In Table1-1 the key transistor parameters of our ultra-low-Vth FETs (ulv) and of the standard low-Vth transistor are listed. The Vth values are 165mV and 161mV for the ulv-NFET and ulv-PFET respectively, Ion increases by 35% and 22%, which translates into an average decrease of the CV/I-metric delay by 29%. Circuit simulations showed a performance increase of 25%. Concerning Vth, performance, and Ioff the target values have been nearly met.
Table 1-1. Extracted key parameters of the ulv-FETSs in comparison with the target values and the low- Vth FETs 130nm low-Vt 130nm ulv-FET Target NFET / PFET NFET / PFET
Ion 560 / 240 755 / 295 [µA/µm]
Ioff 1.2 / 1.2 48 / 17 §35 [nA/µm]
Vth 295 / 260 165/160 150 [mV] body effect 150 / 135 60/65 90 [mV/V]
Vth@¨L=10nm 35 / 30 65/30 [mV]
Vth@¨L=15nm 65 / 70 100/90 [mV] Simulated gate delay 1 0.8 0.75 [relative units]
The sensitivity of Vth to gate length variation (roll-off) is expressed in Vth-shift per 10nm or 15nm gate length decrease. A comparison with low- Vth-FETs shows a pronounced increase. Therefore in addition to temperature compensation, back biasing has also to be used to compensate for this strong technology variation. TLFeBOOK
9 The values of the body effect are also included in Table 1-1. The body effect is expressed in Vth-shift per 1V well bias. The ulv-FETs yield values, which are lower by more than 50% compared to the low-Vth transistors. The decrease of body effect in combination with the increased roll-off reduces the leverage of back biasing for ulv-FETs very significantly. The leverage is not even sufficient to compensate the technology variation, since the value of the roll-off is higher than that of the body effect. As an example, the ulv- NFET shows roll-off values of 65mV/10nm and 100mV/15nm and a body effect of only 60mV/V. To investigate the migration potential of the ulv-FETs for future technology generations Ioff measurement results, obtained from a recent 90nm hardware, were used. Based on this measurement data the leverage of active well with the standard reg-Vth and low leakage transistor options has been analyzed. For supply voltages of 1.2V and 0.75V a reverse back biasing voltage of 0.5V has been applied. For the NFET, the back biasing results in a leakage reduction by 50% to 70% for all transistor widths and for both values of Vdd. In the case of the PFET, the leakage reduction values are similar (60% to 80%) for transistors with W> 0.5µm. For very narrow PFETs with Vdd = 1.2V, the reduction is only 20% or even less. Since narrow FETs are used within SRAMs, which contribute a major part of the circuit’s standby current, this small reduction for narrow transistors in addition reduces significantly the leverage of active well. The root cause is an additional leakage mechanism based on tunnelling currents across the drain-well junction, which limits the reverse back biasing to 0.5V. This tunnelling current depends exponentially on the drain-well voltage and is working against any reduction of the sub-threshold current via active well. At Vdd = 0.75V the drain-well voltage is reduced and the tunnelling current is therefore lower. In this case the effect of back biasing is not compensated by a rising tunnelling current and a leakage current reduction by 70% is still achieved. For a 90nm technology the limit of 0.5V for the well potential swing limits the reduction of the leakage currents to a factor between 2 and 4. This is still a major contribution of all feasible measures to reduce standby power consumption, but the leverage becomes quite small compared to the reduction ratios of several orders of magnitude obtained in previous technologies [11,12]. In future technologies, Ileak will become more strongly affected by the emerging tunnelling current Igate through the gate of the FET. This is due to the ever decreasing gate oxide thickness and also due to the fact, that even the on-state transistors shows gate leakage. Igate is not affected by well biasing reducing the leverage of active well even further. TLFeBOOK
10
In summary the zero-Vth-devices have become very susceptible to process and temperature variations. Significant yield is only achievable with back biasing via active well control and with active cooling. The latter approach is not feasible for mobile applications. Therefore a more conservative approach with respect to zero-Vth, but still aggressive compared to current devices, had to be chosen. An ultra-low-Vth device with about 150mV threshold voltage proved to be the best compromise between zero- Vth and current low-Vth of about 300mV within a 130 nm CMOS technology. But even though fabrication of this ultra-low-Vth device is possible, it affects some standard methods to overcome short-channel effects. The so called halo- or pocket-implantation had to be removed to bring the threshold voltage down. Unfortunately short-channel effects are now heavily increased, leading as shown to a very strong Vth roll-off at slight variations of the channel length. Finally this effect was prohibitive for the overall approach and led to cancellation of many zero-Vth projects in the industry[13].
1.5 DESIGN APPROACHES TO POWER REDUCTION
As outlined above, solutions from process technology by itself will not suffice to provide sufficient power reduction. Therefore, solutions must be found in algorithms, product architecture and logic design. Increasingly, differentiated device options provided by process technology are utilized on these levels in the search for optimization of power consumption. For leading-edge products which need to optimize both power consumption and system performance, optimization techniques on architecture and design level have been proposed and partly already been implemented. While academic research often focuses on the tradeoff between power consumption and performance, industrial product development must also take other variables into consideration. • Product cost: often, power optimization design techniques increase die area, directly affecting manufacturing cost. Also, utilization of additional devices (e.g. different Vth devices) increases mask count and consequently manufacturing cost, and additionally requires up-front expenditures for the development of such devices. Finally, increased manufacturing complexity poses the risk of lowered manufacturing yield. • Product robustness: it must be ensured that optimized products still work across the specified range of operating conditions, also taking manufacturing variations into account. TLFeBOOK
11
1.5.1 Multi-Vdd Design
As outlined in the introduction, the supply voltage Vdd quadratically impacts dynamic switching power consumption. Thus, lowering Vdd is the preferred option to reduce dynamic power consumption. However, as discussed in Section 1.2, lowering Vdd reduces the system performance. Thus, the incentive to lower Vdd to reduce power consumption is kept in check by the need to maintain performance. Reduction of Vdd can be applied on different abstraction levels of a design. Most effective regarding power reduction, and also easiest to implement is to lower Vdd for an entire IC. As this will directly impact the performance of the IC design, this often is not an option. On a lower abstraction level, it is possible to lower Vdd for an entire module. This is still rather simple to implement, but if only modules are chosen such that overall IC performance is not impacted, the achieved gains in power reduction will often be very moderate. Finally, a reduction in supply voltage can be applied specifically to individual gates, such that the overall system performance is not reduced. This approach, as shown in Figure 1-4, recognizes that in a typical design, most logic paths are not critical. They can be slowed down, often significantly, without reducing the overall system performance. This slowing down is achieved by lowering the supply voltage Vdd for gates on the non- critical paths, which results in lowered power consumption. TLFeBOOK
12 10ns
SET SET D Q D Q
Q Q CLR CLR
SET SET D Q D Q
Q Q CLR 5ns CLR Non-critical path may be delayed
10ns
SET SET D Q D Q
Vdd_low Q Vdd_low Q CLR Vdd_low CLR
SET SET D Q D Q
Q Q CLR 8ns CLR Non-critical path runs with reduced supply voltage
Figure 1-4. Multi-Vdd design
This technique will modify the distribution of path delays in a design to a distribution skewed towards paths with higher delay, as indicated Figure 1-5 [14].
Single Supply Voltage SSV Multiple Supply Voltages MSV MSV SSV
crit. paths
td td 1/f 1/f
Figure 1-5. Distribution of path delays under single and multiple supply voltages TLFeBOOK
13 A number of studies have shown significant variation in dynamic power reduction results from implementing a multi-Vdd design strategy, ranging from less than 10% up to almost 50%, with 40% being the average [15,16]. Rules of thumb for selecting appropriate supply voltage levels have been developed. When using two supply voltages, the lower Vdd was proposed to be 0.6x-0.7x of the higher Vdd [17]. The optimal supply voltage level also depends on Vth [18]. The benefit of using multiple supply voltages quickly saturates. The major gain is obtained by moving from a single Vdd to dual-Vdd. Extending this to ever more supply voltage levels yields only small incremental benefits [18,19], even when the overhead introduced by multiple supply voltages (see below) is not taken into consideration. The power reduction achieved by this technique roughly depends on two parameters: the difference between the regular supply voltage Vdd and the lowered supply voltage Vdd_low, and the percentage of gates to which Vdd_low is applied. Regarding the first parameter, it has been pointed out some years ago that the leverage of this concept decreases as process technologies are scaled down further [18]. Recent work has analyzed this in more detail [14]. At least for high-Vth devices, which are essential for low standby power design due to their lower leakage current, Vth has scaled much slower than Vdd recently. Therefore, gate overdrive (Vdd - Vth) is diminished, negatively impacting performance. Thus, even a little reduction in Vdd will have a very significant impact on performance. Therefore, the potential to lower Vdd while maintaining overall system performance is greatly reduced. It is shown that from 0.25µm down to 0.09µm, the effectiveness of dual-Vdd decreases by a factor of 2 (from 60% dynamic power reduction to 30%) for high-Vth designs, whereas it stays about constant for low-Vth designs. This can however be countered by introduction of variable threshold voltages, as will be seen later. Regarding the second parameter, experience has shown that especially in designs using the multi-Vth technique outlined below, path delays tend to be skewed to higher delays already, thus reducing the number of gates that can be slowed down further [14]. For the selection of those gates which will receive the lower supply voltage Vdd_low, a number of techniques have been proposed. Most prevalent is the concept of clustered voltage scaling (CVS). It recognizes that it is desirable to have clusters of gates assigned to the same voltage, since between the output of a gate supplied by Vdd_low and the input of a gate supplied by Vdd a level shifter is required to avoid static current flow [20]. This concept has been enhanced by extended clustered voltage scaling (ECVS)[17] which essentially allows an arbitrary assignment of supply TLFeBOOK
14 voltage levels to gates. This strategy implies more frequent insertion of level shifters into the design. However, usually only power consumption and delay are considered in the literature. The additional area cost is neglected. In industry, this certainly is not feasible. While conceptually simple, the implementation of a multi-Vdd concept poses a number of challenges. • The additional supply voltage Vdd_low needs to be created on-chip by a dc- to-dc converter, unless the voltage already exists externally. This results in area overhead, and in power consumption for the converter. • The additional supply voltage Vdd_low must be distributed across the chip. • Level-shifters are required between different supply domains. It is feasible to integrate level shifters into flip-flops [21]. The penalties in area, power consumption and delay resulting from these effects are not always taken into account by work published in the literature. Studies indicate that a 10% area overhead will result from implementing a dual-Vdd design [22]. An additional consideration for industrial IC product development is that EDA tool support for implementing a dual-Vdd design is still only rudimentary. It is not sufficient to have a single point tool which can perform power-performance tradeoffs. Instead, this methodology needs to encompass the entire design flow (e.g. power distribution in layout; automated insertion of level shifters etc.).
1.5.2 Multi-Vth Design
Another essential technique is the use of different transistor threshold voltages (multi-Vth design). Primarily this technique reduces leakage power consumption, thus increasing standby time of battery-powered ICs. As leakage power consumption becomes an increasingly important component of overall power consumption in modern process technologies, this technique increasingly also helps to reduce overall power consumption significantly, as design moves to more advanced process technologies. The idea is similar to multi-Vdd design: paths that do not need highest performance are implemented with special leakage-reduced transistors (typically higher Vth transistors, but also thicker gate-oxide Tox), as shown in Figure 1-6. TLFeBOOK
15
10ns
SET SET D Q D Q
Q Q CLR CLR
SET SET D Q D Q
Q Q CLR 5ns CLR Non-critical path may be delayed
10ns
SET SET D Q D Q
high V Q t high Vt Q CLR CLR high Vt
SET SET D Q D Q
Q Q CLR 8ns CLR Non-critical path runs with increased threshold voltage
Figure 1-6. Multi-Vth design
A typical industrial approach today is to first create a design using lower Vth transistors to achieve the required performance and then to selectively replace gates off the critical path with higher Vth (or thicker Tox) transistors to reduce leakage. Studies in the literature have reported reductions in leakage of around 50% up to 80%. Some approaches assume that different Vth levels are provided by the process technology (through doping variations) and propose algorithms to optimally assign Vth levels to transistors, ensuring that performance is not compromised [23, 24]. Recently, it has also been proposed to achieve modifications in Vth by modifying transistor length or gate oxide thickness Tox [25]. Design-tool support for this technique is also rudimentary at best. While it is becoming established to design different modules of an IC with different Vth transistors, it is very challenging to do this on the level of individual transistors within a module. The primary reason is that the entire design flow must be able to handle cells with identical functionality and size, which differ in their electrical properties. This poses no principal algorithmic problems, but must be consistently implemented in all EDA tools within a design flow. TLFeBOOK
16 1.5.3 Hybrid Approaches
Recently approaches have been suggested in the literature which combine implementation of multiple supply voltages and multiple threshold voltages for further power reduction. Especially for designs where minimization of total power consumption is key (as compared to e.g. minimization of standby power for mobile products), it is possible to trade off leakage and dynamic power, as originally proposed in the zero-Vth concept. Studies in the literature indicate a total power optimum when leakage power contributes 10% to 30% [26,12]. This ratio depends significantly on the process technology, operating environment, and clock frequency of a design. For applications where leakage power minimization is critical (e.g. mobile products), this approach usually is not feasible, as it requires a relatively low Vth which causes high leakage currents [14]. With the increasing significance of gate leakage currents, variations of gate oxide thickness Tox have also been proposed. An overall framework for using two supply voltages and two threshold voltages as well has been presented [19]. Theoretically, it is shown that more than 60% of total power consumption can be saved this way (not considering required overhead such as level shifters, routing etc.). Rules of thumb are proposed and it is shown that the optimal second Vdd is about 50% of the original Vdd in this case. It is also argued that the usefulness of multi- Vdd strategies is not diminished, but actually increased in more advanced technologies, if also a multi-Vth strategy is followed, since this strategy allows to trade off leakage vs. dynamic power consumption by changing Vth and Vdd to optimize power consumption, while maintaining a required timing performance. This approach has been applied to the practical example of an ARM processor in [27]. Due to specific layout considerations it was not possible to implement all four intended combinations of Vdd and Vth. Instead, three different libraries were implemented. Using a CVS algorithm, a reduction in dynamic power by 15% was achieved for a 0.18µm process technology. Leakage power was reduced by 40%. As leakage power was more than 1000x smaller than dynamic power, overall active power reduction was 15%. To achieve this, a 14% increase in area was required. A very recent approach considers also transistor width sizing in addition to Vdd and Vth assignment [28]. Using a two stage, sensitivity-based approach, total power savings of 37% on average over a suite of benchmark circuits are reported. In this study, the threshold voltage is chosen rather low, so that leakage represents 20-50% of total power consumption. Therefore, optimization of both leakage and dynamic power consumption is essential, which is achieved with the presented approach. TLFeBOOK
17 An enhanced approach for leakage power consumption considers multiple gate oxide thicknesses Tox in addition to multi-Vth [29]. It is motivated by the fact that gate leakage increases very dramatically with newer process technologies. Gate leakage is of the same order of magnitude as subthreshold leakage at the 90nm process node. Their relationship also depends significantly on the operating temperature T. The key observation that an OFF transistor suffers from subthreshold leakage, an ON transistor from gate leakage, motivates the approach to analyze transistor states in standby mode and assign Vth and Tox such that leakage power consumption is minimized. Leakage reductions of 5-6x are obtained on benchmark circuits, compared to designs using a single Vth and Tox. Previous approaches that included Tox into the optimization varied Tox only for different design modules, not on critical paths within modules. These newer approaches promise further reductions in power consumption. This will come, however, at a price (as seen e.g. in the ARM example). Design complexity increases significantly when variations in many parameters are made available at the same time. In some studies, the resulting overhead is not considered.
1.5.4 Cost Tradeoffs
This overhead must be considered, however, since it is quite significant: • Multi-Vdd: level-shifter (area, power consumption, delay), routing of additional supply voltages (area). • Multi-Vth: additional masks (manufacturing costs); potentially special design rules at the boundary between different Vth devices (area). • Multi-Tox: additional masks (manufacturing costs). • In addition, IC development costs increase due to more complex design flows. Also, special process options (Vth, Tox) must be developed, qualified and continuously monitored. For each such option, the design library must be electrically characterized, modelled for all EDA tools, and potentially optimized regarding circuit design and layout. It must be maintained and regularly updated (changes in electrical parameters, changes in tools in the design flow) over a long period of time as well. If a very specialized manufacturing flow is developed to fully optimize a given product, it will be very difficult to shift manufacturing of this product to a different fab (e.g. a foundry in case additional capacity is required). For these and potentially other reasons, we are not yet aware of industrial products that have implemented such proposals in a fine-grained manner (i.e. different Vth, Vdd and Tox combined within one design module). TLFeBOOK
18 Some approaches in the literature also determine optimum levels of threshold voltages depending on a given design. In industry, this is rarely feasible. Typically, a manufacturing process has to be taken as given, with only predefined values of Vth (and Tox) being available.
1.6 APPROACHES ON HIGHER ABSTRACTION LEVELS
The approaches outlined above on gate level and device level can be (and often must be) supported by measures on higher levels of abstraction. Some of the most promising concepts are as follows: • partitioning the system such that large areas can be powered off for significant periods of time (block turnoff) • especially partitioning memory systems such that large parts can be turned off in standby mode • clock gating is an essential method which reduces dynamic power consumption by local off-switching of non-active gates • coding strategies (e.g. for buses) can reduce switching and thus dynamic power consumption
1.7 CONCLUSION AND FUTURE CHALLENGES
There is no single “silver bullet” to solve the challenge of power reduction. While ultra-low voltage logic based on special ultra-low-Vth devices is a conceptually very convincing concept, its widespread implementation is hindered by manufacturing concerns. An extrapolation of current technology trends indicates that such a concept will become even more difficult in the future. Today, design techniques are the most promising approach to reduce power – both dynamic and leakage. The concepts outlined here can be further extended. It is feasible to dynamically adjust supply and threshold voltages. These are theoretically promising concepts which however still require more investigation especially with regard to feasibility under industrial boundary conditions. Quite likely, in the future even more emphasis than today will have to be placed on power reduction schemes on algorithmic and system level. On these levels, the levers to reduce power consumption are largest.
Acknowledgement
The authors wish to acknowledge and thank Jörg Berthold and Tim Schönauer for their contributions and fruitful discussions. TLFeBOOK
19 References
[1] G. Moore, Cramming More Components onto integrated circuits, Electronics Magazine, Vol. 38, No. 8, 1965, pp. 114-117. [2] ITRS, International Technology Roadmap for Semiconductors, 2003, http://public.itrs.net. [3] F. Pollack, New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies, Micro32 Keynote, 1999. [4] U. Schlichtmann, Systems are Made from Transistors: UDSM Technology Creates New Challenges for Library and IC Development, IEEE Euromicro Symposium on Digital System Design, 2002, pp. 1-2. [5] S. Borkar, Design Challenges of Technology Scaling, IEEE Micro, July/August 1999, pp. 23-29. [6] S. Thompson, P. Packan, and M. Bohr, MOS Scaling: Transistor Challenges for the 21st Century, Intel Technology Journal, Q3 1998. [7] N. Kim et al., Leakage Current: Moore's Law Meets Static Power, IEEE Computer, Vol. 36, No. 12, December 2003, pp. 68-75. [8] S. Sakurai, A. R. Newton, Alpha-Power Law MOSFET Model and its Application to CMOS Inverter Delay and Other Formulas, IEEE Journal of Solid-State Circuits, Vol. 25, No. 2, 1990, pp. 584-594. [9] J.B. Burr, J. Schott, A 200 mV self-testing encoder/decoder using Stanford ultra-low- power CMOS, 1994 IEEE International Solid-State Circuits Conference [10] J. Berthold, R. Nadal, C. Heer, Optionen für Low-Power-Konzepte in den sub-180-nm- CMOS-Technologien (In German), U.R.S.I. Kleinheubacher Tagung 2002. [11] V. Svilan, M. Matsui, J. B. Burr, Energy-Efficient 32 x 32-bit Multiplier in Tunable Near-Zero Threshold CMOS, ISLPED 2000, pp. 268-272. [12] V. Svilan, J. B. Burr, L. Tyler, Effects of Elevated Temperature on Tunable Near-Zero Threshold CMOS, ISLPED 2001, pp. 255-258. [13] C. Heer, Designing low-power circuits: an industrial point of view, PATMOS 2001 [14] T. Schoenauer, J. Berthold, C. Heer, Reduced Leverage of Dual Supply Voltages in Ultra Deep Submicron Technologies, International Workshop on Power And Timing Modeling, Optimization and Simulation PATMOS 2003, pp. 41-50. [15] K. Usami, M. Igarashi, Low-Power Design Methodology and Applications utilizing Dual Supply Voltages, Proceedings of the Asia and South Pacific Design Automation Conference 2000, pp. 123-128. [16] M. Donno, L. Macchiarulo, A. Macii, E. Macii, M. Poncino, Enhanced Clustered Voltage Scaling for Low Power, Proceedings of the 12th ACM Great Lakes Symposium on VLSI, 2002, pp. 18-23. [17] K. Usami et al., Automated Low-Power Technique Exploiting Multiple Supply Voltages Applied to a Media Processor, IEEE Journal of Solid-State Circuits, Vol. 33, No. 3, March 1998, pp. 463-472. [18] M. Hamada, Y. Ootaguro, T. Kuroda, Utilizing Surplus Timing for Power Reduction, Proceedings IEEE Custom Integrated Circuits Conference CICC, 2001, pp. 89-92. [19] A. Srivastava, D. Sylvester, Minimizing Total Power by Simultaneous Vdd/Vth Assignment, Proceedings of the Asia and South Pacific Design Automation Conference 2003, pp. 400-403. [20] K. Usami, M. Horowitz, Clustered Voltage Scaling Technique for Low-Power Design, Proceedings of the International Symposium on Low Power Design ISLPD, 1995, pp. 3- 8. TLFeBOOK
20
[21] K. Usami et al., Design Methodology of Ultra Low-power MPEG4 Codec Core Exploiting Voltage Scaling Techniques, Proceedings of the 35th Design Automation Conference 1998, pp. 483-488. [22] C. Yeh, Y.-S. Kang, Layout Techniques Supporting the Use of Dual Supply Voltages for Cell-Based Designs, Proceedings of the 36th Design Automation Conference 1999, pp. 62-67. [23] Q. Wang, S. Vrudhula, Algorithms for Minimizing Standby Power in Deep Submicrometer, Dual-Vt CMOS Circuits, IEEE Transactions on CAD, Vol. 21, No. 3, March 2002, pp. 306/318. [24] L. Wei, Z. Chen, K. Roy, M. Johnson, Y. Ye, V. De, Design and Optimization of Dual- Threshold Circuits for Low-Voltage Low-Power Applications, IEEE Transactions on Very Large Scale Integration (VLSI), Vol. 7, No. 1, March 1999, pp. 16-24. [25] N. Sirisantana, K. Roy, Low-Power Design Using Multiple Channel Lengths and Oxide Thicknesses, IEEE Design & Test of Computers, January-February 2004, pp. 56-63. [26] K. Nose, T. Sakurai, Optimization of VDD and VTH for Low-Power and High-Speed Applications, Proceedings of the Asia and South Pacific Design Automation Conference 2000, pp. 469-474. [27] R. Bai, S. Kulkarni, W. Kwong, A. Srivastava, D. Sylvester, D. Blaauw, An Implementation of a 32-bit ARM Processor Using Dual Power Supplies and Dual Threshold Voltages, IEEE International Symposium on VLSI, 2003, pp. 149-154. [28] A. Srivastava, D. Sylvester, D. Blaauw, Concurrent Sizing, Vdd and Vth Assignment for Low-Power Design, Proceedings of the Design, Automation and Test in Europe Conference DATE, 2003, pp. 718-719. [29] D. Lee, H. Deogun, D. Blaauw, D. Sylvester, Simultaneous State, Vt and Tox Assignment for Total Standby Power Minimization, Proceedings of the Design, Automation and Test in Europe Conference DATE, 2003, pp. 494-499. TLFeBOOK
21
Chapter 2
ON-CHIP OPTICAL INTERCONNECT FOR LOW-POWER
Ian O’Connor and Fred´ eric´ Gaffiot EcoleCentraledeLyon
Abstract It is an accepted fact that process scaling and operating frequency both contribute to increasing integrated circuit power dissipation due to interconnect. Extrapolat- ing this trend leads to a red brick wall which only radically different interconnect architectures and/or technologies will be able to overcome. The aim of this chap- ter is to explain how, by exploiting recent advances in integrated optical devices, optical interconnect within systems on chip can be realised. We describe our vision for heterogeneous integration of a photonic “above-IC" communication layer. Two applications are detailed: clock distribution and data communication using wavelength division multiplexing. For the first application, a design method will be described, enabling quantitative comparisons with electrical clock trees. For the second, more long-term, application, our views will be given on the use of various photonic devices to realize a network on chip that is reconfigurable in terms of the wavelength used.
Keywords: Interconnect technology, optical interconnect, optical network on chip
2.1 INTRODUCTION In the 2003 edition of the ITRS roadmap [17], the interconnect problem was summarised thus: “For the long term, material innovation with traditional scal- ing will no longer satisfy performance requirements. Interconnect innovation with optical, RF, or vertical integration ... will deliver the solution”. Continu- ally shrinking feature sizes, higher clock frequencies, and growth in complexity are all negative factors as far as switching charges on metallic interconnect is concerned. Even with low resistance metals such as copper and low dielectric constant materials, bandwidths for long interconnect will be insufficient for fu- ture operating frequencies. Already the use of metal tracks to transport a signal over a chip has a high cost in terms of power: clock distribution for instance TLFeBOOK
22 requires a significant part (30-50%) of total chip power in high-performance microprocessors. A promising approach to the interconnect problem is the use of an optical interconnect layer, which could empower an increase in the ratio between data rate and power dissipation. At the same time it would enable synchronous op- eration within the circuit and with other circuits, relax constraints on thermal dissipation and sensitivity, signal interference and distortion, and also free up routing resources for complex systems. However, this comes at a price. Firstly, high-speed and low-power interface circuits are required, design of which is not easy and has a direct influence on the overall performance of optical inter- connect. Another important constraint is the fact that all fabrication steps have to be compatible with future IC technology and also that the additional cost incurred remains affordable. Additionally, predictive design technology is re- quired to quantify the performance gain of optical interconnect solutions, where information is scant and disparate concerning not only the optical technology, but also the CMOS technologies for which optics could be used (post-45nm node). In section 2.2, we will describe the “above-IC” optical technology. Sections 2.3 and 2.4 describe an optical clock distribution network and a quantitative electrical-optical power comparison respectively. A proposal for a novel optical network on chip in discussed in section 2.5.
2.2 OPTICAL INTERCONNECT TECHNOLOGY Various technological solutions may be proposed for integrating an optical transport layer in a standard CMOS system. In our opinion, the most promising approach makes use of hybrid (3D) integration of the optical layer above a complete CMOS IC, as shown in fig. 2.1. The basic CMOS process remains the same, since the optical layer can be fabricated independently. The weakness of this approach is in the complex electrical link between the CMOS interface circuits and the optical sources (via stack and advanced bonding). In the system shown in fig. 2.1, a CMOS source driver circuit modulates the current flowing through a biased III-V microsource through a via stack making the electrical connection between the CMOS devices and the optical layer. III-V active devices are chosen in preference to Si-based optical devices for high-speed and high-wavelength operation. The microsource is coupled to the passive waveguide structure, where silicon is used as the core and SiO2 as the cladding material. Si/SiO2 structures are compatible with conventional silicon technology and silicon is an excellent material for transmitting wave- lengths above 1.2µm (mono-mode waveguiding with attenuation as low as 0.8 dB/cm has been demonstrated [10]). The waveguide structure transports the optical signal to a III-V photodetector (or possibly to several, as in the case of TLFeBOOK
23
III−V III−V laser source photodetector
electrical contact Si photonic waveguide (n=3.5)
SiO2 waveguide cladding (n=1.5) metallic interconnect structure driver CMOS IC receiver circuit circuit
Figure 2.1. Cross-section of hybridised interconnection structure a broadcast function) where it is converted to an electrical photocurrent, which flows through another via stack to a CMOS receiver circuit which regenerates the digital output signal. This signal can then if necessary be distributed over a small zone by a local electrical interconnect network.
2.3 AN OPTICAL CLOCK DISTRIBUTION NETWORK In this section we present the structure of the optical clock distribution net- work, and detail the characteristics of each component part in the system: ac- tive optoelectronic devices (external VCSEL source and PIN detector), passive waveguides, interface (driver and receiver) circuits. The latter represent ex- tremely critical parts to the operation of the overall link and require particularly careful design. An optical clock distribution network, shown in fig. 2.2, requires a single photonic source coupled to a symmetrical waveguide structure routing to a number of optical receivers. At the receivers the high-speed optical signal is converted to an electrical one and provided to local electrical networks. Hence the primary tree is optical, while the secondary tree is electrical. It is not feasible to route the optical signal all the way down to the individual gate level since each drop point requires a receiver circuit which consumes area and power. The clock signal is thus routed optically to a number of drop points which will cover a zone over which the last part of the clock distribution will be carried out TLFeBOOK
24 by the electrical secondary clock tree. The size of the zones is determined by calculating the power required to continue in the optical domain and comparing it to the power required to distribute over the zone in the electrical domain. The number of clock distribution points (64 in the figure) is a particularly crucial parameter in the overall system. The global optical H-tree was optimised to achieve minimal optical losses by designing the bend radii to be as large as possible. For 20mm die width and 64 output nodes in the H-tree at the 70nm technology mode, the smallest radius of curvature (r3 in fig. 2.2) is 625µm, which leads to negligible pure bending loss.
die width, D
LCR
L CV : source−waveguide coupling loss LY L W : waveguide transmission loss
L B : bending loss
L Y : Y−coupler loss
L CR : waveguide−receiver r1 LB coupling loss
r3
LW r2
L CV optical source electrical optical optical r1=D/8, r2=D/16, r3=D/32 clock trees waveguides receivers
Figure 2.2. Optical H-tree clock distribution network (OCDN) with 64 output nodes. r1−3 are the bend radii linked to the chip width D
2.3.1 VCSEL sources VCSELs (Vertical Cavity Surface Emitting Lasers) are certainly the most mature emitters for on-chip or chip-to-chip interconnections. Commercial VC- SELs, when forward biased at a voltage well above 1.5V, can emit optical power of the order of a few mW around 850nm, with an efficiency of some 40%. Threshold currents are typically in the mA range. However, fundamental requirements for integrated semiconductor lasers in optical interconnect appli- cations are small size, low threshold lasing operation and single-mode operation (i.e. only one mode is allowed in the gain spectrum). Additionally, the fact that VCSELs emit light vertically makes coupling less easy. It is clear that TLFeBOOK
25 significant effort is required from the research community if VCSELs are to compete seriously in the on-chip optical interconnect arena, to increase wave- length, efficiency and threshold current in the same device. Long wavelength, and low-threshold VCSELs are only just beginning to emerge (for example, a 1.5µm, 2.5Gb/s tuneable VCSEL [5], and an 850nm, 70µA threshold current, 2.6µm diameter CMOS compatible VCSEL [11] have been reported). Ulti- mately however, optical interconnect is more likely to make use of integrated microsources as described in section 2.5, as these devices are intrinsically better suited to this type of application.
2.3.2 PIN photodetectors In order to optimise the frequency and power dissipation performance of the overall link, photodetectors must exhibit high quantum efficiency, large intrinsic bandwidth and small parasitic capacitance. The photodetector performance is measured by the bandwidth efficiency product. Conventional III-V PIN devices suffer from two main limitations. On one hand, their relatively high capacitance per unit area leads to limitations in the design of the transconductance amplifier interface circuit. On the other hand, due to its vertical structure, there is a tradeoff between its frequency performance and its efficiency (the quantum efficiency increases and the bandwidth decreases with the absorption intrinsic layer thickness) [9]. Metal-semiconductor-metal (MSM) photodetectors offer an alternative over conventional PIN photodetectors. An MSM photodetector consists of interdig- itated metal contacts on top of an absorption layer. Because of their lateral structure, MSM photodetectors have very high bandwidths due to their low capacitance and the possibility to reduce the carrier transit time. However, the responsivity is usually low compared to PIN photodetectors [4]. MSM photodiodes with bandwidth greater than 100GHz have been reported.
2.3.3 Waveguides Optical waveguides are at the heart of the optical interconnect concept. In the Si/SiO2 approach, the high relative refractive index difference ∆= 2 2 2 (n1 − n2)/2n1 between the core (n1 ≈ 3.5 for Si) and cladding (n2 ≈ 1.5 for SiO2) allows the realisation of a compact optical circuit with dimensions com- patible with DSM technologies. For example, it is possible to realise monomode waveguides less than 1µm wide (waveguide width of 0.3µm for wavelengths of 1.55µm), with bend radii of the order of a few µm [15]. However, the performance of the complete optical system depends on the minimum optical power required by the receiver and on the efficiency of passive optical devices used in the system. The total loss in any optical link is the sum TLFeBOOK
26 of losses (in decibels) of all optical components:
Ltotal = LCV + LW + LB + LY + LCR (2.1) where
LCV is the coupling coefficient between the photonic source and optical waveguide. There are currently several methods to couple the beam emitted from the laser into the optical waveguide. In this analysis we assumed 50% coupling efficiency LCV from the source to a single mode waveguide. LW is the rectangular waveguide transmission loss per unit distance of the optical power. Due to small waveguide dimensions and large in- dex change at the core/cladding interface in the Si/SiO2 waveguide the side-wall scattering is the dominant source of loss (fig. 2.3a). For the waveguide fabricated by Lee [10] with roughness of 2nm the calculated transmission loss is 1.3dB/cm. LB is the bending loss, highly dependent on the refractive index difference ∆ between the core and cladding medium. In Si/SiO2 waveguides, ∆ is relatively high and so due to this strong optical confinement, bend radii as small as a few µm may be realised. As can be seen from fig. 2.3b, the bending losses associated with a single mode strip waveguide are negligible if the radius of curvature is larger then 3µm. LY is the Y-coupler loss, and depends on the reflection and scattering attenuation into the propagation path and surrounding medium. For high index difference waveguides the losses for the Y-branch are significantly smaller than for low ∆ structures and the simulated losses are less then 0.2dB per split [14]. LCR is the coupling loss from the waveguide to the optical receiver. Using currently available materials and methods it is possible to achieve an almost 100% coupling efficiency from waveguide to optical receiver. In this analysis the coupling efficiency is assumed to be 87% (LCR = 0.6dB) [16].
2.3.4 Interface circuits High-speed CMOS optoelectronic interface circuits are crucial building blocks to the optical interconnect approach. The electrical power dissipation of the link is defined by these circuits, but it is the receiver circuit that poses the most serious design challenges. The power dissipated by the source driver is mainly determined by the source bias current and is therefore device-dependent. On the receiver side however, most of the receiver power is due to the circuit, while only a small fraction is required for the photodetector device. TLFeBOOK
27
60 100
1 50 0.01 40 0.0001
30 1e-06
1e-08 20
Pure bending loss (dB) 1e-10 Transmission loss (dB/cm) 10 1e-12
0 1e-14 1 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 Sidewall roughness (nm) Bend radius (um)
Figure 2.3a. Simulated transmission loss Figure 2.3b. Simulated pure bending loss for varying sidewall roughness in a for various bend radii in a 0.5µm× 0.2µm 0.5µm× 0.2µm Si/SiO2 strip waveguide Si/SiO2 strip waveguide
2.3.4.1 Driver circuits. Source driver circuits generally use a current modulation scheme for high-speed operation. The source always has to be biased above its threshold current by a MOS current sink to eliminate turn-on delays, which is why low-threshold sources are so important (figures of the order of 40µA [7] have been reported). A switched current sink modulates the current flowing through the source, and consequently the output optical power injected into the waveguide. As with most current-mode circuits, high bandwidth can be achieved since the voltage over the source is held relatively constant and parasitic capacitances at this node have reduced influence on the speed.
2.3.4.2 Receiver circuits. A typical structure for a high-speed pho- toreceiver circuit consists of: a transimpedance amplifier (TIA) to convert the photocurrent of a few µA into a voltage of a few mV; a comparator to gener- ate a rail-to-rail signal; and a data recovery circuit to eliminate jitter from the restored signal. Of these, the TIA is arguably the most critical component for high-speed performance, since it has to cope with a generally large photodiode capacitance situated at its input. The basic transimpedance amplifier structure in a typical configuration is shown in fig. 2.4 [8]. The bandwidth/power ratio of this structure can be max- imised by using small-signal analysis and mapping of the individual component values to a filter approximation of Butterworth type. It is then possible to develop a synthesis procedure which, from desired transimpedance performance criteria (gain Zg0, bandwidth and pole quality factor Q) and operating conditions (photodiode and load capacitances, Cd and Cl respectively) generates component values for the feedback resistance Rf and the voltage amplifer (voltage gain Av and output resistance Ro). Circuits with high Ro/Av ratio (≈ 1/ gm) require the least quiescent current and area and this quantity constitutes therefore an important figure of merit in design space TLFeBOOK
28
Rf
ω 1 1 + Av I i 0 = R o Cy M (M + M (1 + M )) −Av f x m x
M (M + M (1 + M ))(1 + A ) Q = f x m x v Cd Cl V 1 + M x (1 + M f ) + M m M f (1 + A v ) Vdd o
M M 2 3 R f − Rov /A M f = R f / Ro Z g0 = − V V M = C x / C i o ( 1 + 1/A ) i y Cm v M = C / C M1 mym Ci Co
C x = C di + C C y = C l + Co
Figure 2.4. CMOS transimpedance amplifier structure
exploration (fig. 2.5a). To reach a sized transistor-level circuit, approximate equations for the small-signal characteristics and bias conditions of the circuit are sufficient to allow a first-cut sizing of the amplifier, which can then be fine- tuned by numerical or manual optimisation, using simulation for exact results. The complete process is described in [13].
Amplifier Ro/Av requirement Ci=500fF Cl=100fF 1THzohm Transimpedance amplifier characteristics against technology node Ro/Av Cd = 400fF, Cl = 150fF 300 250 100 200 Area / um2 150 Quiescent power / 100uW 100 400 50 350 300 250 10 200 150 100 50 0 1
1 10000
3 3000 Bandwidth Transimpedance requirement gain 0.1 /GHz 10 1000 requirement 350 180 130 100 70 45 /ohms Technology node (nm)
Figure 2.5a. TIA Ro/Av design Figure 2.5b. Evolution of TIA character- space with varying bandwidth and istics (power, area, noise) with technology transimpedance gain requirements node
Using this methodology with industrial transistor models for technology nodes from 350nm to 180nm and predictive BSIM3v3/BSIM4 models for tech- nology nodes from 130nm down to 45nm [3], we generated design parameters for 1THzΩ transimpedance amplifiers to evaluate the evolution in critical char- acteristics with technology node. Fig. 2.5b shows the results of transistor level simulation of fully generated photoreceiver circuits at each technology node. TLFeBOOK
29 2.4 QUANTITATIVE POWER COMPARISON BETWEEN ELECTRICAL AND OPTICAL CLOCK DISTRIBUTION NETWORKS 2.4.1 Design methodology In an optical link there are two main sources of electrical power dissipation: (i) power dissipated by the optical receiver(s) and (ii) energy needed by the optical source(s) to provide the required optical output power. To estimate the electrical power dissipated in the system we developed the methodology shown in fig. 2.6.
BER specification photodiode characteristics (SNR requirement) (R,C ,I ) d dark
minimum optical transimpedance power at receiver amplifier
losses in passive minimum optical waveguide network power at source
source emitter receiver efficiency power power
electrical power dissipated in optical system
Figure 2.6. Methodology used to estimate the electrical power dissipation in an optical clock distribution network
The first criterion for defining the performance of the optoelectronic link is the required signal transmission quality, represented by the bit error rate (BER) and directly linked to the photoreceiver signal to noise ratio. For an on-chip interconnect network, a BER of 10−15 is acceptable. To calculate the required signal power at the receiver, the characteristics of the receiver circuit have to be extracted from the transistor-level schematic, which is generated from the photodetector characteristics (responsivity R, Cd, dark current Idark) and from the required operating frequency using the method described in section 2.3. For the given BER and for the noise signal associated with the photodiode and transimpedance circuit the minimum optical power required by the receiver to operate at the given error probability can be calculated using the Morikuni formula [12]. With this figure, and knowing the layout and therefore the optical losses that will be incurred in the waveguides, the minimum required optical power at the source can be estimated. The total electrical power dissipated in the optical TLFeBOOK
30 link is the sum of the power dissipated by the number of optical receivers and the energy needed by the source to provide the required optical power. The electrical power dissipated by the receivers can be extracted from transistor- level simulations. To estimate the energy needed by the optical source, laser light-current characteristics given by Amann [1] were used.
2.4.2 Design performance Our aim in this work was to quantitatively compare the power dissipation in electrical and optical clock distribution networks for a number of cases, in- cluding technology node prediction. For both electrical and optical cases we used technology parameters from the ITRS roadmap (wire geometry, material parameters). For transistor models we used predictive model parameters from Berkeley (BSIM3V3 down to 70nm and BSIM4 down to 45nm). The power dissipated in the electrical system can be attributed to the charging and discharg- ing of the wiring and load capacitance and to the static power dissipated by the buffers. In order to calculate the power we used an internally developed simu- lator, which allows us to model and calculate the electrical parameters of clock networks for future technology nodes [18]. For optical performance predictions we used existing technology characteristics while for the optoelectronic devices we took datasheets from two real devices and used these figures. The power dissipated in clock distribution networks was analysed in both systems at the 70nm technology node. Power dissipation figures for electrical and optical CDNs were calculated based on the system performance summarised in tables 2.1a and 2.1b.
Table 2.1a. Electrical CDN characteris- Table 2.1b. Optical CDN characteristics tics Optical system parameter Electrical system parameter Wavelength λ (nm) 1550 Technology (nm) 70 Waveguide core index (Si) 3.47 Vdd (V) 0.9 Waveguide cladding index (SiO2)1.44 Tox (nm) 1.6 Waveguide thickness (µm) 0.2 2 Chip size (mm ) 400 Waveguide width (µm) 0.5 Global wire width (µm) 1 Transmission loss (dB/cm) 1.3 Metal resistivity (Ω-cm) 2.2 Loss per Y-junction (dB) 0.2 Dielectric constant 3 Input coupling coefficient (%) 50 Optimal segment length (mm) 1.7 Photodiode capacitance (fF) 100 Optimal buffer size (µm) 90 Photodiode responsivity (A/W) 0.95
What follows is the results of comparisons of the power dissipation in elec- trical and optical clock distribution networks. This was quantitatively carried out for varying chip size, operating frequency, number of clock distribution points, technology node, and finally sidewall roughness. This latter perfor- TLFeBOOK
31 mance characteristic is the only non system-driven characteristic, but it gives some important design information to technology groups working on optical interconnect. Fig. 2.7a shows a power comparison where we vary square die size from 10 to 37 mm width. This analysis was carried out for the 70nm node at a distribution frequency of 5.6GHz (which is the clock frequency associated with this node) and 256 drop points. Electrical CDN power rises almost linearly with die size, which is understandable since the line lengths increase and therefore require more buffers to drive them. Optical CDN power rises much more slowly since all that is really changing is transmission loss and this has a quite minor effect on the overall power dissipation. When we vary clock frequency for constant chip width, fig. 2.7b we observe a similar effect for the electrical CDN. Again, the number of buffers has to increase since the segment lengths have to be reduced in order to attain the lower RC time constants. For the optical CDN, what is changing is the receiver power dissipation. The transimpedance amplifier requires a lower output resistance in order to operate at higher frequencies and this translates to a higher bias current. In fig. 2.7c, we vary the number of drop points and see that both electrical and optical CDN power dissipation rises, but optical rises much faster than electrical. There are two reasons for this: firstly, every time the number of drop points is doubled, so is the number of receivers and this accounts for a large part of the power dissipation; secondly, the number of splitters is doubled, which in turn means that the power at emission also has to be doubled. These two factors cause the optical power to catch up with the electrical power at around 4000 drop points. Fig. 2.7e shows a comparison for varying technology node. Not only the technology is changing here, we are also changing the clock frequency asso- ciated with the node. We can see that at the 70nm node there is a five-fold difference between electrical and optical clock distribution. As the technology node advances, this difference becomes even more marked. A final analysis, fig. 2.7f, shows how technological advances are required to improve system performance, concerning in this case waveguide sidewall roughness. 5nm roughness translates to a transmission loss of around 8dB/cm, which in turn corresponds to a power dissipation figure of around 500mW for the 70nm node at 5.6GHz and 20mm chip width. Looking at the 2nm roughness point, achieved at MIT [10] and corresponding to a transmission loss of 1.3dB/cm, we obtain a power dissipation figure of about 10mW, a fifty- fold decrease in the overall power dissipation by going from 5nm roughness to 2nm roughness. This demonstrates the importance of optimising the passive waveguide technology for the whole system. TLFeBOOK
32
1600 1400 Electrical CDN Optical CDN Electrical CDN 256 1400 1200 Optical CDN 256 Electrical CDN 128 Optical CDN 128 1200 1000
1000 800
800 600
600
Power dissipation (mW) Power dissipation (mW) 400
400 200
200 0 100 300 500 700 900 1 3 5 7 Die size (mm2) Clock frequency (GHz)
Figure 2.7a. Comparison of power dissi- Figure 2.7b. Comparison of power dissi- pation in electrical and optical clock dis- pation in electrical and optical clock dis- tribution networks for varying chip size tribution networks for varying clock fre- (70nm technology, 5.6GHz, 256 drop quency (70nm technology, 400mm2, 256 points) drop points) 10000 30
Electrical CDN 20 Optical CDN 1000 10
0
100 -10
-20
Power dissipation (mW) 10 -30 Power gain optical/electrical (%) -40
1 -50 4 32 256 2048 8172 4 32 256 2048 8172 Number of drop points (nodes) Number of drop points (nodes)
Figure 2.7c. Comparison of power dissi- Figure 2.7d. Comparison of power dissi- pation in electrical and optical clock dis- pation in electrical and optical clock dis- tribution networks for varying number of tribution networks for varying number of drop points (70nm technology, 5.6GHz, drop points (70nm technology, 5.6GHz, 400mm2) 400mm2) 1200
Electrical CDN 1000 Optical CDN 256 Optical CDN 1000 Optical CDN 128
800
100 600
400
Power dissipation (mW) Power dissipation (mW) 10 200
0 130 100 70 45 1 3 5 7 9 Technology node (nm) Waveguide transmission loss (dB/cm)
Figure 2.7e. Comparison of power dissi- Figure 2.7f. Evaluation of power dissipa- pation in electrical and optical clock dis- tion in optical clock distribution networks tribution networks for varying technology for varying waveguide sidewall roughness nodes (70nm technology, 5.6GHz, 400mm2)
For a BER of 10−15 the minimal power required by the receiver is -22.3dBm (at 3GHz). Losses incurred by passive components for various nodes in the H-tree are summarised in table 2.2. TLFeBOOK
33
Table 2.2. Optical power budget for 20mm die width at 3GHz
Number of nodes in H-tree 16 32 64 128 Loss in straight lines (dB) 1.3 1.3 1.3 1.3 Loss in curved lines (dB) 1.53 1.66 1.78 1.85 Loss in Y-dividers (dB) 12 15 18 21 Loss in Y-couplers (dB) 0.8 1 1.2 1.4 Output coupling loss (dB) 0.6 0.6 0.6 0.6 Input coupling loss (dB) 3 3 3 3 Total optical loss (dB) 19.2 22.5 25.8 29.1 Min. receiver power (dBm) -22.3 -22.3 -22.3 -22.3 Laser optical power (mW) 0.5 1.1 2.30 4.85
We can conclude from this analysis that power dissipation in optical clock distribution networks is lower than that of electrical clock distribution networks, by a factor of five for example at the 70nm technology node. This factor will in the future become larger due to two reasons: firstly due to improvements in opti- cal fabrication technology; and secondly with the rise in operating frequencies. However, this figure is probably not sufficient to convince semiconductor manu- facturers to introduce such large technological and methodological changes for this application. To improve the figure, weak points can be identified for each main part of an integrated optical link. For the source, the efficiency between electrical and optical power conversion is relatively low. This needs to be im- proved and one area is possibly in integrated microsources. For the waveguide structures, most of the losses need to be improved, especially transmission loss and coupling loss. Sidewall roughness especially has a direct and considerable impact on the power dissipation in the global system. Finally at the receiver end, the transimpedance amplifier power dissipation is too high. Better circuit structures must be devised, or the photodetector parasitic capacitance needs to be reduced.
2.5 OPTICAL NETWORK ON CHIP In current SoC architectures, global data throughput between functional blocks can reach up to tens of gigabits per second, the load being shared by several communication buses. In the future the constraints acting on such data exchange networks will continue to increase: the number of IP blocks in an integrated system could be as high as several hundred and the global throughput could reach the Tb/s scale. To provide this level of performance, the communi- cation system itself is designed as an IP block into which the various functional units will be connected. This type of standardised hardware communication architecture is called a network on a chip (NoC). TLFeBOOK
34 Using wavelength division multiplexing (WDM) techniques, photonics and optoelectronics may offer new solutions to realise reconfigurable optical net- works on chip (ONoC). An ONoC, as an electronic router with routing based on wavelength λ, is actually a circuit-switching based topology and can thus ensure data exchanges between IP blocks with very low contention. The advantages of using an optical network are many: independence of interconnect perfor- mance from distance and data rate, crosstalk reduction, connectivity increase, interconnect power dissipation reduction, increase in the size of isochronous tiles, use of communication protocols. Figure 2.8 shows a 4 × 4 ONoC with all electronic interfaces: photodetector and laser in III-V technology and optical network in SOI technology, using similar heterogeneous integration techniques as described in section 2.2. Intellectual property (IP) blocks shown can be pro- cessor cores, memory blocks, functional units etc. with standard interfaces to the communication network. This is a multi-domain device with high speed optoelectronic circuits (modulation of the laser current and photodetectors) and passive optics (waveguides and passive filters). In the figure, M are masters (processor, IP, ...) which can communicate with targets T (memory, ...). The network is comprised of 4 stages, each associated with a single resonant wave- length. The operation of the 4×4 network is summarised in the table of figure 2.3. This system is a fully passive circuit-switching network based on wave- length routing and is a non-blocking network. From Mi to Tj, there exists only one physical path associated with one wavelength. At any one time, single- wavelength emitters can make 4 connections and multi-wavelength emitters can make 12 connections. The network is in principle scalable to an infinite number of connections. In practice, this number is severely limited by lithog- raphy and etching precision. For a 5nm tolerance on the size of the microdisk, corresponding to state of the art CMOS process technology, the maximum size of the network is 8 × 8.
Table 2.3. Truth table for optical network on chip T1 T2 T3 T4 M1 λ2 λ3 λ1 λ4 M2 λ3 λ4 λ2 λ1 M3 λ1 λ2 λ4 λ3 M4 λ4 λ1 λ3 λ2
The basic element of the network is an optical filter, described in the next section. The ports 1 − 4 correspond to inputs/outputs of the optical filter. Its operation is the same as an electronic cross-bar: the cross function (output in 4) is activated when the injected wavelength in 1 does not correspond to a resonant ring wavelength and the bar function is activated (output in 3) when the injected wavelength in 1 corresponds to a resonant ring wavelength. TLFeBOOK
35
IPM1 M1 T1 IPT1
1 3 IPM2 M2 T2 IPT2
2 4 IPM3 M3 T3 IPT3
1 3 IPM4 M4 T4 IPT4
#1 #3 master master passive optical i i = n target target IP blocks interface network on chip interface IP blocks (driver, (detector, elementary optical n laser) filter operation receiver) #2 #4 i = n
Figure 2.8. Architecture of 4x4 optical network on chip
Operation is symmetrical: the same phenomena happens if the wavelength injection is placed in the port 4.
2.5.1 Microresonators Microring resonators are ideal device candidates for integrated photonic cir- cuits. Because they render possible the addition or extraction of signals from a waveguide based on wavelength in a WDM flow, they can be considered as basic building blocks to build complex communication networks. The use of standard SOI technology leads to high compactness (structures with radii as small as 4µm have been reported) and the possibility of low-cost photonic in- tegration. Figure 2.9 shows the structure of an elementary add-drop filter based on microring resonators. The size of the structure is typically a few hundred µm2. It consists of two identical disks evanescently side-coupled to two signal waveguides which are crossed at near right angles to facilitate signal direc- tivity. The microdisks make up a selective structure: the electromagnetic field propagates in the rings for discrete propagation modes corresponding to specific wavelengths. The resonant wavelengths depend on geometric and structural pa- rameters (indices of the substrate and of the microrings, thickness and diameter of the disks). The basic function of a microresonator can be thought of as a wave-length- controlled switching function. If the wavelength of an optical signal passing through a waveguide in proximity to the resonator (for example injected at port 1) is close enough to a resonant wavelength λ1 (tolerance is of the order of a few nm, depending on the coupling strength between the disk and the waveguide), then the electromagnetic field is coupled into the microrings and then out along the second waveguide (in the example, the optical signal is transmitted to the TLFeBOOK
36
#1 #3 = r #3 #4
#1 r i 10um #2 i r = r #2 r #4
30um
Figure 2.9. Micro-disk realisation of an add-drop filter output port 3, as shown in fig. 2.10a). If the wavelength of the optical signal does not correspond to the resonant wavelength, then the electromagnetic field continues to propagate along the waveguide and not through the structure (in the example, the optical signal would then be transmitted to the output port 4, as shown in fig. 2.10b). This device thus operates as an elementary router, the behaviour of which is summarised in the table in fig. 2.9.
Figure 2.10a. FDTD simulation of add- Figure 2.10b. FDTD simulation of add- drop filter in on-state drop filter in off-state
First structures have been realised and preliminary results are promising. Fig. 2.11a shows an IR photograph of the structure in the cross state (top) and in the bar state (bottom), while fig. 2.11b represents the transmission coefficient on the cross output: the transmitted power on the cross output reaches 100% for wavelengths corresponding to the resonant frequencies of the microdisk.
2.5.2 Microsource lasers From the viewpoint of mode field confinement and mirror reflection, mi- crodisk lasers operate on the principle of total internal reflection, as opposed to multiple reflection, as is the case in VCSELs for example. This fact gives this type of source two distinct advantages over VCSELs for on-chip optical inter- connect. Firstly, light emission is in-plane (as opposed to vertical), meaning TLFeBOOK
37
Figure 2.11a. Infra-red photograph of Figure 2.11b. Transmission coefficient on structure in both cross (top) and bar (bot- cross output for varying wavelength tom) states that emitted light can be injected directly into a waveguide with minimum loss [6]. Secondly, for communication schemes requiring multiple wavelengths, it is easier from a technological point of view to control the radius of such a device than it is to control the thickness of an air gap in a VCSEL. In any case such devices, to be compatible with dense photonic integration, must satisfy the requirements of small volume and high optical confinement, with low threshold current and emitting in the 1.3-1.6µm range. Although these devices are not as mature as VCSELs, they seem extremely promising for optical interconnect applications. An overview of microcavity semiconductor lasers can be found in [2].
2.5.3 Demonstration of principle Behavioural models enable us to verify the operation of the 4 × 4 ONoC at high level simulation. An injection of 4 wavelengths is realised (λ1, λ2, λ3, and λ4) at the port 1 at the same moment (shown in figure 2.12). The input signal format is a matrix. Figure 2.12 is a 3-dimensional representation with wavelength on the X-axis (representing the 4 channels), time on the Y-axis and power (normalised) on the vertical axis. Each injected wavelength has two pulses (Gaussian) in time. The behavioural simulation analyses the 4 outputs T1,T2,T3 and T4 (T2 shown in fig. 2.12). As predicted in table 2.3, only λ3 is detected at the output T2.
Figure 2.12. Simulation of 4x4 optical network on chip TLFeBOOK
38 2.6 CONCLUSION Integrated optical interconnect is one potential technological solution to al- leviate some of the more pressing issues involved in moving volumes of data between circuit blocks on integrated circuits. In this chapter, we have shown how novel integrated photonic devices can be fabricated above standard CMOS ICs, designed concurrently with EDA tools and used in clock distribution and NoC applications. The feasibility of on-chip optical interconnect is no longer really in doubt. We have given some partial results to quantitatively demon- strate the advantages of optical clock distribution. Although lower power can be achieved (of the order of a five-fold decrease), more work is required to explore new solutions that benefit from advances both at the architectural and at the technological level. Also the existing basic building blocks need to be integrated together to physically demonstrate on-chip optical links. Research is well under way in several research groups around the world to do this. Looking further ahead, the use of multiple wavelengths in on-chip communication net- works and in reconfigurable computing is an extremely promising and exciting field of research.
References
[1] M. Amann, M. Ortsiefer, and R. Shau: 2002, ‘Surface-emitting Laser Diodes for Telecommunications’. In: Proc. Symp. Opto- and Microelec- tronic Devices and Circuits. [2] T. Baba: 1997, ‘Photonic Crystals and Microdisk Cavities Based on GaInAsP-InP System’. IEEE J. Selected Topics in Quantum Electronics 3. [3] Y. Cao, T. Sato, D. Sylvester, M. Orchansky, and C. Hu: 2000, ‘New Paradigm of Predictive MOSFET and Interconnect Modeling for Early Cir- cuit Design’. In: Proc. Custom Integrated Circuit Conference. [4] S. Cho et al.: 2002, ‘Integrated detectors for embedded optical interconnec- tions on electrical boards, modules and integrated circuits’. IEEE J. Sel. Topics in Quantum Electronics 8. [5] A. Filios et al.: 2003, ‘Transmission performance of a 1.5-µm 2.5-Gb/s directly modulated tunable VCSEL’. IEEE Phot. Tech. Lett. 15. [6] M. Fujita, A. Sakai, and T. Baba: 1999, ‘Ultrasmall and ultralow threshold GaInAsP-InP microdisk injection lasers: Design, fabrication, lasing charac- teristics and spontaneous emission factor’. IEEE J. Sel. Topics in Quantum Electronics 5. [7] M. Fujita, R. Ushigome, and T. Baba: 2000, ‘Continuous wave lasing in GaInAsP microdisk injection laser with threshold current of 40µA’. IEE Electron. Lett. 36. TLFeBOOK
39 [8] M. Ingels and M. S. J. Steyaert: 1999, ‘A 1-Gb/s, 0.7µm CMOS Optical Receiver with Full Rail-to-Rail Output Swing’. IEEE J. Solid-State Circuits 34(7). [9] I. Kimukin et al.: 2002, ‘InGaAs-Based High-Performance p-i-n Photodi- odes’. IEEE Phot. Tech. Lett. 26(3). [10] K. Lee et al.: 2001, ‘Fabrication of ultralow-loss Si/SiO2 waveguides by roughness reduction’. Optics Letters 26. [11] J. Liu et al.: 2002, ‘Ultralow-threshold sapphire substrate-bonded top- emitting 850-nm VCSEL array’. IEEE Phot. Lett. 14. [12] J. Morikuni et al.: 1994, ‘Improvements to the standard theory for pho- toreceiver noise’. IEEE J. Lightwave Technology 12. [13] I. O’Connor, F. Mieyeville, F. Tissafi-Drissi, G. Tosik, and F. Gaffiot: 2003, ‘Predictive design space exploration of maximum bandwidth CMOS photoreceiver preamplifiers’. In: Proc. IEEE International Conference on Electronics, Circuits and Systems. [14] A. Sakai, T. Fukazawa, and T. Baba: 2002, ‘Low Loss Ultra-Small Branches in a Silicon Photonic Wire Waveguide’. IEICE Tran. Electron. E85-C. [15] A. Sakai, G. Hara, and T. Baba: 2001, ‘Propagation Characteristics of Ultrahigh-∆ Optical Waveguide on Silicon-on-Insulator Substrate’. Jpn. J. Appl. Phys. – Part 2 40. [16] S. Schultz, E. Glytsis, and T. Gaylord: 2000, ‘Design, Fabrication, and Performance of Preferential-Order Volume Grating Waveguide Couplers’. Applied Optics-IP 39. [17] Semiconductor Industry Association: 2003, ‘International Technology Roadmap for Semiconductors’. [18] G. Tosik, F. Gaffiot, Z. Lisik, I. O’Connor, and F. Tissafi-Drissi: 2004, ‘Power dissipation in optical and metallic clock distribution networks in new VLSI technologies’. IEE Elec. Lett. 4(3). TLFeBOOK
40
Chapter 3 NANOTECHNOLOGIES FOR LOW POWER
Jacques Gautier CEA-DRT – LETI/D2NT – CEA/GRE
Abstract The conventional approach to improve the performance of circuits is to scale down the devices and technologies. This is also convenient to lower the power consumption per function. In this chapter, we overview the potential of nanotechnologies for this purpose, with emphasis on few-electron devices in the case of room-temperature operation. Other devices, especially carbon nanotube transistors, resonant tunnelling diodes and quantum cellular automata, are briefly discussed.
Keywords: nanotechnologies; Single Electron Transistor; SET; molecular electronics; RTD; QCA; low power; Coulomb blockade
3.1 INTRODUCTION
In addition to packing-density increase and speed improvement, the downscaling of technologies comes with a reduction of the power consumption per function. However this gain is offset by the tremendous increase in the number of transistors per chip. A possible solution is to go further towards nano-scale devices where a lower amount of charge is needed to code a bit. This is the basis of what is known as single electronics. The use of molecules could be a realistic way to fabricate these tiny devices and other useful nanostructures. In this chapter we overview the potential of nanodevices for low power electronics with emphasis on few-electron electronics in the case of room- temperature (RT) operation. Other devices, especially carbon nanotube transistors, resonant tunnelling diodes (RTD) and quantum cellular automata (QCA), are briefly discussed. TLFeBOOK
41 3.2 SINGLE ELECTRONICS
In CMOS circuits, the total power consumption is the sum of the dynamic power and of the contribution of leakages. For advanced technology generations the later is rapidly rising, but it is still less than the former. So, we will focus on this dynamic power consumption which is given by the usual expression
()2 ⋅⋅+⋅⋅= d gate gate inter DD fVCCNaP c (1)
where a is the activity factor, Ngate is the amount of gates, (Cgate + Cinter) is the load capacitance, gate and interconnect contributions, and f is the clock frequency. This equation shows that the power is proportional to the amount of charge in transistors and interconnects for coding a bit of information. For dense circuits with local interconnects, the dominant contribution is usually the one related to the gate capacitance of transistors which can also be expressed as Pd=a.Ngate.Q.VDD.fc, where Q is the channel charge. So there is a strong motivation to reduce it for power saving. This is currently obtained by the downscaling of technologies. From the extrapolation of the historical trend and from the ITRS roadmap anticipation[1], we can expect a value of only 10-20 electrons for sub-10nm MOSFETs. This is much less than the hundreds to thousands of electrons present in current devices. Is it possible to go still further, towards only one electron, using what is called a single electron transistor or SET [2]? That would be advantageous for power consumption, knowing that the reduction of power per function due to the scaling is more or less balanced by the tremendous increase of the number of transistors per chip. However this gain would be effective only if the capacitances of interconnects are not too large. Another factor in expression (1) is the electrostatic potential at which the charge Q is brought. At present, there is a strong incentive for reducing it. Whereas the supply voltage of current high performance circuits is in the range 1.2-1.8V, operation at only 0.3V on experimental circuits has already been demonstrated [3], which is close to the bottom limit anticipated by the ITRS. For a lower value the device is not in well defined On or Off states which results in either leakage or poor performance. What can be expected from SET's ? Before giving an answer to this question, their properties and modes of operation are briefly recalled.
3.2.1 Background on single electron transistors
A SET is a device which comprises a Source and a Drain reservoir of electrons and a control gate, like MOSFET's. In between, there is an island TLFeBOOK
42 where carriers should be confined [2] (see Fig. 3-1). A common solution to obtain this effect is to insert tunnelling or potential barriers between the reservoirs and the island. This is the main structural difference from MOSFET's, but it is essential for the operation of SET's. Due to this confinement, there is always an integer number of electron in the island. However, the probability to have a given amount of charge is a continuous function of the device bias, such that there is also a continuous variation of the average charge versus the external bias.
RT>>RQ
RT RT S D Cj Cj
island G Cg
Vg
Figure 3-1. Schematics of a SET
Provided that just one electron more or less has a significant effect on the electrostatic energy of the device, it is shown that, for a given device bias, there are limited possible states of charge in the island [2]. Especially, there are bias domains for which only one state of charge is possible. In this case, there is no exchange of charge with the electron reservoirs and the device is in Off state. This is the Coulomb blockade effect. For the other cases, the number of electron oscillates between the highest probable states of charge leading to a flux of carriers between source and drain. For instance, when the two states n and n+1 are possible, the current is due to the repetition of the sequence: one electron coming from the source to the island then leaving the island to the drain. As shown in Fig. 3-2, the electrical characteristics of SET's are very different from those of MOSFET's. The ID(VG) curves have periodic oscillations of current and the output characteristics look like a resistance (or staircase for non-symmetrical device) with a low drain voltage domain where the device is periodically Off and On as a function of VG. The period of Coulomb Blockade Oscillations, CBO, is given by e/Cg. Between two successive oscillations, the only difference is that the average number of TLFeBOOK
43 electron in the island is incremented or decremented by one. At a peak of current, two dominant states of charge have equal probability and, on the average, there is a half integer number of electron in the island.
2 10-7 1,2 10-7 )
A( ID (A) 1 10-7 DI 1,5 10-7 VD=0.4V 8 10-8
-7 -8 1 10 6 10 200mV 4 10-8 Vg=0.45V 5 10-8 100mV 2 10-8 Vg=0.1V 20mV 0 0 00,511,52 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 Vg (V) Vd (V)
Figure 3-2. Typical ID(VG) and ID(VD) characteristics of a SET. They have been obtained by simulation with the following parameters: Cj=0.1aF, Cg=0.2aF, RT=1MΩ, T=300K
To observe the previous typical characteristics, there are two important conditions to meet. Firstly, the charging energy, which is the electrostatic energy increase due to the arrival of one electron in the island, should be large in comparison to the thermal energy kT:
2 = e >> (2) EC kT 2CΣ
where e is the electron charge (absolute value), CΣ is the total capacitance of the island, CΣ=2Cj+Cg, where Cj is the junction capacitance and Cg is the gate to island capacitance. For room-temperature operation, CΣ should be less than 0.3 aF (Ec=10 kT, T=300K), which requires an island smaller than a few nm. The second condition is related to the confinement of the electron wave function in the island, which is essential to quantize the charge in this island. The resistance of the tunnel barriers should exceed the quantum 2 resistance RK=h/e ~25.8 kΩ. For the fabrication of SET’s, there are many different possibilities since any kind of conductive material can be used for the island, metallic as well as semiconductor and even molecular. However silicon is advantageous for CMOS compatibility and also for the stability of devices [4]. TLFeBOOK
44
3.2.2 Designing a low VDD inverter
With regard to the power consumption of digital circuits, we consider in this part the case of a simple inverter, since this is a convenient reference to make comparisons with CMOS. The design of a SET inverter has been discussed by many authors [5,6,7].They pointed out that, since there is only one kind of SET, the complementary action of the pull-up and pull-down devices is not as easy to obtain as in CMOS where two types of transistor exist. A first solution is to choose the supply voltage in order that both of these devices are On or Off in a complementary way in the switching part of the transfer characteristic. An example of such situation is shown in Fig. 3-3. The shaded area displays the Coulomb blockage domains of the pull-up and pull-down transistors at zero temperature. Based on that, the transfer characteristics has been schematically drawn. Contrary to CMOS, we can observe that the voltage swing is less than rail-to-rail and that the DC current is minimal at the transition point.
Vout (V) 1
0,8
VDD 0,6
0,4
0,2
0 1,00 0,00 0,20 0,40 0,60 0,80 -0,2
-0,4
-0,6 Vin (V)
Figure 3-3. Theoretical Coulomb blockade domains, also known as Coulomb diamonds, (shaded areas) at 0K, for the pull-down and pull-up SET's of an inverter. At RT they are a little narrower. Cj=0.1aF, Cg=0.2aF, VDD=0.53V. The bold line is a drawing of the transfer characteristics. TLFeBOOK
45
Since a low VDD is advantageous for low power applications, we discuss now the possibility to minimize it for this simple SET inverter, taking account of the design constraints and aiming room-temperature operation: • Cg + 2.Cj < 0.3aF for RT operation (for Ec~10kT) • Cg / Cj > 1 for voltage gain • VDD = e / (Cg + Cj) for complementary action of transistors As a result, a very low VDD and RT operation would be difficult to achieve simultaneously. In fact, with the previous equations and for a ratio of gate to junction capacitances of 2, the minimum VDD would be equal to 0.7V ! However, for temperatures above 0K, the switching of the SET from Off to On state is not abrupt since there is an exponential variation of the current, equivalent to the subthreshold current of MOSFET’s. Consequently, the real Coulomb blockade diamonds are narrower than those shown in Fig. 3-3 and it is possible to reduce VDD. This is demonstrated in Fig. 3-4, where the DC voltage gain and the DC current at the transition point of an inverter have been plotted versus VDD. Note also that the constraint on CΣ has been a little relaxed. As thoroughly discussed by A. Korotkov [6], the acceptable VDD window is quite narrow. A too low VDD value would be detrimental for the noise margin and for the speed since the DC current at the transition point is exponentially decreasing with VDD. On the contrary, a higher value would increase the power consumption.
1,6 10-7 Imin (A) Imin
1,4
Voltage gain 1,2 10-8 V =e/(Cj+Cg) 1 DD
0,8 -9 Cj=0.1aF 10 0,6 Cg=0.2aF R =1MΩ 0,4 T T=300K 0,2 10-10 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 V (V) DD
Figure 3-4. DC voltage gain (solid line) and DC current (dashed line) at the transition point of a SET inverter at room temperature. TLFeBOOK
46
To go further in reducing VDD, a solution is to add control gates to each SET (Fig. 3-5). Based on this approach, NTT has demonstrated a quasi- CMOS operation inverter at a supply voltage as low as 20 mV8, which is very advantageous for the power consumption, but in this case the temperature was only 27K. The bias of the control gates shifts the CBO, making possible to select the optimal part of the ID(VG) characteristics of each SET for complementary action. In this way, the equivalent of two types of transistors can be obtained, like for CMOS. In addition, their equivalent threshold voltages can be tuned, balancing the influence of eventual parasitic (background) charges in the neighbourhood of the SET:
C gc ∆−=∆ (3) Vg Vgc Cg
To get a symmetrical transfer characteristic, from the Coulomb diamonds, it can be easily demonstrated that the sum of the control gate voltages should be equal to VDD:
=+ gcss gcd d VVV DD (4)
As a result there is one more degree of freedom to design an inverter, in comparison to the case without control gates. That gives flexibility to fix the value of VDD. In fact, there is now one optimal supply voltage, leading to complementary states of pull-up and pull-down transistors, for each bias of the control gates. Taking equation (4) into account, it is given by:
− 2 VCe V = gcssgc (5) DDopt + CC jg
There is a consequent reduction of VDDopt thanks to the control gates, but it is important to note that the constraint on the total capacitance (equation 2) should also take account of the contribution of Cgc : CΣ=2Cj+Cg+Cgc A drawback of this approach is the requirement of extra lines to distribute the control gate voltages. However, this can be avoided in the particular case where Vgcss=VDD and Vgcdd=0V (VSS=0V). For this condition, the optimum value of VDD is given by:
e V = (6) DDopt ++ jg 2CCC gc TLFeBOOK
47 As discussed previously for the case without control gates, at RT the Coulomb blockade area of SET is narrower than at 0K which makes possible a reduction of VDD or a change of bias of the control gates for a given VDD. However, it also results in a change of the SET current which may affect the speed of circuits. Consequently, there is a design trade-off. To illustrate it, in Fig. 3-5 we have plotted the variations of the DC voltage gain and of the propagation delay along a chain of SET inverters versus the DC current at the transition point of the transfer characteristics. The shaded area shows the most advantageous design window. In this example, the load capacitance is equal to 0.5fF, but for another value the design window would be the same, since the propagation delay directly scales with this capacitance. This is a difference with CMOS where the dominant load capacitance of dense logic is due to the gate capacitance of MOSFETs. Here, the gate capacitance of SETs is extremely small and the dominant load capacitance comes from the local interconnects. In fact the later should be much larger than e/2VDD to avoid any detrimental effects of the shot noise. Regarding the dynamic power consumption of SET logic, as long as a CMOS output buffer is not implemented, the major contribution would also come from the load capacitance due to the interconnects.
-8 propagation delay (s) 2 5 10 Master Equation Cj=0.05aF C =0.1aF G 4 10-8 G C =0.1aF V GC C =0.5fF 1,5 load V DC voltageDC gain R =1MΩ 3 10-8 DD T CG VGdd
τ -8 Vout p 2 10 Vin in CG 1 CL V =0.3V V DD 1 10-8 Gss T=300K V +V =V GCss GCdd DD 0,5 0 10-10 10-9 10-8 10-7 DC current @ transition point (A)
Figure 3-5. Variations of the DC voltage gain (solid line) and of the propagation delay along a chain of SET inverters (dashed line) versus the DC current at the transition point of the transfer characteristics. VDD=0.3V. The control gate voltages are varied as follow: 0.7V 48 An example of switching characteristics is shown in Fig. 3-5 for VDD=0.3V, T=300K and a load capacitance of 0.5fF. The corresponding switching energy is 5.2x10-2fJ. In the same figure, transfer characteristics have been plotted for a lower supply voltage, showing that a voltage gain higher than 1 can be obtained down to VDD=0.2V at RT. -9 0,5 5 10 0,2 I I Current (A) pull-down pull-up 0,4 4 10-9 Output (V) Vin, Vout (V) Vin, Vout 0,15 0,3 3 10-9 0,1 0,2 2 10-9 T=300K 0,05 0,1 1 10-9 V V out in T=150K 0 0 0 05 10-8 1 10-7 1,5 10-7 2 10-7 00,050,10,150,2 time (s) input (V) Figure 3-5. On the left, switching characteristics for a SET inverter with control gates. VDD=0.3V, T=300K, CL=0.5fF, Cj=0.05aF, Cg= Cgc =0.1aF, RT=1MΩ, Vgcss=0.3V, Vgcdd=0. On the right, transfer characteristics for the same capacitances and VDD=0.2V. 3.2.3 Designing gates with increased functionality Another approach to lower the power consumption is to build logic gates with increased functionality in order to reduce the count of transistor needed to obtain a given function. This can be done by taking advantage of both the existence of CBO and the possibility to design SET's with multiple inputs [9]. The principle is to choose the logic levels such as the multiple inputs SET's are biased at either the minima or at the peaks of current, depending on the combination of input signals. This is illustrated in Fig. 3-7 in the case of a double input X-OR function. The logic level "1" is equal to the CBO period of the equivalent single input SET, e/(2.Cin), where Cin is the input gate capacitance. From this equivalent SET, it is obvious that the device is TLFeBOOK 49 Off when both of the inputs are either "0" or "1" and that the device is On when one and only one of the inputs is "1". 3 10-8 V =100mV T=300K 2,5 10-8 D ID (A) VDD Cin 2 10-8 VA 1,5 10-8 VB -8 1 10 Cin 5 10-9 VGeff 0 CGeff 00,511,52 Vgeff (V) = 2CC A="0" A ="1" A ="1" Geff in X-OR and + VV and V = BA B="0" B ="1" B ="1" Geff 2 Figure 3-7. Principle of design of a X-OR gate with a double input SET. The output current is a X-OR function of VA and VB. The figure shows also the equivalent input circuit and associated equations. This can be used to design pass-gate logic functions, as demonstrated by Y. Ono [9]. For instance, for an input current signal C in the previous X-OR gate, the output pass current is given by C.(A ⊕ B). Furthermore, one of the inputs of this gate, B for instance, can be viewed as a control input, leading to the output pass current .AC if B is "1". Applying this technique to a gate comprising such a control input in addition to the inputs A and B, we get the function .()⊕BAC for the output current, when the control input is "1" (see Fig. 3-8). This way, cascading SET structures, NTT has designed a 4-b adder with only 40 SET's for operation at 30K9. In comparison with CMOS there are less transistors and no crossing of pass signal routes thanks to the high functionality of the SET gates. Furthermore, there is a low level signal on the pass route. Consequently, a lower dynamic power is expected. Nevertheless, this power gain is not yet evaluated. TLFeBOOK 50 C.(A⊕B) A Control gate B C Figure 3-8. Design of complex gates using multiple input SETs. These gates can be cascaded. Moreover, there are important issues, especially about the control of the phase of CBO. Since it will be probably impossible to avoid the existence of any parasitic charges, charge-tolerant solutions are required. A first approach consists in incorporating redundancy into the circuit design in order to replace the defective gates by reconfiguration [9]. This is valuable only if a reasonable amount of spares is needed and if the area-overhead is not too large. Another solution would be to balance the influence of parasitic offset charges by opposite charges stored near the island of the SET. This concept has been demonstrated in the case of SET in which nanostructures have been embedded [10-12]. The resulting device is a merge of a SET and of a Non Volatile Memory function. Further, a feedback loop can be implemented to automatically control the phase of CBO [13]. The loop is closed to adjust the amount of charges in a memory node then it is opened for the use of the device. There are other potential applications about the possibility to tune and memorize the phase of CBO. A first example has been the demonstration of a hybrid SET-MOSFET gate which can be programmed to be inverting or non-inverting [11]. This feature has been obtained thanks to a SET active device which can operate either in a positive or in a negative transconductance region, depending on the amount of charge stored in a nearby nanostructure. In this case, the SET was fabricated in a very thin undulated SOI film in which a narrow source-drain percolation channel and an electron pocket working as a memory node can be naturally formed for a range of bias. In the hybrid gate, the MOSFET was just used as a load. Since the output voltage swing was only 10mV, an output buffer has been implemented. The reproducibility of the structure is not obvious, but RT operation and peak-to-valley current ration (PVCR) as high as 102 were obtained. The most important is the concept of programmable logic which is feasible with SET based devices, since it has a high potential for low-power and high packing density. The design of SET programmable logic array (PLA) has been also reported by K. Uchida [11]. TLFeBOOK 51 It is important to note that many other functions can be designed with few-electron devices, taking advantage of their specific features. Especially, several memory structures that are promising for low power consumption have been reported [14-17]. For spiking neuron circuits, it has been proposed to combine NVM MOSFET devices and single-electron circuit based on multinanodot floating-gate arrays [18]. Also, some analog applications and devices have been studied, like CCD [8], ADC [19], metrology [20] and NEMS [21]. However, although some have been demonstrated, most of them are still at the proof of concept level. 3.3 MOLECULAR ELECTRONICS For the fabrication of SET's, any kinds of conducting materials can be used. Whereas the basic research was done on metallic SET's [2], circuit demonstrations are performed mainly on silicon [8-13], for complementary with MOSFET's and to benefit of the huge investment in silicon technology. However, it could be advantageous to use molecules for real applications due to the size requirement discussed previously and because the reproducibility of nanoscale structures is very challenging. In addition, the load capacitance of circuits should be very low, for power consumption and speed considerations, which implies short and narrow interconnects. The most promising way to achieve it is the bottom-up approach, using naturally formed tiny structures or self-assembling methods. The best example is the carbon nanotube (CNT) which can be used to fabricate FET's [22-23], SET's [24], interconnects [25] and even non volatile memory arrays [26]. CNT's are long cylinder of carbon atoms consisting of rolled-up sheets of graphite. For Single Wall CNT's the diameter is as small as 1-5 nm. Depending on their chirality, they are semiconductor or metallic materials. Their mobility is much higher than the one of silicon, and a ballistic transport has been demonstrated for lengths less than a few hundreds of nm, but the subthreshold characteristics of CNFET's are not better than those of MOSFET's. Worldwide, several teams are conducting research on the selective growth or deposition of CNT's that would have the right chirality and on the evaluation of CNFET's as potential candidate to replace MOSFET's in the future. For low power applications, thanks to their excellent transport properties [22], it should be possible to reduce the gate overdrive and VDD, while meeting the ITRS specifications [1]. Different kinds of molecules are also currently investigated to make nanometer-scale electronic components and circuits, but a single molecule transistor has not yet been obtained. To date, one of the most advanced achievement is a 1µm2 64 cells crossbar matrix fabricated by HP Labs [27] in which the switching units are bundle of rotaxane molecules. The operation TLFeBOOK 52 of such molecules is not yet clear and others mechanisms like the formation of tiny filaments across the molecule gap between the electrodes could explain the switching [28]. However, on the long term, this research subject has a great potential for high density, low cost and probably ultra low power electronics. True single molecule device will require interconnects at a similar scale. This is also essential to reduce parasitic capacitances and the power consumption. Since the needed resolution is far beyond the possibility of lithographic tools, including NGL, the solution will come from the bottom- up approach. An example is the realization by Caltech of a Pt nanowire lattice with width and pitch of 8 nm and 16 nm respectively [29]. Biology can also come to the rescue for the self-assembly of nano-circuits [30]. A very different approach, also mitigating the arduous task of nanoscale patterning, is the concept of self-assembled nanocells proposed by J. Tour [31]. These nanocells are disordered arrays of metallic islands that are interlinked with molecules and that are accessed by metallic input/output leads. Switching-type functions have been observed, but like for the work of HP [27], the creation and dissolution of metal filaments is probably responsible for the behaviour. In fact, the behaviour of electrically active molecules is strongly influenced by surrounding electrodes and other materials, which make a difference between molecular nanotechnology and bulk or solution-phase chemistry. 3.4 DISCUSSION For CNT devices, as well as for nano MOSFET's, the supply voltage reduction is dependent upon the effects of subthreshold leakage on the static power, leading to trade-off with the speed of circuits. For SET's, the steepness of the On-Off switching is not better, but they offer an increased functionality and low charge operation. However, there are important issues, especially about the sensibility to offset charges and the fabrication of nano- scale structures with sufficient level of reproducibility, which require a lot of work. Although it is not yet clear if they could achieve a lower VDD, there are other candidates, like the resonant tunnelling diodes (RTD). Their operation is based on electron transport via discrete energy levels in double barrier quantum well structures, leading to the existence of a negative differential resistance. This implies the fabrication in suitable materials and a perfect control of the geometry, since the output characteristics are extremely sensitive to the dimensions. A promising approach is the implementation of RTD along semiconductor nanowires [31]. There are also prospective TLFeBOOK 53 studies for a molecular version and for structures mixing Coulomb blockade and resonant effects [32]. One of the most important features of nanodevices, especially for molecular ones, is their size. That offers the possibility to lower the power consumption by parallel processing. For instance, consider two blocks of low capacitance molecular devices doing the same task as one block of conventional devices, but at half the clock frequency. The Ngate . fc product being unchanged, equation 1 shows that the power consumption is directly related to the C.V [2] product, the gate switching energy, which can be strongly reduced thanks to lower capacitances and to the possibility to have devices with lower On current, since fc is divided by 2 in this case. Going further, quantum-dot cellular automata (QCA) is an attractive approach, yet speculative, to reduce the power consumption, since there is no flow of current but only Coulomb interactions [33]. The principle is to encode binary information by charge configuration in electrostatically coupled cells in which there are two extra electrons. It has been shown that a clock field is needed to control the direction of propagation of information along the cells and to enable power gain. This clock could also be used for a quasi-adiabatic switching, leading to extremely low power consumption. To date, experimental demonstrations are performed at low temperature on metallic structures, but molecular implementations are being investigated in view of RT operation [34]. References [1] Semiconductor Industry Association, International Technology Roadmap for Semiconductors 2001 Edition, http://public.itrs.net, 2003 [2] H. Grabert and M. H. Devoret, Single charge tunnelling Coulomb blockade phenomena in nanostructures, volume 294 of NATO ASI Series B, Plenum Press New York and London, 1992 [3] T. Douseki, T. Shimamura, N. Shibata, A 0.3V 3.6GHz 0.3mW frequency divider with differential ED-CMOS/SOI circuit technology, in Proc. ISSCC, February 2003 [4] N. M. Zimmerman and W. H. Huber, Excellent charge offset stability in a Si-based single-electron tunneling transistor, APL Vol. 79, N.19, pp. 3188-3190, 2001 [5] J. R. Tucker, Complementary digital logic based on the Coulomb blockade, JAP 72 (9), 1, pp. 4399-4413, 1992 [6] A. N. Korotkov, R. H. Chen and K. K. Likharev, Possible performance of capacitively coupled single-electron transistors in digital circuits, JAP 78 (4), pp. 2520-2529, 1995 [7] M-Y. Jeong, B-H. Lee and Y-H. Jeong, Design considerations for low-power single- electron transistor logic circuits, JJAP. Vol.40, pp. 2054-2057, 2001 [8] Y. Takahashi, Y. Ono, A. Fujiwara and H. Inokawa, Silicon Single-Electron Devices for logic applications, in Proc. ESSDERC September 2002, Florence, pp. 61-68 [9] Y. Ono, H. Inokawa and T. Takahashi, Binary adders of multigate Single-Electron Transistors: specific design using Pass-Transistor Logic, IEEE Trans. on Nanotech. Vol.1 pp. 93-99, 2002 TLFeBOOK 54 [10] N. Takahashi, H. Hishikuro and T. Hiramoto, A directional current switch using silicon Single Electron Transistors controlled by charge injection into silicon nano-crystal floating dots, in Proc. IEDM, pp.371-374, 1999 [11] K. Uchida, J. Koga, R. Ohba and A. Toriumi, Programmable Single-Electron Transistor logic for future low-power intelligent LSI: proposal and room-temperature operation, IEEE Trans. on Elec. Dev. Vol.50, pp.1623-1630, 2003 [12] G. Molas, X. Jehl, M. Sanquer, B. de Salvo, M. Gely, D. Lafond and S. Deleonibus, Manipulation of periodic Coulomb Blockade Oscillations in ultra-scaled memories by single electron charging of silicon nanocrystals floating gates, Silicon Nano Workshop, Honolulu, June 2004 [13] K. Nishiguchi, H. Inokawa, Y. Ono, A. Fujiwara and Y. Takahashi, Automatic control of the oscillation phase of a Single-Electron Transistor, IEEE EDL25 (1), pp. 31-33, 2004 [14] K. Yano, T. Ishii, T. Hashimoto, T. Kobayashi, F. Murai and K. Seki, Room- Temperature Single-Electron Memory, IEEE Trans. on Elec. Dev. Vol.41, NO.9,pp.1628-1638, 1994 [15] Z. A. K. Durrani, A. Irvine and H. Ahmed, Coulomb blockade memory using integrated Single-Electron Transistor/Metal-Oxide-Semiconductor transistor gain cells, IEEE Trans. on Elec. Dev. Vol.47, pp.2334-2339, 2000 [16] H. Sunamura, H. Kawaura, T. Sakamoto and T. Baba, Multiple-valued memory operation using a Single-Electron Device: a proposal and an experimental demonstration of a ten-valued operation, JJAP Vol. 41, pp. L93-L95, 2002 [17] G. Molas, B. de Salvo, D. Mariolle, G. Ghibaudo, A. Toffoli, N. Buffet and S. Deleonibus, Single electron charging phenomena at room temperature in a silicon nanocrystal memory, in Proc. WODIM 2002, Grenoble [18] T. Morie, T. Matsuura, M. Nagata and A. Iwata, A multinanodot floating-gate MOSFET circuit for spiking neuron models, IEEE Trans. On Nanotechnology, Vol. 2, NO. 3, pp. 158-164, 2003 [19] H. Inokawa, A. Fujiwara and Y. Takahashi, A multiple-valued logic and memory with combined Single-Electron and Metal-Oxide-Semiconductor transistors, IEEE Trans. on Elec. Dev. Vol.50, NO.2, pp. 462-470, 2003 [20] H. E. van den Brom et al., Counting electrons one by one – overview of a joint european research project, IEEE Trans. on Inst. and Meas. Vol. 52, NO.2, pp. 584-588, 2003 [21] S. Mahapatra, V. Pott, S. Ecoffey, A. Schmid, C. Wasshuber, J. W. Tringe, Y. Leblebici, M. Declercq, K. Banerjee and A. Ionescu, SETMOS: a novel true hybrid SET-CMOS high current Coulomb Blockade Oscillation cell for future nano-scale analog ICs, in Proc. IEDM 2003, pp. 703-706 [22] A. Javey, H. Kim, M. Brink, Q. Wang, A. Ural, J. Guo, P. McIntyre, P. McEuen, M. Lundstrom and H. Dai, High-K dielectrics for advanced carbon-nanotube transistors and logic gates, Nature Materials, Vol 1, pp. 241-246, December 2002 [23] P. Avouris, Carbon nanotube electronics, Chemical Physics, 281 (2002), pp. 429-445 [24] K. Matsumoto, S. Kinoshita, Y. Gotoh, K. Kurachi, T. Kamimura, M. Maeda, K. Sakamoto, M. Kuwahara, N. Atoda and Y. Awano, Single-Electron Transistor with ultra-high Coulomb energy of 5000K using position controlled grown carbon nanotube as channel, JJAP Vol.42 Part 1 N°4B, pp. 2415-2418, 2003 [25] J. Li, Q. Ye, A. Cassell, H. T. Ng, R. Stevens, J. Han and M. Meyyappan, Bottom-up approach for carbon nanotube interconnects, APL Vol. 82, N°15, pp. 2491-2493, 2003 [26] Rueckes, K. Kim, E. Joselevich, G. Y. Tseng, C-L. Cheung and C. Lieber, Carbon nanotubes-based nonvolatile Random Access Memory for molecular computing, Science, Vol. 289, pp. 94-97, 7 July 2000 TLFeBOOK 55 [27] Y. Chen, G-Y. Jung, D. A. Ohlberg, X. Li, D. R. Stewart, J. O. Jeppesen, K. A. Nielsen, J. F. Stoddart and R. S. Williams, Nanoscale molecular-switch crossbar circuits, Nanotechnology 14 (2003) 462-468 [28] R. F. Service, Next-generation technology hits an early midlife crisis, Science Vol. 302, pp. 556-559, 24 October 2003 [29] N. Melosh, A. Boukai, F. Diana, B. Gerardot, A. Badolato, P. M. Petroff and J. R. Heath, Ultrahigh-density nanowire lattices and circuits, Science, Vol. 300, pp.112-115, 4 April 2003 [30] P. Fairley, Germs that build circuits, IEEE Spectrum, pp. 37-41, November 2003 [31] M. T. Björk, B. J. Ohlsson, C. Thelander, A. I. Persson, K. Deppert, L. R. Wallenberg and L. Samuelson, Nanowire resonant tunneling diodes, APL, Vol. 81, N°23, pp. 4458- 4460, December 2002 [32] M. Saitoh and T. Hiramoto, Room-temperature operation of highly functional Single- Electron Transistor logic based on quantum mechanical effect in ultra-small silicon dot, in Proc. IEDM IEDM 2003, pp. 753-756 [33] G. Bernstein, Quantum-dot Cellular Automata: computing by field polarization, in Proc. DAC 2003, June 2-6, Anaheim (CA), pp. 268-273 [34] C. Lent and B. Isaksen, Clocked molecular Quantum-dot Cellular Automata, IEEE Trans. on Elec. Dev. Vol.50, NO.9, pp. 1890-1896, 2003 TLFeBOOK 56 Chapter 4 STATIC LEAKAGE REDUCTION THROUGH SIMULTENEOUS VT/TOX AND STATE ASSIGN- MENT Dongwoo Lee, Bo Zhai, David Blaauw and Dennis Sylvester University of Michigan, Ann Arbor Abstract: Standby leakage current minimization is a pressing concern for mobile applica- tions that rely on standby modes to extend battery life. In this paper, we propose new leakage current reduction methods in standby mode. First, we pro- pose a combined approach of sleep-state assignment and threshold voltage (Vt) assignment in a dual-Vt process for subthreshold leakage (Isub) reduction. Sec- ond, for the minimization of gate oxide leakage current (Igate) which has become comparable to Isub in 90nm technologies, we extend the above method to a combined sleep-state, Vt and gate oxide thickness (Tox) assignments approach in a dual-Vt and dual-Tox process to minimize both Isub and Igate. By combining Vt or Vt / Tox assignment with sleep-state assignment, leakage cur- rent can be dramatically reduced since the circuit is in a known state in standby mode and only certain transistors are responsible for leakage current and need to be considered for high-Vt or thick-Tox assignment. A significant improve- ment in the leakage/performance trade-off is therefore achievable using such combined methods. We formulate the optimization problem for simultaneous state/Vt and state/Vt/Tox assignments under delay constraints and propose both an exact method for its optimal solution as well as two practical heuristics with reasonable run time. We implemented and tested the proposed methods on a set of synthesized benchmark circuits and show substantial leakage current reduc- tion compared to the previous approaches using only state assignment or Vt assignment alone. Keywords: Leakage current, reduction, performance, dual threshold voltage, oxide thick- ness, algorithm. 4.1 INTRODUCTION There is a growing need for high-performance and low-power system, especially for portable and battery-powered applications. Since these applica- tions often remain in stand-by mode significantly longer than in active mode, TLFeBOOK 57 their stand-by (or leakage) current has a dominant impact on battery life. Standby mode leakage current reduction therefore has been a concern for some time and a number of such methods have been proposed to address this problem [1]-[7][9]-[18]. However, with continued process scaling, lower sup- ply voltages necessitate reduction of threshold voltages to meet performance goals and result in a dramatic increase in subthreshold leakage current. New methods for reducing the leakage current in standby mode are therefore criti- cally needed. In dual-Vt technology, the MTCMOS approach [1] was proposed where a high-Vt sleep transistor is inserted between the power supply and the circuit logic. In standby mode, this sleep transistor is turned off which dramatically reduces leakage due to its high-Vt. However, the method requires routing of an additional set of power supply lines in the layout as well as substantially sized sleep transistors to maintain good supply integrity and circuit perfor- mance [2]. Also, special latches that maintain state in standby mode need to be used [3]. In addition, the method does not scale well into sub-1V technolo- gies due to the increased delay penalty for the high-Vt sleep device [4]. A different approach to standby mode leakage reduction has been pro- posed that leverages the state dependence of a leakage current due to the so- called stack effect [5][6]. In [7], the circuit input state that minimizes leakage current is determined and special flip-flops are inserted in the design to pro- duce this state in standby mode. The flip-flops in the design are modified to produce a predetermined state in standby mode while also maintaining the previously latched state. The required modification to a flip-flop is minor and can be incorporated in the feedback path of the slave latch with minimal impact on performance [8]. In general, determining the minimum sleep state is a difficult problem due to the inherent logic correlations in the circuit. How- ever, a number of efficient heuristics for this problem have been proposed [9][10]. The limitation of this approach is that for larger circuits, the reduction in leakage current is typically only in the range of only 10 to 30% [9]. The above techniques are aimed primarily at subthreshold leakage current reduction which has been the dominant component of leakage in CMOS tech- nologies to date. However, in 90nm technologies the magnitude of gate tunneling leakage, Igate, in a device is comparable to the subthreshold leakage, Isub, at room temperature. With difficulties in achieving manufacturable high- k insulator solutions to address the gate leakage problem, the burden address this problem is primarily on circuit designers and EDA tools. As a result, there has been recent work in the area of gate leakage analysis and reduction techniques including pin reordering, PMOS sleep transistors, and the use of NAND implementations rather than NOR implementations [11]-[13]. Also, the MTCMOS technique was extended to combat gate leakage by using a TLFeBOOK 58 thick-oxide I/O device with a larger gate drive than the logic transistors as the inserted sleep transistor [14]. Another previous approach to leakage reduction that targets only sub- threshold leakage is to use individual assignment of transistor threshold voltages in a dual-Vt process [15]-[18]. In these approaches, the trade-off between high-Vt transistors with low leakage/low performance and low-Vt transistors with high leakage/high performance is exploited. Circuit paths that are non-critical are assigned high-Vt while critical circuit portions are given low-Vt assignments. The method therefore provides a trade-off between cir- cuit performance and leakage reduction. It was demonstrated that with a modest performance reduction of 5–10%, significant reduction of 3-4X in leakage could be obtained over a circuit with all low-Vt transistors [17]. In these approaches, high/low-Vt assignments are performed without knowledge of the states of the circuit. Therefore, in order to obtain sufficient leakage reduction under all possible circuit states, all or most of the transistors in a particular gate must be set to high-Vt and hence the gate incurs a substantial performance degradation. While such dual-Vt processes have been commonplace for several genera- tions, the availability of multiple oxide thicknesses in a single process has only become relevant at the 90nm node due to the rise of Igate [19]. Given a process technology with dual oxide thicknesses for logic devices, the dual-Vt approach can be easily extended to also consider gate leakage by assigning thick-oxide transistors to non-critical paths as well. However, similar to the dual-Vt assignment approach, a simultaneous dual-Vt and dual oxide thickness assignment with unknown states of the circuit will set all or most of the tran- sistors in a particular gate to both high-Vt and thick-oxide, to ensure that under all possible circuit states in standby mode leakage current is acceptable. How- ever, transistors that are simultaneously assigned a high-Vt and a thick-oxide have a dramatic delay penalty compared to low-Vt transistors with thin oxide. Therefore, this approach carries with it a significant delay penalty for process technologies where both Isub and Igate need to be addressed. In this paper, we therefore propose new methods to reduce standby mode leakage current. We can divide our new methods into two categories: 1) simultaneous dual-Vt and sleep state assignment for Isub reduction for technol- ogies in which Isub is dominant in standby mode and 2) simultaneous dual-Vt, dual oxide thickness and sleep state assignment for both Isub and Igate minimi- zation for technologies which have comparable amount of Igate to Isub. First, we combine the concepts of Vt assignment and sleep state assignment. This approach is based on the key observation that, given a known input state for a gate, the leakage of that gate can be dramatically reduced by setting only a single OFF-transistor on each path from Vdd to Gnd to high-Vt. Since all other TLFeBOOK 59 transistors in the gate are kept at low-Vt and continue to have high drive cur- rent, the performance degradation is limited while significantly gain in leakage current is obtained. This approach therefore provides a much better trade-off between leakage and performance compared to Vt assignment with unknown input state where most or all of the transistors must be set to high-Vt before a significant improvement in the leakage current is observed. The link between the effectiveness of Vt assignment and state assignment was previ- ously observed for Domino logic [8], since these circuits are by their own nature in a known state in standby mode. However, we extend this concept to general CMOS circuits by actively controlling the circuit state in standby mode, thereby dramatically increasing the effectiveness of leakage reduction. The second proposed approach minimizes the total leakage current (Isub and Igate) by simultaneous assignment of sleep state, high-Vt and thick-oxide transistors. In this approach, a key observation is that given a known input state, a transistor need not be assigned both a high-Vt and a thick oxide since Isub only occurs in transistors that are OFF while significant Igate occurs only in transistors that are ON. Furthermore, depending on the input state of a cir- cuit, only a subset of transistors need to be considered for either high-Vt or thick-oxide. Therefore, the impact on the delay of the gate is significantly reduced while obtaining leakage reductions comparable to when all transis- tors are assigned to both high-Vt and thick-oxides. The proposed method is compatible with existing library-based design flows, and we explore different trade-offs between the number of Vt and Tox variations for each library cell and the obtained leakage reduction. In addition, we compare the obtained leakage reduction when Vt (the first method) and Vt / Tox (the second method) assignments can be made individually for transistors in a stack as opposed to when an entire stack is restricted to a uniform assignment due to manufactur- ing or area considerations. Since the circuit state / Vt and the circuit state / Vt / Tox assignments inter- act, it is necessary to consider their optimization simultaneously. The state / Vt and state / Vt / Tox assignment task is to find a simultaneous assignment that minimizes the leakage current in standby mode while meeting a user specified delay constraint. We formulate this problem as an integer optimization prob- lem under delay constraints. The search space consists of all input state / Vt and input states / Vt / Tox assignments and hence is very large. Therefore, in addition to an exact solution, we also propose a number of heuristics. The pro- posed methods are implemented on benchmark circuits synthesized using an industrial cell library in 0.18Pm technology for Isub minimization and in a pre- dictive 65nm technology for both Isub and Igate minimization. On average, the proposed Isub minimization method by simultaneous state / Vt assignment approach improves leakage current by a factor of 6X over the traditional TLFeBOOK 60 approach using Vt assignment only. The second proposed method that mini- mizes both Isub and Igate by simultaneous state / Vt / Tox assignment has an average leakage reduction of 5-6X over an all low-Vt and thin-oxide design solution with a 5% delay range point and achieves more than a 2X improve- ment over the first proposed approach using Vt and state assignment only (i.e., without dual-Tox). The remainder of this paper is organized as follows. In Section 4.2, we dis- cus the used leakage model and the characteristics of Isub and Igate leakage current. In Section 4.3, we present the approach using simultaneous Vt and state assignment for Isub leakage reduction. In Section 4.4, we present the sec- ond approach that also addresses Igate by performing simultaneous Vt, Tox, and state assignment. In Section 4.5, we present our results on benchmark cir- cuits and in Section 4.6 we present our conclusions. 4.2 LEAKAGE MODEL AND CHARACTERISTICS In this section, we discuss our leakage current model and briefly review the general characteristics of gate leakage current in CMOS gates. Since the proposed leakage optimization approach is library-based, we use precharacterized leakage current tables for each library cell, with specific leakage table entries for each possible input state of a library cell. The pre- characterized tables were constructed using SPICE simulation with BSIM3 models from 0.18Pm technology for Isub minimization approach. In order to represent both Isub and Igate components for the state / Vt / Tox assignment approach, BSIM4 models were used to generate the precharacterization of tables. The device simulation parameters were obtained using leakage esti- mates from a predicted 65nm processes [20], and had a gate leakage component that was approximately 36% of the total leakage at room tempera- ture (at which all analysis is performed).1 (Detailed numbers will be shown at Section 4.5.2.) Different high- and low-Vt versions of a cell as well as Tox and Vt versions of a cell will be explained further in Section 4.4.2. Also, the delay and output slope as a function of cell input slope and output loading were stored in precharacterized tables. The total gate leakage for a library cell consists of several different com- ponents, depending on the input state of the gate, as illustrated for the inverter cell in Figure 4.1. The maximum gate tunneling current occurs when the input is at Vdd and Vs = Vd = 0V for the NMOS device. In this case, Vgs = Vgd = Vdd and the Igate is at its maximum for the NMOS device. At the same time, the 1. Since this work aims at standby mode leakage, we expect junction temperatures during these idle periods to be lower than under normal operating conditions, making room temperature analysis more valid. TLFeBOOK 61 Igate Isub Vdd Gnd Igate 0V Igate Vdd Igate Isub Figure 4.1. Inverter circuit with NMOS oxide leakage current. PMOS device exhibits substantial subthreshold leakage current. When the input is at Gnd, the output rises to Vdd and Vgs = 0 while Vgd will become -Vdd for the NMOS device, resulting in a reverse gate tunneling current from the drain to the gate node. In this case, tunneling is restricted to the gate-to-drain overlap region, due to the absence of a channel. Since this overlap region is much smaller than the channel region, reverse tunneling current is signifi- cantly reduced compared to the forward tunneling current [21]. Note that BSIM4 intrinsically considers this reverse tunneling current so it is included in the precharacterized tables described above. When the input voltage is Gnd, the PMOS device also exhibits gate cur- rent from the channel to the gate since its Vgs = Vgd = -Vdd. The relative magnitude of the PMOS gate current in comparison to the NMOS gate current differs for different process technologies. If standard SiO2 is used as the gate oxide material, then the Igate for a PMOS device is typically one order of mag- nitude smaller than that for an NMOS device with identical Tox and Vdd [19][22]. This is due to the much higher energy required for hole tunneling in SiO2 compared to electron tunneling. However, in alternate dielectric materi- als, the energy required for electron and hole tunneling can be completely different. In the case of nitrided gate oxides, in use today in a few processes, PMOS Igate can actually exceed NMOS Igate for higher nitrogen concentra- tions [23][24]. In this paper, we assume that standard SiO2 gate oxide material is used and the PMOS gate current is negligible. However, the presented methods can be easily extended to include appreciable PMOS gate leakage as well. 4.3 SUBTHRESHOLD LEAKAGE REDUCTION 4.3.1 Simultaneous Vt and State Assignment Consider the leakage and performance of the simple NAND2 circuit shown in Figure 4.2 under different input states and Vt assignments. It is clear that given a particular input state, only those transistors that are OFF need to TLFeBOOK 62 t t B p1 p2 Group 1 t A n1 Group 2 t n2 Group 3 Figure 4.2. The concept of groups for a NAND2 gate be considered for high-Vt assignment as the ON-transistors are not leaking. For instance, in state AB = 01, only transistor tn1 needs to be considered for high-Vt assignment. Assigning other transistors to high-Vt will only decrease the performance of the gate with no reduction in leakage current. On the other hand, in state 11 both tp1 and tp2 must be assigned high-Vt in order to reduce leakage, since they are parallel devices. We can partition the transistors into so-called Vt-groups, corresponding to the minimum sets of transistors that need to be set to high-Vt to reduce leak- age in a particular state assignment. For the 2-input NAND gate in Figure 4.2, three Vt-groups exist as shown. The concept of Vt-groups can be easily applied to more complex structures in which case it may be possible that a transistor belongs more than one Vt-group. It is clear that we can restrict our- selves to setting only entire Vt-groups to either high or low-Vt. By considering only Vt-groups, instead of individual transistors, we therefore significantly reduce the number of possible Vt assignment and the optimization complexity. In Table 4.1, we show the leakage current for the NAND2 in Figure 4.2 for different input states and Vt-group assignments. Column 3 shows the leakage current when we use high-Vt for one or more Vt-groups that are OFF in a par- ticular input state. In column 4 and 5, the leakage current with all transistors assigned to, respectively, high-Vt and low-Vt is shown. We can see that in states 01, 10, and 11 only a single Vt-group is a candidate for high-Vt assign- Table 4.1. Leakage current of NAND2 gate Input Assigned Leakage current [pA] State Group with Group Assign. with All High Vt with All Low Vt 224.9 00 39.8 7.2 286.7 2 and 3 7.2 01 2 26.6 26.6 1054.0 10 3 25.7 24.4 922.6 11 1 14.2 14.2 357.2 TLFeBOOK 63 ment. Also, setting only this one Vt-group to high-Vt results in equal or nearly equal leakage compared with the leakage when all transistors are assigned high-Vt demonstrating the effectiveness of the approach. In state 00, three high-Vt assignments are possible: group 2, group 3, and both group 2 and 3. However, the leakage current with both groups assigned to high-Vt is only slightly better than that with only one group set to high-Vt, and assigning group 3 to high-Vt reduces leakage somewhat more than assigning group 2 to high-Vt. Hence, it is clear that we need to only consider assignment of group 3 to high-Vt without significant loss in optimality. Table 4.1 shows that the leakage current varies considerably as different groups associated with different input states are set to high-Vt. At the same time, the impact of different high-Vt group assignments on the performance of the circuit must be considered. By setting only a single group to high-Vt, the performance degradation is restricted to only a single signal transition direc- tion and is also reduced compared to high-Vt assignments where most or all transistors are set to high-Vt. Therefore, the performance/power trade-off of Vt assignment with known input state is much improved compared with that with unknown input state. The input state of a gate effects which transition direction is degraded by a high-Vt group assignment to a gate. Also, the position of the high-Vt group in a stack of transistors changes the impact of a high-Vt group assignment on the different input to output gate delays. Therefore, the input state of a gate must be chosen such that its associated high-Vt group results in the least degrada- tion of the critical paths in the circuit. However, only the input state of the circuit as a whole can be controlled and the logic correlations of the circuit restrict the possible assignments of gate input states. Therefore, selection of the circuit input state and of which gate is assigned a high-Vt group must be made simultaneously to obtain the maximum improvement in leakage current with minimum loss in performance. 4.3.2 Exact Solution to Vt and State Assignment The size of the input state space is 2n, where n is the number of circuit inputs. For each input state assignment, there are two possible Vt assignments for each gate (one high-Vt group which is pre-determined by its input state, m and all low-Vt). The total number of possible Vt assignment is therefore 2 , where m is the number of gates in the circuit and the total size of the search space is 2n+m. In order to find an exact solution to the problem, we developed an efficient branch-and-bound method that simultaneous explores the state and Vt assign- ments and that exploits the characteristics of the problem to obtain efficient pruning of the search space to improve the run time. Due to the exponential TLFeBOOK 64 nature of the problem, an exact solution is only possible for very small cir- cuits. However, the exact approach is still useful as the proposed heuristics are based on it. We use two types of branch and bound trees. The first branch-and-bound tree determines the input state of the circuit and is referred to as the state tree. The nodes of the state tree correspond to the input variables of the circuit inputs. Each node of the state tree is associated with a so-called gate tree which is searched to determine the group Vt assignment. In other words, for a state tree with k nodes, there exist k copies of the gate tree. Each node in a par- ticular gate tree corresponds to a gate in the circuit, as shown in Figure 4.3. Each node has two fanout edges, representing the assignment of that gate with all low-Vt groups (left branch) or with one high-Vt group, as determined by the input state of the gate (right branch). At the root of the state tree, the state of all input variables is unknown. As the algorithm proceeds down the tree, the state of one input variable becomes defined with each level that is traversed. At each node in the state tree, a solu- tion of leakage current can be obtained by traversing the gate tree. Note that the gate tree may be traversed both with a completely known input state at the bottom of the state tree as well as with a partially or completely unknown input state, at higher levels of the state tree. For each node in the state and gate tree, an upper and low bound on the leakage current is computed incrementally as explained in Section 4.3.2.1. Note that early in the state tree the bounds on leakage will be very loose since the state of the circuit is only partly defined. As the algorithm traverses down the state tree, the input state becomes more defined and the leakage bounds become closer. Similarly, the leakage bounds are very wide at the top of each gate tree, as the Vt assignment of all gates are unknown, and becomes progres- sively tighter as the algorithm traverses down the tree. Only at the bottom of both the state tree and its associated gate tree do the upper and low bounds on Gate tree g1 LH g2 g2 LH LH State tree gm gm s1 LH LH 01 Sol. s2 s2 01 01 sn sn 01 01 Sol. Figure 4.3. State tree with gate tree at each node TLFeBOOK 65 leakage coincide. The algorithm first traverses down to the bottom of the tree and then returns back up, to traverse down unvisited branches in DFS manner. During the search, a tree branch is pruned when if it has a lower bound on leakage that is worse than the best upper bound on leakage that has been observed so-far. In addition to pruning based on leakage bounds, we also compute a lower bound on the circuit delay at each node in the gate tree tra- versal and prune all branches whose lower bound exceeds the specified delay constraint. Computation of the delay bounds is also performed incrementally and is discussed in Section 4.3.2.2. Also, early in the state tree, computation of the exact minimum Vt assign- ment by traversing the gate tree is not meaningful since even at that bottom of the gate tree there is considerable uncertainty in the leakage current due to the unknown input state. Therefore, the gate tree is searched only partially at the higher levels of the state-tree which results in slightly more conservative bounds, but an overall improvement in the run time of the algorithm. The gate tree is also searched in DFS manner and edges are pruned based on the computed leakage bounds. During the downward traversal of the gate tree, the high-Vt branch is always selected, provide it meets the delay con- straint. This is due to the fact that the high-Vt branch always has less leakage current than the low-Vt branch. Only if the lower bound on the delay of the high-Vt branch exceeds the delay constraint, is the low-Vt branch selected and is the high-Vt branch pruned. Finally, the gates in the circuit are assigned to nodes in the gate tree in topological order to enable incremental delay computation. Gates of equal topological level are further sorted by decreasing leakage to improve the pruning of the search space. The input signals of the circuit are also assigned to nodes in the state tree in specific order. We want to place inputs whose state assignment strongly influences the total leakage of the circuit near the top of the state tree. We estimate the influence of each input signal on the circuit leakage by taking the sum of the leakage current of all gates connected to the input signal. This input variable ordering is similar to that used in [25]. 4.3.2.1 Incremental leakage bound computation During the traversal of the gate tree, some of the gates will have a known Vt assignment and others, which have not been visited, will have an unknown Vt assignment. As shown in Figure 4.4, a lower bound on the leakage is com- puted by assuming all unknown gates have a high-Vt group assignment and an upper bound is computed by assuming all unknown gates have a low-Vt group assignment. As the high branch is taken in the downward traversal, only the upper bound is update (decreased) while when a low branch is taken, only the lower bound must be updated and is increased. TLFeBOOK 66 UBi-1 = leak(g1~gi-1=known gates) + leak(gi~gn=L) LBi-1 = leak(g1~gi-1=known gates) + leak(gi~gn=H) gi L H gi+1 gi+1 UBi = (unchanged) UBi = UBi-1 - leak(gi=L) + leak(gi=H) LBi = LBi-1 - leak(gi=H) + leak(gi=L) LBi = (unchanged) Delay = (unchanged) Delay = (increased) Figure 4.4. Incremental leakage bound computation 4.3.2.2 Incremental delay bound computation Similar to the leakage current bounds, a lower bound on the delay is com- puted assuming all unknown gates have low-Vt group assignments. Delay is changed only when a high branch is taken in the traversal and is computed incrementally. We first compute the slack of the circuit for all circuit nodes at the start of the tree traversal with all Vt assignments assumed to be low-Vt. When a group changes from a low to a high-Vt group assignment during the traversal, the slack of that gate will be updated. However, the Vt change of the gate will affect not only the gate itself but also the delays of fanout gates due to the slope change at the output of the changed gate. Since the slope at the output of the changed gate will become slower due to its high-Vt assignment, the delay of all fanout gates will increase, resulting an overall increased cir- cuit delay. Ignoring the effect of slope change on fanout gates will therefore result in the computation of an optimistic lower bound which ensures that the optimal solution is not accidentally pruned. It also enables incremental delay computation, given that the gates are visited in topological ordering. As gates are visited, the changed input slope, due to high-Vt assignments of a fanin gate, is processed to ensure that an exact delay bound is computed at the bot- tom of the gate tree. 4.3.3 Heuristic Solution to Vt and State Assignment We propose two fast heuristics that can be applied to large circuits and that produce high quality solutions. The proposed heuristic are based on the exact method described in Section 4.3.2, and are discussed below. Heuristic 1 In this heuristic, the state and gate tree search is limited to only one down- ward traversal. Note that while only a single traversal of the state tree is performed, at each node of the state tree the decision to follow the left or right child node is based on the computed bounds of the leakage using the gate tree. TLFeBOOK 67 Each downward traversal of the gate tree visits m nodes, where m is the num- ber of gates in the circuit. We perform exactly two such traversals at each state tree node, leading to a total run time complexity that is O(nm), where n is the number of circuit inputs. Since the number of inputs is generally thought to grow approximately as the sqrt(m), the total complexity of this heuristic is O(msqrt(m)). Heuristic 2 In the second heuristic, the state tree is searched more extensively, subject to a fixed run time constraint, while the gate tree search is kept to a single downward traversal for each state tree node. Experimentally, it was found that the quality of the first bottom node reached in the gate tree search is near the optimal Vt assignment. This is due to the fact that the gate tree always chooses the high-Vt child in its downward traversal which tends to produce a high quality result. This is in contrast to the state tree, where choosing the correct child during the downward traversal was found to be much more difficult. Therefore, the solution quality was found to improve most by searching the state tree more extensively, subject to a run time constraint, while limiting the gate tree search to a single downward traversal. 4.3.4 Vt assignment Control within Stacks We assume the ability to assign Vt on an individual basis within stacks of transistors. Although it is generally possible to assign the Vt of each transistor in a stack individually, this may result in the need for increased spacing between the transistors in order not to violate design rules and ensure manu- facturability [26]. Hence, at times it may be desirable to restrict the assignment of Vt such that all transistors in a stack are uniform. In this case, less flexibility exists in the assignment of Vt, and hence the obtained trade-off in delay and leakage will degrade to some extent. In Section 4.5.1, we present results showing the impact on the leakage optimization when uniform stack assignments are enforced in the library. 4.4 LEAKAGE REDUCTION METHOD FOR BOTH SUBTHRESHOLD AND GATE LEAKAGE CURRENT 4.4.1 Leakage Reduction Approach The proposed leakage optimization method performs simultaneous assign- ment of standby mode state and high-Vt and thick-oxide transistors. The TLFeBOOK 68 proposed method is based on the key observation that given a known input state, a transistor need not be assigned both a high-Vt and a thick oxide. This is due to the fact that if a transistor that is OFF, gate leakage is significantly reduced and hence the transistor only needs to be considered for high-Vt assignment. Conversely, a transistor that, given a particular input state, is ON may exhibit significant Igate, but does not impact Isub. Hence, conducting tran- sistors only need to be considered for thick oxide assignment. If the input state is unknown in standby mode, it cannot be predicted at design time which tran- sistors will be ON or OFF and therefore all or most transistors must be assigned to both high-Vt and thick-oxide in order to significantly reduce the total average leakage. However, given a known input state, we can avoid assignment of transistors to both high-Vt and thick oxide, thereby significantly improving the obtained leakage / delay trade-off. Furthermore, depending on the input state of a circuit, only a subset of transistors needs to be considered for high-Vt or thick-oxide, as discussed in Section 4.3.1. For instance, in a stack of several transistors that are OFF, only one transistor needs to be assigned to high-Vt to effectively reduce the total Isub. Similarly, Igate for transistors in a stack also has strong dependence on their position. If a conducting transistor is positioned above a non-conducting transistor in a stack, its Vgs and Vgd will be small and gate leakage will be reduced. Hence, depending on the input state, only a small subset of all ON transistors needs to be assigned thick-oxide and only a subset of all OFF tran- sistors need to be considered for high-Vt assignment. We illustrate the advantage of high-Vt and thick-oxide assignment with a known input state for a 2-input NAND and NOR gate in Figure 4.5. In Figure 4.5(a) a 2-input NOR gate is shown with input state 01. Since only PMOS transistors p2 is OFF in the pull-up stack, it is the only transistor that needs to p p p 1 0 1 1 1 0 i1 i1 i1 p p p 2 1 2 1 2 0 i2 i2 i2 n 1 n 2 n 1 n 2 n 1 n 2 (a) (b) (c) p 1 p 2 p 1 p 2 0 1 i1 i2 High-Vt Transistor n 1 n 1 Thick Oxide n 2 n 2 Transistor 1 0 i2 i1 (d) (e) Figure 4.5. High Vt and thick oxide assignments at different input states TLFeBOOK 69 be set to high-Vt to reduce the subthreshold leakage of the gate. Similarly, only NMOS transistor n2 exhibits gate leakage and needs to be assigned thick oxide to reduce Igate. Hence only two out of four transistors are affected while the total leakage current is reduced by nearly the same amount as when all transistors in the gate are set to high-Vt and thick oxide simultaneously. As a result, the delay of the rising input transition at input i1 is unaffected by the high-Vt and thick-oxide assignments, while the other transitions are affected only moderately. In Figure 4.5(b), the worst-case input state for a NOR2 gate is shown, which is when both inputs are 1. In this case, both NMOS devices must be assigned to thick-oxide to reduce Igate, while at least one PMOS device is set to high-Vt. Depending on the delay requirements, the best input state is either the state 01 shown in Figure 4.5(a), or the state 00, shown in Figure 4.5(c), which requires only two transistors to be set to high-Vt. Hence, it is clear that the input state significantly impacts the ability to effectively assign high-Vt and thick-oxides without degrading the performance of the circuit. This leads to the need for a simultaneous optimization approach where both the input state and the high-Vt and thick-oxide assignments are considered simulta- neously under delay constraints. In addition to high-Vt and thick-oxide assignment, we also take advantage of the Igate dependence on input pin ordering to reduce leakage current [11]. This is illustrated in Figure 4.5(d), for a 2-input NAND gate with input state 01. In order to effectively reduce the leakage under this input state, NMOS transistor n1 must be assigned to high-Vt and NMOS transistor n2 must be assigned to thick-oxide. However, if input pins i1 and i2 are reordered, with i1 positioned at the bottom of the stack, as shown in Figure 4.5(e), the Vgs and Vgd voltage of NMOS transistor n1 will be reduced from Vdd to approximately one Vt drop. Hence, the gate leakage current of n1 will be substantially reduced and can be ignored. After reordering the input pins, it is necessary to only set NMOS transistor n2 to high-Vt without further assignments of thick- oxide transistors. It should be noted that pin reordering will impact the delay of the circuit and hence some performance penalty might be incurred. How- ever, this penalty will be readily offset by the elimination of the thick-oxide assignment in the pull-down stack. In this paper, we therefore consider com- bined input state assignment with pin-reordering and Vt / Tox assignment. 4.4.2 Cell Library Construction In order to perform simultaneous Vt, Tox and state assignment, it is neces- sary to develop a library where for each cell the necessary Vt and Tox version are available. After such a library has been constructed, the process of assign- ing Vt and Tox assignments can be performed by simply swapping cells from TLFeBOOK 70 the library. Since different Vt and Tox variations do not alter the footprint of a cell, the leakage optimization can be performed either before or after final placement and routing. For each gate and input state, a number of different Tox and Vt assignments is possible, providing different delay / leakage trade-off points. For the fastest and highest leakage trade-off point, all transistors are assigned to low-Vt and thin oxides, such as the NAND2 gate shown in Figure 4.6(a). On the other hand, for the slowest and lowest leakage version of the cell all transistors con- tributing to leakage are assigned either high-Vt or thick oxide. For instance, for the NAND2 gate with input state 11, shown in Figure 4.6(b), all transistors affect the leakage current and both NMOS transistors are assigned thick Tox while both PMOS transistors are assigned high-Vt to obtain the minimum leakage / maximum delay trade-off point. In addition to the fastest version and minimum leakage version of the cell, a number of other intermediate trade-off points can be constructed for a cell by assigning only some of the transistors that contribute to leakage to high-Vt or thick-Tox. These cell versions would have lower leakage than the fastest cell version but would be faster than the lowest leakage version. It is clear that a large number of possible cell versions can be constructed if all possible trade-off points are considered for each possible input state. While a larger set of cell versions provides the optimization algorithm with more flexibility, and hence a more optimal leakage result, it also increases the size of the library, which is undesirable. Therefore, we initially restrict our library to at most 4 different trade-off points for each input state of a library cell, which are: 1) the minimum delay, shown in Figure 4.6(a), 2) minimum leakage, shown in Fig- t t t t t t A p1 p2 1 A p1 p2 1 A p1 p2 tn1 tn1 tn1 t t t B n2 1 B n2 1 B n2 (a) (b) (c) tp1 tp2 tp1 tp2 tp1 tp2 1 A 0 A 1 A tn1 tn1 tn1 tn2 tn2 tn2 1 B 0 B 0 B (d) (e) (f) Figure 4.6. Complete Vt-Tox versions of NAND2 gate TLFeBOOK 71 ure 4.6(b), 3) fast falling transition but slow rising transition, with intermediate leakage, shown in Figure 4.6(c), and 4) fast rising transition but slow falling transition with intermediate leakage, shown in Figure 4.6(d). Although other possible trade-off points could be considered, we empirically found that these four points yield good optimization results and provide a sys- tematic approach for constructing all versions of a cell. In principle, using four possible trade-off points for each input combina- tion could result in as many as 16 (4x4) cell versions for a 2 input gate. However, in practice, many of the cell versions are shared between different input states. Also, in some cases not all 4 trade-off points are realizable and hence the total number of cell versions is significantly less. We illustrate this for the NAND2 gate for input state 00. The fastest cell version is again shown in Figure 4.6(a) and is shared for all input combinations, and the minimum leakage version is shown in Figure 4.6(e). Note that only one transistor needs to be set to high-Vt to achieve minimum leakage for this input state. This results from the fact that PMOS devices have negligible gate leakage in the target technology and only one transistor in a stack needs to be set to high-Vt to reduce the leakage through the entire stack. Hence, for the input state 00, only two trade-off points are needed and only one additional cell version is added to the library. Input state 10 again requires the assignment of only a single transistor to high-Vt for the minimum leakage version, as shown in Figure 4.6(f). This is due to the fact that the gate leakage through the top NMOS transistor n1 is negligible since its Vgs and Vgd is reduced to approximately one Vt drop. Only two trade-off points are therefore required for this input state and both ver- sions are shared with the 00 state. Finally, if the 01 state occurs in the circuit, the optimization will automatically perform input pin swapping for all but the fastest trade-off point, thereby resulting in no additional cell version. The NAND2 gate therefore requires a total of 5 cell versions to provide up to 4 trade-off points for each input state. In Table 4.2, we show the delay / leakage Table 4.2. Trade-offs for different Vt-Tox versions of NAND2 gate Total Normalized Normalized leakage rise delay fall delay State Cell current [nA] pin A pin B pin A pin B Minimum delay (a) 270.4 1.00 1.00 1.00 1.00 Fast rise delay (d) 109.1 1.00 1.36 1.27 1.27 11 Fast fall delay (c) 91.4 1.36 1.36 1.00 1.00 Minimum leakage (b) 19.5 1.36 1.37 1.27 1.27 Minimum delay (a) 41.2 1.00 1.00 1.00 1.00 00 Minimum leakage (e) 14.0 1.00 1.00 1.12 1.16 Minimum delay (a) 91.8 1.00 1.00 1.00 1.00 10 Minimum leakage (f) 13.3 1.00 1.00 1.12 1.16 TLFeBOOK 72 trade-offs obtained for each input state using the described approach for the NAND2 gate. The same process can be applied to each cell in the library to construct the full set of cell versions for the leakage characterization method. Table 4.3, shows the number of cell version required for several common gates. Note that the number of cell version is higher for NOR gates than NAND gates. Since for a library the total number of cells would increase significantly, we also explored reducing the number of cells by allowing only two trade-off points for each cell (minimum delay, and minimum leakage), instead of 4 trade-off points. In this case, the number of cells for the NAND2 gate reduces to only 3 versions. The number of cell version required for two trade-off points for different cell types is shown in Table 4.3, column 3. In column 4, we add one more cell library version - two trade-off points with reduced num- ber of cells. In order to minimize the number of needed library cells, one or two cells of NOR2 or NOR3, respectively, are removed from library with small degradation of leakage/delay trade-off. Therefore, all gates have only three cells in this option. In Section 4.5.2 we compare the final leakage results using the full library with 4 trade-off points, the reduced library with only two trade-off points, and minimum number of cell library with two trade-off points. Finally, we consider Vt and Tox assignment control within stacks similar to the discussion for Vt stack control in Section 4.3.4. However, Tox assignment differs from Vt assignment in that the assignment of Tox to transistors in a stack is already uniform due to the use of pin-swapping. This is evident from the 5 added cell versions for the NAND2 in Figure 4.6, and can be easily shown to be true for all cell versions generated under the proposed approach. This is a significant advantage since spacing design rules for different Tox assignments are expected to be more severe that those for spacing between different Vt assignments [26]. However, the Vt assignment is not always uni- form as shown in Figure 4.6(e), where only a single transistor in a stack is assigned to high-Vt. In the event that a uniform stack is required, both transis- tors in the stack need to be set to high-Vt, resulting in a slightly worsened Table 4.3. The number of needed library cells 2 trade-off points 4 trade-off points 2 trade-off points with reduced number of cells Inverter 5 3 3 NAND2 5 3 3 NAND3 5 3 3 NOR2 8 4 3 NOR3 9 5 3 TLFeBOOK 73 Gate tree g1 g g State tree 2 2 s1 01 gm s2 s2 01 01 Sol sn sn 01 01 Sol Figure 4.7. State tree with gate tree at each node delay / leakage trade-off. Leakage current comparison results between indi- vidual vs. uniform stack assignment control will be shown in Section 4.5.2. 4.4.3 Optimization - Approach and Heuristics In this section, we present an exact solution and two heuristics to the prob- lem of finding a simultaneous input state, high-Vt and thick-Tox assignments for a circuit under delay constraints. As mentioned, the leakage minimization problem can be formulated as a integer optimization problem under delay constraints. The size of the input state space is 2n, where n is the number of circuit inputs. As discussed in Section 4.4.2, for each input state assignment, there are up to four possible Vt-Tox assignments for each gate. Note that while the total number of cell versions can be larger than 4, only 4 of them need to be considered for each specific input state. For instance, for the NAND2 gate in Figure 4.6, only versions (a)-(d) are considered for a 11 input state. There- m fore, the total number of possible Vt-Tox assignments is 4 , where m is the number of gates in the circuit and the total size of the search space is 2n+2m. In order to find an exact solution to the problem, we extend the branch- and-bound method with Section 4.3.2. The branch and bound algorithm for Vt-Tox and state assignment uses two interdependent search trees: state tree and gate tree. The state tree is searched to determine the input state of the cir- cuit and the gate tree is searched to determine the Vt-Tox assignment of the circuit, as shown in Figure 4.7. The only difference from Section 4.3.2 is the gate tree. Each node in a particular gate tree corresponds to a gate in the cir- cuit. Since there are four possible Vt-Tox assignments for a gate, each node of the gate tree has four edges: minimum delay, minimum leakage, fast fall delay with intermediate leakage, and fast rise delay with intermediate leakage. The exponential nature of the problem makes it impossible to obtain an exact solu- TLFeBOOK 74 tion for substantial circuits, such as Isub minimization approach in Section 4.3.2. Therefore, we also use the two heuristics discussed in Section 4.3.3. 4.5 RESULTS 4.5.1 Subthreshold Leakage Reduction The proposed methods for simultaneous state and Vt assignment were tested on the ISCAS benchmark circuits [27] and a 64-bit ALU circuit, syn- thesized using a 0.18Pm industrial library with Synopsys. This technology has a difference of 14X (10X) in Isub and 16% (15%) in delay between low-Vt and high-Vt NMOS (PMOS) devices. The leakage current for each Vt version of a cell was computed using SPICE simulation and stored in precharacterized tables. Delay computation was performed based on the Synopsys table delay model and was verified to match with Synopsys timing analysis delay reports. In addition to the proposed methods, traditional methods using only state or Vt assignment were also implemented for comparison. The state-only assign- ment was implemented using the approach discussed in [25] while for Vt-only assignment a method similar to the sensitivity-based approach of [17] was used. Table 4.4 compares the leakage results obtained by the three proposed heuristics for three delay constraints to the average leakage computed using 10,000 random input vectors. The columns marked 0%, 5%, and 10% refer to leakage minimization results when the delay constraints were set at 0%, 5%, and 10% respectively, of the full delay range between all low-Vt and all high- Vt circuit delay, as illustrated in Figure 4.8. The 0% column is therefore the Table 4.4. Leakage current comparison between heuristics Minimized leakage current [nA] (reduction factor: vs. average leakage current) 0% in low Vt/high Vt delay range 5% in low Vt/high Vt delay range 10% in low Vt/high Vt delay range Avg. Ileak by random Heuristic 1 Heuristic 2 Heuristic 1 Heuristic 2 Heuristic 1 Heuristic 2 (10K)vectors Ileak X Time Ileak X Ileak X Time Ileak X Ileak X Time Ileak X C432 32.9 7.7 4.3 1 4.3 7.7 4.9 6.7 1 3.6 9.2 4.7 7.0 1 3.6 9.1 C499 94.0 13.2 7.1 3 11.3 8.3 13.1 7.2 2 11.6 8.1 9.7 9.6 2 9.7 9.6 C880 73.4 9.7 7.5 4 8.9 8.3 8.9 8.2 3 8.3 8.8 8.9 8.3 4 8.3 8.8 C1355 85.1 19.0 4.5 3 12.7 6.7 14.6 5.8 3 11.7 7.3 12.0 7.1 3 11.0 7.7 C1908 82.8 19.0 4.3 2 15.1 5.5 15.5 5.3 2 12.2 6.8 13.4 6.2 2 10.3 8.0 C2670 162.5 12.7 12.8 58 12.5 13.0 12.7 12.8 55 12.4 13.1 14.3 11.3 55 12.2 13.3 C3540 173.1 20.1 8.6 10 16.4 10.6 20.5 8.4 10 14.6 11.8 17.4 10.0 9 14.5 11.9 C5315 309.1 26.4 11.7 169 25.9 11.9 27.5 11.2 164 25.2 12.3 28.5 10.9 165 25.2 12.2 C6288 451.5 157.5 2.9 47 153.9 2.9 145.5 3.1 44 141.4 3.2 135.8 3.3 43 128.4 3.5 C7552 385.8 31.0 12.4 330 30.6 12.6 30.8 12.5 330 30.1 12.8 30.7 12.6 328 29.6 13.0 alu64 332.3 46.0 7.2 405 43.6 7.6 47.2 7.0 408 44.5 7.5 43.0 7.7 406 42.0 7.9 AVG 7.6 8.6 8.0 9.2 8.5 9.6 TLFeBOOK 75 Delay with all low-Vt Delay with all high-Vt 0% 100% 5% 10% 50% Figure 4.8. Delay point from all low-Vt to all high-Vt range most stringently constrained optimization as it corresponds to the best obtain- able delay for the circuit (no performance penalty). Note that a simple replacement of all low-Vt devices with all high-Vt ones would yield a ~20% circuit delay increase. Thus, when interpreting the results in this section, a 10% delay point indicates that the circuit after Vt assignment has a delay that is approximately 2% larger than the original fastest implementation. Since the average leakage current with 10,000 random input vectors is computed with all low-Vt transistors, it also corresponds to a 0% delay criteria. Runtimes for heuristic 1 are given in Table 4.4 in seconds. Heuristic 2 was limited to a runt- ime of 1800 seconds (30 minutes). We report the reduction factor relative to the average leakage current over the 10,000 random vectors. Heuristic 2 has ~10% lower leakage results than heuristic 1 at 5% delay point across the benchmark circuits. However, heuristic 2 has a 4-5X runtime overhead for large circuits (~1000X for small circuits) over heuristic 1. In Table 4.5, we compare the proposed approach with traditional tech- niques, including state-only and Vt-only assignment methods. The state-only Table 4.5. Leakage current comparison with traditional techniques Circuits Minimized leakage current [nA] Avg. Vt only & proposed heuristic (reduction factor: vs. average leakage current) Number of State-only Ileak by assignment 0% in the delay range 5% in the delay range 10% in the delay range random (10K) Vt-only Heuristic 1 Vt-only Heuristic 1 Vt-only Heuristic 1 Input Gate vectors Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X C432 36 177 32.9 26.3 1.25 30.8 1.1 7.7 4.3 29.5 1.1 4.9 6.7 29.2 1.1 4.7 7.0 C499 41 519 94.0 86.1 1.09 85.0 1.1 13.2 7.1 57.2 1.6 13.1 7.2 40.4 2.3 9.7 9.6 C880 60 364 73.4 63.7 1.15 64.6 1.1 9.7 7.5 63.9 1.1 8.9 8.2 20.2 3.6 8.9 8.3 C1355 41 528 85.1 81.4 1.04 94.0 0.9 19.0 4.5 65.1 1.3 14.6 5.8 53.4 1.6 12.0 7.1 C1908 33 432 82.8 74.6 1.11 67.0 1.2 19.0 4.3 46.5 1.8 15.5 5.3 30.3 2.7 13.4 6.2 C2670 233 825 162.5 146.2 1.11 44.7 3.6 12.7 12.8 39.7 4.1 12.7 12.8 27.8 5.8 14.3 11.3 C3540 50 940 173.1 155.7 1.11 161.9 1.1 20.1 8.6 148.4 1.2 20.5 8.4 82.4 2.1 17.4 10.0 C5315 178 1627 309.1 283.1 1.09 290.6 1.1 26.4 11.7 289.7 1.1 27.5 11.2 108.3 2.9 28.5 10.9 C6288 32 2470 451.5 412.4 1.09 417.0 1.1 157.5 2.9 259.5 1.7 145.5 3.1 233.0 1.9 135.8 3.3 C7552 207 1994 385.8 352.3 1.10 360.2 1.1 31.0 12.4 353.5 1.1 30.8 12.5 350.9 1.1 30.7 12.6 alu64 131 1803 332.3 294.5 1.13 312.8 1.1 46.0 7.2 288.5 1.2 47.2 7.0 230.1 1.4 43.0 7.7 Avg. 1.12 1.3 7.6 1.6 8.0 2.4 8.5 TLFeBOOK 76 assignment method was limited to a runtime of 1800 seconds (30 minutes). The results demonstrate that substantial improvement in standby leakage cur- rent can be obtained using the proposed methods, with an average improvement of ~80% (5-6X) for the 0% and 5% delay constraints over Vt- only assignment. Table 4.6 compares leakage current results for both individual and uni- form stack control. Since uniform stack control degrades the delay/leakage trade-off as discussed in Section 4.3.4, the results for uniform stack assign- ment exhibit less leakage reduction than those of individual stack control. It is interesting to note, however, that the leakage current degradation by moving to a less fine-grained threshold voltage assignment scheme is not overly large implying that even with manufacturing constraints, the proposed technique provides significant leakage savings. Finally, Figure 4.9 plots the leakage results for the proposed method and the two traditional methods as a function of the delay constraint for circuit c6288. The optimization was performed for a range of delay constraints. The proposed method provides its largest improvements at tight delay constraints. This is due to the fact that, as the delay constraint becomes looser, more tran- sistors can be set to high-Vt in both approaches, and the relative advantage of the proposed approach reduces. However, leakage reduction is most challeng- ing under tight performance constraints at which the proposed technique holds promise. Table 4.6. Leakage current comparison between individual and uniform stack control. Minimized leakage current [nA] 5% in low Vt/high Vt delay range (reduction factor: vs. average leakage current) Average Heuristic 1 V -only assignment I by t leak Individual control Uniform control random (10K)vector Ileak X Ileak X Ileak X C432 32.9 29.5 1.1 4.9 6.7 6.8 4.8 C499 94.0 57.2 1.6 13.1 7.2 12.5 7.5 C880 73.4 63.9 1.1 8.9 8.2 9.1 8.1 C1355 85.1 65.1 1.3 14.6 5.8 23.7 3.6 C1908 82.8 46.5 1.8 15.5 5.3 15.7 5.3 C2670 162.5 39.7 4.1 12.7 12.8 12.9 12.6 C3540 173.1 148.4 1.2 20.5 8.4 24.1 7.2 C5315 309.1 289.7 1.1 27.5 11.2 28.5 10.9 C6288 451.5 259.5 1.7 145.5 3.1 163.1 2.8 C7552 385.8 353.5 1.1 30.8 12.5 31.3 12.3 alu64 332.3 288.5 1.2 47.2 7.0 44.6 7.5 Avg. 1.6 8.0 7.5 TLFeBOOK 77 450 Average Current with Low-V t 400 State Assignment Only with Low-Vt Dual-Vt Assignment only 350 Our proposed method - Heuristic 1 State Assignment Only with High-V t 300 250 200 150 100 Total Leakage Current [nA] Current Leakage Total 50 0 0 102030405060708090100 Delay Point from All Low-V to All High-V Range [%] t t Figure 4.9. Leakage current comparison for c6288 4.5.2 Leakage Reduction for both Subthreshold and Gate Leakage The proposed methods for simultaneous state, Vt and Tox assignment were implemented on a number of benchmark circuits [27] synthesized using a library based on a predictive 65nm process [20]. In this technology, the differ- ence in Igate for the thick-oxide NMOS devices vs. the thin-oxide device is 11X, whereas Isub is reduced by 17.8X (16.7X) when replacing a low-Vt NMOS (PMOS) device with a high-Vt version. Table 4.7 shows relative leak- age and delay values at the four possible Vt and Tox assignments for NMOS devices in this technology. A comparison of our first and second heuristics along with average leakage computed using 10,000 random input vectors is shown in Table 4.8. The total leakage current value is given in PA and runt- ime is given in seconds. In heuristic 2, we set the runtime limit as 1800 Table 4.7. Comparison of leakage and delay between four possible Vt-Tox assignment for NMOS Assignment Normalized values Leakage Vt Oxide thickness Delay Isub Forward Igate Reverse Igate Low Thin 1.00 0.41 0.22 1.00 High Thin 0.06 0.31 0.22 1.33 Low Thick 0.73 0.04 0.00 1.26 High Thick 0.05 0.03 0.02 1.69 TLFeBOOK 78 Table 4.8. Leakage current comparison between heuristics with 4-option, individual stack control library Average 0% in the best-worst delay range 5% in the best-worst delay range 10% in the best-worst delay range Ileak by random Heu1 Heu2 Heu1 Heu2 Heu1 Heu2 (10K) vectors Ileak X Time Ileak X Ileak XTimeIleak X Ileak XTimeIleak X c432 24.5 8.2 3.0 3 5.4 4.6 7.7 3.2 2 3.2 7.6 5.5 4.5 2 3.0 8.2 c499 65.8 32.2 2.0 7 31.1 2.1 26.1 2.5 7 24.6 2.7 22.7 2.9 6 20.8 3.2 c880 50.1 10.3 4.9 8 9.2 5.5 8.5 5.9 7 8.3 6.1 8.5 5.9 7 7.0 7.1 c1355 70.8 20.4 3.5 8 20.4 3.5 15.8 4.5 6 13.1 5.4 9.9 7.1 6 9.9 7.1 c1908 56.7 17.4 3.3 5 16.9 3.4 14.8 3.8 4 13.6 4.2 13.2 4.3 5 10.5 5.4 c2670 104.7 14.9 7.0 82 14.7 7.1 12.3 8.5 78 12.2 8.6 13.5 7.8 78 11.3 9.3 c3540 128.5 27.7 4.6 20 23.7 5.4 22.1 5.8 18 19.9 6.4 18.6 6.9 17 17.4 7.4 c5315 221.2 36.6 6.0 219 35.9 6.2 30.0 7.4 213 30.0 7.4 28.4 7.8 202 27.6 8.0 c6288 346.8 153.6 2.3 75 146.0 2.4 112.2 3.1 64 101.4 3.4 84.1 4.1 59 75.6 4.6 c7552 270.0 34.9 7.7 410 33.4 8.1 32.2 8.4 404 31.8 8.5 30.3 8.9 399 30.2 8.9 alu64 260.0 48.7 5.3 468 46.8 5.6 43.4 6.0 464 41.6 6.3 34.3 7.6 458 33.1 7.9 AVG 4.5 4.9 5.4 6.0 6.2 7.0 seconds (30 minutes). The average leakage computed using the random vec- tors can be used to approximate the standby mode leakage if state assignment as well as dual-Vt and dual-Tox techniques were not employed. Again, the delay range points used in all results are defined by a percentage of the maxi- mum possible delay that is associated with moving from an all low-Vt and thin-oxide design to an all high-Vt and thick-oxide implementation. Note that a simple replacement of all fast devices with their slowest counterparts would yield a ~70% circuit delay increase. Thus, when interpreting the results in this section, a 5% delay point indicates that the circuit after Vt and Tox assignment has a delay that is approximately 4% larger than the original fastest implementation. As shown in Table 4.8, heuristic 2 generally provides somewhat better results but at much greater runtimes. On average, heuristic 2 provides ~10% lower leakage current than heuristic 1 across these benchmarks at the 5% delay point, similar to the results in Section 4.5.1. The improvement of the two proposed heuristics compared to the average leakage without state, Vt or Tox assignment is dramatic and approaches 7X at the 10% delay point in the best-worst delay range. More aggressively, with just a 5% delay penalty the reduction in total standby leakage is 5.3-6X with a maximum improvement of 8.6X for heuristic 2 in circuit c2670. In Table 4.9 we compare our results to other standby mode techniques, including state assignment alone and simultaneous state and Vt assignment (as in the previous section). The total leakage current value is given in PA. Again, we report the reduction factor in relation to the average leakage current with 10,000 random vectors for consistency. We first point out that state assign- TLFeBOOK 79 Table 4.9. Leakage current comparison with 4-option, individual stack control library Average State 0% in the delay range 5% in the delay range 10% in the delay range Ileak by Assignment random Only Vt & State Heu1 Vt & State Heu1 Vt & State Heu1 (10K) vectors Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X c432 24.5 22.7 1.08 13.3 1.8 8.2 3.0 12.5 2.0 7.7 3.2 12.7 1.9 5.5 4.5 c499 65.8 63.9 1.03 41.9 1.6 32.2 2.0 35.7 1.8 26.1 2.5 32.2 2.0 22.7 2.9 c880 50.1 46.0 1.09 18.9 2.6 10.3 4.9 17.5 2.9 8.5 5.9 16.9 3.0 8.5 5.9 c1355 70.8 67.4 1.05 39.9 1.8 20.4 3.5 33.0 2.1 15.8 4.5 29.8 2.4 9.9 7.1 c1908 56.7 54.8 1.04 27.6 2.1 17.4 3.3 25.8 2.2 14.8 3.8 22.9 2.5 13.2 4.3 c2670 104.7 101.4 1.03 33.3 3.1 14.9 7.0 32.7 3.2 12.3 8.5 31.9 3.3 13.5 7.8 c3540 128.5 121.8 1.05 54.5 2.4 27.7 4.6 51.5 2.5 22.1 5.8 48.5 2.7 18.6 6.9 c5315 221.2 215.1 1.03 81.2 2.7 36.6 6.0 77.1 2.9 30.0 7.4 73.7 3.0 28.4 7.8 c6288 346.8 306.7 1.13 209.3 1.7 153.6 2.3 180.4 1.9 112.2 3.1 153.7 2.3 84.1 4.1 c7552 270.0 262.6 1.03 88.9 3.0 34.9 7.7 86.6 3.1 32.2 8.4 86.1 3.1 30.3 8.9 alu64 260.0 237.2 1.10 90.7 2.9 48.7 5.3 86.1 3.0 43.4 6.0 81.1 3.2 34.3 7.6 AVG 1.06 2.3 4.5 2.5 5.4 2.7 6.2 ment alone, which we accomplish by searching the state tree only, achieves very little improvement in standby mode leakage; about 6%. By adding Vt assignment, the algorithm of the first proposed method shows an average reduction of 58% beyond state assignment alone at a 5% delay point. The full Vt, Tox, and state assignment approach provides an additional 53% reduction in current beyond state and Vt assignment for the 5% delay point. Table 4.10 provides a comparison of results using the various cell library options; 4 and 2 trade-off points with individual stack control, and also with uniform stacks. The main result in Table 4.10 is that there is very little leak- Table 4.10. Leakage current comparison between cell library options (current unit: PA) 5% in the best-worst delay range Average Individual stack control Uniform stack control Ileak by random 2-option 2-option (10K) 4-option 2-option 3 cell versions 4-option 2-option 3 cell versions vectors only only Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X c432 24.5 7.7 3.2 7.4 3.3 7.1 3.4 7.3 3.4 7.9 3.1 8.6 2.8 c499 65.8 26.1 2.5 26.7 2.5 27.8 2.4 26.0 2.5 28.0 2.3 28.9 2.3 c880 50.1 8.5 5.9 9.7 5.2 8.0 6.3 10.0 5.0 10.7 4.7 10.8 4.6 c1355 70.8 15.8 4.5 16.2 4.4 14.1 5.0 23.4 3.0 25.2 2.8 23.9 3.0 c1908 56.7 14.8 3.8 14.9 3.8 14.3 4.0 15.9 3.6 15.3 3.7 16.8 3.4 c2670 104.7 12.3 8.5 12.1 8.7 12.4 8.4 16.1 6.5 15.4 6.8 16.5 6.3 c3540 128.5 22.1 5.8 24.2 5.3 25.3 5.1 27.1 4.7 25.8 5.0 29.2 4.4 c5315 221.2 30.0 7.4 30.9 7.2 30.7 7.2 32.1 6.9 32.9 6.7 33.8 6.6 c6288 346.8 112.2 3.1 114.2 3.0 114.2 3.0 134.0 2.6 147.8 2.3 145.4 2.4 c7552 270.0 32.2 8.4 31.4 8.6 30.6 8.8 31.8 8.5 31.1 8.7 31.1 8.7 alu64 260.0 43.4 6.0 44.0 5.9 43.2 6.0 42.0 6.2 47.0 5.5 46.1 5.6 AVG 5.4 5.3 5.4 4.8 4.7 4.6 TLFeBOOK 80 age current penalty when moving from a full 4-option library to a simpler 2- option library. There are several cases where the smaller library outperforms the larger library due to the heuristic nature of the algorithm used (heuristic 1 is used in this table). Since the library size required in the 2-option scenario is roughly half that of 4-option, we conclude that the use of 2-option represents a very good trade-off between library complexity and potential leakage reduc- tion. Moreover we can see that the simplest cell library of 2-option with a reduced number of cells provides good leakage reduction results. In general, a reduced number of cells degrades the leakage/delay trade-off as discussed in Section 4.4.2. However we find that only complex, and infrequently used cells, such as 3-input NORs require appreciable reductions in cell variants which limits the impact on total leakage reduction. Therefore, very good leak- age current minimization can be obtained even with libraries with 3 cell versions for each cell. Also, the restriction that each stack of transistors must use the same Vt and Tox is shown in Table 4.10 to have only a minor impact on leakage. For instance, the uniform stack 4-option case shows a 10.6% average power increase compared to the individual stack 4-option case; this still repre- sents a nearly 5X reduction in standby leakage compared to the average case. Note that library complexity is not reduced in moving from individual to stack-based control; such a change would be dictated by manufacturing issues as well as the trade-off between standby power (lower for individual control) and cell area (expected to be slightly lower for stack-based control). Finally, Figure 4.10 plots the leakage current results for the proposed method and traditional methods as a function of the delay constraint for cir- 350 Average Current with Low-V /Thin-T t ox State Assignment Only with Low-V /Thin-T 300 t ox Dual-Vt & State Assignment Our proposed method - Heuristic 1 State Assignment Only with High-V /Thick-T 250 t ox 200 150 100 Total Leakage Current [uA] Leakage Current Total 50 0 0 102030405060708090100 Delay Point from the best to the worst range [%] Figure 4.10. Leakage current comparison for c6288 TLFeBOOK 81 cuit c6288. Here, a 100% delay point implies a complete replacement of low- Vt and thin-oxide devices with high-Vt and thick-oxide. This is clearly the lowest leakage solution but is also very slow. The key point in Figure 4.10 is that the proposed approaches (heuristic 2 results are not shown but are nearly identical to heuristic 1) provide substantial improvement beyond the average leakage or the use of state assignment alone and that these gains are achiev- able with very small and even zero delay penalties. The rapid saturation of the gains as the delay point increases beyond 10% implies that the new approach is best suited for achieving low-leakage standby states with very little perfor- mance overhead (e.g., 5% or even less). Note that the leakage current achieved by our proposed method does not converge to that by state assign- ment using all high-Vt and thick-oxide devices. The reason is that the selected library cells include only a limited number of thick-oxide assignments in order to simplify the library. Many additional library cells would be needed to achieve convergence to the minimal leakage solution; instead the bulk of this leakage savings can be achieved with very little performance penalty. 4.6 CONCLUSIONS In this paper, we propose new approaches for standby leakage current minimization under delay constraints. Our approaches use simultaneous state assignment and Vt or Vt / Tox assignment. Efficient methods for computing the simultaneous state and Vt or Vt / Tox assignments leading to the minimum standby mode leakage current were presented. The proposed methods were implemented and tested on a set of synthesized benchmark circuits. Using the new state and Vt assignment technique demonstrates 6X lower leakage than previous Vt-only assignment approaches and 5X lower than state assignment alone (at 5% delay point). In cases where gate leakage is prominent, as in 90nm CMOS technologies, these improvements are increased by an addi- tional factor of 2 using state and Vt / Tox assignment. We also investigate the leakage/complexity trade-off for various cell library configurations and dem- onstrate that results are still very good even when only 2 additional variants are used for each cell type. Acknowledgement The authors would like to thank Harmander Deogun for his work in leak- age current model. The work has been supported by NSF, SRC, GSRC/ DARPA, IBM, and Intel. TLFeBOOK 82 References [1] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu and J. Yamada, “1-V power supply high-speed digital circuit technology with multithreshold voltage CMOS,” IEEE Journal of Solid-State Circuits, vol. 30, pp. 847-854, Aug. 1995. [2] J. Kao, A. Chandrakasan, and D. Antoniadis, “Transistor sizing issues and tool for multi-threshold CMOS technology,” Proc. Design Automation Conference, pp. 409-414, 1997. [3] S. Shigematsu, S. Mutoh, Y. Matsuya, Y. Tanabe and J. Yamada, “A 1-V high- speed MTCMOS circuit scheme for power-down application circuits,” IEEE Journal of Solid-State Circuits, vol. 32, pp. 861-869, June 1997. [4] H. Kawaguchi, K. Nose and T. Sakurai, “A super cut-off CMOS (SCCMOS) scheme for 0.5V supply voltage with picoampere standby current,” IEEE Journal of Solid-State Circuits, vol. 35, pp. 1498-1501, October 2000. [5] R. X. Gu and M. I. Elmasry, “Power dissipation analysis and optimization of deep submicron CMOS digital circuits,” IEEE Journal on Solid-State Circuits, vol. 31, no. 5, pp. 707-713, May 1996. [6] Z. Chen, M. C. Johnson, L. Wei and K. Roy, “Estimation of standby leakage power in CMOS circuit considering accurate modeling of transistor stacks,” Proc. International Symposium on Low Power Electronics Design, pp. 239-244, 1998. [7] J. Halter and F. Najm, “A gate-level leakage power reduction method for ultra- low-power CMOS circuits,” Proc. CICC, pp. 475-478, 1997. [8] V. De, Y. Ye, A. Keshavarzi, S. Narendra, J. Kao, D. Somasekhar, R. Nair and S. Borkar, “Techniques for leakage power reduction,” in Design of High-Perfor- mance Microprocessor Circuits, New York: IEEE Press, 2001. [9] M.C. Johnson, D. Somasekhar and K. Roy, “Models and algorithms for bounds on leakage in CMOS circuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 18, pp. 714-725, June 1999. [10] A. Fadi, S. Hassoun, K. A. Sakallaha and D. Blaauw, “Robust SAT-based search algorithm for leakage power reduction,” Proc. International Workshop on Power and Timing Modeling, Optimization and Simulation, 2002. [11] D. Lee, W. Kwong, D. Blaauw and D. Sylvester, “Analysis and minimization techniques for total leakage considering gate oxide leakage,” Proc. Design Auto- mation Conference, pp. 175-180, 2003. [12] R.S. Guindi and F.N. Najm, “Design techniques for gate-leakage reduction in CMOS circuits,” Proc. ISQED, pp.61-65, 2003. [13] F. Hamzaoglu and M.R. Stan, “Circuit-level techniques to control gate leakage for sub-100nm CMOS,” Proc. International Symposium on Low Power Electron- ics and Design, pp. 60-63, 2002. TLFeBOOK 83 [14] T. Inukai, M. Takamiya, K. Nose, H. Kawaguchi, T. Hiramoto and T. Sakurai, “Boosted Gate MOS (BGMOS): Device/circuit cooperation scheme to achieve leakage-free giga-scale integration,” Proc. Custom Integrated Circuit Confer- ence, pp. 409-412, 2000. [15] Q. Wang and S.B.K. Vrudhula, “Static power optimization of deep submicron CMOS circuits for dual Vt technology,” International Conference on Computer- Aided Design, pp. 490-496, 1998. [16] L. Wei, Z. Chen, M. C. Johnson, K. Roy and V. De, “Design and optimization of low voltage high performance dual threshold CMOS circuits,” Proc. Design Automation Conference, pp. 489-494, 1998. [17] S. Sirichotiyakul, T. Edwards, C. Oh, R. Panda and D. Blaauw, “Duet: an accu- rate leakage estimation and optimization tool for dual Vt circuits,” IEEE Transac- tions on Very Large Scale Integration (VLSI) Systems, vol. 10, pp. 79-90, April 2002. [18] M. Ketkar and S. Sapatnekar, “Standby power optimization via transistor sizing and dual threshold voltage assignment,” Proc. ICCAD, 2002, pp. 375-378. [19] S. Stiffler, “Optimizing performance and power for 130nm and beyond,” IBM Technology Group New England Forum, 2003. [20] International Technology Roadmap for Semiconductors, 2002. [21] N. Yang, W. K. Henson, and J. J. Wortman, “A comparative study of gate direct tunneling and drain leakage currents in N-MOSFETs with sub-2nm gate oxides,” IEEE Trans. Electron Devices, vol. 47, pp. 1636-1644, Aug. 2000. [22] B. Yu, H. Wang, C. Riccobene, Q. Xiang and M.-R. Lin “Limits of gate oxide scaling in nano-transistors,” Proc. Symposium on VLSI Tech., pp. 90-91, 2000. [23] Y.-C. Yeo, Q. Lu, W.-C. Lee, T.-J. King, C. Hu, X. Wang, X. Guo and T. P. Ma, “Direct tunneling gate leakage current in transistors with ultra thin silicon nitride gate dielectric,” IEEE Electron Device Letters, vol. 21, pp. 540-542, Nov. 2000. [24] Q. Xiang, J. Jeon, P. Sachdey, B. Yu, K. C. Saraswat and M.-R. Lin, “Very high performance 40nm CMOS with ultra-thin nitride/oxynitride stack gate dielectric and pre-doped dual poly-Si gate electrodes,” Proc. International Electron Devices Meeting, pp. 860-862, 2000. [25] H. Kriplani, F. N. Najm and I. N. Hajj, “Pattern independent maximum current estimation in power and ground buses of CMOS VLSI circuits: algorithms, sig- nal correlations, and their resolution,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, pp. 998-1012, Aug. 1995. [26] Ruchir Puri, IBM T.J. Watson Research, personal communication. [27] F. Brglez and H. Fujiwara, “A Neutral Netlist of 10 Combinatorial Benchmark Circuits”, Proc. ISCAS, 1985, pp.695-698. TLFeBOOK 84 Chapter 5 ENERGY-EFFICIENT SHARED MEMORY ARCHITECTURES FOR MULTI-PROCESSOR SYSTEMS-ON-CHIP Kimish Patel1, Alberto Macii1 and Massimo Poncino2 1 2 Politecnico di Torino; Universit`a di Verona Abstract Most current multi-processor systems-on-chip (MPSoC) platforms do rely on a shared-memory architectural paradigm. The shared memory, typically used for storage of shared data, is a significant performance bottleneck because it re- quires explicit synchronization of memory accesses which can potentially occur in parallel. Multi-port memories are a widely-used solution to this problem; they allow these potentially parallel accesses to occur simultaneously. However, they are not very energy-efficient, since their performance improvement comes at an increased energy cost per access. We propose an energy-efficient architecture for the shared memory that can be used as an alternative to multi-port mem- ories, and combines their performance advantage with a much smaller energy cost. The proposed scheme is based on the application-driven partitioning of the shared address space into a multi-bank architecture. This optimization can be used to quickly explore different power-performance tradeoffs, thanks to simple analytical models of performance and energy. Experiments on a set of paral- lel benchmarks show energy-delay product (EDP) savings of 50% on average, measured on a set of standard parallel benchmarks. Keywords: Multi-Processor Systems, Shared Memory, Systems-on-Chip. 5.1 INTRODUCTION Modern design paradigms for MPSoCs are pushing towards architectures which are fully distributed and that work as general networks, based on a mod- ular layered architecture, and that are able to support non-deterministic com- munications. Such architectures, called Networks-on-Chips (NoCs) [1], have been devised as an answer to the scaling of SoC complexity, especially in terms of the increased number of hosted processing elements, and of the decreased reliability of the communication medium. TLFeBOOK 85 In spite of these scalability challenges, most current SoCs are still based on a shared-medium architecture, and, consequently on a shared-memory paradigm. One reason for this slow migration to more complex architectures is cost. Shared on-chip buses represent a convenient, low-overhead interconnection, and they do not require special handling during the physical design flow. Another reason is a consequence of the limited support provided by system software for such architectures. Although current silicon technology allows to build SoCs with a large number of embedded cores, the capabilities currently offered by the embedded software (e.g., in terms of OS primitives) does not allow to fully exploit all the potential computational power; therefore, most implementations of SoC consist of few (seldom more than 16) processor cores, for which a shared interconnect is perfectly suitable. The architecture of these MPSoC platforms is thus reminiscent of tradi- tional multi-processor systems, where inter-processor communication and/or synchronization is provided through the exchange of data through shared mem- ories of different types. Generally speaking, accessing the shared memories are significantly slower than accesses to local ones. First, they are placed farther away from the processors than private memories; in fact, the latter are often tightly coupled to the cores by means of dedicated local buses, while shared memories are forcedly connected to a shared bus. Moreover, accesses to the shared buses by the processors requires some form of arbitration, which may require the insertion of wait cycles in case of simultaneous accesses. As a consequence, the shared memories tend to become a major bottleneck for the bandwidth of the overall system, especially for applications in which parallelism is built around shared data. Caching of shared data might be a solution, but it raises the well-know issue of cache coherence, i.e., the possible inconsistence between data stored in caches of different processors. Cache coherence can be solved in hardware, yet with an extra overhead that may not be affordable in small-scale, low-cost SoC as those considered in this work. Software-based cache coherence is also a viable solution, but it essentially consists of limiting the caching of shared data to safe times [2]. For applications in which parallelism is built around shared data, this basically amounts to avoid caching of shared data. In this paper, this will be our assumption: all accesses to shared data will always imply an access to the shared memory. Providing sufficient memory bandwidth to sustain fast program execution and data communication/transfers is mandatory for most embedded applications. Increasing memory bandwidth can be achieved by making use of different types of on-chip embedded memories, which provide shorter latencies and wider interfaces [3–5]. One typical solution used to match the computational bandwidth with that of memory is to use multi-port memories. This solution increases the sustainable bandwidth by construction, since a P -port memory TLFeBOOK 86 allows in fact up to P accesses in parallel (i.e., in a single memory cycle). Therefore, by properly choosing the number of ports of the memory versus the number of processors, the issue of synchronization of simultaneous accesses can be easily solved. The adoption of multi-port memories, however, comes at the price of a sig- nificant increase in area, wiring resources, and energy consumption. On the other hand, architectures based on multi-port memories seem to be the only viable option in the cases where bandwidth optimization has absolute priority. In this work we propose an alternative architecture for the shared memory which combines the advantages, in terms of bandwidth, of the multi-port ap- proach, with the advantages, in terms of energy consumption and access time, of partitioned memories [5]. We propose the use of small, single-port mem- ory blocks as a way to achieve memory bandwidth increase together with low energy demand. In our scheme, the memory addressing space is mapped over single-port banks that can be simultaneously accessed by different processors, so as to mimic for a large fraction of the execution time the behavior of a dual- port memory. Energy efficiency is enforced by two facts: First, the single-port blocks have an energy access cost which is smaller than that of monolithic (either single or dual-port) memories; second, address mapping is application- driven, and cell access frequency data is thus used to determine the optimal sizes of the memory blocks. Based on analytical expressions for performance and energy consumption that allow to explore the energy-performance tradeoff, we present experimental results showing that the new architecture guarantees energy savings as high as 69% with respect to a dual-port memory configuration (54% with respect to the baseline, single-ported architecture), with comparable improvement of the memory bandwidth. The rest of the chapter is organized as follows. Section 5.2 provides some background material on memory energy modeling, multi-port memories, and application-driven memory partitioning. Section 5.3 describes how partitioned memories can be used to achieve an energy-efficient shared memory architec- ture. Section 5.4 illustrates the analytical models used to drive the energy- performance exploration engine, which is discussed in Section 5.5. Section 5.6 presents the optimization results for a set of standard parallel applications. Fi- nally, some concluding remarks are provided in Section 5.7. 5.2 BACKGROUND 5.2.1 Modeling Memory Energy Unlike generic hardware modules, the energy consumption of memories is basically independent of the input activity. What matters, in fact, is whether we are reading or writing a value from or to the memory, regardless of the value. TLFeBOOK 87 This property allows to model memory energy consumption in an very abstract way, by explicitly exposing two independent variables affecting it: the cost of an access and the total number of accesses. This translates into the following formula: ctot etot = ei (1) i=1 where ctot is the total number of memory accesses, and ei is the cost of each access. For the sake of simplicity, we equally weigh all accesses (i.e., we do not distinguish the cost of a read from that of a write). Equation 1 exposes the two quantities we can consider to reduce the energy consumption of a memory system and will be used throughout the paper as a reference. Techniques for reducing memory energy can be thus classified according to which variable is optimized [6]. 5.2.2 Multi-Port Memories A multi-port memory is simply a memory that allows multiple simultaneous accesses for reads and writes to any location in memory. Multi-port memo- ries are typically employed as shared memories in multiprocessor designs, and are especially popular as dual-ended FIFO buffers for bus interfacing, or for video/graphics buffering. Multiple simultaneous accesses are made possible by duplicating some of the resources required to access a cell: the address and data pins, the word-lines, and the bit-lines. Figure 5.1 shows the structure of a typical dual-port SRAM cell, and in particular the extra word-line (with the corresponding transistors) and extra bit-line. Figure 5.1. Structure of a Dual-Port SRAM Cell. In some devices, additional overhead is also required to handle the synchro- nization of multiple writes to the same cell; this is managed through a sort of hardware semaphore which serializes the concurrent accesses. The increase in bandwidth provided by multi-port memories comes at the price of increased area, wiring resources and power consumption. Because of this considerable overhead, multi-port memories are usually limited to a TLFeBOOK 88 few ports (often 2, and seldom more than 4). One noticeable exception is represented by register files (although they are not strictly SRAMs), that are typically highly multi-ported (even 16 or more ports) to provide very high bandwidth in superscalar processors. Multi-port memories can also be characterized by the flexibility of the ports. In some memory devices, some of the ports can be specialized, i.e., they allow only some type of access (read or write). This fact can be expressed by writing the number of ports P = pr + pw + prw, where the three terms denote the number of read, write, and read/write ports, respectively. In this work, without loss of generality, we will assume that pr = pw =0,andprw = P , that is, all ports can be used for any type of access at any time. When analyzing multi-port memories from the energy point of view, we must take into account the two following non-idealities, supported by data from several multi-port memory providers ([7],[8],[9]). a) Energy consumption of multi-port memories does not scale linearly with the number of ports. For instance, the energy cost for accessing a dual- port memory is more than twice the energy required for accessing a single-port memory of the same size. b) When a multi-port memory is used as a shared memory in a multiproces- sor system, there are cases in which not all the ports are used simulta- neously. It may in fact happen that the access pattern of the application does not allow to group a set of accesses (from the processors) into a single, multi-port access. In these cases, we must consider the fact that energy consumption does not scale linearly with the number of ports that are accessed simultaneously. For instance, the energy cost for accessing a single port in a dual-port memory is larger than the one for accessing a single-port memory of the same size. With reference to the model of Equation 1, the use of multi-port memories reduces ctot, but it implies a sizable increase of the access cost ei. 5.2.3 Application-Driven Memory Partitioning Partitioning a memory block into multiple blocks, based on the memory ac- cess profile, was originally proposed by Benini et al. [10]. Their technique exploits the fact that, due to the high locality exhibited by embedded applica- tions, the distribution of memory references is not uniform. As a consequence, some memory locations will be accessed more frequently than others. The partitioning is realized by splitting the address space (stored onto a single, monolithic memory block) into non-overlapping contiguous sub-spaces (stored onto several, smaller memory blocks). TLFeBOOK 89 Reduction of energy consumption is achieved because of two facts. First, each block is smaller than the monolithic one, and thus it has a smaller access cost (ei). Second, and more relevant, only one of the blocks is active at a time. By properly partitioning the address space, it should be possible to access the smallest blocks most of the times, and access the largest ones only occasionally. The original implementation of [10] employs a sophisticated recursive algo- rithm to determine the optimal partition with an arbitrary granularity. In this work, we will exploit their idea, yet without employing the same partitioning engine. As a matter of fact, in our case partitioning is driven by the access patterns of more than one processor. Memory partitioning specifically targets the reduction of the access cost ei, and it does not change ctot, since it does not modify the access patterns. 5.3 PARTITIONED SHARED MEMORY ARCHITECTURE The target MPSoC architecture considered is this work is depicted in Fig- ure 5.2. Each processor core has a cache and a private memory (PM) containing private data and code, which is accessed through a local bus. Processors are also connected to another memory (SM), through a common global bus con- taining the data that are shared between the various threads executing on the processors. We do not consider here other types of interconnections, such as point-to-point ones (i.e., crossbars). gure 5.2. Generic Architectural Template. In this work, starting from the assumption that the shared memory is imple- mented as a conventional on-chip, single-port memory, we aim at improving the performance of the accesses to the shared memory, yet in a more energy- efficient way than resorting to a multi-port memory. The proposed shared memory architecture combines the bandwidth advan- tages of multi-port memories (and thus the reduction of ctot) with the advan- TLFeBOOK 90 tages, in terms of energy consumption and access time, of partitioned, single- port memories (and thus the reduction of ei). In our scheme, the memory address space is mapped over single-port banks that can be simultaneously accessed by the different processors, so as to mimic the behavior of a multi-port memory for a large fraction of the execution time. Each bank covers a subset of the address space, with no replication of memory words; therefore, the address sub-spaces are non-overlapping. The latter issue is essential to understand why the partitioned scheme can only approach the performance of the multi-port architecture. Since the memory blocks are single- ported and contain non-overlapping subsets of addresses, simultaneous accesses from the processor can be parallelized only if they fit into different memory blocks. Otherwise, the potentially parallel access must take place into two consecutive memory cycles. Energy efficiency is enforced by two facts: First, the single-port blocks have an energy access cost which is by far smaller than that of monolithic (either single or dual-port) memories; second, address mapping is application-driven, and it accounts thus for the cell access frequency to determine the size of the memory blocks which is most suitable for memory minimization. In the following, we will restrict our analysis to systems with two proces- sors. Consequently, we will consider dual-port memories, and the partitioned architecture will also consists of two blocks at most. Although the concepts that will be discussed apply in principle to an arbitrary number of processors (with multi-port memories and multi-bank architectures), the quantitative analysis of energy and performance strictly refers to the case of two processors (with dual-port memory and two memory blocks). A1 P1 A A1 SPM1 Port1 D D P1 1 D1 DPM A2 A P2 Port2 2 D 2 P2 A SPM2 D D2 (a) (b) Figure 5.3. Dual-Port (a) and Partitioned Single-Port (b) Architectures. Figure 5.3 show a conceptual architecture of the dual-port and the partitioned single-port schemes. Label Ai refers to addresses from processor i, while Di refer to data to/from processor i. In the dual-port scheme (Figure 5.3-(a)), the TLFeBOOK 91 existence of two read/write ports allows to bind each processor to one port, realizing in fact a point-to-point interconnection. In the partitioned architecture (Figure 5.3-(b)), addresses and data must be multiplexed (from processor to memory) or de-multiplexed (from memory to processor) properly, to connect the processor to the required memory block. This block diagram just shows the high-level flow of data and addresses; the actual implementation of the decoder is actually more complex, and will be discussed in the experimental section. 5.3.1 Related Work The literature on energy optimization of embedded memories is quite rich (see [6] for a comprehensive survey); however, most techniques deal with the optimization of caches, scratch-pad memories, or off-chip memories, and multi- port memories are seldom addressed. Most energy optimizations for multi-port memories are concerned with the issue of the mapping of data structures (typically, arrays) to multi-port mem- ories, based on the access profiles of the applications. From these profiles, these techniques evaluate simultaneous array accesses (e.g., whether two or more arrays are accessed in the same cycle), and build a so-called compatibility graph, which expresses the potential parallelization of accesses. The various approaches differ then in how this graph is used to decide the optimal allocation of array accesses to memory ports [3, 11–13]. One technique closer to the one proposed in this work has been discussed by Lewis and Brackenbury [14]. Their approach is based on the typical access patterns of DSP applications, and splits highly-multiported register files into multiple banks of predefined sizes. 5.4 PERFORMANCE AND ENERGY CHARACTERIZATION In this section we will derive analytical expressions for the number of memory accesses and for the total energy consumption for the architectures of Figure 5.3, referred to the case of a system consisting of two processors (hereafter denoted with P1 and P2). 5.4.1 Performance Characterization Let c1 and c2 be the number of memory accesses required by the execution of the application on processors P1 and P2, respectively. In the following, we will use the term memory cycle instead of memory access; we adopt this terminology in order to distinguish accesses to the shared memory that can occur in parallel. In fact, the total number of memory accesses by a processor TLFeBOOK 92 is fixed (and determined by the memory access pattern of the application, which we do not modify); What actually changes is the time (in cycles) required to serve these accesses. Furthermore, we will denotes sets with bold symbols, and their cardinalities with lowercase ones. Our reference performance figure is the total number of memory cycles for the case where shared memory is implemented as a monolithic single-port memory. This value is cspm = c1 + c2. 5.4.1.1 Dual-Port Memory. When the shared memory is implemented by a monolithic dual-port memory, the total number of memory accesses will be smaller than cspm because of the possibility of simultaneous accesses. Only a fraction of the accesses, however, will occur simultaneously. As Figure 5.4 shows, this fraction can be represented in terms of set notation. We denote with Cpar the set of memory cycles that can access memory simul- taneously; Cpar consists of the union of two subsets Cpar = Cpar,1 ∪ Cpar,2, where Cpar,1 ⊆ C1 and Cpar,2 ⊆ C2. These two subsets have same cardinality (i.e., cpar,1 ≡ cpar,2) because each element of one set matches one of the other set to make a parallel access. Figure 5.4. Classification of Execution Cycles. The number of cycles for the dual-port configuration is therefore: cdpm =(c1 − cpar,1)+(c2 − cpar,2)+cpar/2 (2) where cpar = cpar,1 + cpar,2, denotes the total number of the parallel cycles. The division by two in the last term denotes the fact that parallel cycles are actually grouped in pairs, with each pair corresponding to a single memory access. Equation 2 simplifies to cdpm = c1 + c2 − cpar/2, exposing the fact that the magnitude of cpar directly translates into a performance improvement. 5.4.1.2 Partitioned Memory. In the case of partitioned memory, the two memory banks now host two non-overlapping subsets of the address space. This implies that only a subset of the cycles in Cpar can be parallelized;in particular, accesses that fall in the same subset of addresses now need to be serialized, since the two memory blocks are single-ported. TLFeBOOK 93 This further sub-setting of the cycles is depicted in Figure 5.5, using the same set notation as above. We can notice that C1 and C2 are now both split into two subsets, where Ci,j denotes the cycles of processor i that fall into block j. Figure 5.5. Classification of Execution Cycles for the Partitioned Architecture. This induces a partition onto Cpar, as follows. The shaded areas labeled A and D in Figure 5.5 denote parallel accesses that fall into different memory blocks: In region A (D), P1 accesses Block 1 (Block 2), and P2 accesses Block 2 (Block 1). Conversely, the regions labeled B and C denote accesses that fall in the same memory block (Block 2 for region b, and Block 1 for region c). Cycles belonging to region B and C cause a performance penalty, because, although they can potentially occur in parallel, they must be serialized (and thus require two memory accesses). These subsets can be characterized by using a quantity λ, that denotes the percentage of the cycles in Cpar that fall in distinct memory blocks (and can thus be made parallel). λ will be used in the following as a compact metric to evaluate the cost of the partition. In fact, λ depends on where how the partition has been made, that is, how many addresses fall in each block. Therefore, Cpar consists of λcpar cycles that can be parallelized, and (1 − λ)cpar that requires two separate accesses. The number of cycles of the partitioned-memory architecture cspm,part is therefore: cspm,part =(c1 − cpar,1)+(c2 − cpar,2)+λcpar/2+(1− λ)cpar (3) The formula simplifies to cspm,part = c1 + c2 − λcpar/2, exposing the fact that cspm,part ≥ cdpm,sinceλ ≤ 1. Analyzing the dependency of cspm,part versus λ, We notice that cspm,part (and thus) the performance penalty of the partitioned scheme is minimized when λ is maximized, as expected. In par- ticular, when λ =1, all accesses in Cpar are parallelized, and the partitioned scheme is equivalent to the dual-port memory, performance-wise. When λ =0, all accesses by Cpar overlap on the same memory block, and the partitioned scheme is equivalent to the single-port memory architecture. TLFeBOOK 94 5.4.2 Energy Characterization To compute energy, we stick to the high-level model of Equation 1; energy is thus simply obtained by multiplying each access for its cost. 5.4.2.1 Dual-Port Memory. In this case we have to consider two types of access costs, depending on whether one or both ports are accessed. Total energy is obtained thus by properly weighing the terms of Equation 2: In formula: edpm =(c1 − cpar,1) · edpm,1 +(c2 − cpar,2) · edpm,1 + cpar/2 · edpm,2 (4) The term edpm,x denotes the energy per access to the memory, in which the term x = {1, 2} in the subscript denotes the number of ports used in the access. 5.4.2.2 Partitioned Memory. In the case of the partitioned memory, total energy cannot be conveniently expressed by a closed formula, for two reasons. First, the energy per access depends on the size of the memory block that is accessed; the sizes of the blocks, however, are precisely the variables of the partitioning problem we are trying to solve. Second, we have two single- port memories, and each memory access from either processor will fall into one of the two memory blocks. This implies that the energy per access can only be approximated by a “average” cost (i.e., the number of accesses to Block 1 weighted by its energy cost, plus number of accesses to Block 2 weighted by its energy cost). The accurate evaluation of energy for the partitioned architecture requires thus a simulation of the dynamic address trace of the two processors, and the application of Equation 1 on an access-by-access basis. Nevertheless, we can derive an approximate expression of total energy that can be used for a rough comparison with Equation 4: =( − ) · +( − ) · + espm,part c1 cpar,1 espm c2 cpar,2 espm (1 − ) · + 2 · ( + ) (5) λ cpar espm λcpar/ espm1 espm2 The first two term (espm and espm) are the above mentioned average access costs and represent the non-parallel memory accesses. espm is the cost of accessing either Block 1 or Block 2 (depending on the subset of addresses), when accesses are potentially parallel but must be serialized. The last term represents the subset of potentially parallel accesses that will access Block 1 and Block 2 simultaneously (espm1 + espm2). Although approximate, Equation 5 allows to do some rough comparison with the dual-port scheme. First, all energy costs in Equation 5 are smaller than edpm,2, and, in most cases (when the sizes of the two blocks are of comparable size), also smaller than edpm,1. This implies that all four terms of Equation 5 TLFeBOOK 95 are smaller than the corresponding ones in Equation 4, and energy is potentially smaller than the dual-port memory case, regardless of the value of λ. The actual dependency of espm,part on λ is not easily observable from Equa- tion 5. A large value of λ increases the probability of accessing both blocks in the same cycle (this corresponds to the largest term (espm1 +espm2)). Therefore, energy should be in principle reduced by choosing partitions which minimize λ. In this case, in fact, only one of the two blocks (each one smaller than the monolithic memory) will be accessed in each cycle, thus using less energy; a small value of λ, however, tends to increase the number of cycles, as already observed. 5.5 EXPLORATION FRAMEWORK The models described in Section 5.4 show that there exists a tradeoff be- tween energy and performance in partitioning the shared memory. Although we are searching for energy-efficient memory architectures, we cannot ig- nore performance implications; therefore, in order to search for the best en- ergy/performance tradeoff, we use energy/delay product (EDP) as a metric, and choose to minimize EDP during the space exploration. Thanks to the simple models of Section 5.4, the optimization space is rela- tively small, since λ is the only parameter of the models. λ is a function of the access pattern of the application, but it also depends on how the address space is partitioned. Partitions can be characterized by the boundary address B that splits the address space [0,...,N − 1] into two sub-spaces [0,...,B− 1] and [B,...,N−1]. Therefore, λ is also a function of B. As an example, Figure 5.6 shows the behavior of λ versus B for a parallel FFT kernel; we can observe that the curve is not monotonic, showing the sensitivity of λ to the access pattern. Figure 5.6. Behavior of λ(B) vs. B. These observations leads us to the following exploration procedure, for a shared memory of N words: 1 Compute epm(λ) and cpm(λ) as in Section 5.4; TLFeBOOK 96 2 For all possible values of B =0,...,N − 1, Compute EDPpm(λ) as epm · cpm. EDPpm(λ(B)) is not a function, since there may be more values of B (and thus of EDP)foragivenvalueofλ. An example of such curve is shown in Figure 5.7, for the parallel FFT benchmark. pareto( ) 3 Compute the function EDPpm λ , obtained by selecting, for each ( ) pareto value of λ, the smallest value of EDPpm λ . EDPpm contains the Pareto points of EDPpm(λ), and can possibly contain some discontinu- ities. Figure 5.8 shows the resulting curve for the FFT benchmark. Figure 5.7. Behavior of EDP(λ(B)) vs. λ. Figure 5.8. Pareto Points of EDP(λ(B)). 4 Compute the minimum EDPmin, of this function, and let λmin the cor- responding value of λ; 5Ontheλ vs. B plot, identify the corresponding value Bmin of B.In case of multiple values of B, choose the one that makes the partitions as equal (in size) as possible. TLFeBOOK 97 5.6 EXPERIMENTAL RESULTS 5.6.1 Experimental Setup We have implemented our partitioned memory scheme in ABSS [15]. ABSS is an execution-driven architectural simulator for multiprocessor systems devel- oped at Stanford University, that extends the ideas implemented in the AUG- MINT simulator. ABSS is based on the idea of augmentation,thatis,the instrumentation of the assembly code with various hooks that allow to make context switches to the simulator; augmentation translates the program into a functionally equivalent program that runs on the simulated version of the processor. The memory architecture provided by ABSS includes both private and shared memory. All the memories are connected through a single shared bus. Yet, ABSS does not provide any specific predefined cache or shared bus model; rather, it a defines a specific interface to which user-defined cache and bus models can be easily hooked. We have integrated Dinero [16] into ABSS, in order to provide accurate cache simulation data, and we have derived performance and energy models for the shared memory (both single- and dual-port) by interpolation of the results obtained from an industrial memory generator by ST Microelectronics. The target technology for all the models is 0.18µ. Concerning the benchmarks, we have used Stanford’s SPLASH suite [17] which includes a set of kernels and parallel applications widely used in the parallel computing community. 5.6.2 Energy/Performance Tradeoff Analysis Table 5.1 shows energy-delay product (EDP) results for the above bench- marks, for the monolithic, single-port architecture (EDPmm) and the parti- tioned one (EDPpm), obtained using the exploration procedure of Section 5.5. The EDP reduction (Column ∆) ranges from 40.5% to 62.3% (50.2% on aver- age).The exploration procedure also allows to compute the best performance and energy points; these are summarized in Table 5.2, where performance improve- ments (number of cycles) and energy saving with respect to the monolithic, single port architecture are reported (Columns Best Performance and Best En- ergy). The comparison of Tables 5.1 and 5.2, shows that savings in the EDP is mostly due to energy savings than to performance savings. Minimum EDP points are in fact very close to minimum energy points, for most of the bench- marks, while performance improvements are less significant. Notice also that only benchmarks that exhibit a sizable amount of parallel cycles (e.g., FFT, LU-Cont, Radix) results in a sizable performance improvement. Conversely, energy does not seem to be that sensitive to the amount of parallel cycles. TLFeBOOK 98 Table 5.1. Energy-Delay Product Results. Application EDPmm EDPpm ∆ [%] Barnes 24987.8 11357.5 54.6 FFT 6.4 3.7 41.2 FMM 853.4 389.6 54.4 LU 3931.3 2339.2 40.5 LU-CONT 3734.4 2073.1 44.5 Radix 59512.5 23180.5 61.0 Volrend 869794.2 453283.8 47.9 Water-N2 150460.7 56710.0 62.3 Water-S 10581.2 5770.5 45.5 Average 50.2 Table 5.2. Optimal Performance and Energy Points. Application Best Performance [%] Best Energy [%] Barnes 1.5 54.4 FFT 34.0 37.2 FMM 2.1 54.4 LU 10.9 40.3 LU-CONT 19.8 40.5 Radix 25.4 60.9 Volrend 0.3 50.3 Water-N2 13.9 62.3 Water-S 8.7 45.9 Average 13.0 49.6 Figure 5.9 shows the energy savings of the the partitioned architecture with respect to the dual-port case. Numbers refer to best-performance points, since we want to reduce the performance penalty as much as possible. The savings do not include the cost of the decoding logic. The partitioned architecture results in an average energy saving of 56% (maximum 70%). This energy saving is achieved at an increase of the total number of memory cycles of 2.4% on average (10.1% maximum). 5.6.3 Decoder Implementation The partitioned architecture requires an ad-hoc encoder which implements the conceptual scheme of Figure 5.3. The encoder must provide two main functionalities. First, it must drive the selectors that decide to which block a given memory access is directed; to do this, it must contain the information about the boundary of the partition of the address space. Second, and more important, TLFeBOOK 99