Quick viewing(Text Mode)

ULTRA LOW-POWER ELECTRONICS and DESIGN Tlfebook

TLFeBOOK TLFeBOOK

ULTRA LOW-POWER ELECTRONICS AND DESIGN TLFeBOOK

This page intentionally left blank TLFeBOOK Ultra Low-Power Electronics and Design

Edited by Enrico Macii Politecnico di Torino, Italy

KLUWER ACADEMIC PUBLISHERS NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW TLFeBOOK

eBook ISBN: 1-4020-8076-X Print ISBN: 1-4020-8075-1

©2004 Springer Science + Business Media, Inc.

Print ©2004 Kluwer Academic Publishers Dordrecht

All rights reserved

No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher

Created in the United States of America

Visit Springer's eBookstore at: http://www.ebooks.kluweronline.com and the Springer Global Website Online at: http://www.springeronline.com TLFeBOOK

Contents

CONTRIBUTORS…………………………………………………………………….VII PREFACE…………………………………………………………….………………...IX INTRODUCTION……………………………………………………………………XIII

1. ULTRA-LOW-POWER DESIGN: DEVICE AND LOGIC DESIGN APPROACHES……………………………………….………………………………….1

2. ON-CHIP OPTICAL INTERCONNECT FOR LOW-POWER……………………21

3. NANOTECHNOLOGIES FOR LOW POWER……………….…………………….40

4. STATIC LEAKAGE REDUCTION THROUGH SIMULTANEOUS

Vt/Tox AND STATE ASSIGNMENT………………………………………………….56

5. ENERGY-EFFICENT SHARED MEMORY ARCHITECTURES FOR MULTI- SYSTEMS-ON-CHIP…………………………………...…..84

6. TUNING CACHES TO APPLICATIONS FOR LOW-ENERGY EMBEDDED SYSTEMS……………………………………………………………………………..103

7. REDUCING ENERGY CONSUMPTION IN CHIP MULTIPROCESSORS USING WORKLOAD VARIATIONS……………………………………………....123

8. ARCHITECTURES AND DESIGN TECHNIQUES FOR ENERGY EFFICIENT EMBEDDED DSP AND MULTIMEDIA PROCESSING……….….141

9. SOURCE-LEVEL MODELS FOR SOFTWARE POWER OPTIMIZATION…..156

10. TRANSMITTANCE SCALING FOR REDUCING POWER DISSIPATION OF A BACKLIT TFT-LCD…………………………………………………………..172 TLFeBOOK vi

11. POWER-AWARE NETWORK SWAPPING FOR WIRELESS PALMTOP PCS…………………………………………………………………………………… 198

12. ENERGY EFFICIENT NETWORK-ON-CHIP DESIGN…………………………214

13. SYSTEM LEVEL POWER MODELING AND SIMULATION OF HIGH-END INDUSTRIAL NETWORK-ON-CHIP……………………………….233

14. ENERGY AWARE ADAPTATIONS FOR END-TO-END VIDEO STREAMING TO MOBILE HANDHELD DEVICES…………………………….255 TLFeBOOK

vii Contributors

A. Acquaviva Università di Urbino L. Benini Università di Bologna D. Bertozzi Università di Bologna D. Blaauw University of Michigan, Ann Arbor A. Bogliolo Università di Urbino A. Bona STMicroelectronics C. Brandolese Politecnico di Milano W.C. Cheng University of Southern G. De Micheli N. Dutt , Irvine W. Fornaciari Politecnico di Milano F. Gaffiot Ecole Centrale de Lyon J. Gautier CEA-DRT–LETI/D2NT–CEA/GRE A. Gordon-Ross University of California, Riverside R. Gupta University of California, San Diego C. Heer Infineon Technologies AG M. J. Irwin Pennsylvania State University I. Kadayif Canakkale Onsekiz Mart University M. Kandemir Pennsylvania State University B. Kienhuis Leiden I. Kolcu UMIST E. Lattanzi Università di Urbino D. Lee University of Michigan, Ann Arbor A. Macii Politecnico di Torino S. Mohapatra University of California, Irvine I. O’Connor Ecole Centrale de Lyon K. Patel Politecnico di Torino M. Pedram University of Southern California C. Pereira University of California, San Diego C. Piguet CSEM M. Poncino Università di Verona F. Salice Politecnico di Milano P. Schaumont University of California, Los Angeles U. Schlichtmann Technische Universität München D. Sylvester University of Michigan, Ann Arbor TLFeBOOK viii F. Vahid University of California, Riverside and University of California, Irvine N. Venkatasubramanian University of California, Irvine I. Verbauwhede University of California, Los Angeles and K.U.Leuven N. Vijaykrishnan Pennsylvania State University V. Zaccaria STMicroelectronics R. Zafalon STMicroelectronics B. Zhai University of Michigan, Ann Arbor C. Zhang University of California, Riverside TLFeBOOK

ix Preface

Today we are beginning to have to face up to the consequences of the stunning success of Moore’s Law, that astute observation by ’s Gordon Moore which predicts that transistor densities will double every 12 to 18 months. This observation has now held true for the last 25 years or more, and there are many indications that it will continue to hold true for many years to come. This book appears at a time when the first examples of complex circuits in 65nm CMOS technology are beginning to appear, and these products already must take advantage of many of the techniques to be discussed and developed in this book. So why then should our increasing success at , as evidenced by the success of Moore’s Law, be creating so many new difficulties in power management in circuit designs?

The principal source and the physical origin of the problem lies in the differential scaling rates of the many factors that contribute to power dissipation in an IC – transistor speed/density product goes up faster than the energy per transition comes down, so the power dissipation per unit area increases in a general sense as the technology evolves.

Secondly, the “natural” transistor switching speed increase from one generation to the next is becoming downgraded due to the greater parasitic losses in the wiring of the devices. The technologists are offsetting this problem to some extent by introducing lower permittivity dielectrics (“low- k”) and lower resistivity conductors (copper) – but nonetheless to get the needed circuit performance, higher speed devices using techniques such as silicon-on-insulator (SOI) substrates, enhanced carrier mobility (“strained silicon”) and higher field (“overdrive”) operation are driving power densities ever upwards. In many cases, these new device architectures are increasingly leaky, so static power dissipation becomes a major headache in power management, especially for portable applications. TLFeBOOK x A third factor is system or application driven – having all this integration capability available encourages us to combine many different functional blocks into one system IC. This means that in many cases, a large part of the chip’s required functionality will come from software executing on and between multiple on-chip execution units; how the optimum partitioning between hardware architecture and software implementation is obtained is a vast subject, but clearly some implementations will be more energy efficient than others. Given that, in many of today’s designs, more than 50% of the total development effort is on the software that runs on the chip, getting this partitioning right in terms of power dissipation can be critical to the success of (or instrumental in the failure of!) the product.

A final motivation comes from the practical and environmental consequences of how we design our chips – state-of-the-art high performance circuits are dissipating up to 100W per square centimeter – we only need 500 square meters of such silicon to soak up the output of a small nuclear power station. A related argument, based on battery lifetime, shows that the “converged” mobile phone application combining telephony, data transmission, multimedia and PDA functions that will appear shortly is demanding power at the limit of lithium-ion or even methanol-water fuel cell battery technology. We have to solve the power issue by a combination of design and process technology innovations; examples of current approaches to power management include multiple transistor thresholds, triple gate oxide, dynamic supply voltage adjustment and memory architectures.

Multiple transistor thresholds is a technique, practiced for several years now, that allows the designer to use high performance (low Vt) devices where he needs the speed, and low leakage (high Vt) devices elsewhere. This benefits both static power consumption (through less sub-threshold leakage) and dynamic power consumption (through lower overall switching currents). High threshold devices can also be used to gate the supplies to different parts of the circuit, allowing blocks to be put to sleep until needed.

Similar to the previous technique, triple gate oxide (TGO) allows circuit partitioning between those parts that need performance and other areas of the circuit that don’t. It has the additional benefit of acting on both sub-threshold leakage and gate leakage. The third oxide is used for I/O and possibly mixed-signal. It is expected over the next few years that the process technologists will eventually replace the traditional silicon dioxide gate dielectric of the CMOS devices by new materials such as rare earth oxides with much higher dielectric constants that will allow the gate leakage problem to be completely suppressed. TLFeBOOK

xi Dynamic supply voltage adjustment allows the supply voltage to different blocks of the circuit to be adjusted dynamically in response to the immediate performance needs for the block – this very sophisticated technique will take some time to mature.

Finally, many, if not most, advanced devices use very large amounts of memory for which the contents may have to be maintained during standby; this consumes a substantial amount of power, either through refreshing dynamic RAM or through the array leakage for static RAM. Traditional non- volatile memories have writing times that are orders of magnitude too slow to allow them to substitute these on-chip memories. New developments, such as MRAM, offer the possibility of SRAM-like performance coupled with unlimited endurance and data retention, making them potential candidates to replace the traditional on-chip memories and remove this component of standby power consumption.

Most of the approaches to power management described briefly above will be employed in 65nm circuits, but there are a lot more good ideas waiting to be applied to the problem, many of which you will find clearly and concisely explained in this book.

Mike Thompson, Philippe Magarshack

STMicroelectronics, Central R&D Crolles, France TLFeBOOK

This page intentionally left blank TLFeBOOK

xiii Introduction

ULTRA LOW-POWER ELECTRONICS AND DESIGN

Enrico Macii Politecnico di Torino

Power consumption is a key limitation in many electronic systems today, ranging from mobile telecom to portable and desktop computing systems, especially when moving to nanometer technologies. Power is also a showstopper for many emerging applications like ambient intelligence and sensor networks. Consequently, new design techniques and methodologies are needed to control and limit power consumption. The 2004 edition of the DATE (Design Automation and Test in Europe) conference has devoted an entire Special Focus Day to the power problem and its implications on the design of future electronic systems. In particular, keynote presentations and invited talks by outstanding researchers in the field of low-power design, as well as several technical papers from the regular conference sessions have addressed the difficulties ahead and advanced strategies and principles for achieving ultra low-power design solutions. Purpose of this book is to integrate into a single volume a selection of these contributions, duly extended and transformed by the authors into chapters proposing a mix of tutorial material and advanced research results. The manuscript consists of a total of 14 chapters, addressing different aspects of ultra low-power electronics and design. Chapter 1 opens the volume by providing an insight to innovative transistor devices that are capable of operating with a very low threshold voltage, thus contributing to a significant reduction of the dynamic component of power consumption. Solutions for limiting leakage power during stand-by mode are also discussed. The chapter closes with a quick overview of low-power design techniques applicable at the logic level, including multi-Vdd, multi-Vth and hybrid approaches. Chapter 2 focuses on the problem of reducing power in the interconnect network by investigating alternatives to traditional metal wires. In fact, according to the 2003 ITRS roadmap, metallic interconnections may not be able to provide enough transmission speed and to keep power under control for the upcoming technology nodes (65nm and below). A possible solution, explored in the chapter, consists of the adoption of optical interconnect networks. Two applications are presented: Clock distribution and data communication using wavelength division multiplexing. TLFeBOOK xiv In Chapter 3, the power consumption problem is faced from the technology point of view by looking at innovative nano-devices, such as single-electron or few-electron transistors. The low-power characteristics and potential of these devices are reviewed in details. Other devices, including carbon nano- tube transistors, resonant tunnelling diodes and quantum cellular automata are also treated.

Chapter 4 is entirely dedicated to advanced design methodologies for reducing sub-threshold and gate leakage currents in deep-submicron CMOS circuits by properly choosing the states to which gates have to be driven when in stand-by mode, as well as the values of the threshold voltage and of the gate oxide thickness. The authors formulate the optimization problem for simultaneous state/Vth and state/Vth/Tox assignments under delay constraints and propose both an exact method for its optimal solution and two practical heuristics with reasonable run-time. Experimental results obtained on a number of benchmark circuits demonstrate the viability of the proposed methodology.

Chapter 5 is concerned with the issue of minimizing power consumption of the memory subsystem in complex, multi-processor systems-on-chip (MPSoCs), such as those employed in multi-media applications. The focus is on design solutions and methods for synthesizing memory architectures containing both single-ported and multi-ported memory banks. Power efficiency is achieved by casting the memory partitioning design paradigm to the case of heterogeneous memory structures, in which data need to be accessed in a shared manner by different processing units.

Chapter 6 addresses the relevant problem of minimizing the power consumed by the cache hierarchy of a . Several design techniques are discussed, including application-driven automatic and dynamic cache parameter tuning, adoption of configurable victim buffers and frequent-value data encoding and compression.

Power optimization for parallel, variable-voltage/frequency processors is the subject of Chapter 7. Given a processor with such an architecture, this chapter investigates the energy/performance tradeoffs that can be spanned in parallelizing array-intensive applications, taking into account the possibility that individual processing units can operate at different voltage/frequency levels. In assigning voltage levels to processing units, compiler analysis is used to reveal hetherogeneity between the loads of the different units in parallel execution. TLFeBOOK xv Chapter 8 provides guidelines for the design and implementation of DSP and multi-media applications onto programmable embedded platforms. The RINGS architecture is first introduced, followed by a detailed discussion on power-efficient design of some of the platform components, namely, the DSPs. Next, design exploration, co-design and co-simulation challenges are addressed, with the goal of offering to the designers the capability of including into the final architecture the right level of programmability (or reconfigurability) to guarantee the required balance between system performance and power consumption. Chapter 9 targets software power minimization through source code optimization. Different classes of code transformations are first reviewed; next, the chapter outlines a flow for the estimation of the effects that the application of such transformations may have on the power consumed by a software application. At the core of the estimation methodology there is the development of power models that allow the decoupling of processor- independent analysis from all the aspects that are tightly related to processor architecture and implementation. The proposed approach to software power minimization is validated through several experiments conducted on a number of embedded processors for different types of benchmark applications.

Reduction of the power consumed by TFT liquid crystal displays, such as those commonly used in consumer electronic products is the subject of Chapter 10. More specifically, techniques for reducing power consumption of transmissive TFT-LCDs using a cold cathode fluorescent lamp backlight are proposed. The rationale behind such techniques is that the transmittance function of the TFT-LCD panel can be adjusted (i.e., scaled) while meeting an upper bound on a contrast distortion metric. Experimental results show that significant power savings can be achieved for still images with very little penalty in image contrast.

Chapter 11 addresses the issue of efficiently accessing remote memories from wireless systems. This problem is particularly important for devices such as palmtops and PDAs, for which local memory space is at a premium and networked memory access is required to support virtual memory swapping. The chapter explores performance and energy of network swapping in comparison with swapping on local microdrives and FLASH memories. Results show that remote swapping over power-manageable wireless network interface cards can be more efficient than local swapping and that both energy and performance can be optimized by means of power- aware reshaping of data requests. In other words, dummy data accesses can be preemptively inserted in the source code to reshape page requests in order to significantly improve the effectiveness of dynamic power management. TLFeBOOK xvi Chapter 12 focuses on communication architectures for multi-processor SoCs. The network-on-chip (NoC) paradigm is reviewed, touching upon several issues related to power optimization of such kinds of communication architectures. The analysis goes on a layer-by-layer basis, and particular emphasis is given to customized, domain-specific networks, which represent the most promising scenario for communication-energy minimization in multi-processor platforms.

Chapter 13 provides a natural follow up to the theory of NoCs covered in the previous chapter by describing an industrial application of this type of communication architecture. In particular, the authors introduce an innovative methodology for automatically generating the power models of a versatile and parametric on-chip communication IP, namely the STBus by STMicroelectronics. The methodology is validated on a multi-processor hardware platform including four ARM cores accessing a number of peripheral targets, such as SRAM banks, interrupt slaves and ROM memories.

The last contribution, offered in Chapter 14, proposes an integrated end-to- end power management approach for mobile video streaming applications that unifies low-level architectural optimizations (e.g., CPU, memory, registers), OS power-saving mechanisms (e.g., dynamic voltage scaling) and adaptive middleware techniques (e.g., admission control, trans-coding, network traffic regulation). Specifically, interaction parameters between the different levels are identified and optimized to achieve a reduction in the power consumption.

Closing this introductory chapter, the editor would like to thank all the authors for their effort in producing their outstanding contributions in a very short time. A special thank goes to Mike Thompson and Philippe Magarshack of STMicroelectronics for their keynote presentation at DATE 2004 and for writing the foreword to this book. The editor would also like to acknowledge the support offered by Mark De Jongh and the Kluwer staff during the preparation of the final version of the manuscript. Last, but not least, the editor is grateful to Agnieszka Furman for taking care of most of the “dirty work” related to book editing, paging and preparation of the camera-ready material. TLFeBOOK 1

Chapter 1 ULTRA-LOW-POWER DESIGN: DEVICE AND LOGIC DESIGN APPROACHES

Christoph Heer1 and Ulf Schlichtmann2 1Infineon Technologies AG; 2Technische Universität München

Abstract Power consumption increasingly is becoming the bottleneck in the design of ICs in advanced process technologies. We give a brief introduction into the major causes of power consumption. Then we report on experiments in an advanced process technology with ultra-low threshold voltage (Vth) devices. It turns out that in contrast to older process technologies, this approach increasingly is becoming less suitable for industrial usage in advanced process technologies. Following, we describe methodologies to reduce power consumption by optimizations in logic design, specifically by utilizing multiple levels of supply voltage Vdd and threshold voltage Vth. We evaluate them from an industrial product development perspective. We also give a brief outlook to proposals on other levels in the and to future work.

Keywords: Low-power design, dynamic power reduction, leakage power reduction, ultra- low-Vth devices, multi-Vdd, multi-Vth, CVS

1.1 INTRODUCTION

The progress of silicon process technology marches on relentlessly. As predicted by Gordon Moore decades ago, silicon process technology continues to achieve improvements at an astonishing pace [1]. The number of transistors that can be integrated on a single IC approximately doubles every 2 years [2,3]. This engineering success has created innovative new industries (e.g. personal computers and peripherals, consumer electronics) and revolutionized other industries (e.g. communications). Today, however, it is becoming increasingly difficult to achieve improvements at the pace that the industry has become accustomed to. More and more technical challenges appear that require increasing resources to be TLFeBOOK

2 solved [4]. One such problem is the increasing power consumption of integrated circuits. It becomes even more critical as an increasing number of today’s high-volume consumer products are battery-powered. In the following, we will consider the sources of power consumption and their development over time. We will show why reduction of power consumption increasingly is becoming critical to product success and will review traditional approaches in Sections 1.1 and 1.2. In Section 1.3 we will then analyze a potential solution based on introduction of an optimized transistor with a very low threshold voltage Vth. Thereafter, we will present and discuss logic-level design optimizations for power reduction in Section 1.4. Also, we will briefly point out potential optimizations on higher levels. Our observations are made from the perspective of industrial IC product development where technical optimizations must be carefully evaluated against the cost associated with achieving and implementing them. Mostly, the presented methodologies are already being utilized in leading-edge industrial ICs.

1.2 POWER CONSUMPTION BECOMES CRITICAL

Depending on the type of end-product and its application, different aspects of power consumption are the primary concern: dynamic power or leakage power. Reduction of dynamic power consumption is a concern for almost all IC products today. For battery-powered products, reduced power consumption directly results in longer operating time for the product, which is a very desirable characteristic. Even for non-battery-powered products, reduced power consumption brings many advantages, such as reduced cost because of cheaper packaging or higher performance because of lower temperatures. Finally, reduced power consumption often leads to lower system cost (no fans required; no or cheaper air conditioning for data / telecom center etc.). Dynamic power consumption is caused by the charging and discharging of capacitances when a circuit switches. In addition, during switching a short-circuit current flows, but this current is typically much smaller, and will therefore be neglected in the following. The dynamic current due to capacitance charging and discharging is determined by the following well- known relationship:

•• 2 dyn ~ VCfP ddL TLFeBOOK

3

Based on constant electrical field scaling, Vdd and CL each are reduced by 30% in each successive process generation. Also, delay decreases by 30%, resulting in 43% increase in frequency. Therefore, the dynamic power consumption per device is reduced by 50% from one process generation to the next. As scaling also doubles the number of devices that can be implemented in a given die area, dynamic power consumption per area should stay roughly identical. However, historically frequency has increased by significantly more than 43% from one process generation to the next (e.g. in , it has roughly doubled, due to architectural optimizations, such as deeper pipeline stages), and in addition, die sizes have increased with each new process technology, further increasing the power consumption, due to an increased number of active devices [5]. For these reasons, dynamic power consumption has increased exponentially, as is shown in Figure 1-1 for the example of microprocessors. Reduction of leakage power consumption today is primarily a concern for products that are powered by battery and spend most of their operating hours in some type of standby mode, such as cell phones. For many process generations, however, leakage has increased roughly by a factor of 10 for every two process nodes [6]. Due to this dramatic increase with newer process generations, leakage is becoming a significant contribution to overall IC power consumption even in normal operating mode, as can be seen in Figure 1-1 as well. Leakage was estimated to increase from 0.01% of overall power consumption in a 1.0µm technology, to 10% in a 0.1µm technology [6]. For a microprocessor, Intel estimated leakage power consumption at more than 50W for a 100nm technology node[3]. This figure probably is extreme, and leakage depends strongly on a number of factors, such as threshold voltage (Vth) of the transistor, gate oxide thickness and environmental operating conditions (supply voltage Vdd, temperature T). Nevertheless, for an increasing number of products leakage power consumption is turning into a problem, even when they are not battery-powered. TLFeBOOK 4

Figure 1-1. Development of dynamic and leakage power consumption over time [3,7]

1.3 TRADITIONAL APPROACHES TO POWER REDUCTION

As outlined above, dynamic power consumption is governed by:

•• 2 dyn ~ VCfP ddL

with f denoting the switching frequency, CL the capacitance being switched, and Vdd the supply voltage . This formula immediately identifies the key levers to reduce dynamic power: • Reduce operating frequency • Reduce driven capacity • Reduce supply voltage Traditionally, reduction in supply voltage Vdd has been the most often followed strategy to reduce power consumption. Unfortunately, lowering Vdd has the side effect of reducing performance as well, primarily because gate TLFeBOOK 5 overdrive (the difference between Vdd and Vth) diminishes if the threshold voltage Vth is kept constant. Based on the alpha power law model [8], the delay td of an inverter is given by

•VC t = ddL d ()− α VV thdd

with α denoting a fitting constant. As supply voltages are driven below 1.0V, the reductions in gate overdrive are more pronounced than previously. In addition, newer process technologies give significantly less of a performance boost compared to the previous process generation than has traditionally been the case, therefore a further reduction in performance is highly undesirable. Finally, the power reduction achieved by moving to a new process generation has trended down over time, since supply voltages have been scaled by increasingly less than the 30% prescribed by the constant electrical field scaling paradigm. Consequently, more advanced approaches are required. In the following, our main focus will be on dynamic power consumption, but we will also consider leakage power consumption.

1.4 ZERO-VTH DEVICES

The concept of zero-Vth devices was developed in the mid 90-ies. It overcomes the diminishing gate overdrive by radically setting the threshold voltage of the active devices to zero. It has been shown [9], that the optimum power dissipation is obtained, if Pleak (leakage contribution) is in the same order of magnitude as Pdyn (dynamic switching contribution). This can be achieved for transistors with Vth close to 0V (‘zero-Vth transistor‘). Therefore the devices will never completely switch off. But from an overall power perspective the gain in active power consumption is tremendous. Using these transistors the supply voltage of 130nm circuits can be reduced to values below 0.3V to achieve a Pdyn reduction by 90% without performance degradation. Alternatively, the circuit can be operated at twice the clock frequency when keeping the supply voltage at 1.2V, as shown in Figure 1-2. The corresponding Ion/Ioff-ratio for the zero-Vth transistor is about 10-100 instead of >105 for the standard transistor options. During standby, the complete circuits are switched-off or are set into a low leakage mode to cope with the very high leakage contribution. The low leakage mode is achieved by ‘active well’ control, which denotes the use of the body effect. The well potentials of the PFETs and NFETs are altered to change Vth. To achieve a lower leakage current, the absolute value of Vth is increased by TLFeBOOK

6

reverse back biasing: a negative well-to-source voltage Usb is used. Therefore voltages below Vss for NFETs and above Vdd for PFETs have to be generated. Furthermore, active well is required to compensate the lot-to- lot or wafer-to-wafer variations of Vth. The initial ‘zero-Vth’ concept assumed constant junction temperatures Tj below 40°C. For some high-end computer equipment the costs for active chip cooling are affordable to achieve this junction temperature. But this is definitely not the case for cost-driven consumer products. For this application domain Tj in active mode ranges between 85°C and 125°C, and in some applications the specified worst-case ambient temperature is even 80°C. The proposed zero-Vth concept is therefore not applicable without changes and adaptations.

Figure 1-2. Simulated performance curves of transistors with ultra-low Vth. Compared to low- Vth, either a performance gain or a Vdd reduction can be achieved. Curves for reg-Vth and high-Vth transistors of a 130 nm technology are included

A more conservative approach with respect to zero-Vth, but still aggressive compared to current devices, had to be chosen. An ultra-low Vth device with about 150mV threshold voltage proved to be the best TLFeBOOK

7 compromise between zero-Vth and current low-Vth of about 300mV within a 130 nm CMOS technology. To identify the optimal choice of Vth and Vdd in combination with the higher junction temperature Tj, simulations with modified parameters of the 130nm low-Vth transistor are performed. In Figure 1-3 the power dissipation is shown for a high activity circuit (ı= 20%) with various options for the transistor threshold voltages: reg-Vth, low-Vth, and transistors whose Vth are reduced to 200mV, 150mV, 100mV and 50mV. The reg-Vth circuit performance was used as the reference (Vdd = 1.5V), and the supply voltages for the other transistor options were reduced to meet that reference performance.

3,5E-05 V = 1.5V dd T= 125°C fast 3,0E-05

2,5E-05 nom = target 2,0E-05

1.2V slow 1,5E-05 Power [W] 1.0V 0.6V 1,0E-05 0.8V 0.7V 5,0E-06

0,0E+00 reg-Vt low-Vt 200mV 150mV 100mV 50mV Device Option / Vth (mV)

Figure 1-3. Power dissipation at T=125°C in active mode for several transistor options with reduced Vth. A minimum power consumption is achieved at 150mV Vth. (At T=55°C the minimum is achieved for the same option but process variations show less impact).

The reduced supply voltage leads to lower overall active power consumption Pactive. A minimum power consumption is reached at Vth = 150mV. With even lower threshold voltages Pactive starts to increase again because of the increase of the leakage current. The steep rise of Pactive originates from the exponential relation between Vth and leakage current. As a rule of thumb a 100mV reduction of the threshold voltage allows for a Vdd TLFeBOOK

8 reduction by § 0.15V but on the other hand results in a tenfold increase of the leakage current. From Figure 1-3 also the impact of technology variations is visible. Due to the high leakage contribution a power reduction of only 25% is achieved under fast process conditions. Using back biasing in reverse mode, the high performance of fast transistors can be reduced through increasing Vth. The corresponding leakage current therefore decreases and allows a power reduction by 50% (stippled arrow).

A process modification has been developed to manufacture devices with the threshold voltage of 150 mV, which proves to be the most efficient for the target application domain of mobile consumer products [10]. In Table1-1 the key transistor parameters of our ultra-low-Vth FETs (ulv) and of the standard low-Vth transistor are listed. The Vth values are 165mV and 161mV for the ulv-NFET and ulv-PFET respectively, Ion increases by 35% and 22%, which translates into an average decrease of the CV/I-metric delay by 29%. Circuit simulations showed a performance increase of 25%. Concerning Vth, performance, and Ioff the target values have been nearly met.

Table 1-1. Extracted key parameters of the ulv-FETSs in comparison with the target values and the low- Vth FETs 130nm low-Vt 130nm ulv-FET Target NFET / PFET NFET / PFET

Ion 560 / 240 755 / 295 [µA/µm]

Ioff 1.2 / 1.2 48 / 17 §35 [nA/µm]

Vth 295 / 260 165/160 150 [mV] body effect 150 / 135 60/65 90 [mV/V]

Vth@¨L=10nm 35 / 30 65/30 [mV]

Vth@¨L=15nm 65 / 70 100/90 [mV] Simulated gate delay 1 0.8 0.75 [relative units]

The sensitivity of Vth to gate length variation (roll-off) is expressed in Vth-shift per 10nm or 15nm gate length decrease. A comparison with low- Vth-FETs shows a pronounced increase. Therefore in addition to temperature compensation, back biasing has also to be used to compensate for this strong technology variation. TLFeBOOK

9 The values of the body effect are also included in Table 1-1. The body effect is expressed in Vth-shift per 1V well bias. The ulv-FETs yield values, which are lower by more than 50% compared to the low-Vth transistors. The decrease of body effect in combination with the increased roll-off reduces the leverage of back biasing for ulv-FETs very significantly. The leverage is not even sufficient to compensate the technology variation, since the value of the roll-off is higher than that of the body effect. As an example, the ulv- NFET shows roll-off values of 65mV/10nm and 100mV/15nm and a body effect of only 60mV/V. To investigate the migration potential of the ulv-FETs for future technology generations Ioff measurement results, obtained from a recent 90nm hardware, were used. Based on this measurement data the leverage of active well with the standard reg-Vth and low leakage transistor options has been analyzed. For supply voltages of 1.2V and 0.75V a reverse back biasing voltage of 0.5V has been applied. For the NFET, the back biasing results in a leakage reduction by 50% to 70% for all transistor widths and for both values of Vdd. In the case of the PFET, the leakage reduction values are similar (60% to 80%) for transistors with W> 0.5µm. For very narrow PFETs with Vdd = 1.2V, the reduction is only 20% or even less. Since narrow FETs are used within SRAMs, which contribute a major part of the circuit’s standby current, this small reduction for narrow transistors in addition reduces significantly the leverage of active well. The root cause is an additional leakage mechanism based on tunnelling currents across the drain-well junction, which limits the reverse back biasing to 0.5V. This tunnelling current depends exponentially on the drain-well voltage and is working against any reduction of the sub-threshold current via active well. At Vdd = 0.75V the drain-well voltage is reduced and the tunnelling current is therefore lower. In this case the effect of back biasing is not compensated by a rising tunnelling current and a leakage current reduction by 70% is still achieved. For a 90nm technology the limit of 0.5V for the well potential swing limits the reduction of the leakage currents to a factor between 2 and 4. This is still a major contribution of all feasible measures to reduce standby power consumption, but the leverage becomes quite small compared to the reduction ratios of several orders of magnitude obtained in previous technologies [11,12]. In future technologies, Ileak will become more strongly affected by the emerging tunnelling current Igate through the gate of the FET. This is due to the ever decreasing gate oxide thickness and also due to the fact, that even the on-state transistors shows gate leakage. Igate is not affected by well biasing reducing the leverage of active well even further. TLFeBOOK

10

In summary the zero-Vth-devices have become very susceptible to process and temperature variations. Significant yield is only achievable with back biasing via active well control and with active cooling. The latter approach is not feasible for mobile applications. Therefore a more conservative approach with respect to zero-Vth, but still aggressive compared to current devices, had to be chosen. An ultra-low-Vth device with about 150mV threshold voltage proved to be the best compromise between zero- Vth and current low-Vth of about 300mV within a 130 nm CMOS technology. But even though fabrication of this ultra-low-Vth device is possible, it affects some standard methods to overcome short-channel effects. The so called halo- or pocket-implantation had to be removed to bring the threshold voltage down. Unfortunately short-channel effects are now heavily increased, leading as shown to a very strong Vth roll-off at slight variations of the channel length. Finally this effect was prohibitive for the overall approach and led to cancellation of many zero-Vth projects in the industry[13].

1.5 DESIGN APPROACHES TO POWER REDUCTION

As outlined above, solutions from process technology by itself will not suffice to provide sufficient power reduction. Therefore, solutions must be found in algorithms, product architecture and logic design. Increasingly, differentiated device options provided by process technology are utilized on these levels in the search for optimization of power consumption. For leading-edge products which need to optimize both power consumption and system performance, optimization techniques on architecture and design level have been proposed and partly already been implemented. While academic research often focuses on the tradeoff between power consumption and performance, industrial product development must also take other variables into consideration. • Product cost: often, power optimization design techniques increase die area, directly affecting manufacturing cost. Also, utilization of additional devices (e.g. different Vth devices) increases mask count and consequently manufacturing cost, and additionally requires up-front expenditures for the development of such devices. Finally, increased manufacturing complexity poses the risk of lowered manufacturing yield. • Product robustness: it must be ensured that optimized products still work across the specified range of operating conditions, also taking manufacturing variations into account. TLFeBOOK

11

1.5.1 Multi-Vdd Design

As outlined in the introduction, the supply voltage Vdd quadratically impacts dynamic switching power consumption. Thus, lowering Vdd is the preferred option to reduce dynamic power consumption. However, as discussed in Section 1.2, lowering Vdd reduces the system performance. Thus, the incentive to lower Vdd to reduce power consumption is kept in check by the need to maintain performance. Reduction of Vdd can be applied on different abstraction levels of a design. Most effective regarding power reduction, and also easiest to implement is to lower Vdd for an entire IC. As this will directly impact the performance of the IC design, this often is not an option. On a lower abstraction level, it is possible to lower Vdd for an entire module. This is still rather simple to implement, but if only modules are chosen such that overall IC performance is not impacted, the achieved gains in power reduction will often be very moderate. Finally, a reduction in supply voltage can be applied specifically to individual gates, such that the overall system performance is not reduced. This approach, as shown in Figure 1-4, recognizes that in a typical design, most logic paths are not critical. They can be slowed down, often significantly, without reducing the overall system performance. This slowing down is achieved by lowering the supply voltage Vdd for gates on the non- critical paths, which results in lowered power consumption. TLFeBOOK

12 10ns

SET SET D Q D Q

Q Q CLR CLR

SET SET D Q D Q

Q Q CLR 5ns CLR Non-critical path may be delayed

10ns

SET SET D Q D Q

Vdd_low Q Vdd_low Q CLR Vdd_low CLR

SET SET D Q D Q

Q Q CLR 8ns CLR Non-critical path runs with reduced supply voltage

Figure 1-4. Multi-Vdd design

This technique will modify the distribution of path delays in a design to a distribution skewed towards paths with higher delay, as indicated Figure 1-5 [14].

Single Supply Voltage SSV Multiple Supply Voltages MSV MSV SSV

crit. paths

td td 1/f 1/f

Figure 1-5. Distribution of path delays under single and multiple supply voltages TLFeBOOK

13 A number of studies have shown significant variation in dynamic power reduction results from implementing a multi-Vdd design strategy, ranging from less than 10% up to almost 50%, with 40% being the average [15,16]. Rules of thumb for selecting appropriate supply voltage levels have been developed. When using two supply voltages, the lower Vdd was proposed to be 0.6x-0.7x of the higher Vdd [17]. The optimal supply voltage level also depends on Vth [18]. The benefit of using multiple supply voltages quickly saturates. The major gain is obtained by moving from a single Vdd to dual-Vdd. Extending this to ever more supply voltage levels yields only small incremental benefits [18,19], even when the overhead introduced by multiple supply voltages (see below) is not taken into consideration. The power reduction achieved by this technique roughly depends on two parameters: the difference between the regular supply voltage Vdd and the lowered supply voltage Vdd_low, and the percentage of gates to which Vdd_low is applied. Regarding the first parameter, it has been pointed out some years ago that the leverage of this concept decreases as process technologies are scaled down further [18]. Recent work has analyzed this in more detail [14]. At least for high-Vth devices, which are essential for low standby power design due to their lower leakage current, Vth has scaled much slower than Vdd recently. Therefore, gate overdrive (Vdd - Vth) is diminished, negatively impacting performance. Thus, even a little reduction in Vdd will have a very significant impact on performance. Therefore, the potential to lower Vdd while maintaining overall system performance is greatly reduced. It is shown that from 0.25µm down to 0.09µm, the effectiveness of dual-Vdd decreases by a factor of 2 (from 60% dynamic power reduction to 30%) for high-Vth designs, whereas it stays about constant for low-Vth designs. This can however be countered by introduction of variable threshold voltages, as will be seen later. Regarding the second parameter, experience has shown that especially in designs using the multi-Vth technique outlined below, path delays tend to be skewed to higher delays already, thus reducing the number of gates that can be slowed down further [14]. For the selection of those gates which will receive the lower supply voltage Vdd_low, a number of techniques have been proposed. Most prevalent is the concept of clustered voltage scaling (CVS). It recognizes that it is desirable to have clusters of gates assigned to the same voltage, since between the output of a gate supplied by Vdd_low and the input of a gate supplied by Vdd a level shifter is required to avoid static current flow [20]. This concept has been enhanced by extended clustered voltage scaling (ECVS)[17] which essentially allows an arbitrary assignment of supply TLFeBOOK

14 voltage levels to gates. This strategy implies more frequent insertion of level shifters into the design. However, usually only power consumption and delay are considered in the literature. The additional area cost is neglected. In industry, this certainly is not feasible. While conceptually simple, the implementation of a multi-Vdd concept poses a number of challenges. • The additional supply voltage Vdd_low needs to be created on-chip by a dc- to-dc converter, unless the voltage already exists externally. This results in area overhead, and in power consumption for the converter. • The additional supply voltage Vdd_low must be distributed across the chip. • Level-shifters are required between different supply domains. It is feasible to integrate level shifters into flip-flops [21]. The penalties in area, power consumption and delay resulting from these effects are not always taken into account by work published in the literature. Studies indicate that a 10% area overhead will result from implementing a dual-Vdd design [22]. An additional consideration for industrial IC product development is that EDA tool support for implementing a dual-Vdd design is still only rudimentary. It is not sufficient to have a single point tool which can perform power-performance tradeoffs. Instead, this methodology needs to encompass the entire design flow (e.g. power distribution in layout; automated insertion of level shifters etc.).

1.5.2 Multi-Vth Design

Another essential technique is the use of different transistor threshold voltages (multi-Vth design). Primarily this technique reduces leakage power consumption, thus increasing standby time of battery-powered ICs. As leakage power consumption becomes an increasingly important component of overall power consumption in modern process technologies, this technique increasingly also helps to reduce overall power consumption significantly, as design moves to more advanced process technologies. The idea is similar to multi-Vdd design: paths that do not need highest performance are implemented with special leakage-reduced transistors (typically higher Vth transistors, but also thicker gate-oxide Tox), as shown in Figure 1-6. TLFeBOOK

15

10ns

SET SET D Q D Q

Q Q CLR CLR

SET SET D Q D Q

Q Q CLR 5ns CLR Non-critical path may be delayed

10ns

SET SET D Q D Q

high V Q t high Vt Q CLR CLR high Vt

SET SET D Q D Q

Q Q CLR 8ns CLR Non-critical path runs with increased threshold voltage

Figure 1-6. Multi-Vth design

A typical industrial approach today is to first create a design using lower Vth transistors to achieve the required performance and then to selectively replace gates off the critical path with higher Vth (or thicker Tox) transistors to reduce leakage. Studies in the literature have reported reductions in leakage of around 50% up to 80%. Some approaches assume that different Vth levels are provided by the process technology (through doping variations) and propose algorithms to optimally assign Vth levels to transistors, ensuring that performance is not compromised [23, 24]. Recently, it has also been proposed to achieve modifications in Vth by modifying transistor length or gate oxide thickness Tox [25]. Design-tool support for this technique is also rudimentary at best. While it is becoming established to design different modules of an IC with different Vth transistors, it is very challenging to do this on the level of individual transistors within a module. The primary reason is that the entire design flow must be able to handle cells with identical functionality and size, which differ in their electrical properties. This poses no principal algorithmic problems, but must be consistently implemented in all EDA tools within a design flow. TLFeBOOK

16 1.5.3 Hybrid Approaches

Recently approaches have been suggested in the literature which combine implementation of multiple supply voltages and multiple threshold voltages for further power reduction. Especially for designs where minimization of total power consumption is key (as compared to e.g. minimization of standby power for mobile products), it is possible to trade off leakage and dynamic power, as originally proposed in the zero-Vth concept. Studies in the literature indicate a total power optimum when leakage power contributes 10% to 30% [26,12]. This ratio depends significantly on the process technology, operating environment, and clock frequency of a design. For applications where leakage power minimization is critical (e.g. mobile products), this approach usually is not feasible, as it requires a relatively low Vth which causes high leakage currents [14]. With the increasing significance of gate leakage currents, variations of gate oxide thickness Tox have also been proposed. An overall framework for using two supply voltages and two threshold voltages as well has been presented [19]. Theoretically, it is shown that more than 60% of total power consumption can be saved this way (not considering required overhead such as level shifters, routing etc.). Rules of thumb are proposed and it is shown that the optimal second Vdd is about 50% of the original Vdd in this case. It is also argued that the usefulness of multi- Vdd strategies is not diminished, but actually increased in more advanced technologies, if also a multi-Vth strategy is followed, since this strategy allows to trade off leakage vs. dynamic power consumption by changing Vth and Vdd to optimize power consumption, while maintaining a required timing performance. This approach has been applied to the practical example of an ARM processor in [27]. Due to specific layout considerations it was not possible to implement all four intended combinations of Vdd and Vth. Instead, three different libraries were implemented. Using a CVS algorithm, a reduction in dynamic power by 15% was achieved for a 0.18µm process technology. Leakage power was reduced by 40%. As leakage power was more than 1000x smaller than dynamic power, overall active power reduction was 15%. To achieve this, a 14% increase in area was required. A very recent approach considers also transistor width sizing in addition to Vdd and Vth assignment [28]. Using a two stage, sensitivity-based approach, total power savings of 37% on average over a suite of benchmark circuits are reported. In this study, the threshold voltage is chosen rather low, so that leakage represents 20-50% of total power consumption. Therefore, optimization of both leakage and dynamic power consumption is essential, which is achieved with the presented approach. TLFeBOOK

17 An enhanced approach for leakage power consumption considers multiple gate oxide thicknesses Tox in addition to multi-Vth [29]. It is motivated by the fact that gate leakage increases very dramatically with newer process technologies. Gate leakage is of the same order of magnitude as subthreshold leakage at the 90nm process node. Their relationship also depends significantly on the operating temperature T. The key observation that an OFF transistor suffers from subthreshold leakage, an ON transistor from gate leakage, motivates the approach to analyze transistor states in standby mode and assign Vth and Tox such that leakage power consumption is minimized. Leakage reductions of 5-6x are obtained on benchmark circuits, compared to designs using a single Vth and Tox. Previous approaches that included Tox into the optimization varied Tox only for different design modules, not on critical paths within modules. These newer approaches promise further reductions in power consumption. This will come, however, at a price (as seen e.g. in the ARM example). Design complexity increases significantly when variations in many parameters are made available at the same time. In some studies, the resulting overhead is not considered.

1.5.4 Cost Tradeoffs

This overhead must be considered, however, since it is quite significant: • Multi-Vdd: level-shifter (area, power consumption, delay), routing of additional supply voltages (area). • Multi-Vth: additional masks (manufacturing costs); potentially special design rules at the boundary between different Vth devices (area). • Multi-Tox: additional masks (manufacturing costs). • In addition, IC development costs increase due to more complex design flows. Also, special process options (Vth, Tox) must be developed, qualified and continuously monitored. For each such option, the design library must be electrically characterized, modelled for all EDA tools, and potentially optimized regarding circuit design and layout. It must be maintained and regularly updated (changes in electrical parameters, changes in tools in the design flow) over a long period of time as well. If a very specialized manufacturing flow is developed to fully optimize a given product, it will be very difficult to shift manufacturing of this product to a different fab (e.g. a foundry in case additional capacity is required). For these and potentially other reasons, we are not yet aware of industrial products that have implemented such proposals in a fine-grained manner (i.e. different Vth, Vdd and Tox combined within one design module). TLFeBOOK

18 Some approaches in the literature also determine optimum levels of threshold voltages depending on a given design. In industry, this is rarely feasible. Typically, a manufacturing process has to be taken as given, with only predefined values of Vth (and Tox) being available.

1.6 APPROACHES ON HIGHER ABSTRACTION LEVELS

The approaches outlined above on gate level and device level can be (and often must be) supported by measures on higher levels of abstraction. Some of the most promising concepts are as follows: • partitioning the system such that large areas can be powered off for significant periods of time (block turnoff) • especially partitioning memory systems such that large parts can be turned off in standby mode • clock gating is an essential method which reduces dynamic power consumption by local off-switching of non-active gates • coding strategies (e.g. for buses) can reduce switching and thus dynamic power consumption

1.7 CONCLUSION AND FUTURE CHALLENGES

There is no single “silver bullet” to solve the challenge of power reduction. While ultra-low voltage logic based on special ultra-low-Vth devices is a conceptually very convincing concept, its widespread implementation is hindered by manufacturing concerns. An extrapolation of current technology trends indicates that such a concept will become even more difficult in the future. Today, design techniques are the most promising approach to reduce power – both dynamic and leakage. The concepts outlined here can be further extended. It is feasible to dynamically adjust supply and threshold voltages. These are theoretically promising concepts which however still require more investigation especially with regard to feasibility under industrial boundary conditions. Quite likely, in the future even more emphasis than today will have to be placed on power reduction schemes on algorithmic and system level. On these levels, the levers to reduce power consumption are largest.

Acknowledgement

The authors wish to acknowledge and thank Jörg Berthold and Tim Schönauer for their contributions and fruitful discussions. TLFeBOOK

19 References

[1] G. Moore, Cramming More Components onto integrated circuits, Electronics Magazine, Vol. 38, No. 8, 1965, pp. 114-117. [2] ITRS, International Technology Roadmap for Semiconductors, 2003, http://public.itrs.net. [3] F. Pollack, New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies, Micro32 Keynote, 1999. [4] U. Schlichtmann, Systems are Made from Transistors: UDSM Technology Creates New Challenges for Library and IC Development, IEEE Euromicro Symposium on Digital System Design, 2002, pp. 1-2. [5] S. Borkar, Design Challenges of Technology Scaling, IEEE Micro, July/August 1999, pp. 23-29. [6] S. Thompson, P. Packan, and M. Bohr, MOS Scaling: Transistor Challenges for the 21st Century, , Q3 1998. [7] N. Kim et al., Leakage Current: Moore's Law Meets Static Power, IEEE Computer, Vol. 36, No. 12, December 2003, pp. 68-75. [8] S. Sakurai, A. R. Newton, Alpha-Power Law MOSFET Model and its Application to CMOS Inverter Delay and Other Formulas, IEEE Journal of Solid-State Circuits, Vol. 25, No. 2, 1990, pp. 584-594. [9] J.B. Burr, J. Schott, A 200 mV self-testing encoder/decoder using Stanford ultra-low- power CMOS, 1994 IEEE International Solid-State Circuits Conference [10] J. Berthold, R. Nadal, C. Heer, Optionen für Low-Power-Konzepte in den sub-180-nm- CMOS-Technologien (In German), U.R.S.I. Kleinheubacher Tagung 2002. [11] V. Svilan, M. Matsui, J. B. Burr, Energy-Efficient 32 x 32-bit Multiplier in Tunable Near-Zero Threshold CMOS, ISLPED 2000, pp. 268-272. [12] V. Svilan, J. B. Burr, L. Tyler, Effects of Elevated Temperature on Tunable Near-Zero Threshold CMOS, ISLPED 2001, pp. 255-258. [13] C. Heer, Designing low-power circuits: an industrial point of view, PATMOS 2001 [14] T. Schoenauer, J. Berthold, C. Heer, Reduced Leverage of Dual Supply Voltages in Ultra Deep Submicron Technologies, International Workshop on Power And Timing Modeling, Optimization and Simulation PATMOS 2003, pp. 41-50. [15] K. Usami, M. Igarashi, Low-Power Design Methodology and Applications utilizing Dual Supply Voltages, Proceedings of the Asia and South Pacific Design Automation Conference 2000, pp. 123-128. [16] M. Donno, L. Macchiarulo, A. Macii, E. Macii, M. Poncino, Enhanced Clustered Voltage Scaling for Low Power, Proceedings of the 12th ACM Great Lakes Symposium on VLSI, 2002, pp. 18-23. [17] K. Usami et al., Automated Low-Power Technique Exploiting Multiple Supply Voltages Applied to a Media Processor, IEEE Journal of Solid-State Circuits, Vol. 33, No. 3, March 1998, pp. 463-472. [18] M. Hamada, Y. Ootaguro, T. Kuroda, Utilizing Surplus Timing for Power Reduction, Proceedings IEEE Custom Integrated Circuits Conference CICC, 2001, pp. 89-92. [19] A. Srivastava, D. Sylvester, Minimizing Total Power by Simultaneous Vdd/Vth Assignment, Proceedings of the Asia and South Pacific Design Automation Conference 2003, pp. 400-403. [20] K. Usami, M. Horowitz, Clustered Voltage Scaling Technique for Low-Power Design, Proceedings of the International Symposium on Low Power Design ISLPD, 1995, pp. 3- 8. TLFeBOOK

20

[21] K. Usami et al., Design Methodology of Ultra Low-power MPEG4 Codec Core Exploiting Voltage Scaling Techniques, Proceedings of the 35th Design Automation Conference 1998, pp. 483-488. [22] C. Yeh, Y.-S. Kang, Layout Techniques Supporting the Use of Dual Supply Voltages for Cell-Based Designs, Proceedings of the 36th Design Automation Conference 1999, pp. 62-67. [23] Q. Wang, S. Vrudhula, Algorithms for Minimizing Standby Power in Deep Submicrometer, Dual-Vt CMOS Circuits, IEEE Transactions on CAD, Vol. 21, No. 3, March 2002, pp. 306/318. [24] L. Wei, Z. Chen, K. Roy, M. Johnson, Y. Ye, V. De, Design and Optimization of Dual- Threshold Circuits for Low-Voltage Low-Power Applications, IEEE Transactions on Very Large Scale Integration (VLSI), Vol. 7, No. 1, March 1999, pp. 16-24. [25] N. Sirisantana, K. Roy, Low-Power Design Using Multiple Channel Lengths and Oxide Thicknesses, IEEE Design & Test of Computers, January-February 2004, pp. 56-63. [26] K. Nose, T. Sakurai, Optimization of VDD and VTH for Low-Power and High-Speed Applications, Proceedings of the Asia and South Pacific Design Automation Conference 2000, pp. 469-474. [27] R. Bai, S. Kulkarni, W. Kwong, A. Srivastava, D. Sylvester, D. Blaauw, An Implementation of a 32-bit ARM Processor Using Dual Power Supplies and Dual Threshold Voltages, IEEE International Symposium on VLSI, 2003, pp. 149-154. [28] A. Srivastava, D. Sylvester, D. Blaauw, Concurrent Sizing, Vdd and Vth Assignment for Low-Power Design, Proceedings of the Design, Automation and Test in Europe Conference DATE, 2003, pp. 718-719. [29] D. Lee, H. Deogun, D. Blaauw, D. Sylvester, Simultaneous State, Vt and Tox Assignment for Total Standby Power Minimization, Proceedings of the Design, Automation and Test in Europe Conference DATE, 2003, pp. 494-499. TLFeBOOK

21

Chapter 2

ON-CHIP OPTICAL INTERCONNECT FOR LOW-POWER

Ian O’Connor and Fred´ eric´ Gaffiot EcoleCentraledeLyon

Abstract It is an accepted fact that process scaling and operating frequency both contribute to increasing integrated circuit power dissipation due to interconnect. Extrapolat- ing this trend leads to a red brick wall which only radically different interconnect architectures and/or technologies will be able to overcome. The aim of this chap- ter is to explain how, by exploiting recent advances in integrated optical devices, optical interconnect within systems on chip can be realised. We describe our vision for heterogeneous integration of a photonic “above-IC" communication layer. Two applications are detailed: clock distribution and data communication using wavelength division multiplexing. For the first application, a design method will be described, enabling quantitative comparisons with electrical clock trees. For the second, more long-term, application, our views will be given on the use of various photonic devices to realize a network on chip that is reconfigurable in terms of the wavelength used.

Keywords: Interconnect technology, optical interconnect, optical network on chip

2.1 INTRODUCTION In the 2003 edition of the ITRS roadmap [17], the interconnect problem was summarised thus: “For the long term, material innovation with traditional scal- ing will no longer satisfy performance requirements. Interconnect innovation with optical, RF, or vertical integration ... will deliver the solution”. Continu- ally shrinking feature sizes, higher clock frequencies, and growth in complexity are all negative factors as far as switching charges on metallic interconnect is concerned. Even with low resistance metals such as copper and low dielectric constant materials, bandwidths for long interconnect will be insufficient for fu- ture operating frequencies. Already the use of metal tracks to transport a signal over a chip has a high cost in terms of power: clock distribution for instance TLFeBOOK

22 requires a significant part (30-50%) of total chip power in high-performance microprocessors. A promising approach to the interconnect problem is the use of an optical interconnect layer, which could empower an increase in the ratio between data rate and power dissipation. At the same time it would enable synchronous op- eration within the circuit and with other circuits, relax constraints on thermal dissipation and sensitivity, signal interference and distortion, and also free up routing resources for complex systems. However, this comes at a price. Firstly, high-speed and low-power interface circuits are required, design of which is not easy and has a direct influence on the overall performance of optical inter- connect. Another important constraint is the fact that all fabrication steps have to be compatible with future IC technology and also that the additional cost incurred remains affordable. Additionally, predictive design technology is re- quired to quantify the performance gain of optical interconnect solutions, where information is scant and disparate concerning not only the optical technology, but also the CMOS technologies for which optics could be used (post-45nm node). In section 2.2, we will describe the “above-IC” optical technology. Sections 2.3 and 2.4 describe an optical clock distribution network and a quantitative electrical-optical power comparison respectively. A proposal for a novel optical network on chip in discussed in section 2.5.

2.2 OPTICAL INTERCONNECT TECHNOLOGY Various technological solutions may be proposed for integrating an optical transport layer in a standard CMOS system. In our opinion, the most promising approach makes use of hybrid (3D) integration of the optical layer above a complete CMOS IC, as shown in fig. 2.1. The basic CMOS process remains the same, since the optical layer can be fabricated independently. The weakness of this approach is in the complex electrical link between the CMOS interface circuits and the optical sources (via stack and advanced bonding). In the system shown in fig. 2.1, a CMOS source driver circuit modulates the current flowing through a biased III-V microsource through a via stack making the electrical connection between the CMOS devices and the optical layer. III-V active devices are chosen in preference to Si-based optical devices for high-speed and high-wavelength operation. The microsource is coupled to the passive waveguide structure, where silicon is used as the core and SiO2 as the cladding material. Si/SiO2 structures are compatible with conventional silicon technology and silicon is an excellent material for transmitting wave- lengths above 1.2µm (mono-mode waveguiding with attenuation as low as 0.8 dB/cm has been demonstrated [10]). The waveguide structure transports the optical signal to a III-V photodetector (or possibly to several, as in the case of TLFeBOOK

23

III−V III−V laser source photodetector

electrical contact Si photonic waveguide (n=3.5)

SiO2 waveguide cladding (n=1.5) metallic interconnect structure driver CMOS IC receiver circuit circuit

Figure 2.1. Cross-section of hybridised interconnection structure a broadcast function) where it is converted to an electrical photocurrent, which flows through another via stack to a CMOS receiver circuit which regenerates the digital output signal. This signal can then if necessary be distributed over a small zone by a local electrical interconnect network.

2.3 AN OPTICAL CLOCK DISTRIBUTION NETWORK In this section we present the structure of the optical clock distribution net- work, and detail the characteristics of each component part in the system: ac- tive optoelectronic devices (external VCSEL source and PIN detector), passive waveguides, interface (driver and receiver) circuits. The latter represent ex- tremely critical parts to the operation of the overall link and require particularly careful design. An optical clock distribution network, shown in fig. 2.2, requires a single photonic source coupled to a symmetrical waveguide structure routing to a number of optical receivers. At the receivers the high-speed optical signal is converted to an electrical one and provided to local electrical networks. Hence the primary tree is optical, while the secondary tree is electrical. It is not feasible to route the optical signal all the way down to the individual gate level since each drop point requires a receiver circuit which consumes area and power. The clock signal is thus routed optically to a number of drop points which will cover a zone over which the last part of the clock distribution will be carried out TLFeBOOK

24 by the electrical secondary clock tree. The size of the zones is determined by calculating the power required to continue in the optical domain and comparing it to the power required to distribute over the zone in the electrical domain. The number of clock distribution points (64 in the figure) is a particularly crucial parameter in the overall system. The global optical H-tree was optimised to achieve minimal optical losses by designing the bend radii to be as large as possible. For 20mm die width and 64 output nodes in the H-tree at the 70nm technology mode, the smallest radius of curvature (r3 in fig. 2.2) is 625µm, which leads to negligible pure bending loss.

die width, D

LCR

L CV : source−waveguide coupling loss LY L W : waveguide transmission loss

L B : bending loss

L Y : Y−coupler loss

L CR : waveguide−receiver r1 LB coupling loss

r3

LW r2

L CV optical source electrical optical optical r1=D/8, r2=D/16, r3=D/32 clock trees waveguides receivers

Figure 2.2. Optical H-tree clock distribution network (OCDN) with 64 output nodes. r1−3 are the bend radii linked to the chip width D

2.3.1 VCSEL sources VCSELs (Vertical Cavity Surface Emitting Lasers) are certainly the most mature emitters for on-chip or chip-to-chip interconnections. Commercial VC- SELs, when forward biased at a voltage well above 1.5V, can emit optical power of the order of a few mW around 850nm, with an efficiency of some 40%. Threshold currents are typically in the mA range. However, fundamental requirements for integrated semiconductor lasers in optical interconnect appli- cations are small size, low threshold lasing operation and single-mode operation (i.e. only one mode is allowed in the gain spectrum). Additionally, the fact that VCSELs emit light vertically makes coupling less easy. It is clear that TLFeBOOK

25 significant effort is required from the research community if VCSELs are to compete seriously in the on-chip optical interconnect arena, to increase wave- length, efficiency and threshold current in the same device. Long wavelength, and low-threshold VCSELs are only just beginning to emerge (for example, a 1.5µm, 2.5Gb/s tuneable VCSEL [5], and an 850nm, 70µA threshold current, 2.6µm diameter CMOS compatible VCSEL [11] have been reported). Ulti- mately however, optical interconnect is more likely to make use of integrated microsources as described in section 2.5, as these devices are intrinsically better suited to this type of application.

2.3.2 PIN photodetectors In order to optimise the frequency and power dissipation performance of the overall link, photodetectors must exhibit high quantum efficiency, large intrinsic bandwidth and small parasitic capacitance. The photodetector performance is measured by the bandwidth efficiency product. Conventional III-V PIN devices suffer from two main limitations. On one hand, their relatively high capacitance per unit area leads to limitations in the design of the transconductance amplifier interface circuit. On the other hand, due to its vertical structure, there is a tradeoff between its frequency performance and its efficiency (the quantum efficiency increases and the bandwidth decreases with the absorption intrinsic layer thickness) [9]. Metal-semiconductor-metal (MSM) photodetectors offer an alternative over conventional PIN photodetectors. An MSM photodetector consists of interdig- itated metal contacts on top of an absorption layer. Because of their lateral structure, MSM photodetectors have very high bandwidths due to their low capacitance and the possibility to reduce the carrier transit time. However, the responsivity is usually low compared to PIN photodetectors [4]. MSM photodiodes with bandwidth greater than 100GHz have been reported.

2.3.3 Waveguides Optical waveguides are at the heart of the optical interconnect concept. In the Si/SiO2 approach, the high relative refractive index difference ∆= 2 2 2 (n1 − n2)/2n1 between the core (n1 ≈ 3.5 for Si) and cladding (n2 ≈ 1.5 for SiO2) allows the realisation of a compact optical circuit with dimensions com- patible with DSM technologies. For example, it is possible to realise monomode waveguides less than 1µm wide (waveguide width of 0.3µm for wavelengths of 1.55µm), with bend radii of the order of a few µm [15]. However, the performance of the complete optical system depends on the minimum optical power required by the receiver and on the efficiency of passive optical devices used in the system. The total loss in any optical link is the sum TLFeBOOK

26 of losses (in decibels) of all optical components:

Ltotal = LCV + LW + LB + LY + LCR (2.1) where

LCV is the coupling coefficient between the photonic source and optical waveguide. There are currently several methods to couple the beam emitted from the laser into the optical waveguide. In this analysis we assumed 50% coupling efficiency LCV from the source to a single mode waveguide. LW is the rectangular waveguide transmission loss per unit distance of the optical power. Due to small waveguide dimensions and large in- dex change at the core/cladding interface in the Si/SiO2 waveguide the side-wall scattering is the dominant source of loss (fig. 2.3a). For the waveguide fabricated by Lee [10] with roughness of 2nm the calculated transmission loss is 1.3dB/cm. LB is the bending loss, highly dependent on the refractive index difference ∆ between the core and cladding medium. In Si/SiO2 waveguides, ∆ is relatively high and so due to this strong optical confinement, bend radii as small as a few µm may be realised. As can be seen from fig. 2.3b, the bending losses associated with a single mode strip waveguide are negligible if the radius of curvature is larger then 3µm. LY is the Y-coupler loss, and depends on the reflection and scattering attenuation into the propagation path and surrounding medium. For high index difference waveguides the losses for the Y-branch are significantly smaller than for low ∆ structures and the simulated losses are less then 0.2dB per split [14]. LCR is the coupling loss from the waveguide to the optical receiver. Using currently available materials and methods it is possible to achieve an almost 100% coupling efficiency from waveguide to optical receiver. In this analysis the coupling efficiency is assumed to be 87% (LCR = 0.6dB) [16].

2.3.4 Interface circuits High-speed CMOS optoelectronic interface circuits are crucial building blocks to the optical interconnect approach. The electrical power dissipation of the link is defined by these circuits, but it is the receiver circuit that poses the most serious design challenges. The power dissipated by the source driver is mainly determined by the source bias current and is therefore device-dependent. On the receiver side however, most of the receiver power is due to the circuit, while only a small fraction is required for the photodetector device. TLFeBOOK

27

60 100

1 50 0.01 40 0.0001

30 1e-06

1e-08 20

Pure bending loss (dB) 1e-10 Transmission loss (dB/cm) 10 1e-12

0 1e-14 1 2 3 4 5 6 7 8 9 10 11 12 2 3 4 5 6 7 8 9 Sidewall roughness (nm) Bend radius (um)

Figure 2.3a. Simulated transmission loss Figure 2.3b. Simulated pure bending loss for varying sidewall roughness in a for various bend radii in a 0.5µm× 0.2µm 0.5µm× 0.2µm Si/SiO2 strip waveguide Si/SiO2 strip waveguide

2.3.4.1 Driver circuits. Source driver circuits generally use a current modulation scheme for high-speed operation. The source always has to be biased above its threshold current by a MOS current sink to eliminate turn-on delays, which is why low-threshold sources are so important (figures of the order of 40µA [7] have been reported). A switched current sink modulates the current flowing through the source, and consequently the output optical power injected into the waveguide. As with most current-mode circuits, high bandwidth can be achieved since the voltage over the source is held relatively constant and parasitic capacitances at this node have reduced influence on the speed.

2.3.4.2 Receiver circuits. A typical structure for a high-speed pho- toreceiver circuit consists of: a transimpedance amplifier (TIA) to convert the photocurrent of a few µA into a voltage of a few mV; a comparator to gener- ate a rail-to-rail signal; and a data recovery circuit to eliminate jitter from the restored signal. Of these, the TIA is arguably the most critical component for high-speed performance, since it has to cope with a generally large photodiode capacitance situated at its input. The basic transimpedance amplifier structure in a typical configuration is shown in fig. 2.4 [8]. The bandwidth/power ratio of this structure can be max- imised by using small-signal analysis and mapping of the individual component values to a filter approximation of Butterworth type. It is then possible to develop a synthesis procedure which, from desired transimpedance performance criteria (gain Zg0, bandwidth and pole quality factor Q) and operating conditions (photodiode and load capacitances, Cd and Cl respectively) generates component values for the feedback resistance Rf and the voltage amplifer (voltage gain Av and output resistance Ro). Circuits with high Ro/Av ratio (≈ 1/ gm) require the least quiescent current and area and this quantity constitutes therefore an important figure of merit in design space TLFeBOOK

28

Rf

ω 1 1 + Av I i 0 = R o Cy M (M + M (1 + M )) −Av f x m x

M (M + M (1 + M ))(1 + A ) Q = f x m x v Cd Cl V 1 + M x (1 + M f ) + M m M f (1 + A v ) Vdd o

M M 2 3 R f − Rov /A M f = R f / Ro Z g0 = − V V M = C x / C i o ( 1 + 1/A ) i y Cm v M = C / C M1 mym Ci Co

C x = C di + C C y = C l + Co

Figure 2.4. CMOS transimpedance amplifier structure

exploration (fig. 2.5a). To reach a sized transistor-level circuit, approximate equations for the small-signal characteristics and bias conditions of the circuit are sufficient to allow a first-cut sizing of the amplifier, which can then be fine- tuned by numerical or manual optimisation, using simulation for exact results. The complete process is described in [13].

Amplifier Ro/Av requirement Ci=500fF Cl=100fF 1THzohm Transimpedance amplifier characteristics against technology node Ro/Av Cd = 400fF, Cl = 150fF 300 250 100 200 Area / um2 150 Quiescent power / 100uW 100 400 50 350 300 250 10 200 150 100 50 0 1

1 10000

3 3000 Bandwidth Transimpedance requirement gain 0.1 /GHz 10 1000 requirement 350 180 130 100 70 45 /ohms Technology node (nm)

Figure 2.5a. TIA Ro/Av design Figure 2.5b. Evolution of TIA character- space with varying bandwidth and istics (power, area, noise) with technology transimpedance gain requirements node

Using this methodology with industrial transistor models for technology nodes from 350nm to 180nm and predictive BSIM3v3/BSIM4 models for tech- nology nodes from 130nm down to 45nm [3], we generated design parameters for 1THzΩ transimpedance amplifiers to evaluate the evolution in critical char- acteristics with technology node. Fig. 2.5b shows the results of transistor level simulation of fully generated photoreceiver circuits at each technology node. TLFeBOOK

29 2.4 QUANTITATIVE POWER COMPARISON BETWEEN ELECTRICAL AND OPTICAL CLOCK DISTRIBUTION NETWORKS 2.4.1 Design methodology In an optical link there are two main sources of electrical power dissipation: (i) power dissipated by the optical receiver(s) and (ii) energy needed by the optical source(s) to provide the required optical output power. To estimate the electrical power dissipated in the system we developed the methodology shown in fig. 2.6.

BER specification photodiode characteristics (SNR requirement) (R,C ,I ) d dark

minimum optical transimpedance power at receiver amplifier

losses in passive minimum optical waveguide network power at source

source emitter receiver efficiency power power

electrical power dissipated in optical system

Figure 2.6. Methodology used to estimate the electrical power dissipation in an optical clock distribution network

The first criterion for defining the performance of the optoelectronic link is the required signal transmission quality, represented by the bit error rate (BER) and directly linked to the photoreceiver signal to noise ratio. For an on-chip interconnect network, a BER of 10−15 is acceptable. To calculate the required signal power at the receiver, the characteristics of the receiver circuit have to be extracted from the transistor-level schematic, which is generated from the photodetector characteristics (responsivity R, Cd, dark current Idark) and from the required operating frequency using the method described in section 2.3. For the given BER and for the noise signal associated with the photodiode and transimpedance circuit the minimum optical power required by the receiver to operate at the given error probability can be calculated using the Morikuni formula [12]. With this figure, and knowing the layout and therefore the optical losses that will be incurred in the waveguides, the minimum required optical power at the source can be estimated. The total electrical power dissipated in the optical TLFeBOOK

30 link is the sum of the power dissipated by the number of optical receivers and the energy needed by the source to provide the required optical power. The electrical power dissipated by the receivers can be extracted from transistor- level simulations. To estimate the energy needed by the optical source, laser light-current characteristics given by Amann [1] were used.

2.4.2 Design performance Our aim in this work was to quantitatively compare the power dissipation in electrical and optical clock distribution networks for a number of cases, in- cluding technology node prediction. For both electrical and optical cases we used technology parameters from the ITRS roadmap (wire geometry, material parameters). For transistor models we used predictive model parameters from Berkeley (BSIM3V3 down to 70nm and BSIM4 down to 45nm). The power dissipated in the electrical system can be attributed to the charging and discharg- ing of the wiring and load capacitance and to the static power dissipated by the buffers. In order to calculate the power we used an internally developed simu- lator, which allows us to model and calculate the electrical parameters of clock networks for future technology nodes [18]. For optical performance predictions we used existing technology characteristics while for the optoelectronic devices we took datasheets from two real devices and used these figures. The power dissipated in clock distribution networks was analysed in both systems at the 70nm technology node. Power dissipation figures for electrical and optical CDNs were calculated based on the system performance summarised in tables 2.1a and 2.1b.

Table 2.1a. Electrical CDN characteris- Table 2.1b. Optical CDN characteristics tics Optical system parameter Electrical system parameter Wavelength λ (nm) 1550 Technology (nm) 70 Waveguide core index (Si) 3.47 Vdd (V) 0.9 Waveguide cladding index (SiO2)1.44 Tox (nm) 1.6 Waveguide thickness (µm) 0.2 2 Chip size (mm ) 400 Waveguide width (µm) 0.5 Global wire width (µm) 1 Transmission loss (dB/cm) 1.3 Metal resistivity (Ω-cm) 2.2 Loss per Y-junction (dB) 0.2 Dielectric constant 3 Input coupling coefficient (%) 50 Optimal segment length (mm) 1.7 Photodiode capacitance (fF) 100 Optimal buffer size (µm) 90 Photodiode responsivity (A/W) 0.95

What follows is the results of comparisons of the power dissipation in elec- trical and optical clock distribution networks. This was quantitatively carried out for varying chip size, operating frequency, number of clock distribution points, technology node, and finally sidewall roughness. This latter perfor- TLFeBOOK

31 mance characteristic is the only non system-driven characteristic, but it gives some important design information to technology groups working on optical interconnect. Fig. 2.7a shows a power comparison where we vary square die size from 10 to 37 mm width. This analysis was carried out for the 70nm node at a distribution frequency of 5.6GHz (which is the clock frequency associated with this node) and 256 drop points. Electrical CDN power rises almost linearly with die size, which is understandable since the line lengths increase and therefore require more buffers to drive them. Optical CDN power rises much more slowly since all that is really changing is transmission loss and this has a quite minor effect on the overall power dissipation. When we vary clock frequency for constant chip width, fig. 2.7b we observe a similar effect for the electrical CDN. Again, the number of buffers has to increase since the segment lengths have to be reduced in order to attain the lower RC time constants. For the optical CDN, what is changing is the receiver power dissipation. The transimpedance amplifier requires a lower output resistance in order to operate at higher frequencies and this translates to a higher bias current. In fig. 2.7c, we vary the number of drop points and see that both electrical and optical CDN power dissipation rises, but optical rises much faster than electrical. There are two reasons for this: firstly, every time the number of drop points is doubled, so is the number of receivers and this accounts for a large part of the power dissipation; secondly, the number of splitters is doubled, which in turn means that the power at emission also has to be doubled. These two factors cause the optical power to catch up with the electrical power at around 4000 drop points. Fig. 2.7e shows a comparison for varying technology node. Not only the technology is changing here, we are also changing the clock frequency asso- ciated with the node. We can see that at the 70nm node there is a five-fold difference between electrical and optical clock distribution. As the technology node advances, this difference becomes even more marked. A final analysis, fig. 2.7f, shows how technological advances are required to improve system performance, concerning in this case waveguide sidewall roughness. 5nm roughness translates to a transmission loss of around 8dB/cm, which in turn corresponds to a power dissipation figure of around 500mW for the 70nm node at 5.6GHz and 20mm chip width. Looking at the 2nm roughness point, achieved at MIT [10] and corresponding to a transmission loss of 1.3dB/cm, we obtain a power dissipation figure of about 10mW, a fifty- fold decrease in the overall power dissipation by going from 5nm roughness to 2nm roughness. This demonstrates the importance of optimising the passive waveguide technology for the whole system. TLFeBOOK

32

1600 1400 Electrical CDN Optical CDN Electrical CDN 256 1400 1200 Optical CDN 256 Electrical CDN 128 Optical CDN 128 1200 1000

1000 800

800 600

600

Power dissipation (mW) Power dissipation (mW) 400

400 200

200 0 100 300 500 700 900 1 3 5 7 Die size (mm2) Clock frequency (GHz)

Figure 2.7a. Comparison of power dissi- Figure 2.7b. Comparison of power dissi- pation in electrical and optical clock dis- pation in electrical and optical clock dis- tribution networks for varying chip size tribution networks for varying clock fre- (70nm technology, 5.6GHz, 256 drop quency (70nm technology, 400mm2, 256 points) drop points) 10000 30

Electrical CDN 20 Optical CDN 1000 10

0

100 -10

-20

Power dissipation (mW) 10 -30 Power gain optical/electrical (%) -40

1 -50 4 32 256 2048 8172 4 32 256 2048 8172 Number of drop points (nodes) Number of drop points (nodes)

Figure 2.7c. Comparison of power dissi- Figure 2.7d. Comparison of power dissi- pation in electrical and optical clock dis- pation in electrical and optical clock dis- tribution networks for varying number of tribution networks for varying number of drop points (70nm technology, 5.6GHz, drop points (70nm technology, 5.6GHz, 400mm2) 400mm2) 1200

Electrical CDN 1000 Optical CDN 256 Optical CDN 1000 Optical CDN 128

800

100 600

400

Power dissipation (mW) Power dissipation (mW) 10 200

0 130 100 70 45 1 3 5 7 9 Technology node (nm) Waveguide transmission loss (dB/cm)

Figure 2.7e. Comparison of power dissi- Figure 2.7f. Evaluation of power dissipa- pation in electrical and optical clock dis- tion in optical clock distribution networks tribution networks for varying technology for varying waveguide sidewall roughness nodes (70nm technology, 5.6GHz, 400mm2)

For a BER of 10−15 the minimal power required by the receiver is -22.3dBm (at 3GHz). Losses incurred by passive components for various nodes in the H-tree are summarised in table 2.2. TLFeBOOK

33

Table 2.2. Optical power budget for 20mm die width at 3GHz

Number of nodes in H-tree 16 32 64 128 Loss in straight lines (dB) 1.3 1.3 1.3 1.3 Loss in curved lines (dB) 1.53 1.66 1.78 1.85 Loss in Y-dividers (dB) 12 15 18 21 Loss in Y-couplers (dB) 0.8 1 1.2 1.4 Output coupling loss (dB) 0.6 0.6 0.6 0.6 Input coupling loss (dB) 3 3 3 3 Total optical loss (dB) 19.2 22.5 25.8 29.1 Min. receiver power (dBm) -22.3 -22.3 -22.3 -22.3 Laser optical power (mW) 0.5 1.1 2.30 4.85

We can conclude from this analysis that power dissipation in optical clock distribution networks is lower than that of electrical clock distribution networks, by a factor of five for example at the 70nm technology node. This factor will in the future become larger due to two reasons: firstly due to improvements in opti- cal fabrication technology; and secondly with the rise in operating frequencies. However, this figure is probably not sufficient to convince semiconductor manu- facturers to introduce such large technological and methodological changes for this application. To improve the figure, weak points can be identified for each main part of an integrated optical link. For the source, the efficiency between electrical and optical power conversion is relatively low. This needs to be im- proved and one area is possibly in integrated microsources. For the waveguide structures, most of the losses need to be improved, especially transmission loss and coupling loss. Sidewall roughness especially has a direct and considerable impact on the power dissipation in the global system. Finally at the receiver end, the transimpedance amplifier power dissipation is too high. Better circuit structures must be devised, or the photodetector parasitic capacitance needs to be reduced.

2.5 OPTICAL NETWORK ON CHIP In current SoC architectures, global data throughput between functional blocks can reach up to tens of gigabits per second, the load being shared by several communication buses. In the future the constraints acting on such data exchange networks will continue to increase: the number of IP blocks in an integrated system could be as high as several hundred and the global throughput could reach the Tb/s scale. To provide this level of performance, the communi- cation system itself is designed as an IP block into which the various functional units will be connected. This type of standardised hardware communication architecture is called a network on a chip (NoC). TLFeBOOK

34 Using wavelength division multiplexing (WDM) techniques, photonics and optoelectronics may offer new solutions to realise reconfigurable optical net- works on chip (ONoC). An ONoC, as an electronic router with routing based on wavelength λ, is actually a circuit-switching based topology and can thus ensure data exchanges between IP blocks with very low contention. The advantages of using an optical network are many: independence of interconnect perfor- mance from distance and data rate, crosstalk reduction, connectivity increase, interconnect power dissipation reduction, increase in the size of isochronous tiles, use of communication protocols. Figure 2.8 shows a 4 × 4 ONoC with all electronic interfaces: photodetector and laser in III-V technology and optical network in SOI technology, using similar heterogeneous integration techniques as described in section 2.2. Intellectual property (IP) blocks shown can be pro- cessor cores, memory blocks, functional units etc. with standard interfaces to the communication network. This is a multi-domain device with high speed optoelectronic circuits (modulation of the laser current and photodetectors) and passive optics (waveguides and passive filters). In the figure, M are masters (processor, IP, ...) which can communicate with targets T (memory, ...). The network is comprised of 4 stages, each associated with a single resonant wave- length. The operation of the 4×4 network is summarised in the table of figure 2.3. This system is a fully passive circuit-switching network based on wave- length routing and is a non-blocking network. From Mi to Tj, there exists only one physical path associated with one wavelength. At any one time, single- wavelength emitters can make 4 connections and multi-wavelength emitters can make 12 connections. The network is in principle scalable to an infinite number of connections. In practice, this number is severely limited by lithog- raphy and etching precision. For a 5nm tolerance on the size of the microdisk, corresponding to state of the art CMOS process technology, the maximum size of the network is 8 × 8.

Table 2.3. Truth table for optical network on chip T1 T2 T3 T4 M1 λ2 λ3 λ1 λ4 M2 λ3 λ4 λ2 λ1 M3 λ1 λ2 λ4 λ3 M4 λ4 λ1 λ3 λ2

The basic element of the network is an optical filter, described in the next section. The ports 1 − 4 correspond to inputs/outputs of the optical filter. Its operation is the same as an electronic cross-bar: the cross function (output in 4) is activated when the injected wavelength in 1 does not correspond to a resonant ring wavelength and the bar function is activated (output in 3) when the injected wavelength in 1 corresponds to a resonant ring wavelength. TLFeBOOK

35

IPM1 M1 T1 IPT1

1 3 IPM2 M2 T2 IPT2

2 4 IPM3 M3 T3 IPT3

1 3 IPM4 M4 T4 IPT4

#1 #3 master master passive optical i i = n target target IP blocks interface network on chip interface IP blocks (driver, (detector, elementary optical n laser) filter operation receiver) #2 #4 i = n

Figure 2.8. Architecture of 4x4 optical network on chip

Operation is symmetrical: the same phenomena happens if the wavelength injection is placed in the port 4.

2.5.1 Microresonators Microring resonators are ideal device candidates for integrated photonic cir- cuits. Because they render possible the addition or extraction of signals from a waveguide based on wavelength in a WDM flow, they can be considered as basic building blocks to build complex communication networks. The use of standard SOI technology leads to high compactness (structures with radii as small as 4µm have been reported) and the possibility of low-cost photonic in- tegration. Figure 2.9 shows the structure of an elementary add-drop filter based on microring resonators. The size of the structure is typically a few hundred µm2. It consists of two identical disks evanescently side-coupled to two signal waveguides which are crossed at near right angles to facilitate signal direc- tivity. The microdisks make up a selective structure: the electromagnetic field propagates in the rings for discrete propagation modes corresponding to specific wavelengths. The resonant wavelengths depend on geometric and structural pa- rameters (indices of the substrate and of the microrings, thickness and diameter of the disks). The basic function of a microresonator can be thought of as a wave-length- controlled switching function. If the wavelength of an optical signal passing through a waveguide in proximity to the resonator (for example injected at port 1) is close enough to a resonant wavelength λ1 (tolerance is of the order of a few nm, depending on the coupling strength between the disk and the waveguide), then the electromagnetic field is coupled into the microrings and then out along the second waveguide (in the example, the optical signal is transmitted to the TLFeBOOK

36

#1 #3 = r #3 #4

#1 r i 10um #2 i r = r #2 r #4

30um

Figure 2.9. Micro-disk realisation of an add-drop filter output port 3, as shown in fig. 2.10a). If the wavelength of the optical signal does not correspond to the resonant wavelength, then the electromagnetic field continues to propagate along the waveguide and not through the structure (in the example, the optical signal would then be transmitted to the output port 4, as shown in fig. 2.10b). This device thus operates as an elementary router, the behaviour of which is summarised in the table in fig. 2.9.

Figure 2.10a. FDTD simulation of add- Figure 2.10b. FDTD simulation of add- drop filter in on-state drop filter in off-state

First structures have been realised and preliminary results are promising. Fig. 2.11a shows an IR photograph of the structure in the cross state (top) and in the bar state (bottom), while fig. 2.11b represents the transmission coefficient on the cross output: the transmitted power on the cross output reaches 100% for wavelengths corresponding to the resonant frequencies of the microdisk.

2.5.2 Microsource lasers From the viewpoint of mode field confinement and mirror reflection, mi- crodisk lasers operate on the principle of total internal reflection, as opposed to multiple reflection, as is the case in VCSELs for example. This fact gives this type of source two distinct advantages over VCSELs for on-chip optical inter- connect. Firstly, light emission is in-plane (as opposed to vertical), meaning TLFeBOOK

37

Figure 2.11a. Infra-red photograph of Figure 2.11b. Transmission coefficient on structure in both cross (top) and bar (bot- cross output for varying wavelength tom) states that emitted light can be injected directly into a waveguide with minimum loss [6]. Secondly, for communication schemes requiring multiple wavelengths, it is easier from a technological point of view to control the radius of such a device than it is to control the thickness of an air gap in a VCSEL. In any case such devices, to be compatible with dense photonic integration, must satisfy the requirements of small volume and high optical confinement, with low threshold current and emitting in the 1.3-1.6µm range. Although these devices are not as mature as VCSELs, they seem extremely promising for optical interconnect applications. An overview of microcavity semiconductor lasers can be found in [2].

2.5.3 Demonstration of principle Behavioural models enable us to verify the operation of the 4 × 4 ONoC at high level simulation. An injection of 4 wavelengths is realised (λ1, λ2, λ3, and λ4) at the port 1 at the same moment (shown in figure 2.12). The input signal format is a matrix. Figure 2.12 is a 3-dimensional representation with wavelength on the X-axis (representing the 4 channels), time on the Y-axis and power (normalised) on the vertical axis. Each injected wavelength has two pulses (Gaussian) in time. The behavioural simulation analyses the 4 outputs T1,T2,T3 and T4 (T2 shown in fig. 2.12). As predicted in table 2.3, only λ3 is detected at the output T2.

Figure 2.12. Simulation of 4x4 optical network on chip TLFeBOOK

38 2.6 CONCLUSION Integrated optical interconnect is one potential technological solution to al- leviate some of the more pressing issues involved in moving volumes of data between circuit blocks on integrated circuits. In this chapter, we have shown how novel integrated photonic devices can be fabricated above standard CMOS ICs, designed concurrently with EDA tools and used in clock distribution and NoC applications. The feasibility of on-chip optical interconnect is no longer really in doubt. We have given some partial results to quantitatively demon- strate the advantages of optical clock distribution. Although lower power can be achieved (of the order of a five-fold decrease), more work is required to explore new solutions that benefit from advances both at the architectural and at the technological level. Also the existing basic building blocks need to be integrated together to physically demonstrate on-chip optical links. Research is well under way in several research groups around the world to do this. Looking further ahead, the use of multiple wavelengths in on-chip communication net- works and in reconfigurable computing is an extremely promising and exciting field of research.

References

[1] M. Amann, M. Ortsiefer, and R. Shau: 2002, ‘Surface-emitting Laser Diodes for Telecommunications’. In: Proc. Symp. Opto- and Microelec- tronic Devices and Circuits. [2] T. Baba: 1997, ‘Photonic Crystals and Microdisk Cavities Based on GaInAsP-InP System’. IEEE J. Selected Topics in Quantum Electronics 3. [3] Y. Cao, T. Sato, D. Sylvester, M. Orchansky, and C. Hu: 2000, ‘New Paradigm of Predictive MOSFET and Interconnect Modeling for Early Cir- cuit Design’. In: Proc. Custom Integrated Circuit Conference. [4] S. Cho et al.: 2002, ‘Integrated detectors for embedded optical interconnec- tions on electrical boards, modules and integrated circuits’. IEEE J. Sel. Topics in Quantum Electronics 8. [5] A. Filios et al.: 2003, ‘Transmission performance of a 1.5-µm 2.5-Gb/s directly modulated tunable VCSEL’. IEEE Phot. Tech. Lett. 15. [6] M. Fujita, A. Sakai, and T. Baba: 1999, ‘Ultrasmall and ultralow threshold GaInAsP-InP microdisk injection lasers: Design, fabrication, lasing charac- teristics and spontaneous emission factor’. IEEE J. Sel. Topics in Quantum Electronics 5. [7] M. Fujita, R. Ushigome, and T. Baba: 2000, ‘Continuous wave lasing in GaInAsP microdisk injection laser with threshold current of 40µA’. IEE Electron. Lett. 36. TLFeBOOK

39 [8] M. Ingels and M. S. J. Steyaert: 1999, ‘A 1-Gb/s, 0.7µm CMOS Optical Receiver with Full Rail-to-Rail Output Swing’. IEEE J. Solid-State Circuits 34(7). [9] I. Kimukin et al.: 2002, ‘InGaAs-Based High-Performance p-i-n Photodi- odes’. IEEE Phot. Tech. Lett. 26(3). [10] K. Lee et al.: 2001, ‘Fabrication of ultralow-loss Si/SiO2 waveguides by roughness reduction’. Optics Letters 26. [11] J. Liu et al.: 2002, ‘Ultralow-threshold sapphire substrate-bonded top- emitting 850-nm VCSEL array’. IEEE Phot. Lett. 14. [12] J. Morikuni et al.: 1994, ‘Improvements to the standard theory for pho- toreceiver noise’. IEEE J. Lightwave Technology 12. [13] I. O’Connor, F. Mieyeville, F. Tissafi-Drissi, G. Tosik, and F. Gaffiot: 2003, ‘Predictive design space exploration of maximum bandwidth CMOS photoreceiver preamplifiers’. In: Proc. IEEE International Conference on Electronics, Circuits and Systems. [14] A. Sakai, T. Fukazawa, and T. Baba: 2002, ‘Low Loss Ultra-Small Branches in a Silicon Photonic Wire Waveguide’. IEICE Tran. Electron. E85-C. [15] A. Sakai, G. Hara, and T. Baba: 2001, ‘Propagation Characteristics of Ultrahigh-∆ Optical Waveguide on Silicon-on-Insulator Substrate’. Jpn. J. Appl. Phys. – Part 2 40. [16] S. Schultz, E. Glytsis, and T. Gaylord: 2000, ‘Design, Fabrication, and Performance of Preferential-Order Volume Grating Waveguide Couplers’. Applied Optics-IP 39. [17] Semiconductor Industry Association: 2003, ‘International Technology Roadmap for Semiconductors’. [18] G. Tosik, F. Gaffiot, Z. Lisik, I. O’Connor, and F. Tissafi-Drissi: 2004, ‘Power dissipation in optical and metallic clock distribution networks in new VLSI technologies’. IEE Elec. Lett. 4(3). TLFeBOOK

40

Chapter 3 NANOTECHNOLOGIES FOR LOW POWER

Jacques Gautier CEA-DRT – LETI/D2NT – CEA/GRE

Abstract The conventional approach to improve the performance of circuits is to scale down the devices and technologies. This is also convenient to lower the power consumption per function. In this chapter, we overview the potential of nanotechnologies for this purpose, with emphasis on few-electron devices in the case of room-temperature operation. Other devices, especially carbon nanotube transistors, resonant tunnelling diodes and quantum cellular automata, are briefly discussed.

Keywords: nanotechnologies; Single Electron Transistor; SET; molecular electronics; RTD; QCA; low power; Coulomb blockade

3.1 INTRODUCTION

In addition to packing-density increase and speed improvement, the downscaling of technologies comes with a reduction of the power consumption per function. However this gain is offset by the tremendous increase in the number of transistors per chip. A possible solution is to go further towards nano-scale devices where a lower amount of charge is needed to code a bit. This is the basis of what is known as single electronics. The use of molecules could be a realistic way to fabricate these tiny devices and other useful nanostructures. In this chapter we overview the potential of nanodevices for low power electronics with emphasis on few-electron electronics in the case of room- temperature (RT) operation. Other devices, especially carbon nanotube transistors, resonant tunnelling diodes (RTD) and quantum cellular automata (QCA), are briefly discussed. TLFeBOOK

41 3.2 SINGLE ELECTRONICS

In CMOS circuits, the total power consumption is the sum of the dynamic power and of the contribution of leakages. For advanced technology generations the later is rapidly rising, but it is still less than the former. So, we will focus on this dynamic power consumption which is given by the usual expression

()2 ⋅⋅+⋅⋅= d gate gate inter DD fVCCNaP c (1)

where a is the activity factor, Ngate is the amount of gates, (Cgate + Cinter) is the load capacitance, gate and interconnect contributions, and f is the clock frequency. This equation shows that the power is proportional to the amount of charge in transistors and interconnects for coding a bit of information. For dense circuits with local interconnects, the dominant contribution is usually the one related to the gate capacitance of transistors which can also be expressed as Pd=a.Ngate.Q.VDD.fc, where Q is the channel charge. So there is a strong motivation to reduce it for power saving. This is currently obtained by the downscaling of technologies. From the extrapolation of the historical trend and from the ITRS roadmap anticipation[1], we can expect a value of only 10-20 electrons for sub-10nm . This is much less than the hundreds to thousands of electrons present in current devices. Is it possible to go still further, towards only one electron, using what is called a single electron transistor or SET [2]? That would be advantageous for power consumption, knowing that the reduction of power per function due to the scaling is more or less balanced by the tremendous increase of the number of transistors per chip. However this gain would be effective only if the capacitances of interconnects are not too large. Another factor in expression (1) is the electrostatic potential at which the charge Q is brought. At present, there is a strong incentive for reducing it. Whereas the supply voltage of current high performance circuits is in the range 1.2-1.8V, operation at only 0.3V on experimental circuits has already been demonstrated [3], which is close to the bottom limit anticipated by the ITRS. For a lower value the device is not in well defined On or Off states which results in either leakage or poor performance. What can be expected from SET's ? Before giving an answer to this question, their properties and modes of operation are briefly recalled.

3.2.1 Background on single electron transistors

A SET is a device which comprises a Source and a Drain reservoir of electrons and a control gate, like MOSFET's. In between, there is an island TLFeBOOK

42 where carriers should be confined [2] (see Fig. 3-1). A common solution to obtain this effect is to insert tunnelling or potential barriers between the reservoirs and the island. This is the main structural difference from MOSFET's, but it is essential for the operation of SET's. Due to this confinement, there is always an integer number of electron in the island. However, the probability to have a given amount of charge is a continuous function of the device bias, such that there is also a continuous variation of the average charge versus the external bias.

RT>>RQ

RT RT S D Cj Cj

island G Cg

Vg

Figure 3-1. Schematics of a SET

Provided that just one electron more or less has a significant effect on the electrostatic energy of the device, it is shown that, for a given device bias, there are limited possible states of charge in the island [2]. Especially, there are bias domains for which only one state of charge is possible. In this case, there is no exchange of charge with the electron reservoirs and the device is in Off state. This is the Coulomb blockade effect. For the other cases, the number of electron oscillates between the highest probable states of charge leading to a flux of carriers between source and drain. For instance, when the two states n and n+1 are possible, the current is due to the repetition of the sequence: one electron coming from the source to the island then leaving the island to the drain. As shown in Fig. 3-2, the electrical characteristics of SET's are very different from those of MOSFET's. The ID(VG) curves have periodic oscillations of current and the output characteristics look like a resistance (or staircase for non-symmetrical device) with a low drain voltage domain where the device is periodically Off and On as a function of VG. The period of Coulomb Blockade Oscillations, CBO, is given by e/Cg. Between two successive oscillations, the only difference is that the average number of TLFeBOOK

43 electron in the island is incremented or decremented by one. At a peak of current, two dominant states of charge have equal probability and, on the average, there is a half integer number of electron in the island.

2 10-7 1,2 10-7 )

A( ID (A) 1 10-7 DI 1,5 10-7 VD=0.4V 8 10-8

-7 -8 1 10 6 10 200mV 4 10-8 Vg=0.45V 5 10-8 100mV 2 10-8 Vg=0.1V 20mV 0 0 00,511,52 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 Vg (V) Vd (V)

Figure 3-2. Typical ID(VG) and ID(VD) characteristics of a SET. They have been obtained by simulation with the following parameters: Cj=0.1aF, Cg=0.2aF, RT=1MΩ, T=300K

To observe the previous typical characteristics, there are two important conditions to meet. Firstly, the charging energy, which is the electrostatic energy increase due to the arrival of one electron in the island, should be large in comparison to the thermal energy kT:

2 = e >> (2) EC kT 2CΣ

where e is the electron charge (absolute value), CΣ is the total capacitance of the island, CΣ=2Cj+Cg, where Cj is the junction capacitance and Cg is the gate to island capacitance. For room-temperature operation, CΣ should be less than 0.3 aF (Ec=10 kT, T=300K), which requires an island smaller than a few nm. The second condition is related to the confinement of the electron wave function in the island, which is essential to quantize the charge in this island. The resistance of the tunnel barriers should exceed the quantum 2 resistance RK=h/e ~25.8 kΩ. For the fabrication of SET’s, there are many different possibilities since any kind of conductive material can be used for the island, metallic as well as semiconductor and even molecular. However silicon is advantageous for CMOS compatibility and also for the stability of devices [4]. TLFeBOOK

44

3.2.2 Designing a low VDD inverter

With regard to the power consumption of digital circuits, we consider in this part the case of a simple inverter, since this is a convenient reference to make comparisons with CMOS. The design of a SET inverter has been discussed by many authors [5,6,7].They pointed out that, since there is only one kind of SET, the complementary action of the pull-up and pull-down devices is not as easy to obtain as in CMOS where two types of transistor exist. A first solution is to choose the supply voltage in order that both of these devices are On or Off in a complementary way in the switching part of the transfer characteristic. An example of such situation is shown in Fig. 3-3. The shaded area displays the Coulomb blockage domains of the pull-up and pull-down transistors at zero temperature. Based on that, the transfer characteristics has been schematically drawn. Contrary to CMOS, we can observe that the voltage swing is less than rail-to-rail and that the DC current is minimal at the transition point.

Vout (V) 1

0,8

VDD 0,6

0,4

0,2

0 1,00 0,00 0,20 0,40 0,60 0,80 -0,2

-0,4

-0,6 Vin (V)

Figure 3-3. Theoretical Coulomb blockade domains, also known as Coulomb diamonds, (shaded areas) at 0K, for the pull-down and pull-up SET's of an inverter. At RT they are a little narrower. Cj=0.1aF, Cg=0.2aF, VDD=0.53V. The bold line is a drawing of the transfer characteristics. TLFeBOOK

45

Since a low VDD is advantageous for low power applications, we discuss now the possibility to minimize it for this simple SET inverter, taking account of the design constraints and aiming room-temperature operation: • Cg + 2.Cj < 0.3aF for RT operation (for Ec~10kT) • Cg / Cj > 1 for voltage gain • VDD = e / (Cg + Cj) for complementary action of transistors As a result, a very low VDD and RT operation would be difficult to achieve simultaneously. In fact, with the previous equations and for a ratio of gate to junction capacitances of 2, the minimum VDD would be equal to 0.7V ! However, for temperatures above 0K, the switching of the SET from Off to On state is not abrupt since there is an exponential variation of the current, equivalent to the subthreshold current of MOSFET’s. Consequently, the real Coulomb blockade diamonds are narrower than those shown in Fig. 3-3 and it is possible to reduce VDD. This is demonstrated in Fig. 3-4, where the DC voltage gain and the DC current at the transition point of an inverter have been plotted versus VDD. Note also that the constraint on CΣ has been a little relaxed. As thoroughly discussed by A. Korotkov [6], the acceptable VDD window is quite narrow. A too low VDD value would be detrimental for the noise margin and for the speed since the DC current at the transition point is exponentially decreasing with VDD. On the contrary, a higher value would increase the power consumption.

1,6 10-7 Imin (A) Imin

1,4

Voltage gain 1,2 10-8 V =e/(Cj+Cg) 1 DD

0,8 -9 Cj=0.1aF 10 0,6 Cg=0.2aF R =1MΩ 0,4 T T=300K 0,2 10-10 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 V (V) DD

Figure 3-4. DC voltage gain (solid line) and DC current (dashed line) at the transition point of a SET inverter at room temperature. TLFeBOOK

46

To go further in reducing VDD, a solution is to add control gates to each SET (Fig. 3-5). Based on this approach, NTT has demonstrated a quasi- CMOS operation inverter at a supply voltage as low as 20 mV8, which is very advantageous for the power consumption, but in this case the temperature was only 27K. The bias of the control gates shifts the CBO, making possible to select the optimal part of the ID(VG) characteristics of each SET for complementary action. In this way, the equivalent of two types of transistors can be obtained, like for CMOS. In addition, their equivalent threshold voltages can be tuned, balancing the influence of eventual parasitic (background) charges in the neighbourhood of the SET:

C gc ∆−=∆ (3) Vg Vgc Cg

To get a symmetrical transfer characteristic, from the Coulomb diamonds, it can be easily demonstrated that the sum of the control gate voltages should be equal to VDD:

=+ gcss gcd d VVV DD (4)

As a result there is one more degree of freedom to design an inverter, in comparison to the case without control gates. That gives flexibility to fix the value of VDD. In fact, there is now one optimal supply voltage, leading to complementary states of pull-up and pull-down transistors, for each bias of the control gates. Taking equation (4) into account, it is given by:

− 2 VCe V = gcssgc (5) DDopt + CC jg

There is a consequent reduction of VDDopt thanks to the control gates, but it is important to note that the constraint on the total capacitance (equation 2) should also take account of the contribution of Cgc : CΣ=2Cj+Cg+Cgc A drawback of this approach is the requirement of extra lines to distribute the control gate voltages. However, this can be avoided in the particular case where Vgcss=VDD and Vgcdd=0V (VSS=0V). For this condition, the optimum value of VDD is given by:

e V = (6) DDopt ++ jg 2CCC gc TLFeBOOK

47 As discussed previously for the case without control gates, at RT the Coulomb blockade area of SET is narrower than at 0K which makes possible a reduction of VDD or a change of bias of the control gates for a given VDD. However, it also results in a change of the SET current which may affect the speed of circuits. Consequently, there is a design trade-off. To illustrate it, in Fig. 3-5 we have plotted the variations of the DC voltage gain and of the propagation delay along a chain of SET inverters versus the DC current at the transition point of the transfer characteristics. The shaded area shows the most advantageous design window. In this example, the load capacitance is equal to 0.5fF, but for another value the design window would be the same, since the propagation delay directly scales with this capacitance. This is a difference with CMOS where the dominant load capacitance of dense logic is due to the gate capacitance of MOSFETs. Here, the gate capacitance of SETs is extremely small and the dominant load capacitance comes from the local interconnects. In fact the later should be much larger than e/2VDD to avoid any detrimental effects of the shot noise. Regarding the dynamic power consumption of SET logic, as long as a CMOS output buffer is not implemented, the major contribution would also come from the load capacitance due to the interconnects.

-8 propagation delay (s) 2 5 10 Master Equation Cj=0.05aF C =0.1aF G 4 10-8 G C =0.1aF V GC C =0.5fF 1,5 load V DC voltageDC gain R =1MΩ 3 10-8 DD T CG VGdd

τ -8 Vout p 2 10 Vin in CG 1 CL V =0.3V V DD 1 10-8 Gss T=300K V +V =V GCss GCdd DD 0,5 0 10-10 10-9 10-8 10-7 DC current @ transition point (A)

Figure 3-5. Variations of the DC voltage gain (solid line) and of the propagation delay along a chain of SET inverters (dashed line) versus the DC current at the transition point of the transfer characteristics. VDD=0.3V. The control gate voltages are varied as follow: 0.7V

48 An example of switching characteristics is shown in Fig. 3-5 for VDD=0.3V, T=300K and a load capacitance of 0.5fF. The corresponding switching energy is 5.2x10-2fJ. In the same figure, transfer characteristics have been plotted for a lower supply voltage, showing that a voltage gain higher than 1 can be obtained down to VDD=0.2V at RT.

-9 0,5 5 10 0,2

I I Current (A) pull-down pull-up

0,4 4 10-9 Output (V) Vin, Vout (V) Vin, Vout 0,15

0,3 3 10-9

0,1

0,2 2 10-9 T=300K 0,05 0,1 1 10-9

V V out in T=150K 0 0 0 05 10-8 1 10-7 1,5 10-7 2 10-7 00,050,10,150,2 time (s) input (V)

Figure 3-5. On the left, switching characteristics for a SET inverter with control gates. VDD=0.3V, T=300K, CL=0.5fF, Cj=0.05aF, Cg= Cgc =0.1aF, RT=1MΩ, Vgcss=0.3V, Vgcdd=0. On the right, transfer characteristics for the same capacitances and VDD=0.2V.

3.2.3 Designing gates with increased functionality

Another approach to lower the power consumption is to build logic gates with increased functionality in order to reduce the count of transistor needed to obtain a given function. This can be done by taking advantage of both the existence of CBO and the possibility to design SET's with multiple inputs [9]. The principle is to choose the logic levels such as the multiple inputs SET's are biased at either the minima or at the peaks of current, depending on the combination of input signals. This is illustrated in Fig. 3-7 in the case of a double input X-OR function. The logic level "1" is equal to the CBO period of the equivalent single input SET, e/(2.Cin), where Cin is the input gate capacitance. From this equivalent SET, it is obvious that the device is TLFeBOOK

49 Off when both of the inputs are either "0" or "1" and that the device is On when one and only one of the inputs is "1".

3 10-8 V =100mV T=300K 2,5 10-8 D ID (A) VDD Cin 2 10-8 VA

1,5 10-8 VB -8 1 10 Cin

5 10-9

VGeff 0 CGeff 00,511,52 Vgeff (V) = 2CC A="0" A ="1" A ="1" Geff in X-OR and + VV and V = BA B="0" B ="1" B ="1" Geff 2

Figure 3-7. Principle of design of a X-OR gate with a double input SET. The output current is a X-OR function of VA and VB. The figure shows also the equivalent input circuit and associated equations.

This can be used to design pass-gate logic functions, as demonstrated by Y. Ono [9]. For instance, for an input current signal C in the previous X-OR gate, the output pass current is given by C.(A ⊕ B). Furthermore, one of the inputs of this gate, B for instance, can be viewed as a control input, leading to the output pass current .AC if B is "1". Applying this technique to a gate comprising such a control input in addition to the inputs A and B, we get the function .()⊕BAC for the output current, when the control input is "1" (see Fig. 3-8). This way, cascading SET structures, NTT has designed a 4-b adder with only 40 SET's for operation at 30K9. In comparison with CMOS there are less transistors and no crossing of pass signal routes thanks to the high functionality of the SET gates. Furthermore, there is a low level signal on the pass route. Consequently, a lower dynamic power is expected. Nevertheless, this power gain is not yet evaluated. TLFeBOOK

50

C.(A⊕B) A Control gate B C

Figure 3-8. Design of complex gates using multiple input SETs. These gates can be cascaded.

Moreover, there are important issues, especially about the control of the phase of CBO. Since it will be probably impossible to avoid the existence of any parasitic charges, charge-tolerant solutions are required. A first approach consists in incorporating redundancy into the circuit design in order to replace the defective gates by reconfiguration [9]. This is valuable only if a reasonable amount of spares is needed and if the area-overhead is not too large. Another solution would be to balance the influence of parasitic offset charges by opposite charges stored near the island of the SET. This concept has been demonstrated in the case of SET in which nanostructures have been embedded [10-12]. The resulting device is a merge of a SET and of a Non Volatile Memory function. Further, a feedback loop can be implemented to automatically control the phase of CBO [13]. The loop is closed to adjust the amount of charges in a memory node then it is opened for the use of the device. There are other potential applications about the possibility to tune and memorize the phase of CBO. A first example has been the demonstration of a hybrid SET-MOSFET gate which can be programmed to be inverting or non-inverting [11]. This feature has been obtained thanks to a SET active device which can operate either in a positive or in a negative transconductance region, depending on the amount of charge stored in a nearby nanostructure. In this case, the SET was fabricated in a very thin undulated SOI film in which a narrow source-drain percolation channel and an electron pocket working as a memory node can be naturally formed for a range of bias. In the hybrid gate, the MOSFET was just used as a load. Since the output voltage swing was only 10mV, an output buffer has been implemented. The reproducibility of the structure is not obvious, but RT operation and peak-to-valley current ration (PVCR) as high as 102 were obtained. The most important is the concept of programmable logic which is feasible with SET based devices, since it has a high potential for low-power and high packing density. The design of SET programmable logic array (PLA) has been also reported by K. Uchida [11]. TLFeBOOK

51 It is important to note that many other functions can be designed with few-electron devices, taking advantage of their specific features. Especially, several memory structures that are promising for low power consumption have been reported [14-17]. For spiking neuron circuits, it has been proposed to combine NVM MOSFET devices and single-electron circuit based on multinanodot floating-gate arrays [18]. Also, some analog applications and devices have been studied, like CCD [8], ADC [19], metrology [20] and NEMS [21]. However, although some have been demonstrated, most of them are still at the proof of concept level.

3.3 MOLECULAR ELECTRONICS

For the fabrication of SET's, any kinds of conducting materials can be used. Whereas the basic research was done on metallic SET's [2], circuit demonstrations are performed mainly on silicon [8-13], for complementary with MOSFET's and to benefit of the huge investment in silicon technology. However, it could be advantageous to use molecules for real applications due to the size requirement discussed previously and because the reproducibility of nanoscale structures is very challenging. In addition, the load capacitance of circuits should be very low, for power consumption and speed considerations, which implies short and narrow interconnects. The most promising way to achieve it is the bottom-up approach, using naturally formed tiny structures or self-assembling methods. The best example is the carbon nanotube (CNT) which can be used to fabricate FET's [22-23], SET's [24], interconnects [25] and even non volatile memory arrays [26]. CNT's are long cylinder of carbon atoms consisting of rolled-up sheets of graphite. For Single Wall CNT's the diameter is as small as 1-5 nm. Depending on their chirality, they are semiconductor or metallic materials. Their mobility is much higher than the one of silicon, and a ballistic transport has been demonstrated for lengths less than a few hundreds of nm, but the subthreshold characteristics of CNFET's are not better than those of MOSFET's. Worldwide, several teams are conducting research on the selective growth or deposition of CNT's that would have the right chirality and on the evaluation of CNFET's as potential candidate to replace MOSFET's in the future. For low power applications, thanks to their excellent transport properties [22], it should be possible to reduce the gate overdrive and VDD, while meeting the ITRS specifications [1]. Different kinds of molecules are also currently investigated to make nanometer-scale electronic components and circuits, but a single molecule transistor has not yet been obtained. To date, one of the most advanced achievement is a 1µm2 64 cells crossbar matrix fabricated by HP Labs [27] in which the switching units are bundle of rotaxane molecules. The operation TLFeBOOK

52 of such molecules is not yet clear and others mechanisms like the formation of tiny filaments across the molecule gap between the electrodes could explain the switching [28]. However, on the long term, this research subject has a great potential for high density, low cost and probably ultra low power electronics. True single molecule device will require interconnects at a similar scale. This is also essential to reduce parasitic capacitances and the power consumption. Since the needed resolution is far beyond the possibility of lithographic tools, including NGL, the solution will come from the bottom- up approach. An example is the realization by Caltech of a Pt nanowire lattice with width and pitch of 8 nm and 16 nm respectively [29]. Biology can also come to the rescue for the self-assembly of nano-circuits [30]. A very different approach, also mitigating the arduous task of nanoscale patterning, is the concept of self-assembled nanocells proposed by J. Tour [31]. These nanocells are disordered arrays of metallic islands that are interlinked with molecules and that are accessed by metallic input/output leads. Switching-type functions have been observed, but like for the work of HP [27], the creation and dissolution of metal filaments is probably responsible for the behaviour. In fact, the behaviour of electrically active molecules is strongly influenced by surrounding electrodes and other materials, which make a difference between molecular nanotechnology and bulk or solution-phase chemistry.

3.4 DISCUSSION

For CNT devices, as well as for nano MOSFET's, the supply voltage reduction is dependent upon the effects of subthreshold leakage on the static power, leading to trade-off with the speed of circuits. For SET's, the steepness of the On-Off switching is not better, but they offer an increased functionality and low charge operation. However, there are important issues, especially about the sensibility to offset charges and the fabrication of nano- scale structures with sufficient level of reproducibility, which require a lot of work. Although it is not yet clear if they could achieve a lower VDD, there are other candidates, like the resonant tunnelling diodes (RTD). Their operation is based on electron transport via discrete energy levels in double barrier quantum well structures, leading to the existence of a negative differential resistance. This implies the fabrication in suitable materials and a perfect control of the geometry, since the output characteristics are extremely sensitive to the dimensions. A promising approach is the implementation of RTD along semiconductor nanowires [31]. There are also prospective TLFeBOOK

53 studies for a molecular version and for structures mixing Coulomb blockade and resonant effects [32]. One of the most important features of nanodevices, especially for molecular ones, is their size. That offers the possibility to lower the power consumption by parallel processing. For instance, consider two blocks of low capacitance molecular devices doing the same task as one block of conventional devices, but at half the clock frequency. The Ngate . fc product being unchanged, equation 1 shows that the power consumption is directly related to the C.V [2] product, the gate switching energy, which can be strongly reduced thanks to lower capacitances and to the possibility to have devices with lower On current, since fc is divided by 2 in this case. Going further, quantum-dot cellular automata (QCA) is an attractive approach, yet speculative, to reduce the power consumption, since there is no flow of current but only Coulomb interactions [33]. The principle is to encode binary information by charge configuration in electrostatically coupled cells in which there are two extra electrons. It has been shown that a clock field is needed to control the direction of propagation of information along the cells and to enable power gain. This clock could also be used for a quasi-adiabatic switching, leading to extremely low power consumption. To date, experimental demonstrations are performed at low temperature on metallic structures, but molecular implementations are being investigated in view of RT operation [34].

References

[1] Semiconductor Industry Association, International Technology Roadmap for Semiconductors 2001 Edition, http://public.itrs.net, 2003 [2] H. Grabert and M. H. Devoret, Single charge tunnelling Coulomb blockade phenomena in nanostructures, volume 294 of NATO ASI Series B, Plenum Press New York and London, 1992 [3] T. Douseki, T. Shimamura, N. Shibata, A 0.3V 3.6GHz 0.3mW frequency divider with differential ED-CMOS/SOI circuit technology, in Proc. ISSCC, February 2003 [4] N. M. Zimmerman and W. H. Huber, Excellent charge offset stability in a Si-based single-electron tunneling transistor, APL Vol. 79, N.19, pp. 3188-3190, 2001 [5] J. R. Tucker, Complementary digital logic based on the Coulomb blockade, JAP 72 (9), 1, pp. 4399-4413, 1992 [6] A. N. Korotkov, R. H. Chen and K. K. Likharev, Possible performance of capacitively coupled single-electron transistors in digital circuits, JAP 78 (4), pp. 2520-2529, 1995 [7] M-Y. Jeong, B-H. Lee and Y-H. Jeong, Design considerations for low-power single- electron transistor logic circuits, JJAP. Vol.40, pp. 2054-2057, 2001 [8] Y. Takahashi, Y. Ono, A. Fujiwara and H. Inokawa, Silicon Single-Electron Devices for logic applications, in Proc. ESSDERC September 2002, Florence, pp. 61-68 [9] Y. Ono, H. Inokawa and T. Takahashi, Binary adders of multigate Single-Electron Transistors: specific design using Pass-Transistor Logic, IEEE Trans. on Nanotech. Vol.1 pp. 93-99, 2002 TLFeBOOK

54

[10] N. Takahashi, H. Hishikuro and T. Hiramoto, A directional current switch using silicon Single Electron Transistors controlled by charge injection into silicon nano-crystal floating dots, in Proc. IEDM, pp.371-374, 1999 [11] K. Uchida, J. Koga, R. Ohba and A. Toriumi, Programmable Single-Electron Transistor logic for future low-power intelligent LSI: proposal and room-temperature operation, IEEE Trans. on Elec. Dev. Vol.50, pp.1623-1630, 2003 [12] G. Molas, X. Jehl, M. Sanquer, B. de Salvo, M. Gely, D. Lafond and S. Deleonibus, Manipulation of periodic Coulomb Blockade Oscillations in ultra-scaled memories by single electron charging of silicon nanocrystals floating gates, Silicon Nano Workshop, Honolulu, June 2004 [13] K. Nishiguchi, H. Inokawa, Y. Ono, A. Fujiwara and Y. Takahashi, Automatic control of the oscillation phase of a Single-Electron Transistor, IEEE EDL25 (1), pp. 31-33, 2004 [14] K. Yano, T. Ishii, T. Hashimoto, T. Kobayashi, F. Murai and K. Seki, Room- Temperature Single-Electron Memory, IEEE Trans. on Elec. Dev. Vol.41, NO.9,pp.1628-1638, 1994 [15] Z. A. K. Durrani, A. Irvine and H. Ahmed, Coulomb blockade memory using integrated Single-Electron Transistor/Metal-Oxide-Semiconductor transistor gain cells, IEEE Trans. on Elec. Dev. Vol.47, pp.2334-2339, 2000 [16] H. Sunamura, H. Kawaura, T. Sakamoto and T. Baba, Multiple-valued memory operation using a Single-Electron Device: a proposal and an experimental demonstration of a ten-valued operation, JJAP Vol. 41, pp. L93-L95, 2002 [17] G. Molas, B. de Salvo, D. Mariolle, G. Ghibaudo, A. Toffoli, N. Buffet and S. Deleonibus, Single electron charging phenomena at room temperature in a silicon nanocrystal memory, in Proc. WODIM 2002, Grenoble [18] T. Morie, T. Matsuura, M. Nagata and A. Iwata, A multinanodot floating-gate MOSFET circuit for spiking neuron models, IEEE Trans. On Nanotechnology, Vol. 2, NO. 3, pp. 158-164, 2003 [19] H. Inokawa, A. Fujiwara and Y. Takahashi, A multiple-valued logic and memory with combined Single-Electron and Metal-Oxide-Semiconductor transistors, IEEE Trans. on Elec. Dev. Vol.50, NO.2, pp. 462-470, 2003 [20] H. E. van den Brom et al., Counting electrons one by one – overview of a joint european research project, IEEE Trans. on Inst. and Meas. Vol. 52, NO.2, pp. 584-588, 2003 [21] S. Mahapatra, V. Pott, S. Ecoffey, A. Schmid, C. Wasshuber, J. W. Tringe, Y. Leblebici, M. Declercq, K. Banerjee and A. Ionescu, SETMOS: a novel true hybrid SET-CMOS high current Coulomb Blockade Oscillation cell for future nano-scale analog ICs, in Proc. IEDM 2003, pp. 703-706 [22] A. Javey, H. Kim, M. Brink, Q. Wang, A. Ural, J. Guo, P. McIntyre, P. McEuen, M. Lundstrom and H. Dai, High-K dielectrics for advanced carbon-nanotube transistors and logic gates, Nature Materials, Vol 1, pp. 241-246, December 2002 [23] P. Avouris, Carbon nanotube electronics, Chemical Physics, 281 (2002), pp. 429-445 [24] K. Matsumoto, S. Kinoshita, Y. Gotoh, K. Kurachi, T. Kamimura, M. Maeda, K. Sakamoto, M. Kuwahara, N. Atoda and Y. Awano, Single-Electron Transistor with ultra-high Coulomb energy of 5000K using position controlled grown carbon nanotube as channel, JJAP Vol.42 Part 1 N°4B, pp. 2415-2418, 2003 [25] J. Li, Q. Ye, A. Cassell, H. T. Ng, R. Stevens, J. Han and M. Meyyappan, Bottom-up approach for carbon nanotube interconnects, APL Vol. 82, N°15, pp. 2491-2493, 2003 [26] Rueckes, K. Kim, E. Joselevich, G. Y. Tseng, C-L. Cheung and C. Lieber, Carbon nanotubes-based nonvolatile Random Access Memory for molecular computing, Science, Vol. 289, pp. 94-97, 7 July 2000 TLFeBOOK

55

[27] Y. Chen, G-Y. Jung, D. A. Ohlberg, X. Li, D. R. Stewart, J. O. Jeppesen, K. A. Nielsen, J. F. Stoddart and R. S. Williams, Nanoscale molecular-switch crossbar circuits, Nanotechnology 14 (2003) 462-468 [28] R. F. Service, Next-generation technology hits an early midlife crisis, Science Vol. 302, pp. 556-559, 24 October 2003 [29] N. Melosh, A. Boukai, F. Diana, B. Gerardot, A. Badolato, P. M. Petroff and J. R. Heath, Ultrahigh-density nanowire lattices and circuits, Science, Vol. 300, pp.112-115, 4 April 2003 [30] P. Fairley, Germs that build circuits, IEEE Spectrum, pp. 37-41, November 2003 [31] M. T. Björk, B. J. Ohlsson, C. Thelander, A. I. Persson, K. Deppert, L. R. Wallenberg and L. Samuelson, Nanowire resonant tunneling diodes, APL, Vol. 81, N°23, pp. 4458- 4460, December 2002 [32] M. Saitoh and T. Hiramoto, Room-temperature operation of highly functional Single- Electron Transistor logic based on quantum mechanical effect in ultra-small silicon dot, in Proc. IEDM IEDM 2003, pp. 753-756 [33] G. Bernstein, Quantum-dot Cellular Automata: computing by field polarization, in Proc. DAC 2003, June 2-6, Anaheim (CA), pp. 268-273 [34] C. Lent and B. Isaksen, Clocked molecular Quantum-dot Cellular Automata, IEEE Trans. on Elec. Dev. Vol.50, NO.9, pp. 1890-1896, 2003 TLFeBOOK

56 Chapter 4 STATIC LEAKAGE REDUCTION THROUGH SIMULTENEOUS VT/TOX AND STATE ASSIGN- MENT

Dongwoo Lee, Bo Zhai, David Blaauw and Dennis Sylvester University of Michigan, Ann Arbor

Abstract: Standby leakage current minimization is a pressing concern for mobile applica- tions that rely on standby modes to extend battery life. In this paper, we propose new leakage current reduction methods in standby mode. First, we pro- pose a combined approach of sleep-state assignment and threshold voltage (Vt) assignment in a dual-Vt process for subthreshold leakage (Isub) reduction. Sec- ond, for the minimization of gate oxide leakage current (Igate) which has become comparable to Isub in 90nm technologies, we extend the above method to a combined sleep-state, Vt and gate oxide thickness (Tox) assignments approach in a dual-Vt and dual-Tox process to minimize both Isub and Igate. By combining Vt or Vt / Tox assignment with sleep-state assignment, leakage cur- rent can be dramatically reduced since the circuit is in a known state in standby mode and only certain transistors are responsible for leakage current and need to be considered for high-Vt or thick-Tox assignment. A significant improve- ment in the leakage/performance trade-off is therefore achievable using such combined methods. We formulate the optimization problem for simultaneous state/Vt and state/Vt/Tox assignments under delay constraints and propose both an exact method for its optimal solution as well as two practical heuristics with reasonable run time. We implemented and tested the proposed methods on a set of synthesized benchmark circuits and show substantial leakage current reduc- tion compared to the previous approaches using only state assignment or Vt assignment alone.

Keywords: Leakage current, reduction, performance, dual threshold voltage, oxide thick- ness, algorithm.

4.1 INTRODUCTION

There is a growing need for high-performance and low-power system, especially for portable and battery-powered applications. Since these applica- tions often remain in stand-by mode significantly longer than in active mode, TLFeBOOK

57 their stand-by (or leakage) current has a dominant impact on battery life. Standby mode leakage current reduction therefore has been a concern for some time and a number of such methods have been proposed to address this problem [1]-[7][9]-[18]. However, with continued process scaling, lower sup- ply voltages necessitate reduction of threshold voltages to meet performance goals and result in a dramatic increase in subthreshold leakage current. New methods for reducing the leakage current in standby mode are therefore criti- cally needed. In dual-Vt technology, the MTCMOS approach [1] was proposed where a high-Vt sleep transistor is inserted between the power supply and the circuit logic. In standby mode, this sleep transistor is turned off which dramatically reduces leakage due to its high-Vt. However, the method requires routing of an additional set of power supply lines in the layout as well as substantially sized sleep transistors to maintain good supply integrity and circuit perfor- mance [2]. Also, special latches that maintain state in standby mode need to be used [3]. In addition, the method does not scale well into sub-1V technolo- gies due to the increased delay penalty for the high-Vt sleep device [4]. A different approach to standby mode leakage reduction has been pro- posed that leverages the state dependence of a leakage current due to the so- called stack effect [5][6]. In [7], the circuit input state that minimizes leakage current is determined and special flip-flops are inserted in the design to pro- duce this state in standby mode. The flip-flops in the design are modified to produce a predetermined state in standby mode while also maintaining the previously latched state. The required modification to a flip-flop is minor and can be incorporated in the feedback path of the slave latch with minimal impact on performance [8]. In general, determining the minimum sleep state is a difficult problem due to the inherent logic correlations in the circuit. How- ever, a number of efficient heuristics for this problem have been proposed [9][10]. The limitation of this approach is that for larger circuits, the reduction in leakage current is typically only in the range of only 10 to 30% [9]. The above techniques are aimed primarily at subthreshold leakage current reduction which has been the dominant component of leakage in CMOS tech- nologies to date. However, in 90nm technologies the magnitude of gate tunneling leakage, Igate, in a device is comparable to the subthreshold leakage, Isub, at room temperature. With difficulties in achieving manufacturable high- k insulator solutions to address the gate leakage problem, the burden address this problem is primarily on circuit designers and EDA tools. As a result, there has been recent work in the area of gate leakage analysis and reduction techniques including pin reordering, PMOS sleep transistors, and the use of NAND implementations rather than NOR implementations [11]-[13]. Also, the MTCMOS technique was extended to combat gate leakage by using a TLFeBOOK

58 thick-oxide I/O device with a larger gate drive than the logic transistors as the inserted sleep transistor [14]. Another previous approach to leakage reduction that targets only sub- threshold leakage is to use individual assignment of transistor threshold voltages in a dual-Vt process [15]-[18]. In these approaches, the trade-off between high-Vt transistors with low leakage/low performance and low-Vt transistors with high leakage/high performance is exploited. Circuit paths that are non-critical are assigned high-Vt while critical circuit portions are given low-Vt assignments. The method therefore provides a trade-off between cir- cuit performance and leakage reduction. It was demonstrated that with a modest performance reduction of 5–10%, significant reduction of 3-4X in leakage could be obtained over a circuit with all low-Vt transistors [17]. In these approaches, high/low-Vt assignments are performed without knowledge of the states of the circuit. Therefore, in order to obtain sufficient leakage reduction under all possible circuit states, all or most of the transistors in a particular gate must be set to high-Vt and hence the gate incurs a substantial performance degradation. While such dual-Vt processes have been commonplace for several genera- tions, the availability of multiple oxide thicknesses in a single process has only become relevant at the 90nm node due to the rise of Igate [19]. Given a process technology with dual oxide thicknesses for logic devices, the dual-Vt approach can be easily extended to also consider gate leakage by assigning thick-oxide transistors to non-critical paths as well. However, similar to the dual-Vt assignment approach, a simultaneous dual-Vt and dual oxide thickness assignment with unknown states of the circuit will set all or most of the tran- sistors in a particular gate to both high-Vt and thick-oxide, to ensure that under all possible circuit states in standby mode leakage current is acceptable. How- ever, transistors that are simultaneously assigned a high-Vt and a thick-oxide have a dramatic delay penalty compared to low-Vt transistors with thin oxide. Therefore, this approach carries with it a significant delay penalty for process technologies where both Isub and Igate need to be addressed. In this paper, we therefore propose new methods to reduce standby mode leakage current. We can divide our new methods into two categories: 1) simultaneous dual-Vt and sleep state assignment for Isub reduction for technol- ogies in which Isub is dominant in standby mode and 2) simultaneous dual-Vt, dual oxide thickness and sleep state assignment for both Isub and Igate minimi- zation for technologies which have comparable amount of Igate to Isub. First, we combine the concepts of Vt assignment and sleep state assignment. This approach is based on the key observation that, given a known input state for a gate, the leakage of that gate can be dramatically reduced by setting only a single OFF-transistor on each path from Vdd to Gnd to high-Vt. Since all other TLFeBOOK

59 transistors in the gate are kept at low-Vt and continue to have high drive cur- rent, the performance degradation is limited while significantly gain in leakage current is obtained. This approach therefore provides a much better trade-off between leakage and performance compared to Vt assignment with unknown input state where most or all of the transistors must be set to high-Vt before a significant improvement in the leakage current is observed. The link between the effectiveness of Vt assignment and state assignment was previ- ously observed for Domino logic [8], since these circuits are by their own nature in a known state in standby mode. However, we extend this concept to general CMOS circuits by actively controlling the circuit state in standby mode, thereby dramatically increasing the effectiveness of leakage reduction. The second proposed approach minimizes the total leakage current (Isub and Igate) by simultaneous assignment of sleep state, high-Vt and thick-oxide transistors. In this approach, a key observation is that given a known input state, a transistor need not be assigned both a high-Vt and a thick oxide since Isub only occurs in transistors that are OFF while significant Igate occurs only in transistors that are ON. Furthermore, depending on the input state of a cir- cuit, only a subset of transistors need to be considered for either high-Vt or thick-oxide. Therefore, the impact on the delay of the gate is significantly reduced while obtaining leakage reductions comparable to when all transis- tors are assigned to both high-Vt and thick-oxides. The proposed method is compatible with existing library-based design flows, and we explore different trade-offs between the number of Vt and Tox variations for each library cell and the obtained leakage reduction. In addition, we compare the obtained leakage reduction when Vt (the first method) and Vt / Tox (the second method) assignments can be made individually for transistors in a stack as opposed to when an entire stack is restricted to a uniform assignment due to manufactur- ing or area considerations. Since the circuit state / Vt and the circuit state / Vt / Tox assignments inter- act, it is necessary to consider their optimization simultaneously. The state / Vt and state / Vt / Tox assignment task is to find a simultaneous assignment that minimizes the leakage current in standby mode while meeting a user specified delay constraint. We formulate this problem as an integer optimization prob- lem under delay constraints. The search space consists of all input state / Vt and input states / Vt / Tox assignments and hence is very large. Therefore, in addition to an exact solution, we also propose a number of heuristics. The pro- posed methods are implemented on benchmark circuits synthesized using an industrial cell library in 0.18Pm technology for Isub minimization and in a pre- dictive 65nm technology for both Isub and Igate minimization. On average, the proposed Isub minimization method by simultaneous state / Vt assignment approach improves leakage current by a factor of 6X over the traditional TLFeBOOK

60 approach using Vt assignment only. The second proposed method that mini- mizes both Isub and Igate by simultaneous state / Vt / Tox assignment has an average leakage reduction of 5-6X over an all low-Vt and thin-oxide design solution with a 5% delay range point and achieves more than a 2X improve- ment over the first proposed approach using Vt and state assignment only (i.e., without dual-Tox). The remainder of this paper is organized as follows. In Section 4.2, we dis- cus the used leakage model and the characteristics of Isub and Igate leakage current. In Section 4.3, we present the approach using simultaneous Vt and state assignment for Isub leakage reduction. In Section 4.4, we present the sec- ond approach that also addresses Igate by performing simultaneous Vt, Tox, and state assignment. In Section 4.5, we present our results on benchmark cir- cuits and in Section 4.6 we present our conclusions.

4.2 LEAKAGE MODEL AND CHARACTERISTICS

In this section, we discuss our leakage current model and briefly review the general characteristics of gate leakage current in CMOS gates. Since the proposed leakage optimization approach is library-based, we use precharacterized leakage current tables for each library cell, with specific leakage table entries for each possible input state of a library cell. The pre- characterized tables were constructed using SPICE simulation with BSIM3 models from 0.18Pm technology for Isub minimization approach. In order to represent both Isub and Igate components for the state / Vt / Tox assignment approach, BSIM4 models were used to generate the precharacterization of tables. The device simulation parameters were obtained using leakage esti- mates from a predicted 65nm processes [20], and had a gate leakage component that was approximately 36% of the total leakage at room tempera- ture (at which all analysis is performed).1 (Detailed numbers will be shown at Section 4.5.2.) Different high- and low-Vt versions of a cell as well as Tox and Vt versions of a cell will be explained further in Section 4.4.2. Also, the delay and output slope as a function of cell input slope and output loading were stored in precharacterized tables. The total gate leakage for a library cell consists of several different com- ponents, depending on the input state of the gate, as illustrated for the inverter cell in Figure 4.1. The maximum gate tunneling current occurs when the input is at Vdd and Vs = Vd = 0V for the NMOS device. In this case, Vgs = Vgd = Vdd and the Igate is at its maximum for the NMOS device. At the same time, the

1. Since this work aims at standby mode leakage, we expect junction temperatures during these idle periods to be lower than under normal operating conditions, making room temperature analysis more valid. TLFeBOOK

61

Igate Isub Vdd Gnd Igate 0V Igate Vdd

Igate Isub

Figure 4.1. Inverter circuit with NMOS oxide leakage current. PMOS device exhibits substantial subthreshold leakage current. When the input is at Gnd, the output rises to Vdd and Vgs = 0 while Vgd will become -Vdd for the NMOS device, resulting in a reverse gate tunneling current from the drain to the gate node. In this case, tunneling is restricted to the gate-to-drain overlap region, due to the absence of a channel. Since this overlap region is much smaller than the channel region, reverse tunneling current is signifi- cantly reduced compared to the forward tunneling current [21]. Note that BSIM4 intrinsically considers this reverse tunneling current so it is included in the precharacterized tables described above. When the input voltage is Gnd, the PMOS device also exhibits gate cur- rent from the channel to the gate since its Vgs = Vgd = -Vdd. The relative magnitude of the PMOS gate current in comparison to the NMOS gate current differs for different process technologies. If standard SiO2 is used as the gate oxide material, then the Igate for a PMOS device is typically one order of mag- nitude smaller than that for an NMOS device with identical Tox and Vdd [19][22]. This is due to the much higher energy required for hole tunneling in SiO2 compared to electron tunneling. However, in alternate dielectric materi- als, the energy required for electron and hole tunneling can be completely different. In the case of nitrided gate oxides, in use today in a few processes, PMOS Igate can actually exceed NMOS Igate for higher nitrogen concentra- tions [23][24]. In this paper, we assume that standard SiO2 gate oxide material is used and the PMOS gate current is negligible. However, the presented methods can be easily extended to include appreciable PMOS gate leakage as well.

4.3 SUBTHRESHOLD LEAKAGE REDUCTION

4.3.1 Simultaneous Vt and State Assignment Consider the leakage and performance of the simple NAND2 circuit shown in Figure 4.2 under different input states and Vt assignments. It is clear that given a particular input state, only those transistors that are OFF need to TLFeBOOK

62

t t B p1 Group 1

t A n1 Group 2

t n2 Group 3

Figure 4.2. The concept of groups for a NAND2 gate be considered for high-Vt assignment as the ON-transistors are not leaking. For instance, in state AB = 01, only transistor tn1 needs to be considered for high-Vt assignment. Assigning other transistors to high-Vt will only decrease the performance of the gate with no reduction in leakage current. On the other hand, in state 11 both tp1 and tp2 must be assigned high-Vt in order to reduce leakage, since they are parallel devices. We can partition the transistors into so-called Vt-groups, corresponding to the minimum sets of transistors that need to be set to high-Vt to reduce leak- age in a particular state assignment. For the 2-input NAND gate in Figure 4.2, three Vt-groups exist as shown. The concept of Vt-groups can be easily applied to more complex structures in which case it may be possible that a transistor belongs more than one Vt-group. It is clear that we can restrict our- selves to setting only entire Vt-groups to either high or low-Vt. By considering only Vt-groups, instead of individual transistors, we therefore significantly reduce the number of possible Vt assignment and the optimization complexity. In Table 4.1, we show the leakage current for the NAND2 in Figure 4.2 for different input states and Vt-group assignments. Column 3 shows the leakage current when we use high-Vt for one or more Vt-groups that are OFF in a par- ticular input state. In column 4 and 5, the leakage current with all transistors assigned to, respectively, high-Vt and low-Vt is shown. We can see that in states 01, 10, and 11 only a single Vt-group is a candidate for high-Vt assign-

Table 4.1. Leakage current of NAND2 gate

Input Assigned Leakage current [pA] State Group with Group Assign. with All High Vt with All Low Vt 224.9 00 39.8 7.2 286.7 2 and 3 7.2 01 2 26.6 26.6 1054.0 10 3 25.7 24.4 922.6 11 1 14.2 14.2 357.2 TLFeBOOK

63 ment. Also, setting only this one Vt-group to high-Vt results in equal or nearly equal leakage compared with the leakage when all transistors are assigned high-Vt demonstrating the effectiveness of the approach. In state 00, three high-Vt assignments are possible: group 2, group 3, and both group 2 and 3. However, the leakage current with both groups assigned to high-Vt is only slightly better than that with only one group set to high-Vt, and assigning group 3 to high-Vt reduces leakage somewhat more than assigning group 2 to high-Vt. Hence, it is clear that we need to only consider assignment of group 3 to high-Vt without significant loss in optimality. Table 4.1 shows that the leakage current varies considerably as different groups associated with different input states are set to high-Vt. At the same time, the impact of different high-Vt group assignments on the performance of the circuit must be considered. By setting only a single group to high-Vt, the performance degradation is restricted to only a single signal transition direc- tion and is also reduced compared to high-Vt assignments where most or all transistors are set to high-Vt. Therefore, the performance/power trade-off of Vt assignment with known input state is much improved compared with that with unknown input state. The input state of a gate effects which transition direction is degraded by a high-Vt group assignment to a gate. Also, the position of the high-Vt group in a stack of transistors changes the impact of a high-Vt group assignment on the different input to output gate delays. Therefore, the input state of a gate must be chosen such that its associated high-Vt group results in the least degrada- tion of the critical paths in the circuit. However, only the input state of the circuit as a whole can be controlled and the logic correlations of the circuit restrict the possible assignments of gate input states. Therefore, selection of the circuit input state and of which gate is assigned a high-Vt group must be made simultaneously to obtain the maximum improvement in leakage current with minimum loss in performance.

4.3.2 Exact Solution to Vt and State Assignment The size of the input state space is 2n, where n is the number of circuit inputs. For each input state assignment, there are two possible Vt assignments for each gate (one high-Vt group which is pre-determined by its input state, m and all low-Vt). The total number of possible Vt assignment is therefore 2 , where m is the number of gates in the circuit and the total size of the search space is 2n+m. In order to find an exact solution to the problem, we developed an efficient branch-and-bound method that simultaneous explores the state and Vt assign- ments and that exploits the characteristics of the problem to obtain efficient pruning of the search space to improve the run time. Due to the exponential TLFeBOOK 64 nature of the problem, an exact solution is only possible for very small cir- cuits. However, the exact approach is still useful as the proposed heuristics are based on it. We use two types of branch and bound trees. The first branch-and-bound tree determines the input state of the circuit and is referred to as the state tree. The nodes of the state tree correspond to the input variables of the circuit inputs. Each node of the state tree is associated with a so-called gate tree which is searched to determine the group Vt assignment. In other words, for a state tree with k nodes, there exist k copies of the gate tree. Each node in a par- ticular gate tree corresponds to a gate in the circuit, as shown in Figure 4.3. Each node has two fanout edges, representing the assignment of that gate with all low-Vt groups (left branch) or with one high-Vt group, as determined by the input state of the gate (right branch). At the root of the state tree, the state of all input variables is unknown. As the algorithm proceeds down the tree, the state of one input variable becomes defined with each level that is traversed. At each node in the state tree, a solu- tion of leakage current can be obtained by traversing the gate tree. Note that the gate tree may be traversed both with a completely known input state at the bottom of the state tree as well as with a partially or completely unknown input state, at higher levels of the state tree. For each node in the state and gate tree, an upper and low bound on the leakage current is computed incrementally as explained in Section 4.3.2.1. Note that early in the state tree the bounds on leakage will be very loose since the state of the circuit is only partly defined. As the algorithm traverses down the state tree, the input state becomes more defined and the leakage bounds become closer. Similarly, the leakage bounds are very wide at the top of each gate tree, as the Vt assignment of all gates are unknown, and becomes progres- sively tighter as the algorithm traverses down the tree. Only at the bottom of both the state tree and its associated gate tree do the upper and low bounds on

Gate tree

g1 LH

g2 g2 LH LH State tree

gm gm s1 LH LH 01 Sol.

s2 s2 01 01

sn sn 01 01 Sol. Figure 4.3. State tree with gate tree at each node TLFeBOOK

65 leakage coincide. The algorithm first traverses down to the bottom of the tree and then returns back up, to traverse down unvisited branches in DFS manner. During the search, a tree branch is pruned when if it has a lower bound on leakage that is worse than the best upper bound on leakage that has been observed so-far. In addition to pruning based on leakage bounds, we also compute a lower bound on the circuit delay at each node in the gate tree tra- versal and prune all branches whose lower bound exceeds the specified delay constraint. Computation of the delay bounds is also performed incrementally and is discussed in Section 4.3.2.2. Also, early in the state tree, computation of the exact minimum Vt assign- ment by traversing the gate tree is not meaningful since even at that bottom of the gate tree there is considerable uncertainty in the leakage current due to the unknown input state. Therefore, the gate tree is searched only partially at the higher levels of the state-tree which results in slightly more conservative bounds, but an overall improvement in the run time of the algorithm. The gate tree is also searched in DFS manner and edges are pruned based on the computed leakage bounds. During the downward traversal of the gate tree, the high-Vt branch is always selected, provide it meets the delay con- straint. This is due to the fact that the high-Vt branch always has less leakage current than the low-Vt branch. Only if the lower bound on the delay of the high-Vt branch exceeds the delay constraint, is the low-Vt branch selected and is the high-Vt branch pruned. Finally, the gates in the circuit are assigned to nodes in the gate tree in topological order to enable incremental delay computation. Gates of equal topological level are further sorted by decreasing leakage to improve the pruning of the search space. The input signals of the circuit are also assigned to nodes in the state tree in specific order. We want to place inputs whose state assignment strongly influences the total leakage of the circuit near the top of the state tree. We estimate the influence of each input signal on the circuit leakage by taking the sum of the leakage current of all gates connected to the input signal. This input variable ordering is similar to that used in [25].

4.3.2.1 Incremental leakage bound computation During the traversal of the gate tree, some of the gates will have a known Vt assignment and others, which have not been visited, will have an unknown Vt assignment. As shown in Figure 4.4, a lower bound on the leakage is com- puted by assuming all unknown gates have a high-Vt group assignment and an upper bound is computed by assuming all unknown gates have a low-Vt group assignment. As the high branch is taken in the downward traversal, only the upper bound is update (decreased) while when a low branch is taken, only the lower bound must be updated and is increased. TLFeBOOK

66

UBi-1 = leak(g1~gi-1=known gates) + leak(gi~gn=L)

LBi-1 = leak(g1~gi-1=known gates) + leak(gi~gn=H)

gi L H

gi+1 gi+1 UBi = (unchanged) UBi = UBi-1 - leak(gi=L) + leak(gi=H)

LBi = LBi-1 - leak(gi=H) + leak(gi=L) LBi = (unchanged) Delay = (unchanged) Delay = (increased) Figure 4.4. Incremental leakage bound computation

4.3.2.2 Incremental delay bound computation Similar to the leakage current bounds, a lower bound on the delay is com- puted assuming all unknown gates have low-Vt group assignments. Delay is changed only when a high branch is taken in the traversal and is computed incrementally. We first compute the slack of the circuit for all circuit nodes at the start of the tree traversal with all Vt assignments assumed to be low-Vt. When a group changes from a low to a high-Vt group assignment during the traversal, the slack of that gate will be updated. However, the Vt change of the gate will affect not only the gate itself but also the delays of fanout gates due to the slope change at the output of the changed gate. Since the slope at the output of the changed gate will become slower due to its high-Vt assignment, the delay of all fanout gates will increase, resulting an overall increased cir- cuit delay. Ignoring the effect of slope change on fanout gates will therefore result in the computation of an optimistic lower bound which ensures that the optimal solution is not accidentally pruned. It also enables incremental delay computation, given that the gates are visited in topological ordering. As gates are visited, the changed input slope, due to high-Vt assignments of a fanin gate, is processed to ensure that an exact delay bound is computed at the bot- tom of the gate tree.

4.3.3 Heuristic Solution to Vt and State Assignment We propose two fast heuristics that can be applied to large circuits and that produce high quality solutions. The proposed heuristic are based on the exact method described in Section 4.3.2, and are discussed below. Heuristic 1 In this heuristic, the state and gate tree search is limited to only one down- ward traversal. Note that while only a single traversal of the state tree is performed, at each node of the state tree the decision to follow the left or right child node is based on the computed bounds of the leakage using the gate tree. TLFeBOOK

67 Each downward traversal of the gate tree visits m nodes, where m is the num- ber of gates in the circuit. We perform exactly two such traversals at each state tree node, leading to a total run time complexity that is O(nm), where n is the number of circuit inputs. Since the number of inputs is generally thought to grow approximately as the sqrt(m), the total complexity of this heuristic is O(msqrt(m)). Heuristic 2 In the second heuristic, the state tree is searched more extensively, subject to a fixed run time constraint, while the gate tree search is kept to a single downward traversal for each state tree node. Experimentally, it was found that the quality of the first bottom node reached in the gate tree search is near the optimal Vt assignment. This is due to the fact that the gate tree always chooses the high-Vt child in its downward traversal which tends to produce a high quality result. This is in contrast to the state tree, where choosing the correct child during the downward traversal was found to be much more difficult. Therefore, the solution quality was found to improve most by searching the state tree more extensively, subject to a run time constraint, while limiting the gate tree search to a single downward traversal.

4.3.4 Vt assignment Control within Stacks

We assume the ability to assign Vt on an individual basis within stacks of transistors. Although it is generally possible to assign the Vt of each transistor in a stack individually, this may result in the need for increased spacing between the transistors in order not to violate design rules and ensure manu- facturability [26]. Hence, at times it may be desirable to restrict the assignment of Vt such that all transistors in a stack are uniform. In this case, less flexibility exists in the assignment of Vt, and hence the obtained trade-off in delay and leakage will degrade to some extent. In Section 4.5.1, we present results showing the impact on the leakage optimization when uniform stack assignments are enforced in the library.

4.4 LEAKAGE REDUCTION METHOD FOR BOTH SUBTHRESHOLD AND GATE LEAKAGE CURRENT

4.4.1 Leakage Reduction Approach The proposed leakage optimization method performs simultaneous assign- ment of standby mode state and high-Vt and thick-oxide transistors. The TLFeBOOK

68 proposed method is based on the key observation that given a known input state, a transistor need not be assigned both a high-Vt and a thick oxide. This is due to the fact that if a transistor that is OFF, gate leakage is significantly reduced and hence the transistor only needs to be considered for high-Vt assignment. Conversely, a transistor that, given a particular input state, is ON may exhibit significant Igate, but does not impact Isub. Hence, conducting tran- sistors only need to be considered for thick oxide assignment. If the input state is unknown in standby mode, it cannot be predicted at design time which tran- sistors will be ON or OFF and therefore all or most transistors must be assigned to both high-Vt and thick-oxide in order to significantly reduce the total average leakage. However, given a known input state, we can avoid assignment of transistors to both high-Vt and thick oxide, thereby significantly improving the obtained leakage / delay trade-off. Furthermore, depending on the input state of a circuit, only a subset of transistors needs to be considered for high-Vt or thick-oxide, as discussed in Section 4.3.1. For instance, in a stack of several transistors that are OFF, only one transistor needs to be assigned to high-Vt to effectively reduce the total Isub. Similarly, Igate for transistors in a stack also has strong dependence on their position. If a conducting transistor is positioned above a non-conducting transistor in a stack, its Vgs and Vgd will be small and gate leakage will be reduced. Hence, depending on the input state, only a small subset of all ON transistors needs to be assigned thick-oxide and only a subset of all OFF tran- sistors need to be considered for high-Vt assignment. We illustrate the advantage of high-Vt and thick-oxide assignment with a known input state for a 2-input NAND and NOR gate in Figure 4.5. In Figure 4.5(a) a 2-input NOR gate is shown with input state 01. Since only PMOS transistors p2 is OFF in the pull-up stack, it is the only transistor that needs to

p p p 1 0 1 1 1 0 i1 i1 i1

p p p 2 1 2 1 2 0 i2 i2 i2

n 1 n 2 n 1 n 2 n 1 n 2

(a) (b) (c)

p 1 p 2 p 1 p 2 0 1 i1 i2 High-Vt Transistor

n 1 n 1 Thick Oxide n 2 n 2 Transistor 1 0 i2 i1

(d) (e)

Figure 4.5. High Vt and thick oxide assignments at different input states TLFeBOOK

69 be set to high-Vt to reduce the subthreshold leakage of the gate. Similarly, only NMOS transistor n2 exhibits gate leakage and needs to be assigned thick oxide to reduce Igate. Hence only two out of four transistors are affected while the total leakage current is reduced by nearly the same amount as when all transistors in the gate are set to high-Vt and thick oxide simultaneously. As a result, the delay of the rising input transition at input i1 is unaffected by the high-Vt and thick-oxide assignments, while the other transitions are affected only moderately. In Figure 4.5(b), the worst-case input state for a NOR2 gate is shown, which is when both inputs are 1. In this case, both NMOS devices must be assigned to thick-oxide to reduce Igate, while at least one PMOS device is set to high-Vt. Depending on the delay requirements, the best input state is either the state 01 shown in Figure 4.5(a), or the state 00, shown in Figure 4.5(c), which requires only two transistors to be set to high-Vt. Hence, it is clear that the input state significantly impacts the ability to effectively assign high-Vt and thick-oxides without degrading the performance of the circuit. This leads to the need for a simultaneous optimization approach where both the input state and the high-Vt and thick-oxide assignments are considered simulta- neously under delay constraints. In addition to high-Vt and thick-oxide assignment, we also take advantage of the Igate dependence on input pin ordering to reduce leakage current [11]. This is illustrated in Figure 4.5(d), for a 2-input NAND gate with input state 01. In order to effectively reduce the leakage under this input state, NMOS transistor n1 must be assigned to high-Vt and NMOS transistor n2 must be assigned to thick-oxide. However, if input pins i1 and i2 are reordered, with i1 positioned at the bottom of the stack, as shown in Figure 4.5(e), the Vgs and Vgd voltage of NMOS transistor n1 will be reduced from Vdd to approximately one Vt drop. Hence, the gate leakage current of n1 will be substantially reduced and can be ignored. After reordering the input pins, it is necessary to only set NMOS transistor n2 to high-Vt without further assignments of thick- oxide transistors. It should be noted that pin reordering will impact the delay of the circuit and hence some performance penalty might be incurred. How- ever, this penalty will be readily offset by the elimination of the thick-oxide assignment in the pull-down stack. In this paper, we therefore consider com- bined input state assignment with pin-reordering and Vt / Tox assignment. 4.4.2 Cell Library Construction

In order to perform simultaneous Vt, Tox and state assignment, it is neces- sary to develop a library where for each cell the necessary Vt and Tox version are available. After such a library has been constructed, the process of assign- ing Vt and Tox assignments can be performed by simply swapping cells from TLFeBOOK

70 the library. Since different Vt and Tox variations do not alter the footprint of a cell, the leakage optimization can be performed either before or after final placement and routing. For each gate and input state, a number of different Tox and Vt assignments is possible, providing different delay / leakage trade-off points. For the fastest and highest leakage trade-off point, all transistors are assigned to low-Vt and thin oxides, such as the NAND2 gate shown in Figure 4.6(a). On the other hand, for the slowest and lowest leakage version of the cell all transistors con- tributing to leakage are assigned either high-Vt or thick oxide. For instance, for the NAND2 gate with input state 11, shown in Figure 4.6(b), all transistors affect the leakage current and both NMOS transistors are assigned thick Tox while both PMOS transistors are assigned high-Vt to obtain the minimum leakage / maximum delay trade-off point. In addition to the fastest version and minimum leakage version of the cell, a number of other intermediate trade-off points can be constructed for a cell by assigning only some of the transistors that contribute to leakage to high-Vt or thick-Tox. These cell versions would have lower leakage than the fastest cell version but would be faster than the lowest leakage version. It is clear that a large number of possible cell versions can be constructed if all possible trade-off points are considered for each possible input state. While a larger set of cell versions provides the optimization algorithm with more flexibility, and hence a more optimal leakage result, it also increases the size of the library, which is undesirable. Therefore, we initially restrict our library to at most 4 different trade-off points for each input state of a library cell, which are: 1) the minimum delay, shown in Figure 4.6(a), 2) minimum leakage, shown in Fig-

t t t t t t A p1 p2 1 A p1 p2 1 A p1 p2

tn1 tn1 tn1

t t t B n2 1 B n2 1 B n2

(a) (b) (c)

tp1 tp2 tp1 tp2 tp1 tp2 1 A 0 A 1 A

tn1 tn1 tn1

tn2 tn2 tn2 1 B 0 B 0 B

(d) (e) (f)

Figure 4.6. Complete Vt-Tox versions of NAND2 gate TLFeBOOK

71 ure 4.6(b), 3) fast falling transition but slow rising transition, with intermediate leakage, shown in Figure 4.6(c), and 4) fast rising transition but slow falling transition with intermediate leakage, shown in Figure 4.6(d). Although other possible trade-off points could be considered, we empirically found that these four points yield good optimization results and provide a sys- tematic approach for constructing all versions of a cell. In principle, using four possible trade-off points for each input combina- tion could result in as many as 16 (4x4) cell versions for a 2 input gate. However, in practice, many of the cell versions are shared between different input states. Also, in some cases not all 4 trade-off points are realizable and hence the total number of cell versions is significantly less. We illustrate this for the NAND2 gate for input state 00. The fastest cell version is again shown in Figure 4.6(a) and is shared for all input combinations, and the minimum leakage version is shown in Figure 4.6(e). Note that only one transistor needs to be set to high-Vt to achieve minimum leakage for this input state. This results from the fact that PMOS devices have negligible gate leakage in the target technology and only one transistor in a stack needs to be set to high-Vt to reduce the leakage through the entire stack. Hence, for the input state 00, only two trade-off points are needed and only one additional cell version is added to the library. Input state 10 again requires the assignment of only a single transistor to high-Vt for the minimum leakage version, as shown in Figure 4.6(f). This is due to the fact that the gate leakage through the top NMOS transistor n1 is negligible since its Vgs and Vgd is reduced to approximately one Vt drop. Only two trade-off points are therefore required for this input state and both ver- sions are shared with the 00 state. Finally, if the 01 state occurs in the circuit, the optimization will automatically perform input pin swapping for all but the fastest trade-off point, thereby resulting in no additional cell version. The NAND2 gate therefore requires a total of 5 cell versions to provide up to 4 trade-off points for each input state. In Table 4.2, we show the delay / leakage

Table 4.2. Trade-offs for different Vt-Tox versions of NAND2 gate Total Normalized Normalized leakage rise delay fall delay State Cell current [nA] pin A pin B pin A pin B Minimum delay (a) 270.4 1.00 1.00 1.00 1.00 Fast rise delay (d) 109.1 1.00 1.36 1.27 1.27 11 Fast fall delay (c) 91.4 1.36 1.36 1.00 1.00 Minimum leakage (b) 19.5 1.36 1.37 1.27 1.27 Minimum delay (a) 41.2 1.00 1.00 1.00 1.00 00 Minimum leakage (e) 14.0 1.00 1.00 1.12 1.16 Minimum delay (a) 91.8 1.00 1.00 1.00 1.00 10 Minimum leakage (f) 13.3 1.00 1.00 1.12 1.16 TLFeBOOK

72 trade-offs obtained for each input state using the described approach for the NAND2 gate. The same process can be applied to each cell in the library to construct the full set of cell versions for the leakage characterization method. Table 4.3, shows the number of cell version required for several common gates. Note that the number of cell version is higher for NOR gates than NAND gates. Since for a library the total number of cells would increase significantly, we also explored reducing the number of cells by allowing only two trade-off points for each cell (minimum delay, and minimum leakage), instead of 4 trade-off points. In this case, the number of cells for the NAND2 gate reduces to only 3 versions. The number of cell version required for two trade-off points for different cell types is shown in Table 4.3, column 3. In column 4, we add one more cell library version - two trade-off points with reduced num- ber of cells. In order to minimize the number of needed library cells, one or two cells of NOR2 or NOR3, respectively, are removed from library with small degradation of leakage/delay trade-off. Therefore, all gates have only three cells in this option. In Section 4.5.2 we compare the final leakage results using the full library with 4 trade-off points, the reduced library with only two trade-off points, and minimum number of cell library with two trade-off points. Finally, we consider Vt and Tox assignment control within stacks similar to the discussion for Vt stack control in Section 4.3.4. However, Tox assignment differs from Vt assignment in that the assignment of Tox to transistors in a stack is already uniform due to the use of pin-swapping. This is evident from the 5 added cell versions for the NAND2 in Figure 4.6, and can be easily shown to be true for all cell versions generated under the proposed approach. This is a significant advantage since spacing design rules for different Tox assignments are expected to be more severe that those for spacing between different Vt assignments [26]. However, the Vt assignment is not always uni- form as shown in Figure 4.6(e), where only a single transistor in a stack is assigned to high-Vt. In the event that a uniform stack is required, both transis- tors in the stack need to be set to high-Vt, resulting in a slightly worsened

Table 4.3. The number of needed library cells

2 trade-off points 4 trade-off points 2 trade-off points with reduced number of cells Inverter 5 3 3 NAND2 5 3 3 NAND3 5 3 3 NOR2 8 4 3 NOR3 9 5 3 TLFeBOOK

73

Gate tree

g1

g g State tree 2 2

s1

01 gm

s2 s2 01 01 Sol

sn sn 01 01 Sol

Figure 4.7. State tree with gate tree at each node delay / leakage trade-off. Leakage current comparison results between indi- vidual vs. uniform stack assignment control will be shown in Section 4.5.2.

4.4.3 Optimization - Approach and Heuristics In this section, we present an exact solution and two heuristics to the prob- lem of finding a simultaneous input state, high-Vt and thick-Tox assignments for a circuit under delay constraints. As mentioned, the leakage minimization problem can be formulated as a integer optimization problem under delay constraints. The size of the input state space is 2n, where n is the number of circuit inputs. As discussed in Section 4.4.2, for each input state assignment, there are up to four possible Vt-Tox assignments for each gate. Note that while the total number of cell versions can be larger than 4, only 4 of them need to be considered for each specific input state. For instance, for the NAND2 gate in Figure 4.6, only versions (a)-(d) are considered for a 11 input state. There- m fore, the total number of possible Vt-Tox assignments is 4 , where m is the number of gates in the circuit and the total size of the search space is 2n+2m. In order to find an exact solution to the problem, we extend the branch- and-bound method with Section 4.3.2. The branch and bound algorithm for Vt-Tox and state assignment uses two interdependent search trees: state tree and gate tree. The state tree is searched to determine the input state of the cir- cuit and the gate tree is searched to determine the Vt-Tox assignment of the circuit, as shown in Figure 4.7. The only difference from Section 4.3.2 is the gate tree. Each node in a particular gate tree corresponds to a gate in the cir- cuit. Since there are four possible Vt-Tox assignments for a gate, each node of the gate tree has four edges: minimum delay, minimum leakage, fast fall delay with intermediate leakage, and fast rise delay with intermediate leakage. The exponential nature of the problem makes it impossible to obtain an exact solu- TLFeBOOK

74 tion for substantial circuits, such as Isub minimization approach in Section 4.3.2. Therefore, we also use the two heuristics discussed in Section 4.3.3.

4.5 RESULTS

4.5.1 Subthreshold Leakage Reduction

The proposed methods for simultaneous state and Vt assignment were tested on the ISCAS benchmark circuits [27] and a 64-bit ALU circuit, syn- thesized using a 0.18Pm industrial library with . This technology has a difference of 14X (10X) in Isub and 16% (15%) in delay between low-Vt and high-Vt NMOS (PMOS) devices. The leakage current for each Vt version of a cell was computed using SPICE simulation and stored in precharacterized tables. Delay computation was performed based on the Synopsys table delay model and was verified to match with Synopsys timing analysis delay reports. In addition to the proposed methods, traditional methods using only state or Vt assignment were also implemented for comparison. The state-only assign- ment was implemented using the approach discussed in [25] while for Vt-only assignment a method similar to the sensitivity-based approach of [17] was used. Table 4.4 compares the leakage results obtained by the three proposed heuristics for three delay constraints to the average leakage computed using 10,000 random input vectors. The columns marked 0%, 5%, and 10% refer to leakage minimization results when the delay constraints were set at 0%, 5%, and 10% respectively, of the full delay range between all low-Vt and all high- Vt circuit delay, as illustrated in Figure 4.8. The 0% column is therefore the Table 4.4. Leakage current comparison between heuristics

Minimized leakage current [nA] (reduction factor: vs. average leakage current)

0% in low Vt/high Vt delay range 5% in low Vt/high Vt delay range 10% in low Vt/high Vt delay range Avg. Ileak by random Heuristic 1 Heuristic 2 Heuristic 1 Heuristic 2 Heuristic 1 Heuristic 2 (10K)vectors Ileak X Time Ileak X Ileak X Time Ileak X Ileak X Time Ileak X C432 32.9 7.7 4.3 1 4.3 7.7 4.9 6.7 1 3.6 9.2 4.7 7.0 1 3.6 9.1 C499 94.0 13.2 7.1 3 11.3 8.3 13.1 7.2 2 11.6 8.1 9.7 9.6 2 9.7 9.6 C880 73.4 9.7 7.5 4 8.9 8.3 8.9 8.2 3 8.3 8.8 8.9 8.3 4 8.3 8.8 C1355 85.1 19.0 4.5 3 12.7 6.7 14.6 5.8 3 11.7 7.3 12.0 7.1 3 11.0 7.7 C1908 82.8 19.0 4.3 2 15.1 5.5 15.5 5.3 2 12.2 6.8 13.4 6.2 2 10.3 8.0 C2670 162.5 12.7 12.8 58 12.5 13.0 12.7 12.8 55 12.4 13.1 14.3 11.3 55 12.2 13.3 C3540 173.1 20.1 8.6 10 16.4 10.6 20.5 8.4 10 14.6 11.8 17.4 10.0 9 14.5 11.9 C5315 309.1 26.4 11.7 169 25.9 11.9 27.5 11.2 164 25.2 12.3 28.5 10.9 165 25.2 12.2 C6288 451.5 157.5 2.9 47 153.9 2.9 145.5 3.1 44 141.4 3.2 135.8 3.3 43 128.4 3.5 C7552 385.8 31.0 12.4 330 30.6 12.6 30.8 12.5 330 30.1 12.8 30.7 12.6 328 29.6 13.0 alu64 332.3 46.0 7.2 405 43.6 7.6 47.2 7.0 408 44.5 7.5 43.0 7.7 406 42.0 7.9 AVG 7.6 8.6 8.0 9.2 8.5 9.6 TLFeBOOK

75

Delay with all low-Vt Delay with all high-Vt 0% 100%

5% 10% 50%

Figure 4.8. Delay point from all low-Vt to all high-Vt range most stringently constrained optimization as it corresponds to the best obtain- able delay for the circuit (no performance penalty). Note that a simple replacement of all low-Vt devices with all high-Vt ones would yield a ~20% circuit delay increase. Thus, when interpreting the results in this section, a 10% delay point indicates that the circuit after Vt assignment has a delay that is approximately 2% larger than the original fastest implementation. Since the average leakage current with 10,000 random input vectors is computed with all low-Vt transistors, it also corresponds to a 0% delay criteria. Runtimes for heuristic 1 are given in Table 4.4 in seconds. Heuristic 2 was limited to a runt- ime of 1800 seconds (30 minutes). We report the reduction factor relative to the average leakage current over the 10,000 random vectors. Heuristic 2 has ~10% lower leakage results than heuristic 1 at 5% delay point across the benchmark circuits. However, heuristic 2 has a 4-5X runtime overhead for large circuits (~1000X for small circuits) over heuristic 1. In Table 4.5, we compare the proposed approach with traditional tech- niques, including state-only and Vt-only assignment methods. The state-only

Table 4.5. Leakage current comparison with traditional techniques

Circuits Minimized leakage current [nA]

Avg. Vt only & proposed heuristic (reduction factor: vs. average leakage current) Number of State-only Ileak by assignment 0% in the delay range 5% in the delay range 10% in the delay range random (10K) Vt-only Heuristic 1 Vt-only Heuristic 1 Vt-only Heuristic 1 Input Gate vectors Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X C432 36 177 32.9 26.3 1.25 30.8 1.1 7.7 4.3 29.5 1.1 4.9 6.7 29.2 1.1 4.7 7.0 C499 41 519 94.0 86.1 1.09 85.0 1.1 13.2 7.1 57.2 1.6 13.1 7.2 40.4 2.3 9.7 9.6 C880 60 364 73.4 63.7 1.15 64.6 1.1 9.7 7.5 63.9 1.1 8.9 8.2 20.2 3.6 8.9 8.3 C1355 41 528 85.1 81.4 1.04 94.0 0.9 19.0 4.5 65.1 1.3 14.6 5.8 53.4 1.6 12.0 7.1 C1908 33 432 82.8 74.6 1.11 67.0 1.2 19.0 4.3 46.5 1.8 15.5 5.3 30.3 2.7 13.4 6.2 C2670 233 825 162.5 146.2 1.11 44.7 3.6 12.7 12.8 39.7 4.1 12.7 12.8 27.8 5.8 14.3 11.3 C3540 50 940 173.1 155.7 1.11 161.9 1.1 20.1 8.6 148.4 1.2 20.5 8.4 82.4 2.1 17.4 10.0 C5315 178 1627 309.1 283.1 1.09 290.6 1.1 26.4 11.7 289.7 1.1 27.5 11.2 108.3 2.9 28.5 10.9 C6288 32 2470 451.5 412.4 1.09 417.0 1.1 157.5 2.9 259.5 1.7 145.5 3.1 233.0 1.9 135.8 3.3 C7552 207 1994 385.8 352.3 1.10 360.2 1.1 31.0 12.4 353.5 1.1 30.8 12.5 350.9 1.1 30.7 12.6 alu64 131 1803 332.3 294.5 1.13 312.8 1.1 46.0 7.2 288.5 1.2 47.2 7.0 230.1 1.4 43.0 7.7 Avg. 1.12 1.3 7.6 1.6 8.0 2.4 8.5 TLFeBOOK

76 assignment method was limited to a runtime of 1800 seconds (30 minutes). The results demonstrate that substantial improvement in standby leakage cur- rent can be obtained using the proposed methods, with an average improvement of ~80% (5-6X) for the 0% and 5% delay constraints over Vt- only assignment. Table 4.6 compares leakage current results for both individual and uni- form stack control. Since uniform stack control degrades the delay/leakage trade-off as discussed in Section 4.3.4, the results for uniform stack assign- ment exhibit less leakage reduction than those of individual stack control. It is interesting to note, however, that the leakage current degradation by moving to a less fine-grained threshold voltage assignment scheme is not overly large implying that even with manufacturing constraints, the proposed technique provides significant leakage savings. Finally, Figure 4.9 plots the leakage results for the proposed method and the two traditional methods as a function of the delay constraint for circuit c6288. The optimization was performed for a range of delay constraints. The proposed method provides its largest improvements at tight delay constraints. This is due to the fact that, as the delay constraint becomes looser, more tran- sistors can be set to high-Vt in both approaches, and the relative advantage of the proposed approach reduces. However, leakage reduction is most challeng- ing under tight performance constraints at which the proposed technique holds promise.

Table 4.6. Leakage current comparison between individual and uniform stack control.

Minimized leakage current [nA]

5% in low Vt/high Vt delay range (reduction factor: vs. average leakage current) Average Heuristic 1 V -only assignment I by t leak Individual control Uniform control random (10K)vector Ileak X Ileak X Ileak X C432 32.9 29.5 1.1 4.9 6.7 6.8 4.8 C499 94.0 57.2 1.6 13.1 7.2 12.5 7.5 C880 73.4 63.9 1.1 8.9 8.2 9.1 8.1 C1355 85.1 65.1 1.3 14.6 5.8 23.7 3.6 C1908 82.8 46.5 1.8 15.5 5.3 15.7 5.3 C2670 162.5 39.7 4.1 12.7 12.8 12.9 12.6 C3540 173.1 148.4 1.2 20.5 8.4 24.1 7.2 C5315 309.1 289.7 1.1 27.5 11.2 28.5 10.9 C6288 451.5 259.5 1.7 145.5 3.1 163.1 2.8 C7552 385.8 353.5 1.1 30.8 12.5 31.3 12.3 alu64 332.3 288.5 1.2 47.2 7.0 44.6 7.5 Avg. 1.6 8.0 7.5 TLFeBOOK

77

450 Average Current with Low-V t

400 State Assignment Only with Low-Vt Dual-Vt Assignment only 350 Our proposed method - Heuristic 1 State Assignment Only with High-V t 300

250

200

150

100 Total Leakage Current [nA] Current Leakage Total

50

0 0 102030405060708090100 Delay Point from All Low-V to All High-V Range [%] t t

Figure 4.9. Leakage current comparison for c6288

4.5.2 Leakage Reduction for both Subthreshold and Gate Leakage

The proposed methods for simultaneous state, Vt and Tox assignment were implemented on a number of benchmark circuits [27] synthesized using a library based on a predictive 65nm process [20]. In this technology, the differ- ence in Igate for the thick-oxide NMOS devices vs. the thin-oxide device is 11X, whereas Isub is reduced by 17.8X (16.7X) when replacing a low-Vt NMOS (PMOS) device with a high-Vt version. Table 4.7 shows relative leak- age and delay values at the four possible Vt and Tox assignments for NMOS devices in this technology. A comparison of our first and second heuristics along with average leakage computed using 10,000 random input vectors is shown in Table 4.8. The total leakage current value is given in PA and runt- ime is given in seconds. In heuristic 2, we set the runtime limit as 1800

Table 4.7. Comparison of leakage and delay between four possible Vt-Tox assignment for NMOS

Assignment Normalized values

Leakage Vt Oxide thickness Delay Isub Forward Igate Reverse Igate Low Thin 1.00 0.41 0.22 1.00 High Thin 0.06 0.31 0.22 1.33 Low Thick 0.73 0.04 0.00 1.26 High Thick 0.05 0.03 0.02 1.69 TLFeBOOK

78

Table 4.8. Leakage current comparison between heuristics with 4-option, individual stack control library

Average 0% in the best-worst delay range 5% in the best-worst delay range 10% in the best-worst delay range Ileak by random Heu1 Heu2 Heu1 Heu2 Heu1 Heu2 (10K) vectors Ileak X Time Ileak X Ileak XTimeIleak X Ileak XTimeIleak X c432 24.5 8.2 3.0 3 5.4 4.6 7.7 3.2 2 3.2 7.6 5.5 4.5 2 3.0 8.2 c499 65.8 32.2 2.0 7 31.1 2.1 26.1 2.5 7 24.6 2.7 22.7 2.9 6 20.8 3.2 c880 50.1 10.3 4.9 8 9.2 5.5 8.5 5.9 7 8.3 6.1 8.5 5.9 7 7.0 7.1 c1355 70.8 20.4 3.5 8 20.4 3.5 15.8 4.5 6 13.1 5.4 9.9 7.1 6 9.9 7.1 c1908 56.7 17.4 3.3 5 16.9 3.4 14.8 3.8 4 13.6 4.2 13.2 4.3 5 10.5 5.4 c2670 104.7 14.9 7.0 82 14.7 7.1 12.3 8.5 78 12.2 8.6 13.5 7.8 78 11.3 9.3 c3540 128.5 27.7 4.6 20 23.7 5.4 22.1 5.8 18 19.9 6.4 18.6 6.9 17 17.4 7.4 c5315 221.2 36.6 6.0 219 35.9 6.2 30.0 7.4 213 30.0 7.4 28.4 7.8 202 27.6 8.0 c6288 346.8 153.6 2.3 75 146.0 2.4 112.2 3.1 64 101.4 3.4 84.1 4.1 59 75.6 4.6 c7552 270.0 34.9 7.7 410 33.4 8.1 32.2 8.4 404 31.8 8.5 30.3 8.9 399 30.2 8.9 alu64 260.0 48.7 5.3 468 46.8 5.6 43.4 6.0 464 41.6 6.3 34.3 7.6 458 33.1 7.9 AVG 4.5 4.9 5.4 6.0 6.2 7.0 seconds (30 minutes). The average leakage computed using the random vec- tors can be used to approximate the standby mode leakage if state assignment as well as dual-Vt and dual-Tox techniques were not employed. Again, the delay range points used in all results are defined by a percentage of the maxi- mum possible delay that is associated with moving from an all low-Vt and thin-oxide design to an all high-Vt and thick-oxide implementation. Note that a simple replacement of all fast devices with their slowest counterparts would yield a ~70% circuit delay increase. Thus, when interpreting the results in this section, a 5% delay point indicates that the circuit after Vt and Tox assignment has a delay that is approximately 4% larger than the original fastest implementation. As shown in Table 4.8, heuristic 2 generally provides somewhat better results but at much greater runtimes. On average, heuristic 2 provides ~10% lower leakage current than heuristic 1 across these benchmarks at the 5% delay point, similar to the results in Section 4.5.1. The improvement of the two proposed heuristics compared to the average leakage without state, Vt or Tox assignment is dramatic and approaches 7X at the 10% delay point in the best-worst delay range. More aggressively, with just a 5% delay penalty the reduction in total standby leakage is 5.3-6X with a maximum improvement of 8.6X for heuristic 2 in circuit c2670. In Table 4.9 we compare our results to other standby mode techniques, including state assignment alone and simultaneous state and Vt assignment (as in the previous section). The total leakage current value is given in PA. Again, we report the reduction factor in relation to the average leakage current with 10,000 random vectors for consistency. We first point out that state assign- TLFeBOOK

79

Table 4.9. Leakage current comparison with 4-option, individual stack control library

Average State 0% in the delay range 5% in the delay range 10% in the delay range Ileak by Assignment random Only Vt & State Heu1 Vt & State Heu1 Vt & State Heu1 (10K) vectors Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X c432 24.5 22.7 1.08 13.3 1.8 8.2 3.0 12.5 2.0 7.7 3.2 12.7 1.9 5.5 4.5 c499 65.8 63.9 1.03 41.9 1.6 32.2 2.0 35.7 1.8 26.1 2.5 32.2 2.0 22.7 2.9 c880 50.1 46.0 1.09 18.9 2.6 10.3 4.9 17.5 2.9 8.5 5.9 16.9 3.0 8.5 5.9 c1355 70.8 67.4 1.05 39.9 1.8 20.4 3.5 33.0 2.1 15.8 4.5 29.8 2.4 9.9 7.1 c1908 56.7 54.8 1.04 27.6 2.1 17.4 3.3 25.8 2.2 14.8 3.8 22.9 2.5 13.2 4.3 c2670 104.7 101.4 1.03 33.3 3.1 14.9 7.0 32.7 3.2 12.3 8.5 31.9 3.3 13.5 7.8 c3540 128.5 121.8 1.05 54.5 2.4 27.7 4.6 51.5 2.5 22.1 5.8 48.5 2.7 18.6 6.9 c5315 221.2 215.1 1.03 81.2 2.7 36.6 6.0 77.1 2.9 30.0 7.4 73.7 3.0 28.4 7.8 c6288 346.8 306.7 1.13 209.3 1.7 153.6 2.3 180.4 1.9 112.2 3.1 153.7 2.3 84.1 4.1 c7552 270.0 262.6 1.03 88.9 3.0 34.9 7.7 86.6 3.1 32.2 8.4 86.1 3.1 30.3 8.9 alu64 260.0 237.2 1.10 90.7 2.9 48.7 5.3 86.1 3.0 43.4 6.0 81.1 3.2 34.3 7.6 AVG 1.06 2.3 4.5 2.5 5.4 2.7 6.2 ment alone, which we accomplish by searching the state tree only, achieves very little improvement in standby mode leakage; about 6%. By adding Vt assignment, the algorithm of the first proposed method shows an average reduction of 58% beyond state assignment alone at a 5% delay point. The full Vt, Tox, and state assignment approach provides an additional 53% reduction in current beyond state and Vt assignment for the 5% delay point. Table 4.10 provides a comparison of results using the various cell library options; 4 and 2 trade-off points with individual stack control, and also with uniform stacks. The main result in Table 4.10 is that there is very little leak-

Table 4.10. Leakage current comparison between cell library options (current unit: PA)

5% in the best-worst delay range

Average Individual stack control Uniform stack control Ileak by random 2-option 2-option (10K) 4-option 2-option 3 cell versions 4-option 2-option 3 cell versions vectors only only

Ileak X Ileak X Ileak X Ileak X Ileak X Ileak X c432 24.5 7.7 3.2 7.4 3.3 7.1 3.4 7.3 3.4 7.9 3.1 8.6 2.8 c499 65.8 26.1 2.5 26.7 2.5 27.8 2.4 26.0 2.5 28.0 2.3 28.9 2.3 c880 50.1 8.5 5.9 9.7 5.2 8.0 6.3 10.0 5.0 10.7 4.7 10.8 4.6 c1355 70.8 15.8 4.5 16.2 4.4 14.1 5.0 23.4 3.0 25.2 2.8 23.9 3.0 c1908 56.7 14.8 3.8 14.9 3.8 14.3 4.0 15.9 3.6 15.3 3.7 16.8 3.4 c2670 104.7 12.3 8.5 12.1 8.7 12.4 8.4 16.1 6.5 15.4 6.8 16.5 6.3 c3540 128.5 22.1 5.8 24.2 5.3 25.3 5.1 27.1 4.7 25.8 5.0 29.2 4.4 c5315 221.2 30.0 7.4 30.9 7.2 30.7 7.2 32.1 6.9 32.9 6.7 33.8 6.6 c6288 346.8 112.2 3.1 114.2 3.0 114.2 3.0 134.0 2.6 147.8 2.3 145.4 2.4 c7552 270.0 32.2 8.4 31.4 8.6 30.6 8.8 31.8 8.5 31.1 8.7 31.1 8.7 alu64 260.0 43.4 6.0 44.0 5.9 43.2 6.0 42.0 6.2 47.0 5.5 46.1 5.6 AVG 5.4 5.3 5.4 4.8 4.7 4.6 TLFeBOOK

80 age current penalty when moving from a full 4-option library to a simpler 2- option library. There are several cases where the smaller library outperforms the larger library due to the heuristic nature of the algorithm used (heuristic 1 is used in this table). Since the library size required in the 2-option scenario is roughly half that of 4-option, we conclude that the use of 2-option represents a very good trade-off between library complexity and potential leakage reduc- tion. Moreover we can see that the simplest cell library of 2-option with a reduced number of cells provides good leakage reduction results. In general, a reduced number of cells degrades the leakage/delay trade-off as discussed in Section 4.4.2. However we find that only complex, and infrequently used cells, such as 3-input NORs require appreciable reductions in cell variants which limits the impact on total leakage reduction. Therefore, very good leak- age current minimization can be obtained even with libraries with 3 cell versions for each cell. Also, the restriction that each stack of transistors must use the same Vt and Tox is shown in Table 4.10 to have only a minor impact on leakage. For instance, the uniform stack 4-option case shows a 10.6% average power increase compared to the individual stack 4-option case; this still repre- sents a nearly 5X reduction in standby leakage compared to the average case. Note that library complexity is not reduced in moving from individual to stack-based control; such a change would be dictated by manufacturing issues as well as the trade-off between standby power (lower for individual control) and cell area (expected to be slightly lower for stack-based control). Finally, Figure 4.10 plots the leakage current results for the proposed method and traditional methods as a function of the delay constraint for cir-

350 Average Current with Low-V /Thin-T t ox State Assignment Only with Low-V /Thin-T 300 t ox Dual-Vt & State Assignment Our proposed method - Heuristic 1 State Assignment Only with High-V /Thick-T 250 t ox

200

150

100 Total Leakage Current [uA] Leakage Current Total 50

0 0 102030405060708090100 Delay Point from the best to the worst range [%]

Figure 4.10. Leakage current comparison for c6288 TLFeBOOK

81 cuit c6288. Here, a 100% delay point implies a complete replacement of low- Vt and thin-oxide devices with high-Vt and thick-oxide. This is clearly the lowest leakage solution but is also very slow. The key point in Figure 4.10 is that the proposed approaches (heuristic 2 results are not shown but are nearly identical to heuristic 1) provide substantial improvement beyond the average leakage or the use of state assignment alone and that these gains are achiev- able with very small and even zero delay penalties. The rapid saturation of the gains as the delay point increases beyond 10% implies that the new approach is best suited for achieving low-leakage standby states with very little perfor- mance overhead (e.g., 5% or even less). Note that the leakage current achieved by our proposed method does not converge to that by state assign- ment using all high-Vt and thick-oxide devices. The reason is that the selected library cells include only a limited number of thick-oxide assignments in order to simplify the library. Many additional library cells would be needed to achieve convergence to the minimal leakage solution; instead the bulk of this leakage savings can be achieved with very little performance penalty.

4.6 CONCLUSIONS

In this paper, we propose new approaches for standby leakage current minimization under delay constraints. Our approaches use simultaneous state assignment and Vt or Vt / Tox assignment. Efficient methods for computing the simultaneous state and Vt or Vt / Tox assignments leading to the minimum standby mode leakage current were presented. The proposed methods were implemented and tested on a set of synthesized benchmark circuits. Using the new state and Vt assignment technique demonstrates 6X lower leakage than previous Vt-only assignment approaches and 5X lower than state assignment alone (at 5% delay point). In cases where gate leakage is prominent, as in 90nm CMOS technologies, these improvements are increased by an addi- tional factor of 2 using state and Vt / Tox assignment. We also investigate the leakage/complexity trade-off for various cell library configurations and dem- onstrate that results are still very good even when only 2 additional variants are used for each cell type.

Acknowledgement The authors would like to thank Harmander Deogun for his work in leak- age current model. The work has been supported by NSF, SRC, GSRC/ DARPA, IBM, and Intel. TLFeBOOK

82 References

[1] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu and J. Yamada, “1-V power supply high-speed digital circuit technology with multithreshold voltage CMOS,” IEEE Journal of Solid-State Circuits, vol. 30, pp. 847-854, Aug. 1995. [2] J. Kao, A. Chandrakasan, and D. Antoniadis, “Transistor sizing issues and tool for multi-threshold CMOS technology,” Proc. Design Automation Conference, pp. 409-414, 1997. [3] S. Shigematsu, S. Mutoh, Y. Matsuya, Y. Tanabe and J. Yamada, “A 1-V high- speed MTCMOS circuit scheme for power-down application circuits,” IEEE Journal of Solid-State Circuits, vol. 32, pp. 861-869, June 1997. [4] H. Kawaguchi, K. Nose and T. Sakurai, “A super cut-off CMOS (SCCMOS) scheme for 0.5V supply voltage with picoampere standby current,” IEEE Journal of Solid-State Circuits, vol. 35, pp. 1498-1501, October 2000. [5] R. X. Gu and M. I. Elmasry, “Power dissipation analysis and optimization of deep submicron CMOS digital circuits,” IEEE Journal on Solid-State Circuits, vol. 31, no. 5, pp. 707-713, May 1996. [6] Z. Chen, M. C. Johnson, L. Wei and K. Roy, “Estimation of standby leakage power in CMOS circuit considering accurate modeling of transistor stacks,” Proc. International Symposium on Low Power Electronics Design, pp. 239-244, 1998. [7] J. Halter and F. Najm, “A gate-level leakage power reduction method for ultra- low-power CMOS circuits,” Proc. CICC, pp. 475-478, 1997. [8] V. De, Y. Ye, A. Keshavarzi, S. Narendra, J. Kao, D. Somasekhar, R. Nair and S. Borkar, “Techniques for leakage power reduction,” in Design of High-Perfor- mance Microprocessor Circuits, New York: IEEE Press, 2001. [9] M.C. Johnson, D. Somasekhar and K. Roy, “Models and algorithms for bounds on leakage in CMOS circuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 18, pp. 714-725, June 1999. [10] A. Fadi, S. Hassoun, K. A. Sakallaha and D. Blaauw, “Robust SAT-based search algorithm for leakage power reduction,” Proc. International Workshop on Power and Timing Modeling, Optimization and Simulation, 2002. [11] D. Lee, W. Kwong, D. Blaauw and D. Sylvester, “Analysis and minimization techniques for total leakage considering gate oxide leakage,” Proc. Design Auto- mation Conference, pp. 175-180, 2003. [12] R.S. Guindi and F.N. Najm, “Design techniques for gate-leakage reduction in CMOS circuits,” Proc. ISQED, pp.61-65, 2003. [13] F. Hamzaoglu and M.R. Stan, “Circuit-level techniques to control gate leakage for sub-100nm CMOS,” Proc. International Symposium on Low Power Electron- ics and Design, pp. 60-63, 2002. TLFeBOOK

83 [14] T. Inukai, M. Takamiya, K. Nose, H. Kawaguchi, T. Hiramoto and T. Sakurai, “Boosted Gate MOS (BGMOS): Device/circuit cooperation scheme to achieve leakage-free giga-scale integration,” Proc. Custom Integrated Circuit Confer- ence, pp. 409-412, 2000. [15] Q. Wang and S.B.K. Vrudhula, “Static power optimization of deep submicron CMOS circuits for dual Vt technology,” International Conference on Computer- Aided Design, pp. 490-496, 1998. [16] L. Wei, Z. Chen, M. C. Johnson, K. Roy and V. De, “Design and optimization of low voltage high performance dual threshold CMOS circuits,” Proc. Design Automation Conference, pp. 489-494, 1998. [17] S. Sirichotiyakul, T. Edwards, C. Oh, R. Panda and D. Blaauw, “Duet: an accu- rate leakage estimation and optimization tool for dual Vt circuits,” IEEE Transac- tions on Very Large Scale Integration (VLSI) Systems, vol. 10, pp. 79-90, April 2002. [18] M. Ketkar and S. Sapatnekar, “Standby power optimization via transistor sizing and dual threshold voltage assignment,” Proc. ICCAD, 2002, pp. 375-378. [19] S. Stiffler, “Optimizing performance and power for 130nm and beyond,” IBM Technology Group New England Forum, 2003. [20] International Technology Roadmap for Semiconductors, 2002. [21] N. Yang, W. K. Henson, and J. J. Wortman, “A comparative study of gate direct tunneling and drain leakage currents in N-MOSFETs with sub-2nm gate oxides,” IEEE Trans. Electron Devices, vol. 47, pp. 1636-1644, Aug. 2000. [22] B. Yu, H. Wang, C. Riccobene, Q. Xiang and M.-R. Lin “Limits of gate oxide scaling in nano-transistors,” Proc. Symposium on VLSI Tech., pp. 90-91, 2000. [23] Y.-C. Yeo, Q. Lu, W.-C. Lee, T.-J. King, C. Hu, X. Wang, X. Guo and T. P. Ma, “Direct tunneling gate leakage current in transistors with ultra thin silicon nitride gate dielectric,” IEEE Electron Device Letters, vol. 21, pp. 540-542, Nov. 2000. [24] Q. Xiang, J. Jeon, P. Sachdey, B. Yu, K. C. Saraswat and M.-R. Lin, “Very high performance 40nm CMOS with ultra-thin nitride/oxynitride stack gate dielectric and pre-doped dual poly-Si gate electrodes,” Proc. International Electron Devices Meeting, pp. 860-862, 2000. [25] H. Kriplani, F. N. Najm and I. N. Hajj, “Pattern independent maximum current estimation in power and ground buses of CMOS VLSI circuits: algorithms, sig- nal correlations, and their resolution,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 14, pp. 998-1012, Aug. 1995. [26] Ruchir Puri, IBM T.J. Watson Research, personal communication. [27] F. Brglez and H. Fujiwara, “A Neutral Netlist of 10 Combinatorial Benchmark Circuits”, Proc. ISCAS, 1985, pp.695-698. TLFeBOOK

84

Chapter 5

ENERGY-EFFICIENT SHARED MEMORY ARCHITECTURES FOR MULTI-PROCESSOR SYSTEMS-ON-CHIP

Kimish Patel1, Alberto Macii1 and Massimo Poncino2 1 2 Politecnico di Torino; Universit`a di Verona

Abstract Most current multi-processor systems-on-chip (MPSoC) platforms do rely on a shared-memory architectural paradigm. The shared memory, typically used for storage of shared data, is a significant performance bottleneck because it re- quires explicit synchronization of memory accesses which can potentially occur in parallel. Multi-port memories are a widely-used solution to this problem; they allow these potentially parallel accesses to occur simultaneously. However, they are not very energy-efficient, since their performance improvement comes at an increased energy cost per access. We propose an energy-efficient architecture for the shared memory that can be used as an alternative to multi-port mem- ories, and combines their performance advantage with a much smaller energy cost. The proposed scheme is based on the application-driven partitioning of the shared address space into a multi-bank architecture. This optimization can be used to quickly explore different power-performance tradeoffs, thanks to simple analytical models of performance and energy. Experiments on a set of paral- lel benchmarks show energy-delay product (EDP) savings of 50% on average, measured on a set of standard parallel benchmarks. Keywords: Multi-Processor Systems, Shared Memory, Systems-on-Chip.

5.1 INTRODUCTION Modern design paradigms for MPSoCs are pushing towards architectures which are fully distributed and that work as general networks, based on a mod- ular layered architecture, and that are able to support non-deterministic com- munications. Such architectures, called Networks-on-Chips (NoCs) [1], have been devised as an answer to the scaling of SoC complexity, especially in terms of the increased number of hosted processing elements, and of the decreased reliability of the communication medium. TLFeBOOK

85 In spite of these scalability challenges, most current SoCs are still based on a shared-medium architecture, and, consequently on a shared-memory paradigm. One reason for this slow migration to more complex architectures is cost. Shared on-chip buses represent a convenient, low-overhead interconnection, and they do not require special handling during the physical design flow. Another reason is a consequence of the limited support provided by system software for such architectures. Although current silicon technology allows to build SoCs with a large number of embedded cores, the capabilities currently offered by the embedded software (e.g., in terms of OS primitives) does not allow to fully exploit all the potential computational power; therefore, most implementations of SoC consist of few (seldom more than 16) processor cores, for which a shared interconnect is perfectly suitable. The architecture of these MPSoC platforms is thus reminiscent of tradi- tional multi-processor systems, where inter-processor communication and/or synchronization is provided through the exchange of data through shared mem- ories of different types. Generally speaking, accessing the shared memories are significantly slower than accesses to local ones. First, they are placed farther away from the processors than private memories; in fact, the latter are often tightly coupled to the cores by means of dedicated local buses, while shared memories are forcedly connected to a shared bus. Moreover, accesses to the shared buses by the processors requires some form of arbitration, which may require the insertion of wait cycles in case of simultaneous accesses. As a consequence, the shared memories tend to become a major bottleneck for the bandwidth of the overall system, especially for applications in which parallelism is built around shared data. Caching of shared data might be a solution, but it raises the well-know issue of cache coherence, i.e., the possible inconsistence between data stored in caches of different processors. Cache coherence can be solved in hardware, yet with an extra overhead that may not be affordable in small-scale, low-cost SoC as those considered in this work. Software-based cache coherence is also a viable solution, but it essentially consists of limiting the caching of shared data to safe times [2]. For applications in which parallelism is built around shared data, this basically amounts to avoid caching of shared data. In this paper, this will be our assumption: all accesses to shared data will always imply an access to the shared memory. Providing sufficient memory bandwidth to sustain fast program execution and data communication/transfers is mandatory for most embedded applications. Increasing memory bandwidth can be achieved by making use of different types of on-chip embedded memories, which provide shorter latencies and wider interfaces [3–5]. One typical solution used to match the computational bandwidth with that of memory is to use multi-port memories. This solution increases the sustainable bandwidth by construction, since a P -port memory TLFeBOOK

86 allows in fact up to P accesses in parallel (i.e., in a single memory cycle). Therefore, by properly choosing the number of ports of the memory versus the number of processors, the issue of synchronization of simultaneous accesses can be easily solved. The adoption of multi-port memories, however, comes at the price of a sig- nificant increase in area, wiring resources, and energy consumption. On the other hand, architectures based on multi-port memories seem to be the only viable option in the cases where bandwidth optimization has absolute priority. In this work we propose an alternative architecture for the shared memory which combines the advantages, in terms of bandwidth, of the multi-port ap- proach, with the advantages, in terms of energy consumption and access time, of partitioned memories [5]. We propose the use of small, single-port mem- ory blocks as a way to achieve memory bandwidth increase together with low energy demand. In our scheme, the memory addressing space is mapped over single-port banks that can be simultaneously accessed by different processors, so as to mimic for a large fraction of the execution time the behavior of a dual- port memory. Energy efficiency is enforced by two facts: First, the single-port blocks have an energy access cost which is smaller than that of monolithic (either single or dual-port) memories; second, address mapping is application- driven, and cell access frequency data is thus used to determine the optimal sizes of the memory blocks. Based on analytical expressions for performance and energy consumption that allow to explore the energy-performance tradeoff, we present experimental results showing that the new architecture guarantees energy savings as high as 69% with respect to a dual-port memory configuration (54% with respect to the baseline, single-ported architecture), with comparable improvement of the memory bandwidth. The rest of the chapter is organized as follows. Section 5.2 provides some background material on memory energy modeling, multi-port memories, and application-driven memory partitioning. Section 5.3 describes how partitioned memories can be used to achieve an energy-efficient shared memory architec- ture. Section 5.4 illustrates the analytical models used to drive the energy- performance exploration engine, which is discussed in Section 5.5. Section 5.6 presents the optimization results for a set of standard parallel applications. Fi- nally, some concluding remarks are provided in Section 5.7.

5.2 BACKGROUND 5.2.1 Modeling Memory Energy Unlike generic hardware modules, the energy consumption of memories is basically independent of the input activity. What matters, in fact, is whether we are reading or writing a value from or to the memory, regardless of the value. TLFeBOOK

87 This property allows to model memory energy consumption in an very abstract way, by explicitly exposing two independent variables affecting it: the cost of an access and the total number of accesses. This translates into the following formula: ctot etot = ei (1) i=1 where ctot is the total number of memory accesses, and ei is the cost of each access. For the sake of simplicity, we equally weigh all accesses (i.e., we do not distinguish the cost of a read from that of a write). Equation 1 exposes the two quantities we can consider to reduce the energy consumption of a memory system and will be used throughout the paper as a reference. Techniques for reducing memory energy can be thus classified according to which variable is optimized [6].

5.2.2 Multi-Port Memories A multi-port memory is simply a memory that allows multiple simultaneous accesses for reads and writes to any location in memory. Multi-port memo- ries are typically employed as shared memories in multiprocessor designs, and are especially popular as dual-ended FIFO buffers for bus interfacing, or for video/graphics buffering. Multiple simultaneous accesses are made possible by duplicating some of the resources required to access a cell: the address and data pins, the word-lines, and the bit-lines. Figure 5.1 shows the structure of a typical dual-port SRAM cell, and in particular the extra word-line (with the corresponding transistors) and extra bit-line.

    

        

Figure 5.1. Structure of a Dual-Port SRAM Cell.

In some devices, additional overhead is also required to handle the synchro- nization of multiple writes to the same cell; this is managed through a sort of hardware semaphore which serializes the concurrent accesses. The increase in bandwidth provided by multi-port memories comes at the price of increased area, wiring resources and power consumption. Because of this considerable overhead, multi-port memories are usually limited to a TLFeBOOK

88 few ports (often 2, and seldom more than 4). One noticeable exception is represented by register files (although they are not strictly SRAMs), that are typically highly multi-ported (even 16 or more ports) to provide very high bandwidth in superscalar processors. Multi-port memories can also be characterized by the flexibility of the ports. In some memory devices, some of the ports can be specialized, i.e., they allow only some type of access (read or write). This fact can be expressed by writing the number of ports P = pr + pw + prw, where the three terms denote the number of read, write, and read/write ports, respectively. In this work, without loss of generality, we will assume that pr = pw =0,andprw = P , that is, all ports can be used for any type of access at any time. When analyzing multi-port memories from the energy point of view, we must take into account the two following non-idealities, supported by data from several multi-port memory providers ([7],[8],[9]).

a) Energy consumption of multi-port memories does not scale linearly with the number of ports. For instance, the energy cost for accessing a dual- port memory is more than twice the energy required for accessing a single-port memory of the same size.

b) When a multi-port memory is used as a shared memory in a multiproces- sor system, there are cases in which not all the ports are used simulta- neously. It may in fact happen that the access pattern of the application does not allow to group a set of accesses (from the processors) into a single, multi-port access. In these cases, we must consider the fact that energy consumption does not scale linearly with the number of ports that are accessed simultaneously. For instance, the energy cost for accessing a single port in a dual-port memory is larger than the one for accessing a single-port memory of the same size.

With reference to the model of Equation 1, the use of multi-port memories reduces ctot, but it implies a sizable increase of the access cost ei. 5.2.3 Application-Driven Memory Partitioning Partitioning a memory block into multiple blocks, based on the memory ac- cess profile, was originally proposed by Benini et al. [10]. Their technique exploits the fact that, due to the high locality exhibited by embedded applica- tions, the distribution of memory references is not uniform. As a consequence, some memory locations will be accessed more frequently than others. The partitioning is realized by splitting the address space (stored onto a single, monolithic memory block) into non-overlapping contiguous sub-spaces (stored onto several, smaller memory blocks). TLFeBOOK

89 Reduction of energy consumption is achieved because of two facts. First, each block is smaller than the monolithic one, and thus it has a smaller access cost (ei). Second, and more relevant, only one of the blocks is active at a time. By properly partitioning the address space, it should be possible to access the smallest blocks most of the times, and access the largest ones only occasionally. The original implementation of [10] employs a sophisticated recursive algo- rithm to determine the optimal partition with an arbitrary granularity. In this work, we will exploit their idea, yet without employing the same partitioning engine. As a matter of fact, in our case partitioning is driven by the access patterns of more than one processor. Memory partitioning specifically targets the reduction of the access cost ei, and it does not change ctot, since it does not modify the access patterns. 5.3 PARTITIONED SHARED MEMORY ARCHITECTURE The target MPSoC architecture considered is this work is depicted in Fig- ure 5.2. Each processor core has a cache and a private memory (PM) containing private data and code, which is accessed through a local bus. Processors are also connected to another memory (SM), through a common global bus con- taining the data that are shared between the various threads executing on the processors. We do not consider here other types of interconnections, such as point-to-point ones (i.e., crossbars).



   



gure 5.2. Generic Architectural Template.

In this work, starting from the assumption that the shared memory is imple- mented as a conventional on-chip, single-port memory, we aim at improving the performance of the accesses to the shared memory, yet in a more energy- efficient way than resorting to a multi-port memory. The proposed shared memory architecture combines the bandwidth advan- tages of multi-port memories (and thus the reduction of ctot) with the advan- TLFeBOOK

90 tages, in terms of energy consumption and access time, of partitioned, single- port memories (and thus the reduction of ei). In our scheme, the memory address space is mapped over single-port banks that can be simultaneously accessed by the different processors, so as to mimic the behavior of a multi-port memory for a large fraction of the execution time. Each bank covers a subset of the address space, with no replication of memory words; therefore, the address sub-spaces are non-overlapping. The latter issue is essential to understand why the partitioned scheme can only approach the performance of the multi-port architecture. Since the memory blocks are single- ported and contain non-overlapping subsets of addresses, simultaneous accesses from the processor can be parallelized only if they fit into different memory blocks. Otherwise, the potentially parallel access must take place into two consecutive memory cycles. Energy efficiency is enforced by two facts: First, the single-port blocks have an energy access cost which is by far smaller than that of monolithic (either single or dual-port) memories; second, address mapping is application-driven, and it accounts thus for the cell access frequency to determine the size of the memory blocks which is most suitable for memory minimization. In the following, we will restrict our analysis to systems with two proces- sors. Consequently, we will consider dual-port memories, and the partitioned architecture will also consists of two blocks at most. Although the concepts that will be discussed apply in principle to an arbitrary number of processors (with multi-port memories and multi-bank architectures), the quantitative analysis of energy and performance strictly refers to the case of two processors (with dual-port memory and two memory blocks).

A1 P1 A

A1 SPM1 Port1 D D P1 1

D1 DPM A2 A P2 Port2 2 D 2 P2 A SPM2 D D2

(a) (b)

Figure 5.3. Dual-Port (a) and Partitioned Single-Port (b) Architectures. Figure 5.3 show a conceptual architecture of the dual-port and the partitioned single-port schemes. Label Ai refers to addresses from processor i, while Di refer to data to/from processor i. In the dual-port scheme (Figure 5.3-(a)), the TLFeBOOK

91 existence of two read/write ports allows to bind each processor to one port, realizing in fact a point-to-point interconnection. In the partitioned architecture (Figure 5.3-(b)), addresses and data must be multiplexed (from processor to memory) or de-multiplexed (from memory to processor) properly, to connect the processor to the required memory block. This block diagram just shows the high-level flow of data and addresses; the actual implementation of the decoder is actually more complex, and will be discussed in the experimental section.

5.3.1 Related Work The literature on energy optimization of embedded memories is quite rich (see [6] for a comprehensive survey); however, most techniques deal with the optimization of caches, scratch-pad memories, or off-chip memories, and multi- port memories are seldom addressed. Most energy optimizations for multi-port memories are concerned with the issue of the mapping of data structures (typically, arrays) to multi-port mem- ories, based on the access profiles of the applications. From these profiles, these techniques evaluate simultaneous array accesses (e.g., whether two or more arrays are accessed in the same cycle), and build a so-called compatibility graph, which expresses the potential parallelization of accesses. The various approaches differ then in how this graph is used to decide the optimal allocation of array accesses to memory ports [3, 11–13]. One technique closer to the one proposed in this work has been discussed by Lewis and Brackenbury [14]. Their approach is based on the typical access patterns of DSP applications, and splits highly-multiported register files into multiple banks of predefined sizes.

5.4 PERFORMANCE AND ENERGY CHARACTERIZATION In this section we will derive analytical expressions for the number of memory accesses and for the total energy consumption for the architectures of Figure 5.3, referred to the case of a system consisting of two processors (hereafter denoted with P1 and P2).

5.4.1 Performance Characterization

Let c1 and c2 be the number of memory accesses required by the execution of the application on processors P1 and P2, respectively. In the following, we will use the term memory cycle instead of memory access; we adopt this terminology in order to distinguish accesses to the shared memory that can occur in parallel. In fact, the total number of memory accesses by a processor TLFeBOOK

92 is fixed (and determined by the memory access pattern of the application, which we do not modify); What actually changes is the time (in cycles) required to serve these accesses. Furthermore, we will denotes sets with bold symbols, and their cardinalities with lowercase ones. Our reference performance figure is the total number of memory cycles for the case where shared memory is implemented as a monolithic single-port memory. This value is cspm = c1 + c2.

5.4.1.1 Dual-Port Memory. When the shared memory is implemented by a monolithic dual-port memory, the total number of memory accesses will be smaller than cspm because of the possibility of simultaneous accesses. Only a fraction of the accesses, however, will occur simultaneously. As Figure 5.4 shows, this fraction can be represented in terms of set notation. We denote with Cpar the set of memory cycles that can access memory simul- taneously; Cpar consists of the union of two subsets Cpar = Cpar,1 ∪ Cpar,2, where Cpar,1 ⊆ C1 and Cpar,2 ⊆ C2. These two subsets have same cardinality (i.e., cpar,1 ≡ cpar,2) because each element of one set matches one of the other set to make a parallel access.

 

  

Figure 5.4. Classification of Execution Cycles.

The number of cycles for the dual-port configuration is therefore:

cdpm =(c1 − cpar,1)+(c2 − cpar,2)+cpar/2 (2) where cpar = cpar,1 + cpar,2, denotes the total number of the parallel cycles. The division by two in the last term denotes the fact that parallel cycles are actually grouped in pairs, with each pair corresponding to a single memory access. Equation 2 simplifies to cdpm = c1 + c2 − cpar/2, exposing the fact that the magnitude of cpar directly translates into a performance improvement.

5.4.1.2 Partitioned Memory. In the case of partitioned memory, the two memory banks now host two non-overlapping subsets of the address space. This implies that only a subset of the cycles in Cpar can be parallelized;in particular, accesses that fall in the same subset of addresses now need to be serialized, since the two memory blocks are single-ported. TLFeBOOK

93 This further sub-setting of the cycles is depicted in Figure 5.5, using the same set notation as above. We can notice that C1 and C2 are now both split into two subsets, where Ci,j denotes the cycles of processor i that fall into block j.

     

   

      

Figure 5.5. Classification of Execution Cycles for the Partitioned Architecture.

This induces a partition onto Cpar, as follows. The shaded areas labeled A and D in Figure 5.5 denote parallel accesses that fall into different memory blocks: In region A (D), P1 accesses Block 1 (Block 2), and P2 accesses Block 2 (Block 1). Conversely, the regions labeled B and C denote accesses that fall in the same memory block (Block 2 for region b, and Block 1 for region c). Cycles belonging to region B and C cause a performance penalty, because, although they can potentially occur in parallel, they must be serialized (and thus require two memory accesses). These subsets can be characterized by using a quantity λ, that denotes the percentage of the cycles in Cpar that fall in distinct memory blocks (and can thus be made parallel). λ will be used in the following as a compact metric to evaluate the cost of the partition. In fact, λ depends on where how the partition has been made, that is, how many addresses fall in each block. Therefore, Cpar consists of λcpar cycles that can be parallelized, and (1 − λ)cpar that requires two separate accesses. The number of cycles of the partitioned-memory architecture cspm,part is therefore:

cspm,part =(c1 − cpar,1)+(c2 − cpar,2)+λcpar/2+(1− λ)cpar (3)

The formula simplifies to cspm,part = c1 + c2 − λcpar/2, exposing the fact that cspm,part ≥ cdpm,sinceλ ≤ 1. Analyzing the dependency of cspm,part versus λ, We notice that cspm,part (and thus) the performance penalty of the partitioned scheme is minimized when λ is maximized, as expected. In par- ticular, when λ =1, all accesses in Cpar are parallelized, and the partitioned scheme is equivalent to the dual-port memory, performance-wise. When λ =0, all accesses by Cpar overlap on the same memory block, and the partitioned scheme is equivalent to the single-port memory architecture. TLFeBOOK

94 5.4.2 Energy Characterization To compute energy, we stick to the high-level model of Equation 1; energy is thus simply obtained by multiplying each access for its cost.

5.4.2.1 Dual-Port Memory. In this case we have to consider two types of access costs, depending on whether one or both ports are accessed. Total energy is obtained thus by properly weighing the terms of Equation 2: In formula:

edpm =(c1 − cpar,1) · edpm,1 +(c2 − cpar,2) · edpm,1 + cpar/2 · edpm,2 (4)

The term edpm,x denotes the energy per access to the memory, in which the term x = {1, 2} in the subscript denotes the number of ports used in the access.

5.4.2.2 Partitioned Memory. In the case of the partitioned memory, total energy cannot be conveniently expressed by a closed formula, for two reasons. First, the energy per access depends on the size of the memory block that is accessed; the sizes of the blocks, however, are precisely the variables of the partitioning problem we are trying to solve. Second, we have two single- port memories, and each memory access from either processor will fall into one of the two memory blocks. This implies that the energy per access can only be approximated by a “average” cost (i.e., the number of accesses to Block 1 weighted by its energy cost, plus number of accesses to Block 2 weighted by its energy cost). The accurate evaluation of energy for the partitioned architecture requires thus a simulation of the dynamic address trace of the two processors, and the application of Equation 1 on an access-by-access basis. Nevertheless, we can derive an approximate expression of total energy that can be used for a rough comparison with Equation 4: =( − ) · +( − ) · + espm,part c1 cpar,1 espm c2 cpar,2 espm (1 − ) · + 2 · ( + ) (5) λ cpar espm λcpar/ espm1 espm2 The first two term (espm and espm) are the above mentioned average access costs and represent the non-parallel memory accesses. espm is the cost of accessing either Block 1 or Block 2 (depending on the subset of addresses), when accesses are potentially parallel but must be serialized. The last term represents the subset of potentially parallel accesses that will access Block 1 and Block 2 simultaneously (espm1 + espm2). Although approximate, Equation 5 allows to do some rough comparison with the dual-port scheme. First, all energy costs in Equation 5 are smaller than edpm,2, and, in most cases (when the sizes of the two blocks are of comparable size), also smaller than edpm,1. This implies that all four terms of Equation 5 TLFeBOOK

95 are smaller than the corresponding ones in Equation 4, and energy is potentially smaller than the dual-port memory case, regardless of the value of λ. The actual dependency of espm,part on λ is not easily observable from Equa- tion 5. A large value of λ increases the probability of accessing both blocks in the same cycle (this corresponds to the largest term (espm1 +espm2)). Therefore, energy should be in principle reduced by choosing partitions which minimize λ. In this case, in fact, only one of the two blocks (each one smaller than the monolithic memory) will be accessed in each cycle, thus using less energy; a small value of λ, however, tends to increase the number of cycles, as already observed.

5.5 EXPLORATION FRAMEWORK The models described in Section 5.4 show that there exists a tradeoff be- tween energy and performance in partitioning the shared memory. Although we are searching for energy-efficient memory architectures, we cannot ig- nore performance implications; therefore, in order to search for the best en- ergy/performance tradeoff, we use energy/delay product (EDP) as a metric, and choose to minimize EDP during the space exploration. Thanks to the simple models of Section 5.4, the optimization space is rela- tively small, since λ is the only parameter of the models. λ is a function of the access pattern of the application, but it also depends on how the address space is partitioned. Partitions can be characterized by the boundary address B that splits the address space [0,...,N − 1] into two sub-spaces [0,...,B− 1] and [B,...,N−1]. Therefore, λ is also a function of B. As an example, Figure 5.6 shows the behavior of λ versus B for a parallel FFT kernel; we can observe that the curve is not monotonic, showing the sensitivity of λ to the access pattern.

                            

Figure 5.6. Behavior of λ(B) vs. B. These observations leads us to the following exploration procedure, for a shared memory of N words:

1 Compute epm(λ) and cpm(λ) as in Section 5.4; TLFeBOOK

96

2 For all possible values of B =0,...,N − 1, Compute EDPpm(λ) as epm · cpm. EDPpm(λ(B)) is not a function, since there may be more values of B (and thus of EDP)foragivenvalueofλ. An example of such curve is shown in Figure 5.7, for the parallel FFT benchmark.

pareto( ) 3 Compute the function EDPpm λ , obtained by selecting, for each ( ) pareto value of λ, the smallest value of EDPpm λ . EDPpm contains the Pareto points of EDPpm(λ), and can possibly contain some discontinu- ities. Figure 5.8 shows the resulting curve for the FFT benchmark.









  



      

Figure 5.7. Behavior of EDP(λ(B)) vs. λ.











   



       

Figure 5.8. Pareto Points of EDP(λ(B)).

4 Compute the minimum EDPmin, of this function, and let λmin the cor- responding value of λ;

5Ontheλ vs. B plot, identify the corresponding value Bmin of B.In case of multiple values of B, choose the one that makes the partitions as equal (in size) as possible. TLFeBOOK

97 5.6 EXPERIMENTAL RESULTS 5.6.1 Experimental Setup We have implemented our partitioned memory scheme in ABSS [15]. ABSS is an execution-driven architectural simulator for multiprocessor systems devel- oped at Stanford University, that extends the ideas implemented in the AUG- MINT simulator. ABSS is based on the idea of augmentation,thatis,the instrumentation of the assembly code with various hooks that allow to make context switches to the simulator; augmentation translates the program into a functionally equivalent program that runs on the simulated version of the processor. The memory architecture provided by ABSS includes both private and shared memory. All the memories are connected through a single shared bus. Yet, ABSS does not provide any specific predefined cache or shared bus model; rather, it a defines a specific interface to which user-defined cache and bus models can be easily hooked. We have integrated Dinero [16] into ABSS, in order to provide accurate cache simulation data, and we have derived performance and energy models for the shared memory (both single- and dual-port) by interpolation of the results obtained from an industrial memory generator by ST Microelectronics. The target technology for all the models is 0.18µ. Concerning the benchmarks, we have used Stanford’s SPLASH suite [17] which includes a set of kernels and parallel applications widely used in the parallel computing community. 5.6.2 Energy/Performance Tradeoff Analysis Table 5.1 shows energy-delay product (EDP) results for the above bench- marks, for the monolithic, single-port architecture (EDPmm) and the parti- tioned one (EDPpm), obtained using the exploration procedure of Section 5.5. The EDP reduction (Column ∆) ranges from 40.5% to 62.3% (50.2% on aver- age).The exploration procedure also allows to compute the best performance and energy points; these are summarized in Table 5.2, where performance improve- ments (number of cycles) and energy saving with respect to the monolithic, single port architecture are reported (Columns Best Performance and Best En- ergy). The comparison of Tables 5.1 and 5.2, shows that savings in the EDP is mostly due to energy savings than to performance savings. Minimum EDP points are in fact very close to minimum energy points, for most of the bench- marks, while performance improvements are less significant. Notice also that only benchmarks that exhibit a sizable amount of parallel cycles (e.g., FFT, LU-Cont, Radix) results in a sizable performance improvement. Conversely, energy does not seem to be that sensitive to the amount of parallel cycles. TLFeBOOK

98

Table 5.1. Energy-Delay Product Results.

Application EDPmm EDPpm ∆ [%] Barnes 24987.8 11357.5 54.6 FFT 6.4 3.7 41.2 FMM 853.4 389.6 54.4 LU 3931.3 2339.2 40.5 LU-CONT 3734.4 2073.1 44.5 Radix 59512.5 23180.5 61.0 Volrend 869794.2 453283.8 47.9 Water-N2 150460.7 56710.0 62.3 Water-S 10581.2 5770.5 45.5 Average 50.2

Table 5.2. Optimal Performance and Energy Points.

Application Best Performance [%] Best Energy [%] Barnes 1.5 54.4 FFT 34.0 37.2 FMM 2.1 54.4 LU 10.9 40.3 LU-CONT 19.8 40.5 Radix 25.4 60.9 Volrend 0.3 50.3 Water-N2 13.9 62.3 Water-S 8.7 45.9 Average 13.0 49.6

Figure 5.9 shows the energy savings of the the partitioned architecture with respect to the dual-port case. Numbers refer to best-performance points, since we want to reduce the performance penalty as much as possible. The savings do not include the cost of the decoding logic. The partitioned architecture results in an average energy saving of 56% (maximum 70%). This energy saving is achieved at an increase of the total number of memory cycles of 2.4% on average (10.1% maximum).

5.6.3 Decoder Implementation The partitioned architecture requires an ad-hoc encoder which implements the conceptual scheme of Figure 5.3. The encoder must provide two main functionalities. First, it must drive the selectors that decide to which block a given memory access is directed; to do this, it must contain the information about the boundary of the partition of the address space. Second, and more important, TLFeBOOK

99



















           

Figure 5.9. Energy Savings of the Partitioned Architecture vs. the Multi-Port One. it must handle the connection between processors and memory blocks; this requires a sort of arbitration mechanism that allows to serialize accesses that are potentially parallel, but fall in the same subset of addresses (i.e., memory block). Figure 5.10 shows a more detailed block diagram of the encoder. It takes as inputs the addresses A1 and A2 from the two processors, the corresponding request signals Reqi, and the value B of the address corresponding to the parti- tion. It then generates the addresses to be sent to each memory block AB1 and AB2 , and the signals used to allow the processors to access memory Granti. The latter are both active but in the cases where potentially parallel accesses must be serialized. The decoder contains two main blocks. The first block (RH, Request Han- dler) checks the two addresses A1 and A2, and generates the Busyi outputs as well as a signal that determines whether the accesses can be parallelized or not (S/NS). The other block (SEL), uses three inputs to decide to what memory block to send what address: the S/NSinput, and the outputs A1i and A2i of two comparators (the boxes labeled with “=”) which determine in which block A1 and A2 are falling, respectively. By using the value of B as an external input, it is possible to make the decoder application-independent, and therefore to have one single encoder for any application. We have implemented the decoder in VHDL, and synthesized it on a 0.18µm technology library by ST Microelec- tronics, using Synopsys Design Compiler. When applying the memory access trace of the FFT benchmark, the dissipation of the decoder is 0.35 µJ, about 1.7% of total memory energy consumption (19.8 µJ). TLFeBOOK

100

    

    

      

 

     

    

Figure 5.10. Block Diagram of the Decoder.

Concerning delay, although the decoder is on the critical path (its delay adds up to the memory access time), this is not really an issue in the par- titioned architecture. In fact, the memory cycle time in this case is smaller than that of the dual-port case, since we are accessing smaller memory blocks. Quantitatively, the partitioned architecture results in a slack equal to ddpm − max( dspm,1,dspm,2),wheredi denotes access time to the corresponding mem- ory block. The delay of the decoder obtained from synthesis is 310ps,well within this slack.

5.7 CONCLUSIONS We have proposed an energy-efficient alternative to multi-port memories suit- able for the implementation of the shared memory of multi-processor systems- on-chip. The architecture is based on application-driven partitioning of the address space into multiple banks. The target of the architecture is to achieve little or no performance penalty with respect to multi-port memories; therefore, we pursue maximum perfor- mance partitioning solutions, corresponding the case where the chance of par- allelizing the accesses is maximized. The architecture can be enhanced so that zero performance penalty is achieved, thank to the use of an extra memory buffer. Experiments on a set of parallel benchmarks has shown average energy-delay product (EDP) reductions of 50% on average, with respect to the baseline case of a single-port memory, and energy savings of 56%, with respect to the case of a multi-port memory, with an average 2% performance penalty. TLFeBOOK

101 References

[1] L. Benini, G. De Micheli, “Networks on Chips: A New SoC Paradigm,” IEEE Computer, Vol. 35, No. 1, pp. 70–78, January 2002. [2] P.Stenstrom,¨ “A Survey of Cache Coherence Schemes for Multiprocessors,” IEEE Computer, Vol. 23, No. 6, June 1990, pp. 12–24. [3] F. Catthoor, et al. Custom Memory Management Methodology Explo- ration for Memory Optimization for Embedded Multimedia System Design, Kluwer Academic Publishers, 1998. [4] P. Panda, N. Dutt, Memory Issues in Embedded Systems-on-Chip Optimiza- tion and Exploration, Kluwer Academic Publishers, 1999. [5] A. Macii, L. Benini, M. Poncino, Memory Design Techniques for Low- Energy Embedded Systems, Kluwer Academic Publishers, 2002. [6] L. Benini, A. Macii, M. Poncino, “Energy-Aware Design of Embedded Memories: A Survey of Technologies, Architectures and Optimization Techniques”, ACM Transactions on Embedded Computing Systems,Vol. 2, No. 1, Feb. 2003, pp. 5–32. [7] Cypress Semiconductor, http://www.cypress.com/products. [8] Integrated Devices Technology, http://www.idt.com/products/ multi port.html. [9] Artisan Components, http://www.artisan.com/products/ memory.html. [10] L. Macchiarulo, A. Macii, L. Benini, M. Poncino, “Layout-Driven Mem- ory Synthesis for Embedded Systems-on-Chip," IEEE Transactions on Very Large Scale Integration (VLSI), Vol. 10, No. 2, pp. 96-105, April 2000 [11] P.R. Panda, N.D. Dutt, “Behavioral Array Mapping into Multiport Mem- ories Targeting Low-Power,” VLSI’97: International Conference on VLSI Design, Jan. 1997, pp. 268–272. [12] P.R. Panda, L. Chitturi, “An Energy-Conscious Algorithm for Memory Port Allocation,” ICCAD’02: International Conference on Computer Aided Design, Nov. 2002, pp. 572–576. [13] W.-T. Shiue, C. Chakrabarti, “Low-Power Multi-Module, Multi-Port Memory Design for Embedded Systems,” Journal of VLSI Signal Process- ing, pp.167-178, Nov 2001. [14] M. Lewis, L. Brackenbury, “Exploiting Typical DSP Data Access Patterns and Asynchrony for a Low-Power Multi-ported Register Bank,” ASYNC’01: International Symposium on Asynchronous Circuits and Systems, March 2001, pp. 4–14. [15] D. Sunada, D. Glasco, M. Flynn, ABSS v2.0: A SPARC Simulator, Tech- nical Report CSL-TR-98-755, CSL, Stanford University, April 1998. TLFeBOOK

102 [16] M. D. Hill, J. Elder, DineroIV Trace-Driven Uniprocessor Cache Simula- tor, www.cs.wisc.edu/markhill/DineroIV, 1998. [17] J. P. Singh, W.-D. Weber, A. Gupta, “SPLASH: Stanford Parallel Appli- cations for Shared-Memory”, News, Vol. 20, No. 1, pages 5-44, March 1992. TLFeBOOK

103

Chapter 6 TUNING CACHES TO APPLICATIONS FOR LOW-ENERGY EMBEDDED SYSTEMS

Ann Gordon-Ross1, Chuanjun Zhang1, Frank Vahid1,2, and Nikil Dutt2 1University of California, Riverside;2 University of California, Irvine

Abstract The power consumed by the memory hierarchy of a microprocessor can contribute to as much as 50% of the total microprocessor system power, and is thus a good candidate for power and energy optimizations. We discuss four methods for tuning a microprocessors’ cache subsystem to the needs of any executing application for low-energy embedded systems. We introduce on- chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune a configurable level-one cache’s total size, associativity and line size to an executing application. We extend the single-level cache tuning heuristic for a two-level cache using a methodology applicable to both a simulation-based exploration environment and a hardware-based system prototyping environment. We show that a victim buffer can be very effective as a configurable parameter in a memory hierarchy. We reduce static energy dissipation of on-chip data cache by compressing the frequent values that widely exist in a data cache memory.

Keywords: Cache; configurable; architecture tuning; low power; low energy; embedded systems; on-chip CAD; dynamic optimization; cache hierarchy; cache exploration; cache optimization; victim buffer; frequent value.

6.1 INTRODUCTION

The power consumed by the memory hierarchy of a microprocessor can contribute to 50% or more of total microprocessor system power [1]. Such a large contributor to power is a good candidate for power and energy optimization. The design of the caches in a memory hierarchy plays a major role in the memory hierarchy’s power and performance. Tuning cache design parameters to the needs of a particular application or program region can save energy. Cache design parameters include: cache size, meaning the total number of data byte storage; cache associativity, meaning the number of tag and data ways simultaneously read per cache TLFeBOOK

104 access; cache line size, meaning the number of bytes in a block when moving data between cache and the next memory level; and victim buffer use, meaning a small fully-associative buffer storing recently-evicted cache data lines. Every application has different cache requirements that cannot be efficiently satisfied with one predetermined cache configuration. For instance, different applications have vastly different spatial and temporal locality and thus have different requirements [2] with respect to cache size, cache line size, cache associativity, victim buffer configuration, etc. In addition to tunable cache parameters, widely existing frequent values in data caches for some applications can enable data encoding within the cache for reduced power consumption. We define cache tuning as the task of choosing the best configuration of cache design parameters for a particular application, or for a particular phase of an application, such that performance, power and/or energy are optimized. New technologies enable cache tuning. Core-based processors allow a designer to choose a particular cache configuration [3-7]. Some processor designs allow caches to be configured during system reset or even during runtime [2,8,9]. Manual tuning of the cache is hard. A single-level cache may have many tens of different cache configurations, and interdependent multi-level caches may have thousands of cache configurations. The configuration space gets even larger if other dependent configurable architecture parameters are considered, such as bus and processor parameters. Exhaustively searching the space may be too slow even if fully automated. With possible average energy savings of over 40% through tuning [2,10], we sought to develop automated cache tuning methods. In this chapter, we discuss four methods of cache tuning for energy savings. We discuss an in-system method for automatically, transparently, and dynamically tuning a level-one cache; an automatic tuning methodology for two-level caches applicable to both a simulation-based exploration environment or a hardware-based prototyping environment; a configurable victim buffer; and a data cache that encodes frequent data values.

6.2 BACKGROUND – TUNABLE CACHE PARAMETERS

Many methods exist for configuring a single level of cache to a particular application during design time and in-system during runtime. Cache configuration can be specified during design time for many commercial soft cores from MIPS [6], ARM [5], and Arc [4] and for environments such as Tensilica’s Xtensa processor generator [7] and ’s Nios embedded processor system [3]. TLFeBOOK

105 Configurable cache hardware also exists to assist in cache configuration. Motorola’s M*CORE [9] processors offer way configuration which allows the ways of a unified data/instruction cache to individually be specified as either data or instruction ways. Additionally, ways may be shut down entirely. Way shut-down is further explored by Albonesi [8] to reduce dynamic power by an average of 40%. An adaptive cache line size methodology is proposed by Veidenbaum et al.[11] to reduce memory traffic by more than 50%. Exhaustive search methods may be used to find optimal cache configurations, but the time required for an exhaustive search is often prohibitive. Several tools do exist for assisting designers in tuning a single level of cache. Platune [12] is a framework for tuning configurable system- on-a-chip (SOC) platforms. Platune offers many configurable parameters beyond just cache parameters, and prunes the search space by isolating interdependent parameters from independent parameters. The level one cache parameters, being dependent, are explored exhaustively. Heuristic methods exist to prune the search space of the configurable cache. Palesi et al. [13] improves upon the exhaustive search used in Platune by using a genetic algorithm to produce comparable results in less time. Zhang et al. [14] presents a cache configuration exploration methodology wherein a cache exploration component searches configurations in order of their impact on energy, and produces a list of Pareto-optimal points representing reasonable tradeoffs in energy and performance. Ghosh et al.[15] uses an analytical model to efficiently explore cache size and associativity and directly computes a cache configuration to meet the designers’ performance constraints. Few methods exist for tuning multiple levels of a cache hierarchy. Balasubramonian et al. [10] proposes a hardware-based cache configuration management algorithm to improve memory hierarchy performance while considering energy consumption. An average reduction in memory hierarchy energy of 43% can be achieved with a configurable level two and level three cache hierarchy coupled with a conventional level one cache.

6.3 A SELF-TUNING LEVEL ONE CACHE ARCHITECTURE

Tuning a cache to a particular application can be a cumbersome task left for designers even with the advent of recent computer-aided design (CAD) tuning aids. Large configuration spaces may take a designer weeks or months to explore and with a small time-to-market, lengthy tuning iterations may not be feasible. We propose to move the CAD environment on-chip, eliminating designer effort for cache tuning. We introduce on-chip hardware TLFeBOOK

106 implementing an efficient heuristic that automatically, transparently, and dynamically tunes the cache to the executing program to reduce energy [16].

6.3.1 Configurable Cache Architecture

The on-chip hardware tunes four cache parameters in the level-one cache: cache line size (64, 32, or 16 bytes), cache size (8, 4, or 2 Kbytes), associativity (4, 2, or 1-way), and cache way prediction (on or off). Way prediction is a method for reducing set-associative cache energy, in which one way is initially accessed, and other ways accessed only upon a miss.

I$ Micro- Off chip Tuner processor Memory D$

Figure 6-1. Self-tuning cache architecture

The exploration space is quite large, necessitating an efficient exploration heuristic implemented with specialized tuning hardware, as illustrated in Figure 6-1. The tuning phase may be activated during a special software- selected tuning mode, during startup of a task, whenever a program phase change is detected, or at fixed time intervals. The choice of approach is orthogonal to the design of the self-tuning architecture itself. The cache architecture supports a certain range of configurations [2]. The base level-one cache of 8 Kbytes consists of four banks that can operate as four ways. A special configuration register allows the ways to be concatenated to form either a direct-mapped or 2-way set associative 8 Kbyte cache. The configuration register may also be configured to shut down ways, resulting in a 4 Kbyte direct-mapped or 2-way set associative cache or a 2 Kbyte direct-mapped cache. Specifically, due to the bank layout for way shut down, 2 Kbyte 2- or 4-way set associative and 4 Kbyte 4-way set associative caches are not possible using the configurable cache hardware.

6.3.2 Heuristic Development Through Analysis

A naïve tuning approach would simply try all possible combinations of configurable parameters in an arbitrary order. For each configuration, the miss rate can be measured and used to estimate the energy consumption of the particular cache configuration. After all configurations are executed, the approach would simply choose the configuration with the lowest energy TLFeBOOK

107 consumption. However, such an exhaustive method may involve the inspection of too many configurations. Therefore, we wish to develop a cache tuning heuristic that minimizes the number of configurations explored. When developing a good heuristic, the parameter (cache size, line size, associativity, or way prediction) with the largest impact in performance and energy would likely be the best parameter to search first. We analyzed each parameter to determine the parameter’s impact on miss rate and energy by fixing three parameters and varying the third. We observed that varying the cache size had the largest average impact on energy and miss rate – changing the cache size can impact the energy by a factor of two or more. From our analysis, we developed a search heuristic that first determines the best cache size, determines the best line size, then the best associativity, and finally, if the best associativity is greater than one, our heuristic determines whether to use way prediction or not.

6.3.3 Search Heuristic

The heuristic developed based on the importance of parameters is summarized below:

1. Begin with a 2 Kbyte, direct-mapped cache with a 16 byte line size. Increase the cache size to 4 Kbytes. If the increase in cache size causes a decrease in energy consumption, increase the cache size to 8 Kbytes. Choose the cache size with the best energy consumption. 2. For the best cache size determined in step 1, increase the line size from 16 bytes to 32 bytes. If the increase in line size causes a decrease in energy consumption, increase the line size to 64 bytes. Choose the line size with the best energy consumption. 3. For the best cache size determined in step 1 and the best line size determined in step 2, increase the associativity to 2 ways. If the increase in associativity causes a decrease in energy consumption, increase the associativity to 4 ways. Choose the associativity with the best energy consumption. 4. If step (3) determined the best associativity to be greater than 1, determine if enabling way prediction results in energy savings.

The cache tuning heuristic can be implemented in either software or hardware. In a software-based approach, the system processor would execute the search heuristic. Executing the heuristic on the system processor would not only change the runtime behavior of the application but also affect the cache behavior, possibly resulting in the search heuristic choosing a non- TLFeBOOK

108 optimal cache configuration. Therefore, we prefer a hardware-based approach that does not significantly impact overall area or power.

6.3.4 Experiments and Results

We simulated numerous Powerstone [9] and MediaBench [18] benchmarks using SimpleScalar [19], a cycle-accurate simulator that includes a MIPS-like microprocessor model, to obtain the number of cache accesses and cache misses for each benchmark and configuration explored. For power dissipation, we considered both static power dissipation due to leakage current and dynamic power dissipation due to logic switching current and the charging and discharging of the load capacitance. We obtain the energy of a cache hit from our own CMOS 0.18 µm layout of our configurable cache (we found our energy values correspond closely with CACTI values). We obtain the off-chip memory access energy from a standard Samsung memory, and the stall energy from a 0.18 µm MIPS microprocessor. Furthermore, we obtained the power consumed by our cache tuner, through simulation of a synthesized version of our cache tuner written in VHDL.

Table 6-1. Results of search heuristic. Ben. is the benchmark considered, cfg. is the cache configuration selected, No. is the number of configurations examined by our heuristic, and E% is the energy savings of both the I-cache and D-cache. Ben. I-cache cfg No. D-cache cfg No. I-cache E% D-cache E% padpcm 8K_1W_64B 7 8K_1W_32B 7 23% 77% crc 2K_1W_32B 4 4K_1W_64B 6 70% 30% auto 8K_2W_16B 7 4K_1W_32B 6 3% 97% bcnt 2K_1W_32B 4 2K_1W_64B 4 70% 30% bilv 4K_1W_64B 6 2K_1W_64B 4 64% 36% binary 2K_1W_32B 4 2K_1W_64B 4 54% 46% blit 2K_1W_32B 4 8K_2W_32B 8 60% 40% brev 4K_1W_32B 6 2K_1W_64B 4 63% 37% g3fax 4K_1W_32B 6 4K_1W_16B 5 60% 40% fir 4K_1W_32B 6 2K_1W_64B 4 29% 71% jpeg 8K_4W_32B 8 4K_2W_32B 7 6% 94% pjpeg 4K_1W_32B 6 4K_1W_16B 5 51% 49% optimal 4K_2W_64B ucbqsort 4K_1W_16B 6 4K_1W_64B 6 63% 37% tv 8K_1W_16B 7 8K_2W_16B 7 37% 63% adpcm 2K_1W_16B 5 4K_1W_16B 5 64% 36% epic 2K_1W_64B 5 8K_1W_16B 6 39% 61% g721 8K_4W_16B 8 2K_1W_16B 3 15% 85% pegwit 4K_1W_16B 5 4K_1W_16B 5 37% 63% mpeg2 4K_1W_32B 6 4K_2W_16B 6 40% 60% optimal 8K_2W_16B Average 5.8 Average: 5.4 45% 55% TLFeBOOK

109

Table 6-1 shows the results of our search heuristic, for instruction and data cache configurations. Our search heuristic is quite effective: it searches on average only 5.8 configurations, compared to 27 configurations for an exhaustive approach. Furthermore, our heuristic finds the optimal configuration in nearly all cases. For the two data cache configurations where the heuristic does not find the optimal, pjpeg and mpeg2, the configuration found is only 5% and 12% worse than the optimal, respectively. On average, the dynamic self-tuning cache can reduce memory- access energy by 45% to 55%. Additionally, be observed that way prediction is only beneficial for instruction caches and that only a 4-way set associative instruction cache has lower energy consumption when way prediction is used. However, for the benchmarks we examined, the cache configurations with the lowest energy dissipation were mostly direct mapped caches where way prediction is not applicable. To determine the area and power overhead of our cache tuner, we designed the cache tuner hardware using VHDL and synthesized the tuner using Synopsys Design Compiler. The total tuner size was about 4,000 gates, or 0.039 mm2 in 0.18 µm CMOS technology. Compared to the reported size of the MIPS 4Kp with caches [20], this represents an increase in area of just over 3%. The power consumption of the cache tuner is 2.69 mW at 200 MHz, which is only 0.5% of the power consumed by a MIPS processor. Furthermore, we only use the tuning hardware during the tuning stage; the tuner can be shutdown after the best configuration is determined, thereby minimizing the effects of additional static power dissipation due to the tuner.

6.4 AUTOMATIC TUNING OF A TWO-LEVEL CACHE ARCHITECTURE – THE TCAT

In the previous section, we described an automatic method for tuning a single level of cache in system during run-time. We extend the single level cache tuner to tune two-level caches to embedded applications for reduced energy consumption [21]. This method is applicable to both a simulation- based exploration environment and a hardware-based prototyping environment. We present the two-level cache tuner, or TCaT – a heuristic for searching the huge solution space of possible configurations. The heuristic interlaces the exploration of the two cache levels and searches the various cache parameters in a specific order based on their impact on energy. TLFeBOOK

110 6.4.1 Configurable Cache Architecture

The configurable caches in each of the two cache levels explored here are based on the configurable cache architecture described for a single level configurable cache in Section 6.3.1. The target architecture for our two-level cache tuning heuristic contains separate level one instruction and data caches and separate level two instruction and data caches. For the first level cache, we explore the same search space as the single level cache tuner: cache line size (64, 32, or 16 bytes), cache size (8, 4, or 2 Kbytes), and associativity (4, 2, or 1-way). For the second level of cache, we expand the cache size to a possible 64, 32, or 16 Kbytes while the line size and associativity parameters are the same. We do not explore way prediction with the TCaT. An exhaustive exploration of all cache configurations for a two level cache hierarchy is too costly. For a single level separate instruction and data cache design, an exhaustive exploration would explore a total of 28 different cache configurations. However, the addition of a second level of hierarchy raises the number of cache configurations to 432. Nevertheless, for comparison purposes, we determined the optimal cache configuration for each benchmark by generating exhaustive data. It took over one month of continual simulation time on an UltraSparc compute server to generate the data for our nine benchmarks. In addition, we have chosen a base cache hierarchy configuration consisting of an 8 Kbyte, 4-way set associative level-one cache with a 32 byte line size, and a 64 Kbyte 4-way set associative level two cache with a 64 byte line size – a reasonably common configuration.

6.4.2 Initial Two-Level Cache Tuning Heuristic – Search Each Level Independently

Initially, we extended the heuristic described in Section 6.3.3 for a two- level cache by tuning the level-one cache while holding the level-two cache at the smallest size, then tuning the level-two cache using the same heuristic. We applied the initial heuristic to the benchmarks and found that this heuristic did not perform well for two levels (the original heuristic was intended for only one level, where it works well). The cache configuration determined by our initial heuristic consumed, on average over all benchmarks, 1.41 times more energy than the optimal configuration. In the worst case, our initial heuristic found a cache configuration using 2.7 times more energy than the optimal configuration. In one benchmark, the initial heuristic found a cache configuration that was worse than the base cache. The naïve assumption that the two levels of cache could be configured independently was the reason that our initial heuristic did not perform well TLFeBOOK

111 for a two level system. In a two-level cache hierarchy, the behavior of each cache level directly affects the behavior of the other level. For example, the miss rate of the level one cache does not solely determine the performance of the level two cache. The performance of the level two cache is also determined by what values are missing in the level one cache. To fully explore the dependencies between the two levels, we decided to explore both levels simultaneously.

6.4.3 The Two-Level Cache Tuner - TCaT

To more fully explore the dependencies between the two cache levels, we expanded our initial heuristic to interlace the exploration of the level one and level two caches. Instead of entirely configuring the level one cache before configuring the level two cache, the interlaced heuristic explores one parameter for both levels of cache before exploring the next parameter, while adhering to the parameter ordering of the initial heuristic. The basic intuition behind our heuristic is that interlacing the exploration allows for better modeling and tuning of the interdependencies between the different levels of cache hierarchy. We applied the interlaced heuristic to the benchmarks and found that the interlaced heuristic performed much better than the initial heuristic, but there was still much room for improvement. We examined the cases where the interlaced heuristic did not yield the optimal solution. We discovered that in these cases, the optimal was not being reached for two reasons. First, the initial heuristic did not fully explore each parameter. For instance, if an increase from a 2 Kbyte to 4 Kbyte cache size did not yield an improvement in energy, an 8 Kbyte cache size was not examined. The second reason the optimal configuration was not being found was not due to a failure in the heuristic, but rather due to the limitations set on certain cache configurations by the configurable cache itself. For example, in the level two cache, if a 16 Kbyte cache is chosen as the best size, the only associativity available is a direct-mapped cache. With no energy improvement by increasing the cache from a 16 Kbyte direct-mapped to a 32 Kbyte direct-mapped cache, no other associativities are searched by the previous heuristics. To allow for all associativities to be searched, we added a final adjustment to the associativity search step of the interlaced heuristic with full parameter exploration. The final adjustment allows the cache size to be increased for both the level one and level two caches in order to search larger associativities. We refer to this final heuristic as the two-level cache tuner - the TCaT. TLFeBOOK

112 6.4.4 Experiments and Results

The experimental setup and energy calculations are the same as those described in Section 6.3.4. We explored nine different benchmarks obtained from MediaBench [18] and EEMBC [22] benchmarks suites.

1.2 1 Base Cache 0.8 0.6 Initial Heuristic 0.4 TCaT 0.2 0 Optimal g721 pegwit average AIFIRF01 AIFFTR01 rawcaudio BITMNP01 IDCTRN01 TTSPRK01 PNTRCH01

Figure 6-2. Energy consumption for the initial heuristic cache configuration, the TCaT cache configuration, and the optimal cache configuration, normalized to the base cache configuration for each benchmark.

Figure 6-2 shows the results for the initial heuristic and the TCaT for each benchmark. The energy consumptions have been normalized to the base cache configuration for each benchmark’s cache hierarchy. The results show that the TCaT finds the optimal cache configuration in most cases. Compared to the base cache configuration and averaged over all benchmarks, the initial heuristic achieves an average energy savings of 32% while the TCaT achieves an average energy savings of 53%. Additionally, we found that for every benchmark, there is no loss of performance due to cache configuration for optimal energy consumption. In fact, the benchmarks receive an average of a 28% speedup, which we found was due to the tuning of the cache line size. Furthermore, the TCaT reduces the configuration search space significantly. The exhaustive approach for separate instruction and data caches for a two level cache hierarchy explores 432 cache configurations. The improved heuristic explores only 28 cache configurations, or only 6.5% of the search space. This reduction in the search space speeds up both a simulation approach and a hardware-based prototyping platform approach. TLFeBOOK

113 6.5 USING A VICTIM BUFFER IN AN APPLICATION SPECIFIC MEMORY HEIRARCHY

In addition to tuning cache parameters such as cache size, line size, and associativity, the cache subsystem can include a configurable victim buffer which can be beneficial in systems with a direct-mapped cache. Direct- mapped caches are popular in embedded microprocessor architecture due to their simplicity and good hit rates for many applications. A victim buffer is a small fully-associative cache, whose size is typically 4 to 16 cache lines, residing between a direct-mapped L1 cache and the next level of memory. The victim buffer holds lines discarded after an L1 cache miss. The victim buffer is checked whenever there is an L1 cache miss, before going to the next level memory. If the desired data is found in the victim buffer, the data in the victim buffer is swapped back to the L1 cache. Jouppi [23] reported that a four-entry victim buffer could reduce 20% to 95% of the conflict misses in a 4 Kbyte direct-mapped data cache. Albera and Bahar [24] evaluated the power and performance advantages of a victim buffer in a high performance superscalar, speculative, out-of-order processor. They showed that adding a victim buffer to an 8 Kbyte direct-mapped data cache results in 10% energy savings and 3.5% performance improvements on average for the Spec95 benchmark suite. A victim buffer improves the performance and energy of a direct-mapped cache on average, but for some applications, a victim buffer actually degrades performance without much or any energy savings, as we will show later. Such degradation occurs when the victim buffer hit rate is low. Checking a victim buffer requires an extra cycle after an L1 miss. If the victim buffer hit rate is high, that extra cycle actually prevents dozens of cycles for accessing the next level memory. But if the buffer hit rate is low, that extra cycle does not save much and thus is wasteful. Whether a victim buffer’s hit rate is high or low is dependent on what application is running. Such performance overhead may be one reason that victim buffers are not always included in embedded processor cache architectures. In this section, we will show that treating the victim buffer as a configurable memory parameter to a direct-mapped cache is superior to either using a direct-mapped cache without a victim buffer or using a direct- mapped cache with an always-on victim buffer [25]. Furthermore, we show that a victim buffer parameter is even useful with a cache that itself is highly parameterized. TLFeBOOK

114 6.5.1 Victim Buffer as a Cache Parameter

We consider adding a victim buffer to both core-based and pre-fabricated platform based design situations. A core-based approach involves incorporating a processor (core) into a chip before the chip has been fabricated, either using a synthesizable core (soft core) or a layout (hard core). In either case, most core vendors allow a designer to configure the level 1 cache’s total size (typical sizes range from no cache to 64 Kbyte), associativity (ranging from direct mapped to 4 or 8 ways), and sometimes line size (ranging from 16 bytes to 64 bytes). Other parameters include use of write through, write back, and write allocate policies for writing to a cache, as well as the size of a write buffer. Adding a victim buffer to a core-based approach is straightforward, involving simply including or not including a buffer into the design. A pre-fabricated platform is a chip that has already been designed, but is intended for use in a variety of possible applications. To perform efficiently for the largest variety of applications, recent platforms come with parameterized architectures that a designer can configure for his/her particular set of applications. Recent architectures include cache parameters [2,8,9] that can be configured by setting a few configuration register bits. We therefore developed a configurable victim buffer that could be turned on or off by setting bits in a configuration register.

6.5.2 Experiments and Results

The experimental setup and energy calculations are the same as those described in Section 6.3.4. The benchmarks examined include programs from the Powerstone [9], MediaBench [18], and Spec2000 [26] benchmark suites.

6.5.2.1 Victim Buffer with a Direct-Mapped Cache Figure 6-3 shows the performance and energy improvements when adding an always-on victim buffer to a direct-mapped cache. Performance is the program execution time. Energy is estimated as described in section 6.3.4. 0% represents the performance and energy consumption of an 8 Kbyte direct-mapped cache. From Figure 6-3, we see that a victim buffer improves both performance and energy for some benchmarks, like mpeg, epic, and adpcm. For other benchmarks, energy is not improved but performance is degraded, as for vpr, fir, and padpcm. A victim buffer should be excluded or turned off for these benchmarks. Some benchmarks, like jpeg, parser, and auto2, yield some energy savings at the expense of some performance degradation using a victim buffer – a designer might choose whether to TLFeBOOK

115 include/exclude or turn on/off the buffer in these cases depending on whether energy or performance is more important.

16% 21% 24% 38% 43% 60% 12% performance 8% energy 4% 0% -4% fir art blit vpr crc bilv v42 mcf jpeg epic pjeg bcnt brev g721 mpeg auto2 g3fax binary parser pegwit adpcm cbqsort padpcm

Figure 6-3. Performance and energy improvements when adding a victim buffer to an 8 Kbyte direct-mapped cache. Positive values mean the victim buffer improved performance or energy, with 0% representing an 8 Kbyte direct-mapped cache without a victim buffer. Benchmarks with both bars positive should turn on the victim buffer, while those with negative performance improvement and little or no energy improvement should turn off the victim buffer.

6.5.2.2 Victim Buffer with a Parameterized Cache Figure 6-4 shows the performance and energy improvement of adding a victim buffer to a parameterized cache having the same configurability described by Zhang et. Al. [2] 0% represents the performance and energy of the original configurable cache when tuned optimally to a particular application. The bars represent the performance and energy of the configurable cache when optimally tuned to an application assuming a victim buffer exists and is always on. The optimal cache configurations for a given benchmark are usually different for each of the two cases (no victim buffer versus always-on victim buffer). We see that, even though the configurable cache already represents significant energy savings compared to either a 4-way or direct-mapped cache [2], a victim buffer extends the savings of a configurable cache by a large amount for many examples. For example, a victim buffer yields an additional 32%, 43%, and 23% energy savings for benchmarks adpcm, epic, and mpeg2. The savings of adpcm and epic come primarily from the victim buffer that reduces the visits to off-chip memory. The saving of epic comes primarily from the victim buffer enabling us to configure the configurable cache to use less associativity without increasing accesses to the next memory level. Yet, for other benchmarks, like adpcm, auto2 and vpr, the TLFeBOOK 116 victim buffer yields performance overhead with no energy savings and thus should be turned off.

12% performance energy 32% 43% 23% 8% 4% 0% -4% blit vpr bilv v42 mcf pjeg epic jpeg auto2 g3fax pegwit padpcm

Figure 6-4. Performance and energy improvements when adding a victim buffer to an 8 Kbyte configurable cache. 0% represents a configurable cache without a victim buffer, tuned optimally to the particular benchmark.

6.6 LOW STATIC-POWER FREQUENT-VALUE DATA CACHES

Recently, a frequent value (FV) low power data cache design was proposed based on the observation that a major portion of data cache accesses involves frequent values, which can be dynamically captured [27]. Frequent values are encoded in the cache, occupying only a few bits. We improve upon previous FV data caches by reducing static power by shutting off the unused bits in the larger sub-array for encoded frequent values [28]. Since frequent values are stored in encoded form using only the few bits in the smaller sub-array, the remaining bits in the larger sub-array serve no purpose as long as the value stays frequent. Such shutoff may be beneficial since FVs occupy many words in data caches [27]. Furthermore, the original FV low power cache design suffers from an extra cycle when reading non-FVs [27], which account for 68% of all data cache accesses, resulting in a 5% increase in execution time. We used circuit design to remove the extra cycle.

6.6.1 Overview of Original FV Cache Design

In this section, we give a brief overview of the original FV data cache designed by Yang and Gupta [27]. The FV cache was proposed based on the observation that a small number of distinct frequently occurring data values often occupy a large portion of program memory data spaces and therefore account for a large portion of memory accesses [27]. This frequent value phenomenon was TLFeBOOK

117 exploited in designing a data cache that trades off performance with energy efficiency. From the perspective of the frequent value cache, data values are divided into two categories: a small number of frequent values, in our case 32 FVs, and all remaining values that are referred to as non-frequent values. The frequent values are stored in encoded form and therefore can be represented in 5 bits; the non-frequent values are stored in unencoded form in 32 bit words. Additionally, a flag bit is needed for each word in the cache to determine if the value stored in that location is encoded or not. The set of frequent values remains fixed for a given program run. When reading a word from the cache, initially we simply read from the low-bit array. Since every word read out contains a flag bit, the flag is examined to determine what comes next. The flag being 1 means the desired word is in un-encoded form, so the remaining bits should be read out from the high-bit array to form the original value. On the other hand, the flag being 0 means that the desired word is a frequent value and stored in encoded form. In this case, the access proceeds to decode the value. Since the access to the high-bit array is avoided, cache activity is reduced. A write to the FV cache is performed as follows. Before a value is written, it is first encoded through an encoder. If encoding is successful, it means that the value is a frequent value and thus a 5-bit code is stored in the low-bit array and the flag bit is cleared. In this case, accessing the high-bit array is avoided. If the encoding fails, the value to be written is a non- frequent value and thus both low-bit and high-bit data arrays are accessed as well as the flag bit being set. Note that writing non-FVs does not need to take two cycles as does reading non-FVs, because the value is encoded early in the pipeline and thus the decision of driving one array or two is clear before the access.

6.6.2 Improving the FV Cache Design

The FVs are not only accessed frequently, but also distributed widely in caches [29]. This phenomenon provides a good opportunity for reducing static power. Our approach is the following. Since the 32-bit FVs are encoded in 5 bits, the remaining 27 bits do not store any useful information. Therefore, they can be shut down to save static power and as long as a value stays frequent, static power is saved. The overall savings depend on the occupancy of FVs in the cache. Our studies show that on average nearly half of the cache content contains FVs, which indicates the benefit of reducing static power through finding FVs. The flag bits are initially set to 1, which means initially all words are non-FVs. Any data to the data cache is checked with the FV encoder. If the TLFeBOOK

118 word is an FV, the corresponding flag bit is set to 0 and this cache word is encoded and stored in the 5-bit array. At the same time, the flag bit turns off the 27-bit portion of the word. Similarly, on reading FVs, only the 5-bit portion is read and the 27-bit portion is gated off using the flag bit. On a non-FV read or write, the flag bit is set to 0 and the original 32 bits are written into the cache as usual. Our new circuit design improves the original FV cache design in that there is no extra delay in determining accesses of the 27-bit portion.

6.6.3 Designers’ Choices of Using the FV Cache

We have described a low static power FV cache. When utilized into a processor system, the FV cache can be designed with different degrees of complexity and flexibility. In this section, we provide three approaches that are suitable for a variety of processors targeting different types of applications. Essentially, the complexity comes from how FVs are identified and if they are allowed to vary for different applications. As always, the more flexibility the processor provides, the more complex the FV cache is. The first approach is appropriate to application specific processors. Since only a single type of application runs on the processor, its FVs tend to be stable over time. In such cases, the FVs can be first obtained from a profiling run through simulations, and then synthesized into the cache as part of the cache . The advantage of this approach is that once the FVs are hard coded on-chip, the cache does not perform operations other than reads. Thus, the logic of this component is simple and can be designed to consume minimum power. The second approach extends the first one with the ability of changing the FVs according to different applications. This approach is suitable for a multi-task environment in which the processor runs multiple programs instead of single program. Each program’s FVs are still obtained off-line. Instead of synthesizing the FVs on-chip, a register file may be used to store FVs so that they can be rewritten on each activation of a different program. The size of the register file depends on the number of FVs of interest to the designer, which is heavily dependent on each program’s behavior. The third approach provides the maximum flexibility in maintaining FVs. According to a previous study [29], some programs’ FVs are sensitive to different inputs. This suggests that another dimension of varying FVs might be added into the design. Since it is infeasible to profile every program on all possible inputs to catch FVs, detecting FVs on-line would be useful. Thus, on top of the second approach, the register file could be extended to dynamically capture FVs using extra logic. In the scheme proposed by Yang and Gupta [27], an inexpensive hardware FV finder was developed that TLFeBOOK

119 monitored cache accesses. The FV finder was turned on for only the first 5% of memory accesses assuming that the total memory access numbers are known a priori. After that, the FVs were captured in the finder and transmitted to the cache so that the cache starts operating as an FV cache. The energy overhead of the finder was estimated to be 0.3%-6.1% of the L1 D-cache (8 Kbyte to 64 Kbyte caches were tested). The area overhead is similar to our second approach, and thus modest. One potential issue is that the FV finder described detects frequently accessed values, which may or may not correspond to frequently distributed values in memory, though they usually are the same. We leave an FV finder for frequently distributed values for future work.

6.6.4 Experiments and Results

To determine the benefits of our FV cache architecture in reducing static energy, we ran 11 SPEC2000 [26] benchmarks through the SimpleScalar tool set [19]. We used a 4-issue out-of-order processor simulator with a 32 Kbyte L1 instruction and data cache. The benchmarks were fast-forwarded for 1 billion instructions and executed for 500 million instructions afterwards, using reference inputs.

6.6.4.1 Static Energy Savings Our main goal is to reduce the static energy consumed by the data cache without losing performance. As mentioned earlier, the overall static energy saving depends on the average coverage of FVs inside data cache. Through experiments, we found that there are abundant FVs in the L1 data cache at any time for Spec 2000 benchmarks, as shown in Figure 6-5. The percentage shown is the average for the 500 million instructions execution time. On average, 49.2% of the total words are FVs, with the highest being 77.0% for benchmark mcf and the lowest 9.4% for benchmark ammp. The static energy savings are proportional to the number of FVs in the data cache. Thus, the corresponding static energy savings on average are 35% (49.2%×27/33×86%) considering that 27 bits out of 33 bits (we need a flag bit per 32-bit word) are shut off and 86% of static power can be saved using a pMOS Gated-Vdd. When compared with the conventional 32-bit per word cache, the static energy savings can be calculated as 100%- (100%- 35%)*33/32 = 33%. TLFeBOOK

120

80% 60% 40% 20% 0% art vpr gcc mcf gzip bzip Ave mesa votex ammp parser equake

Figure 6-5. Percentage of data cache words that are FVs

6.6.4.2 Performance Improvement Our second achievement is the performance improvement over the original FV data cache design. Recall that the original FV cache performance overhead was due to the prolonged non-FV accesses. The more non-FV accesses, the slower the execution and the less the overall power savings (less energy savings), since the system would consume more energy when the program runs longer. We measured the average percentage of cache hits that are FVs, as shown in Figure 6-6(a). On average, the hit rate on data FVs is 32% with the highest being 62.7% for votex and the lowest 11.4% for mcf. Therefore, we can see that on average, 68% of cache accesses are non-FVs.

75% 2.0 Normal Cache 2-cycle FVC 50% 1.5 25% 1.0 IPC 0.5 0% 0.0 art gcc vpr art gcc mcf vpr gzip Ave bzip mcf gzip Ave mesa mesa votex bzip2 parser ammp parser equake ammp vortex equake (a) (b)

Figure 6-6. (a) Hit rate of FVs in data cache; (b) Performance (IPC) degradation of two-cycle FV cache

With our improved circuitry (1-cycle latency for non-FVs as well as for FVs), we are able to maintain the same execution speed as the base case. To see how much performance we have gained over the original FV cache, we measured the IPCs for a normal cache and a 2-cycle FV cache and plot them in Figure 6-6(b). The IPC for our improved design is the same as the normal cache. Figure 6-6(b) shows the slowdowns of the original FV cache design, which is the same value as our performance improvement. We can see that there is a 5.2% difference in the averaged IPCs between the original FV cache and our improved version. This also means that in addition to the static energy we saved by shutting off partial FV words, we also saved more dynamic energy than the original FV cache design. TLFeBOOK

121 Another feature in our new design is that it is safe in the sense that it does not increase power consumption significantly even when FVs are not abundant. Thus, our improved FV cache design is an appealing approach in reducing both static and dynamic energy of caches.

Acknowledgements

This work was supported by the National Science Foundation (CCR- 0203829, CCR-9876006) and by the Semiconductor Research Corporation (2003-HJ-1046G).

References

[1] S. Segars. Low power design techniques for microprocessors, International Solid State Circuit Conference, February 2001. [2] C. Zhang, F. Vahid, and W. Najjar. A highly-configurable cache architecture for embedded systems. 30th Annual International Symposium on Computer Architecture, June 2003. [3] Altera, Nios Embedded Processor System Development, http://www.altera.com/corporate/ news_room/releases/products/nr-nios_delivers_goods.html. [4] Arc International, www.arccores.com. [5] ARM, www.arm.com. [6] MIPS Technologies, www.mips.com. [7] Tensilica, Xtensa Processor Generator, http://www.tensilica.com/. [8] D. H. Albonesi. Selective Cache Ways: On Demand Cache Resource Allocation. Journal of Instruction Level Parallelism, May 2002. [9] A. Malik, W. Moyer, and D. Cermak. A Low Power Unified Cache Architecture Providing Power and Performance Flexibility. International Symposium on Low Power Electronics and Design, 2000. [10] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory Heirarchy Reconfiguration For Energy and Performance in General-Purpose Processor Architecture. 33rd International Symposium on Microarchitecture, December 2000. [11] A. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. Cache Access and Cache Time Model. IEEE Journal of Solid-State Circuits, Vol 31, No 5, 1996. [12] T. Givargis and F. Vahid. Platune: A Tuning Framework For System-On-a-Chip Platforms. IEEE Transactions on Computer Aided Design, November 2002. [13] M. Palesi and T. Givargis. Multi-Objective Design Space Exploration Using Genetic Algorithms. International Workshop on Hardware/Software Codesign, May 2002. [14] C. Zhang and F. Vahid. Cache Configuration Exploration on Prototyping Platforms. 14th IEEE International Workshop on Rapid System Prototyping , June 2003. [15] A. Ghosh and T. Givargis. Cache Optimization For Embedded Processor Cores: An Analytical approach. International Conference on Computer Aided Design, November 2003. [16] C. Zhang, F. Vahid, and R. Lysecky. A Self-Tuning Cache Architecture for Embedded Systems. Design Automation and Test in Europe Conference (DATE), February 2004. [17] M. Powell, A.Agarwal, T. Vijaykumar, B. Falsafi, and K. Roy. Reducing Set-Associative Cache Energy via Way-Prediction and Selective Direct Mapping, 34th International Symposium on Microarchitecture, 2001. TLFeBOOK

122

[18] C. Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: A Tool For Evaluating and Synthesizing Multimedia and Communication Systems. Proc 30th Annual International Symposium on Microarchitecture, December 1997. [19] D. Burger, T. Austin, and S. Bennet. Evaluating Future Microprocessors: The Simplescalar Toolset. University of Wisconsin-Madison. Department Tech. Report CS-TR-1308, July 2000. [20] http://www.mips.com/products/s2p3.html, 2003. [21] A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic Tuning of Two-Level Caches to Embedded Applications. Design Automation and Test in Europe Conference (DATE), February 2004. [22] EEMBC, the Embedded Microprocessor Benchmark Consortium, www.eembc.org. [23] N. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers, Proceedings of International Symposium on Computer Architecture, 1990. [24] G. Albera and R. Bahar. Power/performance Advantages of Victim Buffer in High- Performance Processors, IEEE Alessandro Volta Memorial Workshop on Low-Power Design, 1999. [25] C. Zhang and F. Vahid. Using a Victim Buffer in an Application-Specific Memory Hierarchy. Design Automation and Test in Europe Conference (DATE), February 2004. [26] http://www.specbench.org/osg/cpu2000. [27] J. Yang and R. Gupta. Energy Efficient Frequent Value Data Cache Design, Int. Symp. on Microarchitecture, Nov. 2002. [28] C. Zhang, J. Yang, and F. Vahid. Low Static-Power Frequent-Value Data Caches. Design Automation and Test in Europe Conference (DATE), February 2004. [29] J. Yang and R. Gupta. “Frequent Value Locality and its Applications,” ACM Transactions on Embedded Computing Systems (inaugural issue), Vol. 1, No. 1, pages 79-105, November 2000. TLFeBOOK

123

Chapter 7

REDUCING ENERGY CONSUMPTION IN CHIP MULTIPROCESSORS USING WORKLOAD VARIATIONS

I. Kadayif1, M. Kandemir2, N. Vijaykrishnan2, M. J. Irwin2 and I. Kolcu3 1 2 3 Canakkale Onsekiz Mart University; Pennsylvania State University; UMIST

Abstract Advances in semiconductor technology are enabling designs with several hundred million transistors. Since building sophisticated single processor based systems is a complex process from design, verification, and software development per- spectives, the use of chip multiprocessing is inevitable in future microprocessors. In fact, the abundance of explicit loop-level parallelism in many embedded ap- plications helps us identify chip multiprocessing as one of the most promising directions in designing systems for embedded applications. Another architectural trend that we observe in embedded systems, namely, multi-voltage processors, is driven by the need of reducing energy consumption during program execution. Practical implementations such as Transmeta’s Crusoe and Intel’s XScale tune processor voltage/frequency depending on current execution load. Considering these two trends, chip multiprocessing and voltage/frequency scaling, this chapter presents an optimization strategy for an architecture that makes use of both chip parallelism and voltage scaling. In our proposal, the compiler takes advantage of heterogeneity in parallel execution between the loads of different processors and assigns different voltages/frequencies to different processors if doing so reduces energy consumption without increasing overall execution cycles significantly. Our experiments with a set of applications show that this optimization can bring large energy benefits without much performance loss.

Keywords: Chip multiprocessing, voltage scaling, loop-level parallelism, embedded systems, optimizing compilers. TLFeBOOK

124 7.1 INTRODUCTION Rising development costs motivate computer architecture companies to de- sign fewer systems-on-chip, but to make each one they do design more flex- ible and programmable. Doing so makes it possible to reuse designs to take advantage of economies of scale and shorten time-to-market. Moreover, pro- grammability allows companies to keep products in the market longer, boosting integrated profits. High-performance embedded processors have traditionally relied mainly on clock frequency and superscalar instruction issue to boost performance. While frequency and superscalarity have served the industry well and will continue to be used, we believe that they have limitations that will diminish the gains they will deliver in the future. The gains in operating frequencies, which have historically come at a rate of about 35 percent per year, are attributable to two major factors: semiconductor feature scaling and deeper pipelining. But each of these factors is approaching the point of diminishing returns. Similarly, superscalar processing is nearing its limits, mainly due to the exponential in- crease in complexity in dispatch logic with increasing issue width. In addition, superscalar processing is limited by the inherent instruction-level parallelism in the code. Although VLIW implementations are less complex than their superscalar counterparts (since most of execution decisions are made by the compiler), they still employ power-hungry components and are limited by the available instruction-level parallelism. It should also be noted that both super- scalar and VLIW architectures are not efficient from an energy consumption viewpoint. Therefore, it is not clear whether current architectures will be suffi- cient for meeting continuously increasing power and performance demands of applications. These observations motive system designers to investigate different archi- tectures. When one looks at computer architecture industry today, two different trends in system design can easily be observed: on-chip multi-processing and multi-voltage processors. On-chip multi-processors take advantage of high- level, coarse-grain parallelism that exists due to the natural independence of separate program fragments (e.g., functions and loops). As compared to super- scalar and VLIW architectures, they are much more suitable for array-intensive embedded applications. Another advantage of using an on-chip multiprocessor, instead of a more powerful and sophisticated uniprocessor, is that there is less difficulty in designing a smaller, less complex chip. This also speedups chip verification and validation. Thus, time required to put the chip in the market becomes shorter. One can see several examples of on-chip multi-processing today in both academia and industry. For example, the four-core Hydra from Stanford University [14] is built around Integrated Device Technology Inc.’s RC32364 processor, which uses a 0.25-micron process, and runs at 250 MHz. TLFeBOOK

125 As manufacturing processes keep getting refined, it becomes even easier to replicate the core several times on a single die. The MAJC architecture from [11] allows one to four processors to share the same die, and for each to run separate threads. Each processor is limited to four functional units (each of which are able execute both integer and floating point operations, making the MAJC architecture more flexible). Another example of an on-chip multi-processor from industry is the Power4 processor from IBM [15], where two processors are placed into the same die. The second trend, multi-voltage processors, is mainly driven by the need to reduce energy consumption during program execution. Practical implementa- tions such as Transmeta’s Crusoe [10] and Intel’s XScale [8] scale processor voltage/frequency depending on execution load. Observing that one rarely needs an application to exercise a processor’s maximum performance and the unused extra performance usually represents wasted energy, Crusoe designers try to match the operating level of the processor (in terms of voltage and fre- quency) to the performance requirements of the application being executed. Depending on the voltage regulator, a Crusoe processor can change its voltage in steps of 25mV and its frequency in steps of 33MHz. Considering the continuously pressing power and performance demands, we can expect these two techniques to be co-exist in the future embedded archi- tectures. Specifically, we believe that future architectures will be based on on-chip multi-processors, where each on-chip processor can be individually voltage/frequency scaled. Considering such an architecture, this paper investi- gates the energy/performance tradeoffs in parallelizing array-intensive applica- tions taking into account the possibility that individual processors can operate in different voltage/frequency levels. In assigning voltage levels to processors, we make use of compiler analysis that reveals heterogeneity between the loads of different processors in parallel execution. Our experiments with a set of ap- plications show that the proposed optimization can bring large energy benefits without much performance penalty. The rest of this chapter is organized as follows. The next sections describes our chip multiprocessor. Section 7.3 discusses why we may be experiencing load imbalance across on-chip processors at runtime. Section 7.4 discusses the necessary compiler analysis for determining workloads (on a loop nest ba- sis) of individual processors participating in parallel computation. Section 7.5 discusses additional optimizations to further enhance our power savings. Sec- tion 7.6 describes our implementation, experimental platforms, and presents performance and energy numbers. Section 7.7 presents our concluding re- marks. TLFeBOOK

126

CPU0 CPU1 CPU2 CPU3

Cache Cache Cache Cache 0 1 2 3

L2 Cache Optional

Off-Chip

Figure 7.1. Chip multiprocessor under consideration.

7.2 CHIP MULTIPROCESSOR ARCHITECTURE AND EXECUTION MODEL The chip multiprocessor we consider here is a shared-memory architecture; that is, the entire address space is accessible by all processors. Each processor has a private L1 cache, and shared memory is assumed to be off-chip. Option- ally, we may include a (shared) L2 cache as well. Note that several architectures from academia and industry fit in this description [1, 14, 11, 12]. We keep the subsequent discussion simple by using a shared bus as the interconnect (though one could use fancier/higher bandwidth interconnects as well). We also use the MESI [19] protocol (the choice is orthogonal to the focus of this paper) to keep the caches coherent across the CPUs. We assume that voltage level and frequency of each processor in this architecture can be set independently of the others, and this is the main mechanism through which we save power. This paper focuses on a single-issue, five-stage (instruction fetch (IF), instruc- tion decode/operand fetch (ID), execution (EXE), memory access (MEM), and write-back (WB) stages) pipelined datapath for each on-chip processor. Cur- rently, this is the only architectural model for which our compiler estimates processor workload. Note that progress in VLSI technology has allowed chip-makers to pack millions of transistors in a single die. Rather than throwing all these resources into a single, powerful processing core and making this core very complex to design and verify, chip-multiprocessors consisting of several simpler proces- sor cores can offer a more cost-effective and simpler way of exploiting these higher levels of integration. Chip multiprocessors also offer a higher granu- larity (thread/process level) at which parallelism in programs can be exploited by compiler/runtime support, rather than leaving it to the hardware to extract the parallelism at the instruction level on a single (larger) multiple-issue core. All these compelling reasons motivate the trends toward chip multiprocessor TLFeBOOK

127 architectures, and there is clear evidence of this trend in the several commercial offerings and research projects [1, 14, 11, 12]. Our application execution strategy can be summarized as follows. We focus on array-based applications that are constructed from loop nests. Typically, each loop nest in such an application is small but executes a large number of iterations and accesses/manipulates large datasets (typically multidimensional arrays). We employ a loop nest based application parallelization strategy. More specifically, each loop nest is parallelized independently of the others. In this context, parallelizing a loop nest means distributing its iterations across proces- sors and allowing processors to execute their portions in parallel. For example, a loop with 1000 iterations can be parallelized across 10 processors by allo- cating 100 iterations to each processor. We also assume that after each loop nest execution, all processors get synchronized before they start executing the next loop nest. Note that dropping this requirement would necessitate a so- phisticated compiler analysis to identify the cases under which a processor that finishes its portion of iterations from the previous loop nest can go ahead and start executing its portion from the next loop nest without waiting for the others. Nevertheless, in our experiments to be presented later, we also evaluate such an alternative strategy. There are many proposals for power management of a dynamic voltage scaling-capable processor. Most of them are at operating system level and are either task-based [13, 17] or interval-based [21, 5]. While some proposals aim at reducing energy without compromising performance, a recent study by Grunwald et al [6] observed noticeable performance loss for some interval- based algorithms using actual measurements. The existing compiler based studies such as [7, 16] target single processor architectures. In comparison, our work targets at a chip multiprocessor based environment.

7.3 LOAD IMBALANCE IN PARALLEL EXECUTION We can broadly divide loop nest parallelization techniques into two cate- gories: static and dynamic. In the static case, the compiler (or the user) decides a suitable parallelization strategy for each loop nest at compile time. The idea is to assign each loop iteration to a processor. There are at least two ways of doing this. In block assignment, a group of consecutive loop iterations are as- signed to the same processor. Since such iterations typically access data stored in consecutive memory locations, this type of assignment can also be expected to be data locality friendly. In cyclic assignment, the iterations assigned to processors are interleaved using some stride. While this type of assignment is known to be good from a load balance viewpoint, it generally exhibits poor data locality. Consider, as an example, the loop nest shown below and the array reference in it: TLFeBOOK

128

(a) (b) (c)

  P   0   P  0  P  P  0 1     P   2   P  1  P  P  1 3     P  P  0   2  P  P   2 1    P   2   P  P  3 P   3  3  

(d) (e)

     V  P   0  0         V  P    1   1         P    V  2        2               V  P      3     3          

P P2 P P V0 V1 V2 >>> V3 3 1 0

Figure 7.2. Different array accesses imposed by different iteration assignments (the array is assumed to be row-major).

for i: 1..1024 for j: 1..1024 ..X[i,j]..

Assuming that only the i-loop is parallelized across four processors (P0 through P3), Figure 7.2(a) illustrates how array X is accessed by the processors when block iteration assignment is used. In this assignment, each processor executes 256 × 1024 iterations, and accesses a group of consecutive rows of the array as depicted in Figure 7.2(a). However, it is also possible to parallelize this loop (i) by distributing its iterations cyclicly across processors using some regular stride. For example, we can give the first 128 × 1024 iterations to the first processor, the next 128 × 1024 to the second one and so on, and when we give its quota to the last processor, we can repeat the whole process (until all loop iterations have been assigned) starting over with the first processor. Figure 7.2(b) shows how array X is accessed by the processors under this cyclic iteration assignment scheme. Note that the cyclic iteration distribution is flexi- ble in the sense that it can work with any stride. For example, instead of using 128 × 1024 iteration chunks, we could have easily used 16 × 1024 or even 1 × 1024 iteration chunks. In comparison, in a dynamic parallelization strategy, the assignment of iter- ations to processors is performed dynamically during the course of execution by a central controller. Typically, this controller gives a new set of loop it- erations to a processor when that processor is done with executing its current set of assigned iterations. While the dynamic strategy is expected to balance TLFeBOOK

129 the workloads of processors better than static strategies (as it can take run- time constraints into account), it also incurs a much higher runtime cost — in terms of both execution cycles and power consumption — (as compared to the static parallelization schemes) since decisions regarding iteration assignments are made at runtime. Therefore, our focus in this study is on static loop nest parallelization. Consider now the following loop nest:

for i: 1..1024 for j: i..1024 ..X[i,j]..

While this loop nest is similar to the previous one considered above, there is one significant difference: the lower bound of the inner loop (j)isi (instead of 1). Figure 7.2(c) shows how the four processors access the array in question when block iteration assignment is employed. Clearly, there is a significant load imbalance across the processors. Assuming that each iteration of this loop nest has the same cost (in terms of execution cycles) and all processors should synchronize following the execution of the nest, there is not any advantage for the processors with the light load to finish their set of iterations as soon as pos- sible. Instead, they can delay their executions (by reducing their frequencies) and lower their voltages to save energy while making sure that their execution does not take more time than that of the processor with the largest load (op- erating with the highest voltage level). Figure 7.2(d) illustrates such a voltage assignment, assuming that V0 is the highest voltage level available. The work presented in this paper performs such a voltage-to-processor assignment for each loop nest of a given array-based application. In a sense, in our frame- work the job of the compiler is not just to decide which loop iterations should be assigned to which processors but also which supply voltage/frequency each processor needs to use. Our objective is to save as much power as possible without incurring much performance penalty. At this point, someone might claim that it would be better in this case (Fig- ure 7.2(c)) to use cyclic assignment instead of block assignment as this would eliminate the load imbalance problem introduced by the latter to a large extent. However, this may not be a viable option in general. Consider, for example, the scenario depicted in Figure 7.2(e), where the direction of parallelization is reversed (due to data dependences for example). In this case, cyclic assignment would be very costly in terms of data locality (cache behavior), assuming that the array in question is stored as row-major. Considering the fact that off-chip memory accesses are getting more and more expensive in terms of processor cycle times, one may not want to degrade data locality. TLFeBOOK

130 7.4 COMPILER SUPPORT As mentioned earlier, the compiler’s job in our setting is to assign not only iterations to processors but also come up with a suitable voltage level for each processor. To do this, the compiler needs to estimate the workload of each processor and match it with an appropriate voltage/frequency level. Without loss of generality, we assume that there are s voltage/frequency levels available to the compiler. Our compiler-based approach proceeds as follows: • Parallelization Step. In this step, the compiler parallelizes an applica- tion in a loop nest basis. That is, each loop nest is parallelized independently considering the intrinsic data dependences it has. Since we are targeting a chip multiprocessor, our parallelization strategy tries to achieve (for each nest) outer-loop parallelism to the best extent possible. In other words, we parallelize the outermost loop (in the nest) that carries no data dependence. Our baseline results are obtained using this parallelization strategy. Later in our experiments, we change our parallelization strategy to conduct a sensitivity analysis. • Processor Load Estimation. In this step, the compiler estimates the load of each processor in each nest. To do this, it performs two calculations: (a) iteration count estimation and (b) per-iteration cost estimation. Since in most array-based embedded applications bounds of loops are known before execution starts, estimating the iteration count for each loop nest is not very difficult. The challenge is in determining the cost (in terms of execution cycles) of a single iteration (for a given loop nest). Since the processors employed in our chip multiprocessor are simple single-issue cores, our cost computation is closely dependent on the number and types of the assembly instructions that will be generated for the loop body. Specifically, we associate a base execution cost with each type of assembly instruction. In addition, we also estimate the number of cache misses. Since loop-based embedded applications exhibit very good instruction locality (as they spend most of their execution cycles within loop nests and there are not too many conditional-if executions), we focus on data cache and estimate data cache misses using the method proposed by Carr et al [2]. An important issue is to estimate (at the source level) what assembly instructions will be generated for the loop body in question. We attack this problem as follows. The constructs that are vital to the studied codes include a typical loop, a nested loop, assignment statements, array references, and scalar variable references within and outside loops. Our objective is to estimate the number of assembly instructions of each type associated with the actual execution of these constructs. To achieve this, the assembly equivalents of several codes were obtained using our back-end compiler (a variant of gcc) with the O2-level optimization. Next, the portions of the assembly code were correlated with corresponding high-level constructs to extract the number and type of each instruction associated with the construct. In order to simplify the TLFeBOOK

131 correlation process and to partially isolate the impact of instruction choice due to low-level optimizations, the assembly instructions with similar functionality and energy consumption are grouped together. For example, both branch- if-not-equal (bne) and branch-if-equal (beq) are grouped as a generic branch instruction (denoted bra). To illustrate our parameter extraction process in more detail, we focus on some specifics of the following example constructs. First, let us focus on a loop construct. Each loop construct is modeled to have a one-time overhead to load the loop index variable into a register and initialize it. Each loop also has an index comparison and an index increment (or decrement) overhead whose costs are proportional to the number of loop iterations (called trip count or trip). From correlating the high-level loop construct to the corresponding assembly code, each loop initialization code is estimated to execute one load (lw) and one add (add) instruction (in general). Similarly, an estimate of trip+1 load (lw), store-if-less-than (stl), and branch (bra) instructions is associated with the index variable comparison. For index variable increment (resp. decrement), 2×trip addition (resp. subtraction) and trip load, store, and jump instructions are estimated to be performed. Next, we consider extracting the number of instructions associated with ar- ray accesses. First, the number and types of instructions required to compute the address of the element are identified. This requires the evaluation of the base address of the array and the offset provided by the subscript(s). Our cur- rent implementation considers the dimensionality of the array in question, and computes the necessary instructions for obtaining each subscript value. Com- putation of the subscript operations is modeled using multiple shift and addi- tion/subtraction instructions (instead of multiplications) as this is the way our back-end compiler generates code when invoked with the O2 optimization flag. Finally, an additional load/store instruction was associated to read/write the corresponding array element. Note that these correlations between high-level constructs and low-level assembly instructions are a first-level approximation for our simple architecture and array-dominated codes with the O2-level op- timization and obtained through extensive analysis of a large number of code fragments. Based on the process outlined above, the compiler estimates iteration count for each processor and per-iteration cost. Then, by multiplying these two, it calculates the estimated workload for each processor. While this workload estimation may not be 100% accurate, it allows the compiler to rank processors according to their workloads and assign suitable voltage levels and frequencies to them as will be described in the next item. As an example consider the second loop nest shown above, parallelized using 4 processors. Assuming that our estimator estimates the cost of loop body as L instructions, the loads of TLFeBOOK

132 processors P0, P1, P2,andP3 are 256 × 1024 × L, 256 × (1024-257+1) × L, 256 × (1024-513+1) × L, and 256 × (1024-769+1) × L, respectively. • Voltage Assignment. In this step, the compiler first orders the proces- sors according to non-increasing workloads. After that, the highest voltage is assigned to the processor with the largest workload (the objective being not to affect the execution time to the greatest extent possible). Then, the processor with the second highest workload gets assigned to the minimum voltage level Vk available (where 1 ≤ k ≤ s) that does not cause its execution time to exceed that of the processors with the largest workload. In this way, each processor gets the minimum voltage level (to save maximum amount of power) without increasing overall parallel execution time of the nest (which is determined by the processor with the largest workload). Continuing with the example above, suppose that we have two voltage/frequency levels (that is, V1/f1 and V2/f2, assuming s =2andV1/f1 >V2/f2), we first determine the execution time taken by processor P0 (denoted T0). Then, for each other processor, we use V2/f2 if doing so does not cause their execution times to exceed T0. If any of these execution times exceeds T0 (when using V2/f2), we switch back to V1/f1 for that processor. The success of our strategy critically depends on two important factors. First, there should be some load imbalance to exploit between different processors. This is because if there is no such imbalance then it is reasonable to execute each processor with the highest voltage/frequency. Second, the compiler-based workload estimation should be reasonably accurate. If this is not the case, then we may assign a wrong voltage level/frequency to a processor, which may in turn impact overall execution time. In fact, in this scheme, the only time we pay some penalty is when our compiler-based workload estimation is not very accurate. In our experiments, we quantify this penalty in detail.

7.5 ADDITIONAL OPTIMIZATIONS In this section, we discuss how the effectiveness of our strategy can be further increased using additional optimizations.

7.5.1 Inter-Nest Optimization In the description of our strategy above, we assumed that the processors will synchronize at the end of each loop nest (before they start executing the next loop nest). As noted by Tseng [20], such a global synchronization presents two major problems. First, to implement such a synchronization, the compiler needs to generate extra (synchronization) code and insert it in the application code. Obviously, this code presents extra performance and power overhead at runtime. Second, since this synchronization requires all processors to wait for the slowest one, it makes poor use of available resources (from the performance TLFeBOOK

133 angle). Consequently, allowing a processor to continue without waiting for the slower ones can allow small perturbations in processor execution times to even out, thereby improving overall performance (by taking advantage of the loosely-coupled nature of chip multiprocessors). However, determining when it is safe to allow a processor to continue without synchronization requires extra compiler analysis. In this study, we implemented a strategy that takes a number (called b) as a parameter, and for each loop nest, allows a processor to continue for at most b next nests if doing so does not violate any data dependences.

7.5.2 Voltage/Frequency Reuse Another optimization can be performed by being more careful in voltage assignment. Up to this point in our discussion we assumed that the processor assignment for each loop nest is done independently of the other nests. As a result of this, as we move from one loop nest to another the same processor can get assigned different voltage levels. Consequently, we pay a penalty (in terms of both performance and energy consumption) for changing voltage levels. This penalty can be minimized by reusing the same voltage as much as possible for the same processor throughout the execution. This can be achieved as follows. Suppose that in loop nest i, we used voltage level Vk for processor j.When we move to loop nest i +1if we need to assign voltage level Vk to a processor, we use processor j for that. This can be repeated for each neighboring loop nest pair, and in this way, the processors reuse their voltage levels as much as possible.

7.5.3 Adaptive Parallelization So far in our treatment of the subject, we have assumed that we use all available processors in execution of all nests in the application. However, it is known from prior research [9] that, in some cases using fewer processors (and shutting off the unused ones along with their L1 caches) can result in a better energy consumption behavior. We also conducted experiments with an adaptive strategy, where each loop nest is first profiled using different number of processors in conjunction with our optimization strategy. After the profiling, for each loop nest, we identified the ideal number of processors, and used it in the actual execution. It should be noted that in adaptive parallelization we use fewer number of processors than available (this means some performance loss); however, turning off unused processors along with their L1 caches can bring energy benefits.

7.5.4 Combining Cyclic and Block Iteration Allocations As has been discussed earlier in the paper, one may also opt to use cyclic distribution of loop iterations across processors. Since our framework is able to TLFeBOOK

134

Table 7.1. Base simulation parameters used in our experiments.

Parameter Default Value Number of Voltage/Frequency Levels 8 Lowest/Highest Voltage Levels 0.8V/1.4V Frequency Step Size 30MHz Voltage/Frequency Transition Penalty 10 cycles/2.10nJ L1 Size 8KB L1 Line Size 32 bytes L1 Associativity 4-way L1 Latency 1cycle L2 Size (Shared) 2MB L2 Associativity 4-way L2 Line Size 64 bytes L2 Latency 10 cycles Memory Access Latency 100 cycles Bus Arbitration Delay 5cycles Replacement Policy Strict LRU L1 Energy (per access) 1.14nJ L2 Energy (per access) 2.56nJ Main Memory Energy (per access) 23.10nJ estimate the number of cache misses, we can potentially have a better strategy as follows. For each loop nest, we can calculate the number of misses for both block and cyclic wise allocations and select the strategy that generates the best energy savings under a performance (execution cycles) constraint. We can refer to such a strategy as hybrid since it makes use of both block and cyclic wise allocation.

7.6 EXPERIMENTS We tested the effectiveness of our algorithm in reducing energy consump- tion of chip multiprocessor using six array-intensive programs: 3D, DFE, LU, SPLAT, MGRID, and WAVE5. 3D is an image-based modeling application that simplifies the task of building 3D models and scenes. DFE is a digital image filtering and enhancement code. LU is an LU decomposition program. SPLAT is a volume rendering application which is used in multi-resolution vol- ume visualization through hierarchical wavelet splatting. Finally, MGRID and WAVE5 are C versions of two Spec95FP applications. These C programs are written in such a fashion that they can operate on inputs of different sizes. The default configuration parameters used in our experiments are given in Table 7.1, and these are the values that are used unless explicitly stated/varied in the sensitivity experiments. To conduct our experiments, we modified Simics [18]. Simics is a full system simulation platform that can simulate both uniprocessor and multiprocessor TLFeBOOK

135

Figure 7.3. Normalized energy consumption with different number of processors (8 voltage levels). machines. All energy results reported in this section include the energy spent in CPUs, their caches, and main memory and have been normalized with respect to the energy consumption when no voltage scaling is used and each processor is operated with maximum supply voltage and frequency. The graph in Figure 7.3 gives the normalized energy consumptions with different number of processors. We can make two main observations from this graph. First, all our six applications get some energy benefit from our approach with all processor sizes experimented. Second, our energy savings get better with increased number of processors. This is because a larger number of processors means more load imbalance to optimize, and our approach takes advantage of it. When considering individual applications, one can see that MGRID and WAVE5 perform poorly as compared to the others, mainly because these applications have very few cases where our approach is applicable. In comparison, LU benefits much from increasing the number of processors since most of its few loops exhibit significant amount of load imbalance. Overall, the average savings across all six applications are between 16.03% (for the two processor case) and 41.80% (for the thirty-two processor case). To evaluate the impact of the number of voltage levels on energy savings, we also performed experiments with different number of voltage levels. The results are presented in Figure 7.4 for the 8 processor case. One can easily see from this graph that the number of voltage levels has a significant impact on energy behavior. In particular, the difference in going from 4 levels to 8 levels is dramatic; the corresponding savings are 6.63% and 29.02%. Increasing the number of voltage levels further (to 16) does not bring too much additional energy benefits since there is little scope left to be optimized (beyond what could be optimized using TLFeBOOK

136

Figure 7.4. Normalized energy consumption with different voltage levels (8 processors).

8 levels). It should also be mentioned that when we have only 2 levels, the average saving across all applications is only 2.40%. This poor results is due to the fact that our strategy tries not to increase execution cycles as much as possible. Consequently, in many cases (when we have only 2 voltage levels) the compiler cannot use the lower voltage for a processor (even though the processor has low workload) since doing so would increase execution cycles dramatically. Recall that in Section 7.5 we discussed four different optimization strategies that can further increase energy savings. The graph shown in Figure 7.5 gives normalized energy consumptions with these optimizations. The first bar for each application corresponds to our strategy when none of these four optimiza- tions have been activated. Our first observation is that each application benefits from one or more of these optimizations. Second, not every optimization is ef- fective for each benchmark. For example, using the hybrid iteration allocation brings energy benefits in only 3D and DFE (since the nests in other allocations exhibit a uniform behavior and prefer only one type of iteration allocation for the best energy behavior). Similarly, adaptive parallelization is useful only for 3D and SPLAT. To further study the impact of inter-nest optimization (one of the optimizations discussed in Section 7.5), we also performed experiments with different values for b (nore that the default value that we used in Figure 7.5 is 4). We see from the graph in Figure 7.6 that for the applications that benefit from this optimization, a b value of 4 seems to be reasonable. This is because in many cases the data dependences in the application prevent a processor from going beyond the next four nests to execute without waiting for the slower processors. While the energy savings reported in this section are significant, for a fair comparison one also needs to consider the impact of our approach on perfor- TLFeBOOK

137

Figure 7.5. Impact of different optimizations on energy consumption (8 processors; 8 voltage levels; and b =4).

Figure 7.6. Impact of b on energy consumption (8 processors and 8 voltage levels). mance. As has been pointed out earlier, our approach can lead to an increase in execution cycles only if the compiler analysis is largely inaccurate. The graph in Figure 7.7 shows that the performance overhead incurred by our approach is below 2% in all but one (SPLAT) application. The reason that we have a relatively large performance penalty in SPLAT is the fact that this application exhibits a large number of conflict misses (over 68%, and rest are cold and ca- pacity misses), which cannot be captured by the cache miss estimation scheme currently employed by our implementation. Consequently, the compiler is not very successful in attaching suitable voltage levels to processors, and this in turn causes performance degradation. It is conceivable that a more accurate TLFeBOOK

138

Figure 7.7. Percentage increase in execution cycles (8 processors; 8 voltage levels). cache miss estimation strategy (e.g., [4]) can help improve the behavior of this benchmark. This will be part of our future research on this topic.

7.7 CONCLUDING REMARKS A chip multiprocessor lowers the number of functional units per processor, and distributes separate tasks/threads to each processor. This paper has evalu- ated a compiler-directed strategy that allows different processors to use different voltage levels/frequencies to take advantage of the load imbalances stemming from loop parallelization. Our results with six applications clearly demonstrate the effectiveness of our strategy and makes a case for voltage-sensitive loop parallelization. Our results also show that it is possible to increase energy savings further by employing voltage/frequency reuse, adaptive parallelization, and inter-nest optimization.

Acknowledgments This work was supported in part by NSF Career Awards #0093082 and #0093085, and a grant from GSRC PAS.

References

[1] L.A.Barroso,K.Gharachorloo,R.McNamara,A.Nowatzyk,S.Qadeer,B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. Proceedings of International Sym- posium on Computer Architecture, Vancouver, Canada, June 12–14 2000. [2] S. Carr, K. S. McKinley, and C. Tseng. Compiler Optimizations for Im- proving Data Locality. Proceedings of the Sixth International Conference TLFeBOOK

139 on Architectural Support for Programming Languages and Operating Sys- tems, San Jose, October 1994. [3] DAC’02 Sessions: Design Methodologies Meet Network Applications and System on Chip Design, New Orleans, LA, June 2002. [4] S. Ghosh, M. Martonosi, and S. Malik. Cache Miss Equations: An An- alytical Representation of Cache Misses. Proceedings of the 11th ACM International Conference on Supercomputing, July, 1997. [5] K. Govil, E. Chan, and H. Wasserman. Comparing Algorithms for Dynamic Speed-Setting of a Low-Power CPU. Proceedings of the 1st ACM Interna- tional Conference on Mobile Computing and Networking, November 1995. [6] D. Grunwald, P. Levis, K. Farkas, C. Morrey III, and M. Neufeld. Poli- cies for Dynamic Clock Scheduling. Proceedings of the 4th Symposium on Operating System Design and Implementation, October 2000. [7] C.-H. Hsu and U. Kremer. Dynamic Voltage and Frequency Scaling for Scientific Applications. Proceedings of the 14th Workshop on Languages and Compilers for Parallel Computing, August 2001. [8] Intel XScale Technology. http://www.intel.com/design/intelxscale/. [9] I. Kadayif, M. Kandemir, and U. Sezer. An Integer Linear Programming Based Approach for Parallelizing Applications in On-Chip Multiproces- sors. In Proc. Design Automation Conference, New Orleans, LA, June 2002. [10] A. Klaiber. The Technology Behind Crusoe Pro- cessors. Transmeta White Paper, January 2000. http://www.transmeta.com/about/press/white papers.html. [11] MAJC-5200. http://www.sun.com/microelectronics/MAJC/5200wp .html [12] MP98: A Mobile Processor. http://www.labs.nec.co.jp/MP98/top-e.htm. [13] T. Okuma, T. Ishihara, and H. Yasuura. Real-Time Task Scheduling for a Variable Voltage Processor. Proceedings of the 12th International Sympo- sium on System Synthesis, 1999. [14] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang. The Case for a Single Chip Multiprocessor. Proceedings of the 7th Intl Confer- ence on Architectural Support for Programming Languages and Operating Systems, ACM Press, New York, 1996, pp. 2–11. [15] POWER4 System Microarchitecture, White Paper, http://www-1.ibm .com/servers/eserver/pseries/hardware/whitepapers/power4 .html [16] H. Saputra, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, J. S. Hu, C- H. Hsu, and U. Kremer. Energy-Conscious Compilation Based on Voltage TLFeBOOK

140 Scaling. Proceedings of ACM SIGPLAN Joint Conference LCTES’02 and SCOPES’02, Berlin , Germany, June, 2002. [17] Y. Shin, K. Choi, and T. Sakurai. Power Optimization of Real-Time Em- bedded Systems on Variable Speed Processors. Proceedings of the Inter- national Conference on Computer-Aided Design, November 2000. [18] SIMICS. http://www.virtutech.com/simics/simics.html. [19] J. P. Singh and D. Culler. Parallel Computer Architecture: A Hardware- Software Approach, Morgan-Kaufmann, 1998. [20] C.-W. Tseng. Compiler Optimizations for Eliminating Barrier Synchro- nization. Proceedings of 5th ACM Symposium on Principles and Practice of Parallel Programming, Santa Barbara, CA, July 1995. [21] M. Weiser, B. Welch, A. Demers, and S. Shenker. Scheduling for Reduced CPU Energy. Proceedings of the 1st Symposium on Operating Systems Design and Implementation, November 1994. TLFeBOOK

141

Chapter 8 ARCHITECTURES AND DESIGN TECHNIQUES FOR ENERGY EFFICIENT EMBEDDED DSP AND MULTIMEDIA PROCESSING

Ingrid Verbauwhede1,2, Patrick Schaumont1, Christian Piguet3, Bart Kienhuis4 1University of California, Los Angeles; 2K.U.Leuven; 3CSEM; 4Leiden

Abstract Energy efficient embedded systems consist of a heterogeneous collection of very specific building blocks, connected together by a complex network of many dedicated busses and interconnect options. The trend to merge multiple functions into one device makes the design and integration of these “systems- on-chip” (SOC’s) even more challenging. Yet, specifications and applications are never fixed and require the embedded units to be programmable. The topic of this chapter is to give the designer architectures and design techniques to find the right balance between energy efficiency and flexibility. The key is to include programmability (or reconfiguration) at the right level of abstraction and tuned to the application domain. The challenge is to provide an exploration and programming environment for this heterogeneous architecture platform.

Keywords: Embedded systems, architectures, low power, design tools, design exploration

8.1 INTRODUCTION

Embedded systems (e.g. a cell phone, a GPS receiver, a portable DVD player, a HDD ) use an architecture that is a heterogeneous collection of very specific building blocks, connected together by a complex network of many dedicated busses and interconnect options. General- purpose programmable processors are not used for energy efficiency reasons. Typically, multiple small embedded processor cores with accelerators, IP cores, etc. are used. The trend to merge multiple functions TLFeBOOK

142 into one device (e.g. a cell phone with video capabilities) makes the design and integration of these “systems-on-chip” (SOC’s) even more challenging. Yet, specifications and applications are never fixed and require the embedded units to be programmable. A good balance between energy efficiency and programmability can be obtained by using programmable domain-specific processors. A well known example are the programmable digital signal processors (DSPs). DSPs are developed for wireless communication systems (mostly driven by cellular standards). In a first generation this meant that DSPs were adapted to execute many types of filters (e.g. FIR, IRR), later communication algorithms such as Viterbi decoding and more recently Turbo decoding are added. A first trend we notice is that more applications and multiple applications run in parallel or on demand on the device, e.g. video decoding, data processing, multiple standards, etc. A second trend we notice is that these new applications tend to run either on a separate domain specific programmable processor or on a hardware accelerator (the distinction between the two being rather blurry) next to the embedded DSP or micro- controller instead of being tightly coupled into the instruction set of the host processor. A third trend we notice is that general-purpose programming environments are getting more heterogeneous and domain-specific. The general-purpose solutions are for energy efficiency reasons augmented with domain specific units, accelerators, IP cores, etc. This is clearly visible in FPGA’s, as the new generations now include specialized blocks such as embedded core’s, block RAM’s and large numbers of multipliers. One successful example is the Virtex-Pro family of Xilinx [17]. These devices contain up to four Power PC cores, multiple columns of SRAM, multiple columns of multipliers, Gbits IO transceivers, etc. The architecture design of this heterogeneous SOC is a search in a three dimensional design space, which we call the reconfiguration hierarchy [12]. First in the Y direction: at what level of abstraction should the programming be introduced? Secondly in the X direction: which component of the architecture should be programmable? Thirdly in the Z direction: what is the timing relation between processing and the configuration/programming? Programming can be introduced at multiple levels of abstraction. When it is introduced at the instruction set level, it is called a “programmable processor”. When it is introduced at the CLB level of an FPGA, it is called a reconfigurable device. Regarding components, a processor has four basic components: data paths, control, memory and interconnect. One has a choice of making some or all of them programmable. Then the third question is to compare the processing activity to the binding time. It makes a system configurable, reconfigurable, or dynamic reconfigurable. TLFeBOOK

143 The challenge is to develop a design environment to navigate in this three dimensional design space. Several SOC platforms have been presented in literature. Most of them focus on general -purpose regular architectures, e.g. [2]. Very few focus on the low power issue and the need to tune the architecture towards the application. One example is the low power Maya platform [18]. Unique to our design approach is that we combine the design and programming of the architecture with an environment to explore the best options. The chapter is organized as follows. Section 8.2 and 8.3 look at the architecture design, while section 8.4 and 8.5 discuss the design exploration, co-design and co-simulation challenges.

8.2 ENERGY EFFICIENT HETEROGENEOUS SOC’S

The system designer needs an architecture platform that gives him the lowest energy consumption, but at the same time provides enough flexibility to allow re-programming or re-configuration. The key to energy efficiency is to tune the architecture to the application domain. This means freezing flexibility in the X (components) and Y (level of abstraction) direction of the reconfiguration hierarchy. A hierarchy of so-called “Y charts” allows us to do this in a top-down fashion [5]. A complex SOC will consist of multiple domain specific processing engines. Each processor is programmable to a more or less degree. It can be highly programmable if the processor is a micro-controller or a DSP engine or a blank box of CLB units. The efficiency goes up as domain specific instructions are added. An example of this is the addition of a MAC instruction to a DSP processor. Loosely coupled co-processors will be more energy efficient but less flexible as they fit a narrower application domain. An example is the Turbo coder acceleration unit. The ultimate energy efficient block is the optimized hard IP unit. Yet, it does not provide any flexibility. In SOC a range and collection of these blocks are used. Similarly arguments can be made for the interconnect component of a SOC. Currently, we see only two extreme options: either dedicated one-to- one connections and specialized busses, which have the lowest power consumption (to a first order) or general-purpose global busses or inter- connect, as provided by FPGA’s [17] or networks on chip [2]. The latter two are both general-purpose solutions at different levels of abstraction to give the designer a maximum flexibility and programmability. TLFeBOOK

144

Software ProtocolStandard Algorithm Architecture Domain- MicroArchitectre Specific Circuit Hardware Hardware Networking SecurityVideo Signal Proc

MEMORY RF Video Signal Baseband Crypto Signal Baseband Engine Processing CPU Processing Engine

Reconfigurable Interconnect

Figure 8-1. Example RINGS Architecture.

The proposed RINGS architecture [16] is an architecture platform that gives the designer the option to explore the energy flexibility trade-offs. An example is shown in Fig. 8-1. A RINGS architecture contains a heterogeneous set of building blocks: programmable cores, both DSP’s and micro-controllers, programmable and/or reconfigurable hardware accelerator units, specialized IP building blocks, front-end blocks, and so on. When designing a solution based on RINGS, it is important that the domain expert has freedom to select the appropriate level of flexibility, ranging from fully programmable approaches, such as embedded micro controllers or FPGA blocks to highly optimized IP blocks. For different domains, the flexibility will be supported in different ways as domains have different characteristics. This domain specific flexibility can be expressed as a do-main specific abstraction pyramid as shown for Networking, Video, and Signal Processing on Fig. 8-1. In case of Video, the engine will consist of elements expressed in the Video pyramid, for example dedicated co-processors. The SOC is connected together at the top level by a supervising software program, which typically runs on an embedded micro-controller. At the bottom level, the reconfigurable interconnect glues it together. The programming paradigm used in RINGS is a reconfigurable network-on-chip. Also in this network, flexibility can be traded for energy efficiency at different levels of abstraction. Designers can instantiate an arbitrary network of 1D and 2 D router modules leading to an architecture illustrated in Fig. 8-2. TLFeBOOK

145

Proc A Proc B Proc A

2D 1D 2D router router router

2D 2D router router

Proc X Proc Y

Figure 8-2. Example of Network-on-chip.

This network illustrates the three binding time concepts. At the level of configuration, the static network architecture with routers is instantiated. Reconfiguration is done by means of reprogramming the routing tables and programming by giving each packet a target address. A traditional reconfiguration is obtained by reprogramming the routing tables in each node. An alternative approach is to use an easy to reconfigure physical channel. One example of this is a CDMA based reconfigurable interconnect [6][16]. Fig. 8-3 shows a conceptual picture of a source-synchronous CDMA implementation. Each sender and receiver gets a unique spreading code. By changing the Walsh code, a different configuration is obtained. Traditional busses, which are a TDMA channel, require hardware switches for reconfiguration. CDMA interconnect has the advantage that reconfiguration can occur “on-the-fly.”

. . .

MOD1 MOD3

. . .

X X X

Figure 8-3. Reconfigurable Interconnect (a) TDMA (b) SS-CDMA Bus Interface [1]. TLFeBOOK

146 8.3 ULTRA LOW POWER COMPONENTS

The focus of this section is on the architecture design options to design ultra low power processor components, in many cases without losing performance. DSP processors have real-time constraints or need to maximize their throughput for a given task while at the same time minimize the power or energy consumption. Therefore, the design of DSP processors is very challenging, as it has to take into account contradictory goals: an increased throughput request at a reduced energy budget. On top there are new issues due to very deep submicron technologies such as interconnect delays and leakage. For instance, hearing aids used analog filters 15 years ago and were designed as digital ASIC-like circuits 5 years ago. Today they are designed with powerful DSP processors below 1 Volt and 1 mW of power consumption [8]. Hearing aids companies require DSP processors just because they require flexibility, i.e. to program the applications in-house. The design of ultra-low power DSP cores has to be performed at all design levels, i.e. system, architecture, circuit and technology levels. We will focus in this section to DSP architectures, but VHDL implementations as well as cell libraries are important too. Latch-based implementations including gated clocks described in VHDL or Verilog, low-power standard cell libraries and leakage reduction circuit techniques are necessary to reduce power consumption at these low levels. Various DSP architectures can be and have been proposed to reduce significantly the power consumption while keeping the largest throughput. Beyond the single MAC DSP core of 5-10 years ago, it is well known that parallel architectures with several MAC working in parallel allow the designers to reduce the supply voltage and the power consumption at the same throughput. It is why many VLIW or multitask DSP architectures have been proposed and used even for hearing aids. The key parameter to benchmark these architectures is the number of simple operations executed per clock cycle, up to 50 or more. However, there are some drawbacks. The very large instruction words up to 256 bits increase significantly the energy per memory access. Some instructions in the set are still missing for new better algorithms. Finally the growing core complexity and transistor count becomes a problem because leakage is roughly proportional to the transistor count. To be significantly more energy efficient, there are basically two ways, however impacting either flexibility or the ease of programming: 1. To design specific very small DSP engines for each task, in such a way that each DSP task is executed in the most energy efficient way on the smallest piece of hardware [9]. For N DSP tasks within a given TLFeBOOK

147 application, the resulting architecture will be N co-processors or hardware accelerators around a controller or a simple DSP core as illustrated on Fig. 8-1.

Memory Memory Memory Memory

x + - x

Figure 8-4. Hardware Reconfiguration Example [3].

2. To design reconfigurable architectures such as the DART cluster [3], in which configuration bits allow the user to modify the hardware in such a way that it can much better fit to the executed algorithms. Fig. 8-4 shows an example. Option 1 is definitively the best one regarding power consumption. Each DSP task uses the minimal number of transistors and transitions to perform its work. The control code unavoidable in every application is also efficiently executed on the controller or on the simple DSP, and some unexpected DSP tasks can be executed on the simple DSP if no accelerator is available. However, the main issue is the software mapping of a given application onto so many heterogeneous processors and co-processors (see Section 4). Transistor count could be high and some co-processors fully useless for some applications. Regarding leakage, unused engines have to be cut off from the supply voltages, resulting in complex procedures to start/stop them. Reconfigurable DSP architectures are much more power efficient than FPGAs. The key point is to reconfigure only a limited number of units within the DSP core, such as some execution units and addressing units [11]. The latter are interesting, as the operands fetch from memory is generally a severe bottleneck in parallel machines for which 8-16 operands are required each clock cycle. So, sophisticated addressing modes can be dynamically reconfigured depending on the DSP task to be executed. Fig. 8-5 shows an example in which several addressing modes can reconfigured depending on the user’s algorithms. This AGU (Address Generation Unit) contains 4 index registers (a0 to a3), 4 offset registers (o0 to o3) and 4 modulo registers (m0 TLFeBOOK

148 to m3). All these registers could be used to generate a given addressing mode and to compute AGU registers updates. The VLIW AGU operation register (AGUOP) is controlled by an AGU reconfiguration register (i0 to i3) that could be reconfigured at any time and allows the programmer to generate new addressing modes. Fig. 8-5 shows two examples of AGU computations. In the first example, register i0 contains configuration data such as the multiplexers and the PREAD adder are configured to generate address a0 + (02>>1), while at the same time registers a1, a3 and o3 are updated with new values computed through POSAD1, POSAD2 and PREADR ALUs. The POSAD1 ALU is used to generate WP1 = (a1+o3) modulo m2, while the POSAD2 ALU is used to generate WP2= m3 + 02<<2, and the result of PREADR is used to update register a0. The second example (i2) generates WP2 that uses both POSAD1 and POSAD2 ALUs connected in series. The operation (ao-02)%m0 is performed in the POSAD1 ALU, while adding 03 is performed in the POSAD2 ALU. This flexibility allows the programmer to generate very complex addressing modes that cannot be available in conventional DSP cores with addressing modes only defined in their instruction sets.

VLIW AGU Reconfigurable instruction registers WP1 WP2 WP3 I AOM i0

WP1 a0 o0 m0 SELAGUOP i1 a1 o1 m1 i2 WP2 a2 o2 m2 i3 a3 o3 m3 RP1 RP2 RP3 RP4 RP5 RP1 RP2 RP3 RP4 RP5 RP6 in = AGUOP RP7 (n=0..3) PREAD POSAD1 PREAD POSAD1 Examples of in operations:

i0: DM ADDR = a0+(o1>>1), P2A RES1RP6 RP7 RES2

WP1: a1 = (a1+o3)%m2, P2B

WP2: o3 = m3 + o2<<2 POSAD2 WP3: a0 = a0+(o1>>1), P2A P2B

i2: DM ADDR = a2+o1, AGUOP WP1: none WP2: a0 = (a0-o2)%m0+o3 WP3: a2 = a2+o1 POSAD2

DM ADDR

Figure 8-5. Addressing Modes Reconfiguration Example (MACGIC DSP).

However, the power consumption is necessarily increased due to the relatively large number of reconfiguration bits that have to be loaded in the TLFeBOOK

149 configuration registers. Similarly, the reconfigurable units are necessarily more complex that non-reconfigurable units in terms of transistor count and therefore consume more. Software issues are also difficult, as users can define new instructions or new addressing modes that are difficult to support by the development tools.

8.4 DESIGN & ARCHITECTURE EXPLORATION

The way a system behaves depends on the architecture, the way the applications are written, and how these applications are mapped onto the architecture as compactly expressed by the Y-chart [5]. Examples of architectures for low-power have already been given in other sections. On such architecture, mapping is typically done in case of reconfigurable fabrics by the behavioral synthesis tool and the place and route tools. In case of DSPs and CPUs, the mapping is typically performed by C-compilers dedicated to a particular type of DSP or CPU. An important question remains: how to specify the applications that they can take advantage of the architecture in an effective manner. A low-power architecture will typically employ different levels of parallelism like bit-level parallelism, instruction parallelism or task-level parallelism to take advantage of voltage scaling as already explained in the previous section. To successfully map a DSP application at a high level, the applications need to express task-level parallelism. This parallelism is typically not present, as the applications are written in sequential languages like C or Matlab. Therefore, mapping them is often a manual process that is very tedious and time consuming, leading to a sub optimal system. A designer would like to have tool support that converts automatically the sequential specification into a parallel format. Moreover, the tool should allow him to ‘play’ with the amount of parallelism extracted from the specification. In general, such tools are lacking in embedded system design. Some companies, like Pico and Art (ARM/Adelante) try to provide limited commercial solutions but this field is still very much subject to research. The Compaan tool suite [13] aims at providing designers the option to play with parallelism for applications that are so-called “Nested Loop Programs”, a very natural fit for DSP applications. A DSP application is specified in a subset of Matlab and is automatically converted by Compaan into a network of parallel processes. These processes can be specified in “C’ and mapped, using a conventional C compiler, onto a DSP or CPU. On the other hand, they can also be specified in VHDL and mapped using the appropriate tools onto some reconfigurable fabric or realized as a dedicated IP core [19]. Hence, “programming” the RINGS architecture is reduced to putting some TLFeBOOK

150 processes onto the CPUs and DSPs while others are mapped onto FPGAs or use dedicated IP cores. There are many ways we can find parallelism in the application and in the way we partition the processes of the CPUs, DSPs and reconfigurable resources. Being able to explore these options early on in the design phase is crucial to get efficient embedded low-power systems. To allow designers to do this exploration, Compaan is equipped with a suite of techniques [14] like Unfolding, Skewing and Merging, to allow designers to play with the level of parallelism exposed in the derived network of processes. Skewing and Unfolding increase the amount of parallelism, while Merging reduces parallelism. By performing these techniques, many different networks can be created that can be mapped in different ways onto the architecture. When applied in a systematic way, the design space can be explored and the best performing network of processes can be picked. The difference in utilization of the architecture for a particular network can be huge. By rewriting a DSP application (like Beam-forming) using the presented techniques, we are able to achieve performances on a QR algorithm (7 Antenna’s, 21 updates) ranging from 12MFlops to 472MFlops. We realized QR using commercial floating point IP cores from QinetiQ, which include pipelined 55 (Rotate) and 42 (Vectorize) stages. We achieved this performance increase without doing anything to the architecture or mapping tools, but only by playing with the way the QR application is written, effectively improving the way the pipelines of the IP cores are utilized. Using a system like Compaan, an experienced designer should be able to obtain very different performing networks in days, having the opportunity to explore different systems and picking the one that uses the least amount of power.

8.5 DOMAIN-SPECIFIC CO DESIGN ENVIRONMENTS

As discussed in the previous section, parallelism and distributed processing are key to energy efficient architectures. Because the ensemble of architecture elements (processors, busses, memories) cooperate towards a common application, the designer faces a considerable co-simulation and co- design problem. A key requirement is to have a good design model. Such a model allows building of simulation tools, compilers and code generators. We will look at a highly successful design model for programmable systems: the instruction-set architecture (ISA). Next we will consider the approach taken by the RINGS architecture. In a classic Von-Neumann architecture, the instruction-set-architecture (ISA) model maintains a single, consistent and abstracted view to the TLFeBOOK

151 operation of the system. Such a view ties four independent architecture concepts together: control, interconnect, storage, and data operations [15]. This way the ISA becomes a template for the underlying target architecture, for which compiler algorithms (scheduling etc) can be developed. Often however, the ISA is unable to offer the right target template – in terms of parallelism, storage capabilities or other. In the RINGS architecture, we do not use an ISA as an intermediate design model, but approach each of the four components that make up an ISA independently. We enumerate them below and look at the requirements they impose on co-simulation and co-design. • Data Operations: Energy efficient operation requires us to specialize each operator as much as possible. A RINGS system contains multiple processing cores. These can include hardwired or programmable (DSP or RISC) processors. We thus need to be able to combine instruction-set simulation with hardware simulation. • Storage: Energy efficient operation requires us to distribute storage. In addition to the high-level design transformations discussed in the previous section, we target to minimize storage bandwidth and use multiple distributed memories. Each processor in RINGS will work inside of a private memory space. Many operations in multimedia can be implemented with dedicated storage architectures that take only a fraction of the energy cost of a full-blown ISA. Examples are matrix transposition or scan-conversion. Such dedicated storage can be captured as a hardwired processor. • Interconnect: The energy efficient interconnect architecture discussed in section 2 requires explicit expression of interconnect operations – in contrast to an ISA where this is implicitly encoded in the instruction format. A network-on-chip can be modeled as a dedicated hardware architecture [1]. On top of the network-on-chip a suitable network protocol must be implemented, for example message-passing with the MPI standard [7]. However, also this protocol is subject to specialization and/or hard-coding. For example, a hardwired DCT coding unit attached to a DSP core through RINGS will have a fixed communication pattern. This pattern can be hard-coded in a collapsed and optimized protocol stack. • Control: Energy efficient operation requires us to split the data-flow and control-flow in a RINGS architecture and handle them independently. Fig. 8-6 clarifies this point. It shows the effect of moving an AES encryption operation gradually from high-level software (Java) implementation to dedicated hardware implementation, while at the same time maintaining the interface to the high level Java model. It can be seen that the interface overhead goes from 0.8% for a C-accelerated AES to TLFeBOOK

152 8000% for a hardware-accelerated AES! This overhead obviously is caused by all the interfaces moving data from Java to C to hardware and back. With the MPI message passing scheme, we have the freedom to route control flow and a data flow independently as messages. This way, we can eliminate or minimize this interface overhead.

Java Rijndael Interface cycles 301,034 367 Interface 892 C Rijndael cycles ac 44,063 ce le ra Co-processor tio Rijndael n cycles 11

Total Cycles 301, 034 44,430 903

Figure 8-6. Overhead of Tightly Coupled Data/Control Flow.

When we put the elements together, we conclude that the RINGS co- design environment should accommodate multiple instruction-set simulators with user-specified hardware models. All of these must be embedded in a model of an on-chip network. The timing accuracy of the simulation should be precise enough to simulate interactions such as network-on-chip communication conflicts. On the other hand, the simulation must also be fast enough to support reasonable design exploration capabilities. We have built the ARMZILLA environment to evaluate one class of RINGS architectures, namely those that can be built with one or more ARM cores, a network-on-chip, and dedicated hardware processors. Fig. 8-7 illustrates the ARMZILLA setup. There are three components: a hardware simulation kernel (GEZEL), one or more instruction-set simulators (ISS), and a configuration unit. The GEZEL kernel [4] captures hardware models with the FSMD (Finite-State-Machine with Datapath) model-of- computation. It uses a specialized language and a scripted approach to promote interactive design exploration. The cycle-true models of GEZEL can also be automatically converted to synthesizable VHDL. For the ARM ISS we use the cycle-true SimIT-ARM environment [10]. The ARM ISS uses memory-mapped channels to connect to the GEZEL hardware models. Finally, the configuration unit specifies a symbolic name for each ARM ISS, and associates each ISS with an executable. This way the memory-mapped communication channels can be set up, and the hardware GEZEL models can address each ARM memory space uniquely. TLFeBOOK

153

C C Network Hardware C On Chip Processors FDL FDL Cross Compiler EXE EXE Config EXE

Configuration Unit ARM ISS GEZEL ARM ISS ARM ISS Kernel Memory-mapped ARMZILLA Channels

VHDL

Figure 8-7. The ARMZILLA Design Environment for ARM-based RING Processors.

An example of what can be done with the ARMZILLA environment is shown in Table 8-1. This table shows cycle counts that were obtained after partitioning a JPEG encoding algorithm. The reference implementation runs on a single-ARM ISS model. In the second implementation, we separate the chrominance and luminance channels over two ARM processors. This seems a logical partition that splits the data operations roughly in two parts. But, it also creates a communication bottleneck in the on-chip network and the resulting implementation becomes slower then the O3-level optimized single-processor implementation. The third implementation shows a better partitioning. In this case, the data streams are routed out of the ARM and into dedicated hardware processors for JPEG encoder subtasks. These processors can communicate directly amongst themselves. All these simulations are cycle-accurate yet they can run efficiently. For the H.264 decoding on a dual ARM with network-on-chip for example, ARMZILLA offers a simulation speed of 176K cycles per second. The simulation speed varies with the complexity of the hardware model used. A single, stand-alone SimIT-ARM simulator runs at 1 MHz cycle-true on a 3GHz . TLFeBOOK

154

Table 8-1. Multiprocessor JPEG Encoding Performance Partition Cycle count 64x64 block One single ARM 1.223 M Dual ARM using split chrominance/ 1.336 M luminance channels Single ARM with color conversion, 313K transform coding, huffman coding as stand- alone hardware processors

8.6 CONCLUSIONS

In this chapter, we presented architecture design and design exploration for low power systems-on-chip. Low power is obtained by tuning all components of the architecture (datapaths, control, memory and interconnect) to the application. This can occur at different levels of abstraction. The design of this type of SOC requires support by design models and methods. The design environments Compaan and Gezel /Armzilla are illustration of supporting tools for this design space exploration.

References

[1] D. Ching, P. Schaumont, I. Verbauwhede, “Integrated Modeling and Generation of a Reconfigurable Network-On-Chip,” Proc. 11th Reconfigurable Architectures Workshop, RAW 2004, Santa Fe, NM, April 2004. [2] W. Dally, B. Towles, “Route Packets, not wires: on-chip interconnection networks,” Proc. DAC 2001. [3] R. David et al., “Low-Power Reconfigurable Processors”, Chapter 20 in “Low Power E Electronics Design,” edited by C. Piguet, CRC Press, 2004. [4] GEZEL kernel, http://www.ee.ucla.edu/~schaum/gezel [5] B. Kienhuis, et al.,``A Methodology to Design Programmable Embedded Systems'', LNCS, Vol 2268, Nov. 2001. [6] J. Kim, et al., “A 2-Gb/s/pin Source Synchronous CDMA Bus Interface with simultaneous Multi-Chip Access and Reconfigurable I/O capability,” CICC, Sept 2003. [7] MPICH – A portable implementation of MPI, http://www.unix.mcs.anl.gov/mpi/mpich/ [8] P. Mosch et al., “A 720 mW 50 MOPS 1V. DSP for a Hearing Aid Chip Set,” Proc. ISSCC, pp. 238-239, Feb. 2000. [9] Õzgün Paker et al., “A heterogeneous multi-core platform for low power signal processing in systems-on-chip,” ESSCIRC 2002. [10] W. Qin, S. Malik, “Flexible and Formal Modeling of Microprocessors with Application to Retargetable Simulation,” Proceedings of DATE 2003, Mar, 2003, pp.556-561. [11] F. Rampogna et al., “Magic, a Low-Power, re-configurable DSP”, Chapter 21 in “Low Power Electronics Design”, ed. C. Piguet, CRC Press, 2004. [12] P. Schaumont, I. Verbauwhede, M. Sarrafzadeh, K. Keutzer, “A quick safari through the reconfiguration jungle,” Proceedings DAC 2001, pg. 172-177, June 2001. TLFeBOOK

155

[13] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, E. Deprettere,``System Design using Kahn Process Networks: The Compaan/Laura Approach'', DATE2004, Feb 2004, Paris, France. [14] T. Stefanov, B. Kienhuis, E. Deprettere, “Algorithmic Transformation Techniques for Efficient Exploration of Alternative Application Instances'', Proc. CODES'2002, Colorado, May 2002. [15] I. Verbauwhede, J. M. Rabaey. “Synthesis of Real-Time Systems: Solutions and challenges” Journal of VLSI Signal Processing, Vol. 9, No. 1/2, Jan. 1995, pp. 67-88. [16] I. Verbauwhede, M.C. F. Chang, “Reconfigurable Interconnect for next generation systems”, Proc. SLIP, pp. 71-74, April 2002. [17] Xilinx: Virtex-II-Pro Platform FPGAs: Introduction and Overview and Functional Description, Aug. 2003, Oct. 2003, www.xilinx.com/bvdocs/publications/ds083-1.pdf, ds083-2.pdf. [18] H. Zhang, et al., “A 1V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications,” IEEE Journal on Solid State Circuits, November 2000. [19] C. Zissulescu, et al., ``Laura: Leiden Architecture Research and Exploration Tool'', Proc. FPL 2003. TLFeBOOK

156

Chapter 9

SOURCE-LEVEL MODELS FOR SOFTWARE POWER OPTIMIZATION

Carlo Brandolese, William Fornaciari and Fabio Salice Politecnico di Milano

Abstract This chapter presents a methodology and a set of models supporting energy-driven source-to-source transformations. The most promising code transformation tech- niques have been identified and studied leading to accurate analytical and/or statistical models. Experimental results obtained for some common embedded- system processors over a set of typical benchmarks are discussed, showing the value of the proposed approach as a support tool for embedded software design.

Keywords: Software optimization, Power optimization, Source-level modeling

9.1 INTRODUCTION In a growing number of complex heterogeneous embedded systems the rele- vance of the software component is rapidly increasing. Issues such as develop- ment time, flexibility and reusability are, in fact, better addressed by software based solutions. Another trend that is significantly pushing designers to move as much functionality as possible toward software is the increased interest in platform-based designs. In such systems much of the architecture is fixed and can only be configured to match the design constraints. The greatest part of the application-specific functionality is thus naturally shifted from hardware dedicated components to software programs. In such a scenario it is clear that the importance of software is steadily increasing and poses new problems to designers. Though performance, in the sense of computational efficiency, is still the foremost requirement for many embedded systems, power consump- tion is gaining more and more attention. Optimization of the code is thus one of the key points and is currently addressed almost only by means of compilation techniques. It is still not uncommon for designers to manually code critical sections of the application directly in assembly. The recent technical litera- TLFeBOOK

157 ture proposes a different approach, based on source-to-source transformations aimed at improving code quality either directly or by enabling better compiler optimizations. Source code transformations are extremely complex to auto- mate since they require a thorough semantic analysis of the code fragments to be optimized. This chapter proposes a sound and flexible methodology for the analysis of the effect of source-to-source transformations mostly aimed at allowing rapid and accurate design space exploration. The proposed approach is based on a wide set of models studied to decouple the processor-independent analysis from all technology specific aspects.

9.2 TRANSFORMATIONS OVERVIEW Source-to-source transformation presented in literature, can be grouped in to four main areas according to the code structures they operate on: loops, data structures, procedures, control structures and operators. It is worth noting that not all the transformations are interesting when operating at source-level since some of them can as well be performed at RT or assembly-level and are thus performed by modern compilers. The most promising transformations, either found in literature [1, 2] or studied in the present work, are summarized in the following. Particular attention must be devoted to loop transformations [3–6] since most of the execution time of a program is spent in loops.

Loop unrolling replicates the body of a loop a given number of times U (the unrolling factor), and modifies the iteration step from 1 to U. The trans- formation impacts on energy in two ways: on one hand, it reduces loop overhead by performing less compare and branch instructions; on the other hand, it allows the compiler for better optimization and register usage in the larger loop body.

Loop distribution breaks a single loop into multiple loops with the same it- eration range but each enclosing only a subset of the statements in the original loop. Distribution is used to create sub-loops with fewer depen- dencies, improve instruction cache and instruction TLB locality due to shorter loop bodies, reduce memory requirements by iterating over fewer arrays and improve register usage by decreasing register pressure.

Loop fusion performs the opposite action of distribution, i.e. merging, by re- ducing loop overhead, increasing instruction parallelism, improving reg- ister, data cache, TLB or page locality. It also improves the load balance of parallel loops.

Loop interchange exchanges the position of two loops in a loop nest, gener- ally moving one of the outer loops to the innermost position. It is one of the most valuable transformations and can improve performance in TLFeBOOK

158 many ways: it enables and improves vectorization, increases data access locality and increases the number of loop-invariant expressions in the inner loop. Loop tiling improves memory locality, primarily the at cache level, by access- ing matrices in N ×M sized tiles rather than completely. It also improves processor, register, TLB, and page locality. Software pipelining breaks the operations of a single loop iteration into S stages, and arranges the code in such a way that stage 1 is executed on the instructions originally belonging to iteration i, stage 2 on those of iteration i − 1, etc. Startup code must be generated before the loop to initialize the pipeline for the first S − 1 iterations and cleanup code must be generated after the loop to drain the pipeline for the last S−1 iterations. Loop unswitch is applied when a loop contains a branch with a loop-invariant test condition. The loop is then replicated inside each branch of the con- ditional, saving the overhead of conditional branching inside the loop, reducing the code size of the loop body, and possibly enabling the paral- lelization of one or both branches. The second class collects a number of data-structure and memory access transformations [7, 6]. Local to global array promotion allows compilers to use simpler addressing modes since global arrays address does not depend on the stack pointer. Scratch-pad array introduction has the goal of storing the most frequently accessed array elements in a smaller array (the scratch-pad) to improve spatial locality. Multiple indirection elimination identifies common chains of indirections and stores the address into a temporary variable. The third group gathers those transformations [7] impacting on procedures and functions. Function inlining replaces the most frequently invoked function with the func- tion body. Inline expansion increases the spatial locality and decreases the number of function calls. This transformation increases the number of unique references, which may result in more misses. However, a de- crease in the miss rate may also occur, since, without inlining, the callee code might replace the caller code in the instruction cache. Soft inlining is an intermediate solution between function calling and inlining. The transformation replaces calls and returns with jumps. This reduces the code size w.r.t. inlining and eliminates context switching overheads. TLFeBOOK

159 Code linking directives can be used to suitably reorder the objects of different functions to match as more as possible the dynamic call graph. This potentially leads to a reduction in instruction misses. Most of the transformation in the last group are usually performed by com- pilers. Nevertheless, some of them can still be conveniently considered when operating at source-level [7, 8]. Conditional sub-expression reordering exploits shortcut evaluation of con- ditions usually performed by compilers. The transformation operates by reordering the sub-expressions according to their probability of being true (for OR conditions) or false (for AND conditions). This reduces the number of instructions executed. Special cases pre-evaluation allows avoiding a function call (usually a math- ematic library function) when the argument has a special value for which the result is known. This is done by defining suitable macros testing for the special cases and leads to a reduction of actual calls. Special cases optimization replaces calls to generic library or user-defined functions with optimized versions, suitable for common special cases. As an example, power raising on integers can be coded more efficiently than it can be for real numbers.

9.3 METHODOLOGY Transformations applied to source code might lead to very different results depending on a number of factors: the specific structure of the code, the target architecture, the parameters of the transformations etc. Furthermore, it is not unusual that a transformation applied on the source code as it is, leads to poor or no energy reduction, while, when applied to a pre-transformed code its effectiveness is greatly increased. Thus sequences of transformations should be considered, rather than single transformations. For this reason it is crucial to explore different transformations and sequences of transformations in terms of their energy reduction efficiency. The exploration strategy should allow to easily modify the parameters of the transformation and of the target technology and thus leading to a quick estimate of the expected benefits.

9.3.1 Conceptual Flow Figure 9.1 shows the conceptual scheme of the estimation flow. The source code is processed and its relevant characteristics are extracted by means of a lexical and syntactical analysis leading to the set of code parameters. Typical parameters are code size, loop body size, number of paths, number of loop iterations, etc. TLFeBOOK

160

Source Code ↓ Transformation → Code Analysis parameters ↓ ∆I, ∆Minst, ∆Mdata ↓ Technology → Energy Estimation parameters ↓ ∆E

Figure 9.1. Phases of the methodology flow

The designer then chooses the transformations parameters such as unroll factor, tiling size etc. and, finally, selects the target technology from a set of libraries. Such libraries are collections of technology parameters specifying architectural figures such as cache sizes, bus width etc. and electrical figures such as power supply voltages, average core currents, bus and memory capaci- tances etc. Based on all this data, the estimation models first provide the three dimensionless figures ∆I, ∆Minst and ∆Mdata expressing the variations of number of instructions executed, of number of instruction cache misses and of number of data cache misses, respectively. These figures, though still rather abstract, already provide the designer with an indication of the potential bene- fits of a given transformation. To account for the target technology as well, the variations are fed to a set of models, depending on the technology parameters, leading to an estimate of the energy reduction ∆E deriving from the application of the considered transformation.

9.3.2 Technology Models Experimental results have shown that the energy consumption of an embed- ded system based on a processor executing some programs can be approximated by considering three major contributions: the processor core and its on-chip caches, the system bus and the main memory. All these components can be modeled at different levels of accuracy by means of equations that involve two sets of parameters: those strictly related to the specific technology and those summarizing the properties and the behavior of the code being executed. In particular, as outlined above in the description of the conceptual flow, the en- ergy estimates can be based on three execution parameters only: the number of assembly instructions executed and the number of instruction and data cache misses. Though simple, the adopted models provide satisfactory results, es- pecially when considering energy variations rather than absolute values. The technology parameters considered and used in the models adopted for the CPU, the cache, the bus and the main memory are summarized in Table 9.1. TLFeBOOK

161

Table 9.1. Technology parameters

Symbol Meaning Symbol Meaning

Tck CPU clock period B Cache block size CPI Average CPI1 S Cache size P cpu Average CPU power Edec Memory decode energy Ctot Total capacitance on the bus Erw Memory read/write energy Vsw Bus switching voltage Eref Memory refresh energy Asw Average bus switching activity Vm Memory supply voltage W Bus width Iref Average memory refresh current

The form of the equations, referred to relative energy variations, are reported in the following using the symbols introduced. The processor energy variation is modeled as: ∆Ecpu = TckP cpuCPI∆I (3.1)

The contribution of system bus to energy variation ∆Ebus is: ∆ = 1 2 (∆ +∆ +∆ ) Ebus 2 CtotVsw Nbus,addr Nbus,data Ninst (3.2) where:

∆Nbus,addr = Asw,addrWaddr(∆Mdata +∆Minst) (3.3)

∆Nbus,data = Asw,dataWdataBdata∆Mdata (3.4)

∆Nbus,inst = Asw,instWdataBinst∆Minst (3.5)

Finally, the adopted memory model expresses the energy variation ∆Em as:

∆Em =∆Em,data +∆Em,inst +∆Em,ref (3.6) where:

∆Em,data =(Edec + ErwBdata)∆Mdata (3.7)

∆Em,inst =(Edec + ErwBinst)∆Minst (3.8) ∆Em,ref = TckVmIref CPI∆I (3.9) 9.4 CASE STUDIES In this section, two case studies are reported: Loop unrolling and Loop fusion. For each transformation, the source code parameters and the model equations are reported and discussed.

9.4.1 Loop Unrolling Loop unrolling is a parametric transformation whose results in terms of energy reduction are influenced by the unrolling factor U, i.e. the number of TLFeBOOK

162 times the loop body is replicated to build the modified loop. The parameter U, thus, completely defines the transformation. The effects of loop unrolling clearly depend also on the characteristics of the source code being transformed. Such properties are captured by the set of source code parameters reported in Table 9.2.

Table 9.2. Source code parameters for loop unrolling

Symbol Meaning LI Number of loop instructions LS Size of loop instructions (bytes) LBI Number of loop-body instructions LBS Size of loop-body instructions (bytes) N Loop iterations

The number of instructions of the original loop is:

Io = N · LI (4.1)

The transformed loop executes Nt = N/U times and:

LIt = LI +(U − 1)LBI (4.2) instructions per iteration. Therefore, the total number of instructions executed by the transformed loop is: N I = N · LI = · [LI +(U − 1)LBI] (4.3) t t t U The instructions gain obtained with unrolling is thus: N ∆I = · [LI +(U − 1)LBI] − I (4.4) U o The transformation has also effects on the number of instruction cache misses due to the increased dimension of the loop body. A more accurate analysis leads to the results—summarized in the following—that show a non-linear dependence of the number of misses on the relative values of the loop size 2 LS and the instruction cache size Sinst . Three significant cases have been identified:

LS ≤ Sinst In this case there are no capacity misses since the entire loop code can

2The loop size and number of instructions are linearly related assuming a fixed instruction size. TLFeBOOK

163 be loaded into the cache. Hence, there are only cold misses, during the first iteration. The number of instruction cache misses is thus: LS Minst = (4.5) Binst

Sinst

LS ≥ 2Sinst The number of misses in every iteration is equal to the number of cold misses, i.e.: LS Minst = N (4.7) Binst For all these cases, the relevant figure is the variation of the number of instruction cache misses ∆Minst = IMt − IMo. Such difference depends on the variation of number of instructions due to the transformation:

∆LS = LSt − LSo =(U − 1)LBS (4.8) and must be calculated for all the 32 =9cases. It is worth noting that since the transformed code will always be larger than the original one, only 6 out of the 9 cases are significant. For the sake of conciseness, only the two boundary cases are described in the following.

(LSo ≤ ICS) ∧ (LSt ≤ ICS) In this case both the original and the transformed code completely fit into the cache and thus only cold misses take place. The variation, recalling Equation (4.5), is: LSo LSt (U − 1)LBS ∆Minst = − ≈ (4.9) Binst Binst Binst

(LSo ≥ 2ICS) ∧ (LSt ≥ 2ICS) In this other limiting case, both codes are larger than the double of the cache size and thus each instruction fetch causes a miss. Recalling Equa- tion (4.7), the instruction miss variation is: LS +(U − 1)LBS LS ∆M = N − N (4.10) inst t IBS o IBS TLFeBOOK

164 In a similar manner and referring to Equations (4.5)–(4.7), the variations for the other four cases can be calculated. The last effect to be considered is the variation of data cache misses. Since the transformation does not modify the data access pattern of the code, the term ∆Mdata can be assumed to be 0, at least at a first approximation. A first validation can be performed at this level comparing the dimensionless estimated figures ∆I and ∆Minst with those derived from simulation. Figure 9.2 shows the results for the variation of number of instruction executed. It is worth noting that ∆I does not depend on the cache size but only on the structure of the code and the effectiveness of the optimizations that the compiler can exploit on the modified loop.

Loop Unrolling -300 Actual Estimated -400

-500

Instruction gain -600

-700 0 102030405060708090100 Unroll factor (U)

Figure 9.2. Loop unrolling: ∆I

As far as the variation of instruction cache misses, different scenarios have been considered by varying the cache size from 256 to 4096 bytes. Table 9.3 summarizes the results obtained by averaging the estimation error over the interval U = [2; 100] and Figure 9.3 shows the two boundary cases.

Table 9.3. Loop unrolling: ∆Minst average error and standard deviation

Sinst(bytes) % σ% 256 -1.881 8.026 512 -2.557 7.101 1024 -2.531 6.910 2048 -2.750 9.252 4096 -1.691 5.065

The two contributions ∆I and ∆Minst (remembering that ∆Mdata =0) can now be fed to the technology models to derive the overall energy saving. Table 9.4 reports the average error and the corresponding standard deviation in terms of energy gain for the five cache-size scenarios just considered. These results show that the model tends to underestimate the potential gain deriving from loop unrolling. A possible reason is that unrolling a loop leads to a TLFeBOOK

165

Loop Unrolling 400 Cache size = 256 Bytes

300

200

100 Actual Instruction miss gain Estimated 0 0 102030405060708090100 400 Cache size = 4 KBytes

300

200

100 Actual Instruction miss gain Estimated 0 0 102030405060708090100 Unrolling factor (U)

Figure 9.3. Loop unrolling: ∆Minst

Table 9.4. Loop unrolling: ∆E average error and standard deviation

Sinst(bytes) % σ% 256 -1.754 9.144 512 -4.552 7.322 1024 -7.663 6.966 2048 -6.203 5.777 4096 -4.409 3.011

longer loop body, i.e. a larger basic block where the compiler can better perform optimizations. Despite the light biasing of the model, the overall average error is, in absolute value, approximately 4.9% and this can be considered more than satisfactory when operating at source code level.

9.4.2 Loop Fusion This transformation has the purpose of combining into a new single loop the bodies of different subsequent loops. Some constraint must be satisfied, in particular the loops to be merged need to have the same iteration range and the statements in their bodies must be independent. The only transformation parameter characterizing loop fusion is the number NF of loops to be merged. The source code parameters that influence the effect of this transformation are TLFeBOOK

166 all those considered for loop unrolling (see Table 9.2) plus the number and size of control instructions, defined as:

LCI = LI − LBI (4.11) LCS = LS − LBS (4.12)

In the following the subscript k ∈ [1,NF] is used to indicate a specific loop among those to be fused. An additional useful parameter is the average number of control instructions over all the considered loops:

1 NF LCI = LCI (4.13) NF k k=1 Using the symbols just introduced, the number of instructions in the original and transformed codes are:

NF Io = N (LBIk + LCIk) (4.14) k=1 NF It = N(LCI + LBIk) (4.15) k=1 The variation ∆I is thus given by:

NF NF ∆I = N(LCI + LBIk − (LBIk + LCIk)= =1 =1 k k (4.16) NF = N(LCI − LCIk) k=1

Assuming that LCI = LCIk ∀k yields:

NF NF LCIk = LCI = NF · LCI (4.17) k=1 k=1 and thus Equation (4.16) can be rewritten as:

NF ∆I = N(LCI − LCIk)=N(1 − NF)LCI (4.18) k=1 TLFeBOOK

167 To study the effect of loop fusion with respect to instruction misses, the same cases considered for loop unrolling and expressed by Equations (4.5)– (4.7) turn out to be applicable. Nevertheless, when considering the original code composed of NF loops, the number of instruction misses must be estimated for each single loop according to the three mentioned equations and then summed over all loops. On the other hand, the estimates for the transformed code can be obtained by simply substituting LS with the overall transformed code size LSt, defined as:

NF LSt = LCS + LBSk (4.19) k=1 According to Equations (4.5)–(4.7) and referring to the original code sizes LSo,k and the transformed code size LSt, the number of instruction misses of the original loops IMo,k and the transformed one IMt can be derived. The resulting overall variation is thus:

NF ∆Minst = IMt − IMo,k (4.20) k=1 It is worth noting that the number of possible cases derived from the limiting conditions on the cache size is, in general, 3NF+1. Similar considerations apply to the estimation of data cache misses. Since in most cases the different loops operate on different arrays, data misses tend to be increased, the best- case condition being that all data fit into the cache in which case the number of misses will approximately be invariant. A validation procedure similar to that used for loop unrolling has been applied for loop fusion also, considering the simplest and most common case where NF =2. To analyze the behavior of the transformation, loops with different body sizes have been considered and the results for instruction misses are shown in Figure 9.5, where the x axis is an index related to the loop body size ratio. For the same combinations of loop body sizes and for an instruction cache size varying from 256 to 4096 bytes, the gain in terms of instruction misses have also been estimated and compared with actual results, leading to the data collected in Table 9.5 and the graphs of Figure 9.5 relative to the two limiting cases. Again the accuracy obtained is more than satisfactory since the average abso- lute error is approximately 2.1% with very low standard deviation. Combining dimensionless figures with the energy models of the different component of the considered system led to the energy estimates. Such estimates show a very limited error, as reported in Table 9.6, and are not biased. It is though worth noting that the reported results refer to loops manipulating very small arrays for which the hypothesis of being fully contained in the data cache may be assumed to hold. This translates into the models by assuming ∆Minst =0. TLFeBOOK

168

Loop Fusion -600 Actual -700 Estimated -800 -900 -1000 Instruction gain -1100 -1200 0 102030405060 Loop size index

Figure 9.4. Loop fusion: ∆I

Table 9.5. Loop fusion: ∆Minst

Sinst(bytes) % σ% 256 +2.423 2.701 512 +3.004 2.804 1024 -3.150 4.253 2048 +0.153 1.672 4096 -0.258 1.419

Loop Fusion 1500 Cache size = 64 Bytes

1000 Actual Estimated

500

0 Instruction miss gain

-500 0 102030405060 0 Cache size = 4 KBytes -1

-2

-3

-4 Actual Instruction miss gain Estimated -5 0 102030405060 Loop size index

Figure 9.5. Loop fusion: ∆Minst

More complex cases show higher errors but preliminary experimental results suggest that a 10–15% error is a reasonable and conservative upper bound. TLFeBOOK

169

Table 9.6. Loop fusion: ∆E

Sinst(bytes) % σ% 256 +1.945 3.882 512 +0.177 3.469 1024 -0.194 3.916 2048 +1.592 2.425 4096 +0.168 0.017

9.5 EXPERIMENTAL RESULTS

The estimates of ∆I, ∆Minst and ∆Mdata, combined with the energy models (see Section 9.3.2) adopted to account for the technology-dependent parameters, lead to a new set of results showing the accuracy of the complete methodology in terms of energy reduction (∆E) estimation. The models for 5 transformations have been tested on a set of SPEC95 benchmarks in order to quantify the energy gain estimation error. The actual energy gain has been obtained by simulating both the original and the transformed code and then compared with the estimated gain derived from the models. Experiments have been performed on four architectures based on different processors and operating systems using third-party timing and/or power profiling tools (see Table 9.7).

Table 9.7. Operating environments for validation

Processor Operating system Simulation engine Intel strongARM Linux RedHat 9.0 SimpleScalar 3.0 / SimPAnalyzer IBM PowerPC 405 Linux RedHat 9.0 SimpleScalar 3.0 Sun microSPARC II EP Solaris 8 SpixTools MIPS Tech. MIPS-32 Linux RedHat 9.0 SimpleScalar 3.0

Each benchmark has been analyzed varying both the instruction cache size (Sinst) and the input data and all compatible transformations have been applied in a proper sequence using the predicted optimal values for their parameters (unroll factor, tile size, etc.). Table 9.8 collects the relative error between the estimated gain ∆Eest and the actual value ∆Eact derived from simulation. The results confirm that the models are reliable since they can correctly predict both energy reductions and undesirable energy increases. In conclusion, the average estimation error has shown to be around than 3%. TLFeBOOK

170

Table 9.8. Energy gain estimation relative errors

FIB FIR WAVE-1 WAVE-2 IIR Sinst % σ% % σ% % σ% % σ% % σ% 256 +4.16 3.90 n/a n/a -1.97 2.81 +4.29 3.63 -1.63 1.20 512 +7.18 4.02 -3.67 4.48 -1.83 2.67 +4.63 3.52 -1.82 1.15 1024 +3.31 1.49 -2.11 4.95 -2.87 3.51 +4.81 0.79 -3.93 1.51 2048 -1.42 2.15 +1.03 7.68 -2.37 3.71 +4.20 0.57 -0.53 1.59 4096 -2.08 1.91 +11.25 7.57 -1.86 3.71 +3.74 0.20 +0.03 16.00 Average 3.63 2.69 4.51 6.17 2.18 3.28 4.33 1.74 1.58 4.29

9.6 CONCLUSIONS The presented work has addressed the problem of the fast estimation of the effects induced by a set of specific source code transformations by using a structured methodological approach based on technology-independent models. In particular, the presented analysis flow, by providing an appropriated set of both technological and transformation parameters, allows the designer to an a priori evaluation of the impact of a specific transformation and/or the effect of a sequence of interdependent transformations. Two specific transformations have been accurately described: loop unrolling and loop fusion. As far as loop unrolling is concerned, it has been shown that the proposed model can be con- sidered more than satisfactory since the average error between the estimated gain and the simulated gain is, approximately, 4,9% with a low standard devi- ation. Concerning loop fusion, the model has produced estimates—for a wide set of technological options—displaying an average absolute error of 2,1% with an high level of reliability. Both the methodology and the models has been val- idated on a set of benchmarks showing an overall average error of the estimated energy gain around 3%. This result is more than satisfactory and confirms that the models of the different transformation are sufficiently accurate and the methodology, though subject to further improvements, is promising. TLFeBOOK

171 References

[1] L. Benini and G. De Micheli. System-level power optimization: Tech- niques and tools. Transactions on Design Automation of Electronic Sys- tems, 5:115–192, 2000. [2] F. Catthoor, H. De Man, and C. Hulkarni. Code transformations for low power caching in embedded multimedia processors. Proc. of IPPS/SPDP, pages 292–297, 1998. [3] D.F. Bacon, S.L. Graham, and O.J. Sharp. Compiler transformations for high performance computing. Technical Report N. UCB/CSD-93-781, Uni- versity of California at Berkeley, 1993. [4] M.S. Lam. Software pipelining: An effective scheduling technique for vliw machines. SIGPLAN Conference on Programming Language Design and Implementation, pages 318–328, 1988. [5] M.S. Lam, E.E. Rothberg, and M.E. Wolfe. The cache performance and optimization of blocked algorithms. Conference on Architectural Support for Programming Languages an Operating Systems, pages 63–74, 1991. [6] M.J. Wolfe. More iteration space tiling. ACM Proceedings of Supercom- puting, pages 655–664, 1989. [7] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto. The impact of source code transformations on software power and energy consumption. Journal of Circuits, Systems and Computers, 11(5):477–502, 2002. [8] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto. Library functions timing characterization for source-level analysis. Conference on Design Automation and Testing in Europe, pages 1132–1133, March 2003. TLFeBOOK

172

Chapter 10 TRANSMITTANCE SCALING FOR REDUCING POWER DISSIPATION OF A BACKLIT TFT-LCD

Wei-Chung Cheng and Massoud Pedram University of Southern California

Abstract This chapter presents transmittance scaling; a technique aimed at conserving power in a transmissive TFT-LCD with a cold cathode fluorescent lamp (CCFL) backlight by reducing the backlight illumination while compensating for the luminance loss. This goal is accomplished by adjusting the transmittance function of the TFT-LCD panel while meeting an upper bound on a contrast distortion metric. Experimental results show that an average of 3.7X power saving can be achieved for still images with a mere 10% contrast distortion.

Keywords: CCFL; transmissive LCD; TFT-LCD; backlight luminance dimming; transmittance scaling; concurrent brightness and contrast scaling; power efficiency; low power design.

10.1 INTRODUCTION TFT-LCD is the most popular flat-panel display used in today's consumer electronics and computer systems. TFT stands for "Thin Film Transistor" and describes the control elements that actively control the individual pixels. For this reason, one speaks of so-called "active matrix TFT's". LCD means "Liquid Crystal Display" and stands for monitors that are based on liquid crystals. To obtain a high image quality and low power dissipation in a TFT- LCD, low off-current and high on-current are necessary. Previous studies on battery-powered electronics point out that the display subsystem dominates the energy consumption of the whole system. In the SmartBadge system, for instance, the display consumes 29%, 29%, and 50% of the total power in the active, idle, and standby modes, respectively [1]. Direct-view LCDs can largely be categorized into reflective and transmissive displays which utilize ambient light and light from an artificial light source TLFeBOOK

173 (e.g., fluorescent backlight tube) respectively. In a transmissive TFT-LCD monitor, the backlight contributes more than 50% of the display subsystem when using a cold cathode fluorescent lamp (CCFL) [2]. To reduce the backlight power consumption, Choi et al. proposed a technique called backlight luminance dimming. This technique dims the backlight and compensates for the luminance loss by adjusting the grayscale of the image to increase its brightness or contrast. The grayscale of the image is adjusted by multiplying the pixel values by a scaling factor. In this chapter, we describe the transmittance scaling technique, which compensates for the luminance loss by adjusting the transmittance function of the TFT-LCD panel. More precisely, transmittance scaling means “scaling the transmittance function of the TFT-LCD panel.” This is a general technique that can achieve concurrent brightness and contrast scaling of the whole image to compensate for the effects of the backlight dimming. In the following sections, we explain how CCFL works and show how to model the non-linearity between its backlight illumination and power consumption. Next, we propose a contrast distortion metric to quantify the image quality loss after transmittance scaling. Finally, we formulate and optimally solve the optimal transmittance scaling problem subject to a constraint on the contrast distortion.

10.2 PRELIMINARIES A transmissive LCD uses a dedicated backlight. A reflective LCD uses the ambient light or/and a dedicated frontlight. A transflective LCD uses both the ambient light and backlight. The frontlight and backlight use the same light source. The difference between the two lighting schemes is in the light path from the light source through the LCD panel to the observer. A back-lit or front-lit LCD offers superior contrast ratio compared to the one that is lit by the ambient light. A backlight can be direct or indirect type. A direct backlight is positioned directly beneath the LCD panel. An indirect (or side- lit) backlight is positioned at the side of the LCD panel and requires a carefully designed light-guide and a diffuser to illuminate the LCD panel evenly. Most TFT-LCD monitors use CCFL for backlighting due to its unrivaled luminance density – emitting the most light within the minimum form factor. The CCFL can be designed to generate an arbitrary color, which is critical for reproducing pure white in the backlighting applications. Technology for CCFL manufacturing is mature; therefore, its production cost is rather low. However, compared to power consumption of the TFT-LCD panel, the power consumption of the CCFL backlight is quite high. TLFeBOOK

174 10.2.1 Radiometry and Photometry Terminology

Radiometry refers to the science of measuring light in any portion of the electromagnetic spectrum [3]. In practice, radiometry is usually limited to the measurement of infrared, visible, and ultraviolet light using optical instruments. Light is radiant energy. Electromagnetic radiation transports energy through space. Radiant energy (denoted as Q) is measured in joules. A broadband source such as the Sun emits electromagnetic radiation throughout most of the electromagnetic spectrum, from radio waves to gamma rays. However, most of its radiant energy is concentrated within the visible portion of the spectrum. A single-wavelength laser, on the other hand, is a monochromatic source; all of its radiant energy is emitted at one specific wavelength. Energy per unit time is power, which we measure in joules per second, or watts. A laser beam, for example, has so many watts of radiant power. Light “flows” through space, and so radiant power is more commonly referred to as the “time rate of flow of radiant energy” or radiant flux. It is defined as: ĭ =dQ/dt where Q is radiant energy and t is time. Radiant flux is measured in watts. In terms of a photographic light meter measuring visible light, the instantaneous magnitude of the electric current is directly proportional to the radiant flux. The total amount of current measured over a period of time is directly proportional to the radiant energy absorbed by the light meter during that time. Radiant flux density is the radiant flux per unit area at a point on a surface. There are two possible conditions. The flux can be arriving at the surface, in which case the radiant flux density is referred to as irradiance. The flux can also be leaving the surface due to emission and/or reflection. The radiant flux density is then referred to as radiant exitance. Radiant flux density is measured in watts per square meter. The radiant flux density at a point on a surface due to a single ray of light arriving (or leaving) at a solid angle ș to the surface normal is dĭ/(dA·cosș). The radiance at that point for the same angle is then d2ĭ/(dA·dȦ·cosș), or radiant flux density per unit solid angle. Radiance is measured in watts per square meter per steradian. We can imagine an infinitesimally small point source of light that emits radiant flux in every direction. The amount of radiant flux emitted in a given direction can be represented by a ray of light contained in an elemental cone. This gives us the definition of radiant intensity: I =dĭ/dȦ. Radiant intensity is measured in watts per steradian. Photometry is the science of measuring visible light in units that are weighted according to the sensitivity of the human eye [3]. It is a quantitative science based on a statistical model of the human visual response to light -- that is, our perception of light -- under carefully TLFeBOOK

175 controlled conditions. The human visual system is a complex and highly nonlinear detector of electromagnetic radiation with wavelengths ranging from 380 to 770 nanometers (nm). The sensitivity of the human eye to light varies with wavelength. A light source with a radiance of one watt/m2- steradian of green light (540nm wavelength), for example, appears much brighter than the same source with a radiance of one watt/m2-steradian of red (650nm wavelength) or blue light (450nm wavelength). In photometry, we attempt to measure the subjective impression produced by stimulating the human eye-brain visual system with radiant energy. This task is complicated immensely by the eye’s nonlinear response to light. It varies not only with wavelength but also with the amount of radiant flux, whether the light is constant or flickering, the adaptation of the iris and retina, the spatial complexity of the scene being perceived, the psychological and physiological state of the observer, and a host of other variables [4]. According to studies done by the Commission Internationale d’Eclairage (CIE), the photopic luminous efficiency of the human visual system as a function of wavelength looks like a near-normal distribution as depicted in Figure 10-1 (cf. [5].) The CIE photometric curve thus provides a weighting function that can be used to convert radiometric measurements into photometric measurements. Today the international standard for a light source is a point source that has a luminous intensity of one candela (the Latin word for “candle”). It emits monochromatic radiation with a frequency of 540*1012 Hertz (or approximately 555nm, corresponding with the wavelength of maximum photopic luminous efficiency) and has a radiant intensity (in the direction of measurement) of 1/683 watts per steradian.

Figure 10-1: Photopic luminosity function.

Together with the CIE photometric curve, candela provides the weighting factor needed to convert between radiometric and photometric measurements. Consider, for example, a monochromatic point source with a TLFeBOOK

176 wavelength of 510nm and a radiant intensity of 1/683 watts per steradian. The photopic luminous efficiency at 510nm is 0.503. The source therefore has a luminous intensity of 0.503 candela. Luminous flux is the photometrically-weighted radiant flux (power). Its unit of measurement is the lumen, defined as 1/683 watts of radiant power at a frequency of 540*1012 Hertz. As with luminous intensity, the luminous flux of light with other wavelengths can be calculated using the CIE photometric curve. Luminous energy is photometrically-weighted radiant energy. It is measured in lumen seconds. Luminous flux density is photometrically-weighted radiant flux density. Luminous flux density is measured in lumens per square meter. Illuminance is the photometric equivalent of irradiance, whereas luminous exitance is the photometric equivalent of radiant existence. Illuminance can be used to characterize the luminous flux emitted from a surface. Most photographic light meters measure the illuminance. Luminance is photometrically-weighted radiance. In terms of visual perception, we perceive luminance. It is an approximate measure of how “bright” a surface appears when we view it from a given direction. Luminance is measured in lumens per square meter per steradian. The maximum brightness of a CRT or LCD monitor is described by luminance in its specification. Luminous intensity is photometrically-weighted radiant intensity. It is measured in lumens per steradian (i.e., candelas). Luminous intensity can be used to characterize the optical power emitted from a spot light source, such as a light bulb. There is much more that we have not covered here, such as reflectance, transmittance, absorption, scattering, diffraction, and polarization. We have also ignored the interaction of the human visual system with light, including scoptic and mesopic luminous efficiency, temporal effects such as flicker, and most important, color perception. The study of light and how we perceive it fills volumes of research papers and textbooks.

10.2.2 Cold Cathode Fluorescent Lamp

A transmissive LCD uses a dedicated backlight. A reflective LCD uses the ambient light or/and a dedicated frontlight. A transflective LCD uses both the ambient light and backlight. The frontlight and backlight use the same light source. The difference between the two lighting schemes is in the light path from the light source through the LCD panel to the observer. A back-lit or front-lit LCD offers superior contrast ratio compared to the one that is lit by the ambient light. A backlight can be direct or indirect type. A direct backlight is positioned directly beneath the LCD panel. An indirect (or side- lit) backlight is positioned at the side of the LCD panel and requires a carefully designed light-guide and a diffuser to illuminate the LCD panel TLFeBOOK

177 evenly. Most TFT-LCD monitors use CCFL for backlighting due to its unrivaled luminance density – emitting the most light within the minimum form factor. The CCFL can be designed to generate an arbitrary color, which is critical for reproducing pure white in the backlighting applications. Technology for CCFL manufacturing is mature; therefore, its production cost is rather low. However, compared to power consumption of the TFT-LCD panel, the power consumption of the CCFL backlight is quite high. A CCFL backlight module consists of the fluorescent lamp, the driving DC-AC inverter, and the light reflector. A CCFL is a sealed glass tube with electrodes on both ends. The tube is filled with an inert gas (argon) and mercury. The inner glass surface of the tube is coated with phosphor, which emits visible light when excited by photons. The wavelength of the visible light (i.e., color) depends on the type of the gas and phosphor. In the LCD backlighting application, a proper mix of red, green, and blue phosphors produces the desired three-band white light. In other applications where the pure white is not required, the emitted light is color-shifted (e.g., the bluish or yellowish household hot cathode fluorescent lamps). The CCFL converts electrical energy into visible light by a process known as the gas discharge phenomenon. When a high voltage is applied to the electrodes turning on the lamp, electrical arcs are generated that ionize the gas, which allows the electrical current to flow. The collision among the moving ions injects energy to the mercury atoms. The electrons of the mercury atoms receive energy and jump to a higher energy level followed by emitting ultraviolet photons when falling back to their original energy level. The ionized gas conducts the electrical current as a gas conductor. The impedance of the gas conductor, unlike that of the metal conductor having a linear behavior, decreases as the current increases. Therefore, the CCFL has to be driven by an alternative current (AC). Otherwise, the CCFL explodes due to the heat caused by the runaway current. The CCFL is usually manufactured as a thin long tube, which can be straight or in any different shape. A DC-AC inverter is usually used to drive a CCFL in battery-powered applications. A DC-AC inverter is basically a switching oscillator circuit that supplies high-voltage AC current from a low-voltage battery. The nominal AC frequency of modern CCFL is in the range of 50-100 kHz to avoid flickering. The nominal operate voltage has to be higher than 500 VRMS to keep inert gas ionized. To conserve energy in battery-powered applications, dimming control is a desirable feature for DC-AC inverters. Different methods of dimming the CCFL have been used, including linear current, pulse-width-modulation, and current chopping [6]. In a DC-AC inverter with dimming control, an analog TLFeBOOK

178 or digital input signal is exposed for adjusting the CCFL illumination in addition to the on/off input signal. Most well designed DC-AC inverters have high electrical efficiency (>80%) and linear response of output electrical power to input power. Most fluorescent lamps, however, have low optical efficiency (<20%) and non-linear response of output optical power versus input power [7].

10.2.3 Power Modeling for the CCFL

The CCFL illumination is a complex function of the driving current, ambient temperature, warm-up time, lamp age, driving waveform, lamp dimensions, and reflector design [7]. In the transmissive TFT-LCD application, only the driving current is controllable. Therefore, we model the CCFL illumination as a function of the driving current only and ignore the other parameters.

1

0.9

0.8

0.7

0.6

0.5

0.4

CCFL illuminance CCFL 0.3

0.2

0.1

0 0 0.5 1 1.5 2 2.5 3 Power Consumption (W)

Figure 10-2: Normalized CCFL illuminance (i.e., backlight factor, b) versus driver’s power consumption for a CCFL light source.

Relationship between the CCFL illumination (i.e., luminous flux incident on a surface per unit area) and the driver’s power dissipation for the CCFL in LG Philips transmissive TFT-LCD LP064V1 [8] is shown in Figure 10-2. The CCFL illumination increases monotonically as the driving power increases from 0 to 80% of the full driving power. For values of driving power higher than this threshold, the CCFL illumination starts to saturate. The saturation phenomenon is due to the fact that the increased temperature and pressure inside the tube adversely impact the efficiency of emitting visible light [9]. TLFeBOOK

179 Let backlight factor b∈[0,1] denote the normalized CCFL illumination factor with b=1 representing the maximum backlight illumination and b=0 representing no backlight illumination. Accounting for the saturation phenomenon in the CCFL light source, we use a two-piece linear function to characterize the power consumption of CCFL as a function of the backlight factor:

°­ A .,0bC+≤≤ bB P ()b = lin lin s backlight ® +≤≤ (Watt). (1) ¯°Asat.,bC sat B s b 1 We conduct experiments to derive the regression coefficients, A’s and C’s. A precision luminance meter such as [10] provides accurate absolute illuminance readings. These meters are expensive and can be replaced with the cheaper photographic light meters. Indeed the absolute illuminance readings are not required to characterize the CCFL power consumption as a function of the backlight factor. An accurate photographic light meter can serve the purpose so far as it is capable of sensing minor illuminance changes as described next. We use a light meter in a way similar to how we compare the mass of two objects on a scale (weighing machine). We simultaneously adjust the backlight factor b and the TFT-LCD grayscale, x∈[0,255], while maintaining the same CCFL illuminance. Let q(b) denote the required analog or digital dimming control input of the DC-AC inverter for generating backlight factor b. We start by setting the maximum CCFL illuminance (b=1) and the minimum TFT-LCD grayscale (x=0). The grayscale x is obtained by displaying a pure gray image in which Red=Green=Blue=x for every pixel. The grayscale x is increased until the light meter senses a variation and reports a different reading. We then reduce the backlight factor b (by adjusting the dimming control q) until the meter reports the previous reading. The change in the TFT-LCD grayscale (which determines the transmittance of the TFT-LCD panel) is thus known. Now the change in the backlight factor is assumed to be the equal to the TFT-LCD transmittance. We record q as the dimming control value for the backlight factor b=(255- x)/256. At the same time, the power consumption of the backlight Pbacklight is measured and recorded. We repeat the above procedure for x=0,1...,255. After interpolation, we obtain q(b) and Pbacklight(b). For the CCFL in LG Philips transmissive TFT-LCD LP064V1, the following coefficient values were obtained: A =0.4991, A =0.1489, C =0.1113, C =0.6119, lin sat lin sat (2) Bs=0.8666. TLFeBOOK

180 10.2.4 Transmissive TFT-LCD Monitor

The major components of a transmissive TFT-LCD monitor subsystem include a video controller, a frame buffer, a video interface, a TFT-LCD panel, and the backlight (cfr. Figure 10-3.) The frame buffer is a portion of memory used by software applications to deliver video data to the video controller. The video data from the application is stored in the frame buffer by the CPU. The video controller fetches the video data and generates appropriate analog (e.g., VGA) or digital (e.g., DVI) video signals to the video interface. The video interface carries the video signals between the video controller and the TFT-LCD monitor. The TFT-LCD monitor receives the video data and generates a proper shade – i.e., transmittance – for each pixel according to the corresponding pixel value. All of the pixels on a transmissive LCD panel are illuminated from behind by the backlight. To the observer, a displayed pixel looks bright if its transmittance is high (i.e., it is in the 'on' state), meaning it passes the backlight. On the other hand, a displayed pixel looks dark if its transmittance is low (i.e., it is in the 'off' state), meaning that it blocks the backlight. If the transmittance can be adjusted to more than two levels between the 'on' and 'off' states, then the pixels can be displayed in grayscale. If the shade can be colored as red, green, or blue by using different color filters, then pixels can be displayed in color by mixing three sub-pixels each in a different color and with its own grayscale. In other words, the perceived brightness of a pixel is determined by its transmittance and the backlight illumination.

Grayscale Control Display CPU Line Video data

Frame TFT-LCD Scanner Video Timing panel Backlight

Power supply DC-AC

Figure 10-3: Block diagram of a TFT-LCD monitor. TLFeBOOK

181 A TFT-LCD panel consists of the following ordered layers: front polarizer, color filter, glass, indium tin oxide (ITO), polymide film, liquid crystals, polymide film, ITO, glass, and rear polarizer. The light transmittance is determined by the front and rear polarizers and the orientation of the liquid crystals. A polarizer is a light filter that blocks the light wave in different directions. After passing thought the rear polarizer, the backlight becomes in a single direction. If the front polarizer and rear polarizer are perpendicular to one another and there is no liquid crystal in between, then the backlight will be blocked and the LCD looks dark. Otherwise, the backlight will pass through and the LCD looks bright. The liquid crystals can be considered as tiny lenses between the two polarizers. The direction of light wave can be twisted by changing the orientation of the liquid crystals. Thus, the liquid crystals can be considered as voltage- controlled light switches (cfr. Figure 10-4.)

Figure 10-4: LCD as voltage-controlled light switch.

The liquid crystals are in a phase of matter between liquid and solid states. In the liquid state the molecules can move freely, whereas in solid state the molecules are fixed in certain order. The liquid crystals are in the state that the molecules can move until they form a certain order according to an external force. Three types of liquid crystal phases are used in LCD: nematic, cholesteric, and smectic. The twisted nematic (TN) liquid crystal is the most widely used in today's LCD monitors. TLFeBOOK

182 The nematic liquid crystals can be considered as transparent rods (rod-like lenses.) These rods are locally aligned with their long axes nearly parallel to each other on a two-dimensional plane. Their orientation can be denoted by an angle. When there are two stacked planes of nematic liquid crystals, the molecules on each plane align with each other, but the angle of each plane can be different, or twisted. When more twisted planes of nematic liquid crystals are stacked, a series of twisted rod-like lenses forms a spiral light path, which can twist the direction of the light wave from the backlight. If the polarizers are perpendicular, then the backlight passes and a bright pixel is seen. This configuration of crossed polarizers is called normally white. When an external electrical field is applied, all nematic liquid crystals change their orientation such that their long axes point to the electrodes uniformly. In this case, the direction of the light wave is not affected and forms a dark pixel in a normally white LCD. In other words, a normally white LCD consumes power to generate a dark pixel. By controlling the electrode voltage, the amount of light passing the LCD can be modulated to generate grayscale pixels. The nematic liquid crystals are sandwiched between two glasses. The spacing in between decides the number of planes of nematic liquid crystals. To twist the liquid crystals, parallel groves are produced on the glasses by using polymide film. The groves on the front glass set the orientation of the nearby nematic liquid crystals. The groves on the rear glass set the orientation of the nearby nematic liquid crystals to be perpendicular to the first one. The orientations of other planes in between are twisted by the first and last plane from 0 to 90 degree accordingly when no voltage applied. The color filter determines a sub-pixel to be red, green, or blue.

Vg

Vd

Vcom

Figure 10-5: The electrical waveforms of the gate-source Vg and drain-source voltage Vd of a TFT-LCD.

Each sub-pixel has an individual liquid crystal cell, a thin-film-transistor and a storage capacitor. The layout of the transparent ITO electrodes defines the shape of a sub-pixel. The liquid crystals between the electrodes form a TLFeBOOK

183 conceptual cell. The electrical field of the capacitor controls the transmittance of the cell. The capacitor is charged and discharged by its own TFT. The gate electrode of the TFT controls the timing of charging/discharging when the pixel is scanned (or addressed) for refreshing its content. The (drain-) source electrode of the TFT controls the amount of charge (cfr. Figure 10-5).

Vcom

Vd

Vg Gate bus line G Cgs Vn D S Cst Clc

Vs Vcom

t Source bus line

Figure 10-6: The equivalent circuit of a TFT-LCD sub-pixel. CST is the storage capacitor. CGS and Clc are parasitic capacitances.

The gate electrodes and source electrodes of all TFT’s are driven by a set of gate drivers and source drivers, respectively. A single gate driver (called a gate bus line) drives all gate electrodes of the pixels on the same row. The gate electrodes are enabled at the same time the row is scanned. A single source driver (called a source bus line) drives all source electrodes of the pixels on the same column (cfr. Figure 10-6). The source driver supplies the desired voltage level (called grayscale voltage) according to the pixel value. In other words, ideally, the pixel value transmittance, t(x), is a linear function of the grayscale voltage v(x), which is itself a linear function of the pixel value x. If there are 256 grayscales, then the source driver must be able to supply 256 different grayscale voltage levels. For the source driver to provide a wide range of grayscales, a number of reference voltages are required. The source driver mixes different reference voltages to obtain the desired grayscale voltages. Typically, these different reference voltages are fixed and designed as a voltage divider. For example in [8], an Analog Devices input LCD reference driver [11] is used with a 10-way voltage divider. TLFeBOOK

184

Vdd

rk Vk-1 rk-1 Vk-2 rk-2 Vk-3 rk-3

V2 r2 V1 r 1 V0 r0

Figure 10-7: The voltage divider generating reference voltages for grayscale controller.

Assume that the transmittance of the TFT-LCD is linear and the resistors of the voltage divider are identical. If k+1 identical resistors r0…rk are connected in series between ground and Vdd in that order, then the output voltage seen from the top terminal of ri is (cfr. Figure 10-7):

i +1 = . VVi (3) k +1 dd

1

0.9

t 0.8

0.7

0.6

0.5

0.4 Transmittance 0.3

0.2

0.1

0 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 1 TFT-LCD Panel Power

Figure10- 8: Normalized pixel transmittance, t(x), versus power consumption of a pixel in the LCD panel. TLFeBOOK

185 10.2.5 Power Modeling for the TFT-LCD

The hydrogenated amorphous silicon (a-Si:H) is commonly used to fabricate the TFT in display applications. For a TFT-LCD panel, the a-Si:H TFT power consumption can be modeled by a quadratic function of pixel value x∈[0,255] [13]: 2 PTFT (x)=c0+c1x+c2x (Watt). (4) We performed the current and power measurements on the LG Philips, LP064V1 LCD. The measurement data are shown in Figure10- 8. The regression coefficients are thus determined as:

c0=2.703E-3, c1=2.821E-4, c2=2.807E-5. (5) The power consumption of a normally white TFT-LCD panel decreases as its global transmittance increases. In other words, while maintaining the same luminance, the power consumption of the TFT-LCD decreases when dimming the backlight. The change in the TFT-LCD power consumption is, however, quite small.

10.3 BACKLIGHT AND TRANSMITTANCE SCALING The general framework for the backlight luminance dimming and the transmittance scaling is depicted in Figure 10-9. The pixel illumination is determined by the backlight illumination and the LCD transmittance. The backlight illumination is controlled by adjusting the amplitude of the dimming control signal to the DC-AC inverter. The LCD transmittance is controlled by the pixel values and the grayscale controller. The backlight scaling technique dims the backlight to save power and increases pixel values to compensate for the brightness loss. The transmittance scaling technique dims the backlight to save power and increases the reference grayscale voltages to compensate for the brightness loss. Note that transmittance scaling does not change the pixel values. TLFeBOOK

186

Dimming value Pixel value Reference Control voltage Grayscale controller 10 grayscale voltages DC-AC inverter Source driver

CCFL Illumination * LCD Transmittance = Pixel Illumination

Figure 10-9: A framework for backlight luminance dimming and transmittance scaling. The luminance of a transmissive object is the product of the luminance of the incident light and the transmittance of the object [5]. For a pixel on a transmissive TFT-LCD monitor, its transmittance, t(x), is a function of its pixel value x. More precisely, a pixel value of zero means zero transmittance and hence the perceived shade will be black whereas a pixel value of 255 means a transmittance of one and hence the perceived shade will be white. Other pixel values between 0 and 255, result in various shades of gray. Now, luminance, L(x), of a pixel in the TFT-LCD panel is calculated as: L(x) = b·t(x) (6) The ambient light is not considered here because it has little effect for a transmissive TFT-LCD when compared with a reflective or transflective one. Figure 10-10 depicts the relation in (6) assuming that the TFT-LCD transmissivity is a linear function of the pixel value.

b * =

Backlight TFT-LCD Transmittance Luminance Function Factor Function Figure 10-10: The luminance of a normalized pixel value (right) is the product of the backlight factor b and the TFT-LCD transmissivity function. In a non-backlight-scaled TFT-LCD monitor, the backlight luminance (denoted by the normalized backlight factor b) is fixed at full CCFL driver power (b=1). TLFeBOOK

187 10.3.1 Backlight Luminance Dimming

Reference [2] describes two backlight luminance dimming techniques, which dim the backlight luminance to save power consumption. To compensate for brightness loss, the authors reduce b while increasing the pixel values from x to x' by two mechanisms. The “backlight luminance dimming with brightness compensation” technique uses: x’=x+b. (7) The “backlight luminance dimming with contrast enhancement” technique uses: x’=x/b. (8) The pixel values are adjusted by software before being written into the frame buffer or by hardware after being fetched by the video controller. The distortion after backlight luminance dimming is evaluated by the percentage of saturated pixels that exceed the range of pixel values, i.e., [0,255]. The optimal backlight factor is determined by the backlight luminance dimming policy subject to the given distortion rate. To calculate the distortion rate, a histogram estimator is required for calculating the statistics of the input image.

10.3.2 Programmable LCD Reference Driver

Recall that the pixel value transmittance, t(x), is a linear function of the grayscale voltage v(x). The transmittance scaling approach is to control the mapping of v(x) in order to control the transmittance function t(x). We propose using a programmable LCD reference driver (PLRD) described as follows. The PLRD is implemented by adding an extra logic to the original voltage divider expressed by (6). The logic contains a number of p-channel and n- channel switches and multiplexers. Recall that k+1 identical resistors r0…rk are connected in series between ground and Vdd. The PLRD takes two input arguments gl and gu, and then short circuits the top terminal of rgl to ground and the top terminal of rgu to Vdd. In this way, the output voltage seen from the top terminal of ri becomes:

­ Vguikdd, ≤≤ ° ' ° igl− V = Vgligu≤< iglgu,, ® − dd, . (9) ° gu gl ° ≤≤ ¯ 0, 0 igl

Clearly, the PLRD performs a linear transformation (limited by 0 and Vk) TLFeBOOK

188 on the original reference voltages, and therefore, provides the transmittance scaling policy a mechanism for adjusting the TFT-LCD transmittance function as shown in Figure 10-11a. The luminance function is shown in Figure 10-11b.

1 1 b b * t(x) = L(x)

0 gl gu 1 0 gl gu 1 x x (a) (b) Figure 10-11: (a) The LCD transmittance function (b) and the luminance function when using a programmable LCD reference driver.

The similar concept of PLRD has been implemented in TFT-LCD controllers such as [12] to control contrast or gamma-correction. The PLRD represents a class of linear transformations on the backlight-scaled image. It covers both brightness scaling (adjusting gu and gl simultaneously) and contrast scaling (adjusting gu-gl). On the other hand, non-linear transformations are not desired in transmittance scaling because they cannot preserve the uniformity of contrast.

10.3.3 Contrast Fidelity

The term contrast describes the concept of the differences between the dark and bright pixels. Brightness and contrast are the two most important properties of any image. In the Human Visual System [5][14], which models the perception of human vision as a three-stage processing, the brightness and contrast are perceived in the first two stages. Virtually every single display permits the users to adjust the brightness and contrast settings. For transmissive LCD monitors, the brightness control changes the backlight illumination and the contrast control changes the LCD transmittance function. TLFeBOOK

189

(a) Original image (b) Dim backlight to (c) Backlight (d) Transmittance 50% without luminance dimming scaling compensation Figure 10-12: Luminance functions and visual effects of adjusting brightness (b), contrast (c), and both (d) when the backlight is dimmed to 50%. Figure 10-12 shows how the brightness and contrast control change the luminance function and their visual effects when the maximum brightness is limited to 50%. Figure 10-12a depicts the original image of the USC girl. In Figure 10-12b, when the backlight is reduced to 50%, the image contrast is noticeably reduced. If we compensate for the contrast loss as shown in Figure 10-12c, then the darker (<50%) pixels will preserve their original brightness while the brighter (>50%) pixels will overshoot completely (there will be no contrast present among these pixels.) Figure 10-12d shows how the transmittance scaling generates a better image by balancing the contrast loss and number of overshot pixels. The luminance functions in Figure 10- 12b and Figure 10-12d represent the following class of linear transformations that can be implemented by the PLRD as expressed in (9):

≤≤ −d ­ 0, 0 xgl gl= ° ⋅=cx+≤≤ d,, gl x gu where c btx() ® bd− . (10) ° ≤≤ gu= ¯ bgux,1 c

Here (gl,0) and (gu,b) are the points where y=cx+d intersects y=0 and y=b, respectively. The luminance function consists of three regions: the TLFeBOOK

190 undershot region [0,gl], the linear region [gl,gu], and the overshot region [gu,1]. In other words, the gl and gu are the darkest and the brightest pixel values that can be displayed without contrast distortion (overshooting or undershooting). Notice that the slope of the linear region is very close to that of the original luminance function, which is unity. The image has very few pixels in the undershot and overshot regions. Its histogram is shown in Figure 10-13a. The kernel of transmittance scaling is to find the dissimilarity between the original and backlight-scaled images, which can be solely determined by examining the luminance function b.t(x). We define the contrast fidelity function as the derivative of b.t(x):

­0, 0≤

c is limited between 0 and 1. If c>1, the contrast increases and deviates from that of the original image while the dynamic range [gl,gu] shrinks. The overall contrast fidelity decreases from this point, so we do not include c>1 in our solution space. The contrast fidelity is defined without quantifying contrast itself, which has no universal definition [15] and cannot help solve the optimal transmittance scaling policy problem. However, the definition of contrast fidelity does convey the concept of the classical definitions of contrast such as Weber's or Michelson's that express contrast as the ratio of the luminance difference to the maximum luminance [5][14][15]. If the normalized image histogram providing the probability distribution of the occurrence of pixel value x in the image is given as: p(x)∈[0,1], x=0..255, (12)

Then the global contrast fidelity of the backlight-scaled image will be defined as:

gu F =⋅fx() px (). Cc¦ (13) gl

Fc is a function of p, gl and gu. Finding the optimal solution that minimizes Fc is called the optimal transmittance scaling policy problem. TLFeBOOK

191 More precisely, given the image, we would like to find the optimal backlight factor and the PLRD transformation function, which maximize the global contrast fidelity. The variables of optimization are b, gu, and gl. The global contrast fidelity captures the brightness distortion due to backlight luminance dimming, also. When the backlight is dimmed, the dynamic range [gl,gu] is shrunk accordingly, so that more pixels have contrast fidelity of zero.

Figure 10-13: (a) Histogram of the example image, (b) Optimal values of gl (left curve) and gl+dr (right curve) as functions of dynamic range dr, (c) Global contrast fidelity Fc as a function of dynamic range dr for b=1 (upper curve ) and b=0.5 (lower curve), (d) Optimal solutions < Fc,Pbacklight>, (e) After transmittance scaling with 10% contrast distortion (f) After backlight luminance dimming.

10.3.4 Transmittance Scaling

To simplify the optimal transmittance scaling policy problem, our TLFeBOOK

192 approach is first to find the optimal linear transmittance function for each given backlight factor. This problem is called the contrast fidelity optimization problem. In this simplified version, given the image and backlight factor, we would like to find the optimal PLRD transformation function that maximizes the global contrast fidelity. The variables of optimization are gu and gl. Next, we sweep the backlight factor domain to find the globally optimal solutions. Our goal is to find the optimal gl and gu that maximize the overall contrast fidelity Fc. After that, the optimal coefficients c and d can be calculated from (10). The optimal transmittance function t(x) that should be applied to the LCD can then be determined as:

­ 0, 0≤

The optimal solution to the contrast fidelity optimization problem for an arbitrary histogram can be found by the following procedures. Let dr=gu-gl denote the required dynamic range [gl,gu] and the backlight factor b denote the available dynamic range [0,b]. For each dr, we can find the required gl+ dr dynamic range [gl,gl+dr] that maximizes ¦ px(). The optimal gl is found gl by scanning gl=0*256/k, 1*256/k,…(k-1)*256/k, where k represents the resolution of the PLRD in (9). Based on the histogram shown in Figure 10- 13a, Figure 10-13b shows the optimal gl and gl+dr in the x axis as functions of dr in the y axis. The left and right curves are the optimal gl and gl+dr, respectively for different dr values. This means that when the backlight is dimmed to dr, by using the available dynamic range [0,dr] from the transmissive LCD to display the required dynamic range [gl,gl+dr] by the image, we are able to generate a backlight-scaled image that minimizes the number of undershot or overshot pixels. Now consider the contrast fidelity c in (14). If the available dynamic range is larger or equal to the required dynamic range (dr”b), the optimal contrast fidelity c=1 can be obtained with d”0 and the overall contrast fidelity Fc is gu simply ¦ px(). Otherwise, if dr>b, the highest possible contrast fidelity is gl

c=b/dr with t=1 and d=0. Thus, Fc becomes: TLFeBOOK

193

b gl+ dr ¦ px(). (15) dr gl

Figure 10-13c shows Fc as a function of dr for b=1 (upper) and b=0.5 (lower). The Fc increases as dr increases from dr=0 to dr=0.5. For the b=1 curve, the example image needs no more than 70% of available dynamic range to represent the whole histogram with the best contrast fidelity c=1. For the b=0.5 curve, the Fc decreases from dr=0.5 to dr=1 because in (15) gl+ dr the ¦ p()x increases slower than dr. The optimal Fc happens at dr=0.5 and gl the contrast fidelity c=1 in the region [gl,gl+dr]. Notice that c=1 is not always the optimal solution when dr>b. If the distribution in the histogram is not normal (e.g. has two peaks) the optimal dr can be greater than b, such gl+ dr that ¦ px() can be increased. For each backlight factor b, the complexity of gl 2 finding the optimal Fc, gl and gu is O(k ) with a small k (<12). Given the solution to the contrast fidelity optimization problem for any backlight factor b, the optimal transmittance scaling policy problem can be solved by sweeping the backlight factor range between bmin and bmax, where bmin and bmax are user-specified minimum and maximum backlight factors, respectively. All of the optimal solutions are recorded along with their power consumptions. The inferior solutions, i.e., those with higher or equal power consumptions but lower fidelity, are discarded. The remaining solutions are stored for the transmittance scaling policy to select the most suitable solution according to the user preferences. Figure 10-13d shows the optimal solutions for b=0.8, 0.7,…0.2 from top-right to bottom-left. The x and y coordinates of each solution indicate the global contrast fidelity and backlight power, respectively. The two inferior solutions for b=1.0 and 0.9 are discarded because they have the same fidelity, Fc=1, as that of b=0.8. Results show that more than 50% power savings can be achieved by the transmittance scaling policy while maintaining almost 100% of contrast fidelity at a backlight factor of 70%. The visual effect is shown in Figure 10-13e, in comparison with Figure 10-13f generated from the brightness-invariant policy from (8). A pseudo-code transmittance scaling procedure is provided below. TLFeBOOK

194

transmittance_scaling(p[0..255],k) { cdf[0]=p[0]; for (i=0; i<256; i++) cdf[i]+=p[i]; for (b=bmin; b<=bmax; b+=(1/k)) { Pb=Pbacklight(b); for (dr=1; dr<=255; dr+=(256/k)) { Rmax=-1; for (g=0; g<=255-dr; g+=(256/k)) { R=cdf[g+dr]-cdf[g]; if (R>Rmax) { gl=g; Rmax=R; } } } if (b>=dr) Fc=R; else Fc=(b/dr)*R; gu=gl+dr; Sol = ; Search solution database for and <*,Pb,*,*>; if (Sol is not inferior) Insert Sol into solution database; } }

We use a set of benchmark images from the USC SIPI Image Database (USID) [16]. The USID is considered the de facto benchmark suite in the signal and image processing research field [5]. The results reported here are from 8 color images from volume 3 of USID. All of these images are 256 by 256 pixel images. The color depth is 24 bits, i.e., 8 bits per color-channel in the range of 0-255. Tables 10-1 and 10-2 show the results of the optimal transmittance scaling policies for the benchmarks. We use 90% as the global contrast fidelity threshold to find the minimum backlight factor and the optimal transmittance transformation. The results show an average of 3.7X savings within 10% of contrast distortion. TLFeBOOK

195

Table 10-1. Optimal transmittance scaling solutions to the USID benchmark images.

Image Backlight Contrast Brightness Overall CCFL factor fidelity shift fidelity Power # b c d Fc (mW) 4.1.01 0.51 1 0.00 0.91 803.84 4.1.02 0.38 1 0.00 0.91 549.99 4.1.03 0.65 1 0.00 0.91 1077.21 4.1.04 0.75 1 0.00 0.91 1272.47 4.1.05 0.75 1 0.01 0.91 1272.47 4.1.06 0.84 1 0.04 0.90 1448.21 4.1.07 0.71 1 0.06 0.90 1194.36 4.1.08 0.72 1 0.04 0.92 1213.89

Table 10-2. Original images (upper images) vs. transmittance-scaled images (lower images). TLFeBOOK

196 10.4 SUMMARY This chapter presented a technique for reducing the backlight illumination while compensating for the luminance loss by adjusting the transmittance function of the TFT-LCD panel. First, background information about CCFL and TFT-LCD monitors was presented. Next, a contrast distortion metric to quantify the image quality loss after dimming backlight was described. Finally, the transmittance scaling problem was precisely formulated and solved. Experimental results showed that an average of 3.7X power saving can be achieved with a mere 10% contrast distortion. The proposed transmittance scaling technique was applied to still images only. However, the basic technique can be extended to video streams. For video, the decision about the backlight scaling factor is made for each frame one at a time. Consequently, the backlight factor may change significantly across consecutive frames. The enormous change in the backlight factor may introduce inter-frame brightness distortion to the observer. Therefore, when the transmittance scaling technique is applied to video applications such as an MPEG2 decoder, the inter-frame change in the backlight dimming factor should be controlled carefully such that the change is hardly noticeable to human eyes.

References

[1] T. Simunic et al., “Event-driven power management,” IEEE Tran. Computer-Aided Design of Integrated Circuits and Systems, vol. 20, pp. 840-857, July 2001. [2] I. Choi, H. Shim, and N. Chang, “Low-power color TFT LCD display for hand-held embedded systems,” Proc. of Symp. on Low Power Electronics and Design, Aug. 2002, pp. 112-117. [3] ANSI/IES. 1986. Nomenclature and Definitions for Illuminating Engineering, ANSI/IES RP-16-1986. New York, NY: Illuminating Engineering Society of North America. [4] Radiosity: A Programmer’s Perspective by Ian Ashdown, © October 2002 by Heart Consultants Limited. (Originally published by John Wiley & Sons in 1994.) [5] W. K. Pratt, Digital Image Processing, Wiley Interscience, 1991. [6] Maxim, MAX1610 Digitally Controlled CCFL Backlight Power Supply. [7] J. Williams, “A fourth generation of LCD backlight technology,” Linear Technology Application Note 65, Nov. 1995. [8] LG Philips, LP064V1 Liquid Crystal Display. [9] Stanley Electric Co., Ltd., [CFL] cold cathode fluorescent lamps, 2003. [10] Minolta, Minolta Precision Luminance Meter LS-100. TLFeBOOK

197

[11] Analog Devices, AD8511 11-Channel, Muxed Input LCD Reference Drivers. [12] , HD66753 168x132-dot Graphics LCD Controller/Driver with Bit-operation Functions, 2003. [13] H. Aoki, “Dynamic characterization of a-Si TFT-LCD pixels,” HP Labs 1996 Technical Reports (HPL-96-19), February 21, 1996. [14] S. Daly, “The visible differences predictor: an algorithm for the assessment of image fidelity,” Digital Images and Human Vision, pp. 179-206, Cambridge: MIT Press, 1993. [15] E. Peli, “Contrast in complex images,” J. Opt. Soc. Amer. A, vol. 10, no. 10, pp. 2032- 2040, Oct. 1990. [16] A. G. Weber, “The USC-SIPI image database version 5,” USC-SIPI Report #315, Oct. 1997. Also http://sipi.usc.edu/services/database/Database.html. [17] Toshihisa Tsukada, TFT/LCD, Liquid-Crystal Displays Addressed by Tin-Film Transistors, Amsterdam: Gordon and Breach Publishers, 1996. [18] W. C. O'Mara, Liquid crystal flat panel displays: manufacturing science & technology, New York: Van Nostrand Reinhold, 1993. TLFeBOOK

198

Chapter 11

POWER-AWARE NETWORK SWAPPING FOR WIRELESS PALMTOP PCS

Andrea Acquaviva, Emanuele Lattanzi and Alessandro Bogliolo Universit`adiUrbino

Abstract Virtual memory is considered to be an unlimited resource in desktop or notebook computers with high storage memory capabilities. However, in wireless mobile devices like palmtops and personal digital assistants (PDA), storage memory is limited or absent due to weight, size and power constraints. As a consequence, swapping over remote memory devices can be considered as a viable alternative. Nevertheless, power hungry wireless network interface cards (WNIC) may limit the battery lifetime and application performance if not efficiently exploited. In this chapter we explore performance and energy of network swapping in compar- ison with swapping on local microdrives and flash memories. Our study points out that remote swapping over power-manageable WNICs can be more efficient than local swapping and that both energy and performance can be optimized through power-aware reshaping of data requests. Experimental results show that application-level prefetching can be applied to save up to 60% of swapping energy while also improving performance.

Keywords: Memory management, power management, remote memory swapping.

11.1 INTRODUCTION Mass storage devices provide to desktop and computers the support to implement a virtual memory that can be viewed as an unlimited resource to be used to extend the main memory whenever needed. However, in wireless mobile devices like palmtops and personal digital assistants (PDAs), storage memory is limited or absent due to weight, size and power constraints, thus limiting the application of virtual memory. On the other hand, if a wireless network interface card (WNIC) is available, unlimited swapping space could be found on remote devices made available by a server and managed by the operating system as either network file systems (NFS) or network block devices TLFeBOOK

199 (NBD). However, swapping over a power hungry WNIC may limit the battery lifetime and application performance if not efficiently exploited. In this chapter we report the results of extensive experiments conducted to evaluate and optimize the performance and power efficiency of different local and remote swap devices for wireless PDAs (namely, a compact flash (CF), a micro drive (HD) and two different WNICs). The contribution of the chapter is three-fold. First we characterize all swap devices in terms of time and energy inherently required to swap a single page. Second, we test the effectiveness of the dynamic power management (DPM) support made available by each device. Third, we show that dummy data accesses can be preemptively inserted in the source code to reshape page requests in order to significantly improve the effectiveness of DPM. Experimental results show that WNICs are less efficient than local devices both in terms of energy and time per page. However, the DPM support pro- vided by WNICs is much more efficient than that of local micro drives, making network swapping less expensive than local swapping for real-world applica- tions with non-uniform page requests. Finally, we show that application-level reshaping of page requests can be used in conjunction with DPM to save up to 60% of energy while improving performance. The rest of this chapter is organized as follows. In Section 11.2 we provide some background on remote storage space. In Section 11.3 we discuss the key features of local and remote storage devices that can be used for swapping, and we briefly outline the software support for remote swapping provided by Linux. In Section 11.4 we describe the experimental setup used for our experiments. In Section 11.5 we outline the benchmarks used to characterize each swap device in terms of power and performance and we present characterization results. In Section 11.6 we discuss the support for dynamic power management provided by each swap device and we propose an application-level prefetching strategy for energy-aware swapping. In Section 11.7 we report and discuss the results of extensive experiments conducted on a simple case study. In Section 11.8 we draw conclusions.

11.2 REMOTE STORAGE SPACE The concept of remote storage has been exploited by deeply networked sys- tems for mainly three reasons: to provide extra storage space, to enable file sharing and to enhance swapping capabilities. First, remote memories or magnetic disks are used to store application and data by systems with limited or absent local mass storage space. Diskless and mobile terminals are both computer systems characterized by limited or absent disk capacity. TLFeBOOK

200 Even if the memory is not a constraint, remote storage space is used as repos- itory of data shared among different users working on different machines, as in the case of file servers. Access to remote data can be controlled by network file systems such as NFS. However, mobile networks require suitable proto- cols to handle disconnected and weakly connected operations. To this purpose, dedicated file systems and file hoarding methods have been designed [13, 7, 8]. File hoarding is the technique of preparing disconnections by caching crit- ical data. Differently from traditional caching, the cost of a miss (or failure) can be catastrophic if it occurs when the system is disconnected from the net- work. To identify critical data, LRU policies augmented with user-specified hoard-priority have been proposed as part of the CODA file system [13]. Here, priorities are used to offset the LRU age of an object. In addition, the user is given the possibility to interactively control the hoarding strategy (the so called translucent caching concept). Automated hoarding methods have been also recently presented [8, 7]. Remote memories are also commonly used as swap areas to temporarily park run-time data and application code when the total amount of available system memory is not enough to contain user processes. In computer clusters remote swap areas are designed to replace local swap partitions for performance reasons. In fact, high speed links may provide faster access than local magnetic disks especially under certain workload conditions, due to the high rotational latency [12, 6]. While for remote file systems the main issues is reliability, for remote swapping the performance of data transfer is the key point. For this reason, simpler and more efficient supports have been proposed [5]. Remote swap areas can be also exploited by mobile devices, where local storage space is limited and expensive [3]. However, network swapping in mobile devices does not come for free since they are much more bandwidth and energy constrained than desktop PCs and workstations. Remote swapping for handheld computing systems is a recent research topic that has not been extensively studied so far. The problem of energy consumption of network swapping in mobile devices has been faced by Hom et al. [9]. They proposed a compilation framework aimed at reducing the energy by switching the communication device on and off by means of specific instructions inserted at compile time based on a partial knowledge of the memory footprint of the application.

11.3 SWAP DEVICES We refer to the page-based swapping support provided by the Linux OS. Linux performs a page swap in two situations: i) when a kernel daemon, ac- tivated once per second, finds that the number of free pages has fallen below a given threshold; ii) when a memory request cannot be satisfied. The page TLFeBOOK

201 to be swapped-out is selected in a global way, independently from the pro- cess that made the request. The page replacement algorithm is based on an approximation of least recently used (LRU) policy [4]. Modern operating systems equipping palmtops and PDAs make possible to define heterogeneous support for swapping. Swapping can be performed both locally to the PDA and remotely, by exploiting server storage capabilities and network connections. More than one swap units can be enabled at the same time, with assigned priority. The unit with the highest priority is selected by default until it becomes insufficient.

11.3.1 Local Devices On-board non-volatile memory is usually available in palmtop PCs to store the bootloader and the filesystem. Magnetic disks can be added to extend file storage capabilities. Swap can be made locally in palmtops as in desktop PCs. A dedicated partition can be defined in hard drives or flash memories, where the filesystem resides. Alternatively, some OS’s allow the user to define a swap file that does not need a dedicated partition. Either way, the swap area comes at the price of decreasing the space available for actual storage purposes.

Compact Flash Palmtop PCs are equipped with on-board flash memories, but additional memory chips can be installed as an expansion if an external slot is present. Memory Technology Device (MTD) drivers allow to define swap partitions or swap files on flash memories. However, being read-most devices, flash mem- ories are not the ideal support for swapping. Nevertheless, we evaluate their swapping performance since they are always present in palmtop PCs, being sometimes the only alternative to network swapping.

Hard Disk Today’s technology made available hand-sized magnetic disks (called mini or micro drives) suitable to be installed in palmtop computers. Currently they provide a storage capability up to 5GBytes. Like traditional hard disks (HD), micro drives provide a seek time much longer than the access time to sequential blocks. For this reason, access to these kind of devices is usually performed in bursts whenever possible by exploiting on-board hardware buffers in order to compensate for the initial transfer delay. The OS tries to limit the delays by filtering disk accesses using software caches, whose size is limited by the available space in main memory. When a micro drive used as a swap device, this trade-off is even more critical, since increasing the memory space allocated for caching increases the number of swap requests. TLFeBOOK

202 11.3.2 Network Devices In order to provide the performance required to fully exploit the channel bandwidth, remote swap files can be mapped in the main memory of a remote server. This is the choice we made for our experiments.

Network File System NFS (Network File System) is used in a network to enable file sharing among different machines on a local area. The communication protocol is based on a UDP stack, while data transfers between NFS server and clients are based on Remote Procedure Calls (RPCs). The idea of using NFS to support network swapping is relatively recent [16]. To this purpose, a remote file must be con- figured as a swap area. This is made possible by modern operating systems that allow the user to specify either a device or a file as a swap unit.

Network Block Device A Network Block Device (NBD) offers to the OS and to the applications running on top of it the illusion of using a local block device, while data are not stored locally but sent to a remote server [5]. As in case of NFS, the virtual local device is mapped in a remote file, but the swap unit is viewed as a device, rather than as a file. This is made possible by a kernel level driver (or module) that communicates to a remote user-level server. The first time the network connection is set-up, a NDB user-level client negotiates with the NBD server the size and the access granularity of the exported file. After initialization, the user-level NBD client does not take part to remaining transactions, that directly involve the kernel NBD driver and the NBD server. No RPCs are required in this case, thus reducing the software overhead. Latest releases of NBD driver use an user level network communication, which affects the performance of the protocol, since data must be copied from the kernel to the user space address, but increases flexibility. Differently from NFS, the underlying network stack is TCP instead of UDP. This increases the reliability of network transfers, at the cost of increasing the protocol overhead.

11.4 EXPERIMENTAL SETUP We performed our experiments on a HP’s IPAQ 3600 handheld PC, equipped with a Strong-ARM-1110 processor, 32MB SDRAM and 16MB of FLASH. Our benchmarks were executed on the palmtop on top of the Linux operating system, Familiar release 6.0. The WNICs used to provide network connectivity were a LUCENT (hereafter denoted by NICLucent) and a CISCO AIRONET 350(NICCisco), while the AP connected to the remote swapping server was a TLFeBOOK

203

/*************** Benchmark 1 ***************/ double A[ROW][COL]; initialize(A,ROW,COL); t0 = time(); read_by_column(A,ROW,COL); t1 = time(); /*******************************************/

Figure 11.1. Pseudo-code of the benchmark used to characterize swap devices.

CISCO 350 Series base station [19, 18, 17]. The remote server was installed on a Athlon 4 Mobile 1.2 GHz notebook. For local swap experiments we used a 340 MB IBM Microdrive (HD) and a 64 MB Compaq-Sundisk Compact Flash Memory (CF) [20, 21]. Power consumption of both WNICs and local devices was measured using a Sycard Card Extender that allowed us to monitor the time behavior of the supply current drawn by the card. The current waveforms were then digitized using a National Instruments Data Acquisition Board PCI 6024E [11] connected to a PC. A Labview (version 6.1) [10] software running on the PC was used to coordinate the acquisition and bufferize current samples to compute power and energy consumption. The remote swap NBD server was instrumented in order to collect time- stamped traces of swapping activity during benchmarks execution.

11.5 CHARACTERIZATION OF SWAPPING COSTS To characterize the inherent cost of a page swap we developed a suite of benchmarks accessing data structures much larger than the available main mem- ory, without performing any computation on them. This kind of benchmark is suitable to characterization swapping cost since the computation time is neg- ligible with respect of the time spent in swapping and the devices under char- acterization are always busy serving page requests. The pseudo-code of the benchmark is shown in Figure 11.1. A large matrix is allocated and initial- ized and then read by column in order to maximize the number of page faults. Different benchmarks were generated by changing the number of columns and rows of the matrix in order to change the number of page faults while keep- ing the total size of the matrix unchanged. This allowed us to cross-validate experimental results and reduce characterization errors. A second set of benchmarks was obtained by replacing the read by column procedure with a write by column procedure, and used to characterize the swapping cost in case of write-back. TLFeBOOK

204

Characterization Results Experimental results are reported in Table 11.1 in terms of time, energy and power required by each local and remote device to swap a page of 4096 bytes. Both read-only and write-back results are reported. In general, write-back doubles the cost (in energy and time) of a read-only swap, since it involves two data transfers. As expected, local devices are more efficient than WNICs and CF has an energy-per-page more than 10 times lower than all other devices. It is also worth noting that, for a given WNIC, NBD provides greater perfor- mance than NFS, at the cost of a slightly higher power consumption. Since the time reduction overcomes the additional power consumption, the energy per page required by NBD is lower than that required by NFS.

Table 11.1. Power consumption and performance of local and remote swap devices.

Swap device Read-only Write-back Type Mode Time Energy Power Time Energy Power [ms] [mJ] [mW] [ms] [mJ] [mW] CF local 4.1 0.201 49 8.2 0.402 49 HD local 3.0 1.911 637 6.4 4.061 637 NICCisco NBD 7.0 5.934 848 14.0 10.319 735 NICCisco NFS 8.5 6.123 720 14.6 10.516 720 NICLucent NBD 8.0 5.626 578 15.0 8.599 573 NICLucent NFS 10.0 5.243 524 22.0 10.672 485

11.6 POWER OPTIMIZATION In the previous section we have characterized swap devices in terms of time and energy requirements per swap page. To this purpose we designed a set of benchmarks that simply accessed data structures much larger than the main memory without performing any computation on them. Although useful for characterization purposes, the benchmarks of Figure 11.1 are unrealistic for two main reasons. First, computation time is usually non- negligible, so that page requests are spaced in time according to a distribution that depends on the workload and on the state of the main memory. Second, the total size of the data structures accessed by each application usually does not exceed the size of the main memory, or otherwise the performance degradation would not be acceptable. In most cases of practical interest, swapping is mainly needed after a context switch to bring in main memory data structures the first time they are used by the active process. Moreover, in handheld devices there are often only a few processes running concurrently, so that both main memory and peripherals are TLFeBOOK

205

Table 11.2. Power states of local and remote devices.

Device State Power Timeout WU-time WU-power [mW] [ms] [ms] [mW] CF Read 107 Write 156 Wait 4.5 HD Read 946 Write 991 Wait 600 Sleep 24 2000 4500 ± 1980 1067 NIC Receive 755 (cisco) Transmit 1136 Wait 525 (PSP/PSPCAM) 113 0/850 14/14 400 Power-Off 0 any 370 451 NIC Receive 548 (lucent) Transmit 798 Wait 407 (PSP) 38 100 1 800 Power-Off 0 any 270 357

mainly used by a single process at the time. In this situation, the usage pattern of swapping devices are significantly different from those used for characterizing swapping costs because of the presence of long idle periods between page swaps. Since swapping devices spend power while waiting for page requests, the effective energy per page is larger than that reported in Table 11.1. On the other hand, idleness can be dynamically exploited to save power by putting the devices in low-power operating modes, or by turning them off. Dynamic power management (DPM) significantly impacts the performance and energy trade-off offered by each device under bursty workloads. In this section, we first analyze the DPM supports provided by each swapping device, then we show how to increase their effectiveness by means of software optimization techniques aimed at reshaping the distribution of page requests.

11.6.1 DPM Support Evaluation The DPM support provided by each swap device is schematically represented in Table 11.2. For each device, the key features of active and inactive operating modes are reported. Active modes are characterized only in terms of power consumption, while inactive modes are also characterized in terms of timeout to be waited before entering the inactive state, wake-up time and wake-up power. TLFeBOOK

206 The data reported in the Table have been obtained by analyzing the current profiles provided by the measurement setup described in Section 11.4. First of all we remark that the average power consumptions measured during page swaps (reported in Table 11.1) are not equal to the power consumptions measured for the devices during read/receive or write/transmit. In fact, for instance, a page swap across a wireless link entails the transmission of the page request, a waiting time corresponding to the latency of the remote device, the reception of the page and, possibly, the write-back of a swapped-out page. The average swapping power comes from the weighted average of all these contributions. The CF has no inactive states. This is because its power consumption in wait mode is negligible, making inactive low-power states useless. On the contrary, NICs and HDs consume a large amount of power while waiting for service requests, so that it is worth switching them to low-power inactive states during long idle periods. The Sleep state of the HD has the lowest power consumption, but the highest wakeup cost in terms of power (higher than 1W) and time (in the order of several seconds). Moreover, the wakeup time is highly unpredictable, its measured standard deviation being almost 2 seconds. According to the IEEE802.11b standard, WNICs provide MAC-level DPM support that can be enabled via software [1]. The actual implementation of the DPM support depends on the WNIC. The protocol policy (PSP) consists in placing the card in a low-power state called doze mode, in which it sleeps but wakes-up periodically to keep synchronized with the network and to check the access point (AP) for outstanding data. A polling frame bust be transmitted by the card for each packet to be retrieved. PSP mode provides power savings at a cost of a noticeable performance hit. To increase performance, a variation of this policy is implemented by CISCO cards. They automatically switch from PSP to CAM (Constant Awake Mode) when a large amount of traffic is detected. In this case no polling frame is needed between packets since the reception and transmission happen in active mode. Even if the power consumption in sleep state is low, it is not negligible. Moreover, the card is sensitive to broadcast traffic. A more aggressive policy would require to completely shut-off the card when no needed by any active application in the system. Thus, more power can be saved, at the price of a larger wake-up delay needed by network re-association. OS-level policies can be implemented to this purpose based on a power management infrastructure recently developed for Linux OS [2]. This infrastructure is composed by a power manager module that handles requests from applications and keeps track of their resource needs. On the other side, upon a request, the power manager can directly switch off a peripheral (WNIC in our case) if no other applications are using it. Switch off request may come from user applications through TLFeBOOK

207

/*************** Case study ****************/ double dummy[2048][2048], C[128][128]; double A[128][128], B[128][128]; initialize(A,128,128); initialize(B,128,128); initialize(C,128,128); initialize(dummy,2048,2048); //swap out t0 = time(); compute_product(A,B,C); t1 = time(); /*******************************************/

Figure 11.2. Pseudo-code of the case study.

dedicated APIs or directly by another kernel module. We exploited this feature to let the NBD driver module switch on and off the card between swapping requests. The features of doze and power-off modes are reported for both WNICs in Table 11.2. We observe that the MAC-level DPM support of NICLucent is more efficient that that of NICCisco, but the DPM policy is more conservative (the timeout being 100ms).

11.6.2 Re-shaping of Swap Requests The effectiveness of any DPM strategy strongly depends on workload statis- tics. Regardless of the DPM policy, the higher the burstiness of the workload, the higher the power savings. In fact, long idle periods can be effectively exploited to switch to the deepest inactive states, while long activity bursts amortize the cost of wake up. Although caching and buffering can be performed by the OS to perform a low-level reshaping of page requests, a typical trace of swapping traffic shows small bursts of a few pages followed by short periods of inactivity. Increasing the granularity of page swaps could increase the burstiness of the workload, but also increases the risk of preemptively swapping-in unused pages. On the other hand, in many cases data pre-fetching could be deterministically performed at the application-level in order to reshape swapping traffic. This can be done by inserting dummy accesses to the data structures right before they are used. Dummy accesses generate bursts of page requests for two main reasons: first, they are not delayed by any commutation, second, a single access is sufficient to fetch an entire page. TLFeBOOK

208 11.7 CASE STUDY We use matrix multiplication as a case study to evaluate the effectiveness of the DPM strategies implemented by the swap devices and to demonstrate the feasibility of application-level reshaping of swap requests. The pseudo-code of the case study is reported in Figure 11.2: it simply computes the product of two square matrices A and B and puts the result in a third matrix C. The total size of the three matrices fits in main memory, but we use a dummy matrix, exceeding the size of the physical memory, to force swapping activity. Matrices A, B and C are first allocated and initialized, then the dummy matrix is initialized in order to swap A, B and C out from main memory. In practice, the initialization of the dummy matrix creates boundary conditions similar to those possibly caused by the execution of other applications. Then we monitor the execution time and the swapping energy caused by the execution of the compute product procedure. The distribution of swap requests is shown in Figure 11.3. The expected distribution is also plotted for comparison. The large number of pages requested at the beginning corresponds to the upload of the entire matrix B. In fact, the first column of B has to be read in order to compute the first entry of C. Since matrices are stored in memory by rows, reading the first column entails swapping the entire matrix. Subsequent page requests are spaced in time according to the time required to compute 512 × 128 floating point products. Comparing the actual requests with the theoretical needs, we observe that the OS swaps 8 pages at the time, thus increasing the opportunity for DPM. However, the total number of pages request by the OS is 104, while the three matrices fit into 96 pages. To reshape swap requests we inserted dummy accesses to the three matrices between the computation of initial time t0 and the computation of the matrix pro- duct. Dummy accesses were performed by a routing, called access one per page, that reads one matrix entry every 512 (i.e., one entry per page).

Case Study Results Experimental results obtained by executing the case study with and without traffic reshaping are reported in Table 11.3. Based on the results reported in Table 11.1 we decided to use NBD for remote swapping. In Figure 11.5 we detailed the comparison among different devices. Each device was tested with and without DPM. The two WNICs were tested with both MAC-level DPM (doze) and OS-level DPM (power-off). The DPM mode and the corresponding timeouts are reported in the first column. The performance of the CF and the CPU time obtained by running the application with data available in main memory are also reported in the first two rows for reference. The DPM of the HD was enabled by default, so that data reported TLFeBOOK

209

105 100 95 90 85 80 75 70 65 Theoretical needs 60 Page swapped 55 50 45 40 35 Number of pages 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Execution time [s]

Figure 11.3. Distribution of page requests. on row HD are computed from previous characterization. All other data were obtained from real measurements, by repeating each experiment 4 times. Interestingly, even without DPM, the HD consumes more energy than WNICs. This is because of its higher power consumption when idle. When DPM is en- abled, WNICs become much more convenient than HD. In particular, the DPM of the HD is counterproductive both in terms of time and energy under this traf- fic conditions because of the large wakeup cost. On the contrary, MAC-level DPM of NICCisco and NICLucent saves respectively more than 50% and more than 80% of the swapping energy. If the power-off state is exploited, power savings become of 85% and 94%, respectively, with negligible performance loss. When DPM policies are enabled, traffic reshaping provide further advantages both in terms of energy and execution time. For the HD, traffic reshaping makes DPM effective to save more than 60% of energy consumption. For WNICs, traffic reshaping provides additional energy savings while further reducing the performance loss. The ratios between results obtained with and without traffic reshaping are reported in the table. In particular, when the power-off state is exploited, traffic reshaping leads to additional energy savings around 60%. The overall effect of DPM and traffic reshaping makes network swapping much more energy efficient than local swapping on a HD (the overall energy being almost 10 times lower) with a performance penalty of about 5%. In Figure 11.4 we reported current profiles obtained by running the case study benchmark for the Compact Flash, Cisco card without power management and microdrive. The Figure evidences the strong variance of microdrive’s power profile, due to the variability of the head position and speed at the shutdown TLFeBOOK

210

Table 11.3. Execution time and swapping energy required to run the case study of Figure 11.2 with and without traffic reshaping.

Device Exec. time [s] Original Reshaped Ratio Avg Std Avg Std RAM 25 0 25.25 0.5 1.01 CF 25.5 0.57 25.75 0.5 1.01 HD 25.31 - 25.81 - 1.01 (PM ON) 37.75 5.91 27.75 0.96 0.73 NICCisco 26 - 26.33 0.58 1.01 (PSPCAM) 30.67 2.16 27 1 0.88 (PSP) 43.33 0.58 45 - 1.04 (OFF) 28.75 0.5 26.0 0.82 0.93 NICLucent 30.25 0.5 28.5 1.0 0.94 (PSP) 30.0 0 27.75 0.5 0.92 (OFF) 30.0 0 28.25 0.5 0.94

Energy [mJ] Original Reshaped Ratio Avg Std Avg Std RAM ----- CF 0.14 0.003 0.16 0.02 1.12 HD 15.20 - 15.50 - 1.01 (PM ON) 19.43 5.31 6.21 0.85 0.32 NICCISCO 16.51 0.01 16.73 0.38 1.01 (PSPCAM) 10.59 0.23 6.64 0.61 0.63 (PSP) 8.05 0.23 8.83 0.45 1.09 (OFF) 2.47 0.09 0.89 0.051 0.36 NICLUCENT 13.60 0.56 12.70 0.51 0.93 (PSP) 2.54 0.096 2.19 0.08 0.86 (OFF) 1.76 0.08 0.72 0.045 0.41

instants. In fact, the two bottom traces in Figure are obtained running the same benchmark. As a reference, we marked swapping activity intervals with uppercase latin letters.

11.8 CONCLUSION In conclusions, our experiments demonstrate the feasibility and the energy efficiency of network swapping from wireless palmtop PCs. The effectiveness of the DPM support provided by WNICs makes them more efficient than local HDs and open the field to optimization strategies (like swap reshaping) that may further improve energy efficiency and performance. TLFeBOOK

211

20 Compact Flash 15 A B C D E F G 10 5 0 BC DE F G Cisco WNIC 175 A 150 125 100 A BC D E F G IBM uHD 200 Current [mA] 100

0 A B C D E FG IBM uHD 200

100

0 0 10203040 Time [s]

Figure 11.4. Power profiles of the different swap devices during the execution of the case study of Figure 11.2.

Figure 11.5. Comparison of energy consumptions reported in Table 11.3

References

[1] LAN/MAN Standards Committee of the IEEE Computer Society. Part 11: Wireless LAN MAC and PHY Specifications: Higher-Speed Physical Layer Extension in the 2.4 GHz Band, IEEE, 1999. [2] A. Acquaviva, T. Simunic, V. Deolalikar, S. Roy, "Remote Power Control of Wireless Network Interfaces," Proceedings of PATMOS, Turin, Italy, Sept. 2003. TLFeBOOK

212 [3] I. Bokun, K. Zielinski, "Active Badges–The Next Generation," http://www.linuxjournal.com/article.php?sid=3047, Oct. 1998. [4] D. Bovet, M. Cesati, "Understanding the Linux Kernel," OReally´ & Asso- ciates, Sebastopol, CA, Jan. 2001. [5] P.T. Breuer, A. Marin Lopez, A. Garcia Ares, "The Network Block Device," Linux Journal, Issue 73, May 2000. [6] M. D. Flouris, E. P. Markatos, "The Network RamDisk: Using Remote Memory on Heterogeneous NOWs," Cluster Computing, pp. 281-293, 1999, Baltzer Science Publishers. [7] G. H. Kuenning, G. J. Popek, “Automated Hoarding for Mobile Comput- ing,” Proc. of Symposium on Operating System Principles, pp. 264–275, Oct. 1997. [8] G. Kuenning, W. Ma, P. Reiher, G. J. Popek, “Symplifying Automated Hoarding Methods,” Proc. of MSWiM, pp. 15–21, Sept. 2002. [9] J. Hom, U. Kremer, "Energy Management of Virtual Memory on Diskless Devices," Proceedings of COLP, Barcelona, Spain, Sept. 2001. [10] National Instruments, "Labview User Manual," http://www. ni.com/pdf/manuals/320999d.pdf [11] National Instruments, "NI 6023E/6024E/6025E Family Specifications," http://www.ni.com/pdf/manuals/370719b.pdf [12] T. Newhall, S. Finney, K. Ganchev, M. Spiegel, "Nswap: A Network Swap- ping Module for Linux Clusters," Proceedings of Euro-Par, Klagenfurt, Austria, August 2003. [13] M. Satyanarayanan, "The Evolution of Coda," ACM TOCS, Vol. 20, Is- sue 2, Pages: 85–124, May 2002. [14] A. Silbershatz, P. Galvin, G. Gagne, "Operating System Concepts, 6th Edition," Addison-Wesley, 2002. [15] Sycard Technology, "PCCextend 140 CardBus Extender User’s Manual," http://www.sycard.com/docs/cextman.pdf (1996) [16] "Swapping via NFS for Linux," http://www.nfs-swap.dot-heine.de [17] Cisco System, Cisco Aironet 350 Series Access Points, http://www.cisco.com/univercd/cc/td/doc/product/wireless/ airo 350/accsspts/index.htm, 2003. [18] Cisco System, Cisco Aironet 350 Series Wireless LAN Adapters, http://www.cisco.com/univercd/cc/td/doc/product/wireless/ airo 350/350cards/index.htm, 2003. [19] Agere, 802.11 Wireless Chip Set White Paper, http://www.agere. com/client/docs/multimode white paper.pdf, 2003. TLFeBOOK

213 [20] IBM, 340MB Microdrive Hard Drive, http://www.storage. ibm.com/hddredirect.html?/micro/index.html, 2003. [21] Compaq, compact flash cards, http://www.hp.com/products1/ storage/products/storagemedia/flash cards/index.html, 2003. TLFeBOOK

214

Chapter 12

ENERGY-EFFICIENTNETWORK-ON-CHIP DESIGN

Davide Bertozzi1,LucaBenini1, Giovanni De Micheli2 1 2 University of Bologna; Stanford University

Abstract Performance and power consumption of multi-processor Systems-on-Chip (SoCs) are increasingly determined by the scalability properties of the on-chip communication architecture. Networks-on-Chip (NoCs) are a promising solu- tion for efficient interconnection of SoC components. This chapter focuses on low power NoC design techniques, analyzing the related issues at different lay- ers of abstraction and providing examples taken from the most advanced NoC implementations presented in the open literature. Particular emphasis is given to application-specific NoC architectures, in that they represent the most promising scenario for minimization of communication-energy in multi-processor SoCs.

Keywords: Network-on-Chip, Low Power, Micro-network Stack, Application-Specific

12.1 INTRODUCTION The most critical factor in Systems-on-Chip (SoCs) integration will be related to the communication scheme among components. The challenges for on-chip interconnect stem from the physical properties of the interconnection wires. Global wires will carry signals whose propagation delay will exceed the clock period. Thus signals on global wires will be pipelined. At the same time, the switched capacitance on global wires will constitute a significant fraction of the dynamic power dissipation. Moreover, estimating delays accurately will become increasingly harder, as wire geometries may be determined late in the design flow. Hence, the need for latency insensitive design is critical. The most likely synchronization paradigm for future chips is globally-asynchronous locally-synchronous (GALS), with many different clocks. SoC design will be guided by the principle of consuming the least possible power. This requirement matches the need of using SoCs in portable battery- powered electronic devices and of curtailing thermal dissipation which can make chip operation infeasible or impractical. Whereas computation and stor- TLFeBOOK

215 age energy greatly benefits from device scaling (smaller gates, smaller memory cells), the energy for global communication does not scale down. On the con- trary, projections based on current delay optimization techniques for global wires [20] show that global on-chip communication will require increasingly higher energy consumption. Hence, communication-energy minimization will be a growing concern in future technologies [40]. Energy considerations will impose small logic swings and power supplies, most likely below 1 Volt. Electrical noise due to cross-talk, electro-magnetic interference (EMI) and radiation-induced charge injection (soft errors) will be likely to produce data upsets. Thus, the mere transmission of digital values on wires will be inherently unreliable. To cope with these problems, network design technology can be used to ana- lyze and design SoCs modeled as micro-networks of components (or Networks- on-Chip, NoCs). The SoC interconnect design analysis and synthesis is based upon the micro-network stack paradigm, which is an adaptation of the pro- tocol stack [25] (Figure 12.1) used in networking. This abstraction is useful for layering micro-network protocols and separating design issues belonging to different domains. SoCs differ from wide-area networks because of local proximity and be- cause they exhibit much less non-determinism. In particular, micro-networks have a few distinctive characteristics, namely, energy constraints, design-time specialization and low communication latency. This chapter focuses on low power NoC design techniques, and analyzes specific design issues related to the different layers of abstraction outlined in the micro-network stack in a bottom-up way. The objective is to describe, for each layer, how the system interconnect is progressively abstracted and what the most relevant micro-network design issues are in order to come up with an energy-efficient NoC architecture. Particular emphasis is given to cus- tomized, domain-specific NoCs, which represent the most promising scenario for communication-energy minimization in the context of NoC-based multi- processor SoCs (MPSoCs). In most cases, specific solutions proposed in the literature are outlined, even though it should be clear that many design issues are open and significant progress in this area is expected in the near future.

12.2 PHYSICAL LAYER Global wires are the physical implementation of the communication chan- nels. Traditional rail-to-rail voltage signaling with capacitive termination, as used today for on-chip communication, is definitely not well-suited for high- speed, low-energy communication on future global interconnect [16]. Reduced swing, current-mode transmission, as used in some processor-memory systems, TLFeBOOK

216

Figure 12.1. Micro-network stack can significantly reduce communication power dissipation while preserving speed of data communication [29]. In the case of a simple CMOS driver, low-swing signaling is achieved by lowering the driver’s supply voltage Vdd. This implies a quadratic dynamic = 2 power reduction (because Pdyn KVdd). Unfortunately, swing reduction at the transmitter complicates the receiver’s design. Increased sensitivity and noise immunity are required to guarantee reliable data reception. Differential receivers have superior sensitivity and robustness, but they require doubling the bus width. To reduce the overhead, pseudo-differential schemes have been pro- posed, where a reference signal is shared among several bus lines and receivers, and incoming data is compared against the reference in each receiver. Pseudo- differential signaling reduces the number of signal transitions, but also noise margins with respect to fully differential signaling. Thus, reduced switching activity is counterbalanced by higher swings and determining the minimum- energy solution requires careful circuit-level analysis. Dynamic voltage scaling has been recently applied to busses [23, 26]. In [26] the voltage swing on communication busses is reduced, even though signal integrity is partially compromised. Encoding techniques can be used to detect corrupted data which is then retransmitted. The retransmission rate is an input to a closed-loop DVS control scheme, which sets the voltage swing at a trade-off point between energy saving and latency penalty (due to data retransmission). The On-Chip Network (OCN) for low power heterogeneous SoC platforms illustrated in [9] employs some advanced techniques for low-power physical in- terconnect design. OCN consists of global links connecting clusters of tightly- connected IPs which are several millimeters long. By using overdrivers, clocked sense amplifiers and twisted differential signaling, packets are transmitted reli- ably with less than 600 mV swing. The size of a transceiver and the overdrive voltage are chosen to obtain a 200 mV separation at the receiver end. A 5 mm TLFeBOOK

217 global link of 1.6um wire-pitch can carry a packet at 1.6GHz with 320ps wire- delay and consumes 0.35pJ/bit. On the contrary, a full-swing link consumes up to 3x more power and additional area of repeaters. An on-chip serialization technique [10] is also used in OCN, thus significantly reducing area. However, the number of signal transitions on a link is increased since the temporal locality between adjacent packets is removed. An ad-hoc serialized low-energy transmission coding scheme was therefore designed as an attempt to exploit temporal locality between packets. The encoder generates a ’1’ only when there is difference between a current packet and a previous packet before it is serialized. The decoder then uses this encoded packet to reconstruct the original input, using its previously stored packet. A 13.4% power saving is obtained for a multimedia application. The power overhead associated with the encoder/decoder is only 0.4mW. Nevertheless, as the technology trends lead us to use smaller voltage swings and capacitances, the upset probabilities will rise. Thus the trend toward faster and lower-power communication may decrease reliability as an unfortunate side effect. Reliability bounds as voltages scale can be derived from theoretical (entropic) considerations [19] and can be measured also by experiments on real circuits. Finally, another key physical-layer issue is synchronization. In fact, global synchronization signals (i.e., clocks) are responsible for a significant fraction of the power budget in digital integrated systems. Alternative on-chip synchro- nization protocols that do not require the presence of a global clock have been proposed in the past [30, 11] but their effectiveness has not been studied in detail from the energy viewpoint. In the OCN NoC [9], a programmable power management unit provides four clocks with PLL; 1.6GHz for the OCN, 800 MHz for schedulers, 100MHz for processors and 50 MHz for peripherals. Those clock frequencies are scalable by software for power-mode control and also for optimal operation of each application.

12.3 SYSTEM INTERCONNECT ARCHITECTURE Designing the architecture for an on-chip interconnect requires choices at a higher level of abstraction with respect to physical interconnect design, but also with a stronger impact on energy dissipation. Traditional shared buses try to overcome their energy inefficiency by means of bus splitting [5]. The bus is split into smaller segments and proper bridges are inserted to ensure communication between any two adjacent segments when needed. Thus, the load capacitance charged and discharged at each bus access is reduced. Most commercial shared buses make use of this solution, including AMBA bus [1] and IBM CoreConnect [33]. They split the bus based on the TLFeBOOK

218 characteristics of the connected masters and slaves (e.g. high performance cores versus slow peripherals). More advanced bus specifications (such as AMBA Multi-Layer [34]) allow to group IP cores into clusters, and this can be done based on their interaction during application execution. As an example, OCN [9] exploits locality of IP cores by grouping them into clusters, and a crossbar switch is used for intra-cluster packets, perform- ing buffer-less cut-through switching. A round-robin scheduling of the switch ensures fairness and starvation-freedom to OCN. An n × n crossbar fabric comprises n2 crosspoint junctions which contain NMOS pass-transistors. In a conventional crossbar fabric, each input driver wastes power to charge two long wires (horizontal and vertical) and 2n transistor-junction-capacitors. OCN employs a crossbar partial activation technique. By splitting the crossbar fab- ric into 4 × 4 tiles, input and output wires can be divided into four. A gated input driver at each tile is activated only when the scheduler grants access to the tile. The output signal does not propagate to other tiles to reduce the power consumption on the vertical wire. A 43% power saving is obtained in a 16 × 16 crossbar switch fabric with a negligible area overhead.

12.3.1 Network topology Energy considerations might affect the on-chip network topology selection process, as showed by the architectural choices made in the design of recently proposed NoC solutions. Again, the OCN case is very instructive. In fact, a star topology guaranteeing constant and minimum switch hop counts between every communicating IP was adopted in an early implementation [10]. However, a 1-level flat star topology results in a number of capacitive global wires that may cause long latency and large power dissipation. Therefore, the most recent solution consists of a hierarchical SoC composed of clusters of tightly, star-connected IPs. Octagon [21] on-chip communication architecture consists of 8 nodes and 12 bi-directional links connected according to an octagonal topology. In this way, communication between any pairs of nodes can be performed by at most two hops. Moreover, Octagon exhibits higher aggregate throughput than a shared bus or crossbar interconnect, a simple, shortest past routing algorithm and less wiring than a crossbar interconnect. Octagon and OCN are examples of network topologies that try to provide the highest degree of connectivity between network nodes while trying to minimize the number of hops, therefore targeting high-performance and low-power NoC realizations. Power-aware topology selection is briefly discussed in [31] with respect to the SoCIN NoC architecture. A mesh topology is compared with a torus one: the former exhibits lower costs, while the latter reduces message latency. To avoid the long wrapping-around links, with a very high associated capacitive load, TLFeBOOK

219 a folded torus topology can be used [13]. Such approach reduces the wiring lengths and the power consumption while allowing to improve the operating frequency of network channels. A more detailed comparison between the power efficiency of a mesh and a folded torus topology is addressed in [13]. The power has been decomposed into the power per hop (traversal of input and output controllers) and power per wire distance travelled. The analysis shows that if wire transmission power dominates per-hop power, the mesh is more power efficient. For the 16 tile network considered in [13], the wire transmission power was estimated to be significantly greater than per-hop power, however the power overhead of the torus was small (less than 15%), and was counterbalanced by the benefits of its larger effective bandwidth.

12.4 DATA LINK LAYER The data-link layer abstracts the physical layer as an unreliable digital link, where the probability of bit upsets is non null (and increasing as technology scales down). Furthermore, reliability can be traded off for energy [19]. The main purpose of data-link protocols is to increase the reliability of the link up to a minimum required level, under the assumption that the physical layer by itself is not sufficiently reliable. An effective way to deal with errors in communication is to packetize data. If data is sent on an unreliable channel in packets, error containment and recovery is easier, because the effect of errors is contained by packet boundaries, and error recovery can be carried out on a packet-by-packet basis. For the realization of on-chip micro-networks, several error recovery mech- anisms developed for macroscopic networks can be deployed, but their energy efficiency should be carefully assessed in this context. As a practical example, consider two alternative reliability-enhancement techniques: error-correcting codes and error-detecting codes with retransmission. A set of experiments involved applying error correcting and detecting codes to an AMBA bus and comparing the energy consumption in four cases [12]: 1) original unencoded data; 2) single-error correction, 3) single-error correction and double-error de- tection, 4) multiple-error detection. Hamming codes were used. Note that in case 3, a detected double error requires retransmission. In case 4, using (n, k) linear codes, 2n − 2k errors patterns of length n can be detected. In all cases, some errors may go undetected and be catastrophic. Using the property of the codes, it is possible to map the mean time to failure (MTTF) requirement into bit upset probabilities, and thus comparing the effectiveness of the encoding scheme in a given noisy channel (characterized by the upset probability) in meeting the MTTF target. TLFeBOOK

220

4600 46 0.46 0.0046 4.6e-5 4.6e-7 4.6e-9 MTTF (days)

Figure 12.2. Energy efficiency for various encoding schemes

The energy efficiency of various encoding schemes varies: we summarize here one interesting case, where three assumptions apply. First, wires are long enough so that the corresponding energy dissipation dominates encod- ing/decoding energy. Second, voltage swing can be lowered until the MTTF target is met. Third, upset probabilities are computed using a white Gaussian noise model [18]. Figure 12.2 shows the average energy per useful bit as a function of the MTTF (which is the inverse of the residual word error proba- bility). In particular, for reliable SoCs (i.e., for MTTF = 1 year), multiple-error detection with retransmission is shown to be more efficient than error-correcting schemes. We refer the reader to [12] for results under different assumptions. Another important aspect affecting the energy consumption is the media ac- cess control (MAC) function. Currently, centralized arbitration schemes are widely adopted [1, 14] for the serialization of bus access requests. Unfortu- nately, central arbiters are instance-specific and therefore poorly scalable. In fact, the energy cost of communicating with the arbiter, and hardware complex- ity of the arbiter itself scale up more than linearly with the number of bus masters. The selection of a specific arbitration algorithm impacts both performance and power consumption [32]. Alternative multiplexing approaches, such as code di- vision multiplexing, are actively investigated for on-chip communication [27]. However, research in this area is just burgeoning, and significant work is needed to develop energy-aware media-access-control for future micro-networks. Arbitration mechanisms are required also in the implementation of NoC switches to address contention resolution problems such as: prioritizing one out of multiple input channels whose packets have to be directed to the same TLFeBOOK

221 output channel or multiplexing multiple virtual channels onto the same physical output link.

12.5 NETWORK LAYER At the network layer, packetized data transmission can be customized by the choice of switching and routing algorithms. The former establishes the type of connection while the latter determines the path followed by a message through the network to its final destination. In circuit switching data and control are separated: control is provided to the network just to set up a connection over which all subsequent data is transported in a connection-free fashion. On the contrary, packet-switched on-chip networks naturally offer best ef- fort services, as contention takes place at the granularity of individual packets. Packet arrival cannot be predicted and contention has to be resolved dynami- cally: power-hungry data storage is required at the routers for this purpose, and the provision of guarantees is complicated. However, a better link utilization is achieved and error control is made easier. The most promising packet switching technique for NoC application is worm- hole switching. It was originally designed for parallel computer clusters [17] because it achieves the minimal network delay and requires fewer buffers. In wormhole switching, each packet is further segmented into flits (flow control unit). The header flit reserves the routing channel of each switch, the body flits will then follow the reserved channel, the tail flit will later release the channel reservation. One major advantage of wormhole switching is that it does not require the complete packet to be stored in the switch while waiting for the header flit to be routed to the next stages. Wormhole switching not only reduces the store- and-forward delay at each switch, but it also requires much less buffer spaces. Because of these advantages, wormhole switching is an ideal candidate tech- nique for on-chip interconnect networks [13], although deadlock and livelock are potential problems that need to be taken care of [17, 15]. Routing algorithms can be static (packets injected into the network already include routing information and only minimum header processing is required at the switches) or dynamic (routing decisions are dynamically taken at the switches). These latter policies allow packet routes to adapt to network condi- tions, and therefore trade-off the energy savings obtained in this way with the increased switch complexity and related energy dissipation. Next, a compari- son between energy efficiency of routing techniques is provided as an example of network-level design decisions for low power.

Contention-Look-Ahead Routing A contention-look-ahead routing scheme is the one where the current routing decision is helped by monitoring the adjacent TLFeBOOK

222 switches, thus possibly avoiding or reducing blockages and contention in the coming stages. A contention-aware routing scheme is described in [22]. The routing decision at every node is based on the “stress values” (the traffic loads of the neighbors) that are propagated between neighboring nodes. This scheme is effective in avoiding “hot spots” in the network. The routing decision steers the packets to less congested nodes. To solve the contention problems in wormhole switching schemes, a contention-look-ahead routing algorithm can be used, which “foresees” the contention and delays in the coming stages using a direct connection from the neighboring nodes. The major difference from [22] is that information is han- dled in flits, and thus large and/or variable size packets can be handled with limited input buffers. Furthermore, because it avoids contention between pack- ets and requires much less buffer usage, the latter contention-look-ahead routing scheme can greatly reduce the network power consumption. At every intermediate stage, there may be many alternate routes to go to the next stage. We call the route that always leads the packet closer to the destination a profitable route. Conversely, a route that leads the packet away from the destination is called misroute [17]. In mesh networks, profitable routes and misroutes can be distinguished by comparing the current node ID with the destination node ID. Profitable routes will guarantee a shortest path from source to destination. Nevertheless misroutes do not necessarily need to be avoided. Occasionally, the buffer queues in all available profitable routes are full, or the queues are too long. Thus detouring to a misroute may lead to a shorter delay time. Under these circumstances, a misroute may be more desirable. It is interesting to compare the contention-look-ahead routing algorithm with dimension order routing – a routing scheme that always routes the packets on one dimension first, upon reaching the destination row or column, then switches to the other dimension until reaching the destination. Dimension ordered routing is deterministic and guarantees shortest path, but it cannot avoid contention. The contention-look-ahead routing will reduce the power consumption on the buffers because it can “foresee” the contention in the forthcoming stages and shorten the buffer queue length. On the contrary, dimension-ordered routing always steers the packets along the shortest path, while contention-look-ahead routing may choose the misroute when contention occurs and therefore has a larger average hop count per packet. This translates to more power on the interconnect. Finally, the contention-look-ahead routing switch needs more logic gates than dimension-ordered routing. However, simulation results show that with 16 RISC processors on a 4x4 mesh interconnect, contention-look-ahead rout- TLFeBOOK

223

Figure 12.3. Cache and Memory Energy Decrease as Packet Payload Size Increases ing reduces the total network power by about 15% with 16-flit buffers. The reduction is more significant with larger buffer sizes.

12.6 TRANSPORT LAYER At the transport layer, algorithms deal with the decomposition of messages into packets at the source and their assembly at destination. The choice of information decomposition into packets or flits, as well as the choice of packet size can heavily impact energy efficiency. Next, we will use the shared-memory MPSoC as a case study to analyze the packet size trade-offs both qualitatively and quantitatively. The system architecture consists of an on-chip intercon- nect which provides connectivity to nodes composed by a RISC processor, its caches, a local memory reachable by means of a local bus and the network in- terface. The MPSoC power consumption originates from three sources: 1) the node processor power consumption, 2) the cache and shared memory power consumption, and 3) the interconnect network power consumption. We will start first from the cache and memory analysis.

Cache and memory power consumption Whenever there is a cache miss, the cache block content needs to be encapsulated inside the packet payload and sent across the network. In shared-memory MPSoC, the cache block size correlates with the packet payload size. Larger packet sizes will decrease the cache miss rate, because more cache content can be updated in one memory access. Consequently, both cache energy consumption and memory energy consumption will be reduced. This relationship can be seen from Fig. 12.3. It shows the energy consumption of cache and memory under different packet sizes. The energy in the figure is normalized to the value of 256Byte, which achieves the minimum energy consumption. TLFeBOOK

224

Figure 12.4. Network and Total MPSoC Energy Consumption under Different Packet Payload Sizes

Interconnect network power consumption The power consumption of packetized dataflow on MPSoC network is determined by the following three factors: 1) the number of packets on the network, 2) the energy consumed by each packet on one hop, and 3) the number of hops each packet travels. We summarize these effects and list them below:

1 Packets with larger payload size will decrease the cache miss rate and consequently decrease the number of packets on the network.

2 Larger packet size will increase the energy consumed per packet, because there are more bits in the payload.

3 Larger packets will occupy the intermediate node switches for a longer time, and cause other packets to be re-routed to non-shortest paths. This leads to more contention that will increase the total number of hops needed for packets traveling from source to destination.

Actually, increasing the cache block size will not decrease the cache miss rate proportionally. Therefore, the decrease of packet count cannot compensate for the increase of energy consumed per packet caused by the increase of packet length. Larger packet size also increases the hop counts on the datapath. Fig. 12.4a shows the combined effects of these factors. The values are normalized to the measurement of 16Byte. As packet size increases, energy consumption on the interconnect network will increase. The total energy dissipated on the MPSoC is shown in Fig. 12.4b. It clearly decreases as packet size increases. However, when the packets are too large, as in the case of 256Byte in the figure, the total MPSoC energy will increase. This is because when the packet is too large, the increase of interconnect network energy will outgrow the decrease of energy on cache and memories. TLFeBOOK 225

Core Shared memory

Core−ports memory words North North South South East East counter West West Outputs Inputs Local Config. Line write operation read operation Crossbar Frequency Decoder Controller Voltage

North to South & East PC Example producer P1 P2 consumer Communication Pattern Instruction Memory

(a) ASOC Core Interface (b) Producer-consumer pair

Figure 12.5. Communication-based power management

12.7 SYSTEM AND APPLICATION LAYERS In the context of highly integrated on-chip multi-processors, lowering supply voltage of the cores reduces power quadratically but also results in a perfor- mance degradation which can be tolerated only if it does not impact performance beyond a critical, application-dependent threshold. Given the key role played by on-chip communication with respect to MPSoC performance, the concept of communication-based power management (CBPM) has been introduced. It consists of integrating the system-level power management functionality into the communication architecture, which binds the system components together, thus eliminating the need for separate power management entities. Second, due to its connectivity, the communication architecture can gather information (such as the execution states of system components) required to make power man- agement decisions. Finally, since the communication architecture schedules inter-component communications, it can control the timing of a component’s power modes, thus regulating the component’s (and therefore the system’s) power profile. Multiple implementations of this concept are feasible [38, 6], and two rele- vant examples will be hereafter described. The first one is represented by the Adaptive System-on-Chip (ASOC) illustrated in [6], which has been used to build a backbone for power-aware signal processing cores [39]. ASOC ability to provide dynamic voltage and frequency scaling is due to the architecture of the network interface of the cores, illustrated in Figure 12.5(a) [39]. The core interface uses a synchronized global communication schedule to manage communications through each tile. The instruction memory holds a list of the communication patterns required at run-time. A program counter (PC) fetches these patterns in succession and a decoder converts them into switch settings for TLFeBOOK

226 a crossbar, that routes data between the local core and neighboring tiles (North, East, South or West). Moreover, at each core, frequency and voltage are au- tomatically adjusted. A subsystem uses up/down counters to track the data transfer rate between core and interconnect. Blocked or unsuccessful transfers cause the count to increase, while successful transfers decrease the value. If the core input port is blocked consecutively, the core is running too slowly with respect to its predecessors. If the core output port is consecutively blocked, the core is running too quickly for its successors. In either case, these counters send trigger signals to the core configuration unit to increase or decrease the core clock. The new frequency setting automatically selects a new supply voltage value. A similar approach can be applied to pipelined signal processing applica- tions, wherein a sequence of computation stages exchange the results of their processing in a pipelined fashion. From an hardware viewpoint, the system might consist of cascaded producer-consumer pairs communicating by means of a shared memory. If producer and consumer are not well synchronized, energy-inefficient synchronization mechanisms could be triggered. For in- stance, if the consumer expects input data to process while the producer is not ready to output the result of its computation yet, the consumer keeps polling on a semaphore until its input data is available in the shared memory. This mechanism wastes significant amounts of energy, and should be avoided as much as possible. A solution to keep producer-consumer pairs synchronized is reported in Figure 12.5(b). Shared memory can be abstracted as a queue and a memory access counter keeps track of the queue level. When a lower threshold is crossed, it means that the producer is too slow or the consumer too fast, and frequency/voltage scaling can be applied for balancing data pro- duction or consumption rate. The opposite holds when an upper threshold is crossed. Counter monitoring might be carried out by a proper power man- agement hardware module connected to the bus, with the ability to program (through a register) the clock frequency generator of the cores. This could be done continuously or periodically at discrete times, in order to amortize the frequency switching cost. Worst-case power savings with respect to static frequency selection (power-optimized for a particular application) amounts to 12%.

12.8 APPLICATION-SPECIFIC NETWORKS-ON-CHIP Customizing MPSoC architectures and tailoring them to a specific appli- cation domain is a very promising approach to system energy minimization. It takes its steps from the optimization techniques used in some SoC design methodologies, that explicitly target applications from a specific domain: their TLFeBOOK

227 main features (control flow, data organization and type of processing) are eval- uated and exploited for a power-aware architecture customization [8]. For customized NoCs to be successful, however, developers must select the appropriate domain-specific architecture and map the system’s communication requirements onto it [7]. This is a non-trivial task, in that an optimized net- work instance has to be derived by analyzing the application communication requirements, and by comparing a number of alternative interconnect solutions. Moreover, reusing the components of a given NoC architecture across dif- ferent designs (and therefore network instances) becomes feasible provided the network building blocks (network interfaces, switches, switch-to-switch links) are designed as soft macros. Some NoC architectures proposed in the literature were built around this concept [6, 4, 3, 2]. Only two of them are mentioned for the sake of brevity. Quality of Service NoC (QNoC) [3] is a NoC framework wherein QoS and cost model for communications in SoCs are first defined, and related NoC archi- tecture and design process are then derived. SoC inter-module communication traffic is classified into 4 classes of service: Signaling (control signals), Real- Time, RD/WR (for short data access) and Block-Transfer (for large data bursts). By analyzing the communication traffic of the target SoC, QoS requirements (in terms of delay and throughput) for each of the four service classes are de- rived. A customized QNoC architecture is then created by modifying a generic network architecture (two-dimensional planar mesh, fixed shortest-path multi- class wormhole switching). The customization process minimizes the network cost (in area and power) while maintaining the required QoS and works as follows: the SoC modules are placed so as to minimize spatial traffic density, unnecessary mesh links and switching nodes are removed, and bandwidth is allocated to the remaining links and switches according to their relative load so that link utilization is balanced. Finally, Xpipes NoC architecture [2] is a library of highly parameterizable network components which are design-time tunable and composable to get cus- tomized domain-specific architectures. Xpipes has been designed with high- performance in mind, and this has been achieved by means of deeply pipelined switches, pipelined links to decouple link throughput from link delay, virtual output buffering. The network interface implements OCP standard signaling and the look-up tables required by static routing algorithms. The network inher- ently provides best-effort services and targets multi-gigahertz heterogeneous MPSoCs, wherein irregular network topologies with links of uneven length might be required. Next, an instructive case study about application-specific NoC instances and their potentials for energy savings is reported, leveraging the Xpipes synthesis flow. TLFeBOOK

228

357 70 run vu au med rast vld iquan idct 16 cpu le dec 27 353 190 0.5 60 40 362 362 Arm 600 40 inv 362 acdc up scan pred sdram 250 idct stripe samp 0.5 sram1 sram2 ,etc mem 49 313 300 910 670 500 vop up 173 313 pad rec 16 adsp samp 32 risc vop bab mem 94 500 (a) MPEG4 core graph (b) VOPD core graph

Figure 12.6. Core Graphs of Video Processing Applications

12.8.1 Case study A core graph representation of the application is the input to Xpipes-based synthesis flow (called NetChip). The design and generation of a customized NoC is achieved by means of two tools: SUNMAP, which performs the network topology mapping and selection functions, and ×pipesCompiler, which per- forms the topology generation function. SUNMAP produces a mapping of cores onto various NoC topologies that are defined in a topology library. The map- pings are optimized for the chosen design objective (such as minimizing area, power or latency) and satisfy the design constraints (such as area or bandwidth constraints). SUNMAP uses floorplanning information early in the mapping pro- cess to determine the area-power estimates of a mapping and to produce feasible mappings (satisfying the design constraints). The tool supports various routing functions (dimension ordered, minimum-path, traffic splitting across minimum- paths, traffic splitting across all paths) and chooses the mapping onto the best topology from the library of available ones. A design file describing the chosen topology is input to the ×pipesCompiler, which automatically generates the SystemC description of the network components (switches, links and network interfaces) and their interconnection with the cores. A custom hand-mapped topology specification can also be accepted by the NoC synthesizer, and the network components with the selected configuration can be generated accord- ingly. NetChip was applied to two different video processing applications: Video Object Plane Decoder (VOPD - mapped onto 12 cores), MPEG4 decoder (14 cores). These are high-end video-processing applications and the hardware- software partitioning of the applications is presented in [35, 36]. The core graphs of these applications is presented in Figure 12.6. The maximum link bandwidth for the NoCs is conservatively assumed to be 500 MB/s. The results of mapping VOPD onto various topologies are presented in Fig- ure 12.7. As seen from Figure 12.7(a), the butterfly topology (4-ary 2-fly) has the least communication delay out of all topologies, the least number of switches, TLFeBOOK

229

3 45 60 500 40 Sws Lnks Area 35 400 Avg (mm2) Power No 2 SWs 30 40 (mW) Hops & 300 25 Lnks 20 200 1 15 20 10 100 5

0 0 0 0 Msh Trs Hyp Cls Bfly Mesh Torus Hyp Cls Bfly Msh Trs Hyp Cls Bfly Msh Trs Hyp Cls Bfly (a) Avg hop delay (b) Resource Util (c) Design Area (d) Design Power

Figure 12.7. Mapping Characteristics of VOPD

run inverse run inverse 44 VLD length scan VLD length Cust decoder scan decoder 42 Mesh s1 s2 s1 AC/DC AC/DC iDCT iQuant 40 iDCT iQuant Predict Predict 38 s2 s3 s2 S2 S1 Stripe up VOP Stripe up VOP 36 samp reconstr Mem Mem samp reconstr s2 s3 34 s2 S2 S3 S2 ARM VOP ARM 32

Padding VOP Avg Pack Lat (Cy) core Mem core Padding Mem 30 s1 s2 s1 28 s1 − 3x3 s1 − 3x3 s2 − 3x2 s2 − 4x4 26 s3 − 5x5 s3 − 2x3 1.6 1.8 2 2.2 2.4 2.6 − repeater − repeater BW (GB/s) (a) Mesh NoC (b) Appln Specific NoC (c) VOPD Avg Lat

Figure 12.8. Mesh and Custom Mappings of Video Object Plane Decoder but has more links when compared to mesh, torus or hypercube. The large power savings achieved by the butterfly network (Figure 12.7(d)) is attributed to the fact that there are fewer switches and smaller number of hops for commu- nication. Moreover, all the switches are 4x4, while the direct topologies have 5x5 switches. The average link length in the butterfly network (obtained from floorplanner) was observed to be longer than the link lengths (around 1.5×)of direct networks. However, as the link power dissipation is much lower than the switch power dissipation, we get large power savings for the butterfly network. The smaller number of switches and smaller switch sizes also account for the large area savings achieved by the butterfly network. Thus, butterfly is the best topology for VOPD. The performance gains for the butterfly over other topolo- gies may be surprising, but careful inspection shows that the butterfly network trades-off path diversity for network switches with average hop delay. On the contrary, the same kind of analysis shows that a mesh topology is more suitable for MPEG4 than other topologies.

12.8.2 Generating Custom Topologies For custom topologies, the mapping and generation phases of the tool can be skipped and ×pipesCompiler can be directly invoked on the input design. A custom hand-tuned NoC for the VOPD is presented in Figure 12.8(b). In TLFeBOOK

230 the VOPD, about half the cores communicate to more than a single core. This motivates the configuration of this custom NoC, having less than half the number of switches than the mesh NoC. NetChip area and power reports relative to the custom NoC were automatically obtained. Significant area (5x) and power improvements (2x) were noticed with the custom NoC as fewer, smaller size switches are used with respect to the mesh network. SystemC simulation of the NoC models allowed to assess their performance. The variation of average packet latency (for 64B packets, 32 bit flits and 7 cycle switch delay) with link bandwidth is showed in Figure 12.8(c). Application- specific NoC has lower packet latency as the average number of switch and link traversals is lower. Moreover, the latency increases more rapidly for the mesh NoC with decrease in bandwidth. With the custom NoC, an average of 25% savings in latency (measured at the minimum plotted BW value) is achieved.

12.9 CONCLUSIONS This chapter focuses on low power design techniques for NoC-based gi- gascale MPSoCs. Several open problems were described at various layers of the communication stack, and the basic strategies to effectively tackle them were sketched. Finally, the large potentials for energy savings provided by the implementation of customized, domain-specific NoCs have been discussed. TLFeBOOK

231 References

[1] P. Aldworth, “System-on-a-Chip Bus Architecture for Embedded Applications,” IEEE International Conference on Computer Design, pp. 297-298, 1999. [2] “×pipes: a Latency Insensitive Parameterized Network-on-chip A rchitecture For Multi-Processor SoCs”, M.Dall’Osso, G.Biccari, L.Giovannini, D.Bertozzi, L.Benini, Int. Conf. on Computer Design, pp.536-541, October 2003. [3] E.Bolotin, I.Cidon, R.Ginosar, A.Kolodny, "QNoC: QoS architecture and design process for Network on Chip", Journal on Systems Architecture, Special Issue on Networks on Chip, December 2003. [4] I.Saastamoinen, D.S.Tortosa, J.Nurmi, "Interconnect IP Node for Future System-on-Chip Designs", IEEE Int. Work. on Electronic Design, Test and Applications, pp.116-120, January 2002. [5] C.T. Hsieh, M. Pedram, ”Architectural Energy Optimization by Bus Splitting,” IEEE Trans. CAD, Vol.21, issue 4, pp.408-414, April 2002. [6] J.Liang, S.Swaminathan, R.Tessier,”aSOC: A Scalable, Single-Chip Communication Architecture,” IEEE Int. Conf. on Parallel Architectures and Compilation Techniques, pp.37-46, October 2000. [7] S.Murali, G.De Micheli, "Bandwidth-Constrained Mapping of Cores onto NoC Architectures", De- sign Automation and Testing in Europe, 2004, pp.20896-20901. [8] L.Bisdounis, C.Dre, S.Blionas, D.Metafas, A.Tatsaki, F.Ieromninon, E.Macii, P.Rouzet, R.Zafalon, L.Benini "Low-Power System-on-Chip Architecture for Wireless LANs," IEE Proc.-Comput. Digit. Tech., Vol.151, no1, January 2004. [9] K. Lee, S.J. Lee, S.E. Kim, H.M. Choi, D. Kim, S. Kim, M.W. Lee, H.J. Yoo, "A 51mW 1.6GHz On-Chip Network for Low Power Heterogeneous SoC Platform", IEEE Int.Solid-State Circuits Con- ference, pp.1-3, 2004. [10] S.J. Lee et al., "An 800MHz Star-Connected On-Chip Network for Application to Systems on a Chip ", IEEE Int.Solid-State Circuits Conference, pp.468-469, February 2003. [11] W. Bainbridge, S. Furber, “Delay insensitive system-on-chip interconnect using 1-of-4 data encod- ing,” IEEE International Symposium on Asynchronous Circuits and Systems, pp. 118-126, 2001. [12] D. Bertozzi, L. Benini and G. De Micheli, “Low-Power Error-Resilient Encoding for On-chip Data Busses,” DATE, International Conference on Design and Test Europe Paris, 2000, pp. 102-109. [13] Dally, W.; Towles, B.; “Route Packets, Not Wires: On-Chip Interconnection Networks” 38th Design Automation Conference, 2001. Proceedings [14] B. Cordan, “An efficient bus architecture for system-on-chip design,” IEEE Custom Integrated Cir- cuits Conference, pp. 623–626, 1999. [15] Dally, W.J; Aoki, H. “Deadlock -free adaptive routing in multicomputer networks using virtual channels” IEEE Trans. on Parallel and Distributed Systems, April 1993 [16] W. Dally and J. Poulton, Digital Systems Engineering, Cambridge University Press, 1998. [17] J. Duato, S. Yalamanchili, L. Ni, Interconnection Networks: an Engineering Approach. IEEE Com- puter Society Press, 1997. [18] R. Hegde, N. Shanbhag, “Toward Achieving Energy Efficiency in Presence of Deep Submicron Noise,” IEEE Transactions on VLSI Systems, pp. 379–391, vol. 8, no. 4, August 2000. [19] R. Hegde, N. Shanbhag, “Toward achieving energy efficiency in presence of deep submicron noise,” IEEE Transactions on VLSI Systems, pp. 379–391, vol. 8, no. 4, August 2000. [20] R. Ho, K. Mai, M. Horowitz, “The Future of wires,” Proceedings of the IEEE, January 2001. [21] Karim, F.; Nguyen, A.; Dey, S. “On-chip Communication Architecture for OC-768 Network Proces- sors” 38th Design Automation Conference, 2001. Proceedings [22] E. Nilsson “Design and Implementation of a Hot-Potato Switch in a Network on Chip” Master of Science , LECS, Royal Institute of Technology [23] Li Shang and Li-Shiuan Peh and Niraj K. Jha, “Dynamic Voltage Scaling with Links for Power Optimization of Interconnection Networks,” HPCA - Proceedings of the International Symposium on High Performance Computer Architecture, Anaheim, February 2003, pp. 91-102. [24] Singh, J. P.; Weber, W.; Gupta, A,; “SPLASH: Stanford Parallel Applications for Shared-Memory” Computer Architecture News, vol. 20, no. 1 [25] J. Walrand, P. Varaiya, High-Performance Communication Networks. Morgan Kaufman, 2000. TLFeBOOK

232

[26] F. Worm, P. Ienne, P. Thiran and G. De Micheli, “ An Adaptive Low-power Transmission Scheme for On-chip Networks,” ISSS, Proceedings of the International Symposium on System Synthesis, Kyoto, October 2002, pp. 92-100. [27] R. Yoshimura, T. Koat, S. Hatanaka, T. Matsuoka, K. Taniguchi, “DS-CDMA wired bus with simple interconnection topology for parallel processing system LSIs,” IEEE Solid-State Circuits Conference, pp. 371-371, Jan. 2000. [28] Ye, T. T.; Benini, L.; De Micheli, G.; “Packetized On-Chip Interconnect Communication Analysis for MPSoC” Design Automation and Test in Europe, DATE 2003 Proceedings [29] H. Zhang, V. George, J. Rabaey, “Low-swing on-chip signaling techniques: effectiveness and robust- ness,” IEEE Transactions on VLSI Systems, vol. 8, no. 3, pp. 264-272, June 2000. [30] H. Zhang, M. Wan, V. George, J. Rabaey, “Interconnect architecture exploration for low-energy configurable single-chip DSPs,” IEEE Computer Society Workshop on VLSI, pp. 2-8, 1999. [31] C.H. Zeferino, A.A. Susin, "SoCIN: A Parametric and Scalable Network-on-Chip," Symposium on Integrated Circuits and Systems Design SBCCI’03, pp. 169-174, September 2003. [32] F. Poletti, D. Bertozzi,L. Benini,A. Bogliolo, "Performance Analysis of Arbitration Policies for SoC Communication Architectures," Journal on Design Automation for Embedded Systems,Kluwer, pp. 189-210, 2003. [33] IBM CoreConnect bus architecture, ”http://www-3.ibm.com/chips/products/coreconnect” [34] AMBA Multi-Layer AHB and AHB-Lite, ”http://www.arm.com/products/solutions/AMBAAHBandLite.html” [35] E.B.Vander Tol, E.G.T.Jaspers,"Mapping of MPEG-4 Decoding on a Flexible Architecture Platform", SPIE 2002, pp. 1-13, Jan, 2002. [36] E.G.T.Jaspers, et al.,"Chip-set for Video Display of Multimedia Information", IEEE Trans. on Con- sumer Electronics, Vol.4 5, No. 3, pp. 707-716, Aug, 1999. [37] R.H. Havemann, J.A. Hutchby, ”High-Performance Interconnects: An Integration Overview”, Pro- ceedings of the IEEE, Vol.89, no5, pp.586-601,May 2001 [38] K. Lahiri, A. Raghunathan, S. Dey ,"Communication Architecture Based Power Management for Battery Efficient System Design", Proc. ACM/IEEE DAC, pp.691-696, 2002. [39] A. Laffely, J. Liang, R. Tessier, W. Burleson ,"Adaptive System on Chip (aSoC): a Backbone for Power-Aware Signal Processing Cores", Int. Conf. on Image Processing, pp.105-108 (III), 2003. [40] V.Raghunathan, M.B. Srivastava, R.K. Gupta ,"A Survey of Techniques for Energy Efficient On-Chip Communication", DAC 2003, pp.900-905, June 2003. TLFeBOOK

233

Chapter 13 SYSTEM LEVEL POWER MODELING AND SIMULATION OF HIGH-END INDUSTRIAL NETWORK-ON-CHIP

Andrea Bona , Vittorio Zaccaria and Roberto Zafalon STMicroelectronics

Abstract Today’s System on Chip (SoC) technology can achieve unprecedented computing speed that is shifting the IC design bottleneck from computation capacity to communication bandwidth and flexibility. This chapter presents an innovative methodology for automatically generating the energy models of a versatile and parametric on-chip communication IP (STBus). Eventually, those models are linked to a standard SystemC simulator, running at BCA and TLM abstraction level. To make the system power simulation fast and effective, we enhanced the STBus class library with a new set of power profiling features (“Power API”), allowing performing power analysis either statically (i.e.: total avg. power) or at simulation runtime (i.e.: dynamic profiling). In addition to random patterns, our methodology has been extensively benchmarked with the high-level SystemC simulation of a real world multi-processor platform (MP- ARM). It consists of four ARM7TDMI processors accessing a number of peripheral targets (including several banks of SRAMs, Interrupt’s slaves and ROMs) through the STBus communication infrastructure. The power analysis of the benchmark platform proves to be effective and highly correlated, with an average error of 2% and a RMS of 0.015 mW vs. the reference (i.e. gate level) power figures. The chapter ends presenting a new and effective methodology to minimize the Design of Experiments (DoE) needed to characterize the above power models. The experimental figures show that our DoE optimization techniques are able to trade off power modeling approximation with characterization cost, leading to a 60% average reduction of the sampling space, with 20% of maximum error.

Keywords: Network-on-Chip power analysis, communication based low power design, system-level energy optimization. TLFeBOOK

234 13.1 INTRODUCTION

Embedded computing systems are on the way to provide a number of new services that will arguably become common practice in the next few years. The most important of these are (i) multimedia (audio/video streaming) capabilities in personal communicators, (ii) huge computing power (especially from clusters of processors) and storage size, (iii) high rate accessibility from mobile terminals. Today’s System on Chip (SoC) technology can achieve unprecedented computing speed that is shifting the IC design bottleneck from computation capacity to communication bandwidth and flexibility. SoC’s designers need to leverage on pre-validated components and IPs such as processor cores, controllers and memory arrays. Design methodology will further support IP re-use in a plug-and-play fashion, including buses and hierarchical interconnection infrastructures. SoCs will have to provide a functionally correct, reliable operation under data uncertainty and noisy signaling. The on-chip physical interconnection will be a limiting factor for both performance and energy consumption, also because the demand for component interfaces will steadily scale-up in size and complexity. In this chapter, we will present a thorough methodology for automatically building the energy model of a Network-on-Chip (NoC) IP at the BCA/Transaction level, in order to allow power profiling of an entire platform since the very early stages of the system design, often when only a software model of the system does exist. The chapter is organized as follows: Section 13.2 introduces a short background on Network-on-Chip. Section 13.3 illustrates the STBus versatile interconnect IP as an industrial example of NoC infrastructure. Section 13.4 introduces the overall NoC power characterization and estimation framework while Section 13.5 goes into details about our NoC’s energy model. Section 13.6 presents the Design of Experiment policy and Section 13.7 reports a significant set of figures about the model validation and the experimental results, including a real-world platform simulation case. Eventually, Section 13.8 presents a new and effective methodology to minimize the Design of Experiments (DoE).

13.2 BACKGROUND

Although the main concepts and the terminology of Network-on-Chip design has been introduced quite recently [1][2][3], both the industrial and research communities have been starting to realize the strategic importance of shifting the design paradigm of high-end digital IC from a deterministic, TLFeBOOK

235 wire-based interconnection of individual blocks and IPs, to a thorough communication-based design methodology [4][7][9], aiming to face with data packets and non-deterministic communication protocols in next generation’s SoCs. With the advent of 90nm and 65nm CMOS technology, the challenges to fix the Network-on-Chip (NoC) issue “by design”, will need: – To provide a functionally-correct, reliable operation of the interconnected components by exploiting appropriate network infrastructure and protocols, i.e. interconnections to be intended as “on chip micro- network” [5][6][7], which is an adaptation of the OSI protocol stack [18]. – To achieve a fluid “flexibility vs. energy-efficiency” system exploration, allowing an effective network centric power management [8][11][12]. Unlike computation energy in fact, the energy for global communication does not scale down with technology shrinking [3][4]. This makes energy more and more dominant in communications. Reaching those goals will be crucial to the whole semiconductor industry in the next future, in order to face with the escalating range of signal integrity and physical wiring issues, who are making the target IC reliability harder and exponentially expensive to achieve. As of today, there is a limited availability of tools able to consistently support this emerging design methodology. Indeed, some high level models for functional/performance system simulations (i.e. Bus Cycle Accurate and Transaction) are smoothly coming up [13] across the design community. However, power predictability of NoCs still remains an open issue. Although NoC’s power estimation has been partially addressed in [10], its low level modeling (i.e. gate and device level) and the extremely slow simulation (i.e. 1000 cycle/s) makes it definitely unsuitable to face with any system level SW/HW exploration task, which might easily need for simulation speeds larger than 100 Kcycle/s.

13.3 ON-CHIP NETWORK: STBUS INTERCONNECT

STBus is versatile, high performances interconnect IP, allowing to specify the communication infrastructure in terms of protocol, interface and parametric architectures [14][15]. It comes with an automated environment (STBus generation kit) suitable to support the whole design flow, starting from the system-level parametric network specification, all the way down to the mapped design and global interconnect floor-plan [16]. The protocol modes supported by STBus are compliant with VSIA standard [19]. In fact, they can scale up from Peripheral, to Basic and to Advanced mode, conventionally named Type-1, Type-2 and Type-3, respectively. In this work, we focus on the last 2 protocols (i.e. Type-2 and Type-3) since they TLFeBOOK

236 better fit with the high demanding communication resources required by modern SoCs. More specifically, Type-2 supports pipelined split transactions, where each transaction is composed by a pair of send and receive packets (packet: a sequence of atomic messages called cells). On top of the above features, Type-3 allows to manage out-of-order packet delivery. The datapath’s width can range between 32, 64 and 128 bits. The STBus architecture builds upon the node module, configurable switch fabrics who can be instantiated multiple times to create a hierarchical interconnect structure. The topology of the switch fabric can be selected by choosing the number of resources dedicated to the request and the response packets; for example a shared bus interconnect has only 1 request and 1 response resources at a time, while a full cross-bar has as many request and response resources as the number of initiators and targets connected to the node. Eventually, type converter and size converter modules can be adopted to interface heterogeneous network domains working under different protocols (i.e. Type-1, 2 and 3) and/or different data-path widths.

13.4 ENABLING ENERGY EXPLORATION FOR NOC

When dealing with multi-processors embedded systems, characterized by tens of masters and slaves connected through a complex communication infrastructure, energy estimation and optimization become of utmost importance. As a matter of fact, although more effective than traditional buses, NoCs are expected to make a relevant contribution to the area budget, due to the growing complexity of packet routing and transaction management policies affecting the interconnection’s control-path, and to the switch fabric in charge of supporting the high speed data packet delivery. Such a complexity has a cost in terms of energy consumption that should be traded-off with the performance benefits. Network structures achieving lower packet’s congestion (i.e. higher performance), are usually characterized by larger data-path complexity in terms of number of simultaneous routing resources available for packet broadcasting. For example, a shared bus communication node can be slower (i.e. higher congestion), yet less power consuming than a full crossbar switch-box, or, the slot-reservation arbitration policy may overcome the limitation of Time Division Multiple Access (TDMA) policy in case of asymmetric workloads in a multi-processors platform. These questions need to account for energy metric during the design exploration in order to find out the optimal platform configuration to meet the performance constraints at minimum energy. Exploration and optimization for SoC design are rapidly evolving towards the analysis of abstract description models that mimic the main operations of the system under analysis, including speed and power TLFeBOOK

237 behavior. According to the SystemC modeling scenario depicted in [13], the abstraction levels that can be used to model the function/power/performance of a communication-based system are the Functional un-timed level, the Transaction level (TLM), the Bus Cycle Accurate level (BCA) and the Pin Accurate – Cycle Accurate level (PA-CA). In short, while the Functional level does not give any insight on the timing figures of the system, the Transaction level only gives coarse time hints (e.g. total read/write time slot), with no structural information on actual wires or pins. The BCA level achieves cycle-accurate timing estimates, yet functionally accurate at the boundaries, while the PA-CA goes down to a clock cycle timing with structural pin-accurate description, at the expense of a much slower simulation. In this chapter we introduce a consistent methodology for automatic energy model’s building to fit most of the above abstraction levels (i.e. Transaction, BCA, PA-CA), suitable to support the NoC’s power estimation since the very early stages of the design exploration, when only a C/C++ model of the system is usually available. Eventually, the system simulation (developed in SystemC, in our case) will rely on high-level profiling statistics to figure out the energy cost, by means of an appropriate library of energy views and a dedicated procedural interface (API). In the following, we will explain how the STBus energy models are based on a set of parametric, analytic equations that are individually accessed by the simulator to compute the eventual energy figures (either statically or at simulation runtime).

13.4.1 Energy Characterization Flow

The energy macro-model of the whole STBus interconnection is partitioned into sub-components, corresponding to each micro-architectural block of the interconnection fabrics that are node, type-converter and size- converter. For sake of simplicity, in this chapter we will show the results of the node component. However, the same automatic flow is currently applied to all of the components of STBus architecture. The proposed model relies on the bus utilization rate, i.e. the number of cells traveling across the bus, as well as on the interconnection topology (i.e. the number of masters/targets), which need to be pre-characterized, once and for all, through an accurate gate-level simulation for each target technology. The power characterization flow consists of 4 major steps depicted in Figure 13-1. TLFeBOOK

238

Abstract CoreConsultant/Design Compiler Network Synthesis Topology STBus Generation Kit

Gate-level Netlist

Testbench Generation VCS/PowerCompiler Gate-level power est.

Power & High-level stats

Power Model Characterization Models DB

Figure 13-1. STBus Power Characterization Flow

As already mentioned in section 3, the STBus Generation Kit allows the designer to automatically synthesize a gate-level netlist starting from a system-level parametric network specification. This is done by inferring the corresponding RTL code and, then, synthesizing all the way down to the mapped design [16]. Thus, an extensive set of gate-level power simulations is launched within a Test-bench Generation suite, specifically tuned to fulfill the many requirements imposed by the STBus protocols and, at the same time, to sensitize the node under a wide range of traffic workloads. Actually, the test-benches can be configured in terms of average latency per master request and slave response and type of operations to be performed on the bus. The operations can be divided in two categories (Load and Store) as they can play with different operand sizes (from 1 to 32 bytes). The last step of the flow in Figure 13-1 is the Model Characterization, where each of the coefficients is computed to fit the high-level model (ref. to next section 5 for details). The final models (one for each component and target technology) are stored into a centralized Power Model Database. Sure enough, the choice of experiments, the length of each simulation and the test-benches adopted during the characterization campaign are crucial knobs to be optimized before running the characterization flow, by means of a suitable Design of Experiments (DoE: see section 6). TLFeBOOK

239 13.4.2 Hooking up the energy models into the system simulation

The STBus Generation Kit supports the generation, among the others, of the SystemC model of each component, ready to be plugged into the target SystemC simulation platform. The current release of the STBus Generation Kit is compliant with BCA SystemC v2.0 descriptions [13]. In evolution, the support for TLM is planned soon, according to the STBus roadmap. The overall SystemC power estimation flow is outlined in Figure 13-2. To make the system simulation environment fast an effective, an ad-hoc API has been developed (SystemC Power API), together with a consistent library of functions allowing to enhance the basic SystemC capabilities with a power profiling feature, providing power analysis either statically (i.e.: total avg. power) or at simulation runtime (i.e.: dynamic profiling). The latter is done by computing a moving average power on a given time window (e.g. ten clock cycles).

Abstract Network STBus Generation Kit Topology

SystemC Node Power Profile Energy Enhanced SystemC Node

SystemC Power API

Power Models DB

Figure 13-2. Power Enhanced SystemC Simulation

By deriving the SystemC node classes and hooking them up to the specific SystemC Power API, we achieve the energy flow enhancement. As a matter of fact, energy-enhanced SystemC nodes provide an extremely fast procedural interface to retrieve each set of model’s coefficients out from the power model database as well as an effective power analysis during the SystemC simulation run. TLFeBOOK

240 13.5 STBUS ENERGY MODEL

In this section, we introduce the power model for a generic configuration n of a node. The configuration of an STBus node identifies a specific instance out from the design space S:

{}=<= > (1) L ,,,,,,,| TypedpsCprprrqrtinnS

where i is the number of initiators, t is the number of targets, rqr is the number of request resources, rpr is the number of response resources, p is the type of arbitration policy (STBus has 7 arbitration policies), CL is the output pin capacitance (range: CLmin= 4 Standard Loads ; CLmax=1 pF), dps is the data-path size (range: 32, 64 and 128 bit) and Type is the protocol mode (Type-2 and 3, in this case).

Based on an extensive experimental background, we recognize a fairly linear relationship between node energy and the rate of sent and received packet cells across all of the interconnection node’s ports. Such a behavior matches with a set of random configuration samples across the entire design space and it has been confirmed during the model validation phase (see section 7). The energy model for a generic configuration n of the STBus node is the following:

() () ⋅⋅= (2) TCnPnE clk

where P(n) is the average power consumption of the node during a simulation of C clock cycles, with a clock period of Tclk. The power consumption P(n) is a linear combination of three contributions, according to the following equation:

r r (3) () () ()nPnBnP s ()nP ⋅+⋅+= r sent C rec C

where B(n) is the average base cost depending on the specific configuration n of the node, Psent(n) is the additive power cost due to cell sent from the masters to the slaves and rs is the total number of cells sent, Preq(n) is the power cost due to each packet cells received by the masters, rr is the total number of cells received by the masters and C is the number of clock cycles. In essence, the power model characterization consists in determining the value of the coefficients B(n), Psent(n) and Preq(n) for each TLFeBOOK

241 specific configuration n of the node. As formerly mentioned, this task is performed by means of a polynomial regression over the set of experiments given by DoE (see section 6). Although linear regression has been successfully used to build the model’s coefficients so far, any higher order models can be easily adopted should a better accuracy become an issue. The experimental setup is generated with the goal of properly stressing rs and rr over the whole range of variation. The total avg. switching activity coming out from the test-benches is kept at 0.5. As far as the interconnection capacitive load “CL“ is concerned, our model supports a linear interpolation between CLmin and CLmax in order to provide a quite accurate estimation of the switching power under the actual instance’s output load.

From a global viewpoint, the characterization campaign of STBus across the whole design space may easily become a huge computing task. The computational effort to power characterize STBus is similar or even larger than the characterization of an industrial size ASIC library. The whole comprehensive STBus design-space, in fact, would lead to more than 3.4*105 individual configurations to be characterized (i.e. RTL synthesis + gate-level simulation + power measure). Such a number comes out from the product of all the possible combinations of the STBus design subspaces (i.e. 8 initiators, 8 targets, 8 request and 8 response resources, 7 arbitration policies, 2 load capacitances, 3 data path sizes and 2 types of protocols). Running an exhaustive characterization is far to be feasible in a reasonable time, even by leveraging on distributed computers. We decided to adopt a response surface method approach to solve this problem. In this approach, only a selected set of configurations are synthesized and characterized, making the remaining set of coefficients derivable by accessing an appropriate set of models (either analytic or look-up table) obtained through response surface methods. Although this approach may lead to some inaccuracy with the energy estimation process, the global accuracy can be taken well under control while allowing a remarkable saving in characterization effort.

13.6 OPTIMAL DESIGN OF EXPERIMENTS

The fundamental theory on statistical design has been largely consolidated during the last twenty years or so, for a wide variety of applications [20]. In this context, the Design of Experiments is based on the convergence analysis of some peculiar quality figures such as average power and average prediction error. Converging to the average power figure let us to identify the minimum length necessary for each simulation, by considering when the power consumption gets close to a steady value, given an arbitrary acceptance threshold (see the power–time curve of Figure TLFeBOOK

242 13-3). On the other hand, the minimum number of experiments (i.e.: synthesis + simulations) needed to safely probe the design space and characterize the specific model, strongly depends on the target accuracy (i.e. max prediction error) as well as on the acceptable characterization effort. The regression analysis to fit the model’s coefficients is performed on the raw characterization data. Therefore, the QoR can be analytically measured by the prediction correlation coefficients (R and R2) and the Root Mean Square error (RMS). Eventually the minimum number of experiments is identified by considering both the RMS steady state and the absolute error over a set of significant benchmarks.

13.6.1 Convergence Analysis: Average Power

The minimum simulations length necessary for the model characterization has to be identified through a convergence analysis. While the minimum simulations length of the testbench would not affect the actual power consumption, it is crucial to make sure that the circuit under analysis can always reach a steady state functional activity before measuring the avg. power consumption. To identify the correct simulation length, we minimized a cost-function that is a product of the simulation time and a measure of the derivative of the power consumption. The cost function is the following: ∆ ()tP () ttC 2 ⋅= (4) ()tP where t is the simulation time, P(t) is the power consumption measured at time t and ǻP(t) is the difference between P(t) and P(t-1). Figure 13-3 shows the average behavior of the cost function for all the possible configurations of shared bus, Type-3, 32 bit width nodes. As can be seen, after 5000 ns the difference between power values does not pay for the increased simulation time. Thus, 5000 ns have been selected as the simulation length for all the characterization experiments. TLFeBOOK

243

Accuracy - Time Cost Function

1300000

1100000

900000

700000

Cost function Cost 500000

300000 0 2000 4000 6000 8000 10000 Simulation Time [ns]

Figure 13-3. Avg. Power vs. Simulation-Time convergence analysis for a given STBus node’s configuration

The derivative (i.e. differential ratio) has been sampled every 1000 ns and, then, normalized to the related power values in order to give a percentage variation.

13.6.2 Convergence Analysis: Model Accuracy

According to previous section 5, the model’s coefficients are resulting from the polynomial regression over a given set of experiments. Those experiments are generated according to the DoE’s policy, by stochastically changing the number of data packets sent/transmitted across the bus and the operation modes. The goal is to find out the minimum number of experiments necessary to meet the required accuracy. Given a set of representative STBus nodes, we perform their characterization with an increasing number i of experiments. For each set of i calibration experiments, the Root Mean Square error (RMS) is evaluated. Figure 13-4 shows that, for i>160, the RMS for all the configurations of the design space gets close to the respective asymptotic values, with a maximum value bounded to 0.01 mW. The minimum number of experiments to proceed with the characterization of the STBus nodes has been defined accordingly. TLFeBOOK

244

RMS Error

Steady State 0.012

0.01 0.008

0.006 0.004 RMS [mW] 0.002

0 0 50 100 150 200 n. of experiments

8x8 Shared Bus - Type 3 - 32 bit 8x1 Shared Bus - Type 3 - 32 bit 1x8 Shared Bus - Type 3 - 32 bit 1x1 Shared Bus - Type 3 - 32 bit

Figure 13-4. Power model’s RMS Error vs. Number of calibration experiments, under four different initiators/targets configurations

13.7 STBUS POWER MODEL VALIDATION AND EXPERIMENTAL RESULTS

We present hereafter the results obtained from the validation phase of the proposed power macro modeling. In addition to the validation carried out by applying an extensive set of synthetic test-benches, we extended the test by running a realistic application, featuring mission-mode SystemC simulations of a multi-processors platform. All the characterization and experimental results presented in this chapter are targeted to STMicroelectronics HCMOS9 ASIC library, featuring 8 metal layers and 0.13 µm MOS channel length, operating at 1.2V nominal supply voltage.

13.7.1 Random pattern validation

We carried out a synthetic validation by applying a uniform set of stochastically generated Verilog test-benches, similar to those used during the calibration phase (section 6.2). In Figure 13-5 we illustrate the scatter plot between the model estimation and the reference power measurement (coming from detailed gate-level power analysis). The average error is 1% with a correlation R of 96%. TLFeBOOK

245 Measured Power

Estimated Power 0 Figure 13-5. Scatter plot of Measured vs. Estimated power consumption, for a set of synthetic benchmarks

13.7.2 Mission mode validation through SystemC co-simulation

To extensively validate our methodology into a real world simulation platform, we decided to assess the robustness of the power model by correlating the power estimation coming from a high-level SystemC simulation with respect to the gate-level power measure of the synthesized STBus node subject to the input stream generated at runtime by SystemC. The multi-processor platform is outlined in Figure 13-6. The architecture consists of four ARM7TDMI processors accessing a number of targets (including several banks of SRAMs, Interrupt’s slaves and ROMs) through the STBus communication infrastructure, configured as a 4 initiators, 3 targets, Type-3, shared bus, 32 bit, fixed priority request arbitration policy, dynamic priority response arbitration policy. TLFeBOOK

246

SystemC SystemC SystemC SystemC Wrapper Wrapper Wrapper Wrapper RTEMS OS RTEMS OS RTEMS OS RTEMS OS ARM 7 TDMI ARM 7 TDMI … ARM 7 TDMI ARM 7 TDMI ISS ISS ISS ISS

8x8 shared bus – 32 bit – Type 3 STBUS Interconnect

ROM … Shared RAM Private #1 RAM … Private #4 RAM

Figure 13-6. Multi-processor platform including four ARM7TDMI processors connected through STBus

Indeed, a remarkable amount of SW layers are intended to be executed on top of this HW platform, including a distributed real-time operating system who runs on each individual processor (RTEMS), and a class of multi-tasking DSP applications, featuring intensive integer matrix computations.

Bus Traffic Rate

Sent Cells Received Cells 1 0.9 0.8 0.7 0.6 0.5 Rate 0.4 0.3 0.2 0.1 0 0 102030405060708090 Time - 10 KCycles

Figure 13-7. Data packet rate monitored across the STBus on the target multiprocessor platform

As far as the simulation framework is concerned, each processor’s ISS has been encapsulated with a SystemC wrapper, in charge of managing the interface protocol with the STBus communication node. The whole SW benchmark has total execution duration of 1 Million clock cycles, including the RT-OS booth strap (the initial 200 Kcycles) and the execution of the TLFeBOOK

247 DSP application SW. In Figure 13-7, the data cell’s statistics (i.e. rate of cells sent/received per time unit) across the STBus is reported. The overall SystemC/Verilog co-simulation flow is depicted in Figure 13-8.

Initiator 1 Co-simulation Interchange … Format File. ST proprietary

Initiator n # Cosimulation interchange file - Generated for shared_2x1ICN.Node_1# Cosimulation interchange file - Generated for SystemC # Thisshared_2x1ICN.Node_1 file has been automatically generated by a SystemC# This filesimualtion has been of automatically STbus generated by # STBusa SystemC is a simualtion propietary of IP STbusof STMicroelectronics Target 1 STBus # STBus is a propietary IP of STMicroelectronics .dt init_0_data .vector 32 0x0 0 … init_0_data.dt init_0_data .vector 32 0x0 0 .dtinit_0_data init_0_add .vector 32 0x1 0 Node init_0_add.dt init_0_add .vector 32 0x1 0 .dtinit_0_add init_0_req .vector 1 0x2 0 Target m init_0_req.dt init_0_req .vector 1 0x2 0 .dtinit_0_req init_0_eop .vector 1 0x3 0 init_0_eop.dt init_0_eop .vector 1 0x3 0 .dtinit_0_eop init_0_be .vector 8 0x4 0 init_0_be.dt init_0_be .vector 8 0x4 0 Cosimulation .dtinit_0_be init_0_opc .vector 8 0x5 0 init_0_opc.dt init_0_opc .vector 8 0x5 0 Enhanced Node init_0_opc SystemC AST - Power stats Macro Model

∆ Verilog Co-simulation Synthesized Testbench STBus Node Generator Power saif

Compiler Verilog VCS

Figure 13-8. SystemC/Verilog co-simulation flow

During the SystemC simulation, initiators and targets generate a trace of “mission mode” transactions, monitored through a specific feature of the STBus node. In fact, the node has been enhanced in order to gather the full signals stream out from the SystemC simulation session. The eventual trace file carries comprehensive print-on-change information, sampled on a clock cycle basis. The co-simulation file is then applied to drive the gate-level Verilog simulation (VCS [16]) and, then, feeding the detailed power analysis of the mapped netlist (PowerCompiler [16]). In Figure 13-9 we compare the power predicted by SystemC when running the system simulation (“Power estimated”) vs. the reference power measured by Power Compiler at gate-level (“Power measured”). Please notice that absolute power numbers are hidden for technology confidentiality. The system level estimation proves to be highly correlated to the reference power figure, with an average error of 2% and a RMS of 0.015 mW. TLFeBOOK

248

Cosimulation - Power Report Power Measured [mW] Power Estimated [mW] Power [mW] Power

01234578910 100K Clock Cycles

Figure 13-9. Estimated vs. Measured average power in STBus

13.8 LOW EFFORT, HIGH ACCURACY POWER MACRO MODELING

Due to the huge domain size, the optimization of the Design of Experiments (DoE) to be adopted when characterizing the power macro- model is a key methodology issue to eventually ensure the task’s feasibility. We decided to adopt a response surface method approach to allow this problem to be manageable and actually solvable through an automatic tool flow. Indeed, although this methodology has been developed to cope with STBus, the described solution can be easily applied to a number of generic parametric IP as well as to third party’s on chip Bus architectures.

13.8.1 Surface reconstruction methods

Let us consider Equation 1 and Equation 3 (Section 5). As can be seen, the coefficients are functions over the dominium S, which can be considered the cartesian product of two subspaces:

S = G × X (5) G = { g | g = } (6)

X = { x | x = < rqr, rpr, p, CL, dps, Type> } (7) TLFeBOOK

249

The coefficients B, Psent and Prec can be seen as a function of two variables: f(n) = f(g,x) (8) The variable g belongs to a discretized space called ‘grid’, over the set of possible pairs , while variable x represents the remaining parameters, according to Equation (7). By considering x fixed, we have that each coeffiecient is described by a surface over the set G. Experimentally, the surface shows to have a very smooth behavior; as an example Figure 13-10 shows a representative behavior of the coefficient B.

Energy / Sent Cell

Power

Initiators Targets

Figure 13-10. An example surface for the base cost

The problem of fast characterization of the entire design space can be thought as reconstructing a surface by directly characterizing only a small subset of points Gs ⊂ G. The given methodology must assure that for an increasing ratio z=|Gs|/|G|, (i.e., the characterization effort of the power model library), the approximation error decreases. Ideally, for z=1, the error should be 0. Our approach, can be decomposed in three fundamental steps:

1. Choice of a representative set Xs ⊂ X to be used as a training set for the evaluation of the different reconstruction methods. TLFeBOOK

250

2. Automatic optimization of the surface reconstruction methods over the

space G × Xs. The output of this stage will be a combination of algorithm and sub-grid Gs suitable to minimize fitting error and characterization effort.

3. Perform the actual characterization of Gs× X.

The considered surface reconstruction methods are twofold.

• Regression–based algorithms: Analytic approximation of a surface, usually a polynomial expression, in which each of the coefficients is fitted to minimize an overall square error. The procedure to fit these parameters is often called Least Square Approximation. Three regression methods have been evaluated: linear, quadratic and cubic.

• Interpolation–based algorithms: Surface approximation by using multiple piecewise polynomials or more generic functions. Being based on look-up tables, the response surface always fits with all of the training set points. The following interpolation algorithms have been analyzed: cubic, bicubic, bilinear and linear grid data. Algorithms belonging to the “spline” interpolation class have been experimentally rejected due to a worse accuracy.

As far as the sub-grid constraints are concerned, the regression based methods do not enforce any limitation on the sub-grid topology Gs , while interpolation methods have several topology constraints, for which the reader is referred to the specific literature [20].

13.8.1.1 Choice of a reference training set Xs ⊂ X The training set Xs is composed by a random selection within X such that each value of the parameters rqr, rpr, p, CL, dps, Type (see eq. 5) is evaluated at least once. In this way it is possible to uniformly visit the characterization space. More specifically in our design case, considering 8 request and 8 response resources, 7 arbitration policies, 2 load capacitances, 3 data path sizes and 2 protocols, our method allows to reduce the training set from 3.4*105 down to 30 configurations, without any significative degradation in accuracy.

13.8.1.2 Surface Reconstruction Algorithm Benchmarking Each of the afore mentioned algorithms has been benchmarked by varying the ratio z=|Gs|/|G| and by sampling the design space G in order to TLFeBOOK

251 find an optimal solution Gs. The search for the optimal solution is driven by the following cost function:

C(Gs) = σ (Gs)η(Gs) (9)

Where σ(Gs) is a guess of the characterization effort and η(Gs) is the maximum error between the predicted coefficients and the actual coefficients. Then, for each value of z=|Gs|/|G|, we find out the optimal reconstruction algorithm and the corresponding grid Gs to be adopted for the final characterization campaign.

13.8.2 Experimental results

In this paragraph we show the experimental evaluations supporting the proposed methodology. The experimental flow aims at selecting the optimal grid Gs together with the best reconstruction algorithm for a given grid size |Gs|. The analysis is performed on a subset of the entire design space configurations (Xs ⊂ X), as explained in section 8.1.1. For each algorithm and grid size |Gs|, an iterative random search is performed to find the best grid, by optimizing the cost function specified in Equation 9. While driving the grid-optimization with the full cost function C(Gs) (see eq. 9), we will use the maximum error η(Gs) to discriminate between candidate reconstruction methods, given a fixed grid-size.

13.8.2.1 Combined Analysis To compare the various methods we are going to focus our attention on the best surface reconstruction method for each category (regression and interpolation). As far as the regression is concerned, the quadratic algorithm shows to be the most promising for small grid-sizes, even better than the interpolation-based methods. On the other side, regarding interpolation, the cubic grid data technique is the most accurate in terms of maximum error. Accordingly, Figure 13-11 shows the best pair <σ, η> (see eq. 9) for each grid size |Gs|, with respect to Quadratic Regression (shortened as ‘Regression’) and Cubic Grid Data Interpolation (shortened as ‘Interpolation’). Moreover, the figure shows the Pareto curve highlighting points that are not dominated in both directions by any other point. This curve is useful to screen out the solutions that optimize both the merit figures and, at the same time, meet specific constraints. For example, a quadratic regression method should be used if the characterization effort is a dominating constraint. In fact, in this case we have only 10% of the original TLFeBOOK

252 characterization effort at the expense of a 33% maximum error. For medium- high characterization effort, the Cubic Grid Data interpolation has shown to reduce the maximum error more than the other methods. It leads to a maximum error of about 18% at the expense of 62% of the original characterization effort.

Accuracy/Effort Trade-off

100% 90% 80% 70% 60% Pareto Curve 50% 40% 30%

Characterization effort 20% 10% Interpolation Regression 0% 0% 10% 20% 30% 40% Maximum Error

Figure 13-11. Characterization effort vs. Maximum error Trade-off

13.9 CONCLUSIONS

An innovative methodology for automatically generating the energy models of a versatile and parametric on-chip communication infrastructure (STBus) has been presented in this chapter. The methodology aggressively targets correlated power estimation with efficient SystemC simulation, running at BCA and TLM abstraction level. Among other synthetic benchmarks, the NoC’s power models validation has been extensively addressing the high-level SystemC simulation of a real world multi- processor platform (MP-ARM), which includes four ARM7TDMI processors accessing a number of peripheral targets (including several banks of SRAMs, Interrupt’s slaves and ROMs) through the STBus communication infrastructure. All the characterization and experimental TLFeBOOK

253 results presented in this chapter are targeted to STMicroelectronics HCMOS9 ASIC library, featuring 8 metal layers and 0.13 µm MOS channel length, operating at 1.2V nominal supply voltage. The synthetic validation between the model estimation and the reference power figures (i.e. gate-level power measure of the synthesized NoC) shows an average error of 1% and correlation R of 96%. The power analysis of the MP-ARM benchmark proves to be highly effective and correlated, with an average error of 2% and a RMS of 0.015 mW vs. the reference power. Eventually, we presented a new and effective methodology to minimize the Design of Experiments (DoE) needed to characterize a set of innovative energy macro models of Network-on-Chip architectures. We have shown that, by properly combining regression (polynomial) and interpolation (table- based) techniques, this methodology aims to reduce the overwhelming complexity of sampling the huge, non-linear design space involved with the system operations of an high-end parametric on-chip-network module. The experimental figures show that our DoE optimization techniques are able to trade off power modeling approximation with model building cost, leading to an average 62% reduction of the sampling space with a maximum error of 18%.

Acknowledgements

The authors are grateful to dr. C.Pistritto and his CMG/OCCS team in Catania, for their valuable and synergic support to achieve the leap enhancement of making the STBus Design Flow truly "power aware".

References

[1]J.Duato, S.Yalamanchili, L. Ni, “Interconnection Networks: an Engineering Approach”, IEEE Computer Society Press, 1997. [2] K. Lahiri, S.Dey et al.,”Efficient Exploration of the SOC Communication Architecture Design Space”, Proc. of ICCAD-2000, Nov. 2000, S.Jose`, USA. [3] W. Dally, B. Toles, “Route Packets, not Wires: On-Chip Interconnection Network”, Proceedings of 38th DAC 2001, June 2001, Las Vegas, USA. [4] A. Sangiovanni Vincentelli, J. Rabaey, K. Keutzer et al., “Addressing the System-on-a- Chip Interconnect Woes Through Communication-Based Design”, Proceedings of 38th DAC 2001, June 2001, Las Vegas, USA. [5] F. Karim, A. Nguyen et al., “On Chip Communication Architecture for OC-768 Network Processors”, Proceedings of 38th DAC 2001, June 2001, Las Vegas, USA. [6] K. Lahiri, S.Dey et al.,”Evaluation of the Traffic Performance Characteristics of System- on-Chip Communication Architectures”, Proc. 14th Int’l Conference on VLSI Design 2001, Los Alamitos, USA. [7] L. Benini, G. De Micheli, “Network on Chip: A New SoC Paradigm”, IEEE Computer, January 2002. TLFeBOOK

254

[8] T. Ye, L. Benini, G. De Micheli, “Analysis of power consumption on switch fabrics in network routers”, Proceedings of 39th DAC 2002, June 2002, New Orleans, USA. [9] S. Kumar et al., “A network on chip architecture and design methodology”, International Symposium on VLSI 2002. [10] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik, “Orion: A Power-Performance Simulator for Interconnection Networks”, International Symposium on Microarchitecture, MICRO- 35, November 2002, Istanbul, Turkey. [11] T. Ye, G. De Micheli and L.Benini, “Packetized On-Chip Interconnect Communication Analysis for MPSoC”, Proceedings of DATE-03, March 2003, Munich, Germany, pp. 344-349. [12] J.Hu and R. Marculescu, “Exploiting the Routing Flexibility for Energy/Performance Aware Mapping of Regular NoC Architectures”, Proceedings of DATE-03, March 2003, Munich, Germany, pp. 688-693. [13] T. Grotker, S. Liao, G. Martin and S. Swan, “System Design with SystemC”, Kluwer Academic Publishers, 2002. [14] “STBus Communication System: Concepts and Definitions”, Reference Guide, STMicroelectronics, October 2002. [15] “STBus Functional Specs”, STMicroelectronics, public web support site, http://www.stmcu.com/inchtml-pages-STBus_intro.html, STMicroelectronics, April 2003. [16] Synopsys Inc., “Core Consultant Reference Manual”, “Power Compiler Reference Manual” and “VCS: Verilog Compiled Simulator Reference Manual”, v2003.06, June 2003. [17] C. Patel, S. Chai, S. Yalamanchili, and D. Schimmel, “Power-constrained design of multiprocessor interconnection networks," in Proc. Int. Conf. Computer Design, pp. 408- 416, Oct. 1997. [18] H.Zimmermann, “OSI Reference Model – The ISO model of architecture for Open System Interconnection”, IEEE Trans. on Communication, n 4, April 1980. [19] VSI Alliance Standard, “System-Level Interface Behavioral Documentation Standard Version 1”, Released March 2000. [20] Box, George E. P. and Draper Norman Richard. Empirical model-building and response surfaces, John Wiley & Sons New York, 1987 TLFeBOOK

255

Chapter 14

ENERGY-AWARE ADAPTATIONS FOR END-TO- END VIDEO STREAMING TO MOBILE HANDHELD DEVICES

Shivajit Mohapatra1, Nalini Venkatasubramanian1, Nikil Dutt1, Cristiano Pereira2, Rajesh Gupta2 1 2 University of California, Irvine; University of California, San Diego

Abstract Optimizing user experience for streaming video applications on handheld devices is a significant research challenge. In this chapter, we propose an integrated end- to-end power management approach that unifies low level architectural optimiza- tions(CPU, memory, registers), OS power-saving mechanisms(dynamic voltage scaling) and adaptive middleware techniques(admission control, transcoding, net- work traffic regulation). Specifically, we identify interaction parameters between the different levels and optimize them to reduce power consumption. With knowl- edge of device configurations, dynamic device parameters and changing system conditions, the middleware layer selects an appropriate video quality and fine tunes the architecture for optimized delivery of video. Performance results in- dicate that architectural optimizations that are cognizant of user level parame- ters(e.g. transcoded video quality) can provide energy gains as high as 57.5% for the CPU and memory, when compared to the baseline case that does not employ any energy optimization. Middleware adaptations to changing network noise levels can save as much as 70% of energy consumed by the wireless net- work interface. Our approach to multiple-level and end-to-end management of power/performance has been implemented in a framework, called FORGE. We show how FORGE can substantially enhance the user experience in a mobile multimedia application.

Keywords: low-power optimization, cross-layer adaptation, power-aware middleware, FORGE project

14.1 MOTIVATION Limiting the energy consumption is an important design goal for mobile devices. Designers have explored techniques for minimizing energy usage of TLFeBOOK

256 most components, from CPU, network, display to peripherals of a mobile sys- tem platform. On the other hand, rapid advances in processor and wireless networking technology are ushering in a new class of multimedia applications (e.g. video streaming/conferencing) for mobile handheld devices. Multimedia applications have distinctive Quality of Service(QoS) and processing require- ments which tend to make them extremely resource-hungry. Moreover, the device specific attributes(e.g form factor of handhelds) significantly influence the human perception of multimedia quality. As a result delivering high qual- ity realtime multimedia content to mobile handheld devices remains a difficult challenge. The difficulty here is due to the fact that energy efficient delivery of media content with good quality attributes requires tradeoffs across various layers of system implementation and functionality - from application to system software to networking. Since the optimal energy conditions can change dynamically, these optimizations should also allow for dynamic adaption of system func- tionality and its performance. In order to dynamically adapt to device mobility, systems need to have a high degree of “network awareness" (e.g. congestion rates, mobility patterns etc.) and need to be cognizant of a constantly changing global system state. Efforts are underway to exploit multimedia specific char- acteristics to enable a range of energy optimization techniques that adapt to, and optimize for, changes in application data (video stream), OS/Hardware (CPU, Memory, Reconfigurable logic), network (congestion, noise, node mobility), residual energy (battery) and even the user environment (ambient light, sound). These issues have been aggressively pursued by researchers and numerous interesting power optimization solutions have been proposed at various cross computational levels. For instance, a sampling of optimizations across design domains are: system cache and external memory access optimizations [1, 15], dynamic voltage scaling(DVS) [29, 4], of the CPU, dynamic power manage- ment of disks and network interfaces(NICs) [8, 3], efficient compilers and ap- plication/middleware based adaptations for power management [22]. Interest- ingly, power optimization techniques developed for individual components of a device have remained seemingly incognizant of the strategies employed for other components. While focussing their attention to a single component, re- searchers make a general assumption that no other power optimization schemes are operational for other components. However, the cumulative power gains for incorporating multiple techniques can be potentially significant. This requires careful evaluation of the trade-offs involved and the customizations required for unified operation [21]. The interaction between different layers is even more important in distributed applications where a combination of local and global information helps and improves the control decisions (power, performance and QoS trade-offs) made at runtime. TLFeBOOK

257 For the mobile multimedia applications, Fig. 14.1 presents the different com- putation levels in a typical handheld computer and shows the cross layer inter- actions for optimized power and performance deliverance.

Video Player Other Tasks Client1 Applications

Network Admission Transcoding Management Control Server Clienti Middleware

DVS Scheduler Clientn Operating System

Network CPU Display Cache Memory RegFiles Card H/W

Figure 14.1. Abstraction Layers in Distributed Multimedia Streaming

The FORGE project aims to study the tradeoffs between power, performance and Quality of Service requirements across the various computational layers [6]. The goal of FORGE is to develop and integrate hardware based architectural optimization techniques with high level operating system and middleware ap- proaches (Fig. 14.1), for improvements in power savings and the overall user experience, in the context of video streaming to a low-power handheld device. Multimedia applications heavily utilize the biggest power consumers in mod- ern computers: the CPU, the network and the display(Fig. 14.1). Therefore, in FORGE, we aggregate the hardware and software techniques that lead to power savings for these resources. To maximize power gains for a CPU architecture, we identify the predominant internal units of the architecture that contribute to power consumption. We use higher-level knowledge of the application such as quality and encoding parameters of the video stream to optimize internal cache configurations, CPU registers and the external memory accesses. Similarly, we utilize hardware/design level data (e.g. cache configuration) and user-level information (video quality perception) to optimize middleware and OS compo- nents for improved performance and power savings - through effective video transcoding, power-aware admission control and efficient network transmis- sion. We reduce the power consumption of the network card by switching it to the “sleep" mode during periods of inactivity. An efficient middleware is used to control network traffic for optimal power management of the network interface. To maximize the user experience, we have studied video quality and power trade-offs for handheld computers. These results drive our optimization efforts in FORGE at each computing level. 14.2 RELATED WORK Let us briefly review the optimization techniques used at various levels, such as architecture, OS, middleware and application in the context of multimedia TLFeBOOK

258 applications. We then examine the relationship of FORGE with prior and ongoing approaches in power aware middleware.

14.2.1 Architectural Adaptations To provide acceptable video performance at the hardware level, efforts have concentrated on analyzing the behavior of the decoder software and devising either architectural enhancements or software improvements for the decoding algorithm. Until recently it was believed that caches can bring no potential benefit in the context of MPEG (video) decoding. In fact, due to the poor lo- cality of the data stream, many MPEG implementations viewed video data as “un-cacheable" and completely disabled the internal caches during playback. However, Soderquist and Leeser showed that video data has sufficient locality that can be exploited to reduce cache-memory traffic by 50 percent or more through simple architectural changes [28]. A different way of improving cache performance by reordering frame traversal was proposed in [9]. Register file reconfiguration was applied in [1]. [16] proposes a technique for combining two hardware adaptations (architecture adaptation and dynamic voltage scaling) to reduce energy in multimedia workloads. The algorithm presented chooses be- tween one of the two adaptations or a combination, depending on their relative performance. This approach is similar to ours, in that architectural optimiza- tions are combined with dynamic voltage scaling (DVS). However, instead of a frame-based adaption founded on profiling and prediction, we target tuning an architecture through available architectural parameters to specific video quality requirements. We apply the optimizations globally (for the entire period that a media of constant quality levels are played), rather than at frame granularity.

14.2.2 Operating System & Middleware Adaptations Most power optimization efforts at the operating system level, have been fo- cussed on techniques like dynamic voltage scaling(DVS) [29, 25, 18], and dynamic power management (DPM) [17, 7]. DVS exploits the fact the CMOS logic used in most current processors has a voltage dependent maximum operat- ing frequency. So when used at lower frequencies, the processor can operate at a correspondingly lower voltage, thereby saving battery power. The challenge here is to accurately predict workload execution times for future jobs. While workloads can be predicted heuristically for best-effort applications [29], or based on worst case execution times of real-time applications [25], worst-case based approaches will almost certainly result in sub-optimal solutions, whereas heuristic predictions can cause timing violations for multimedia tasks. In the GRACE project, the authors suggest using an aggregate statistical demand of applications to adjust frequency/voltage for the processor [31]. DVS techniques for reducing energy in MPEG decoding has been studied in [20]. Additionally, TLFeBOOK

259 scheduling techniques like DSRT [5] have been studied to deliver real-time guarantees. At the OS/middleware levels, another primary focus has been to optimize network interface power consumption [8, 2, 3]. A thorough analysis of power consumption of wireless network interfaces has been presented in [8]. ECOSys- tem [32] is an OS level prototype that incorporates energy allocation and ac- counting mechanisms for various power consuming devices. ECOSystem uses the Currentcy [33] model which is an abstraction for formulating energy aware policies. Chandra et al. [2] have explored the wireless network energy consumption of streaming video formats like Windows Media, Real media and Apple Quick Time. Chandra and Vahdat have explored the effectiveness of energy aware traffic shaping closer to a mobile client [3]. In [26], Shenoy suggests performing power friendly proxy based video transformations to reduce video quality in real-time for energy savings. They also suggest an intelligent network streaming strategy for saving power on the network interface. FORGE uses a similar approach, but models a noisy channel. Caching streams of multiple qualities for efficient performance has been suggested in [10]. PowerScope [12] is an interesting tool that maps energy consumption to pro- gram structure. It first profiles the power consumption and system activity of a computer and then generates an energy profile from this data. Odyssey [22] presents an applications aware adaptation scheme for mobile applications. In this approach the system monitors resource levels, enforces resource alloca- tion and provides feedback to the applications. The application then decides on the best possible adaptation strategy. In our approach we try to integrate the the positive aspects of all the three levels: OS, middleware and applica- tion. Application based adaptation will therefore enhance the performance of our framework. However, applications have to be specifically designed for the framework. JouleTrack [27] is a web-based energy measurement tool for profiling software energy consumption of applications based on StrongArm processor.

14.2.3 Cross-Layer Adaptation Frameworks For efficient coordination and management of cross-layer adaptations, it is cru- cial to develop efficient resource allocation mechanisms. Q-RAM [19] models QoS management as a constraint optimization problem for maximizing sys- tem utility while guaranteeing minimum resources to each application. Pup- peteer [11] presents a middleware framework that uses transcoding to achieve energy gains. Using the well defined interface of applications, the framework presents a distilled version of the application to the user, in order to draw en- TLFeBOOK

260 ergy gains. EQoS [24] formulates energy-aware QoS adaptation as a constraint optimization problem and solves it using heuristic algorithms. The GRACE project [31, 30] uses cross-layer adaptations for maximizing system utility at lower energy costs. They suggest both coarse grained and fine grained tuning through global co-ordination and local adaptation of hardware, OS and application layers. The coarse/global adaptations are expensive and less frequent and occur only when global system changes are triggered (e.g task-set changes). The local adaptations are for the local variation in the ex- ecution of tasks. In GRACE, the global and local coordinators exist on the local device and perform the necessary adaptations. GRACE first tries to de- liver highest utility for each application and then optimizes the energy using dynamic voltage scaling. In contrast, FORGE uses a proxy based distributed middleware approach, that integrates cross-layer(architecture, OS, middleware, application) adaptations on the local device with distributed adaptations such as adaptive traffic shaping and transcoding at the proxy for energy gains. While adaptations in GRACE are limited to the local mobile device, our framework design uses a distributed middleware layer to exploit global system knowledge (e.g. device mobility patterns, network noise levels etc.) to facilitate effective power management (e.g. wireless NIC). Moreover, we adopt an end-to-end ap- proach to power optimization, where residual battery power of a mobile device also drives the adaptations. GRACE on the other hand provides a best-effort approach to energy optimization. Additionally, FORGE tries to tune architec- tural level parameters (e.g. cache configurations) to perform optimally for the currently executing application. The distributed middleware co-ordinates the adaptations at each level based on a rule-base and control information from the proxy.

14.3 SYSTEM MODEL Our system model for a wireless mobile multimedia distributed system is shown in Fig. 14.2. The system entities include a multimedia server, a proxy server that utilizes a directory service, a rule base for specific devices and a video transcoder, an ethernet switch, the wireless access point and users with low- power wireless devices. The multimedia servers store the multimedia content and stream videos to clients upon receipt of a request. The users issue requests for video streams on their handheld devices. All communication between the handheld device and the servers are routed through the proxy server, that can transcode the video stream in realtime. The middleware executes on both the handheld device and the proxy, and performs two important functions. On the device, it obtains residual energy availability information from the underlying architecture and feeds it back to the proxy and relates the video stream parame- ters and network related control information to lower abstraction layers. On the TLFeBOOK

261 proxy, it performs feedback based power aware admission control and realtime transcoding of the video stream, based on the feedback from the device. It also regulates the video transmission over the network based on the noise level and the video stream quality. Additionally, the middleware exploits dynamic global state information(e.g mobility info, noise level etc.) available at the directory service and static device specific knowledge (architecture, OS, video quality levels) from the static rule base, to optimally perform its functions. The rate at which feedback are sent by the device is dictated by administrative policies (e.g. periodic feedback). Moreover, we assume that network connectivity is maintained at all times.

Rule Directory base Service noise Transcoder C

S P C U S E R S U S E Server Proxy Switch Access C Point W A N W I R E D E T H E R N E T W I R E L E SS

Figure 14.2. System Model

In the rest of the chapter, we present important research challenges encoun- tered at each level and discuss approaches that involve both distributed proxy based adaptations coupled with coordinated cross-layer energy optimizations at the device. 14.4 HARDWARE/ARCHITECTURAL LEVEL OPTIMIZATIONS The architectural optimizations are particularly important because of the use of microelectronic system-on-chip components used in multimedia platforms. Since most multimedia applications spend a significant amount of time access- ing and transforming audio and video data, the design of the memory subsystem architecture, and compiler support for exploiting the specialized memory struc- tures are critical for meeting the performance, power and cost budgets of such applications. Since the memory subsystem will dominate the cost (area), performance and power, we have to pay special attention to how it can benefit from customiza- tion. For example, the memory can be selectively cached; the cache line size can be determined by the application; the designer can opt to discard the cache completely and choose specialized memory configurations such as FIFOs and stream buffers. The exploration space of different possible memory architec- tures is vast, and there have been attempts to automate or semi-automate this exploration process [13]. TLFeBOOK

262

Data Register Display Memory Cache File

Network CPU card Functional Clock Units a b

Figure 14.3. Main Components of a Handheld Device (a) and CPU Detail (b)

14.4.1 Hardware-level Optimizations for Handheld Devices There are three major sources of power consumption in a handheld device such as a Compaq iPAQ 3650 for which we indicate the corresponding power numbers: display (approximately 1W for full backlight), network hardware (1.4W) and CPU/memory (1-3W, with the additional board circuits). Each of these subsystems also provide opportunities for controlling the power dissi- pation. In case of the display (LCD), the main energy drain comes from the backlight, which is a predefined user setting and therefore has a limited degree of controllability by the system (without affecting the final utility). The network interface allows for efficient power savings if cognizant of the higher level pro- tocol’s behavior and will be explored in a subsequent section. Out of the three components mentioned above, the CPU coupled with the memory subsystem poses the biggest challenge. The dependence on the input data to be processed, the quality of the code generated by the compiler and the organization of its internal architecture make predicting its power consumption profile very hard in general; nevertheless, very good power saving results can be obtained by utilizing the knowledge of the application running on it and through extensive profiling of a representative data input set from the application’s domain. In the rest of this section, we focus our attention on the possible optimizations at the CPU level for a multimedia streaming application (e.g MPEG-1). We identified the subcomponents of the CPU (Fig. 14.3(b)) that consume the most power and observed the power distribution inside the CPU for MPEG decoding. By running the decoder process in a power simulator (Wattch) for videos of various types and by measuring the relative power consumption of each unit in the CPU we generate the internal processor power distribution. We conclude that: • The relative power contribution of the internal units of the CPU do not vary significantly with the nature or quality of the video played. A possible reason for this is the symmetrical and repetitive nature of MPEG decoding, whose processing is done on fixed size blocks or macroblocks. TLFeBOOK

263 • The units that show an important contribution to the overall power con- sumption and are amenable for power optimization are: caches, register files and functional units. Cache behavior greatly affects the memory performance and hence power consumption, so we optimize the entire memory subsystem in an integrated way. We briefly discuss these components, their impact on overall power con- sumption and how it can be affected by these architectural choices: • Caches/Memory: cache configurations are determined by their size, number of sets, and associativity. The size specifies how large a cache should be, while the associativity/number of sets control its internal structure. We identify that most power gains for MPEG are possible through reconfiguration of the data cache and its effect on the memory traffic, thus amplifying the effect of power optimizations through cache reconfiguration. • Frame Traversal: Decompressing MPEG video in its implied order does not leave space for exploiting the limited locality existent between dependent macroblocks. By just changing the frame traversal order algorithm based on the existing locality, faster decompression rates and higher power savings are achieved via reduced memory accesses [9]. Our proxy-based approach allows for a transparent on-the fly traversal reordering at the proxy server. In addition dynamic voltage scaling provides further savings for MPEG streaming as it allows tradeoffs for transforming the frame decoding slack time (CPU idle time) into important power savings. We discuss DVS and investigate the implications of DVS on other power optimizations in the system. All these parameters when fine-tuned for a specific video quality, will provide the best operating point(for power and performance) for a specific video stream.

14.4.2 Quality-driven Cache Reconfiguration Power consumption for the cache depends on the runtime access counts: while hits result in only a cache access, misses add the penalty of accessing the main memory (external). Fortunately, in most applications the inherent lo- cality of data means that cache miss rate is relatively low and so are accesses to external memory. However, MPEG decoding exhibits a relatively poor data locality, which, when combined with the large data sets exercised by the algo- rithm, leads to an increase in the cache memory-traffic. In order to find the best solution point, we resort to extensive simulation and profiling with data that is representative of the video domain. Internal CPU caches are characterized by their size(S), number of sets(NS), line size(LS) and associativity(A). Our cache reconfiguration goal is optimizing energy consumption for a particular video quality level Qk. In general, cache power consumption for a particular configuration and video quality is given by the function Ecache,k(S, A).By profiling this function for the entire search space (S, A) of available cache TLFeBOOK

264

Total Energy (J)

1.7

1.6

1.5

1.4

1.3 4 8 1.2 16

32 1.1 1 2 4 64 Cache 8 16 32 Size Cache Associativity

Figure 14.4. Cache Energy Variation on Size and Associativity configurations, we generate a cache energy variation graph shown in Fig. 14.4. Depending on the video quality Qk played, there will be one optimal operating opt opt point for that video quality: (Sk ,Ak ). We found out that for all video qualities an optimized operating point exists and it improves cache power con- sumption by up to 10-20% (as opposed to a suboptimized configuration). This technique effectively fine-tunes the organization of the cache so that it perfectly matches the application and the data sets to be processed, yielding important power savings. 14.5 OS/MIDDLEWARE LEVEL OPTIMIZATIONS Gains in power reduction and performance improvement from architectural optimizations can be further amplified if the low-level architecture is cognizant of the exact characteristics of the streamed video. An adaptive middleware software at a proxy can dynamically intercept and doctor a video stream to exactly match the video characteristics for which the target architecture has been optimized. It can also regulate the network traffic to induce maximal power savings in a network interface. Additionally, with knowledge of the video stream the operating system can employ an optimized dynamic voltage scaling of the CPU.

14.5.1 Integrated Dynamic Voltage Scaling For a given supply voltage, V, and clock frequency f, the dynamic power due to digital CMOS varies linearly with frequency and quadratically with the supply voltage (which is also the switching voltage). This relationship can be used at the application level [4]. In our case, for MPEG decoding, frames are processed in a fraction of the frame delay (Fd =1/frame rate). The actual frame decoding time D depends on the type of MPEG frame being processed (I, P, B) and is also influenced by the cache configuration (S, A) and DVS setting (f, V ). We assume a buffer based decoding, where the decoded frames TLFeBOOK

265 are placed in a temporary buffer and are only read when the frame is displayed. This allows us to decouple the decoding of the frame from the displaying part; decoding time is still different for different frames, we therefore assume an average D for a particular video stream/quality. The difference between the average frame delay and actual frame decoding time gives us the slack time θ = Fd − D. We can then perform DVS, where we slow down the CPU to use up the slack time. Cache configuration also slightly influences the frame decoding time (due to the cache misses, which translate into external memory traffic), extreme values proving very inefficient. An optimized cache combined with DVS yields the best power saving results. Determination of the best operating point for the DVS/cache reconfiguration requires simulation of the application with the power aware system software that has direct influence on the technology parameters. This is discussed next.

14.5.2 Power Aware Operating System Architecture We view the notion of power ’awareness’ in the application and OS as a capability to carry out a continuous dialogue between the application, the OS, and the underlying hardware. This dialogue establishes the functionality and performance expectations (or even contracts, as in real-time sense) within the available energy constraints. We describe here our implementation of a specific service, namely the task scheduler, that makes the OS power aware. The sched- uler architecture is composed of two software layers and the OS kernel. One layer interfaces applications with operating system and the other layer makes power related hardware “knobs” available to the operating system. Both layers are connected by means of corresponding power aware operating system ser- vices as shown in Figure 14.5. At the topmost level, embedded applications call the API level interface functions to make use of a range of services that ultimately makes the application energy efficient in the context of its specific functionality. The API level is separated into two sub-layers. The PA-API layer provides all the functions available to the applications, while the other layer pro- vides access to operating system services and power aware modified operating system services (PA OS Services). Active entities that are not implemented within the OS kernel are also be implemented at this level (threads created with the sole purpose of assisting the power management of an operating system service). We call this layer the power aware operating system layer (PA-OSL). To interface the modified operating system level and the underlying hardware level, we define a power aware hardware abstraction layer (PA-HAL). The PA-HAL provides access to the power related hardware parameters in a way that makes it independent of the hardware. TLFeBOOK

266

Application Level

Applications

API Level

PA-API

PA-OSL

OS Level Scheduler

POSIX Device PA OS Drivers OS Services Kernel Memory Manager

Hardware Level

OS HAL PA-HAL

HARDWARE

Figure 14.5. Power Aware Operating System Architecture

14.5.3 Middleware based Network Traffic Regulation We now describe a proxy-based traffic regulation mechanism to reduce energy consumption by the device network interface. Our mechanism (a) dynami- cally adapts to changing network(e.g noise) and device conditions(e.g. residual battery energy). (b) accounts for attributes of the wireless access points (e.g. buffering capabilities) and the underlying network protocol (e.g. packet size). (c) uses the proxy to buffer and transmit optimized bursts of video along with control information to the device. However, even though packets are transmit- ted in bursts by the proxy, the device receives packets that are skewed over time Fig. 14.6; this cuts power savings, as the net sleep time of the interface is re- duced. The skew is caused due to the ethernet access protocol(e.g CSMA/CD) and/or the fair queueing algorithms implemented at the wireless access points. Our mechanism optimizes the stream, such that optimal video bursts sizes are sent for a given noise level, thus maximizing energy savings without perfor- mance costs. Wireless network interface(WNIC) cards typically operate in four modes: transmit, receive, sleep and idle. We estimated the power consumption of the Cisco Aironet 350 series WLAN card to have the following power con- sumption characteristics: transmit(1.68W), receive(1.435W), idle (1.34W) and sleep(0.184W) which agree with the measurements made by Havinga et al. in [14]. This observation suggests that considerable energy savings can be achieved by transitioning the network interface from idle to sleep mode during TLFeBOOK

267 periods of inactivity. The use of bursty traffic was first suggested by Chan- dra [2, 3] and control information was used for adaptation in [26]. We analyze the above power saving approach using a realistic network frame- work(Fig. 14.6), in the presence of noise and AP limitations [21]. The proxy middleware buffers the transcoded video and transmits I seconds of video in a single burst along with the time τ=I for the next transmission as control infor- mation. The device then uses this control information to switch the interface to the active/idle mode at time τ + γ × DEtoE,whereγ is an estimate between zero and one and DEtoE is the end-to-end network delay with no noise.

User 1 User N C

Wired C Wired wireless Proxy Wireless device P HTTP/TCP/IP 802.11b C RTP/UDP/IP Access packets Point

t t Figure 14.6. Wireless Network We acknowledge that a QoS aware preferential service algorithm at the access point can impact power management significantly. The above analysis can be used by an adaptive middleware to calculate an optimal I(burst length) for any given video stream and noise level. Note that energy overhead for buffering the video packets is not affected by using our strategy because the number of read and write memory operations remain unchanged irrespective of the memory buffer size. In the previous section, we demonstrated how low level architecture can be optimized using high level information. In this section, we presented two middleware techniques that can be used to compliment the low-level hardware optimizations, lower energy consumption of the NIC and improve the overall utility of the system. We now introduce a middleware based adaptation scheme for backlight power savings in handheld devices.

14.5.4 Reducing Backlight Power Consumption The backlight accounts for considerable energy overheads in a low-power device. However, potentially large energy savings are realizable by operating the device at a lower backlight intensity levels. We explore a more aggressive approach to brightness compensation and device backlight control for stream- ing video. Furthermore, the adaptation is shifted away from the low-power device and performed at a network proxy server, obviating the need for the decoder on the device to be modified. We have found that aggressive bright- ness compensation is possible for streaming video as compared to still images, TLFeBOOK

268 without considerably impacting the video quality. This is because small defects (introduced due to aggressive compensation) that might be noticeable in a still image are less discernable in streaming video where several frames (images) are displayed on the screen every second. We also propose an effective brightness compensation algorithm for optimized power savings [23]. In this approach, we introduce middleware based adaptation schemes which integrate our com- pensation algorithm to achieve low power backlight operation for streaming video content to mobile handheld devices. Our experiments indicate that this approach can provide power reductions of up to 60% of the power consumption attributed to the backlight, depending on the chosen adaptation scheme and the characteristics of the streamed video. We assume that the proxy server has access to a database of profiled lumi- nosity values for various video streams and device specific parameters (e.g. number of backlight levels, average luminosity at each level etc.), a rule base to determine compensation values and a video transcoder(Fig. 14.7); and low- power wireless devices capable of displaying streaming MPEG video content. All communication between the handhelds and the multimedia server are routed through the proxy server that can change the video stream in real-time.

Figure 14.7. Model for Backlight Adaptation Each device/client has an application layer where the video stream is de- coded and a middleware layer which routes the information flowing to and from the video decoder application. The client middleware layer has access to system parameters such as the backlight levels, the current battery level and information identifying the type and make of the handheld (e.g. iPAQ, Jor- nada etc). In addition to accessing these system parameters, the middleware layer on the client can change these parameters (e.g. operating backlight level) through API calls to the underlying OS. The middleware on the proxy performs the dynamic adaptation of the streaming video content (brightness compensa- tion) and communicates control information to the client middleware (operating backlight levels) through the low bandwidth control stream. The proxy main- tains a database of information about the videos available at the server and TLFeBOOK

269 information specific to different handheld types such as the number, luminous intensity and average power consumption of the backlight levels. Additionally, the proxy also employs a static rule base which specifies conditions which de- termine values for backlight and video compensation. The database and certain parameters of the rule base are populated by extensive profiling and subjective assessment of videos on different handhelds. 14.6 APPLICATION LAYER ADAPTATION Improving the service lifetimes of low-power mobile devices through effective power management strategies can facilitate optimization of user experience for streaming video on to handheld devices. To achieve this, a system should be able to dynamically adapt to global system changes, so that the entire duration of a requested video is streamed to the user at the highest possible quality, while meeting the power constraints of the user’s low-power device. We achieve such an optimal balance between power and performance, by introducing a notion of “Utility Factor UF " for a system, and optimizing the UF for the system. This approach precludes the system from aggressively optimizing for power at the expense of performance and vice-versa; thereby providing an optimized operating point for the system at all times. UF is a measure of “user satisfaction" and we specify it as follows: given the residual energy Eres on a handheld device, a threshold video quality level (QA : QMAX ≥ QA ≥ QMIN) acceptable to the user, and the time of the video playback T,theUF of the system is non-negative, if the system can stream the highest possible quality of video to the user such that the time, quality and the power constraints are satisfied; otherwise UF is negative. Let PVID denote the average power consumption rate of the video playback at the handheld and QPLAY be the quality of video streamed to the user by the system. Using the above notation, we define UF as follows:  QPLAY − QMIN IFF PVID ∗ T

270

Table 14.1. Energy-Aware Transformations for Compaq Ipaq 3650 with bright backlight, Cisco 350 Series Aironet WNIC card. (Q1) Terrible, (Q2) Bad, (Q3) Poor, (Q4) Fair, (Q5) Good, (Q6) Very Good, (Q7) Excellent, (Q8) Like Original

Quality Parameters Avg. Power Avg. Power (WinCE) (Linux) (Q8) SIF, 30fps, 650Kbps 4.42W 6.07W (Q7) SIF, 25fps,450Kbps 4.37W 5.99W (Q6) SIF, 25fps, 350Kbs 4.31W 5.86W (Q5) HSIF, 24fps, 350Kbps 4.24W 5.81W (Q4) HSIF, 24fps, 200Kbps 4.15W 5.73W (Q3) HSIF, 24fps, 150Kbps 4.06W 5.63W (Q2) QSIF, 20fps, 150Kbps 3.95W 5.5W (Q1) QSIF, 20fps, 100kbps 3.88W 5.38W

produce a perceptible quality change or power uptake compared to the nearest SIF encoded video with similar bit and frame rates. Based on these conclusions, we identified eight dynamic video stream trans- formation parameters (Table 14.1) for our proxy-based realtime transcoding and use the profiled average power consumption values to perform our adaptations.

14.7 SUMMARY It has been pointed out by several researchers that power optimization across various levels of system functionality and implementation (architecture, OS, middleware, application) can lead to much greater savings than the case when these are individually optimized for power. The challenge is how these op- timizations can be coordinated across layers; what is the right architectural framework that allows this optimization to occur simultaneously and even dy- namically? To answer this question, this paper proposes a proxy-based middle- ware solution to accommodate optimizations across diverse clients with limited computation and battery power by controlling the amount of needed computa- tion and communication to the client device. We showed how such adaptation in the middleware can be used to improve energy efficient delivery of multimedia content in the case of streaming video. User perception of video also plays a vital role in deciding the proxy-based video transformations and in identifying architectural tuning “knobs". However, identifying the various video qualities remains a highly subjective aspect of the study. Identifying video quality levels objectively/programmatically still remains an open research challenge. In prac- tice however, the widespread deployment of such a unified power management framework for mobile devices would require a set of APIs (programming inter- faces) to be implemented at the various computational layers; this API should fa- TLFeBOOK

271 cilitate effective communication between the various levels. Recent approaches towards power management suggest a more open and flexible architecture for mobile devices that allows higher layers to make informed adaptations at lower layers and vice-versa. A prototype implementation of the framework is cur- rently underway as a part of the FORGE(http://www.ics.uci.edu/˜forge) project.

Acknowledgments This work was supported by funding from an ONR MURI Grant N00014- 02-1-0715 and NSF NGS award ACI-0204028.

References

[1] Azevedo, Ana, Cornea, Radu, Issenin, Ilya, Gupta, Rajesh, Dutt, Nikil, Nicolau, Alex, and Veidenbaum, Alex (2001). Architectural and compiler strategies for dynamic power management in the copper project. In Inter- national Workshop on Innovative Architecure. [2] Chandra, Surendar (2002). Wireless Network Interface Energy Consump- tion Implications of Popular Streaming Formats. In Multimedia Computing and Networking. [3] Chandra, Surendar and Vahdat, A. (2002). Application-specific Network Management for Energy-aware Streaming of Popular Multimedia Formats. In Usenix Annual Technical Conference. [4] Choi, Kihwan, Dantu, Karthik, Chen, Wei-Chung, and Pedram, Massoud (2002). Frame-Based Dynamic Voltageand Frequency Scaling for a MPEG Decoder. In International Conference on Computer Aided Design. [5] Chu, H. H. and Nahrstedt, Klara (1999). "Cpu Service classes for multime- dia applications". In International Conference on Multimedia Computing and Systems. [6] Cornea, Radu, Dutt, Nikil, Gupta, Rajesh, Mohapatra, Shivajit, and et. al. (2003). ServiceFORGE: A Software Architecture for Power and Quality Aware Services. In FME Symposium. [7] Douglis, Fred, Krishnan, P., and Marsh, B. (1994). Thwarting the power hungry disk. In WINTER USENIX conference. [8] Feeney, L.M. and Nilsson, M (2001). Investigating the Energy Consumption of a Wireless Network Interface in an Ad Hoc Networking Environment. In IEEE Infocom. TLFeBOOK

272 [9] Feng, Wu-chi and Sechrest, S. (1996). Improving data caching for software mpeg video decompression. In IS&T/SPIE Digital Video Compresssion: Algorithms and Technologies. [10] Flinn, J. and Satyanarayanan, M. (1999a). Energy-Aware Adaptations for Mobile Applications. In In Symposium on Operating Systems Principles. [11] Flinn, Jason, de Lara, Eyal, Satyanarayanan, M., Wallach, Dan S., and Zwaenepoel, Willy (2001). “Reducing the energy usage of office appli- cations". In IFIP/ACM International Conference on Distributed Systems Platforms. [12] Flinn, Jason and Satyanarayanan, M. (1999b). PowerScope: a tool for pro- filing the energy usage of mobile applications. In Second IEEE Workshop on Mobile Computing Systems and Applications. [13] Grun, P., Dutt, N., and Nicolau, A. (2003). "Memory architecture ex- ploration for programmable embedded systems". In Kluwer Academic Publishers, Norwell, MA. [14] Havinga, Paul J. M. (2000). Mobile Multimedia Systems. PhD thesis, University of Twente. [15] Hughes, C. J., Srinivasan, J., and Adve, S. V. (2001a). Saving energy with architectural and frequency adaptations for multimedia applications. In IEEE/ACM International Symposium on Microarchitecture. [16] Hughes, C. J., Srinivasan, J., and Adve, S. V. (2001b). Saving energy with architectural and frequency adaptations for multimedia applications. In IEEE/ACM International Symposium on Microarchitecture. [17] Irani, S., Shukla, S., and Gupta, R. (2002). “Competitive analysis of dynamic power management strategies for systems with multiple power saving states". In Design Automation and Test in Europe. [18] Kumar, Pavan and Srivastava, Mani (2000). Predictive Strategies for Low- Power RTOS Scheduling. In International Conference on Computer De- sign. [19] Lee, C., Lehoczky, J., Siewiorek, D., Rajkumar, R., and et.al (1999). A Scalable solution to the multi-resource QoS problem. In Real-Time Systems Symposium. [20] Mesarina, M. and Turner, Y. (2002). A Reduced Energy Decoding of MPEG Streams. In Multimedia Computing and Networking. [21] Mohapatra, Shivajit, Cornea, Radu, Dutt, Nikil, Nicolau, Alex, and Venkatasubramanian, Nalini (2003). Integrated power management for video streaming to mobile handheld devices. In ACMMM. [22] Noble, B. D., Satyanarayanan, M., D.Narayanan, J.E.Tilton, and Flinn, J. (1997). Agile Application-AwareAdaptation for Mobility. In In Symposium on Operating Systems Principles. TLFeBOOK

273 [23] Pasricha, Sudeep, Mohapatra, Shivajit, and et. al. (2003). "Reducing back- light power consumption for streaming video applications on mobile hand- held devices". In ACM/IEEE/IFIP Workshop on Embedded Systems for Real-Time Multimedia, 2003. [24] Pillai, P., Huang, H., and Shin, K. G. (2003). Energy-Aware quality of Service adaptation. In Technical Report CSE-TR-479-03, Univ. of Michi- gan. [25] Pillai, P. and Shin, K. G. (2001). Real-Time Dynamic Voltage Scaling for Low-Power Embedded Operating Systems. In In Symposium on Operating Systems Principles. [26] Shenoy, Prashant and Radkov, Peter (2003). Proxy-Assisted Power- Friendly Streaming to Mobile Devices. In Multimedia Computing and Networking. [27] Sinha, Amit and Chandrakasan, Anantha (2001). "jouletrack - a web based tool for software energy profiling". In Design Automation Conference. [28] Soderquist, Peter and Leeser, Miriam (1997). Optimizing the data cache performance of a software MPEG-2 video decoder. In ACM Multimedia, pages 291–301. [29] Weiser, M., Welch, B., Demers, A., and Shenker, S. (1994). Scheduling for Reduced CPU Energy. In In Symposium on Operating Systems Design and Implementation. [30] Yuan, W. and Nahrstedt, K. (2004). Process Group Management in Cross- Layer Adaptation. In Multimedia Computing and Networking. [31] Yuan, W., Nahrstedt, K., Adve, S., Jones, Doug., and Kravets, Robin (2003). Design and Evaluation of a Cross-Layer Adaptation Framework for Mobile Multimedia Systems. In Multimedia Computing and Networking. [32] Zeng, H., Ellis, C., Lebeck, A., and Vahdat, A. (2002). "Ecosystem: Man- aging energy as a first class operating system resource". In Architectural Support for Programming Languages and Operating Systems. [33] Zeng, H., Ellis, C., Lebeck, A., and Vahdat,A. (2003). Currentcy: Unifying policies for resource management. In USENIX.