iii Signatures iv

To al l my family vi

ACKNOWLEDGEMENTS

First of all I would like to stress that this work would not have b een p ossible without

my teacher and now advisor Professor Jordi Cortadella Although a fearless and

fulltime devoted leader to the design of asynchronous circuits he has always b een

available for help always putting me on the right track always seeing the bright

side of my problems

I would like to thank Enric Pastor Fermn Sanchez Oriol Roig Marco A Pena

and Gianluca Cornetta all members of the CADVLSI group where I have b een

working these last years for their help and supp ort I would like to thank also my

colleagues at the Computer Architecture Department

Sp ecial thanks to Rosa M Badaand TomasLang who have reviewed some parts of

this work and to Joan Figueras Their technical discussions and suggestions have

denitely improved the quality of this work To all of them my sincere thanks

This work has b een supp orted in part by the Ministerio de Educacion y Ciencia

of Spain under contracts CYCIT TICE and TIC My bank ac

count is denitely indebted to the Departament dEnsenyament de la Generalitat

de Catalunya for their help nancing my do ctorate and recognizes the Amics de

Gaspar de Portola asso ciation for providing the funds to present part of this work

in a conference in California

I would like to acknowledge the secretary sta of the Computer Architecture Depart

ment for coping so well with the administrative pap erwork My sincere thanks also

to the systems group p eople who have p erfectly managed the advanced computing

resources provided by the Computer Architecture Department

During the last years this work has absorb ed all my time even the socalled free

time I would like to thank the p eople who have tried most of the times unsuccess

fully to put for a while my eyes out of the b o oks and my hands out of the keyboard

Sp ecial thanks to my b est friends and the UPC Rowing team for this work

Last but not least thanks to my parents Pilar and Pere and my brother Ramon

for their unconditional patience and supp ort Thank you all vii

viii Highlevel and logic synthesis techniques for low power

CONTENTS

ACKNOWLEDGEMENTS vii

LIST OF FIGURES ix

LIST OF TABLES xiii

INTRODUCTION

Why lowpower design techniques

Reliability

Chip package

Batteries

Applications

Lowpower design approach

Overall p ower reduction

Research ob jectives

Overview

STATEOFTHEART IN LOWPOWER DESIGN

Powerconsumption sources

Static dissipation

Dynamic dissipation

Dening p ower consumption

System layer

System shutdown

System partitioning

Algorithm selection

Architecture layer

Parallel and pip elined pro cessing

Compiler transformations v

vi Highlevel and logic synthesis techniques for low power

Cache design

Data representation

Logic layer

Combinational circuits

Sequential circuits

Circuit layer

Logic style

Synchronous vs asynchronous design

Design style

Adiabatic computing

Layout layer

Technology layer

Technology scaling and SOI pro cess

Packaging technology

Lowpower techniques summary

Conclusions

HIGHLEVEL SYNTHESIS TECHNIQUES FOR LOW

POWER

Introduction

Previous work

Estimating p ower consumption

Reducing p ower consumption

Powerconsumption mo del of functional units

Registertransfer level techniques for low p ower

Lo op interchange

Op erand reordering

Op erand sharing

Op erand retaining

Op erand similarity

Summary of targeted circuit domains

Scheduling and register binding for low p ower

Highlevel synthesis

Previous work on highlevel synthesis for low p ower

Overview

Contents vii

Scheduling for low p ower

Register binding for low p ower

Conclusions

ARCHITECTURAL TECHNIQUE FOR LOW POWER

Introduction

Parallel multipliers

Previous Work

Potential p ower reduction in array multipliers

TransitionRetaining Barriers

Implementation of the TRB

Results

Conclusions

LOGICCIRCUIT TECHNIQUE FOR LOW POWER

Introduction

Motivation examples

Previous work and overview

Overview of our approach

Power consumption of CMOS gates

Denitions and overview

Transition density

Extended p owerconsumption mo del

Poweroptimization algorithm

Algorithm overview

Monotonic characteristic

Equilibrium probabilities

Exhaustive exploration of gate congurations

Results

Scenarios for the exp eriments

Discussion

Conclusions

CONCLUSIONS AND FUTURE RESEARCH

Introduction

Contributions

viii Highlevel and logic synthesis techniques for low power

System layer

Architecture layer

Logiccircuit layer

Future research

A CIRCUIT DESIGN AND ANALYSIS TOOLS

A Introduction

A Ocean

A SLS

A SIS

REFERENCES

LIST OF FIGURES

Chapter

Stateoftheart in battery technology Pow

Powerconsumption trend Mat

Dierent layers in the design ow of a circuit

Power breakdown in the ThinkPad noteb o ok Ike

Chapter

Static CMOS inverter schematic transistor and switch representa

tions

Shortcircuit p ower dissipation

a Input waveform b charging and c discharging the capaci

tance C

load

a Datapath circuit at V b Parallel and c pip elined versions

at V BCS

3 2

Implementing p olynomial X AX BX C a straightforward

and b Horners scheme

Substituting one multiplication by one addition

a Unbalanced and b balanced dataow graph of the addition

of four op erands

Delaying signals to eliminate hazards

Dierent sub ject graphs implementing a input NAND function

a Sub ject graph b available gates c lowarea mapping d

lowpower mapping

a FSM example b two dierent co dings with dierent switching

activity among them RP

Sequential circuit a b efore and b after retiming

Comparator circuit a without and b with disabled input regis

ters ix

x Highlevel and logic synthesis techniques for low power

a STG and b FSM with the registers enableddisabled by signal

f

a

a Static and b dynamic implementations of the b o olean function

Y A A B B

1 2 1 2

a Original synchronous system b asynchronous system with

adaptive scaling of the supply voltage NS

Design time vs p ower consumption of some design styles

Clo ck tree driving schemes a buers at the clo ck source and b

distributed buers

Chapter

Activity of a bit data stream for dierent temp oral correlations

data approximated from LRa

a Average activity in a multiplier as a function of the constant

value data approximated from CR b A parallel and serial

implementations of an tree

Transformations for reducing a activity at the address and

a number of memory references

a Coarsegrained and b negrained p owerconsumption mo del

for an bit Bo oth multiplier

a Dataow graph b and c are two p ossible schedules and

functionalunit bindings

The input data order in b leads to a higher op erand AHD than

in b Big arrows indicate the input data Thin arrows indicate

the AHD b etween two consecutive input data

Example of application of the lo opinterchange technique

a Motion estimation algorithm and b motionestimation algo

rithm with two lo op interchanges

Datapath for executing a the motionestimation algorithm and

b the matrixvector pro duct algorithm

Matrixvector pro duct algorithm when referencing matrix A a by

rows and b by columns

a MAC op erator for p and b DFG of the thorder LMS

adaptive lter

Dierent inputreusing graphs for a the MAC op erator and b

the symmetric matrixvector pro duct op erator

a DFG of the AR lter bc two p ossible schedule and bindings

List of Figures xi

a Original co de and b after lo op unrolling

a Lowpass image lter algorithm b DFG of the inner lo op of

a without division by nine and c DFG after lo op unrolling

a DFG of the thorder FIR lter bc two p ossible schedule

and bindings

DFG of the Dierential Equation Solver

Highlevel synthesis pro cess

Example of p eakp ower and averagepower optimization during the

scheduling and functionalunit binding tasks a DFG b adder

characteristics ce dierent schedules and bindings and d re

sults

Register binding for low p ower

Listscheduling algorithm

Lowpower listscheduling algorithm

a DFG of the LMS lter and b schedule and FU binding

Registerbinding algorithm for low p ower

Chapter

a Circuit structure with glitching b blo ck C presents more

activity than blo ck A

bit array multiplier

a FA and b HA structure

Signal transitions p er op eration for each basic blo ck in a bit

array multiplier

Useful and total number of signal transitions for an

and bit array multiplier

a Possible lo cations for one TRB and b transitions p er input

vector for dierent lo cations of one TRB in a bit array

Signal transitions p er input vector in a bit array multiplier

a General view of the design with no TRBs and side view with

b no TRBs c one TRB and d two TRBs

2

a Comparison b etween Static CMOS and C MOS and b avoid

ing directpath currents

bit array with two TRBs The four types of delaycells TA

TD are shown

Design of a FA

xii Highlevel and logic synthesis techniques for low power

Layout of an bit array multiplier in a seaofgates design style

Chapter

a Four implementations of function y a a b and b rel

ative p ower consumption for two dierent input activity scenarios

SPICE simulations of congurations A and D in Figure for

two dierent input activity scenarios

Transistorreordering technique in a input NAND gate example

from SLW

a Static CMOS input NAND gate b relative p ower consump

tion for dierent input activities without varying the equilibrium

probabilities of the inputs

a Static CMOS gate representation and b algorithm to obtain

function H

n

k

bd cd using Two BDD representations of function f abc

dierent variable order

Capacitances in a MOS transistor

Optimization algorithm

Exhaustive exploration algorithm

Execution example of the exhaustive exploration algorithm

A gate y a d b e b c d a c e without a seriesparallel repre

sentation

The two scenarios considered

Chapter

App endix A

A a Schematic and b gate description of a full adder

A Seaofgates layout architecture

A a Fulladder netlist description and b layout pro duced by Ocean

A a Command le and b graphical output of the SLS simulator

A a Library and b circuit description in SIS

LIST OF TABLES

Chapter

Power dissipation of highp erformance micropro cessors BE

Power dissipation of highp erformance lowpower micropro cessors

up dated from BE

Power breakdown for dierent pro cessors Keu

Chapter

Switching p ower in dierent types of circuits Tiw

Number of multiplications and additions of the matrix DCT

execution for dierent algorithms BCS

Normalized area and p ower for the dierent architecture designs of

the datapath example in Figure BCS

Twos complement and signmagnitude representation for the values

to using bits

Areadelaypowerdesign simplicity comparison of some logic styles

Technology scale factors for some parameters

Power reductions obtained by dierent techniques

Chapter

Factors of the coarsegrained mo del

Two dierent input reorderings for the input MAC unit M

j

represent the multiplications of the MAC unit

Results obtained by applying the op erandsharing technique

Idle time sp ent by the functional units

Features of the targeted circuit domains

Results obtained with the LPLS algorithm

Comparison b etween the traditional resourcebinding algorithm TRB

and its lowpower version LPRB xiii

xiv Highlevel and logic synthesis techniques for low power

Chapter

Comparison b etween the original design and the new designs with

TRBs

Chapter

SPICE and SLS comparison for the congurations in Figure

Numbers in parenthesis show the p owerconsumption ranking

Number of dierent congurations for some standard gates obtained

by reordering its transistors

Results obtained for several MCNC b enchmarks for b oth scenarios

considered

Chapter

App endix A

INTRODUCTION

Historically p ower consumption has not b een a priority issue in the semiconductor

industry Until recently the two leading parameters that a designer had to take

into account when designing a chip were area and delay

The area of a circuit has b ecome smaller with the increasing levels of integration

and the p erformance has improved with higher clo ck frequencies By the year

the frequency and integration density is exp ected to increase by a factor of and

resp ectively with resp ect to

One of the consequences of these high levels of integration and p erformance is the

increase in p ower consumption Nowadays and for certain applications the p ower

consumption is a design parameter as imp ortant as area or delay So far the designer

had to nd the b est design in a twodimensional space areadelay Now there

is a threedimensional space areadelaypower to dive in with the corresp onding

increase in complexity

This chapter presents a motivation for lowpower design and the research ob jectives

of this work

Chapter Introduction

WHY LOWPOWER DESIGN TECHNIQUES

Several factors have pushed the concern for lowpower as a critical design constraint

The three most imp ortant are reliability chip package cost and battery life in

p ortable systems

Reliability

The p ower dissipated by a circuit turns into heat The excessive heat may cause a

decrease of the circuits reliability ie failure mechanisms such as silicon intercon

nect fatigue package related failure electrical parameter shift electromigration

junction fatigue etc may o ccur with the generation of onchip hightemp erature In



fact every C increase in the onchip temp erature doubles the failures rate Sma

Chip package

The dissipated heat in chips causes a cost increase in packaging Packages are



characterized by the thermal resistance in C p er watt which means that one



watt will raise the temp erature by C For chips consuming a few watts a plastic



package Cwatt may b e used For more p ower hungry chips a ceramic



package Cwatt is needed WE This may increase the overall cost of

the chip in US Moreover highcost packaging might include forced air or

liquid co oling through tiny ducts increasing even more the cost of the nal chip

Batteries

Portable batterydriven applications such as noteb o ok and computers rep

resent the fastest growing segment of the computer industry and the driving factor

for lowpower design concern In the Figure we observe the stateoftheart in

batteries Pow For example Li ion batteries currently present WhKg Li

ion technology will probably improve a over the next years War These

improvements can not meet the high increase in p ower consumption of the circuits

a factor of four every three years as shown in Figure

For example consider the hypothetical case of a p ortable application based on a

DEC Alpha pro cessor Watts Using NiCd batteries we would need roughly

Kg of batteries p er hour of op eration The problem is not solved by using more p owerful batteries with Li ion technology we reduce the weight to Kg still to o

Applications

125

100

75

50

Nominal capacity (Watt−hour/Kg) 25

0 NiCd NiMH Li ion LiPE

Battery technology

Figure Stateoftheart in battery technology Pow

1000 vdd=5V 4 times/3 year vdd=3.3V 100 10 1 0.1 0.01 Power consumption (W) 80 85 90 95 2000

Year of publication

Figure Powerconsumption trend Mat

heavy The solution relies on applying also design techniques for low p ower during

the design of the pro cessor

Chapter Introduction

APPLICATIONS

Lowpower design techniques are suitable for p ortable applications where the life of

the battery is a primary design constraint and highp erformance computing where

the p ower has increased and is increasing at a high pace

Batteryoperated applications

Batteryop erated applications clearly b enet the most from the lowpower design

techniques Examples of these applications may b e found in dierent elds

computing noteb o ok and laptop computers

p ersonal communications the current generation of digital cellular phones and

the Personal Digital Assistants p o ckedsized devides with multimedia access

supp orting full motion digital video and control via handwrite or sp eech recog

nition CSB

consumer electronics wrist watches targeted long ago for low p ower digital

TV p o cket calculators

medicine hearing aids implantable cardiac equipments

military night visors surveillance systems

Many of the ab ove applications sp ecially those related to p ersonal communications

demand the same computation capabilities as found in desktop machines since they

execute complex sp eech and data compression algorithms

Highperformance computing

Highp erformance systems may also b enet from lowpower design techniques With

large integration density and improved sp eed of op eration pro cessors dissipating

to Watts are emerging

Table shows some highp erformance pro cessors with their asso ciated p ower con

sumption We observe that for example the DEC Alpha dissipates W

With the integration level esp ected by the year several DEC Alphas may

b e integrated in just one chip but the p ower dissipated will certainly have to b e

reduced

Therefore lowpower design techniques are needed to implement either new versions

of these pro cessors or completely new designs that consume less Some examples

Lowpower design approach

Pro cessor Clo ck Technology V Power Ref

dd

MHz m V Peak W

Intel Pentium Rep

DEC Alpha ea

DEC Alpha eab

Power PC eaa

MIPS R MIP

UltraSparc eaa

Table Power dissipation of highp erformance micropro cessors BE

Pro cessor Clo ck Technology V Power Ref

dd

MHz m V Peak W

Power PC eaa

IBM SLC eab

+

MIPS R YSS

ARM Gwea

ATT Hobbit Gweb

Table Power dissipation of highp erformance lowpower micropro cessors up

dated from BE

are shown in Table For instance the Power PC is the lowpower version

of the Power PC The frequency has b een reduced to MHz This frequency

reduction reduces p ower by itself but the great amount of reduction obtained with

the Power PC pro cessor more than a factor of has b een obtained applying

lowpower techniques during its design

LOWPOWER DESIGN APPROACH

The p ower dissipated by a circuit may b e tackled at all the dierent layers of the

design pro cess from the system layer where the sp ecications of the circuit are

dened till the layout and technology layers see Figure

In general larger p ower reductions are obtained in the early layers of the design

pro cess ie in the architecture and system layers But the lowpower techniques

in each layer are orthogonal among them For example the p ower savings obtained

Chapter Introduction

SYSTEM

ARCHITECTURE

LOGIC/CIRCUIT

LAYOUT

TECHNOLOGY

Figure Dierent layers in the design ow of a circuit

at the logiccircuit layer will add up to those already obtained at the architecture

layer

Therefore it is imp ortant to nd lowpower techniques at each of the layers of the

design pro cess of a circuit

Overall p ower reduction

An application is not only comp osed of digital circuits There are other comp onents

that dissipate p ower For example Figure shows the p ower distribution within

the ThinkPad noteb o ok Ike The pro cessor unit accounts for the of the

total p ower budget In a p ortable communications terminal the pro cessing units

may account for the of the p ower CSBb

Moreover p ower may b e further broken down in the pro cessor unit as shown in

Table for some pro cessors We observe that for example the logic p ower com

p onent of an average pro cessor is Assuming that this pro cessor is integrated

1

in the ThinkPad noteb o ok the complete elimination on the logic p ower only draws

an approximate overall p ower reduction Therefore eorts should b e aimed at

In fact it is a Pentium

Research objectives

LCD 37%

DC−DC VIDEO 6% LOSS 7% CPU HDD 40%

8%

Figure Power breakdown in the ThinkPad noteb o ok Ike

Pro cessor Logic Clo ck Memories Interface

DEC Alpha

DEC NVAX

MIPSX

TORCH

average

Table Power breakdown for dierent pro cessors Keu

reducing the p ower consumption of those comp onents that have the highest impact

on reducing the overall p ower consumption

RESEARCH OBJECTIVES

The goal of this work is to contribute to the actual state of lowpower design with

techniques at the higher layers of the design pro cess namely the logiccircuit

architecture and system layers These techniques will mainly fo cus on datapath

architectures such as those used in highp erformance realtime systems in telecom

munications sp eech video and image pro cessing A prop er p owerconsumption

mo del for the system layer will b e prop osed to evaluate the techniques describ ed

Some of these techniques will b e automated to study the areadelaypower trade

o In these cases the p ower savings obtained will b e evaluated with existent circuit simulators

Chapter Introduction

OVERVIEW

This work is divided into chapters and one app endix This chapter has presented

an introduction to the growing concern in system design houses related to the p ower

consumption of integrated circuits

Chapter briey shows the stateoftheart in lowpower design some of the existing

techniques for low p ower will b e outlined for each layer of the design pro cess The

goal of this chapter is not to provide a thorough description of these techniques

but rather to give a avour of the stateoftheart in lowpower design

The remainder of the chapters are devoted to the contribution of this work In

Chapter some techniques for reducing p ower during the highlevel synthesis pro

cess are presented These techniques rely on reducing the activity of the circuit

ie reducing the number of transitions at its internal signals To evaluate the

p ower savings obtained with these techniques a prop er p owerconsumption mo del

is presented for the functional units In the same chapter some of the techniques

are automated The algorithms are presented and the areadelaypower tradeo

is evaluated

In Chapter a technique at the architecture layer that reduces the generation and

propagation of the unnecessary activity of the circuit is describ ed This technique

is sp ecially appropiated for array multiplier circuits The p ower reductions for some

of these circuits are evaluated and again the areadelaypower tradeo is studied

The eect of the transistorreordering technique in p ower consumption is studied

in Chapter A p owerconsumption mo del for a static CMOS gate that takes

into account its internal capacitances is presented An optimization algorithm that

nds the b est order of the transistors for each gate of the circuit is describ ed This

algorithm relies on the p owerconsumption mo del of the CMOS gate The p ower

reductions obtained with this technique and the increase in delay are evaluated

Finally Chapter concludes this work summarizing the main conclusions and in

dicating future research work in the eld of lowpower design

In app endix A the to ols used to design and evaluate the circuits in this work are briey reviewed

STATEOFTHEART IN LOWPOWER

DESIGN

This chapter presents an overview of the stateoftheart in lowpower design The

approach describ ed in the last chapter will b e followed some of the existing tech

niques for low p ower will b e outlined for each layer of the design pro cess A more

thorough insight will b e given for the logic circuit architecture and system layers

since this work fo cus on these steps of the design pro cess

Some of the techniques have b een classied in dierent layers by other authors It is

sp ecially unclear the b oundary b etween system and architecture layers Clearly de

cisions taken at the system layer aect the architectural considerations Moreover

some authors further split or merge some of the layers Nevertheless the classi

cation presented in this chapter agrees with the ma jority of classications done by

other authors

During optimization for low p ower it is necessary to know the p ower distribution

within a circuit Therefore the rst section of this chapter is devoted to analyze

the p owerconsumption sources in digital circuits

Sections to describ e some of the existing techniques for lowpower design for

each layer of the design pro cess Section presents a summary of the techniques

describ ed in the previous sections Section states the conclusions of this chapter

Chapter Stateoftheart in lowpower design

Vdd

PMOS AY AY 1 0

NMOS Cload

Gnd

Figure Static CMOS inverter schematic transistor and switch representa

tions

POWERCONSUMPTION SOURCES

To minimize the p ower consumption measured in Watts of a CMOS circuit the

dierent p ower comp onents and their eect must b e identied The p ower dissipated

in a CMOS circuit is divided into two comp onents WE

static dissipation caused by leakage and other static currents

dynamic dissipation caused by the switching transient current or shortcircuit

current and the charging and discharging of load capacitances

To describ e these p owerconsumption sources a static CMOS inverter will b e used

as example Figure shows its schematic transistor and switch representations

Static dissipation

CMOS circuits ideally present no static p ower dissipation b ecause there is no direct

path from p ower supply to ground In practice however transistors b ehave as

nonp erfect switches thus allowing the generation of some leakage currents that

comp ose the static comp onent of CMOS p ower dissipation

The static dissipation is divided into two comp onents the leakage current caused

by the parasitic dio des and the static current which is a function of the input

voltage and the threshold voltage of the transistors

The contribution of the static dissipation is less than of the total p ower But this

contribution increases with the scaling of the technology An imp ortant prerequisite of scaling down the device dimensions is the reduction of p ower supply b ecause the

Powerconsumption sources

Tfall

1 0 XX 0 1

I sc t t t

0 1 2 t 0 t 1 t 2

Figure Shortcircuit p ower dissipation

higher voltage undermine device reliability Reducing the p ower supply leads to the

reduction of the threshold voltage of the transistors implying an increase of the

static dissipation Mei

Dynamic dissipation

This comp onent is caused by the shortcircuit current and the current used for

charging and discharging of load capacitances

Shortcircuit power dissipation

Since the transistors do not b ehave as ideal switches a shortcircuit current arises

when the input changes its value as shown in Figure At time t the input to

0

the inverter is stable at and the p ower dissipation static is negligible At time

t the input is changing its value and therefore it takes an undened value X

1

This values causes b oth transistors to b e turned on and therefore a current I

sc

is generated from p ower supply to ground At time t the input is again stable

2

The p ower consumption pro duced by current I dep ends on the time that b oth

sc

transistors are on Therefore if the input changes slowly the p ower increases

With careful design for balanced input and output rise times this p ower comp onent

can b e kept b elow of the total p ower Vee Moreover this comp onent

highly dep ends on the circuit type as will b e shown in the next section

Chargingdischarging capacitances

The p ower comp onent resulting from the charging and discharging of parasitic ca pacitances switching p ower in the circuit dominates the total p ower consumption

Chapter Stateoftheart in lowpower design

I char 0 to 1 1 to 0

Tcycle

Cload I dis Cload

a

b c

Figure a Input waveform b charging and c discharging the capacitance

C

load

The situation is illustrated in Figure where the capacitance C represents all

load

the dierent parasitic capacitances

Let us consider the case of one complete cycle of op eration with two transitions

at the input signal Figure a When the input signal changes from to a

current I is generated and the capacitance C is charged Figure b It

char load

can b e easily derived that the energy measured in Joules delivered by the p ower

supply is

2

E C V

load

DD

where V is the p ower supply

DD

Half of this energy is dissipated through the PMOS transistor as a result of the

ow of current I The other half is stored in the capacitance C When the

char load

input signal changes to a current I is generated and the capacitance C is

dis load

discharged The energy stored in C during the previous input transition is now

load

dissipated through the NMOS transistor the p ower supply do es not deliver new

energy Therefore the average energy of a single output transition is

2

C V E

load av g

DD

Assuming that the output of the inverter changes every cycle the average p ower

consumption pro duced by the charge and discharge of parasitic capacitances is

2

W C V f

load

DD

where f is the op erating frequency

Powerconsumption sources

Circuit Switching Total Switching as

mA mA of total p ower

bit parallel

multiplier

bit Manchester adder

Floatingp oint pro cessor

Signal pro cessor

Sense amplier p ortion

of static RAM

Microcontroller

Static RAM

Dynamic RAM

Table Switching p ower in dierent types of circuits Tiw

Since not always the outputs of the gates switch the ab ove expression is extended

as

2

W f C V

av g load

DD

where is the number of times that the capacitance C is charged or discharged

load

The pro duct C is called the switched capacitance

load

Again the contribution of this p ower comp onent to the total p ower dep ends on

the type of the circuit In Table we observe that for datapath circuits ie

multipliers adders signal pro cessors etc this comp onent accounts roughly for

the of the total p ower Tiw

Dening p ower consumption

Since this work mainly fo cus on datapath circuits the p ower consumption will b e

dened by equation This equation reveals the three degrees of freedom when

designing for low p ower considering a xed op erating frequency voltage physical

capacitance and activity

With its quadratic relationship to p ower voltage reduction oers the most promising

means for reducing p ower Section will address the technique of reducing the

p ower supply to decrease p ower

Chapter Stateoftheart in lowpower design

Reducing physical capacitances can b e achieved by using small devices and short

interconnection wires Sections and will briey address this issue

But once the supply voltage and the device dimensions have b een xed it is the

switching activity of the signals of the circuit that will ultimately determine its

p ower dissipation Techniques that reduce this activity will b e explained in Sec

tions to Moreover this is the degree of freedom targeted by the techniques

presented as contribution of this work

SYSTEM LAYER

The least explored layer of the design pro cess is the system layer Although the

greatest p ower reductions may b e obtained through a prop er algorithm selection or

system partitioning the designer has to trade o among contradictory parameters

and their impact is mostly unknown in the early phases of the design pro cess

The most wellknown lowpower design strategies at the system layer are system

shutdown system partitioning and algorithm selection

System shutdown

An obvious mechanism for saving p ower is to implement p ower management tech

niques to shut down parts of the system hardware that are idle An idle part still

consumes p ower b ecause of the clo ck line and the meaningless input data which

generates some activity Three systemshutdown strategies are divided into non

predictive and predictive

Nonpredictive shutdown

The nonpredictive approaches are based on shutting the system down after the

pro cessor has b een idle for a certain time interval This interval is xed or set by

the op erating system The system remains in a p owersaving mo de until an external

event asserting one of the wakeup signals will bring the pro cessor out of that mo de

external interrupts system management interrupts reset etc The system may

b e partially shut down thus still p erforming some op erations while in one of the

p owersaving mo des As an example the Power PC implements the following

+

softwareprogrammable p owersaving mo des GIG

doze mode the cache coherence is maintained

System layer

nap mode all logic except the time basedecrementer is disabled Since cache

coherence cannot b e maintained in this mo de a pair of handshake signals enable

the system to ensure that the pro cessor go es into this mo de only when cache

coherency will no longer b e a problem

sleep mode in this mo de the maximum p ower savings are obtained The clo cks

to all units are disabled

Predictive shutdown

The length of the idle time may b e predicted based on the computation history

and the pro cessor is shutdown if the predicted length of idle time justies the cost

of shutting down In CSB a predictive system shutdown has b een implemented

for a p ortable Xterminal device

System partitioning

System partitioning may b e considered at dierent levels of granularity

to decide which comp onents of the system will b e realized in hardware and

which will b e implemented in software hardwaresoftware codesign Some

preliminary research has b een done in estimating the p ower consumption of a

software program TMW

to decide which system functions go to which comp onents In a traditional

system partitioning the primary design goal is to keep the chip count low In

a lowpower system partition the new primary design goal of lowpower may

favor a dierent partitioning Furthermore the new system requires p ower

management functions that were not required in earlier systems An example

is found in Wol

to decide whether the system functions are distributed or not Distributed

pro cessors memories and controllers can lead to signicant p ower savings The

drawback is the increase in area One example is found in LRb where a

nondistributed and a distributed design of a vector quantizer is implemented

and the p ower compared

Algorithm selection

The choice of the algorithm used to implement the application may have a dramatic

impact on the p ower consumption The most straightforward decision is to select

Chapter Stateoftheart in lowpower design

DCT Algorithm Multiplications Additions

Brute Force

RowCol DCT

Chens Algorithm

Lees Algorithm

Feigs Algorithm

Table Number of multiplications and additions of the matrix DCT exe

cution for dierent algorithms BCS

from the dierent algorithms that implement the same function the one with less

number of op erations to b e executed This statement may not b e true if the algo

rithm has dierent types of op erations Since for example multiplications typically

consume much more p ower than additions the algorithm with less multiplications

is a candidate

As an example consider the Discrete Cosine Transform DCT of an matrix

Table shows ve dierent algorithms that implement the DCT and all of them

present dierent number of additions and multiplications BCS Clearly we will

choose Feigs DCT algorithm since it has the smallest number of multiplications

Other features that a lowpower algorithm should present are

regularity to minimize the p ower in the control hardware and the interconnec

tion network

mo dularity to exploit data lo cality through distributed pro cessing units mem

ories and control

few memory references since references to memories are exp ensive in terms of

p ower

ARCHITECTURE LAYER

At the architecture layer four topics will b e studied parallel and pip elined pro

cessing along with voltage reduction compilerrelated techniques cache design and

data representation

Architecture layer

A 5V

B COMPARATOR

C

1/T a

1/2T 1/T A 2.9V

2.9V

B COMPARATOR A TT

C B COMPARATOR

C COMPARATOR

1/T c

1/2T

b

Figure a Datapath circuit at V b Parallel and c pip elined versions

at V BCS

Parallel and pip elined pro cessing

The most wellknown technique for reducing p ower ie decreasing the p ower sup

ply is combined at the architecture layer with parallel and pip elined pro cessing to

overcome the loss of p erformance pro duced by the reduced p ower supply

The p ower consumption dep ends quadratically on the p ower supply V refer

DD

to equation in Section Therefore a reduction in V leads to a signi

DD

cant p owerconsumption reduction Moreover reducing V leads to an increase DD

Chapter Stateoftheart in lowpower design

of the delay of the system since the delay is prop ortional to the capacity b eing

2

chargeddischarged and inversely prop ortional to V V V b egin V the

DD DD th th

threshold voltage CB

Through parallel and pip elined pro cessing this loss of p erformance is overcome

To illustrate this the simple datapath circuit of Figure a will b e used This

circuit compares an op erand C and the result of an op eration b etween A and B

The p ower consumption of this circuit is

2

W C V f

sw itched

DD

where f is the op erating frequency and C is the capacitance of the circuit

sw itched

signals b eing switched ie the pro duct C and in equation

load

Let us assume that the V of this circuit is V and that if we reduce it to V

DD

the delay is doubled

Parallel processing

In Figure b the circuit is duplicated and p owered at V therefore allowing

each part to work at half the original rate Since there are two units pro ducing a

result at half the original rate the same throughput is maintained

The p ower consumption of the parallel version of the datapath circuit is

2

f V C W

DD par sw itched par

The switched capacitance is at least the double of the original one since the circuit

has b een duplicated plus some overhead b ecause of the additional logic shadowed

multiplexer in Figure b Ideally ie with W W therefore

par par

achieving a p ower reduction of

Pipelined processing

The same ideal p ower reduction of can b e obtained if the original circuit is

pip elined as shown in Figure c Now b oth states can op erate at half the

sp eed and thus the supply voltage can again b e reduced to V with no loss of

throughput

The p ower consumption of the pip elined version is

Architecture layer

Architecture Voltage Area Power

Original datapath V

Pip elined datapath V

Parallel datapath V

Pip elineParallel datapath V

Table Normalized area and p ower for the dierent architecture designs of the

datapath example in Figure BCS

2

W V f C

pip DD pip sw itched

The op eration frequency remains the same and there is an overhead increase in

the switched capacitance b ecause of the additional registers If then

pip

W W

pip

Overhead circuitry

The p ower reduction obtained previously has neglected the overhead circuitry

needed to parallelize or to pip eline the original circuit In BCS this datapath

example has b een implemented and the overhead has b een taken into account The

results are shown in Table

In this table we observe that the actual p ower reductions obtained are and

for the pip elined and parallel versions resp ectively Although higher p ower savings

are obtained with the parallel version its large area overhead makes the pip elined

approach more app ealing Note however that the higher the level of pip elining is

the higher is the p ower asso ciated to clo ck the registers

Both techniques of pip elining and parallelizing can b e used together as shown in

the last row of Table to further decrease the p ower supply and the p ower con

sumption

There is a limit in reducing the p ower supply At low V voltages the static p ower

DD

consumption of the circuit b ecomes imp ortant and equation is no longer valid

to mo del the p ower consumed by the circuit In Section this problem will b e addressed

Chapter Stateoftheart in lowpower design

A X A BC

B X

X B X A X

C C

3 2

b

X AX BX C

a

Figure Implementing p olynomial X AX BX C a straightforward

and b Horners scheme

Compiler transformations

Several wellknown compiler techniques used for optimizing area or throughput

may reduce the p ower of the synthesized circuit These techniques transform the

dataow graph of the application in a manner that the inputoutput b ehavior is

preserved

One straightforward use of these transformations is the following once the through

put design constraint has b een met further optimize for throughput Afterwards

the p ower supply can b e reduced to trade the overoptimized throughput for p ower

+

The following compiler transformations may reduce p ower CPM

operand reduction for example using the Horners scheme to implement p olyno

mials Two multiplications are eliminated on a thirdorder p olynomial without

increasing the critical path as shown in Figure

operation substitution costly op erations in p ower may b e substituted for other

less exp ensive op erations This is often used in software compilers ASU The

drawback of this transformation is that it often implies an increase in the critical

path length An example is shown in Figure where a multiplication has b een

traded o in Figure b for an addition at the exp ense of an additional adder

delay in the critical path

When sp eed is not a ma jor issue a signicant reduction in hardware complexity

is achieved by p erforming multiplications over several clo ck cycles as a series of

shifts and adds The average number of shifts and adds p er multiplication can

b e minimized by rounding precise co ecients to the nearest canonical signed

digit co ecients Sam

reducing logicdepth reducing logic depth leads to a decrease in useless activity

see Section An example of this fact is shown in Figure where

Architecture layer

Ar Ai−Ar Xr Xr Yr Yi Xi Xi Ar Ai Xr Xr Yi Yr Xi Xi

Ar Ai+Ar

b

a

Figure Substituting one multiplication by one addition

AB AB CD C

D

b

a

Figure a Unbalanced and b balanced dataow graph of the addition of

four op erands

two dierent implementations of the addition of four op erands are shown All

op erands are assumed to arrive at the same time The implementation in a

presents more activity b ecause of the unbalanced paths for example the second

adder rst computes the unnecessary addition of C and the previous output of

the rst adder When the correct value of the rst adder nally propagates the

second adder recomputes the sum This unnecessary activity is not presented

in Figure b

resource utilization by distributing more uniformly the op erations over the

available time the scheduling algorithm may reduce the required amount of

hardware ie increases the timesharing of the resources while preserving the

number of control steps PR Reducing the amount of hardware implies less

Chapter Stateoftheart in lowpower design

useless activity but the amount of interconnection units and control logic may

increase

Cache design

A cache improves the system p erformance by reducing memory references Since

memory references are very p ower consuming the existing techniques to reduce the

miss rate and therefore increase p erformance asso ciativity buering subarray

access etc also reduce p ower RP However some guidelines should b e followed

for the design of lowpower caches

p ower and hit rate increase with the cache size Therefore an increase in size

that only obtains a small reduction of the miss rate should b e carefully consid

ered since the p ower cost may outweigh the p ower savings obtained with the

lower miss rate

implement the array part of the cache using a sixtransistor SRAM cell design

instead of a fourtransistor cell The latter although it presents less area

consumes more static p ower

Data representation

The representation chosen for the op erands also aects the p ower dissipation since

one representation may inherently present more bit toggles b etween two consecutive

words

Arithmetic representation and data enco ding are two examples where a prop er data

representation for low p ower may b e used

Arithmetic representation

There are several arithmetic representations for data twos complement sign

magnitude and canonical signeddigit Among them twos complement is the most

widespread In this representation the most signicant bits are sign bits for

negative and for p ositive In the signmagnitude representation however only

the most signicant bit has the information of the sign Therefore if the application

usually has to op erate with small values but needs to keep a large bit width for some

op erations involving large values a high activity may arise if the values present a

large number of sign transitions

Architecture layer

Twos complement Value Signmagnitude

Table Twos complement and signmagnitude representation for the values

to using bits

For example consider the twos complement and signmagnitude representation for

the values to using bits as shown in Table Any transition from a negative

value to a p ositive or viceversa implies an average number of bit transitions

when using the twos complement representation whereas only with the sign

magnitude representation The drawback in using signmagnitude representation is

the additional overhead required to p erform the op eration since it may dep end on

the signs of the op erands This results in a sequence of decisions that have to b e

made implying additional logic and execution time Kor

Data encoding

Data enco ding is sp ecially useful for low p ower when a large capacitance is driven

by these data This o ccurs for example at the inputoutput pads IO of the

chip Typically the load capacitances of the IO pads are one to three orders larger

than in the internal circuit

The Gray enco ding is one example of a lowpower address enco ding It takes advan

tage of the sequential memory access which o ccur often in a general pro cessor for

executing consecutive instructions in basic blo cks For example for the sequence

of values from to there are bit switches when the values are enco ded in

binary representation while only bit switches when they are represented in Gray

co de For random access patterns Gray co de and binary co de have similar number

of bit switches The main feature of the Gray co de is that contiguous Graycoded

numbers only dier in one bit Hay

To implement a Graycode addressing system the instruction eld of the target address in a branch instruction is mo died in such a way that the calculated target

Chapter Stateoftheart in lowpower design

address is the correct target address in Graycode addressing system Moreover a

Graycode counter is needed for incrementing the program counter STD

LOGIC LAYER

Several lowpower design techniques exist at the logic layer or gate layer All the

techniques reviewed in this section aim at reducing the activity of the synthesized

circuit Again p ower is traded o in some cases for area and delay The techniques

target combinational and sequential circuits

Combinational circuits

The p ower consumed by a combinational circuit can b e reduced by a prop er path

balancing of the signals and during the multilevel logic synthesis and technology

mapping pro cess

Path balancing and gate sizing

Equation in Section p ointed out that once the supply voltage and the device

dimentions have b een xed it is the switching activity that will determine the p ower

consumption of the circuit The switching activity of a circuit may b e divided into

useful activity those signal transitions needed to change the gate outputs of

the circuit from the previous cycle a function of the previous primary inputs

to the current cycle a function of the current primary inputs

useless activity also called hazard activity the rest of the circuit activity

As an example consider the circuit in Figure a Assume that the critical signal

is B At time the previous value of the output Y is and at time T is the same

Since the correct value of Y is at time T any double transition of Y is considered

to b e a hazard

Hazards are generated b ecause of the dierent arrival times of the signals to the

inputs of the gates One of the reasons of this unmatched delay is the dierent prop

agation times of the gates which vary dep ending on the input and interconnections

1

among gates Moreover these propagation times vary inside each gate dep ending

At submicron design the metal interconnect contributes most of the nets capacitance The

narrow interconnect oers signicant resistance while the smaller transistors switch faster and cause much less capacitive loading

Logic layer

A A Y B

B Y

0 T a

A A Y B

B Y

0 T

b

Figure Delaying signals to eliminate hazards

on the input Therefore the amount of useful activity in a gatelevel description

of a circuit can b e calculated considering that the gates follow a zerodelay mo del

that is the propagation delays of the gates are

The p ower dissipated b ecause of hazards strongly dep ends on the circuit top ology

and the applied inputs but clearly accounts for a signicant fraction of the overall

p ower consumption BFR or even higher SGDK

Hazards at the output of a gate can b e minimized if all the paths in the circuit feed

ing that gate have approximately the same delay This implies that any p otential

hazard at its output will b e eliminated since the delay of the gate acts as a lter

Thus balancing path delays may cause a reduction of the p ower consumption One

metho d to balance the paths is by gate resizing LM ie by replacing some

gates of the circuit with others implementing the same logic function but having

smaller area A smaller gate takes longer in chargingdischarging its output load

thus delaying its output signal In the example of Figure a signal A can b e

delayed as shown in Figure b Now signal Y presents no hazards

Resizing a gate implies resizing its transistors Since smaller devices present less capacitance an ecient lowpower strategy is to use minimum size devices whenever

Chapter Stateoftheart in lowpower design

p ossible But care has to b e taken since smaller gates have larger transition times

at their outputs A large transition time at the input of a gate aects the width of

the shortcircuit region see Section of the gate and hence increases the p ower

consumption BIO Anyway along the critical path devices should b e sized up

to meet p erformance constraints

Multilevel logic synthesis

Traditionally logic optimization based on identifying logic shared among dierent

no des of a circuit has provided an ecient metho d for minimizing a cost function

the area of a b o olean network The area is estimated by computing the number of

literals in the factored form of the network BM The same pro cess of logic syn

thesis can b e applied to minimize the p ower of the network simply by incorp orating

the p ower measure into the cost function RP IP

As an example consider the synthesis of the following b o olean functions RP

f ad bcd ae

1

f a bc dh eh

2

Assume that the signal probabilities of all input signals is and that the activities

ie the number of transitions p er cycle are

a b c d e h

The cost function mentioned ab ove is simply the switched capacitance C pro d

load

uct in equation in Section Therefore the number of transitions at each

no de of the circuit should b e calculated The transition density measure Na j

will b e used for this purp ose This measure will b e explained in Chapter

There are several ways to synthesize the b o olean functions For example a straight

forward twolevel implementation has an area of literals for f plus literals

1

for f The p ower cost obtained with the transition density measure is A

2

p ossible multilevel implementation is

f a bc

3

f d ae f

3 1

Logic layer

f f dh eh

2 3

with an area of and a p ower cost of Another multilevel implementation

is

f d e

3

f a bcd f

3 1

f h f a bc

3 2

with an area of and a p ower cost of This implementation has more area

but less p ower

Technology mapping

Once the logic optimization has b een done the logic functions obtained are trans

formed into a technologydependent circuit with minimized p ower consumption

The graphcoveringbased technologymapping pro cess can b e summarized in the

following three steps WE

the logic equations are converted into a graph sub ject graph where each no de

is restricted to one of a set of base functions eg input NAND gates and

inverters

each library gate is also represented by a graph pattern graph where each no de

is restricted to one of the base functions

nd a minimum cost covering of the sub ject graph

Lowpower techniques can b e applied in steps and TAM TPD during

step a prop er sub ject graph should b e found that minimizes the activity at the

outputs of the gates As an example consider the sub ject graph of Figure a

implementing a input NAND function The input signals are assumed to b e

indep endent With the probabilities and activities of the input signals shown in

the gure the amount of activity at the output no des is The sub ject graph

in Figure b however present less activity and therefore should b e

chosen In this example the cost function has b een considered to b e only the

activity A more accurate cost function may b e the switching capacitance

Further p ower reductions during the covering step are obtained based on the fact

that internal no des of a gate have less capacitance than output no des Therefore

Chapter Stateoftheart in lowpower design

probability activity

A (0.3,0.5) (0.88,0.35) (0.12,0.35) (0.916,0.305) (0.084,0.305) B (0.4,0.5) (0.958,0.195) C (0.7,0.5) D (0.5,0.5)

Total node activity: 1.505 a

A (0.3,0.5) (0.88,0.35) (0.12,0.35)

B (0.4,0.5) (0.958,0.195)

C (0.7,0.5)

D (0.5,0.5) (0.35,0.60) (0.65,0.60)

Total node activity: 2.095

b

Figure Dierent sub ject graphs implementing a input NAND function

High active node

b a

High active node Hidden active node

c

d

Figure a Sub ject graph b available gates c lowarea mapping d

lowpower mapping

if a no de in the sub ject graph presents a high activity it should b e hidden as an

internal no de of a gate of the targeted library

Logic layer

Co ding Co ding

0/0 State

S

1/01/0 1/0 1/0 1

SSSSS S

1 2 3 4 5 2

0/0 0/0 0/0 S

3

S

0/0 4

S

1/1 5

a

b

Figure a FSM example b two dierent co dings with dierent switching

activity among them RP

For example consider the sub ject graph of Figure a and the available library

gates of Figure b The mapping of Figure c presents less area tran

sistors in static CMOS but the high active no de still remains as an output no de

The mapping of Figure d presents more area transistors but the high

active no de has b een hidden on a gate

Sequential circuits

The techniques addressed for lowpower sequential circuits cover the nitestate

machine FSM state assignment the circuit retiming and the disabling of input

registers when not needed

FSM state assignment

The idea b ehind the FSM state assignment for low p ower is to assign similar co des

to states S and S if the probability of state transition b etween these two states is

i j

high

The dierence b etween two co des is the number of nonequal bits This measure is

called the Hamming Distance Therefore the goal of the FSM state assignment for

low p ower is to minimize a value dened as RP

X

p H S S

ij i j

ov er all edg es

where H S S is the Hamming Distance b etween co des S and S and p is the

i j i j ij

conditional state transition probability the probability that the next state is S

j

given that the FSM is in state S

i

The FSM that pro duces a whenever there is sequence if ve s at its input is

shown in Figure a as an illustrative example The probability of the input is

Chapter Stateoftheart in lowpower design

T1 T2 C REG COMBINATIONAL A A’ CIRCUIT REG C

A a

T1+T2

COMBINATIONAL A CIRCUIT REG C

A

b

Figure Sequential circuit a b efore and b after retiming

assumed to b e In Table b two dierent state co dings are shown With

the rst co ding value is with the second co ding is reduced to Both

co dings can b e implemented with the same area RP

Retiming

The retiming technique LS goal is to move registers across p ortions of combi

national logic in order to minimize the cycle time or the number of registers while

maintaining the b ehavior of the circuit For example the sequential circuit in Fig

ure a can b e retimed to eliminate one register as shown in Figure b

Moreover the retiming technique can b e applied along with combinational syn

thesis techniques to optimize the p ortions of combinational logic b etween register

b oundaries MSBSV

Logic layer

The retiming technique for low p ower is based on the fact that the switching

activity at register outputs in a sequential circuit can b e signicantly less than the

activity at the register inputs b ecause only the last signal transition at the input

of a register may propagate to the output when the register is clo cked A register

may then save p ower sp ecially if its input is very active and its output has a high

load capacitance

Therefore once the constraint on p erformance has b een met the retiming technique

can b e used to decrease p ower by selecting the set of no des which by having a

register placed at their output lead to the minimization of the switching activity

capacitance pro duct in the circuit For example in the circuit of Figure a the

0

p ower consumption asso ciated to signals A and A is prop ortional to C

A RE G

0 0

is signicantly lower than and C is higher than C Assuming that

A A A A A

0

0

C The p ower C the p ower of b oth no des A and A is approximated by

A RE G A

0

asso ciated to signal A in Figure b is prop ortional to C Since

A A A A

p ower is saved by not removing this register during the retiming pro cess

A drawback of the retiming technique is that the critical path of the circuit may

increase The circuit in Figure b has a critical path of T T whereas in

1 2

Figure a it is maxT T

1 2

Disabling input registers

In synchronous circuits some useless p ower dissipation can b e eliminated by dis

abling the clo ck in parts of the circuit where useful computation is not b eing p er

formed It is the same idea of the system shutdown describ ed in Section but at

the logic level

One approach is to add a redundant logic to disable the registers at some of the

inputs if the op eration that the combinational blo ck has to p erform do es not dep end

+

on those inputs AMD Therefore fewer activity is generated in the combina

tional blo ck thus saving p ower and also some p ower is reduced in the disabled

registers

As an example consider a comparator circuit in Figure a p erforming the

op eration A B b eing A and B the nbit twos complemented op erands

a a b b Clearly if a and b B greater

n1 0 n1 0 n1 n1

than A or viceversa A greater than B the a a and b b

n2 0 n2 0

bits have no information to calculate the op eration A B Therefore the input reg

isters of these bits may b e disabled Let f b e the b o olean function that implement

a

these conditions If f the registers should b e disabled Figure b a

Chapter Stateoftheart in lowpower design

an−1 A A en b A < B A < B n−1 B B en clk

clk

a

b

Figure Comparator circuit a without and b with disabled input regis ters

0− −0 0/0 IN 1 COMBINATIONAL OUT 11 IN 2 LOGIC 1− REGISTERS clk 1/1 en fa

0−

b

a

Figure a STG and b FSM with the registers enableddisabled by signal

f

a

Another approach is to disable all the input registers in a nitestate machine FSM

when a selflo op in the signaltransition graph STG has b een detected BSdM

BM In a selflo op the state do es not change Therefore by disabling the input

registers no switching activity is generated in the combinational blo ck and p ower

is saved

As an example consider the simple STG of Figure a with two selflo ops One

is pro duced when the FSM is in state with output and the rst input is

The other selflo op is in state with output and either the rst input is or the

second input Again a b o olean function f is needed to implement the selflo op

a conditions

Circuit layer

Logic Style Area Delay Power Design

Static CMOS

BICMOS

BINMOS

PseudoNMOS

NP Domino Logic

CMOS PTL

CPL

DPL

SRPL

Table Areadelaypowerdesign simplicity comparison of some logic styles

In all these approaches of disabling the input registers the area of the synthesized

circuit increases b ecause the logic to implement the enablingdisabling function f

a

Moreover the delay also increases since this logic usually is in the critical path

CIRCUIT LAYER

At the circuit layer four topics related to lowpower design will b e addressed the

logic style the asynchronous design the design style and the adiabatic computing

Logic style

Table presents a comparison in area delay p ower and design simplicity among

dierent logic styles or logic families The rst part of the table shows some

static and dynamic logic styles The second one presents several passtransistor

logic families BE A more accurate comparison in term of p ower is rep orted in LKB for some of these logic styles

Chapter Stateoftheart in lowpower design

A1 B1 clk A2 B2 Y Y A1 A2 A1 A2 B1 B2

B1 B2

b

a

Figure a Static and b dynamic implementations of the b o olean function

Y A A B B

Static vs dynamic logic style

Static CMOS logic design is very p opular for its design simplicity and it also presents

a go o d lowpower b ehavior b ecause of its negligible static p ower dissipation How

ever it suers from a high area and low sp eed The BICMOS logic style overcomes

the limited current drive of the CMOS when the load capacitance is high This

comes at the exp ense of an increase in area p ower sp ecially for small loads and

design simplicity The BINMOS logic family presents less p ower consumption than

BICMOS but is still higher than in CMOS The PseudoNMOS logic style oer a

go o d design simplicity and low area since the Pblo ck of the static CMOS is replaced

by a single PMOS transistor The problem with this logic style is the nonzero static

p ower dissipation

To reduce area and improve the sp eed of CMOS circuits the dynamic logic style

may b e used The NP Domino logic in Table is a dynamic logic family A clo ck

is needed in dynamic logic as shown in Figure where a static CMOS gate is

compared to its dynamic version When the clo ck is low the output no de is pre

charged when the clo ck is high the output no de may b e discharged dep ending on

the logic p erformed by the Nlogic blo ck During the evaluation phase the output

no de maintains the value stored in its load capacitance if it is not discharged

In general the dynamic logic families are not suitable for low p ower applications

The most imp ortant reasons are Lan

the clo cking scheme adds more capacitance to the clo ck distribution increasing

the p ower and skew

Circuit layer

the unnecessary activity generated at the output of the gates

For example the standard cells and datapath element libraries of the Power PC

+

GIG use fully static logic for design simplicity and avoid the higher p ower

dissipation of dynamic logic caused by the precharge phase Moreover a dynamic

design also requires saving the state of the chip if clo cks are disabled

Other drawbacks concerning the robustness and charge sharing are presented in

dynamic logic families Lan BE Nevertheless dynamic implementation might

actually achieve lowpower consumption in some sp ecic applications such as PLAs

Passtransistor logic style

The most promising logic families for low p ower are those based on the pass

transistor logic style Some of these logic families are shown in the second part

of Table The most imp ortant drawback in using a passtransistor logic family

is the diculty in designing circuits There is not much commercial CAD supp ort

for the synthesis of passtransistor gate designs

Synchronous vs asynchronous design

Asynchronous circuits are typically designed on the basis of an explicit selftimed

proto col which implies that the delay of an asynchronous circuit is data dep endent

Ideally the asynchronous circuits should drastically reduce the p ower consumption

of their synchronous counterparts since there is no global clo ck and any signal

transition generated in the circuit is useful ie the circuit only dissipates when

and where necessary

The main drawback with asynchronous circuits is the area overhead to implement

the handshake and completion proto cols needed for a correct op eration In most

of the cases the p ower consumption of this overhead overcomes the p ower savings

obtained with the elimination of the clo ck signal

However CAD to ols for asynchronous design are still under research and improve

ments are exp ected in the next years in the asynchronous arena Some of the results

rep orted so far seem to indicate that it is p ossible to obtain p ower reductions us

ing asynchronous techniques but it dep ends on the type and characteristics of the

circuit For example in KGG a dynamic synchronous and asynchronous adder

are compared The asynchronous version is more p ower consuming then the

Chapter Stateoftheart in lowpower design

V DD

DATA PROCESSING CIRCUIT REGS REGS

Clock a

V DD

Converter V dd Asynchonous State Detect

DATA PROCESSING

FIFO CIRCUIT FIFO REGS REGS

Clock

b

Figure a Original synchronous system b asynchronous system with

adaptive scaling of the supply voltage NS

+

synchronous one However in vBBK the asynchronous implementation of a

DCC error corrector reduces the p ower with resp ect the synchronous one

Asynchronous design with adaptive scaling of the supply voltage

The data dep endent delay of an asynchronous circuit can b e combined with the

reduction of the p ower supply to obtain a lowpower circuit that op erates in a

synchronous environment vBN NS

Figure a shows the original synchronous system working at a frequency T

max

T is dened so that the resp onse time of the data pro cessing circuit is guaran max

Circuit layer

Full Custom Gate Arrays Standard Cells Macro Cells DESIGN TIME

Programmable POWER/OPERATION

Logic

Figure Design time vs p ower consumption of some design styles

teed for the worstdelay case An asynchronous design of the circuit has an average

resp onse time of T The delay overhead T T can b e converted into cor

av g max av g

resp onding p ower savings by reducing the p ower supply until the computational

delay of the asynchronous circuit plus a safety margin is T

max

The asynchronous system with adaptive scaling of the supply voltage is shown

in Figure b The system consists of the data pro cessing circuit itself two

FIFO buers a state detecting circuit and a DCDC converter for scaling down

the supply voltage The converter can b e anything from a resistive device to a

more sophisticated lossless device SBS The scaling circuit may consist of a

DA converter which scales the supply voltage linearly dep ending on the number

of data words in the input FIFO

Since the system op erates in a synchronous environment the input output buers

never should full empty Therefore synchronization problems will not o ccur at

the synchronousasynchronous interface The state detecting circuit monitors the

state of one of the buers for example the input buer If it is running empty

it indicates that the circuit is op erating to o fast and the supply voltage can b e

reduced If on the contrary the buer is running full the supply voltage must b e

increased

Design style

There is a range of design style options to implement a circuit full custom gate

arrays standard cells macro cells programmable logic structures These design

Chapter Stateoftheart in lowpower design

style options have contradictory designtime and p owerconsumption characteristics

as shown in Figure For example in a fullcustom design techniques such as

transistor sizing can b e optimally applied to individual circuits in an adho c basis

Unfortunately this comes at a low design pro ductivity The gate array and sea

ofgates design style consist of predened transistors and the designer needs only

to wire the dierent transistors using metalization and contacts In a standardcell

design style the cells are already predened Care has to b e taken in the routing

discussed in the next section to reduce the p ower Moreover dierent designs of

the same cell with dierent sizes should b e available for lowpower design

In general the more exible design styles present more p ower p er op eration than

the least exible For example in a programmable logic structure there is a higher

capacitance to b e switched p er op eration than in full custom b ecause of the overhead

needed to provide this exibility

Adiabatic computing

The idea of adiabatic or chargerecovery circuits is to recover some of the energy

needed to change the capacitors voltage This is p ossible since the energy dissipated

by a transistor when there is a onestep voltage of V across it is N times the energy

when this voltage step presents N substeps RP The adiabatic computing

b ecomes attractive only when the delay is not critical since p ower is traded for

delay eac

As an example consider the energy dissipation by the charging of a capacitance C

through a transistor mo deled as a resistor R

Z Z

2

T T

2t 2T

V

2

RC RC

e CV e dt P t dt E

diss

R

0 0

where V is the voltage swing at the capacitance The same result is obtained for

the discharging pro cess

The asymptotic energy dissipation ie the limit as T approaches to innity is

2

CV 1

2

CV which draws a p ower consumption of for a charge and discharge of the

2 T

capacitance This value coincides with the one obtained in Section

It is imp ortant to p oint out that current p ower and energy are exp onential in

time relative to the circuits RC time constant Current and energy asymptotically

1

2

approach zero and CV resp ectively Assuming that for example after RC the

2

circuit has reached its nal output value the transition time could b e doubled to

Layout layer

RC and apply a voltage step of V from to RC and a voltage step of V from

RC to RC

Z Z

6RC 3RC

2

P t dt P t dt E CV

diss

3RC 0

Therefore doubling the transition time T and halving the maximum voltage drop

reduces the energy by a factor of two

LAYOUT LAYER

The p ower consumption can b e tackled at the layout layer during the o orplanning

placement and routing tasks

The o orplanning task consists of the allo cation of space on a chip for a given set

of mo dules A mo dule may have its dimentions xed or it may present dierent

shap es The goal is to nd a suitable implementation for each mo dule such that

the total area is minimized while meeting a o orplan top ology Now the p ower

consumption adds a new constraint A o orplanning for low p ower has to decrease

the p ower by CW

selecting the less p owerconsuming mo dules

distributing the p ower consumption rather than to have hot sp ots

The traditional placement algorithm tries to minimize the wire capacitance among

the dierent mo dules The lowpower placement algorithm has to decrease the

switched cacitance VP

The most wellknown techniques for a lowpower routing algorithm are

keep the high switching activity wires short

use the lower parasitic capacitance layer for the highest active wires

Clock distribution

A particular critical signal to route is the clo ck The clo ck signal has to drive

large loads with fast risefall times For example in the case of the DEC Alpha

chip ea the clo ck load is nF and this capacity has to b e chargeddischarged

Chapter Stateoftheart in lowpower design

a

b

Figure Clo ck tree driving schemes a buers at the clo ck source and b

distributed buers

every ns in risefall times of ns This draws a p ower consumption for the

clo ck of W which corresp onds to the of the total p ower consumption of the

chip W

Aside from using a low capacitance layer for routing the clo ck signal lowswing

drivers at the top level or at intermediate levels of the clo ck tree distribution may

b e devised KaKS This scheme reduces the p ower a at the exp ense of an

increase in delay

To obtain the desired fast risefall times in the clo ck signal buers are needed to

drive the large load capacitance These buers may b e group ed at the clo ck source

Figure a or may b e distributed along the clo ck signal Figure b The

rst option has the advantage of avoiding the adjustment of intermediate buer

delays The second has the advantage that relatively small buers are used and

they can b e exibly placed accross the chip to save area layout This option

should b e used b ecause b oth wiring capacitance and driver p ower dissipation are reduced RP

Technology layer

Circuit characteristic Scale factor

2

Dynamic and static

p ower consumption

Delay

3

Powerdelay pro duct

2

Gate area

Power density

Table Technology scale factors for some parameters

TECHNOLOGY LAYER

The equation of p ower consumption in Section showed that the three degrees

for p ower reduction are p ower supply capacitance and activity In the last sections

the reduction of the activity has b een targeted by the ma jority of the lowpower

techniques At the technology layer reducing p ower supply and capacitance is the

main goal

Three topics related to p ower consumption are addressed at the technology layer

technology scaling SOI SilicononInsulator pro cess and MCMs MultiChip Mo d

ules

Technology scaling and SOI pro cess

One straightforward way to drastically reduce the p ower consumption is through

technology scaling Let us supp ose that the p ower supply and the dimensions of the

transistors are scaled down by a factor Table shows what happ ens

to some parameters of the resulting circuit WE The p ower consumption is

signicantly reduced although the p ower density is maintained since the integration

2

level has increased by

The scaling b ehavior of Table do es not apply when the technology is scaled b elow

1

where V is the threshold the halfmicron The delay is prop ortional to

th

V V

DD th

voltage of the transistor When V is near V the delay increases dramatically

DD th

Therefore to maintain the delay scaling by at b elow the halfmicron the threshold

voltage of the transistor needs also to b e scaled down The problem of scaling down

the threshold voltage is that the static p ower consumption caused by subthreshold

leakage currents is no longer negligible BE

Chapter Stateoftheart in lowpower design

It is necessary then a prop er technology pro cess for lowpower and lowvoltage

applications One such technology is a class of SiliconOnInsulator SOI names

thinlm SOI which presents a low cost and improvement of its p erformance at low

voltages The p erformance improvement of SOI is mainly caused by the reduction

of parasitic capacitances and b o dy eect SNDD

Packaging technology

The ochip capacitances are approximately one to three orders larger than those

inside the chip Therefore the interconnections among the dierent chips may

account for a signicant part of the overall p ower consumption of the system

To reduce the ochip wiring capacitance an obvious solution is to integrate as

many chips as p ossible into a single mo dule These mo dules called MultiChip

Mo dules MCM are designed to accommo date to chips and combine the

function of the singlechip package and the printed circuit b oard TRDH In an

MCM all the chips comprising the system are mounted on a single substrate The

ma jor problem asso ciated to MCM is the p ower supply distribution and it is caused

by the material prop erties of the substrate

LOWPOWER TECHNIQUES SUMMARY

The previous sections have reviewed some of the most wellknown techniques for

low p ower In this section we give an overview of the p ower reductions obtained

with these techniques Table shows for the system architecture logic and

circuit layers of the design pro cess some results obtained by the lowpower research

community We have skipp ed the lower layers since they are out of the scop e of this

work

The p ower reduction values rep orted on the table should not b e sup ercially com

pared to decide which technique is more p owerful Among the reasons

the p ower reduction values have b een mainly obtained by applying the tech

niques rep orted in Table However some other lowpower approaches are

applied together with the main technique to further reduce the p ower consump

tion

the type of comparison p erformed by the researchers For example larger p ower

reductions may b e obtained by a lowpower scheduling technique if it is com

pared to an initial p o orly optimized registertransfer level description than if

Conclusions

Layer Technique Reductions References

+

System System shutdown CSB GIG

System partitioning Wol

Algorithm selection OY

Architecture Voltage scaling with CSBa

pip elparall el pro cessing

Cache design BAF PR

KBN

Data representation STD

+ +

Compiler transformations CPM WCF

+

Logic Path balancing BCH

Logic synthesis IP RP

TMA SGDK

Technology mapping LM TAM

TPD

State assignment TPCD HdlR

Retiming MDG

+

Disabling input registers BFR AMD

+

Circuit CPL vs static CMOS YYN

+

Asynchronous vs synchronous KGG vBBK

Asynchronous with adaptive NS

supply voltage

Adiabatic computing SK

Table Power reductions obtained by dierent techniques

compared to an already optimized for area or delay description In the rst

case some p ower reductions may b e obtained by simply applying the traditional

techniques for low area and high p erformance

Nevertheless we b elieve that Table provides an overview of the p ower reductions

obtained at the system architectural logic and circuit layers of the design pro cess

We observe that the highest p ower savings are obtained at the architecture and

system layers However signicant p ower reductions are also achieved at the logic

and circuit layers which add up to those already obtained at the higher layers

CONCLUSIONS

The p owerconsumption sources in a CMOS circuit have b een describ ed in this

chapter Dynamic p ower consumption accounts for almost all the p ower in data

path circuits and in general it is the ma jor source in the rest of the circuits

Three degrees of freedom are considered when designing for low p ower supply volt

age capacitance and activity Once the supply voltage and the device dimensions

Chapter Stateoftheart in lowpower design

have b een xed it is the switching activity of the signals of the circuit that ulti

mately determine its p ower consumption Reducing the activity is the goal of the

techniques prop osed in the next chapters as the contribution of this work

This chapter has also reviewed some stateoftheart lowpower design techniques

covering all the layers of the design hierarchy ranging from the technology to the

system layer A more thorough insight has b een given for the logic circuit architec

ture and system layers The highest p ower savings are obtained at the architecture

and system layers However signicant p ower reductions are also achieved at the

logic and circuit layers which may add up to those already obtained at the higher layers

HIGHLEVEL SYNTHESIS TECHNIQUES

FOR LOW POWER

Decisions taken at the earliest steps of the design of an electronic circuit may have

a signicant impact on the characteristics of the nal implementation This chapter

illustrates how p ower consumption issues can b e tackled at the system and archi

tecture layer during the design of application sp ecic integrated circuits ASICs in

an scenario A set of RTL transformations aiming at reducing

p ower consumption are prop osed and the p otential b enets evaluated

The common idea b ehind the transformations is to reduce the activity of the data

path functional units eg adders multipliers by minimizing the switching activity

of their input op erands Functional units highly contribute to the p ower consump

tion of the datapath Preliminary evaluations obtained by simulation show that

signicant improvements can b e achieved

Finally this chapter demonstrates how some of the presented transformations can

b e automated and incorp orated in highlevel synthesis to ols

Chapter Highlevel synthesis techniques for low power

INTRODUCTION

An increasing demand for devices executing complex tasks such as noteb o ok com

puting video compression and sp eech recognition has emerged These tasks are

often implemented with application sp ecic integrated circuits ASICs that in

teract with general purp ose pro cessors and memories in an embedded system In

this chapter we fo cus on estimating and reducing the p ower consumption of those

tasks that are mainly data driven such as for example digital signal pro cessing

applications

We present a set of registertransfer level RTL techniques for p ower reduction

b earing in mind that decisions taken at the highest layers of the design pro cess ie

architecture and system can have a signicant impact on the quality of the nal

implementation

One of the main problems of prop osing RTL techniques is the estimation of p ower

consumption At this level there is a lack of information on the capacitance and

the switching activity of the signals of the circuit The rst part of the chapter

Section is devoted to prop ose p owerconsumption mo dels for the functional

units of the datapath based on the activity of their op erands Rather than accu

rately estimating the p ower consumption of the nal implementation we fo cus on

fairly compare the relative b enets of dierent RTL descriptions

Section presents ve techniques for p ower reduction Some of them have already

b een prop osed by other authors but fo cusing on dierent targets eg increasing

data lo cality We b elieve that these techniques can b e also applied to reduce the

p ower consumption of the synthesized design All the techniques presented in this

chapter aim at reducing the p ower consumption of the functional units by means

of reducing the activity of the input op erands The b enets in p ower savings of the

dierent techniques are evaluated

Thus the ob jective of this chapter is to help to the designer in deciding which

sp ecication of the circuit is more appropriate in terms of p ower consumption rather

than fully implement all the techniques

The following example illustrates the spirit of the techniques

Assume the average energy dissipated by an array multiplier is unitop eration

If two multiplications say a b and c d are executed at consecutive cycles the

dissipated energy will b e units On the other hand if the second multiplication

is again a b instead of c d the switching activity of the multiplier will b e null

during the second multiplication and therefore no energy will b e dissipated Our

Previous work

estimations also state that if one of the op erands remains unchanged during the

second multiplication say a d the rst op erand do es not change the average

energy is units

Now assume the following two execution orders of a sequence of indep endent op er

ations executed in the same multiplier

Order Energy Order Energy

Cycle x a b x a b

Cycle y c d z a d

Cycle z d a y c d

Total

By applying simple RTL transformations to the rst order Order a less p ower

consuming description can b e obtained Order Two techniques have b een applied

in the example operand reordering on the op eration z d a see Section and

operand sharing to swap the op erations y c d and z d a see Section

Thus the multiplier sees no activity of its rst op erand b etween cycles and and

of its second op erand b etween cycles and A p ower savings is estimated

Each of the techniques presented in the chapter is evaluated with some b enchmarks

in which there is a clear evidence that signicant b enets can b e obtained The

techniques are complementary among them and not every technique pro duces a

tangible improvement on every b enchmark

To cop e with the high complexity of the problem of p ower reduction all the tech

niques are based on heuristics which trade in some cases p ower for area and

p erformance

Finally some of the transformations for low p ower have b een automated and in

corp orated in highlevel synthesis algorithms and their complexity along with the

hardware and delay overhead of the synthesized circuit are given In particular

a list scheduling algorithm and a clique partitioning approach for register binding

aiming at reducing p ower are presented and evaluated in Section

PREVIOUS WORK

The eorts in higher levels of the design pro cess for lowpower design have b een de voted to estimate the p ower consumption and to reduce it through transformations

Chapter Highlevel synthesis techniques for low power

P(0 1) BP1 BP0 −0.99 0.50 −0.80 0.40 −0.60 0.30 0.0 0.25 0.20 0.60 0.10 0.80 0.99 0.00

14 12 10 8 6 4 2 0

Figure Activity of a bit data stream for dierent temp oral correlations

data approximated from LRa

Estimating p ower consumption

Landman has prop osed two p owerestimation mo dels in LRa a mo del that

accounts for the random b ehavior of the LSB bits and the correlated b ehavior of

the MSB bits is presented for datapath circuits Figure shows the transition

activity for several dierent twos complement data streams for each bit Each curve

corresp onds to a dierent temp oral ie b etween successive data correlation

When the data stream is white noise and P P

When the stream is highly correlated The MSB region corresp onds to the

sign bit and consequently the transition probabilities of these bits clearly dep end

on the correlation of the data stream corresp onds to a lower activity for

p ositively correlated signals corresp onds to a higher activity for negatively

correlated signals Three bit regions are observed in Figure The MSB and LSB

regions present a constant transition activity The middle region may b e mo deled

by linear interpolation Values BP and BP are determined from the data stream

statistics In LR this mo del is extended to controlpath circuits

In BB dierent pro cessor mo dels that account for the energy for the ma jor

mo des of computation are describ ed xed throughput maximum throughput and

burst throughput Techniques for trading o throughput and energy consumption

are derived for these mo des

Recently entropybased approaches at registertransfer level have b een prop osed

in DMP Na j The idea is that the switching activity of a circuit can b e

Previous work

0.5

Output multiplier activity 0.0

−1.0 −0.5 0.0 0.5 1.0

Value of constant a

i0 0 i1 1 i2 2 i 3 3

i1 1 i2 2 i 3 3

0

i0

b

b

Figure a Average activity in a multiplier as a function of the constant value

data approximated from CR b A parallel and serial implementations of an

adder tree

predicted without simulation using either entropy or informational energy averages

only the characteristics of the input sequence and some knowledge ab out the com

p osition of the circuit are used to obtain p ower estimations

Reducing p ower consumption

Few authors have addressed the set of transformations at the system and architec

ture layer to obtain lowerpower designs In CR the p ower consumption of ad

ditions and constant multiplications as a function of the op erand activity is studied

The average activity at the output of the multiplier monotonically increases with the

absolute value of the multiplier constant Figure a The average activity at the

output of an adder may b e approximated by the maximum of the activities of the

op erands From this study a data ow graph transformation is derived for the typ

Chapter Highlevel synthesis techniques for low power

if a if a

for i to N for i to N

if bfg if bfg

D g C i D B i f Ai

elsefinstr g elsefg

for i to N for i to N

if a if a

B i f Ai D g C i D

if not bfg if not bfinstr g

b

a

Figure Transformations for reducing a activity at the address bus and a

number of memory references

ical op erations in signal pro cessing applications shown in Figure b Values

i

are constants The conclusion is that the minimum average switching activity over

all the mo dules adders and multipliers of the dataow graph in Figure b is

obtained when either or Similar results are

0 1 2 3 0 1 2 3

obtained when the additions are p erformed serially Figure b In this case

the minimum activity is obtained when

0 1 2 3

+

In WCF some memory transformations for lowpower systems are hinted The

aim of these transformations is to reduce b oth the activity of the address lines and

the number of ochip references One transformation that reduces the address

transitions consists of reordering memory references so that few bits as p ossible

toggle b etween successive addresses supplied to the memory For example the

instruction instr in the lefthand side co de of Figure a may b e moved from the

scop e of the rst if instruction to the scop e of the second one or vice versa if the

switching activity at the address bus is reduced This transformation is only allowed

if the instructions b etween the original and nal p osition of the moved instruction

are indep endent

A transformation that reduces the amount of ochip references consists also on

reordering memory references so that lo cality is exploited As an example consider

the lefthand side co de of Figure b If we know that b oth the A and C arrays

are available in a lo cal memory b ecause of previous op erations and that only array

C is not needed after the execution of the co de the references to array B in the rst

for lo op may throw some values of array C out of the lo cal memory values that will

have to b e fetch again in the second for lo op This can b e prevented if b oth lo ops

are reordered as shown in the righthand side co de of Figure b

In CPRB the traditional transformations for faster and smaller circuits are

applied in order to evaluate the p ower consumption savings Some of these trans

formations are the following criticalpath balancing and reduction op eration count

reduction optimization of datapath and control logic op eration substitution eg

Powerconsumption model of functional units

simple multiplications by shiftadd op erations and bitwidth optimization When

ever the resulting circuit is faster than the required throughput p owersupply re

duction can b e applied to take advantage of its quadratic impact on consumption

POWERCONSUMPTION MODEL OF

FUNCTIONAL UNITS

In highlevel synthesis a fast and accurate p owerconsumption estimation to ol is

necessary to narrow down the design pro cess space and to nally obtain the design

to b e implemented The p owerconsumption mo del describ ed in Section of

Chapter is not applicable to highlevel synthesis b ecause it would require the very

timeconsuming task of rst mapping down each implementation to the gate level

and second apply the mo del to obtain the p ower estimation

Power consumption in the datapath accounts for a large fraction of the overall

p ower budget Among the dierent types of units functional units adders mul

tipliers etc highly contribute to the p ower consumption of the datapath For

example consider a NewtonRaphson divider based on the NewtonRaphson itera

tive algorithm The p ower consumption of the functional units of the divider has

b een estimated to b e approximately the of the total p ower of the circuit whereas

only a is b ecause of the registers wire control and multiplexers LR

We use a highlevel p owerconsumption estimation based on p ower proles of dif

ferent units with sp ecial insight in functional units It is p ossible to obtain a

prole of the p ower consumption of each unit by simulating it and averaging the

p ower consumed over many input samples This metho d provides a fast way

of estimating p ower consumption and it has b een used previously by other au

thors MK PC LRa KKRV

Two p owerconsumption mo dels of functional units will b e describ ed Both consider

the op erand variability of the inputs of the functional unit In one of the mo dels

the activity of the op erands is taken into account whereas the other mo del is based

on the repetition of the op erands

Operand activity relates to the variability of the bitpattern of one op erand from

one op eration to the next Power consumption is somehow related to the Hamming

distance of consecutive bitpatterns Operand repetition relates to the coarsegrained

variability of the op erand ie the op erand may or may not change b etween two consecutive op erations

Chapter Highlevel synthesis techniques for low power

bit Radix Bo oth multiplier 8x8-bit Radix-4 Booth multiplier H(y)8 6

4

o 2

8

nJop 6 nJ/op.

4

2 0 2 4

6

H(x) 8

Unchanged operand

a b

Figure a Coarsegrained and b negrained p owerconsumption mo del

for an bit Bo oth multiplier

Henceforth we will call coarsegrained model the mo del based on the op erand rep

etition and negrained model the one based on the op erand activity

Since these p owerconsumption mo dels target at datapath circuits whenever we

refer to the power consumption of a circuit we mean the energy per operation exe

cuted by that circuit Datapath circuits have a xed throughput and therefore the

energyop eration metric is the most appropriate to quantify the energy eciency

for these type of circuits BB

Coarsegrained model

To illustrate how this mo del works consider Figure a Plot represents the

energy of an bit Bo oth multiplier Kor in nJ oper ation when one op erand

remains unchanged x axis with resp ect to the previous op eration executed on the

multiplier and the other op erand varies randomly Line is the average energy of

plot and line is the average energy when b oth op erands vary randomly with

resp ect to the previous op eration Comparing lines and we observe that

the average p ower consumption of the multiplier is approximately less when

one op erand remains unchanged It is interesting to notice the inuence of the

bitpattern for the unchanged op erand over the p ower consumption eg op erands

with many zero es reduce the activity of the multiplier

Powerconsumption model of functional units

Factor Denition bit bit bit

P P

r ca r ca1 r ca2

P P

cla cla1 cla2

P P

amul amul1 amul2

P P

booth booth1 booth2

P P

r ca2 amul2

r caamul

P P

r ca2 booth2

r cabooth

P P

cla2 amul2

claamul

P P

cla2 booth2

clabooth

Table Factors of the coarsegrained mo del

Absolute measures of the p ower consumption are very timeconsuming in highlevel

synthesis whereas relative measures give satisfactory results Power consumption

estimations obtained with the coarsegrained mo del are relative to the number of

op erands that remain unchanged at the inputs of the functional units Relative

p ower estimations are expressed with the and factors shown in Table The

notation followed in this table is the following P is the average energy sp ent by

f u2

functional unit f u when its two op erands vary randomly at each cycle P is the

f u1

energy sp ent when one op erand remains unchanged and the other varies randomly

r ca cl a amul and booth stand for ripplecarry adder carrylo okahead adder array

multiplier and Bo oth multiplier resp ectively

Factors hardly change with the op erand bitwidth but factors clearly change

b ecause of the larger increase in p ower of the multipliers resp ect to the adders

The factors of b oth mo dels have b een obtained by means of switchlevel simulations

see App endix A Section A of functional units implemented with the library cells

provided with the Ocean system see App endix A Section A

As a simple example of how the coarsegrained mo del is applied assume that the

dataow graph DFG shown in Figure a is a part of a larger DFG scheduled

with two carrylo okahead adders A and A and two Bo oth multipliers M and

M Assume also an op erand bitwidth of for all functional units

Two schedules and functionalunit bindings for the DFG of Figure a are shown

in Figures b and c The p ower consumption b ecause of the schedule and

binding in Figure b is expressed as P P whereas

cla2 booth2 clabooth

in the schedule and binding in Figure c it is expressed as P P P

cla2 cla1 booth2

P Thus the p ower reduction estimated

booth1 booth cla clabooth

Chapter Highlevel synthesis techniques for low power

1 cycle cycle 1 1 1 1 2 A1 A1 2 2 2 A1 2 A2 3 3 3 5 3 A1 5 A2 3 A1 4 4

4 4 A1 4 A2 5 A1

b

c

a

Figure a Dataow graph b and c are two p ossible schedules and

functionalunit bindings

obtained with the schedule and binding in Figure c with resp ect to the one in

Figure b is

clabooth cla booth

clabooth

which draws a p ower reduction of

The coarsegrained mo del provides a fast estimation of the p ower consumption when

no information of the activity of the input data to the functional units is available

Power consumption highly dep ends on the input data activity As an example

consider a bit Bo oth multiplier with one op erand remaining unchanged and

the other varying randomly coarsegrained mo del In this case the multiplier

observes an average of bit changes on the varying op erand But if the varying

op erand presents an average of bit changes the p ower in the multiplier is reduced

a Thus whenever the information of the input data activity is known a more

accurate mo del should b e used

Finegrained model

The activity of an op erand can b e measured with the Average Hamming Distance

AHD dened as

P

n

H x x

i i1

i=1

AH D x lim

n!1

n

where H p q is the Hamming distance b etween p and q x is the value of op erand

i

x in iteration i of the algorithm and n is the total number of iterations

Registertransfer level techniques for low power

6 5

−86 −35 100 8x8−bit Radix−4 Booth multiplier 1462 2100 6300 −17 −60 63

4 7 (a)

5 2

−86 −60 100 8x8−bit Radix−4 Booth multiplier 1462 2100 6300 −17 −35 63

3 4 (b)

Figure The input data order in b leads to a higher op erand AHD than in

b Big arrows indicate the input data Thin arrows indicate the AHD b etween

two consecutive input data

Figure b illustrates the eect of the op erand activity on the p ower consumption

of an bit Bo oth multiplier Obviously p ower consumption tends to zero when

the AHD of b oth op erands tend to zero

A simple example of how the negrained mo del is applied is shown in Figure b

The multiplier in the gure has to p erform three op erations Figure b Tak

ing advantage of the commutative prop erty of the multiplication op eration a b etter

input data order is obtained in Figure b where the AHD of b oth op erands

has decreased from to This decrease in the AHD of the op erands makes the

multiplier a less p ower consuming when executing the last two op erations

REGISTERTRANSFER LEVEL TECHNIQUES

FOR LOW POWER

This section prop oses a set of registertransfer level synthesis techniques that fo cus

on reducing the p ower consumption of functional units Implementation of some of

these techniques is presented in Section

The techniques prop osed are summarized as follows

Chapter Highlevel synthesis techniques for low power

lo op interchange takes advantage of data lo cality to reduce the activity of

the inputs of the functional units

op erand reordering seeks an appropriate op erand order for commutative

op erations to reduce the switching activity

op erand sharing attempts to schedule and bind op erations to functional

units in such a way that the activity of the input op erands is reduced

op erand retaining attempts to minimize the useless p ower consumption of

the idle units and

op erand similarity uses the information of the similarity among the op erands

in the scheduling and registerbinding steps

All the b enchmarks used to evaluate the techniques have only additions and mul

tiplications The p ower reduction estimations obtained with these techniques are

estimated as a function of and with values and

mul add booth cla

clabooth

resp ectively in Table for bitwide functional units

Lo op interchange

The lo opinterchange technique has b een traditionally implemented in compilers to

obtain dep endency graphs with a higher degree of parallelism to reduce stride or

to increase data lo cality and thus register reuse ASU AK

As an example consider the algorithm in Figure a and its asso ciated de

pendence graph in Figure a The compiler creates a dep endence graph to

capture the dep endency information for a piece of co de Each no de in the graph

typically represents one statement An arc b etween two no des indicates that there

is a dep endence b etween the computation they represent The dashed line shows

the execution order of the iterations of the algorithm

With algorithm in Figure b the execution order varies as it is shown in

Figure b If matrix A is laidout in memory in columnma jor form execution

order in a implies more cache misses than the execution order in b

Thus the compiler chooses algorithm b to reduce the running time

Notice that in this example the lo op interchange is a legal transformation since

iterations in b are all executed after iterations they dep end up on But in

general the compiler has to verify that the transformation do es not change the

semantics of the program Pug SFCM

Registertransfer level techniques for low power

for j to N

for i to N

Ai j Ai j

for i to N

for j to N

Ai j Ai j b

j a j i

i

a b

Figure Example of application of the lo opinterchange technique

We apply lo op interchange with the goal of minimizing the number of op erand

changes of the inputs of the functional unit This technique will b e applied to the

motionestimation algorithm for image compression LK Figure a and to

the matrixvector pro duct algorithm to illustrate its eciency

Application of loop interchange

In the algorithm of Figure a we observe three op erations in the inner lo op

absolute value addition and subtraction We assume the datapath depicted in

Figure a for the execution of this algorithm For simplicity we will consider a

subtraction to b e the same as an addition in terms of p ower consumption

The absolute value in s complement arithmetic is computed in two steps a to

check whether a number is negative and b complement the number and add

in this case The rst step represents negligible contribution to the total p ower

consumption just check if the MSB bit is one In average the second step will b e

executed half of the times

In the algorithm in Figure a we also observe that b oth op erands of the accumulation usually change with resp ect to the previous iteration of the algorithm

Chapter Highlevel synthesis techniques for low power

P L bitlength and bitwidth of the current frame

M N max horizontal and vertical vector co ord

m n bitlength and bitwidth of the current blo ck

C V RV current and reference image frame value

C F RF current and reference frame

MV g h motion vector of blo ck g h

P

for g to d e

m

L

e for h to d

n

g h

optimal

M M 1

for i b c to b c

2 2

N 1 N

c to b c for j b

2 2

i j

par t

for k to m

for l to n

CV CF m g k n h l

RV RF m g i k n h j l

i j i j jCV RV j

par t par t

if i j g h then

par t optimal

g h i j

optimal par t

T

MV g h i j

a

M 1 M

c to b c for i b

2 2

N N 1

for j b c to b c

2 2

i j

par t

P

e for g to d

m

L

for h to d e

n

for k to m

for l to n

CV CF m g k n h l

M M 1

for i b cto b c

2 2

N N 1

for j b cto b c

2 2

RV RF m g i k n h j l

i j i j jCV RV j

par t par t

g h

optimal

M M 1

for i b c to b c

2 2

N 1 N

c to b c for j b

2 2

if i j g h then

par t optimal

g h i j

optimal par t

T

MV g h i j

i j

par t

b

Figure a Motion estimation algorithm and b motionestimation algo

rithm with two lo op interchanges

and b oth op erands of the subtraction inside the absolute value op erator also

Registertransfer level techniques for low power

A(i,j) CV

+ part(i,j)

X + C(j) − ABS

RV tmp

b

a

Figure Datapath for executing a the motionestimation algorithm and b

the matrixvector pro duct algorithm

change b ecause b oth are fetched from memory in the inner lo op where the absolute

value op eration is executed

But in the algorithm in Figure b we nd out that one op erand of the subtraction

inside the absolute value op erator remains the same during M N iterations

With the notation in Table the p owerconsumption estimation of algorithm a

is roughly

P

abs

P P LM N P

a add2

and the p owerconsumption estimation of algorithm b is

P

abs

P P LM N P P

b add2 add1

We have estimated by simulation the average p ower consumption of the increment

op eration executed on an adder as P P Thus the estimated reduction

abs add2

factor on p ower consumption is

add

R

add

obtaining a reduction of the p ower consumption of

Power consumption has only b een estimated for the functional units of the data

path

With algorithm b b oth the control logic and the ochip reference order have

changed Of course a wise use of lo cal registers is exp ected to minimize ochip

references This is imp ortant particularly in the motionestimation algorithm where

Chapter Highlevel synthesis techniques for low power

i i N N j j = =

N N

ABC ABC

for i to N

for j to N

C i

tmp

for i to N

for i to N

tmp B i

tmp tmp Ai j B i

for j to N

C j tmp

C j C j Ai j tmp

a

b

Figure Matrixvector pro duct algorithm when referencing matrix A a by

rows and b by columns

its data workingset is considerably large In order to minimize ochip references

the most frequently referenced data can b e stored in an internal cache This implies

that the algorithm must adapt its structure to the size of this internal cache to

prop erly exploit data lo cality

Moreover the lo opinterchange technique should b e applied together with trans

+

formations that optimize the number of ochip references WCF Memory

related p ower consumption is an imp ortant factor sp ecially in multidimensional

signal pro cessing systems

The lo op interchange has also b een applied to the matrixvector pro duct algorithm

Two dierent implementations of the algorithm can b e devised dep ending on how

the matrix is referenced by rows see Figure a or by columns see Fig

2

ure b If the matrix is referenced by rows N multiplications are needed with

b oth op erands not necessarily equal But if the matrix is referenced by columns

only N of those multiplications are necessary the rest have one op erand xed which

corresp onds to the dierent values of the vector This draws a p ower reduction of

when referencing the matrix by columns rather than by rows We assume the

datapath depicted in Figure b for the execution of this algorithm

Registertransfer level techniques for low power

Op erand reordering

The goal of this technique is to nd an appropriate input op erand order for com

mutative op erators in such a way that switching activity is reduced This will b e

achieved by reusing the same input op erands to the op erator among consecutive

iterations of the algorithm implemented by the op erator

The op erandreordering technique can b e implemented using existing techniques

integrated in compilers for the manipulation of dep endence graphs ASU

In order to estimate its eciency this technique will b e applied to the multiply

accumulate MAC op erator and to the lowpass imagelter op erator

The MAC operator

Digital lters are basic comp onents in digital signal pro cessing systems A typ

ical substructure of a lter is the MAC op erator which p erforms the op eration

P

p1

x y where p multiplications and p additions are executed

i i

i=0

One p ossible DFG of a input MAC op erator is shown in Figure a Three

adders and four multipliers are used to implement this MAC op erator There are

other ways to reorganize the additions but the balanced structure of Figure a

implies less p ower consumption CPRB

Figure b shows a thorder LMS adaptive lter TJL In the LMS lter

and in some other digital lters ge FIR and I IR lters the MAC op erator plays

an imp ortant role and therefore minimizing its p ower consumption will decrease

the total p ower consumption of the lter

Application of operand reordering

For p ower consumption purp oses the MAC op erator is classied into three cases

a b oth the x and y values change from one iteration to the next one the general

case b either x or y values are constant and c either the x or y values of

iteration i are the same as those of iteration i but shifted one p osition The

I IR and FIR lters follow cases b and c In the LMS lter the MAC op erator

follows case c

In order to prop ose a b etter op erand reordering for cases a and b the activity of

the op erands is taken into account whereas for case c the repetition of the op erands

see Section will determine the new op erand reordering

Chapter Highlevel synthesis techniques for low power

x(t) h0 h1 x(t−1) x(t−2) h2 h3 x(t−3)

x0y0 x1 y1 x2 y2 x3 y3

d(t) MAC

2a multiplier adder

out a

y(t)

b

Figure a MAC op erator for p and b DFG of the thorder LMS

adaptive lter

Case b has b een addressed in CR and the conclusion is that the minimum

average activity over all no des of the balanced MAC op erator is obtained when the

constant op erands say the y values satisfy y y y or y y

0 1 n 0 1

y

n

We will fo cus on case c As previously explained op erand rep etition will determine

the new reordering In the MAC op erator of the LMS lter of Figure b we

observe that all multipliers receive dierent op erands at each iteration the x values

are shifted one p osition to the left and the rst p osition is the new op erand value

the h values are recalculated at each iteration and therefore are dierent This

fact is clearly shown in Table Reordering A

Table Reordering B shows a dierent op erand reordering that takes advantage

of the shiftwise b ehavior of the x values With this new reordering each multiplier

will have one xed op erand the x value during four consecutive iterations

Registertransfer level techniques for low power

iter Reordering A

M M M M

0 1 2 3

0 1 2 3

i x h x h x h x h

i i1 i2 i3

i i i i

0 1 2 3

i x h x h x h x h

i+1 i i1 i2

i+1 i+1 i+1 i+1

0 1 2 3

i x h x h x h x h

i+2 i+1 i i1

i+2 i+2 i+2 i+2

0 1 2 3

i x h x h x h x h

i+3 i+2 i+1 i

i+3 i+3 i+3 i+3

0 1 2 3

i x h x h x h x h

i+4 i+3 i+2 i+1

i+4 i+4 i+4 i+4

0 1 2 3

i x h x h x h x h

i+5 i+4 i+3 i+2

i+5 i+5 i+5 i+5

0 1 2 3

i x h x h x h x h

i+6 i+5 i+4 i+3

i+6 i+6 i+6 i+6

0 1 2 3

i x h x h x h x h

i+7 i+6 i+5 i+4

i+7 i+7 i+7 i+7

iter Reordering B

M M M M

0 1 2 3

0 1 2 3

i x h x h x h x h

i i1 i2 i3

i i i i

1 2 3 0

i x h x h x h x h

i i1 i2 i+1

i+1 i+1 i+1 i+1

2 3 0 1

i x h x h x h x h

i i1 i+2 i+1

i+2 i+2 i+2 i+2

3 0 1 2

i x h x h x h x h

i i+3 i+2 i+1

i+3 i+3 i+3 i+3

0 1 2 3

i x h x h x h x h

i+4 i+3 i+2 i+1

i+4 i+4 i+4 i+4

1 2 3 0

i x h x h x h x h

i+4 i+3 i+2 i+5

i+5 i+5 i+5 i+5

2 3 0 1

i x h x h x h x h

i+4 i+3 i+6 i+5

i+6 i+6 i+6 i+6

3 0 1 2

i x h x h x h x h

i+4 i+7 i+6 i+5

i+7 i+7 i+7 i+7

Table Two dierent input reorderings for the input MAC unit M repre

j

sent the multiplications of the MAC unit

The estimated p ower consumption of the MAC op erator with reordering A after p

iterations is

P p p P p P p P p p

A mul2 add2 mul2

and the estimated p ower consumption with reordering B after p iterations is

P p p P p P p P p p

B mul mul1 add2 mul2 mul

Thus the estimated p owerconsumption reduction factor from reordering A to B is

p

mul mul

R

mul

p

achieving a p owerconsumption reduction

The op erandreordering technique has also b een applied to the lowpass imagelter

op erator using adders its algorithm is shown in the next section In this op erator

the accumulation of values of the image is the main op eration Two consecutive

iterations of the algorithm have out of the input op erands in common Thus

Chapter Highlevel synthesis techniques for low power

an input op erand reordering can b e devised where three of the nine adders will

have the same op erands during three consecutive cycles which draws an estimated

reduction in p ower of

Implementation of the operand reordering

An appropriate op erand reordering can b e obtained by manipulating using existing

compiler techniques ASU a graph similar to a dep endence graph asso ciated

to the op erator In this graph an arc b etween two no des indicates that an input

op erand can b e reused b etween the computations they represent Let us call this

graph an inputreusing graph

To illustrate how a compiler can manipulate an inputreusing graph consider again

the MAC op erator The input reordering shown in Table Reordering A cor

resp onds to the inputreusing graph in Figure a The y axis represents the

iterations and the xaxis corresp onds to the functional units b eing considered in

this case the four multipliers of the MAC op erator A no de in the graph corre

sp onds to a multiplication

The compiler has to obtain the b est mapping of multiplications into the multipli

ers This is accomplished by maximizing the number of inputreusing dep endencies

among the multiplications executed in each multiplier ie by maximizing the num

b er of vertical inputreusing dep endencies Figure a shows the optimum

inputreordering which corresp onds to the one represented in Table Reordering

B

Another example is the matrixvector pro duct op erator when the matrix has the

symmetry prop erty see Figure b Figure b shows the dataow

graph for one iteration of this op erator using four multipliers and three adders

Again we will fo cus on the multipliers for reducing p ower Figure b shows

a p ossible inputreusing graph with no inputreusing dep endencies among the mul

tiplications executed in each multiplier ie all p ossible inputreusing dep endencies

are shown in dashed arcs meaning that they are not achieved by the compiler

Figure b obtains an improved input reordering with three inputreusing de

p endencies

The compiler can also obtain more inputreusing dep endencies if another multiplier

is available in the targeted architecture as it can b e observed in Figure b

To synthesize the input reordering obtained by the compiler it is necessary to route

the op erands to the desired input of the functional unit This implies in some cases

Registertransfer level techniques for low power

a a a a x y

00 01 02 03 0 0

a a a a x y

01 11 12 13 1 1

a a a a x y

02 12 22 23 2 2

a a a a x y

03 13 23 33 3 3 b

a x a x ai2 x a x Multipliers i0 0 i1 1 2 i3 3 M0 M1 M2 M3

multiplier iteration adder

yi b Multipliers M0 M1 M2 M3

iteration a Multipliers

M0 M1 M2 M3 b Multipliers M0 M1 M2 M3 iteration

iteration b Multipliers

M0 M1 M2 M3 M4 a

iteration

b

Figure Dierent inputreusing graphs for a the MAC op erator and b

the symmetric matrixvector pro duct op erator

an area overhead in terms of interconnection units and an increase in complexity in the

Chapter Highlevel synthesis techniques for low power

Op erand sharing

The op erandsharing technique attempts to schedule and bind op erations to func

tional units in such a way that the activity of the input op erands is reduced

Op erations sharing the same op erand are b ound to the same functional unit and

scheduled in such a way that the function unit can reuse that op erand

This technique is ecient when it is applied to a DFG with variables used by more

than one op eration The AR lter Kun will b e used to illustrate this technique

The DFG of the AR lter is presented in Figure a

Figure b shows a p ossible schedule of the AR lter with two adders one

cycle and one pip elined multiplier two cycles We observe that there are some

op erations which result is the input for more than one op eration thick lines in

Figure a For example the result of addition is an input for multiplications

and Assume we bind multiplications and to the same unit U Assume

also that b etween the execution of multiplication and there is no other use of

unit U Then one of the op erands of unit U will not change from multiplication

to multiplication Henceforth we will call op erand reutilization OPR the

fact that an op erand is reused by two op erations consecutively executed in the

same functional unit In Figure a OPRs can b e p otentially obtained in a

multiplier

An alternative schedule and unit binding with the same latency as the one in b

is presented in Figure c with all OPRs achieved In the schedule and unit

binding of Figure b no OPRs are obtained

Thus the estimated p ower consumption of one iteration in schedule b is

P P P P

b add2 mul2 mul2

and the estimated p ower consumption of one iteration in schedule c is

P P P P P

c mul add2 mul2 mul1 mul2 mul

The estimated p owerconsumption reduction is

mul

R

mul

achieving an reduction

Registertransfer level techniques for low power

1 2 3 4 5 6 7 8 1 2 3 4 5 6

9 10 11 12 7 8

13 14 15 16 9 10

11 12 multiplication addition

a addition a executed in adder unit

m multiplication m executed in multiplier unit

i/o input/output variable a

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 cycle 1 5 1 5 2 6 2 6 3 7 3 7 4 3 8 4 3 8 5 1 5 5 1 5 6 10 4 6 10 4 7 2 6 7 11 6 8 9 8 9 9 1 11 9 12 10 7 12 10 2 7 11 14 11 14 8 12 3 8 12 1 15 13 13 13 13 14 15 14 16 15 9 16 15 3 9

16 11 4 16 4 10 17 10 17 11

18 2 18 2 19 12 19 12

17 18 17 18

b

c

Figure a DFG of the AR lter bc two p ossible schedule and bindings

Application of loop unrolling for operand sharing

The op erandsharing technique is applied when some op erations share the same

op erand in the same iteration of the algorithm But it can also b e applied even

if op erands feed more than one op eration in dierent iterations We just need to

unroll the lo op BGS The technique of lo op unrolling replicates the b o dy of a

Chapter Highlevel synthesis techniques for low power

for i to N step

for i to N

Ai Ai Ai Ai

Ai Ai Ai Ai

Ai Ai Ai Ai

a

b

Figure a Original co de and b after lo op unrolling

lo op some number of times unrolling factor u and then iterates by step u instead

of step This transformation reduces the lo op overhead increases the instruction

parallelism and improves register data cache or TLB lo cality Because of these

prop erties this technique has b een implemented in compilers to optimize the co de

for p erformance

Figure shows an example of lo op unrolling The original co de in Figure a

is unrolled once in Figure b Value N is assumed to b e even Lo op overhead is

cut in half b ecause two iterations are p erformed in each iteration If array elements

are assigned to registers register lo cality is improved b ecause Ai and Ai are

used twice in the lo op b o dy Instruction parallelism is increased b ecause the second

assignment can b e p erformed while the results of the rst are b eing stored and the

lo op variables are b eing up dated

The lowpass image lter Lim will b e used to illustrate how the op erand sharing

and lo op unrolling techniques can b e combined for p ower optimization A DFG of

the lowpass image lter is shown in Figure b We see that no OPR is p ossible

But if we unroll the inner lo op twice the lo op b o dy contains now three iterations

the DFG of Figure c with p otential OPRs are p ossible is obtained

With one adder the schedule of the DFG in Figure c has a cycle latency

and the one in b Therefore the total latency of the algorithm is the same

in b oth schedules Moreover all OPRs are achieved

The estimated p owerconsumption reduction is now

R

add add

obtaining a reduction of

There are some drawbacks in unrolling a lo op that may diminish the p ower savings

achieved on the functional units First the lifetime of the variables may increase

A side eect is the p ossible increase in the number of registers used This is not a problem since an increase in the number of registers do es not imply an increase in

Registertransfer level techniques for low power

Benchmark Power red

thorder El liptic l ler DDN cycles cycle

thorder Daubechies lter PTVF cycles cycle

SHARF TJL p cycles cycle

D input Lee DCT RY cycles cycle

D input Chen DCT RY cycles cycle

matrix multiplier cycles cycle

Table Results obtained by applying the op erandsharing technique

p ower consumption if the variables of the algorithm are prop erly assigned to those

registers see Section unless

the number of registers in the targeted architecture is xed and it is exceeded

by the number of registers used In this case the contents of some of them have

to b e temp orarily stored in memory and loaded back when needed register

spillage and

the number of registers in the targeted architecture is not xed we can choose

it since we design the architecture but there is an area constraint

Second the complexity of the control unit is increased b ecause more cycles are

needed to schedule the op erations of the unrolled lo op

Application of the technique to other benchmarks

Table shows the results obtained when applying the op erandsharing technique

to other highlevel synthesis b enchmarks

In all b enchmarks except for the Elliptic lter the results have b een obtained by

comparing the p owerconsumption estimation of the schedule with fewest achieved

OPRs and the schedule with the largest number of achieved OPRs having b oth

schedules the lowest p ossible latency

In the Elliptic lter we have detected a tradeo b etween the sp eed and the con

sumption of the nal design it is p ossible to obtain a design with a longer latency

cycles vs cycles but also with a higher number of achieved OPRs

Chapter Highlevel synthesis techniques for low power

for i to M

for j to N

out

Ai j a

Ai j a

Ai j a

Aij b

Aij b

Aij b

Ai j c

Ai j c

Ai j c a a0 a1a2a3a4 b0 b1b2b3b4 c0 c1c2c3c4 a0 a1 a2 b0 b1 b2 c0 c1 c2

out

out0 out1 out2

b

c

Figure a Lowpass image lter algorithm b DFG of the inner lo op of

a without division by nine and c DFG after lo op unrolling

Op erand retaining

Not all resources of a datapath are always used during all cycles Some remain

idle when no op eration is available for them A functional unit consumes useful

p ower when it is executing an op eration and consumes useless p ower when there

is an input op erand variation while the functional unit is idle One of the reasons

for this p ossible variation of the input op erands is that the control unit is usually

synthesized using dont care values to minimize area or increase sp eed Thus an

idle functional unit may have input op erand changes b ecause of the variation of the

selection signals of multiplexers

The op erandretaining technique attempts to minimize the useless p ower consump

tion of the idle functional units Useless p ower is sp ecially imp ortant in datapaths

Registertransfer level techniques for low power

Benchmark   Idle Power

time reduc

AR lter Kun p c  c

thorder Daubechies lter PTVF c  c

D input Lee DCT RY c  c

 matrix multiplication c  c

unrolled lowpass image lter  c

LMS adaptive lter TJL c  c

pixel interpolation BS  c

thorder El liptic lter DDN c  c

Table Idle time sp ent by the functional units

synthesized from sparse schedules A schedule is said to b e sparse if its unit o ccu

pation is relatively low Table presents the idle time sp ent by the functional

units for some highlevel synthesis b enchmarks for a schedule with the functional

units of the fourth column The total number of op erations for each b enchmark is

shown in the third column

Some approaches to minimize the useless p ower consumption of idle units are a

with a prop er register binding that minimizes the activity of the functional units

this technique is addressed in Section b by wisely dening the control

signals of the multiplexors during the idle cycles in such a way that the changes at

the inputs of the functional units are minimized this may result in dening some

of the dont care values of the control signals RJ and c latching the op erands

of those units that will b e often idle

In this section approach c is evaluated It consists of the insertion of latches at

the inputs of the functional units to store the op erands only when the unit requires

them Thus in those cycles in which the unit is idle no consumption in pro duced

The control unit has to b e redesigned accordingly in such a way that input latches

b ecome transparent during those cycles in which the corresp onding functional unit

must execute an op eration

The outcome of this technique in terms of p ower savings is somehow similar to the

wellknown technique of putting the functional units or other larger parts of the

circuit to sleep when they are not needed through gatedclo cking strategies CSB

+

BSdM AMD

The op erandretaining technique has b een evaluated to several b enchmarks For

example the leastmean square lter LMS has b een scheduled with two multipliers

two cycles and one adder one cycle In each iteration of this algorithm the adder

b ecomes idle during cycles and the multipliers during cycles

Chapter Highlevel synthesis techniques for low power

The p ower consumption generated by the idle units useless consumption is

P P P

useless add mul add1 mul1 add mul

and the p ower consumption b ecause of the useful calculations useful consumption

is

P P P

usef ul add mul add2 mul2

The estimated reduction in p ower consumption is

P

useless

R

add mul

P P

useless usef ul

achieving a reduction of This estimation do es not take into account the p ower

consumption overhead b ecause of latches For simplicity in the evaluation and to

avoid synthesizing every control unit we have assumed that in average only one

of the op erands changes in each idle unit at each cycle This assumption may b e

optimistic or p essimistic dep ending on the nal implementation A more realistic

assumption would basically dep end on the register binding done for the variables

of the algorithm

More results obtained by applying the op erandretaining technique are rep orted in

the last column in Table

Op erand similarity

In the techniques previously presented the main idea was to maximize the op erand

lo cality or in other words the operand repetition in the functional units The

1

op erandsimilarity technique takes into account the operand activity This tech

nique uses the information of the similarity among the variables and constants of

the algorithm in the scheduling and registerbinding steps

We will show how the activity of the input op erands aect the p ower consumption of

the design Two examples will b e presented to illustrate this technique a lowpower

schedule for a nite impulse resp onse lter FIR lter TJL and a lowpower

register binding for the Dierential Equation Solver GDWL

A proling of the algorithm to b e synthesized is needed in order to determine the

similarity measured with the AHD among its variables and constants

See Section for the denition of operand repetition and operand activity

Registertransfer level techniques for low power

To verify the eect of the input op erand activity in p ower consumption we have

fed the LMS lter of Figure b with a random signal and with a waveform

calculated as the sum of two sines The AHD of b oth signals dier in and this

dierence leads to a p ower reduction when the LMS lter is fed with the

sumoftwosines signal

Example scheduling of the FIR lter

P

p1

x c where c are constants A FIR lter follows the equation

i i i

i=0

When sp eed is not a ma jor issue a signicant reduction in hardware complexity is

achieved by p erforming multiplications over several clo ck cycles as a series of shift

add op erations When sp eed is imp ortant the multiplications must b e executed by

multipliers We will fo cus on this case and will show how a dierent multiplication

execution order can inuence over the nal p ower consumption

As an example assume p the values and for the

constants c to c and a bitwidth of Assume also that the input data is a

0 3

waveform calculated as a sum of two sines If the DFG of the thorder FIR lter

Figure a is scheduled with one multiplier and one adder dierent minimum

latency schedules are p ossible with dierent multiplication execution order In one

of those schedules Figure b the multiplier observes the following changes in

one of its op erands numbers over the arrows indicate the AHD b etween constants

10 7 6 3 10

c c c c c ie an AHD of whereas in the another

0 1 2 3 0

6 11 10

c c schedule Figure c it may observe the following changes c

3 1 0

7 10

c c ie an AHD of

2 0

By means of switchlevel simulations the calculated p ower consumption for the

functional units asso ciated to the rst schedule is less than the one asso ciated

to the second This reduction has b een achieved only with the change of the schedule

of two multiplications

Example register binding for the Dierential Equation Solver

The exp eriments done b efore for the LMS lter of Figure b further showed

that the AHD among the variables h is lower than among the other variables The

i

same o ccurs for other groups of variables

This information can b e used in registerbinding algorithms to obtain a register set

where activity of individual registers is minimized As a side eect those idle units

that observe the changes in the registers will also reduce its consumption Further

Chapter Highlevel synthesis techniques for low power

x(t) c0 x(t−1) c1 x(t−2) c2 x(t−3) c3 1 2 3 4

1 2

3 multiplier adder out

a addition a executed in adder unit

m multiplication m executed in multiplier unit

i/o input/output variable a

x(t) c0 x(t−1) c1 x(t−2) c2 x(t−3) c3 x(t) c0 x(t−1) c1 x(t−2) c2 x(t−3) c3 cycle cycle 1 1 1 1

2 2

3 2 3 2

4 4 5 1 3 5 1 4 6 6

7 4 7 3

8 8

9 2 9 2 10 3 10 3

out out

b c

Figure a DFG of the thorder FIR lter bc two p ossible schedule

and bindings

more the op erandsimilarity information along with the commutative prop erty of

some op erations can b e used also to decrease the p ower consumption in the nonidle

functional units by swapping their op erands

The Dierential Equation Solver GDWL dataow graph see Figure has

b een scheduled with one adder one cycle and two multipliers two cycles The

AHD b etween all pairs of variables has b een obtained by means of simulations of

the algorithm with dierent input data The nal AHD used has b een obtained as

the average of all simulations

Two dierent register bindings A and B have b een obtained Both bindings use

registers The reduction of the register activity of binding B pro duces an average p ower savings of in the functional units with resp ect to A This is obtained by

Registertransfer level techniques for low power

u dx 5 x y 3 multiplication

addition

Figure DFG of the Dierential Equation Solver

Technique Applicability range

Loop interchange Multidimensional signal pro cessing

eg image pro cessing

Descriptions with op erations sharing input

Operand reordering op erands b etween consecutive cycles

eg digital lters

Operand sharing Descriptions with op erations sharing

input op erands in the same cycle

Operand retaining Descriptions where the functional units are

idle a large fraction of the time

Operand similarity Descriptions with a high similarity among

the op erands of the op erations

Table Features of the targeted circuit domains

the reduction of the op erand activity at the inputs of the functional units during

the idle cycles

Summary of targeted circuit domains

Not all the techniques pro duce a tangible improvement in all type of circuits Ta

ble shows the applicability range targeted by each technique

All the techniques are complementary among them There is only an exception if

the op erandretaining technique ie the op erands of the functional units are latched

Chapter Highlevel synthesis techniques for low power

Scheduling Control synthesis

SYNTHESIS Allocation Resource binding

ABSTRACT CIRCUIT DESCRIPTION CONTROL

Figure Highlevel synthesis pro cess

to avoid useless p ower consumption is applied then there is no need to use the

correlation information to reduce the p ower during the idle cycles see section

The p ower savings that each of the techniques would obtain if applied alone to

the same circuit do no add up when they are applied together to that circuit For

example if b oth op erandreordering and op erandsharing techniques are separately

applied to the same circuit and we observe an A and B p ower reduction re

sp ectively we can not assume an AB p ower reduction when b oth techniques

are applied together If the op erandreordering technique is rst applied it may

happ en that the op erandsharing technique can not achieve a B p ower reduction

b ecause the number of op erations that share an op erand may have b een reduced

as a consequence of the transformation p erformed by the op erandreordering tech

nique Therefore the designer should rst apply the techniques that obtain the

largest p ower savings

SCHEDULING AND REGISTER BINDING FOR

LOW POWER

In this section algorithms for the scheduling and registerbinding steps of highlevel

synthesis are presented These algorithms implement some of the concepts describ ed

in the last three techniques in Section therefore targeting at the reduction of

p ower consumption only in the functional units They do not address the reduction

of p ower in IO clo cks or data transfers

Highlevel synthesis

Previous work on highlevel synthesis for low power

Highlevel synthesis is the design task of converting an abstract description of a

digital system into a registertransfer level RTL design implementing that b ehavior

Figure This abstract description may b e comp osed of programs algorithms

owcharts dataow graphs instruction sets generalized FSMs etc The nal RTL

design consists of functional units ALUs multipliers storage units memories

register les and interconnection units multiplexers buses

The highlevel synthesis system compiles the abstract description into an internal

representation All synthesis tasks work from this representation A dataow graph

is used to represent descriptions with a large number of arithmetic op erations To

represent reactive or embedded systems in which the control sequence is based on

external conditions the DFG is extended with the control ow CDFG

The highlevel synthesis pro cess is traditionally decomp osed into four dierent yet

interdependent tasks GR allo cation scheduling binding and control synthesis

The al location task determines the type and quantity of resources used in the RTL

design It also determines the clo cking scheme memory hierarchy and pip elining

style The goal of allo cation is to make appropriate tradeos b etween the designs

cost and p erformance To p erform the required tradeos the allo cation task must

determine the exact area and p erformance values The number of functional units

is a rst approximation of the cost in area The number of control steps indicates

the p erformance

The scheduling task schedules op erations and memory references into clo ck cycles

If the number of clo ck cycles number of resources is a constraint the scheduler

has to pro duce a design with the fewest functional units fewest clo ck cycles

The binding task assigns op erations and memory references within each clo ck cycle

to available hardware units A resource can b e shared by dierent op erations if they

are mutually exclusive ie they will never execute simultaneously

Furthermore the binding task is decomp osed into functional storage and inter

connection unit steps all of them tightly related to each other They are usually

ordered and executed sequentially b ecause of the high complexity of the binding

task

The nal task control synthesis consists of reducing and enco ding states and de

riving the logic network for nextstate and control signals in the control unit

Chapter Highlevel synthesis techniques for low power

Adder Area Delay Power

RCA cycles Cycle

0 1 2 Cycle

CLA cycle Cycle

a b Power Power 2 0 0 1 1 2

Schedule A Schedule B

c d

Power

Schedules

B C

0 A

Peak p ower

1 Average p ower

2 Area

Schedule C d

e

Figure Example of p eakp ower and averagepower optimization during the

scheduling and functionalunit binding tasks a DFG b adder characteristics

ce dierent schedules and bindings and d results

PREVIOUS WORK ON HIGHLEVEL

SYNTHESIS FOR LOW POWER

Highlevel synthesis for low p ower has b een addressed in Mar RJ DK

CP KKRV RJ

In Mar the scheduling and functionalunit binding tasks are applied to reduce

the average and the p eakp ower consumption Peak p ower reduction is imp ortant

to avoid sudden high currents that may cause voltage drops electronmigration

Previous work on highlevel synthesis for low power

and other failure mechanisms A simple example is shown in Figure Fig

ure a shows the datapath to b e synthesized Figure b presents the

characteristics area cycles and average p ower of the two types of adders that can

b e used the p ower values are a simplication of the values presented in Mar

for the sake of clarity The ripplecarry adder RCA presents more p ower in the

rst cycle b ecause there is activity throughout it from the very b eginning In the

second cycle the leastsignicant part has stabilized and only there is activity in the

mostsignicant part b ecause of carry propagation Therefore the average p ower

consumption in the second cycle is less Figures c to e show dierent

schedules with the same delay cycles We observe in Table d that Schedule

B presents b oth the lowest p eak and average p ower

In RJ an scheduling metho d that attempts to reduce the transition activity

of the synthesized design is presented The scheduling approach is driven by the

activity at the inputs of the functional units it selects a sequence of op erations

variables for a mo dule register such that the transition activity is reduced The

same authors present in RJ a linear integer programming formulation for the

problem

Moreover in RJ a metho d for optimizing the control is describ ed The controller

generates the control signals load signals for registers select signals for multiplex

ers The inputs to functional units and registers during idle p erio ds do not aect

the b ehavior of the circuits ie they are dont cares that are used to prevent the

load of a register in those cycles it is idle The dont cares can also b e used to select

the registers with least activity for the idle units so that the useless p ower con

sumption is minimized Power reductions applying b oth scheduling and controller

optimization for low p ower range from to The study in RJ is the closest

work to the one presented in this chapter as contribution

In DK a scheduling and binding technique for reducing the activity in the buses

is describ ed This is accomplished by prop erly multiplexing the data transfers onto

the buses Busses are highcapacitance signals therefore reducing their activity

implies decreasing signicantly the p ower of the RTL design An average reduction

of is achieved for three wellknown highlevel synthesis b enchmarks at the

exp ense of a very high average increase in delay of

In CP a similar register binding algorithm to the one presented in this chapter as

contribution is describ ed It is also formulated as a clique covering of a compatibility

graph Unfortunately no p ower reduction results are rep orted for a comprehensive

set of b enchmarks

Chapter Highlevel synthesis techniques for low power

Minimum register set

Variable Compatibility Graph R0 R1 v2 v1 v3 Clique Partitioning R2 v11 v4 R3 v10 v5 v9 v6 v2 R0 v8 v7 v1 v3 v11 Clique R1 v4 Partitioning v10 R2 v5 R3 v9 v6 v8 v7 R4 Modified Variable Compatibility Graph Maybe not minimum

register set

Figure Register binding for low p ower

Other works related to highlevel synthesis for low p ower are found in KKRV

RJ

Overview

Two traditional approaches for the scheduling and registerbinding tasks mo died

to obtain lowerpower designs are presented in the following sections

The scheduling algorithm for low p ower uses a listscheduling approach where the

priorities of the op erations of the readyop eration queue are set in such a way that

op erations sharing the same op erand are b ound to the same functional unit and

scheduled so that the functional unit can reuse that op erand

The registerbinding algorithm for low p ower is based on a clique partition of a

restricted variablelifetime compatibility graph to obtain a register set that for each

functional unit reduces the p ower consumption during idle cycles A drawback of

this algorithm is that larger register sets may b e obtained Figure Power

consumption in the functional units during nonidle cycles is decreased by taking

into account the AHD among the variables of the b ehavioral description and the

commutative prop erty of some op erations

Previous work on highlevel synthesis for low power

Although the scheduling and registerbinding techniques for low p ower are compat

ible and complementary the scheduling technique will obtain b etter improvements

if applied to dense schedules eg schedules where the functional unit o ccupation

is high whereas the registerbinding technique is more suitable to sparse schedules

Scheduling for low p ower

The goal of the scheduling algorithm for low p ower is to increase the p otential

for a functional unit FU to reuse an op erand Henceforth we will call op erand

reutilization OPR the fact that an op erand is reused by two op erations consecu

tively executed in the same FU The OPR concept has already b een explained in

Section

A traditional listscheduling algorithm LS will b e compared with a mo died list

scheduling targeting at low p ower LPLS In b oth cases the FUbinding task is

based in a clique partition of a variablelifetime compatibility graph but it diers

in the number of edges of this graph as will b e shown later

LPLS also trades o latency for OPRs assume that an op eration happ ens to b e in

the critical path of the DFG but it is scheduled later than indicated by LS b ecause

an OPR is then achieved In this case the latency of the synthesized circuit has

increased

Listscheduling algorithm

The listscheduling algorithm GDWL is a p opular metho d for scheduling op er

ations when the number of resources is a constraint The algorithm is shown in

Figure

The listscheduling algorithm maintains a priority list of ready op erations An op

eration is ready if it has all predecessors already scheduled During each iteration of

the algorithm which corresp onds to a control step the op erations in the b eginning

of the ready list are scheduled till all the resources get used The priority list is al

ways sorted with resp ect to a priority function One such a function is the mobility

range which indicates in how many dierent cycles the op eration may b e scheduled

considering no restriction on the number of resources A smaller value of mobility

indicates a higher urgency for scheduling an op eration since the schedule will run

out of alternative control steps earlier The mobility range can b e easily calculated

with an ASAP and ALAP scheduling refer to GDWL for more details Thus

the priority function resolves the resource usage among the ready op erations the

Chapter Highlevel synthesis techniques for low power

V is the set of op erations

is the priority list for each P List

t

k

op eration type t T

k

C is the current control step

step

m is j T j

is the number of FUs p erforming N

t

k

op erations of type t

k

S is the current schedule

cur r ent

P List P List INSERT READY OPS V P List

t t t

m

2 1

C

step

w hil e or or P List do P List

t t

m

1

C C

step step

f or k to m do

f or f unit to N do

k

then if P List

t

k

C SCHEDULE OP S FIRST P List

step cur r ent t

k

FIRST P List DELETE P List P List

t t t

k k k

endif

endfor

endfor

P List P List INSERT READY OPS V P List

t t t

m

2 1

endw hil e

Figure Listscheduling algorithm

op eration with higher priority gets scheduled Op erations with lower priority will

b e deferred to the next or later control steps

The algorithm uses a priority list P List for each op eration type t T These lists

k

P List P List are denoted by the variables P List The op erations in these

t t t

m

2 1

lists are scheduled into control steps based on N which is the number of functional

t

k

READY OPS scans the set units p erforming op eration of type t The function INSERT

k

of op erations V determines if any of the op erations in the set are ready deletes

each ready op eration from the set V and app ends it to one of the priority lists based

OPS o s returns a new schedule on its op eration type The function SCHEDULE

cur r ent i j

after scheduling the op eration o in control step s The function DELETEP List o

k i

i j

deletes op eration o from the sp ecied list

i

Initially all op erations that do not have any predecessor are inserted into the appro

priate priority list by the function INSERT READY OPS based on the priority function

The while lo op extracts op erations from each priority list and schedules them into

the current step until all the resources are exhausted in that step Scheduling an

op eration in the current step makes other successor op erations ready These ready

Previous work on highlevel synthesis for low power

ASS CREATE ALL SS V

P List P List READY OPS V P List INSERT

t t t

m

2 1

C

step

w hil e or or P List do P List

t t

m

1

C C

step step

f or k to m do

UPDATE PRIORITIES P List

t

k

w hil e do P List

t

k

op FIRST P List

t

k

if IS OSS ASS op then

if not SS HAS RESERVED FU SS then

f unit GET FREE AND NOT RESERVED FU SS

FU IN SS SS f unit RESERVE

endif

schedul e oper ation T RU E

el se

f unit GET FREE AND NOT RESERVED FU SS

f unit then if

schedul e oper ation F ALS E

el se

schedul e oper ation T RU E

endif

endif

if schedul e oper ation T RU E then

SCHEDULE OP S op C

cur r ent step

endif

op DELETE P List P List

t t

k k

endw hil e

endfor

P List P List INSERT READY OPS V P List

t t t

m

2 1

endw hil e

Figure Lowpower listscheduling algorithm

op erations are scheduled during the next iterations of the lo op These iterations

continue untill all the priority lists are empty

Lowpower listscheduling algorithm

The traditional listscheduling algorithm shown in Figure has b een mo died

by means of including some heuristics to achieve as many OPRs as p ossible A

simplied version of its algorithm is presented in Figure Both algorithms

follow the notation in GDWL

Those op erations that share an op erand are group ed in operandsharing sets hence

ALL SS All op erations of a group are able to b e exe forth named SS CREATE

OSS is able to reserve the FU cuted on the same FU An op eration of an SS IS

where it is going to b e assigned for the rest of its SS in case it has not one re

Chapter Highlevel synthesis techniques for low power

served yet RESERVE FU IN SS Given an SS and its reserved FU in the b est case

jSSj OPRs can b e obtained All these consecutive OPRs on the same FU are

called an operandsharing chain LPLS attempts to schedule as many op erations

as p ossible of the SS on its reserved FU It also attempts not to execute other

op erations on this FU in order to prevent breaking the op erandsharing chain OB

FREE AND NOT RESERVED FU The scheduling of the op eration of an SS is guided TAIN

by giving more priority to the op erations in the op erandready queue whose SS has

already a reserved FU UPDATE PRIORITIES The priority of an op eration is decreased

ie will b e scheduled later if it is going to b e assigned to an FU not reserved by

its SS If the op eration scheduled in a later cycle happ ens to b e in the critical path

the nal latency is increased

All the information ab out achieved OPRs gathered during the execution of LPLS

is transferred to the FUbinder as a set of binding constraints The FUbinder

rst complies with all these constraints ie achieves all OPRs already obtained

by LPLS and after that pro ceeds as the traditional FUbinder with weights to

minimize the number of interconnection units multiplexers

LS has a complexity of O n where n is the number of op erations LPLS has a

2

complexity of O n m where m is the number of unit types

Results

The p ower consumption of the designs synthesized with LS are compared with the

p ower consumption of those obtained with LPLS To estimate p ower consumption

the coarsegrained mo del see Section is used and bitwide FUs are assumed

It is imp ortant to notice that LS already achieves many of the p ossible OPRs

b ecause the FU binder already forces some OPRs in its attempt to minimize the

number of multiplexers eg interconnections

Several results are shown in Table The b enchmarks have b een scheduled with

the resources rep orted in Table The fourth column in Table indicates the

maximum p ossible number of OPRs The fth and sixth columns show the number

of OPRs for b oth type of FUs achieved with LS and LPLS

The last column of Table accounts for the savings in p ower consumption in the

FUs b ecause of the increment of achieved OPRs obtained with LPLS The p ower

consumption of an op eration of the b enchmark dep ends on the type of FU where

this op eration is scheduled and on how many op erand changes that FU has when

executes the op eration A of p ower reduction is achieved in the Daub echies

Previous work on highlevel synthesis for low power

Bench LS LPLS Max OPRs OPRs Power

Latency Latency OPRs with LS with LPLS reduc

Table Results obtained with the LPLS algorithm

lter and a in the matrix multiplication The rest of the b enchmarks present

a small or null p owerconsumption reduction b ecause of the following reasons a

the maximum number of p ossible OPRs is to o small compared to the number of

op erations of the b enchmark and b the null or little increase in OPRs achieved

by LPLS with resp ect to LS

Register binding for low p ower

Once the scheduling and FUbinding tasks have b een done the registerbinding

algorithm for low p ower LPRB attempts to further reduce the useful and the

useless p ower consumption of FUs see Section

LPRB assumes that the control unit maintains for each FU the same registers at

its inputs during idle cycles

The LMS b enchmark see Figure a for its DFG will illustrate how LPRB

works

Reducing useless power consumption in FUs

One way to tackle the reduction of useless p ower consumption is by building up a

register set that minimizes the number of input changes at the idle units LPRB

algorithm follows this approach rst part of the algorithm in Figure

A traditional approach for building up a register set is the cliquepartitioning

metho d After applying this metho d to a lifetime compatibility graph for the vari

ables CG each clique of the partition corresp onds to one register LPRB uses the

same traditional approach but applied to a dierent variablecompatibility graph LPCG To build up the LPCG the registerbinding for low p ower rst constructs

Chapter Highlevel synthesis techniques for low power

FU 1 2 3 A0 M0 M1 v0v3 v7 v4 v8 v11v12 v14 cycle (v7) (v6) (v4) (v21) (v0) (v21) 0 (v7) (v6) v0 v3 v4 v7 m1 m2 m3 m4 1 m1 m2 (v7) (v6) v1 v5 v9 v13 2 v1 v5 v1 v5 v8 v11 v14 v12 a1 a2 3 a1 v18 m3 m4 (v1) (v5) v18 v19 4 v9 v13 A0 v9 v13 (v8) (v11) (v14) (v12) 5 a2 a3 v19 v1 a1 v5 v19 v18 (v8) (v11) (v14) (v12) v18 6 a3 v16 v20 v20 Operation a1 is executed v16 v20 (v8) (v11) (v14) (v12) 7 a4 in unit A0 and reads v17 a4 variables v1 and v5 (v16) (v20) v17 c22 (v14) (v12) 8 and writes variable v18 m5 c22 v17 (v16) (v20) (v14) (v12) 9 v21 m5 (v16) (v20) v8 v21 v14 v21 10 v21 m8 m9 (v16) (v20) 11 v10 v15 v11 v10 v4 v21 v0 v21 12 a7 m6 m7 m8 m9 v11 m7 m6 v12 v15 13 a8 v2 v6 v10 v15 v12 v6 v2 v3 v2 (v4) (v21) (v0) (v21) 14 a5 v3 a5 a6 a7 a8 v7 v6 (v4) (v21) (v0) (v21) 15 a6 v7 (v7) (v6) (v4) (v21) (v0) (v21)

16

a b

Figure a DFG of the LMS lter and b schedule and FU binding

CG In a second step a set of edges of the CG are removed RE the CG CREATE

MOVE EDGE Each edge removed from the CG connects two compatible variables

with the following prop erty should b oth b e assigned to the same register an idle

FU would have an input change

Figure b illustrates this concept It shows the schedule and FU binding of the

DFG of Figure a with one adder one cycle and two multipliers two cycles

The shadowed slots represent the cycles in which the FUs are idle For each FU

the control unit maintains during idle cycles the same registers containing the

variables in parenthesis on its inputs Let us consider what happ ens with FU A

in cycle An input change will o ccur at the inputs of idle unit A if for example

variables v and v are assigned to the same register b ecause multiplier M will

mo dify the value of that register in cycle The same happ ens with variable pair

v v But not all the variables of these two pairs have compatible lifetime

b etween them In this example only the pair v v do es Thus for the FU

A in cycle this edge is removed from the CG If the same pro cedure is applied

to all the idle slots of Figure b edges will b e removed

Previous work on highlevel synthesis for low power

r educe pow er consumption in idl e f unctional units

CG CREATE CG V

f or c to MAX CYCLES do

f or f u to MAX FUs do

if IDLE f u c then

f or each oper ation op w hose r esul t

is in cy cl e c MOD MAX CYCLES do

op sour ce OPERATION IN FU f u

REMOVE EDGE C G VAR DESTop sour ceVAR Aop

REMOVE EDGE C G VAR DESTop sour ceVAR Bop

endfor each

endif

endfor

endfor

REGISTER BINDINGCG

r educe pow er consumption in non idl e f unctional units

f or each F U f u do

OBTAIN BEST VARIABLE ORDER f u AH D

endfor each

INTERCONNECTION UNIT BINDER

Figure Registerbinding algorithm for low p ower

The drawback in removing edges is the p ossibility to obtain a larger register set as

it will b e conrmed later with the results

Not all the useless p ower consumption in the idle FUs is eliminated with this tech

nique As an example let us consider FU A in cycle in Figure b Because

the previous op eration executed in FU A has variable v as an op erand and as the

result FU A has in cycle an input change

Further reducing useful power consumption in FUs

By taking into account the commutative prop erty of some op erations and the av

erage Hamming distance AHD among the variables of the design useful p ower

consumption in FUs may b e further reduced it was rst reduced in the scheduling

task This reduction has to b e done after the register set has b een derived This

pro cess is shown in the second part of the algorithm in Figure

As an example consider additions a and a of Figure b With the variable

input order shown the FU A has an AHD on one of its inputs of H v v and on

the other input of H v v Recall from Section how the p ower consumption

of an FU dep ends on the AHD of its inputs If the AHD information among the

variables is available the reduction in p ower can b e evaluated if the variable order

in addition a is changed The problem of obtaining the b est variable order for all

Chapter Highlevel synthesis techniques for low power

Bench LPLS and TRB LPLS and LPRB Power

Red

Table Comparison b etween the traditional resourcebinding algorithm TRB

and its lowpower version LPRB

op erations requires an exhaustive exploration Thus for simplicity LPRB follows

BEST VARIABLE ORDER a greedy approach OBTAIN

By dening a variable order the degrees of freedom for the interconnectionunit

binder are reduced b ecause the correct variable order which implies the correct

register order has to b e satised This implies that the number of multiplexers will

b e larger than the number obtained if no useful p ower is reduced

Results

TRB is compared with its lowpower version LPRB over three datapath b ench

2

marks for which we have representative input data The AHD among the variables

has b een obtained by means of proling the b enchmarks In all of them the schedul

ing and FUbinding tasks have b een done with the lowpower metho ds describ ed in

Section The b enchmarks have b een scheduled with the resources rep orted in

Table To estimate p ower consumption the negrained mo del see Section

is used and bitwide FUs are assumed

For b oth registerbinding algorithms useful and useless p ower consumption of FUs

3

and the number of registers and multiplexers along with estimations of their p ower

consumption are rep orted in Table All p ower estimations are in nJ iter ation

Columns to are resp ectively number of registers p ower of registers number

of multiplexers p ower of multiplexers uselessuseful p ower of FUs and total data

path p ower

In the D input Lee DCT and pixel interpolation b enchmarks no improvement

has b een observed when applying the algorithm for reducing the useful p ower con

It is imp ortant to notice that the AHD among the variables highly dep ends on the input

data The AHD of the b enchmarks related to image pro cessing has b een obtained using the well

known Lena picture We have observed that the AHD values converge fast in approximately

iterations of the algorithm

The equivalent number of input multiplexers

Conclusions

sumption in FUs The greedy metho d used did not change the variable order for

any FU

In the pixel interpolation b enchmark only two adders are used This implies that

the p ower consumption of the registers and multiplexers plays an imp ortant role in

this b enchmark

It is worth noticing the areap ower tradeo in two b enchmarks the number of

registers and multiplexers has increased when applying LPRB Although the total

area has increased the p ower consumption has b een reduced

CONCLUSIONS

In this chapter several strategies to tackle the problem of p ower consumption at

registertransfer level have b een presented The common idea b ehind these strate

gies is to reduce the activity of functional units by reducing the switching activity

of their input op erands The p otential b enets have b een evaluated in dierent

DSP examples The techniques have b een evaluated with a novel twolevel p ower

consumption mo del of functional units Up to p ower reductions have b een

obtained

Algorithms that reduce the activity of the functional units by minimizing the chan

ges of their input op erands have b een presented for the highlevel synthesis tasks

of scheduling and register binding A signicant p owerconsumption reduction is

obtained in the scheduling task with little increase or no increase at all in latency

Further p ower reduction is achieved in the idle functional units during the register

binding task by increasing the number of storage and interconnection units and

taking into account b oth the commutative prop erty of some op erations and the

average Hamming distance among the variables of the dataow graph to b e syn thesized

Chapter Highlevel synthesis techniques for low power

ARCHITECTURAL TECHNIQUE FOR LOW

POWER

This chapter presents a metho d for reducing the p ower dissipation pro duced by

useless activity that only dissipates p ower and do not contribute to the calculation

of the nal result of the circuit This metho d is sp ecially indicated for the class

of circuits that present a layered structure where each layer is driven by primary

inputs from the very b eginning thus generating a high amount of useless activity

Arithmetic circuits such as parallel multipliers b elong to this class of circuits

The technique prop osed to reduce the useless activity is based on the insertion of

selftimed controlled latches on the high active paths of the circuit

In this chapter this technique has b een implemented and evaluated for several

congurations of array multipliers

Chapter Architectural technique for low power

Primary inputs arrive A A B B

C Primary inputs C

a

b

Figure a Circuit structure with glitching b blo ck C presents more ac

tivity than blo ck A

INTRODUCTION

The switching activity of a circuit may b e divided into

useful activity those signal transitions needed to change the gate outputs of

the circuit from the previous cycle to the current cycle

useless activity the rest of the circuit activity Consumes p ower and do es not

generate correct outputs

The amount of useless activity dep end on the structure of the circuit It can range

from to BFR SGDK Useless activity is generated b ecause of the

dierent arrival times of the signals to the inputs of the gates The dierent propa

gation times of the gates is one reason of these unmatched delays Reducing the logic

depth of the circuit and equalizing paths through gate sizing techniques decrease

the amount of useless activity

However some circuits present a regular structure with a xed logic depth In these

circuits other more ecient techniques for reducing the useless activity should b e

devised For example the blo ck C in Figure a will start computing when the

primary inputs arrive Figure b However blo ck C only needs to compute its

output then the correct output of blo ck B is available

Parallel multipliers

In this simple example a large amount of useless activity may b e eliminated by

delaying the primary inputs to blo cks B and C till the outputs of resp ectively

blo cks A and B have b een calculated This solution is not appropriate if the circuit

presents several blo cks and the primary inputs go to dierent blo cks since the area

overhead pro duced by the delay comp onents may b e prohibited Moreover the

delay comp onents consume p ower and may overcome the p ower savings obtained

with the reduction of the useless activity

Array multipliers b elong to this type of highly regular circuits In this chapter a

technique that reduces the useless activity in array multipliers is prop osed It is

based on the insertion of selftimed controlled latches on some highactive signals

of the circuit

This chapter is organized as follows In Section an overview of parallel multi

plier circuits is given In Section previous studies in the area of lowpower arith

metic circuits in general and multipliers in particular are presented In Section

the p otential p ower reduction in array multipliers is analyzed In Section the

transitionretaining barrier TRB technique is prop osed In Section the im

plementation of the TRBs is describ ed Finally the results obtained are presented

in Section In Section some conclusions are presented

PARALLEL MULTIPLIERS

Multipliers are basic arithmetic circuits for many DSP applications Among multi

pliers parallel multipliers are the fastest and they are widely used in order to achieve

high execution sp eed Parallel multipliers present a go o d regularity but also a high

dissipation that increases dramatically with the op erand bitwidth Techniques for

decreasing the dissipation of this type of circuits are thus suitable and in some cases

necessary if p ower dissipation is a ma jor issue

Multipliers are classied into two basic types sequential and parallel The parallel

ones are further classied into two subtypes

tree multipliers all partial pro ducts are generated in parallel and afterwards a

multioperand adder is used to obtain the nal result The delay of a parallel

multiplier is O l og n b eing n the bitwidth of the input data

array multipliers these multipliers do not use two separate circuits for partial

pro duct generation and accumulation A single cell is used in an array struc

Chapter Architectural technique for low power

ture where the new partial pro ducts and the partial accumulation are pro duced

simultaneously The delay of an array multiplier is O n

Schemes for parallel multipliers have b een studied for decades trying to reduce the

number of logical levels of the multiplier BW Pez

Tree multipliers are faster than array multipliers b ecause the former reduce the

number of logical levels however they also suer from a bad layout regularity

This bad regularity implies a large routing area

PREVIOUS WORK

The design of arithmetic circuits aiming at minimizing p ower dissipation has ba

sically fo cused on multipliers Other arithmetic circuits have b een rarely studied

The p ower consumption of adders has b een estimated in CS KGG NMO

Some lowpower arithmetic designs have b een prop osed in EL KBL EL

In SLB the delays in the full adder cells of the multiplier are balanced by imple

menting a few delay buer circuits and therefore p ower savings are obtained since

the unnecessary activity is reduced

+

In YYN a comparison is presented b etween the design of a static CMOS

multiplier and the design using Complementary PassLogic CPL WE Because

of the fewer transistors used in the CPL design and their lower input capacitance

a of p ower dissipation reduction is achieved with a frequency of op eration of

MHz and a p ower supply of V

In SA an architecture to implement the data compression in a bit Bo oth

multiplier WE based on tally functions is describ ed The authors derive a re

duction in p ower of in the compression stage of the multiplier compared with

that of a conventional multiplier

The characterization of unnecessary activity in an array multiplier has b een studied

in OF It is shown that the unnecessary activity in a bit array multiplier

is the of the total activity

In MT the switching activity of multipliers is studied as a function of the

multiplication algorithm the compressor circuitry and the circuit design style A

bit mo died Bo oth multiplier is designed where the activity is estimated to b e

reduced by a factor of

Potential power reduction in array multipliers

In LvMJ the transition activity in array and tree Wallace multipliers is ana

lyzed This study shows that the number of total transitions in a bit Wal

lace multiplier is less than half of those in the array design Moreover the ratios

uselessuseful transitions are and in the array and Wallace multipliers

resp ectively

The switching activity during the partial pro duct generation in a radix Bo oth

multiplier is reduced in ZA This is done by converting the multiplicand from

twos complement into signmagnitude representation Signicant bit transitions

are saved b ecause negating the multiplier in signmagnitude representation implies

only one bit toggle Unfortunately no p ower reductions are rep orted for the entire

multiplier in this work

Asynchronous multipliers have also b een studied in HvBPS SNNS The au

thors of HvBPS conclude that for small multipliers a synchronous design is

b etter in terms of p ower consumption than its asynchronous counterpart The

takeover p oint for asynchronous is predicted to b e near an input width of

Three small input width of bits asynchronous multipliers are also compared to

the synchronous design in SNNS showing that the energy p er op eration of the

synchronous design is less than one third of the asynchronous designs

Finally in LS SR a similar technique as the one prop osed in this pap er is

studied In LS latches are inserted in a xbit radix Bo oth multiplier

to reduce the switching activity The authors claim that a of p ower savings

is achieved with a frequency of op eration of MHz and a p ower supply of V

In SR a design of a Bo oth enco ded array multiplier is presented that uses

dynamic CMOS logic together with a selftimed evaluate signal in such a way that

each carrysave adder within the array evaluates only after all of its inputs have

stabilized No p ower reductions obtained with this technique are rep orted

POTENTIAL POWER REDUCTION IN ARRAY

MULTIPLIERS

The array multiplier structure studied is depicted in Figure This array mul

tiplier computes the nbit pro duct of two nbit unsigned numbers The basic

building blo cks of the array multiplier are fulladders FA and halfadders HA

see Figure

Figure plots the logic transitions p er op eration that take place in the inter

connections among the FA and HA in a bit array These results have b een

Chapter Architectural technique for low power

x1y3 x1y2 x1y1 x2y4 x1y4 x2y3 x0y4 x2y2 x0y3 x2y1 x0y2 x2y0 x0y1 x1y0 x0y0 HA FA FA FA HA x3y4 x3y3 x3y2 x3y1 x3y0 FA FA FA FA HA CPA x4y4 x4y3 x4y2 x4y1 x4y0 FA FA FA FA HA

FA FA FA FA HA

Figure bit array multiplier

a b a b

cin cout s HA FA

cout s

Figure a FA and b HA structure

1

obtained with a gatelevel simulator for a totally inertial gatedelay mo del Each

blo ck in the gure represents the sum of the s and cout signal transitions of the

corresp onding FA or HA

A gate with a delay t from input i to the gate output has an inertial degree d if the gate

i

lters all those glitches of length g such that g d  t If d we say that the gate follows

i i i

a pure delay mo del ie no glitch is ltered If d we say that the gate follows a total ly

inertial delay mo del ie all glitches shorter than the gate delay t are ltered If d

i

we say that the gate follows a partially inertial mo del

Potential power reduction in array multipliers

16x16-bit / Original / General View 15 Y 10 5 CPA XY Y 0 8

6 ARRAY PART 4

2 X 0 0 10 X 20

30

Figure Signal transitions p er op eration for each basic blo ck in a bit

array multiplier

The layout of the gure has b een rotated to allow an appropriate observation of

the transitions at each cell The tallest bars of the plot corresp ond to the carry

2

propagate adder CPA

Since the average p ower dissipation of a CMOS circuit is dominated by the switching

activity useless signal transitions ie transitions that only dissipate p ower and do

not contribute to the calculation of the nal result should b e prevented

In the layered structure of array multipliers where each layer is excited from the

very b eginning many useless transitions are generated These transitions in their

turn generate more useless transitions thus creating a snowball eect We observe

in Figure the consequence of this eect The deep est layers of the array ie

those closest to the CPA are the most p ower consuming

The p otential p ower reduction in array multipliers increases dramatically with the

op erand bitwidth Consider Figure Both useful and total number of signal

transitions for an and bit array multiplier are shown

It is clear that the larger the array multipliers is the larger is the p ossibility for

reducing its p ower dissipation

CPA is used here for its simplicity and inherent lowpower although high p owerdelay pro d uct NMO

Chapter Architectural technique for low power

total transitionsinput vector

r

Min trans

Total trans

r

r

r

r

Figure Useful and total number of signal transitions for an  

 and bit array multiplier

It is interesting to notice that the p otential for reducing p ower consumption in

tree multipliers eg Wallace is lower than in array multipliers The reason is that

parallel multipliers present less generation of useless transitions b ecause of the more

balanced paths to the basic blo cks that comp ose the multiplier We have observed

that for example the number of signal transitions in a bit array multiplier

doubles the number in its Wallace counterpart

TRANSITIONRETAINING BARRIERS

A strategy to avoid useless transitions is to insert selftimed controlled latches on the

paths where these transitions o ccur A p ossible solution is to latch every output of

each cell This conguration has some drawbacks that make it not very app ealing

area and delay and even p ower dissipation increase as a result of the latches

A tradeo is to nd for instance a barrier of latches that divides the array into

two parts the upp er and b ottom halves Thus the useless transitions that are

generated in the upp er half of the array will not cross the barrier until the bar

rier switches into transparent mo de Henceforth this barrier of latches is called a

transitionretaining barrier TRB

Implementation of the TRB

total transitionsinput vector r CPA XY

Y

r

r

r

13 r

r

r r

8

X

1

barrier p osition

a

b

Figure a Possible lo cations for one TRB and b transitions p er input

vector for dierent lo cations of one TRB in a bit array

For VLSI arrays a regular implementation of this barrier is sought The easiest way

to insert latch barriers in the array is by taking advantage of the layered structure

of the array

The enabling signal of the latch barriers is generated by delaying an input signal

for example the clo ck signal The delay should b e no longer than the critical

path from the inputs of the array to the barrier thus not increasing the delay

of the multiplier and as similar as p ossible to the critical path thus no useless

transitions are allowed to cross the barrier to o early

It is obvious that the b est lo cation for the array is the middle of the array b ecause

we balance the useless transitions b etween the two partitions Figure b shows

an analysis of the number of transitions p er input vector in the bit array with

dierent lo cations for the TRB The optimum partition is in row of Figure a

We can evaluate also the switching activity if two or more TRBs are inserted

Similarly the minimum is obtained when the array is divided into equal parts

Figure shows how the transitions in the array are reduced when inserting one

and two TRBs

Chapter Architectural technique for low power

16x16-bit / Original / General View 16x16-bit / Original / Side View 15 Y 10 8 5 0 8 6

6 4 4

2 X 0 0 2 10 0 20 10 30 X 20 00 5 10 15

30 Y

a b

16x16-bit / 1 Barrier / Side View 16x16-bit / 2 Barriers / Side View

8 8

6 6

4 4

X 0 X 0 2 10 2 10 20 20 30 30 00 5 10 15 00 5 10 15

Y Y

c d

Figure Signal transitions p er input vector in a bit array multiplier

a General view of the design with no TRBs and side view with b no TRBs c

one TRB and d two TRBs

IMPLEMENTATION OF THE TRB

Since the cost of latches may b e to o high b oth in area and p ower dissipation

other structures that p erform the same transitionretaining function at the outputs

of the multiplier cells have to b e devised

The designer can choose b etween a static or dynamic implementation of the latch

The most appropriate dynamic latch structure in terms of area is the Clo ckedCMOS

2

or C MOS latch WE The area overhead consists of four transistors p er cell

2

two for each output of the fulladder cell Figure a depicts the C MOS latch

Implementation of the TRB

P P P enable enable out out out enable

N N N enable

Static CMOS C2MOS

a

b

Figure a Comparison b etween Static CMOS and C MOS and b avoiding

directpath currents

2

Related work with p ower dissipation using C MOS has b een presented in SNK

BOI

2

The main problem with the use of C MOS structures is that capacitive coupling may

arise at the out no de in Figure a causing directpath currents in the gates fed

by this no de To avoid this problem the static structure depicted in Figure b

should b e used

The delay mentioned in Section is implemented now as a chain of inverters We

distinguish types of delaycells

TA simulates the delay of the rst layer of the array

TB simulates the delay of a fulladder

TC simulates the delay of a fulladder and controls the TRB and

TD is the same as TC but has no other delaycell as fanout

Figure depicts a bit array with two TRBs where the four types of delay

cells can b e observed All the controlling signals are obtained by delaying the input

signal start

The number of inverters in each delaycell dep ends on the delay of the inverter the

2

input capacitance of the inverter and the capacitance of the C MOS signals enabl e

Chapter Architectural technique for low power

start S0 S1 S2 S3 CPA TA TC TB TD Y0 AND FAFA FA FA S4 Y1 AND FAFA FA FA S5 Y2 AND FA FA FA FA S6 Y3 AND FA FA FA FA S7 X0 X1 X2 X3

Transition−Retaining Barriers

Figure bit array with two TRBs The four types of delaycells TATD are shown

cin a b a b a b cin a b cin cin a s cin a b b a a b cin b

cout

Figure Design of a FA

and enabl e of Figure b The number of inverters in TC and TD delaycells

also dep ends on the op erand bitwidth of the array

RESULTS

Four congurations of array multipliers have b een implemented for the evaluation

of p ower dissipation area and delay The original array multiplier is compared to

designs with one two and three TRBs

Results

Figure Layout of an bit array multiplier in a seaofgates design style

The congurations are implemented in a seaofgates SoG design style using the

Ocean system see App endix A Section A A layout of an bit array mul

tiplier is shown in Figure The fulladder cell is implemented as in WE

see Figure The evaluation of the p ower dissipation has b een obtained with

a switchlevel simulator vG that uses an RC mo del for timing calculations

The scenario of the simulation is the following

input patterns are randomly generated with probability of logical at the pri

mary inputs b eing

all inputs to the multiplier arrive at the same time and

b etween one pseudorandom input vector and the following one there is enough

time for the multiplier outputs to settle

The results are presented in Table In the bit multipliers with one and two

TRBs the overhead of the additional logic generates more p ower than it reduces

Hence the overall p ower dissipation is higher than in the original design

A tradeo exists b etween the delay of the new designs and their p ower dissipation

If the delay of the delaycells is greater than the delay of the fulladder cell the

multipliers will present less p ower but will b e slower Thus the increase in the delay

of the designs with barriers is b ecause of the additional transistors of the latches

and the longer delay of the delaycells

Chapter Architectural technique for low power

Array Area Area Delay Delay Energy Energy

Multiplier Incr Incr Reduc

bit original

bit TRB

bit TRB

bit original

bit TRB

bit TRB

bit TRB

bit original

bit TRB

bit TRB

bit TRB

bit original

bit TRB

bit TRB

bit TRB

Table Comparison b etween the original design and the new designs with

TRBs

We have the b est results in p ower dissipation reduction when three TRBs are in

serted However it is imp ortant to notice that the greatest decrease in p ower is

achieved when inserting one TRB with regard to the original design rather than

when inserting two TRBs with regard to inserting one TRB

The increase in area when inserting TRBs is small and more negligible as the bit

width increases there is an increase of when inserting three TRBs in the

bit and just a in the bit with also three TRBs The overhead

in p ower dissipation area and delay as a result of the insertion of TRBs has b een

taken into account in the results in Table

CONCLUSIONS

In this chapter we have studied the p ower consumption of a class of circuits array

multipliers that present a high p ower consumption b ecause of the generation and

propagation of a large amount of useless activity

The technique of inserting transitionretaining barriers TRB in order to decrease

the propagation of the useless transitions in array multipliers has proved eective in

reducing the p ower dissipation of this type of circuits With three TRBs we achieve

Conclusions

a reduction in p ower of in a bit array multiplier with an increase in area

and delay of and resp ectively

The nal multiplier remains combinational and highly regular This implies that

the technique presented in this chapter can b e easily integrated in existing CAD

to ols for the generation of array multipliers

This technique should b e evaluated for other type of circuits that present a high

amount of useless activity

Chapter Architectural technique for low power

LOGICCIRCUIT TECHNIQUE FOR LOW

POWER

This pap er addresses the optimization of a circuit for low p ower using transistor

reordering a technique that has b een traditionally used to improve the p erformance

In this chapter this technique is used to nd a prop er order of the transistors of a

gate so that the switching activity at the internal no des of the gate is minimized

A sto chastic p owerconsumption mo del that dep ends on the switching activity and

the signal probabilities of the inputs of the gate is presented This mo del takes into

account the p ower at the internal no des of the gate

An optimization algorithm that relies on the sto chastic mo del is describ ed It

p erforms an exhaustive exploration of the dierent congurations of a gate that are

obtained by reordering its transistors Thus the b est conguration for low p ower

of each gate is selected and the overall p ower consumption of the circuit is reduced

Powerreduction results are rep orted for several b enchmarks

Chapter LogicCircuit technique for low power

INTRODUCTION

This chapter addresses the optimization of a circuit for low p ower using transistor

reordering from a gatelevel description The optimization algorithm uses a p ower

consumption mo del of a static CMOS gate that takes into account the p ower of

the internal no des of the gate This mo del allows the exploration of the dierent

congurations of a gate that are obtained by reordering its transistors Thus the

b est conguration of each gate is selected and the overall p ower consumption of the

circuit is decreased

We fo cus on combinational multilevel circuits where it has b een shown that the

p ower consumption of useless signal transitions ie those transitions that do not

contribute to the nal result of the circuit accounts for a large fraction of the overall

dynamic p ower consumption of the circuit Thus it is necessary to incorp orate the

switching activity of the input signals into the p ower consumption of the gate

Motivation examples

To illustrate why it is imp ortant to incorp orate the switching activity information

to the p ower estimation of a gate consider the four p ossible congurations of the

a a b Dierent switching gate in Figure a that implement function y

activity at the inputs D D and D results in a dierent optimal transistor

a1 a2 b

reordering for the gate as it is shown in Table b The equilibrium probability

ie the probability for a signal to b e of all input signals has b een set to

Table b shows the p ower consumption for two dierent input switching

activity scenarios cases and of the dierent congurations relative to

conguration D in case Time intervals b etween two consecutive transitions

at input signal k follow an exp onential distribution with average D In case

k

the b est transistor ordering is given by conguration A of Figure a the

p ower consumption is decreased by with resp ect to conguration D In case

the p ower is decreased by if conguration D is taken instead of A

The reason of these p ower savings is the reduction of the switching activity at the

internal no des of the gate This internal activity dep ends on the switching activity

and probability of the inputs and the order of the transistors

The p owerconsumption values in the last exp eriment have b een obtained with an

accurate switchlevel simulator named SLS see Chapter A Section A Through

+

out this work we will rely on switchlevel simulations instead on SPICE JQN

simulations b ecause we will deal with relatively large b enchmarks and a large num

b er of input vectors for each one A SPICE simulator is characterized by high

Introduction

a1 a2 a1 a2 bn0 bn0 bn0 b n0 a2 a1 a2 a1

y y y y b b a1 a2 a1 a2 n1 n1 n1 n1 a1 a2 a1 a2 b b

(A) (B) (C) (D)

a

Activity transsec A B C D Reduction

D K

a1

D K

a2

D M

b

D M

a1

D K

a2

D K

b

b

a a b and b Figure a Four implementations of function y

relative p ower consumption for two dierent input activity scenarios

accuracy but long simulation time Simulation time is typically prop ortional to

m

N where N is the number of nonlinear devices in the circuit and m is b etween

and WE

However we have p erformed an exp eriment using SPICE simulations to validate

the use of switchlevel simulations in this work Consider gate congurations A

and D in Figure a with an equivalent load capacitance of an inverter Two

new scenarios are considered that pro duce the same output waveform The current

drawn by the p ower supply i t for each scenario and conguration is shown

DD

in Figure We observe signicant dierences in current supply when the output

load has to b e charged at times ns and ns Table shows a comparison

of the p owerconsumption estimations provided by SPICE and SLS The p ower

Chapter LogicCircuit technique for low power

i t mA i t mA

DD DD

Scenario 1

a1 1

0

a2 1

0

b 1

0

y 1

0

t(ns)

0 5 10 15 20 25

ns ns

Cong A Cong D

i t mA i t mA

DD DD

Scenario 2

a1 1

0

a2 1

0

b 1

0

y 1

0

t(ns)

0 5 10 15 20 25

ns ns

Cong A Cong D

Figure SPICE simulations of congurations A and D in Figure for

two dierent input activity scenarios

Cong A Cong D

SPICE mW SPICE mW

Scenario SLS mW SLS mW

Error Error

SPICE mW SPICE mW

Scenario SLS mW SLS mW

Error Error

Table SPICE and SLS comparison for the congurations in Figure Num

b ers in parenthesis show the p owerconsumption ranking

consumption in SPICE is calculated as

Z

T

i t V dt W

DD DD

T

t=0

Introduction

a b c

y 1 a TA C n0 load 1 1 b C TB 0 n1 1 c C TC 1

Case 1 Case 2

Figure Transistorreordering technique in a input NAND gate example

from SLW

where T is the cycle time ns and V is the p ower supply voltage while in SLS

DD

it is obtained as explained in Chapter A In this exp eriment relative errors of SLS

with resp ect to SPICE are less than However it is imp ortant to notice that

b oth simulators agree on the p owerconsumption ranking for the four cases shown

in parenthesis in Table

We conclude from this exp eriment that indeed p ower consumption dep ends on the

order of the transistors and the switching activity of the inputs to the gate and that

SLS simulations are valid to discern the lowest p owerconsuming conguration

The ripplecarry adder is another example in which the equilibrium probability

do es not give enough information to optimize gates for low p ower Consider an

adder implemented as a chain of fulladders that has to calculate the addition of

two nbit op erands with equal equilibrium probability for all bits The equilibrium

probabilities of all inputs of the fulladders is but it is clear that the switching

activity of the inputs of the fulladders corresp onding to the op erands is low

transitions p er op eration whereas the switching activity of the input corresp onding

to the propagated carry is higher sp ecially in those fulladders that compute the

mostsignicant bits b ecause of the generation and propagation of useless signal

transitions

To show how transistor reordering aects the number of transitions at the internal

no des of a gate consider the input NAND gate in Figure In this particular

example the techniques of input reordering and input interchange ie interchange

those inputs to the gate that are logically equivalent coincide In Case inputs b

and c are and a is initially The capacitances asso ciated to internal no des

Chapter LogicCircuit technique for low power

n and n C and C are discharged The output capacitance C is charged

0 1 0 1 load

When input a changes to only the output capacitance is discharged In Case

inputs a and b are and input c is initially All capacitances are charged

When input c changes to all capacitances will discharge thus consuming more

p ower than in Case Therefore if the probabilities of two of the inputs are close

to and the other input is very active we should assign this input to the transistor

TA to minimize the p ower consumption at the internal no des of the gate

PREVIOUS WORK AND OVERVIEW

Transistor reordering is a technique that do es not signicantly aect layout area

and it may decrease the propagation delay of the gate

Carlson CC hinted the p ossibility to use the transistorreordering technique to

decrease p ower consumption and he presented an algorithm for delaypowerarea

optimization where high sp eed was synonym of high p ower consumption This ap

proach of measuring p ower consumption is not suciently accurate since it do es not

consider the probability and switching activity of signals No p owerconsumption

reductions are rep orted in CC

Input reordering conform a subset of transistorreordering techniques For example

by reordering the inputs in a input NAND gate dierent congurations of the

gate are obtained the same o ccurs if we apply transistor reordering But in the

gate represented in Figure a only two dierent congurations are obtained by

applying input reordering and four are obtained with transistor reordering Input

reordering has b een used in TA BIO along with transistor sizing to reduce

p ower consumption It is not clear in those works which is the contribution of the

inputreordering technique by itself This is true sp ecially in BIO where the

largest p ower consumption reduction is achieved in the b enchmark with the

largest number of highfanout gates Input or transistor reordering techniques

obtain b etter results when applied to lowfanout gates b ecause these techniques

fo cus on reducing the p ower consumption of the internal no des of the gate If

the output capacitance is increased the overall gate p ower reduction obtained is

decreased HZA

A study of the p otential b enets of transistor reordering for a sp ecic library is

presented in GBJ where an average p owerconsumption reduction of and

up to for complex gates is obtained A novel binarydecision diagram is used to

represent the structure of the gate An algorithm is presented to obtain all p ossible

Previous work and overview

transistor reorderings of the gate No mo del for the p ower consumption of a CMOS

gate is presented

In HZA inputreordering techniques are applied to a large variety of CMOS

NAND gates ranging from to inputs The ma jor contribution in HZA is the

description of a new mo del for the p ower consumption of a MOSFET chain This

mo del accounts for the consumption of internal no des and it dep ends only on the

equilibrium probabilities of the inputs of the gate

The closest works to ours are PR SLW In PR a clear example of the

eect of transistor reordering on the p ower consumption of a circuit is shown An

estimated average of in p ower reduction is obtained for some MCNC b ench

marks

Finally in SLW the authors prop ose a set of simple transistor reordering rules

for b oth basic and complex CMOS gates to minimize the switching activity at the

internal no des These rules are

For NAND gates

place the input signal with high signal probability near ground

place the input signal with low switching activity near ground

For NOR gates

place the input signal with low signal probability near p ower supply

place the input signal with low switching activity near p ower supply

For complex gates

transform the complex gates into NOR or NAND structures

apply the rules of NAND and NOR gates to the transformed gate

Whenever the rules conict heuristics are applied based on probabilityactivity

ratios Average reductions of are rep orted for several circuits

Overview of our approach

The goal of this chapter is to evaluate the p ower reduction that can b e obtained by

reordering the transistors that comp ose the dierent gates of a circuit Our work

accounts for the transistorreordering technique by itself We maintain the size of

the transistors constant when they are reordered

Chapter LogicCircuit technique for low power

a b c

transsec a b a c b c

y K

transsec b a c a c b

c K

M transsec c c b b a a

b Power

a

b

a

Figure a Static CMOS input NAND gate b relative p ower consumption

for dierent input activities without varying the equilibrium probabilities of the

inputs

It has b een shown that the arrival times of the input signals aect the consumption

of the gate CC However we do not take into account the arrival times b ecause

we are more concerned ab out the switching activity of an input during all the cycle

time than just the last transition that settles the correct value at the output of the

gate

We present a p owerconsumption mo del of a static CMOS gate that dep ends on

b oth the static probability and the switching activity of its inputs Our p ower

consumption mo del diers from the one presented in HZA in that we take into

account the switching information of the input signals our mo del is based on the

transition density measure of activity in digital circuits prop osed by Na jm Na j

As an example consider a input NAND gate Figure a Consider equilibrium

probabilities of and for its inputs a b and c resp ectively Table b

shows the relative p ower consumption of the gate for dierent input switching activ

ity scenarios Our mo del discerns among the six p ossible input reorderings those

that give the maximum and minimum p ower consumption The mo del in HZA

on the contrary can not discern any of the reorderings and it provides the same

p ower estimation for all of them

An algorithm that traverses the gatelevel description of the circuit and uses the

gate mo del mentioned ab ove is presented and results are rep orted for a wide range

of MCNC b enchmarks A cell library with dierent instances of a single gate to

obtain all its p ossible transistor reorderings has b een implemented in a seaofgates

design style The results are based on switchlevel simulations and show that p ower

consumption reductions of up to may b e obtained when applying the transistor reordering technique

Power consumption of CMOS gates

This chapter is organized as follows in Section a p owerconsumption mo del of

a static CMOS gate that takes into account the p ower consumption of the internal

no des is presented In Section an algorithm that traverses the gatelevel de

scription of the circuit and generates an optimized circuit for low p ower is describ ed

Results are presented for several MCNC b enchmarks in Section In Section

some conclusions are drawn for this chapter

POWER CONSUMPTION OF CMOS GATES

Denitions and overview

Denition Sto chastic pro cess Let xt be the value of a system charac

teristic being observed at time t In most situations xt is not known before time

t and may be viewed as a random variable A stochastic process is a description of

the relation between the random variables xt

Denition Stationary Markov pro cess A stationary Markov process is

a stochastic process where the probability law relating the next periods state to the

current state does not change or remains stationary over time

Denition Equilibrium probability Let xt t be a

stationary Markov process with random transition times The probability that it

takes the value at any given time t is the expected value E xt at that time

and it is independent of time This value is called the equilibrium probability of xt

and is denoted as P x

Denition Switching activity The switching activity of a stochastic

process xt is the number of to transitions and to transitions of xt in a

time unit

Henceforth we will mo del logic signals of a circuit as stationary Markov pro

cesses as in Na j to derive a p owerconsumption mo del of a static CMOS gate

that includes the switching activity at the output and internal no des of the gate

The mo del dep ends on b oth the equilibrium probabilities and the switching activity

of the inputs of the gate

The switching activity in b oth input and output no des of a gate is measured with the

transition density technique describ ed in Na j This technique is briey reviewed in the next section

Chapter LogicCircuit technique for low power

Transition density

The transition density is a compact measure of the switching activity in digital

circuits The transition density of a no de is the average number of signal transitions

p er time unit at that no de and it is dened as

n1

X

y

j

P D x D y

i j

x

i

i=0

where P z is the equilibrium probability of a signal or logic function D z is the

y

j

transition density of a signal x y are the n m gate inputs outputs is

x

i

named the boolean dierence and it is a b o olean function that may dep end on all

y

j

is dened as x p n p i The b o olean dierence

p

x

i

x y j y j y x y

i j x =1 j x =0 j i j

i i

y

j

If then all the transitions at input x are propagated to output y In other

i j

x

i

y

j

words expresses the sensibility of y with resp ect to input x

j i

x

i

Thus the transition density provides a fast way of propagating switching activ

ity from the primary inputs to the outputs of the gates that comp ose a circuit

The main drawback of the transition density is that it do es not take into account

the spatial and temp oral correlation of the input signals to the gate although it

considers the signal correlations within the logic mo dule

As an example the transition density at the output of a input AND gate is

D y D aP b D bP a

b where y is the output of the gate and a b are the two inputs However when a

P a P b y is always and therefore D y However using the

ab ove transition density measure we obtain

D y D aP b D b P b D b

which is not correct This drawback caused by the simultaneous switching of the

inputs can b e overcome using conditional probabilities CRP

Extended p owerconsumption mo del Notation

Power consumption of CMOS gates

vdd a1 P

b n0

H FUNCTION n f unction CALCULATE

P k

dfs l ist DEPTH FIRST SEARCH n v dd k

a2 k

f or each el ement dfs l ist e

k k

y P i

ADD f unction INPUT TO MINTERM e k

a1 a2 i

if v dd then sour cee k

N N i

NEW MINTERM f unction n1 CREATE b N b

vss

a

Figure a Static CMOS gate representation and b algorithm to obtain

function H

n

k

We represent a static CMOS gate as a directed acyclic graph V E V fn n

0 p1

y v dd v ssg is the set of no des representing the p internal no des of the gate n n

0 p1

the output no de y and the p ower and ground no des v dd v ss The set of edges

representing the q transistors q of type P and q of type N that connect no des in V

is E fe e e e g Figure a shows the graph representation

0 q 1 0 q 1

P N

P N

of gate C in Figure a Note that this representation retains the transistor

order information of the gate

The p ower consumption of a no de output or internal of a gate is p otentially

aected by all the inputs of the gate In particular the p ower consumption of no de

is n pro duced by input x W j

k i n x

i

k

1

2

V C T

n x !n

i

DD k k

2

T

cy c

where T is the number of transitions at no de n b ecause of input x In other

x !n k i

i

k

words T expresses how many signal transitions of x D are propagated to

x !n i x

i i

k

no de n C is the capacitance of no de n

k n k

k

Henceforth we assume that a no de is charged discharged only when there is a

direct path from p ower ground supply to the no de ie we do not consider charge

sharing among the no des of the gate

Computation of the model

To compute T the b o olean function that represents all p ossible paths from

x !n

i

k

p ower supply to no de n needs to b e calculated Let us call H this function

k n k

Chapter LogicCircuit technique for low power

similarly G is the b o olean function that represents all p ossible paths from n to

n k

k

1

ground

The algorithm to obtain the H function is depicted in Figure b Func

n

k

tion H is obtained by generating a minterm for each p ossible path from no de

n

k

n to supply no de v dd A path from no de n to v dd is a set of r edges e so

k k k

i

that dste n dste sr ce dste sr ce and sr ce

k k k k k k k

0 1 0 r 1 r 2 r 1

v dd where sr ce dste is the source destination no de of edge e Us

k k k

ing a depthrstsearch approach CLR a list of all edges to visit is created

depth f ir st sear ch l ist Afterwards the edges of this list are added to the cur

k

INPUT TO MINTERM until an edge rent minterm of the H b o olean function ADD

n

k

e is reached so that sr ce v dd In this case a new minterm is created CRE

k k

j j

NEW MINTERM sharing with the last created minterm all its edges but the last ATE

one visited

In the example of Figure a the four minterms generated when calculating H

n

1

b a a a a b a a ag leading to H b a a Similarly G can are fa

n n

1

k

b The time complexity of these algorithms also b e derived In Figurea G

n

1

is linear in the number of transistors of the gate

with resp ect to input x that Afterwards the b o olean dierence of function H

i n

k

H

n

k

and the equilibrium probabilities of no de n need to b e calculated The is

k

x

i

H

n

k

b o olean function is calculated as explained in Section The equilibrium

x

i

probability of no de n is obtained as follows HZA the probability of no de n of

k k

b eing at a given instant of time P n j is the probability that n was in

k c k

G

n

k

or that it was P the instant b efore P n j and it is not discharged

k b

x

i

H

n

k

P n j and it is charged P ie

k b

x

i

G H

n n

k k

P P n j P P n j P n j

k b k c k b

x x

i i

where P f P f

Since all signals are assumed to b e stationary Markov pro cesses the steady state

value of P n can b e derived as

k

H

n

k

P

x

i

P n

k

G H

n n

k k

P P

x x

i i

are complementary functions only when n is the output no de y of and H Note that G

k n n

k k the gate

Power consumption of CMOS gates

We conclude that W j is

n x

i

k

H G

n n

1

2

k k

V C D x P P n P n P

n i k k

DD

k

2 x x

i i

T

cy c

If the contributions of all no des output and internal are taken into account the

p ower estimation of the gate is obtained as

p1

n1 n1

X X X

W j W j P

y x n x g ate

i i

k

i=0 i=0

k =0

where p is the number of internal no des and n is the number of inputs of the gate

The p ower consumption of a circuit is then the sum of the p ower consumption of

its gates

and G Computing functions H

n n

k k

Bo olean functions H and G that app ear in the mo del in the last section may

n n

k k

b e symbolically calculated using Binary Decision Diagrams BDDs Bry

There exist several representations of b o olean functions truth table Karnaughs

map canonical sum canonical pro duct pro duct of sums sum of pro ducts multi

level form etc Among these the canonical representations present the desirable

prop erty that if two functions are equivalent their representations are the same and

viceversa Therefore the equivalence test is straightforward A form is canonical

if the representation of a function in that form is unique The problem of canon

ical representations is that they are large typically exp onential in the number of

variables A BDD is a canonical and compact for many functions of a b o olean

function

Given a b o olean function f x x x the p ositive and negative co

j +1 j j 1

factors of f with resp ect to x are dened by f f x x and

j x j +1 j 1

j

f f x x resp ectively A BDD of a b o olean function is a direct

x j +1 j 1

j

acyclic graph based on the successive Shannon decomp ositions of f given a sequence

of splitting variables x Each nonterminal N corresp onds to a splitting variable

i x

j

x and is the origin of two arcs lowN and highN The subgraph ro oted at

j x x

j j

and the graph ro oted at highN lowN represents the negative cofactor f

x x x

j j j

represents the p ositive cofactor f

x

j

bd c d Figure a shows a BDD that represents the function f a b c d abc

Edges are assumed to go from the top to the b ottom terminal A negative cofactor is

Chapter LogicCircuit technique for low power

f f

a b

b b c

c c a

d d

1 0 1 0

a b

Figure Two BDD representations of function f abc bd cd using dierent

variable order

shown with a circle in the edge There are one ro ot no de the name of the function

and two terminal no des and Therefore to evaluate function f for some

variable values we start at the ro ot no de and for each variable we choose the

edge representing the p ositive cofactor if the variable is or the negative cofactor

if it is The explored path for f a b c d is shown in b old in

Figure a obtaining the value

The complexity in terms of no de count and graph depth strongly dep ends on the

choice of the splitting variables The same function f is represented in Figure b

using a dierent variable order A go o d reordering is essential for obtaining a

small graph Some heuristics have b een prop osed to nd a go o d ordering for a

BDD FFK MWBSV An ordered BDD OBDD is a BDD where variables

app ear in the same order along all paths from the ro ot to the terminal no des

Moreover an OBDD can b e reduced by iteratively applying the reductions

identication of isomorphic subgraphs

elimination of redundant no des

to obtain a reduced OBDD ROBDD An ROBDD is a canonical form

Poweroptimization algorithm

CGS CGD Gate Gate Oxid Source Drain CG

Bulk

CS CD

Figure Capacitances in a MOS transistor

In this work we have used the subroutines provided with the SIS package see

Chapter A Section A for handling BDDs

Node capacitances

Several intrinsic capacitances are asso ciated to a MOS transistor Figure The

values of these capacitances and therefore the capacitance of an internal no de

of a gate clearly dep end on the technology and design style used For example

most gates may b e designed using an unbroken row of transistors in which abutting

sourcedrain connections are made But b ecause of transistors with the same signal

on its gate terminal should b e vertically aligned to reduce internal routing the

layout algorithm must nd the minimum set of transistor chains in order to minimize

the number of diusion gaps This leads to the fact that two internal no des with the

same number of sourcedrain connections may actually have dierent capacitances

Furthermore source and drain diusions are dierent in PMOS and NMOStype

transistors

Moreover it is clear that gate output no des are more capacitive than internal no des

b ecause of the additional routing and transistorgate fanout capacitances

Because of the task of mo deling the capacitances of the no des of a gate is di

cult these capacitances should b e extracted and stored for all gates of the library

whenever it is p ossible This is the approach followed in this work

POWEROPTIMIZATION ALGORITHM

In this section an algorithm that traverses the gate description of the circuit is

presented For each gate it nds the b est transistor reordering based on the p ower

Chapter LogicCircuit technique for low power

OBTAIN PROBABILITIES cir cuit

l ist DEPTH FIRST TRAVERSE cir cuit g ate

f or each g ate g ate in g ate l ist do

inf o inputs OBTAIN PROB AND DENS g ate cir cuit

FIND BEST REORDERING inf o inputs g ate cir cuit

inf o output CALCULATE DENS inf o inputs g ate

CIRCUIT INFORMATION inf o output cir cuit UPDATE

Figure Optimization algorithm

consumption mo del explained in Section and an exhaustive exploration of all

p ossible reorderings

Algorithm overview

Finding the b est transistor reordering implies an exhaustive exploration of each gate

Since most gates only have a small number of transistors in series an exhaustive

exploration is feasible The algorithm to obtain all p ossible transistor reorderings

of a gate will b e addressed later

A simplied algorithm of the optimization approach for low p ower is shown in

Figure The probabilities for all output no des of the gates of the circuit are

computed in OBTAIN PROBABILITIES DEPTH FIRST TRAVERSE returns the list of gates

list of the circuit circuit ordered in a depthrst fashion CLR from the gate

outputs ie every gate app ears somewhere after all of its transitive fanin gates

For each gate of this list the probability and transition density information for all of

its inputs is obtained from the circuit OBTAIN PROB AND DENS Afterwards the b est

BEST REORDERING Finally the transition density reordering is derived for gate FIND

DENS and this information of the output no de of the gate is calculated CALCULATE

is transferred to the circuit UPDATE CIRCUIT INFORMATION

Monotonic characteristic

The algorithm takes advantage of the following prop erty of the mo del explained

in Section the reduction of the power in an individual gate always decreases

the power of the circuit The reason of this monotonic b ehavior is that all p ossible

transistor reorderings of a gate lead to the same probability and transition density

at its output no de if the mo del explained in Section is used to compute them Since

Poweroptimization algorithm

the mo del precisely relies on the probability and transition density of the inputs

of a gate to decrease its p ower consumption and

the p ower of the circuit is the sum of the p ower of its gates

it is clear that the reduction of the p ower in an individual gate always decreases

the total p ower of the circuit This monotonic b ehavior may not corresp ond to the

actual b ehavior of a circuit but our results show that this lo cal greedy approach

results in an overall p ower reduction for the whole circuit

Thus with only one traversal of the circuit the optimal reordering always with

resp ect to the mo del for all gates is obtained

Equilibrium probabilities

To apply the p owerconsumption mo del explained in Section the equilibrium

probabilities of the inputs of the gate need to b e known The calculation of the

probabilities for all no des in a circuit that has reconvergent fanout ie there is

correlation among the signals is an NPhard problem Na j Without prop erly

partitioning the circuits CRP Kap the metho d may not b e applicable to

very large circuits

Since having correct equilibrium probabilities is imp ortant to obtain a b etter ac

curacy we have implemented the algorithm prop osed in PM and suitable to

compute probabilities when reconvergent fanout exists

This algorithm traverses the circuit starting at its inputs and computes the logical

expression for the output of each gate as a function of its inputs expressions Once

an expression has b een calculated all its exp onents are suppressed to obtain the

correct probability expression for that signal

Exhaustive exploration of gate congurations

Table shows the number of p ossible congurations obtained by reordering the

transistors of some standard gates These congurations are obtained by pivoting on

the internal no des of the gate More formally this pro cess of obtaining a new con

guration is describ ed as follows let BS b e the subgraph of the graph representing

the PMOS or NMOS blo cks of a gate comp osed of all edges whose source is no de

n and whose destination no de is either no de n or another no de n The same

k bs dummy

denition is recursively applied to no de n until all destination no des are n

dummy bs

Chapter LogicCircuit technique for low power

Gate Conf Gate Conf Gate Conf

inv aoiAB oaiAB

nand aoi oai

nand aoiAB oaiAB

nand aoiAB oaiAB

nor aoi oai

nor aoiABC oaiABC

nor aoiABC oaiABC

aoi oai

Table Number of dierent congurations for some standard gates obtained

by reordering its transistors

AND SEARCH g ate g r aph v isited r eor ds cur r ent node PIVOTE

g ate g r aph PIVOTING ON NODE g ate g r aph cur r ent node

if not VISITED g ate g r aph v isited r eor der ing s then

v isited r eor ds ADD TO VISITED REORDS g ate g r aph

f or index to number of inter nal nodes do

if index cur r ent node then

PIVOTE AND SEARCH g ate g r aph v isited r eor ds index

FIND ALL REORDERINGS g ate g r aph

v isited r eor ds

index to number of inter nal nodes do f or

PIVOTE AND SEARCH g ate g r aph v isited r eor ds index

Figure Exhaustive exploration algorithm

Similarly let TS b e the subgraph with its asso ciated no de n A new conguration

ts

of the gate is obtained by interchanging subgraphs BS and TS

For example in any conguration of Figure a new congurations are obtained

by pivoting on no de n b eing n and n in all cases y and v ss resp ectively

1 ts bs

For the graph representation of the gate explained in Section an algorithm

that nds all p ossible transistor reorderings of a gate is presented in Figure

The strategy of pivoting on an internal no de to obtain a new reordering of tran

sistors leads to the generation of rep eated reorderings A dynamic programming

approach with memoization CLR is used to avoid the generation of overlapping

subproblems

node and pivots on it The algorithm recursively p oints to an internal no de cur r ent

ON INTERNAL NODE Further searching for new to obtain a new reordering PIVOTING

reorderings is pruned if the reordering obtained has already b een visited VISITED

Results

vdd vdd vdd a2 a2 a1P P P b n0 n0 b n0 P bP P (1) pivoting a1 (2) pivoting a1 a2 on node n0 P on node n1 P P y y y a1 a2 b a2 N N N a1N N n1 n1 n1 b a2 N a1N N bN already vss vss vss visited (3) pivoting vdd vdd on node n0 vdd a1 a1P P a1P b bP P n0 n0 bP n0 a2 a2 P P a2 (5) pivoting y y y P b a1 a2 (4) pivoting on node n1 N N N on node n1 bN n1 n1 n1 a1 a2 b a2 N N N a1N N

vss vss vss

Figure Execution example of the exhaustive exploration algorithm

If it has not b een visited it is added to the set of transistor reorderings of the

gate already visited ADD TO VISITED REORDERINGS and the algorithm is called again

for all internal no des of the gate except the current one this is so to prevent the

generation of a reordering that we know b eforehand that we have already visited

The algorithm in Figure works for gates that can b e represented with a series

parallel graph Almost all the gates of typical libraries can b e represented with

this type of graphs Figure shows a gate that do es not admit a seriesparallel

representation The authors are currently working on the formal demostration that

all p ossible transistor reorderings of a static CMOS gate are generated with this

algorithm

To illustrate how this algorithm works it has b een applied to the gate implementing

a a b Figure shows the execution The starting graph the function y

representation of the gate is the one in Figure a We observe that all four

p ossible reorderings those already seen in Figure a are generated

Chapter LogicCircuit technique for low power

a d b c e y a b c e

d

Figure A gate y a d b e b c d a c e without a seriesparallel repre

sentation

RESULTS

Scenarios for the exp eriments

A wide range of MCNC Yan circuits have b een used as b enchmarks They have

b een mapp ed into the gate library shown in Table In some cases to obtain

all transistor reorderings of a gate it is necessary to have more instances of that

gate For example there are two instances of gate oai oaiA which is able to

implement congurations A and B of Figure a and oaiB which is able

to implement congurations C and D All instances of the gates in Table

have b een implemented in a seaofgates design style see Chapter A Section A

Two scenarios have b een considered to evaluate the p owerconsumption savings

obtained with the transistor reordering technique see Figure a In Scenario

A the circuit is considered to b e embedded in a larger digital system Thus

the equilibrium probability and sp ecially the transition density of the inputs of

the circuit may take very dierent values In this scenario the probabilities and

transition density of the primary inputs of each circuit are randomly set with a

uniform distribution Probabilities range from to and transition densities range

from to million transitions p er second In Scenario B the circuit is considered

to b e the combinational part of a sequential system In this scenario b oth the

probability and transition density of the primary inputs of the circuit may take

values b etween and We have set b oth values to and transitions p er cycle resp ectively

Results

Circuit Circuit

clock

Scenario A Scenario B

Figure The two scenarios considered

In b oth scenarios the optimization algorithm has b een applied to the original gate

level description of the circuits to obtain for each gate the b est instance and for

each instance the b est input reordering Because of all instances of the same gate

have the same area the total area of the optimized circuit remains the same

For each scenario and circuit two new gatelevel descriptions have b een created

One of them contains the b est transistor reordering for low p ower for all gates

whereas the other contains the worst one A switchlevel simulator see App endix A

Section A extracts the p ower consumption of each description Thus the maxi

mum p ower reduction for each scenario is obtained The input signals to the circuits

used by the switchlevel simulator have b een generated with an exp onential distri

bution ie time intervals b etween two consecutive transitions of input signal k to

the gate follow an exp onential distribution with average D b eing D the tran

k k

sition density of input signal k The delay information has b een obtained with the

SIS system see App endix A Section A

Discussion

Table shows the results obtained Columns Mo del and SLS show the p ower

consumption reduction b est case compared to worst case for low p ower obtained

with the mo del and with switchlevel simulations resp ectively Column Delay shows

the increase in delay b est case for low p ower compared to a mapping into the

original cell library The delay increases in most of the b enchmarks b ecause not

always the b est transistor reorderings of a gate for low p ower and low delay coincide

In fact the rule of thumb that states that the critical transistor should always b e placed near the output terminal to obtain a fast gate contradicts the low p ower rule

Chapter LogicCircuit technique for low power

of placing it close to the ground no de as can b e observed in the motivation example

case in Section and in SLW

It is shown that the average improvement in p ower consumption in scenario A is

with an average increase in delay of The estimated average improvement

is The reason of this lower value in the estimated improvement is that the

mo del in general overestimates the p ower consumption by an oset thus leading

to a lower estimated reduction

The p ower reduction in scenario B is roughly half the one in scenario A b ecause of

the smaller amount of the circuits in scenario B The p ower and delay of latches and

the clo ck line in scenario B has not b een included in the results In b oth scenarios

there is a small average increase in delay

Thus signicant p ower consumption reduction can b e obtained in b oth scenarios

with little average increase in delay and it is p ossible to achieve p ower reductions

without increasing the delay of the circuit see for example b enchmarks c and

large in Table too

CONCLUSIONS

This chapter has shown that the p erformance optimization technique called tran

sistor reordering may b e also used to reduce the p ower consumption of a circuit

Average p ower reductions of at the exp enses of increase in delay have b een

obtained for a wide range of b enchmarks

An optimization algorithm that uses a sto chastic p owerconsumption mo del of a

static CMOS gate has b een presented This novel p owerconsumption gate mo del

takes into account b oth the probabilities and the switching activity of the inputs of

the gates that comp ose the circuit to estimate the activity at the internal no des of

the gate

The results obtained suggest that

current libraries may b e upgraded with more instances of the gates with dierent

transistor reorderings so that an optimization algorithm can choose the b est

instance for p ower reduction and

it is also p ossible to obtain p ower reductions without increasing the delay of the circuit

Conclusions

Circuit Gates Scenario A Scenario B

name Mo del SLS Delay Mo del SLS Delay

alu

c

c

c

cht

cma

cma

comp

cordic

i

mux

my adder

parity

to o large

x

x

p cle

p cler

frg

sct

unreg

zml

fml

symml

ap ex

count

c

c

c

alu

ap ex

example

i

i

i

rot

term

ttt

x

Average

Table Results obtained for several MCNC b enchmarks for b oth scenarios considered

Chapter LogicCircuit technique for low power

CONCLUSIONS AND FUTURE RESEARCH

This chapter presents the conclusions for the contributions of this work namely

some lowpower design techniques at the system architecture and logiccircuit lay

ers of the design pro cess of a circuit In addition some research directions derived

by these techniques are discussed

Chapter Conclusions and future research

INTRODUCTION

Concern ab out the increasing p ower consumption of the circuits has grown in the

recent years among the semiconductor industries Issues like battery life chip pack

age and co oling devices reliability etc are b ecoming key factors for the successful

outcome of the pro ducts for a large sector of the micropro cessor market

The main factor that has driven most of the eorts in lowpower design is the

increasing demand of p ortable batteryop erated applications such as noteb o ok and

laptop computers or PDAs Personal Digital Assistants In these applications one

of the primary design constraints is the battery life

Thus circuit designers face a new dimension p ower in the traditional twodimen

sional area and delay search space with the asso ciated increase in complexity

There exist few CAD to ols that take into account the p ower consumption during

the design pro cess therefore new lowpower design techniques are needed to help

the designer in the complex task of meeting area delay and p ower constraints

There are two main research elds in the lowpower design area p ower estimation

and lowpower design Both are tightly related since a previous analysis of the

p ower consumption of the design to b e implemented is needed to understand where

the p ower go es ie which are the most p owerconsuming parts that need to b e

targeted by the lowpower design techniques In this work we have fo cused on

lowpower design techniques

The p ower consumed by a circuit is divided into static and dynamic Static p ower

is consumed when there is no activity Transistors do not b ehave as p erfect switches

and therefore leakage currents caused by parasitic dio des and static currents a

function of the input voltage and the threshold voltage of the transistors arise The

contribution of the static consumption is less than of the total p ower and it is

usually neglected

Dynamic p ower is consumed when there is activity This comp onent of the p ower

is divided into

shortcircuit p ower consumption a shortcircuit current arises when there is

a path b etween p ower supply and ground where all transistors are on The

p ower asso ciated to this eect can b e kept b elow of the total p ower for

datapath circuits

capacitance chargedischarge p ower consumption whenever a circuit no de chan

ges its value its asso ciated capacitance has to b e charged or discharged thus

Contributions

consuming p ower For datapath circuits this comp onent accounts roughly for

the of the total p ower

Considering only the capacitance chargedischarge p ower comp onent the p ower

consumption of an output no de i of a gate is dened as

2

C V f

i

DD

where V is the p ower supply f is the op erating frequency and is the number

DD

of times that the capacitance C is charged or discharged p er cycle This equation

i

reveals the three degrees of freedom when designing for low p ower voltage physical

capacitance and activity Once the supply voltage and the device dimensions have

b een xed it is the switching activity of the signals of the circuit that will ultimately

determine its p ower dissipation

Techniques for reducing this activity are the contribution of this work Several low

p ower design techniques have b een reviewed covering all the layers of the design

hierarchy

system system shutdown system partitioning algorithm selection

architecture parallel and pip elined pro cessing with voltage scaling compiler

transformations cache design data representation

logic path balancing and gate sizing multilevel logic synthesis technology

mapping FSM state assignment retiming disabling registers

circuit logic style asynchronous design design style adiabatic computing

layout o orplanning placement and routing

technology technology scaling and SOI pro cess packaging technology

The highest p ower savings are obtained at the architecture and system layers How

ever signicant p ower reductions are also achieved at the logic and circuit layers

which may add up to those already obtained at the higher layers

CONTRIBUTIONS

This work contributes to the design of lowpower at the system architecture and

logiccircuit layers of the design pro cess The contributions have b een published in several international conferences MCa MCb MCc MC

Chapter Conclusions and future research

System layer

At the system layer a set of highlevel synthesis techniques lo op interchange

op erand reordering op erand sharing op erand retaining op erand similarity that

tackle the problem of p ower consumption at the registertransfer level RTL have

b een presented and evaluated using an appropriate highlevel p owerconsumption

mo del Rather than accurately estimating the p ower consumption of the nal im

plementation we fo cus on fairly compare the relative b enets of dierent RTL

descriptions

The common idea b ehind these techniques is to reduce the activity at the inputs of

the functional units the most p owerconsuming units in an RTL architecture To

cop e with the high complexity of the problem of p ower reduction all the techniques

are based on heuristics which trade in some cases p ower for area and p erformance

Each of the techniques presented has b een evaluated with some b enchmarks in

which there is a clear evidence that signicant b enets can b e obtained Up to

p ower reductions in the functional units have b een evaluated The techniques are

complementary among them and not every technique pro duces a tangible improve

ment on every b enchmark Moreover the implementation of these techniques may

cause an increase in area In the evaluation of the p owerconsumption reductions

we have not taken into account the p ower of this additional circuitry However we

b elieve that it is negligible compared to the p ower consumption of the functional

units

The highlevel synthesis tasks of scheduling and register binding have b een auto

mated and p owerarea and p owerdelay tradeos have b een studied The schedul

ing algorithm for low p ower uses a listscheduling approach where the priorities of

the op erations of the readyop eration queue are set in such a way that op erations

sharing the same op erand are b ound to the same functional unit and scheduled so

that the functional unit can reuse that op erand Signicant p ower reductions are

obtained in the scheduling task up to with little increase or no increase at

all in latency

The registerbinding algorithm for low p ower is based on clique partitioning of a

restricted variablelifetime compatibility graph to obtain a register set that for each

functional unit reduces the p ower consumption during its idle cycles Power con

sumption in the functional units during nonidle cycles is decreased by taking into

account the average Hamming distance among the variables of the b ehavioral de

scription and the commutative prop erty of some op erations The p ower reductions obtained consider all units of the RTL design not only the functional units

Contributions

Although the scheduling and registerbinding techniques for low p ower are compat

ible and complementary the scheduling technique obtains b etter improvements if

applied to dense schedules eg schedules where the functional unit o ccupation is

high whereas the registerbinding technique is more suitable to sparse schedules

Both algorithms of scheduling and register binding for low p ower trade o in some

cases latency and area for p ower They should b e mo died to achieve rst the

same p erformance or area as the original algorithms and later as much p ower re

duction as p ossible Moreover the algorithms prop osed are mo dications of two tra

ditional approaches for scheduling listbased and binding clique covering Other

approaches for the scheduling and binding tasks forcedirected greedy integer pro

gramming should b e investigated

The techniques presented only fo cus on functional units since we target at data

intensive applications For memoryintensive circuits dierent techniques should

b e devised that reduce the amount of memory references

Architecture layer

Useless activity do es not contribute to the calculation of the nal result of the circuit

and only dissipates p ower Useless activity is the main p owerconsumption source in

a class of circuits that present a regular layered structure where each layer is driven

by primary inputs from the very b eginning Moreover the uselessuseful activity

ratio increases with the bit width of the inputs of the circuit Array multipliers

b elong to this class of circuits

The technique prop osed to reduce the useless activity is based on the insertion of

selftimed controlled latches on the highactive paths of the circuit The enabling

signals for these latches are generated by delaying an input signal This technique

has b een implemented and evaluated for several congurations of array multipliers

An implementation of this technique that retains the regularity of the array struc

ture consists of inserting the latches following the layered structure of the array ie

latches are inserted acting as a transitionretaining barrier TRB for the useless

activity

Two dierent tradeos have b een studied

areap ower the more number of latches are inserted the more number of useless

transitions are eliminated However latches and their asso ciated delay cells

Chapter Conclusions and future research

increase the total area and consume p ower As an example the insertion of a

TRB in an bit array multiplier increases the p ower instead of reducing it

delaypower if the delay of the delaycells used for the enabling of the TRBs is

greater than the delay of the basic cell of the array the multipliers will present

less p ower but will b e slower If not the TRBs will switch into transparent

mo de to o early allowing some useless transitions to pass through and generate

more useless activity in the next part of the array

The b est results are obtained when three TRBs are inserted However it is imp or

tant to notice that the greatest decrease in p ower is achieved when inserting one

TRB compared to the original design rather than when inserting two TRBs com

pared to inserting just one There is an optimal for p ower consumption number of

TRBs This number is a complex function of the size of the multiplier the useless

activity generated and the p ower consumption of the TRBs and delay cells The

evaluation of this optimal number has not b een addressed in this work and it is left

as future research

With three TRBs we achieve a reduction in p ower of in a bit array

multiplier with an increase in area and delay of and resp ectively Further

p ower savings are obtained with a prop er design of the basic building blo ck of the

multiplier the fulladder cell with its balanced paths so that the generation of

useless transitions is minimized

The nal multiplier remains combinational and highly regular This implies that

the TRB technique can b e easily integrated in existing CAD to ols for the generation

of array multipliers

Logiccircuit layer

Some complex logic functions accept several static CMOS implementations that

present a dierent order of the transistors The transistor reordering technique has

already b een prop osed to decrease the propagation delay of the gate We have used

this technique to nd a prop er order of the transistors of a gate so that the switching

activity at the internal no des of the gate is minimized

The p ower consumption at the internal no des of a gate dep ends on the signal prob

abilities and the amount of switching activity at the inputs of that gate and the

order of the transistors that comp ose the gate Therefore an sto chastic p ower

consumption mo del of a gate that takes into account these parameters has b een

prop osed One drawback of this mo del is that it is based on the transition density

Contributions

measure of the activity This measure although fast to evaluate do es not take into

account the correlation spatial or temp oral of the input signals to the gate Other

more accurate but more time consuming mo dels that take into account the signal

correlation should b e used to evaluate the p ower reductions

This mo del is the core of an optimization algorithm that traverses that gate de

scription of a circuit and for each gate it nds the b est order of the transistors

for low p ower This algorithm presents the following prop erty the p ower reduction

in an individual gate always decreases the p ower of the circuit The reason of this

monotonic b ehavior is that all p ossible transistor reorderings of a gate lead to the

same probability and switching activity at its output no de if the mo del mentioned

b efore is used to compute them

Moreover this algorithm works for gates that can b e represented with a series

parallel graph Almost all the gates of typical libraries can b e represented with this

type of graphs

Finding the b est transistor reordering implies an exhaustive exploration of each gate

Since most gates only have a small number of transistors in series an exhaustive

exploration is feasible An algorithm that p erforms this exploration is presented

and it is based on a graph representation of a CMOS gate that preserves the order

of the transistors

Two scenarios have b een considered to evaluate the p owerconsumption savings

obtained with the transistorreordering technique The main dierence b etween

these scenarios is the dierent amount of switching activity that the primary inputs

present A standard gate library has b een implemented with all the necessary

instances of a gate to obtain all transistor reorderings for that gate

The results obtained show that

the p ower consumption is always decreased

an average p ower reduction of and up to may b e achieved

lower p ower reductions in average are obtained for the scenario with less

switching activity at the primary inputs

the delay increases in most of the b enchmarks b ecause the b est transistor re

ordering for low p ower usually do es not coincide with the b est one for p erfor

mance

in some circuits b oth p ower and delay may b e reduced at the same time

Chapter Conclusions and future research

the reductions predicted by the mo del underestimate the results obtained with

a switchlevel simulator b ecause the mo del in general overestimates the p ower

consumption by an oset thus leading to a lower estimated reduction

FUTURE RESEARCH

The techniques presented as the contribution of this work may b e improved to

obtain b etter p ower reductions and extend their applicability domain Some of the

future research lines are outlined for each of the three parts of the contributions of

this work

With regard to the highlevel synthesis techniques

evaluate the area and delay overhead and implement those techniques that have

not b een automated lo op interchange op erand reordering

fo cus on the p ower consumption of memoryintensive applications and prop ose

techniques to reduce the number of memory references for example study a

mechanism to map the arrays to memory so that the activity at the address bus

is minimized

mo dify the scheduling and register binding algorithms for low p ower to achieve

the same p erformance and area as the original algorithms

investigate other approaches forcedirected greedy integer programming to

the scheduling and resource binding for low p ower

With regard to the transitionretaining barrier technique

use a fulladder design with balanced paths so that the generation of useless

transitions is minimized

reduce the area overhead by not inserting those latches that would eliminate a

small amount of useless activity

exp erimentally calculate the optimal for p ower number of TRBs

evaluate this technique for other types of circuits

And with regard to the transistor reordering technique

Future research

upgrade other standard libraries with more instances of the gates with dierent

transistor reorderings and evaluate again the technique

mo dify the p oweroptimization algorithm to apply the transistorreordering

technique only to those gates that are not in the critical path of the circuit

thus not increasing the delay

extend the sto chastic p owerconsumption mo del of the static CMOS gate to b e

applicable to other logic families and to static CMOS gates that do not admit

a seriesparallel representation

use other more accurate measure of the switching activity that takes into ac

count the spatial and temp oral correlation of the input signals to the gate

Chapter Conclusions and future research

A

CIRCUIT DESIGN AND ANALYSIS TOOLS

The designs prop osed in this work have b een implemented and evaluated with three

circuit design and analysis to ols Ocean SLS and SIS In this chapter these to ols

are reviewed The design and analysis of a fulladder circuit is presented as an

example

Appendix A Circuit design and analysis tools

XY X Y CIN

CIN FULL ADDER

COUT S

a COUT S

b

Figure A a Schematic and b gate description of a full adder

A INTRODUCTION

In this work three to ols have b een used to implement and evaluate the prop osed

designs All the to ols have b een developed at universities

The Ocean design system GS has b een used to implement in a seaofgates

design style some of the cells and circuits prop osed in this work It has provided

also an accurate estimation of the area at the layout level

The SLS switchlevel simulator vG has b een used to obtain the p ower consump

tion of the circuits either implemented with the Ocean system implemented using

its gate libraries Also timing analysis is provided with this simulator

+

The SIS synthesis system SSL has b een used to calculate the critical path of

a gatelevel circuit description

In this chapter we review some of the features of these three to ols The fulladder

circuit of Figure A is implemented and analyzed with the to ols

A OCEAN

Ocean GS is a chip design package developed at Delft University of Technology

the Netherlands It includes a set of to ols for the synthesis and verication of semi

custom seaofgates and gatearray WE chips Its main features are

A Ocean

Vdd metal

P diffusion

Poly gates

N diffusion

Vss metal

Figure A Seaofgates layout architecture

hierarchical fullcustom like layout style on seaofgates

automatic to ols for placement routing and extraction

Seaofgates WE is a design option where the design cost of the integrated

circuit is reduced since the designer only has to dene the connections with metals

contacts and vias The core of the chip already contains a continuous array of

N and Ptransistors Those rows are rep eated vertically Figure A Routing

channels are formed by routing over the top of unused transistors In contrast to

the conventional gatearrays a seaofgates image do es not have predened routing

channels This enables a much more compact implementation of structured circuits

such as pro cessors

Most designs choose equally sized transistors to equalize rise and fall times The

absolute size of transistors is a tradeo b etween drive capability fanin loading

and the array density required The size of the transistors also aects the routing

tracks

The package is provided with some standard gate libraries which have b een used

to implement the circuits throughout this work Whenever needed we have imple

mented our own gates The fulladder circuit in Figure A may b e describ ed with a

network description language as shown in Figure Aa Gates na and ex

b elong to the standard digital library provided with the Ocean package

Once this description of the circuit is compiled and entered into a database the

Ocean system is able to p erform an automatic placement and routing The resulting

Appendix A Circuit design and analysis tools

extern network na terminal ABYvssvdd

extern network ex terminal ABYvssvdd

network full adder terminal XYCINSCOUTvssvdd

f

ex XYVvssvdd

ex VCINSvssvdd

na XYVvssvdd

na VCINVvssvdd

na VVCOUTvssvdd

g

a

b

Figure A a Fulladder netlist description and b layout pro duced by Ocean

circuit is shown in Figure Ab The transistors and capacitances of this layout

can b e extracted and saved into the database for later simulations

A SLS

print XYCINSCOUT

option level

option sigunit e

dissipation

set vdd h

set vss l

set X l h

set Y l h

set CIN l h

a

b

Figure A a Command le and b graphical output of the SLS simulator

A SLS

SLS vG is a switchlevel simulator developed at Delft University of Technology

the Netherlands It can b e used for simulating the logical and timing b ehavior of

digital MOS circuits at three levels

level purely logic without considering the actual circuit parameters

level logic simulation based on actual circuit parameters transistor dimen

sions and interconnection resistances and capacitances

level same as level but with timing To nd the delay times the simu

lator uses approximating piecewiselinear voltage waveform that are found by

p erforming RC constant calculations

One imp ortant feature of the SLS simulator is the p ossibility to p erform mixedlevel

simulations for transistorlevel gatelevel and functionlevel circuits Therefore a

network may consist of some functional blo cks that will b e functionally simulated

A functional blo ck is a software mo del of a digital design It manipulates its inputs

and pro duces outputs The language in which the function blo cks is describ ed is

mainly the C programming language

A separate le the command le provides the simulation control commands A

p ossible command le for the fulladder example that explores all p ossible combi

Appendix A Circuit design and analysis tools

nation at its inputs is shown in Figure Aa With the dissipation command the

simulator is able to obtain the total dynamic dissipation if the circuit is describ ed

at the transistor level It may also provide the average dissipation for dierent

time intervals In our example the p ower dissipation can b e obtained since we are

simulating the layout of Figure Ab

2

The total dissipation is obtained by adding the values C V for all no de tran

i

step

i

sitions in the circuit C is the capacitance of a no de i where a voltage change o ccurs

i

and V is the value of the voltage change at that no de i The total dissipation is

step

i

divided by the simulation time

The delay mo del simulation at level causes that more accurate values for the

dissipation are obtained than no delay mo del simulation at level and At level

even the size of transient eects glitches is taken into account to compute the

dissipation At level and glitches are also considered to compute the dissipation

but they may b e much less realistic In this work we have always simulated the

circuits at level

The simulation outputs can b e visualized either in graphic or text form Fig

ure Ab shows the graphical results of the simulation of the fulladder circuit

using the command le in Figure Aa

A SIS

+

SIS SSL is an interactive to ol for synthesis and optimization of sequential cir

cuits developed at the University of California Berkeley Although it fo cuses on

sequential systems it implements techniques implemented to handle the combina

tional blo cks of these systems

The SIS package has several features

signaltransition graph STG manipulation state minimization state assign

ment STG extraction

combinational optimization no de simplication kernel and cub e extraction

test pattern generation technology mapping restructuring for p erformance

delay analysis

sequential optimization retiming and resynthesis technology mapping state enumeration

A SIS

LSILOGIC u V

GATE nd O a b

PIN a INV

PIN b INV

GATE eo O a b a b

PIN a UNKNOWN

PIN b UNKNOWN

a

mo del full adder

inputs X Y CIN

outputs S COUT

gate eo aX bY OV

gate eo aV bCIN OS

gate nd aX bY OV

gate nd aV bCIN OV

gate nd aV bV OCOUT

end

b

Figure A a Library and b circuit description in SIS

asynchronous synthesis hazardfree synthesis with unbounded gatedelay syn

thesis with b ounded gatedelay removing hazards with b ounded wiredelay

ensuring complete state co ding and p ersistency etc

that have b een built on top of the MISI I BRSVW system

In this work we have mainly used the delay analysis and technology mapping fea

tures for the combinational circuits To p erform the technology mapping a descrip

tion of the targeted gate library has to b e provided to SIS The genlib format is

used for this purp ose For example Figure Aa shows the description of a two

input NAND and XOR gate of the LSILOGIC library LSI The logic function

area and delay information is provided

The blif format is used to describ e a logiclevel hierarchical circuit This format is

dierent from the one used with the Ocean package The same description of the

full adder in blif format is found in Figure Ab

Once the SIS system has read b oth descriptions of the library and circuit delay

analysis may b e p erformed to obtain among other information the critical time of the circuit

Appendix A Circuit design and analysis tools

REFERENCES

AK JR Allen and K Kennedy Automatic lo op interchange In Proc of

the SIGPLAN Symp on Compiler Construction pages

+

AMD M Alidina J Monteiro S Devadas A Ghosh and M Pa

paefthymiou Precomputationbased sequential logic optimization for

low p ower IEEE Transactions on VLSI Systems De

cember

ASU AV Aho R Sethi and JD Ullman Compilers Principles Tech

niques and Tools AddisonWesley Reading Massachussets

BAF J Bunda WC Athas and D Fussell Evaluating p ower implica

tions of CMOS micropro cessor design decisions In Proc International

Workshop on Low Power Design pages April

BB TD Burd and RW Brothersen Energy ecient CMOS micropro

cessor design In Proc th Annual Hawaii International Conf on

System Sciences January

+

BCH RI Bahar H Cho GD Hachtel E Macii and F Somenzi A

symbolic metho d to reduce p ower consumption of circuits containing

false paths In Proc of the IEEEACM International Conference on

Computer Aided Design pages November

BCS R Bro dersen A Chandrakasan and S Sheng Lowpower signal

pro cessing systems In Proc of the IEEE VLSI Signal Processing

Workshop

BE A Bellaouar and MI Elmasry Lowpower digital VLSI design cir

cuits and systems Kluwer Academic Publishers

BFR L Benini M Favalli and B Ricco Analysis of hazard contributions

to p ower dissipation in CMOS ICs In Proc International Workshop

on Low Power Design pages April

BGS DF Bacon SL Graham and OL Sharp Compiler techniques for

highp erformance computing Technical Rep ort UCBCSD

Computer Science Division EECS UCB November

BIO M Borah MJ Irwin and RM Owens Minimizing p ower consump

tion of static CMOS circuits by transistor sizing and input reordering

In Proc of the International Conference on VLSI Design pages

January

Highlevel and logic synthesis techniques for low power

BM R Brayton and C McMullen The decomp osition and factorization

of b o olean expressions In Proc International Symposium on Circuits

and Systems pages

BM L Benini and G De Micheli Transformation and synthesis of FSMs

for lowpower gatedclo ck implementation In Int Symposium on Low

Power Design pages April

BOI M Borah RM Owens and MJ Irwin Highthroughput and low

p ower DSP using clo ckedCMOS circuitry In Int Symposium on Low

Power Design pages April

BRSVW RK Brayton R Rudell A SangiovanniVincentelli and AR Wang

MIS a multiplelevel logic optimization system IEEE Transactions

on ComputerAided Design November

Bry RE Bryant Graphbased algorithms for b o olean function manipu

lation IEEE Transactions on Computers C August

BS CW Brown and BJ Shepherd Graphics File Formats reference

and guide PrenticeHall

BSdM L Benini P Siegel and G de Micheli Saving p ower by synthesizing

gated clo cks for sequential circuits IEEE Design Test of Computers

pages winter

BW CR Baugh and BA Wooley A twos complement parallel ar

ray multiplication algorithm IEEE Transactions on Computers C

December

CB AP Chandrakasan and RW Bro dersen Low Power Digital CMOS

Design Kluwer Academic Publishers

CC BS Carlson and CYR Chen Performance enhancement of CMOS

VLSI circuits by transistor reordering In Proceedings of the th

Design Automation Conference pages

CLR TH Cormen CE Leiserson and RL Rivest Introduction to Algo

rithms McGrawHill

CP JM Chang and M Pedram Register allo cation and binding for low

p ower In Proceedings of the nd Design Automation Conference

pages

+

CPM AP Chandrakasan M Potkonjak R Mehra J Rabaey and RW

Bro dersen Optimizing p ower using transformations IEEE Transac

tions on ComputerAided Design January

CPRB A Chandrakasan M Potkonjak J Rabaey and RW Bro dersen

HYPERLP A system for p ower minimization using architectural

transformations IEEE Transactions on ComputerAided Design of

Integrated Circuits and Systems pages November

REFERENCES

CR A Chatterjee and RK Roy Synthesis of low p ower linear DSP cir

cuits using activity metrics In Proc of the International Conference

on VLSI Design pages January

CRP TL Chou K Roy and S Prasad Estimation of circuit activity

considering signal correlations and simultaneous switching In Proc of

the IEEEACM International Conference on Computer Aided Design

pages

CS TK Callaway and ER Swartzlander Estimating the p ower con

sumption of CMOS adders In Proc of the Custom Integrated Circuit

Conference pages

CSB A Chandrakasan S Sheng and RW Bro dersen Design consid

erations for a future multimedia terminal In WINLAB Workshop

Octob er

CSBa A Chandrakasan S Sheng and R Bro dersen Low p ower CMOS

digital design IEEE Transactions on SolidState Circuits

April

CSBb A Chandrakasan S Sheng and R Bro dersen Lowpower techniques

for p ortable realtime DSP applications VLSI Design pages

CSB AP Chandrakasan MB Srivastava and RW Bro dersen Energy

ecient programmable computation In Proc of the International

Conference on VLSI Design pages January

CW KY Chao and DF Wong Low p ower considerations in o orplan

design In Proc International Workshop on Low Power Design pages

April

DDN P Dewilde E Deprettere and R Nouta Parallel and pipelined VLSI

implementation of signal processing algorithms chapter pages

VLSI and Mo dern Signal Pro cessing PrenticeHall Inglewoo d

Clis NJ

DK A Dasgupta and R Karri Simultaneous scheduling and binding for

p ower minimization during microarchitectural synthesis In Int Sym

posium on Low Power Design pages April

DMP R Marculescu D Marculescu and M Pedram Information theoretic

measures of energy consumption at register transfer level In Int

Symposium on Low Power Design pages April

ea DW Dobb erpuhl et al A MHz b dualissue CMOS micro

pro cessor IEEE Journal of SolidState Circuits

November

eaa G Gerosa et al A W MHz sup erescalar RISC micropro cessor

IEEE Journal of SolidState Circuits December

Highlevel and logic synthesis techniques for low power

eab R Bechade et al A b MHz micropro cessor In International

Solid State Circuits Conference pages February

eac T Indermaur et al Evaluation of charge recovery circuits and adia

batic switching for low p ower CMOS design In Int Symposium on

Low Power Electronics Octob er

eaa D Bearden et al A MHz b fourissue CMOS micropro ces

sor In International Solid State Circuits Conference pages

February

eab WJ Bowhill et al A MHz b quadissue CMOS RISC mi

cropro cessor In International Solid State Circuits Conference pages

February

EL MD Ercegovac and T Lang Reducing transition counts in arithmetic

circuits In Int Symposium on Low Power Electronics pages

Octob er

EL MD Ercegovac and T Lang Lowpower accumulator correlator

In Int Symposium on Low Power Electronics pages Octob er

FFK M Fujita M Fujisawa and N Kawato Evaluation and improvements

of b o olean comparison metho d based on binary decision diagrams In

Proc of the IEEEACM International Conference on ComputerAided

Design ICCAD pages November

GBJ AL Gleb ov D Blaauw and LG Jones Transistor reordering for

low p ower CMOS gates using an SPBDD representation In Int

Symposium on Low Power Design pages April

GDWL D Ga jski N Dutt A Wu and S Lin Highlevel synthesis intro

duction to Chip and System Design Kluwer Academic Publishers

+

GIG S Gary P Ipp olito G Gerosa C Dietz J Eno and H Sanchez

TM

PowerPC a micropro cessor for p ortable computers IEEE De

sign Test of Computers pages winter

GR DD Ga jski and L Ramachandran Introduction to highlevel syn

thesis IEEE Design Test of Computers winter

GS P Gro eneveld and P Stravens Ocean The SeaofGates design sys

tem Technical rep ort Delft University of Technology

Gwea L Gwennap ARM cuts p ower increases p erformance Microproces

sor Report November

Gweb L Gwennap Hobbit enables p ersonal communications Microproces

sor Report Octob er

Hay J P Hayes Computer architecture and organization

REFERENCES

HdlR GD Hachtel and M Hermida de la Rica Reenco ding sequential

circuits to reduce p ower dissipation In Proc International Workshop

on Low Power Design pages April

HvBPS Jaco Haans Kees van Berkel Ad Peeters and Frits Schalij Asyn

chronous multipliers as combinational handshake circuits In S Furber

and M Edwards editors Asynchronous Design Methodologies vol

ume A of IFIP Transactions pages Elsevier Science Pub

lishers

HZA R Hossain M Zheng and A Albicki Reducing p ower dissipation in

serially connected MOSFET circuits via transistor reordering In Proc

of the IEEE International Conference on Computer Design pages

Octob er

Ike T Ikeda Thinkpad lowpower evolution In Int Symposium on Low

Power Electronics pages Octob er

IP S Iman and M Pedram Multilevel network optimization for low

p ower In Proc of the IEEEACM International Conference on Com

puter Aided Design pages November

IP S Iman and M Predram Logic extraction and factorization for low

p ower In Proceedings of the th Design Automation Conference

pages

+

JQN B Johnson T Quarles AR Newton DO Pederson and

A SangiovanniVincentelli SPICE Version e Users Manual Dept

of Electrical Engineering and Computer Science Univ of California

Berkeley April

KaKS H Ko jima and S Tanaka ad K Sasaki Halfswing clo cking scheme

for p ower saving in clo cking circuitry In Symp on VLSI Circuits

pages June

Kap B Kap o or Improving the accuracy of circuit activity measurement

In Proceedings of the st Design Automation Conference pages

KBL U Ko PT Balsara and W Lee Lowpower design techniques for

highp erformance CMOS adders IEEE Transactions on VLSI Sys

tems June

KBN U Ko PT Balsara and AK Nanda Energy optimization of multi

level pro cessor cache architectures In Int Symposium on Low Power

Design pages April

Keu K Keutzer The impact of CAD on the design of low p ower digital

circuits In Int Symposium on Low Power Electronics pages Octob er

Highlevel and logic synthesis techniques for low power

KGG DJ Kinniment JD Garside and B Gao A comparison of p ower

consumption in some CMOS adder circuit In Proc of the Int Work

shop on Power and Timing Modeling Optimization and Simulation

PATMOS pages

KKRV N Kumar S Katkoori L Rader and R Vemuri Proledriven b e

havioral synthesis for lowpower VLSI systems IEEE Design Test

of Computers pages fall

Kor I Koren Computer Arithmetic Algorithms PrenticeHall

Kun SY Kung On sup ercomputing with systolicwavefront array pro ces

sor In Proc of the IEEE pages July

Lan P Landman Lowpower architectural design methodologies PhD the

sis College of Engineering University of California at Berkeley Au

gust UCBERL M

Lim JS Lim TwoDimentional Signal and Image Processing Signal Pro

cessing Series PrenticeHall

LK C Lin and S Kwatra An adaptive algorithm for motion comp ensated

colour image co ding IEEE Globecom

LKB W Lee U Ko and PT Balsara A comparative study on CMOS dig

ital circuit families for lowpower applications In Proc International

Workshop on Low Power Design pages April

LM B Lin and H De Man Lowpower driven technology mapping under

timing constraints In Proc of the IEEE International Conference on

Computer Design

LRa PE Landman and JM Rabaey Blackbox capacitance mo dels for

architectural p ower analysis In Proc International Workshop on Low

Power Design pages April

LRb DB Lidsky and JM Rabaey Lowpower design of memory intensive

functions In Int Symposium on Low Power Electronics pages

Octob er

LR PE Landman and JM Rabaey Activitysensitive architectural

p ower analysis for the control path In Int Symposium on Low Power

Design pages April

LS CE Leiserson and JB Saxe Optimizing synchronous circuitry by

retiming In Third Caltech Conf on VLSI

LS C Lemmonds and SSM Shetti A low p ower by multiplier

using transition reduction circuitry In Proc International Workshop

on Low Power Design pages April

LSI LSI LOGIC LCAK Preliminary Design Manual November

REFERENCES

LvMJ J Leijten J van Meerb ergen and J Jess Analysis and reduction of

glitches in synchronous networks In Proc European Conference on

Design Automation EDAC March

Mar RS Martin Optimizing power consumption area and delay in be

havioral synthesis PhD thesis Department of Electronics Faculty of

Enginering Carleton University March

Mat A Matsuzawa Lowpower p ortable design In Proc International

Symposium on Advanced Research in Asynchronous Circuits and Sys

tems March Invited lecture

MCa E Musoll and J Cortadella Highlevel synthesis techniques for re

ducing the activity of functional units In Int Symposium on Low

Power Design pages April

MCb E Musoll and J Cortadella Lowpower array multipliers with

transitionretaining barriers In Proc of the Int Workshop on Power

and Timing Modeling Optimization and Simulation PATMOS pages

Octob er

MCc E Musoll and J Cortadella Scheduling and resource binding for

low p ower In Int Symposium on System Synthesis pages

September

MC E Musoll and J Cortadella Optimizing CMOS circuits for low p ower

using transistorreordering In Proc European Design and Test Con

ference EDACETCEuroASIC pages March

MDG J Monteiro S Devadas and A Ghosh Retiming sequential circuits

for low p ower In Proc of the IEEE International Conference on

Computer Design pages November

Mei JD Meindl Lowpower micro electronics retrosp ect and prosp ect

Proceedings of the IEEE April

MIP MIPS MIPS Press release

MK RS Martin and JP Knight Powerproler Optimizing ASICs p ower

consumption at the b ehavioral level In Proceedings of the nd Design

Automation Conference

MSBSV S Malik EM Sentovich RK Brayton and A Sangiovanni

Vincentelli Retiming and resynthesis optimizing sequential networks

with combinational techniques IEEE Transactions on Computer

Aided Design January

MT VG Moshnyaga and K Tamary A comparative study of switch

ing activity reduction techniques for the design of lowpower multipli

ers In Proc International Symposium on Circuits and Systems pages

April

Highlevel and logic synthesis techniques for low power

MWBSV S Malik A Wang R Brayton and A SangiovanniVincentelli Logic

verication using binary decision diagrams in a logic synthesis envi

ronment In Proc of the IEEEACM International Conference on

ComputerAided Design ICCAD pages November

Na j FN Na jm Transition density a sto chastic measure of activity in digi

tal circuits In Proceedings of the th Design Automation Conference

pages

Na j FN Na jm Towards a highlevel p ower estimation capability In Int

Symposium on Low Power Design pages April

NMO C Nagendra UK Mehta and RM Owens A comparison of the

p owerdelay characteristics of CMOS adders In Proc International

Workshop on Low Power Design pages April

NS LS Nielsen and J Spars Lowpower op eration using selftimed cir

cuits and adaptive scaling of the supply voltage In Proc International

Workshop on Low Power Design pages April

OF MA Ortega and J Figueras Extra p ower consumed in static CMOS

circuits due to unnecessary logic transitions In IV Int Design Au

tomation Workshop RUSSIAN WORKSHOP June

OY PW Ong and RH Yan Powerconscious software design a frame

work for mo deling software on hardware In Int Symposium on Low

Power Electronics pages Octob er

PC SR Powell and PM Chau A mo del for estimating p ower dissipation

in a class of DSP VLSI chips IEEE Transactions on Circuits and

Systems June

Pez SD Pezaris A ns bit by bit array multiplier IEEE Trans

actions on Computers C April

PM KP Parker and EJ McCluskey Probabilistic treatment of general

combinational networks IEEE Transactions on Computers C

June

Pow RA Powers Batteries for low p ower electronics Proceedings of the

IEEE April

PR M Potkonjak and J Rabaey Optimizing the resource utilization using

transformations In Proc of the IEEEACM International Conference

on Computer Aided Design pages November

PR SC Prasad and K Roy Circuit optimization for minimization of

p ower consumption under delay constraint In Proc International

Workshop on Low Power Design pages April

PR R Panwar and D Rennels Reducing the frequency of tag compares

for low p ower icache design In Int Symposium on Low Power Design

pages April

REFERENCES

PTVF WH Press SA Teukolsky WT Vetterling and BP Flannery Nu

merical Recipes in C The Art of Scientic Computing Cambridge

University Press second edition

Pug W Pugh The omega test a fast and practical integer programming

algorithm for dep endence analysis In Proc Supercomputing

Rep Sp ecial Rep ort The new contenders IEEE Spectrum pages

December

RJ A Raghunathan and NK Jha Behavioral synthesis for low p ower

In Proc of the IEEE International Conference on Computer Design

pages Octob er

RJ A Raghunathan and NK Jha An iterative improvement algorithm

for low p ower data path synthesis In Proc of the IEEEACM Inter

national Conference on Computer Aided Design

RP K Roy and SC Prasad Circuit activity based logic synthesis for

low p ower reliable op erations IEEE Transactions on VLSI Systems

December

RP JM Rabaey and M Pedram editors Low power design methodolo

gies Kluwer Academic Publishers

RY KR Rao and P Yip Discrete Cosine Transform Academic Press

SA M Song and K Asada Design metho dology for low p ower data com

pressors based on a window detector in a bit multiplier In Proc

International Symposium on Circuits and Systems pages

April

Sam H Samueli An improved search algorithm for the design of multipli

erless FIR lters with p oweroftwo co ecients IEEE Transactions

on Circuits and Systems July

SBS AJ Strakatos RW Bro dersen and SR Sanders Higheciency

lowvoltage dcdc conversion for p ortable applications In Proc In

ternational Workshop on Low Power Design pages April

SFCM H Samsom F Franssen F Cattho or and H De Man System level

verication of video and image pro cessing sp ecications In Int Sym

posium on System Synthesis pages September

SGDK A Shen A Ghosh S Devadas and K Keutzer On average p ower

dissipation and random pattern testability of CMOS combinational

logic networks In Proc of the IEEEACM International Conference

on Computer Aided Design

SK LJ Svenson and JG Koller Adiabatic charging without inductors

In Proc International Workshop on Low Power Design pages

April

Highlevel and logic synthesis techniques for low power

SLB T Sakuta W Lee and PT Balsara Delay balanced multipliers for

low p owerlow voltage DSP core In Int Symposium on Low Power

Electronics pages Octob er

SLW WZ Shen JY Lin and FW Wang Transistor reordering rules for

p ower reduction in CMOS gates In ASPDAC July

Sma C Small Shrinking devices put the squeeze on system packaging

EDN February

SNDD GG Shahidi TH Ning RH Dennard and B Davari SOI for low

voltage and highsp eed CMOS In Int Conf on SolidState Devices

and Materials pages

SNK Y Susuki K Nogami and M Kakumu Clo cked CMOS calculator

circuitry IEEE Journal of SolidState Circuits SC De

cember

SNNS J Spars C D Nielsen L S Nielsen and J Staunstrup Design of

selftimed multipliers A comparison In S Furber and M Edwards

editors Asynchronous Design Methodologies volume A of IFIP

Transactions pages Elsevier Science Publishers

SR GE Sob elman and DL Raatz Lowpower multiplier design using

delayed evaluation In Proc International Symposium on Circuits and

Systems pages April

+

SSL EM Sentovich KJ Singh L Lavagno C Mo on R Murgai A Sal

danha H Savo j PR Stephan RK Brayton and A Sangiovanni

Vincentelli SIS A system for sequential circuit synthesis Technical

Rep ort Memorandum No UCBERL M University of Califor

nia Berkeley

STD CL Su CY Tsui and AM Despain Saving p ower in the control

path of embedded pro cessors IEEE Design Test of Computers

pages winter

TA CH Tan and J Allen Minimization of p ower in VLSI circuits using

transistor sizing input ordering and statistical p ower estimation In

Proc International Workshop on Low Power Design pages

April

TAM V Tiwari P Ashar and S Malik Technology mapping for low p ower

In Proceedings of the th Design Automation Conference pages

Tiw G Tiwary Below the halfmicron mark IEEE Spectrum pages

November

TJL JR Treichler CR Johnson Jr and MG Larimore Theory and

Design of Adaptive Filters New York John Wiley Sons

REFERENCES

TMA V Tiwari S Malik and P Ashar Guarded evaluation pushing

p ower management to logic synthesisdesign In Int Symposium on

Low Power Design pages April

TMW V Tiwari S Malik and A Wolfe Power analysis of embedded soft

ware a rst step towards software p ower minimization IEEE Trans

actions on VLSI Systems December

TPCD CY Tsui M Pedram CA Chen and AM Despain Low p ower

state assignment targeting two and multilevel logic implementations

In Proc of the IEEEACM International Conference on Computer

Aided Design November

TPD CY Tsui M Pedram and AM Despain Technology decomp osition

and mapping targeting low p ower dissipation In Proceedings of the

th Design Automation Conference pages June

TPD CY Tsui M Pedram and AM Despain Power ecient technol

ogy decomp osition and mapping under extended p ower consumption

mo del IEEE Transactions on ComputerAided Design of Integrated

Circuits and Systems September

TRDH I Turkik A Reisman R Darveaux and LT Hwang Multichip

packaging for sup ercomputers In Proc of the NEPCON West

pages March

+

vBBK K van Berkel R Burgess J Kessels M Roncken F Schalij and

A Peeters Asynchronous circuits for low p ower a DCC error correc

tor IEEE Design Test of Computers pages summer

vBN CH van Berkel and C Niessen An apparatus featuring a feedback

signal for controlling a p owering voltage for asynchronous electronic

circuitry therein European Pattern Application June

Vee HJM Veendrick Shortcircuit dissipation of static CMOS circuitry

and its impact on the design of buer circuits IEEE Journal of Solid

State Circuits pages August

vG AJ van Gerenden SLS An ecient switchlevel timing simulator

using minmax voltage waveforms In Proc VLSI Conf pages

August

VP H Vaishnav and M Pedram PCUBE A p erformance driven place

ment algorithm for low p ower design In Proc European Design Au

tomation Conference EURODAC pages

War CA Warwick Trends and limits in the talk time of p ersonal com

municators Proceedings of the IEEE April

+

WCF S Wuytack F Cattho or F Franseen L Nachtergaele and H De

Man Global communications and memory optimizing transformations

for low p ower In Proc International Workshop on Low Power Design

pages April

Highlevel and logic synthesis techniques for low power

WE N Weste and Eshraghian Principles of CMOS VLSI Design A Sys

tems Perspective AddisonWesley nd edition

Wol A Wolfe A case study in lowpower system level design In Proc

of the IEEE International Conference on Computer Design Octob er

Yan S Yang Logic synthesis and optimization b enchmarks user guide

Technical rep ort Micro electronics Center of North Carolina Research

Triangle Park NC January

+

YSS NK Yeung YH Sutu TYF Su ET Pak CC Chao S Akki

DD Yau and R Lo denquai The design of a SPECint RISC

pro cessor under w In International Solid State Circuits Conference

pages February

+

YYN K Yano T Yamanaka T Nishida M Saito K Shimohigashi and

A Shimizu A ns CMOS xb multiplier using complementary

passtransistor logic IEEE Journal of SolidState Circuits

April

ZA M Zheng and A Albicki Low p ower and high sp eed multiplication

design through mixed number representations In Proc of the IEEE

International Conference on Computer Design Octob er