Power Reduction Techniques For Systems

VASANTH VENKATACHALAM AND MICHAEL FRANZ

University of California, Irvine

Power consumption is a major factor that limits the performance of . We survey the “state of the art” in techniques that reduce the total power consumed by a microprocessor system over time. These techniques are applied at various levels ranging from circuits to architectures, architectures to system software, and system software to applications. They also include holistic approaches that will become more important over the next decade. We conclude that is a multifaceted discipline that is continually expanding with new techniques being developed at every level. These techniques may eventually allow computers to break through the “power wall” and achieve unprecedented levels of performance, versatility, and reliability. Yet it remains too early to tell which techniques will ultimately solve the power problem.

Categories and Subject Descriptors: C.5.3 [ System Implementation]: Microcomputers—;D.2.10 [Software Engineering]: Design— Methodologies; I.m [ Methodologies]: Miscellaneous General Terms: Algorithms, Design, Experimentation, Management, Measurement, Performance Additional Key Words and Phrases: Energy dissipation, power reduction

1. INTRODUCTION of power; so much power, in fact, that their power densities and concomitant Computer scientists have always tried to heat generation are rapidly approaching improve the performance of computers. levels comparable to nuclear reactors But although today’s computers are much (Figure 1). These high power densities faster and far more versatile than their impair chip reliability and life expectancy, predecessors, they also consume a lot increase cooling costs, and, for large

Parts of this effort have been sponsored by the National Science Foundation under ITR grant CCR-0205712 and by the Office of Naval Research under grant N00014-01-1-0854. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and should not be interpreted as necessarily representing the official views, policies or endorsements, either expressed or implied, of the National Science foundation (NSF), the Office of Naval Research (ONR), or any other agency of the U.S. Government. The authors also gratefully acknowledge gifts from Intel, Microsoft Research, and Sun Microsystems that partially supported this work. Authors’ addresses: Vasanth Venkatachalam, School of Information and Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: [email protected]; Michael Franz, School of Information and Computer Science, University of California at Irvine, Irvine, CA 92697-3425; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]. c 2005 ACM 0360-0300/05/0900-0195 $5.00

ACM Computing Surveys, Vol. 37, No. 3, September 2005, pp. 195–237. 196 V. Venkatachalam and M. Franz

the power problem. Finally, we will dis- cuss some commercial power management systems (Section 3.10) and provide a glimpse into some more radical technolo- gies that are emerging (Section 3.11).

2. DEFINING POWER Power and energy are commonly defined in terms of the work that a system per- forms. Energy is the total amount of work a system performs over a period of time, Fig. 1.Power densities rising. Figure adapted from while power is the rate at which the sys- Pollack 1999. tem performs that work. In formal terms, data centers, even raise environmental P = W/T (1) concerns. At the other end of the performance = ∗ spectrum, power issues also pose prob- E P T, (2) lems for smaller mobile devices with lim- where P is power, E is energy, T is a spe- ited battery capacities. Although one could cific time interval, and W is the total work give these devices faster processors and performed in that interval. Energy is mea- larger memories, this would diminish sured in joules, while power is measured their battery life even further. in watts. Without cost effective solutions to These concepts of work, power, and en- the power problem, improvements in ergy are used differently in different con- micro- technology will eventu- texts. In the context of computers, work ally reach a standstill. Power manage- involves activities associated with run- ment is a multidisciplinary field that in- ning programs (e.g., addition, subtraction, volves many aspects (i.e., energy, temper- memory operations), power is the rate at ature, reliability), each of which is complex which the computer consumes electrical enough to merit a survey of its own. The energy (or dissipates it in the form of heat) focus of our survey will be on techniques while performing these activities, and en- that reduce the total power consumed by ergy is the total electrical energy the com- typical microprocessor systems. puter consumes (or dissipates as heat) We will follow the high-level taxonomy over time. illustrated in Figure 2. First, we will de- This distinction between power and en- fine power and energy and explain the ergy is important because techniques that complex parameters that dynamic and reduce power do not necessarily reduce en- static power depend on (Section 2). Next, ergy. For example, the power consumed we will introduce techniques that reduce by a computer can be reduced by halv- power and energy (Section 3), starting ing the clock frequency, but if the com- with circuit (Section 3.1) and architec- puter then takes twice as long to run tural techniques (Section 3.2, Section 3.3, the same programs, the total energy con- and Section 3.4), and then moving on to sumed will be similar. Whether one should two techniques that are widely applied in reduce power or energy depends on the hardware and software, dynamic voltage context. In mobile applications, reducing scaling (Section 3.5) and resource hiberna- energy is often more important because tion (Section 3.6). Third, we will examine it increases the battery lifetime. How- what compilers can do to manage power ever, for other systems (e.g., servers), tem- (Section 3.7). We will then discuss recent perature is a larger issue. To keep the work in application level power manage- temperature within acceptable limits, one ment (Section 3.8), and recent efforts (Sec- would need to reduce instantaneous power tion 3.9) to develop a holistic solution to regardless of the impact on total energy.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 197

Fig. 2. Organization of this survey.

2.1. Dynamic Power Consumption capacitance (C) and an activity factor (a) that relates to how many 0 → 1or1→ 0 There are two forms of power consump- transitions occur in a chip: tion, dynamic power consumption and static power consumption. Dynamic power 2 consumption arises from circuit activity Pdynamic ∼ aCV f. (3) such as the changes of inputs in an or values in a register. It has two Accordingly, there are four ways to re- sources, switched capacitance and short- duce dynamic power consumption, though circuit current. they each have different tradeoffs and not Switched capacitance is the primary all of them reduce the total energy con- source of dynamic power consumption and sumed. The first way is to reduce the phys- arises from the charging and discharging ical capacitance or stored electrical charge of capacitors at the outputs of circuits. of a circuit. The physical capacitance de- Short-circuit current is a secondary pends on low level design parameters source of dynamic power consumption and such as transistor sizes and wire lengths. accounts for only 10-15% of the total power One can reduce the capacitance by re- consumption. It arises because circuits are ducing transistor sizes, but this worsens composed of transistors having opposite performance. polarity, negative or NMOS and positive The second way to lower dynamic power or PMOS. When these two types of tran- is to reduce the switching activity. As com- sistors current, there is an instant puter chips get packed with increasingly when they are simultaneously on, creat- complex functionalities, their switching ing a short circuit. We will not deal fur- activity increases [De and Borkar 1999], ther with the power dissipation caused by making it more important to develop tech- this short circuit because it is a smaller niques that fall into this category. One percentage of total power, and researchers popular technique, , gates have not found a way to reduce it without the from reaching idle func- sacrificing performance. tional units. Because the clock network As the following equation shows, the accounts for a large fraction of a chip’s more dominant component of dynamic total energy consumption, this is a very power, switched capacitance (Pdynamic), de- effective way of reducing power and en- pends on four parameters namely, supply ergy throughout a processor and is imple- voltage (V), clock frequency (f), physical mented in numerous commercial systems

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 198 V. Venkatachalam and M. Franz

2.2. Understanding Leakage Power Consumption In addition to consuming dynamic power, computer components consume static power, also known as idle power or leakage. According to the most re- cently published industrial roadmaps [ITRSRoadMap], leakage power is rapidly becoming the dominant source of power consumption in circuits (Figure 3) and persists whether a computer is active or Fig. 3. ITRS trends for leakage power dissipation. idle. Because its causes are different from Figure adapted from Meng et al., 2005. those of dynamic power, dynamic power reduction techniques do not necessarily reduce the leakage power. As the equation that follows illustrates, including the Pentium 4, Pentium M, Intel leakage power consumption is the product XScale and Tensilica Xtensa, to mention of the supply voltage (V) and leakage cur- but a few. rent (Ileak), or parasitic current, that flows The third way to reduce dynamic power through transistors even when the tran- consumption is to reduce the clock fre- sistors are turned off. quency. But as we have just mentioned, this worsens performance and does not al- = . ways reduce the total energy consumed. Pleak VIleak (4) One would use this technique only if the target system does not support volt- To understand how leakage current age scaling and if the goal is to re- arises, one must understand how transis- duce the peak or average power dissi- tors work. A transistor regulates the flow pation and indirectly reduce the chip’s of current between two terminals called temperature. the source and the drain. Between these The fourth way to reduce dynamic power two terminals is an insulator, called the consumption is to reduce the supply volt- channel, that resists current. As the volt- age. Because reducing the supply voltage age at a third terminal, the gate,isin- increases gate delays, it also requires re- creased, electrical charge accumulates in ducing the clock frequency to allow the cir- the channel, reducing the channel’s resis- cuit to work properly. tance and creating a path along which The combination of scaling the supply electricity can flow. Once the gate volt- voltage and clock frequency in tandem is age is high enough, the channel’s polar- called (DVS). This ity changes, allowing the normal flow of technique should ideally reduce dynamic current between the source and the drain. power dissipation cubically because dy- The threshold at which the gate’s voltage namic power is quadratic in voltage and is high enough for the path to open is called linear in clock frequency. This is the most the threshold voltage. widely adopted technique. A growing num- According to this model, a transistor is ber of processors, including the Pentium similar to a water dam. It is supposed M, mobile Pentium 4, AMD’s Athlon, and to allow current to flow when the gate ’s Crusoe and Efficieon proces- voltage exceeds the threshold voltage but sors allow software to adjust clock fre- should otherwise prevent current from quencies and voltage settings in tandem. flowing. However, transistors are imper- However, DVS has limitations and cannot fect. They leak current even when the gate always be applied, and even when it can voltage is below the threshold voltage. In be applied, it is nontrivial to apply as we fact, there are six different types of cur- will see in Section 3.5. rent that leak through a transistor. These

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 199 include reverse-biased-junction leakage, ture increases. But as the Equation (6) gate-induced-drain leakage, subthreshold shows, this increases leakage further, leakage, gate-oxide leakage, gate-current causing yet higher temperatures. This vi- leakage, and punch-through leakage. Of cious cycle is known as thermal runaway. these six, subthreshold leakage and gate- It is the chip designer’s worst nightmare. oxide leakage dominate the total leakage How can these problems be solved? current. Equations (4) and (6) indicate four ways to Gate-oxide leakage flows from the gate of reduce leakage power. The first way is to a transistor into the substrate. This type reduce the supply voltage. As we will see, of leakage current depends on the thick- supply voltage reduction is a very common ness of the oxide material that insulates technique that has been applied to compo- the gate: nents throughout a system (e.g., processor,   buses, memories). 2 The second way to reduce leakage power V −α Tox Iox = K2W e V . (5) is to reduce the size of a circuit because the Tox total leakage is proportional to the leak- age dissipated in all of a circuit’s tran- According to this equation, the gate- sistors. One way of doing this is to de- oxide leakage Iox increases exponentially sign a circuit with fewer transistors by as the thickness Tox of the gate’s oxide ma- omitting redundant hardware and using terial decreases. This is a problem because smaller caches, but this may limit perfor- future chip designs will require the thick- mance and versatility. Another idea is to ness to be reduced along with other scaled reduce the effective dy- parameters such as transistor length and namically by cutting the power supplies to supply voltage. One way of solving this idle components. Here, too, there are chal- problem would be to insulate the gate us- lenges such as how to predict when dif- ing a high-k dialectric material instead ferent components will be idle and how to of the oxide materials that are currently minimize the overhead of shutting them used. This solution is likely to emerge over on or off. This, is also a common approach the next decade. of which we will see examples in the sec- Subthreshold leakage current flows be- tions to follow. tween the drain and source of a transistor. The third way to reduce leakage power It is the dominant source of leakage and is to cool the computer. Several cooling depends on a number of parameters that techniques have been developed since are related through the following equa- the 1960s. Some blow cold air into the tion: circuit, while others refrigerate the

− Vth  −V  processor [Schmidt and Notohardjono Isub = K1We nT 1 − e T . (6) 2002], sometimes even by costly means such as circulating cryogenic fluids like In this equation, W is the gate width liquid nitrogen [Krane et al. 1988]. These and K and n are constants. The impor- techniques have three advantages. First tant parameters are the supply voltage they significantly reduce subthreshold V , the threshold voltage Vth, and the leakage. In fact, a recent study [Schmidt temperature T. The subthreshold leak- and Notohardjono 2002] showed that cool- age current Isub increases exponentially ing a memory cell by 50 degrees Celsius as the threshold voltage Vth decreases. reduces the leakage energy by five times. This again raises a problem for future Second, these techniques allow a circuit to chip designs, because as technology scales, work faster because electricity encounters threshold voltages will have to scale along less resistance at lower temperatures. with supply voltages. Third, cooling eliminates some negative The increase in subthreshold leakage effects of high temperatures, namely the current causes another problem. When the degradation of a chip’s reliability and life leakage current increases, the tempera- expectancy. Despite these advantages,

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 200 V. Venkatachalam and M. Franz

Fig. 4. The transistor stacking effect. The cir- cuits in both (a) and (b) are leaking current be- tween Vdd and Ground. However, Ileak1, the leak- age in (a), is less than Ileak2, the leakage in (b). Figure adapted from Butts and Sohi 2000. Fig. 5. Multiple threshold circuits with sleep transistors. there are issues to consider such as the costs of the hardware used to cool the and this increases the threshold voltage circuit. Moreover, cooling techniques of the bottom transistor, making it more are insufficient if they result in wide resistant to leakage. As a result, in Fig- temperature variations in different parts ure 4(a), in which all transistors are in of a circuit. Rather, one needs to prevent the Off position, transistor B leaks less hotspots by distributing heat evenly current than transistor A, and transistor throughout a chip. C leaks less current than transistor B. Hence, the total leakage current is atten- 2.2.1. Reducing Threshold Voltage. The uated as it flows from Vdd to the ground fourth way of reducing leakage power is to through transistors A, B, and C. This is not increase the threshold voltage. As Equa- the case in the circuit shown in Figure 4 tion (6) shows, this reduces the subthresh- (b), which contains only a single off old leakage exponentially. However, it also transistor. reduces the circuit’s performance as is ap- Another way to increase the thresh- parent in the following equation that re- old voltage is to use Multiple Thresh- lates frequency (f), supply voltage (V), and old Circuits With Sleep Transistors (MTC- α threshold voltage (Vth) and where is a MOS) [Calhoun et al. 2003; Won et al. constant: 2003] . This involves isolating a leaky cir- cuit element by connecting it to a pair (V − V )α f ∞ th . (7) of virtual power supplies that are linked V to its actual power supplies through sleep transistors (Figure 5). When the circuit is One of the less intuitive ways of active, the sleep transistors are activated, increasing the threshold voltage is to connecting the circuit to its power sup- exploit what is called the stacking effect plies. But when the circuit is inactive, the (refer to Figure 4). When two or more tran- sleep transistors are deactivated, thus dis- sistors that are switched off are stacked connecting the circuit from its power sup- on top of each other (a), then they dis- plies. In this inactive state, almost no leak- sipate less leakage than a single tran- age passes through the circuit because the sistor that is turned off (b). This is be- sleep transistors have high threshold volt- cause each transistor in the stack induces ages. (Recall that subthreshold leakage a slight reverse bias between the gate and drops exponentially with a rise in thresh- source of the transistor right below it, old voltage, according to Equation (6).)

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 201

Fig. 6. Adaptive body biasing.

This technique effectively confines the To adjust the threshold voltage, adap- leakage to one part of the circuit, but is tive body biasing applies a voltage to the tricky to implement for several reasons. transistor’s body known as a body bias The sleep transistors must be sized prop- voltage (Figure 6). This voltage changes erly to minimize the overhead of acti- the polarity of a transistor’s channel, de- vating them. They cannot be turned on creasing or increasing its resistance to and off too frequently. Moreover, this tech- current flow. When the body bias voltage nique does not readily apply to memories is chosen to fill the transistor’s channel because memories lose data when their with positive ions (b), the threshold volt- power supplies are cut. age increases and reduces leakage cur- A third way to increase the threshold rents. However, when the voltage is cho- is to employ dual threshold circuits. Dual sen to fill the channel with negative ions, threshold circuits [Liu et al. 2004; Wei the threshold voltage decreases, allowing et al. 1998; Ho and Hwang 2004] reduce higher performance, though at the cost of leakage by using high threshold (low leak- more leakage. age) transistors on noncritical paths and low threshold transistors on critical paths, the idea being that noncritical paths can 3. REDUCING POWER execute instructions more slowly without 3.1. Circuit And Logic Level Techniques impairing performance. This is a diffi- cult technique to implement because it re- 3.1.1. Transistor Sizing. Transistor siz- quires choosing the right combination of ing [Penzes et al. 2002; Ebergen et al. transistors for high-threshold voltages. If 2004] reduces the width of transistors too many transistors are assigned high to reduce their dynamic power consump- threshold voltages, the noncritical paths tion, using low-level models that relate the in the circuit can slow down too much. power consumption to the width. Accord- A fourth way to increase the thresh- ing to these models, reducing the width old voltage is to apply a technique known also increases the transistor’s delay and as adaptive body biasing [Seta et al. thus the transistors that lie away from 1995; Kobayashi and Sakurai 1994; Kim the critical paths of a circuit are usually and Roy 2002]. Adaptive body biasing is the best candidates for this technique. Al- a runtime technique that reduces leak- gorithms for applying this technique usu- age power by dynamically adjusting the ally associate with each transistor a toler- threshold voltages of circuits depending able delay which varies depending on how on whether the circuits are active. When close that transistor is to the critical path. a circuit is not active, the technique in- These algorithms then try to scale each creases its threshold voltage, thus saving transistor to be as small as possible with- leakage power exponentially, although at out violating its tolerable delay. the expense of a delay in circuit operation. When the circuit is active, the technique 3.1.2. Transistor Reordering. The ar- decreases the threshold voltage to avoid rangement of transistors in a circuit slowing it down. affects energy consumption. Figure 7

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 202 V. Venkatachalam and M. Franz

3.1.3. Half Frequency and Half Swing Clocks. Half-frequency and half-swing clocks re- duce frequency and voltage, respectively. Traditionally, hardware events such as register file writes occur on a rising clock edge. Half-frequency clocks synchro- nize events using both edges, and they tick at half the speed of regular clocks, thus cutting clock switching power in half. Reduced-swing clocks also often use a lower voltage signal and thus reduce power quadratically. Fig. 7. Transistor Reordering. Figure adapted from Hossain et al. 1996. 3.1.4. Restructuring. There are many ways to build a circuit out of logic gates. One decision that affects power con- shows two possible implementations of sumption is how to arrange the gates and the same circuit that differ only in their their input signals. placement of the transistors marked A For example, consider two implementa- and B. Suppose that the input to transis- tions of a four-input AND gate (Figure 8), tor A is 1, the input to transistor B is 1, achain implementation (a), and a tree and the input to transistor C is 0. Then implementation (b). Knowing the signal transistors A and B will be on, allowing probabilities (1 or 0) at each of the pri- current from Vdd to flow through them mary inputs (A, B, C, D), one can easily cal- and charge the capacitors C1 and C2. culate the transition probabilities (0→1) Now suppose that the inputs change and for each output (W, X, F, Y, Z). If each in- that A’s input becomes 0, and C’s input be- put has an equal probability of being a comes 1. Then A will be off while B and C 1ora0, then the calculation shows that will be on. Now the implementations in (a) the chain implementation (a) is likely to and (b) will differ in the amounts of switch- switch less than the tree implementation ing activity. In (a), current from ground (b). This is because each gate in a chain will flow past transistors B and C, dis- has a lower probability of having a 0→1 charging both the capacitors C1 and C2. transition than its predecessor; its tran- However, in (b), the current from ground sition probability depends on those of all will only flow past transistor C; it will not its predecessors. In the tree implementa- get past transistor A since A is turned off. tion, on the other hand, some gates may Thus it will only discharge the capacitor share a parent (in the tree topology) in- C2, rather than both C1 and C2 as in part stead of being directly connected together. (a). Thus the implementation in (b) will These gates could have the same transi- consume less power than that in (a). tion probabilities. Transistor reordering [Kursun et al. Nevertheless, chain implementations 2004; Sultania et al. 2004] rearranges do not necessarily save more energy than transistors to minimize their switching tree implementations. There are other is- activity. One of its guiding principles is sues to consider when choosing a topology. to place transistors closer to the circuit’s One is the issue of glitches or spurious outputs if they switch frequently in or- transitions that occur when a gate does not der to prevent a domino effect where receive all of its inputs at the same time. the switching activity from one transistor These glitches are more common in chain trickles into many other transistors caus- implementations where signals can travel ing widespread power dissipation. This re- along different paths having widely vary- quires profiling techniques to determine ing delays. One solution to reduce glitches how frequently different transistors are is to change the topology so that the dif- likely to switch. ferent paths in the circuit have similar

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 203

Fig. 8. Gate restructuring. Figure adapted from the Pennsylvania State University Microsystems Design Laboratory’s tutorial on low power design. delays. This solution, known as path bal- such as register files. A typical master- ancing often transforms chain implemen- slave flip-flop consists of two latches, a tations into tree implementations. An- master latch, and a slave latch. The inputs other solution, called retiming, involves of the master latch are the clock signal inserting flip-flops or registers to slow and data. The inputs of the slave latch down and thereby synchronize the signals are the inverse of the clock signal and the that pass along different paths but recon- data output by the master latch. Thus verge to the same gate. Because flip-flops when the clock signal is high, the master and registers are in sync with the proces- latch is turned on, and the slave latch sor clock, they sample their inputs less fre- is turned off. In this phase, the master quently than logic gates and are thus more samples whatever inputs it receives and immune to glitches. outputs them. The slave, however, does not sample its inputs but merely outputs 3.1.5. Technology Mapping. Because of whatever it has most recently stored. On the huge number of possibilities and the falling edge of the clock, the master tradeoffs at the gate level, designers rely turns off and the slave turns on. Thus the on tools to determine the most energy- master saves its most recent input and optimal way of arranging gates and sig- stops sampling any further inputs. The nals. Technology mapping [Chen et al. slave samples the new inputs it receives 2004; Li et al. 2004; Rutenbar et al. 2001] from the master and outputs it. is the automated of constructing a Besides this master-slave design are gate-level representation of a circuit sub- other common designs such as the pulse- ject to constraints such as area, delay, and triggered flip-flop and sense-amplifier flip- power. Technology mapping for power re- flop. All these designs share some common lies on gate-level power models and a li- sources of power consumption, namely brary that describes the available gates, power dissipated from the clock signal, and their design constraints. Before a cir- power dissipated in internal switching ac- cuit can be described in terms of gates, it tivity (caused by the clock signal and by is initially represented at the logic level. changes in data), and power dissipated The problem is to design the circuit out of when the outputs change. logic gates in a way that will mimimize the Researchers have proposed several al- total power consumption under delay and ternative low power designs for flip-flops. cost constraints. This is an NP-hard Di- Most of these approaches reduce the rected Acyclic Graph (DAG) covering prob- switching activity or the power dissipated lem, and a common heuristic to solve it is by the clock signal. One alternative is the to break the DAG representation of a cir- self-gating flip-flop. This design inhibits cuit into a set of trees and find the optimal the clock signal to the flip-flop when the in- mapping for each subtree using standard puts will produce no change in the outputs. tree-covering algorithms. Strollo et al. [2000] have proposed two ver- sions of this design. In the double-gated 3.1.6. Low Power Flip-Flops. Flip-flops flip-flop, the master and slave latches are the building blocks of small memories each have their own clock-gating circuit.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 204 V. Venkatachalam and M. Franz

The circuit consists of a comparator that wayofreducing power is to encode the checks whether the current and previous FSM states in a way that minimizes the inputs are different and some logic that in- switching activity throughout the proces- hibits the clock signal based on the output sor. Another common approach involves of the comparator. In the sequential-gated decomposing the FSM into sub-FSMs flip-flop, the master and the slave share and activating only the circuitry needed the same clock-gating circuit instead of for the currently executing sub-FSM. having their own individual circuit as in Some researchers [Gao and Hayes 2003] the double-gated flip-flop. have developed tools that automate this Another low-power flip-flop is the condi- process of decomposition, subject to some tional capture FLIP-FLOP [Kong et al. 2001; quality and correctness constraints. Nedovic et al. 2001]. This flip-flop detects when its inputs produce no change in the 3.1.8. Delay-Based Dynamic-Supply Voltage output and stops these redundant inputs Adjustment. Commercial processors that from causing any spurious internal cur- can run at multiple clock speed normally rent . There are many variations use a lookup table to decide what supply on this idea, such as the conditional dis- voltage to select for a given clock speed. charge flip-flops and conditional precharge This table is use built ahead of time flip-flops [Zhao et al. 2004] which elim- through a worst-case analysis. Because inate unnecessary precharging and dis- worst-case analyses may not reflect real- charging of internal elements. ity, ARM Inc. has been developing a more A third type of low-power flip-flop is the efficient runtime solution known as the dual edge triggered FLIP-FLOP [Llopis and Razor pipeline [Ernst et al. 2003]. Sachdev 1996]. These flip-flops are sensi- Instead of using a lookup table, Razor tive to both the rising and falling edges adjusts the supply voltage based on the of the clock, and can do the same work delay in the circuit. The idea is that when- as a regular flip-flop at half the clock fre- ever the voltage is reduced, the circuit quency. By cutting the clock frequency slows down, causing timing errors if the in half, these flip-flops save total power clock frequency chosen for that voltage is consumption but may not save total en- too high. Because of these errors, some in- ergy, although they could save energy by structions produce inconsistent results or reducing the voltage along with the fre- fail altogether. The Razor pipeline peri- quency. Another drawback is that these odically monitors how many such errors flip-flops require more area than regu- occur. If the number of errors exceeds a lar flip-flops. This increase in area may threshold, it increases the supply voltage. cause them to consume more power on the If the number of errors is below a thresh- whole. old, it scales the voltage further to save more energy. This solution requires extra hardware 3.1.7. Low-Power Control Logic Design. in order to detect and correct circuit One can view the control logic of a pro- errors. To detect errors, it augments cessor as a Finite State Machine (FSM). the flip-flops in delay-critical regions of It specifies the possible processor states the circuit with shadow-latches. These and conditions for switching between latches receive the same inputs as the the states and generates the signals that flip-flops but are clocked more slowly to activate the appropriate circuitry for each adapt to the reduced supply voltage. If state. For example, when an instruction the output of the flip-flop differs from is in the execution stage of the pipeline, that of its shadow-latch, then this sig- the control logic may generate signals to nifies an error. In the event of such an activate the correct . error, the circuit propagates the output Although control logic optimizations of the shadow-latch instead of the flip- have traditionally targeted performance, flop, delaying the pipeline for a cycle if they are now also targeting power. One necessary.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 205

Fig. 9. Crosstalk prevention through shielding.

3.2. Low-Power Interconnect smaller, there arise additional sources of power consumption. One of these sources Interconnect heavily affects power con- is crosstalk [Sylvester and Keutzer 1998]. sumption since it is the medium of most Crosstalk is spurious activity on a wire electrical activity. Efforts to improve chip that is caused by activity in neighboring performance are resulting in smaller chips wires. As well as increasing delays and im- with more transistors and more densely pairing circuit integrity, crosstalk can in- packed wires carrying larger currents. The crease power consumption. wires in a chip often use materials with One way of reducing crosstalk [Taylor poor thermal conductivity [Banerjee and et al. 2001] is to insert a shield wire Mehrotra 2001]. (Figure 9) between adjacent wires. Since the shield remains deasserted, no 3.2.1. Bus Encoding And CrossTalk. A pop- adjacent wires switch in opposite direc- ular way of reducing the power consumed tions. This solution doubles the number of in buses is to reduce the switching activity wires. However, the principle behind it has through intelligent bus encoding schemes, sparked efforts to develop self-shielding such as bus-inversion [Stan and Burleson codes [Victor and Keutzer 2001; Patel and 1995]. Buses consist of wires that trans- Markov 2003] resistant to crosstalk. As mit bits (logical 1 or 0). For every data in traditional bus encoding, a value is en- transmission on a bus, the number of wires coded and then transmitted. However, the that switch depends on the current and code chosen avoids opposing transitions on previous values transmitted. If the Ham- adjacent bus wires. ming distance between these values is more than half the number of wires, then 3.2.2. Low Swing Buses. Bus-encoding most of the wires on the bus will switch schemes attempt to reduce transitions on current. To prevent this from happening, the bus. Alternatively, a bus can transmit bus-inversion transmits the inverse of the the same information but at a lower intended value and asserts a control sig- voltage. This is the principle behind low nal alerting recipients of the inversion. swing buses [Zhang and Rabaey 1998]. For example, if the current binary value to Traditionally, logical one is represented transmit is 110 and the previous was 000, by 5 volts and logical zero is represented the bus instead transmits 001, the inverse by −5 volts. In a low-swing system of 110. (Figure 10), logical one and zero are Bus-inversion ensures that at most half encoded using lower voltages, such as of the bus wires switch during a bus trans- +300mV and −300mV. Typically, these action. However, because of the cost of the systems are implemented with differen- logic required to invert the bus lines, this tial signaling. An input signal is split into technique is mainly used in external buses two signals of opposite polarity bounded rather than the internal chip interconnect. by a smaller voltage range. The receiver Moreover, some researchers have ar- sees the difference between the two gued that this technique makes the sim- transmitted signals as the actual signal plistic assumption that the amount of and amplifies it back to normal voltage. power that an interconnect wire consumes Low Swing Differential Signaling has depends only on the number of bit tran- several advantages in addition to reduced sitions on that wire. As chips become power consumption. It is immune to

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 206 V. Venkatachalam and M. Franz

Fig. 10. Low voltage differential signaling

Fig. 11. Bus segmentation. crosstalk and electromagnetic radiation 3.2.4. Adiabatic Buses. The preceding effects. Since the two transmitted signals techniques work by reducing activity are close together, any spurious activity or voltage. In contrast, adiabatic cir- will affect both equally without affecting cuits [Lyuboslavsky et al. 2000] reduce to- the difference between them. However, tal capacitance. These circuits reuse ex- when implementing this technique, one isting electrical charge to avoid creating needs to consider the costs of increased new charge. In a traditional bus, when hardware at the encoder and decoder. a wire becomes deasserted, its previous charge is wasted. A charge-recovery bus 3.2.3. Bus Segmentation. Another effec- recycles the charge for wires about to be tive technique is bus segmentation.Ina asserted. traditional shared bus architecture, the Figure 12 illustrates a simple design for entire bus is charged and discharged upon a two-bit adiabatic bus [Bishop and Ir- every access. Segmentation (Figure 11) win 1999]. A comparator in each bitline splits a bus into multiple segments con- tests whether the current bit is equal to nected by links that regulate the traffic be- the previously sent bit. Inequality signi- tween adjacent segments. Links connect- fies a transition and connects the bitline ing paths essential to a communication are to a shorting wire used for sharing charge activated independently, allowing most of across bitlines. In the sample scenario, the bus to remain powered down. Ideally, Bitline1 has a falling transition while Bit- devices communicating frequently should line0 has a rising transition. As Bitline1 be in the same or nearby segments to avoid falls, its positive charge transfers to Bit- powering many links. Jone et al. [2003] line0, allowing Bitline0 to rise. The power have examined how to partition a bus saved depends on transition patterns. No to ensure this property. Their algorithm energy is saved when all lines rise. The begins with an undirected graph whose most energy is saved when an equal num- nodes are devices, edges connect commu- ber of lines rise and fall simultaneously. nicating devices, and edge weights dis- The biggest drawback of adiabatic cir- play communication frequency. Applying cuits is a delay for transfering shared the Gomory and Hu algorithm [1961], they charge. find a spanning tree in which the prod- uct of communication frequency and num- ber of edges between any two devices is 3.2.5. Network-On-Chip. All of the above minimal. techniques assume an architecture in

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 207

and columns corresponding to the inter- section points end up switching current even though only parts of these rows and columns are actually traversed. To elim- inate this widespread current switching, the segmented crossbar topology divides the rows and columns into segments de- marcated by tristate buffers. The differ- ent segments are selectively activated as the data traverses through them, hence Fig. 12. Two bit charge recovery bus. confining current switches to only those segments along which the data actually which multiple functional units share one passes. or more buses. This shared-bus paradigm has several drawbacks with respect to 3.3. Low-Power Memories and Memory power and performance. The inherent bus Hierarchies bandwidth limits the speed and volume of data transfers and does not suit the 3.3.1. Types of Memory. One can classify varying requirements of different execu- memory structures into two categories, tion units. Only one unit can access the Random Access Memories (RAM) and bus at a time, though several may be re- Read-Only-Memories (ROM). There are questing bus access simultaneously. Bus two kinds of RAMs, static RAMs (SRAM) arbitration mechanisms are energy inten- and dynamic RAMs (DRAM) which dif- sive. Unlike simple data transfers, every fer in how they store data. SRAMs store bus transaction involves multiple clock data using flip-flops and DRAMs store cycles of handshaking protocols that in- each bit of data as a charge on a capac- crease power and delay. itor; thus DRAMs need to refresh their As a consequence, researchers have data periodically. SRAMs allow faster ac- been investigating the use of generic in- cesses than DRAMs but require more terconnection networks to replace buses. area and are more expensive. As a result, Compared to buses, networks offer higher normally only register files, caches, and bandwidth and support concurrent con- high bandwidth parts of the system are nections. This design makes it easier to made up of SRAM cells, while main mem- troubleshoot energy-intensive areas of the ory is made up of DRAM cells. Although network. Many networks have sophisti- these cells have slower access times cated mechanisms for adapting to varying than SRAMs, they contain fewer transis- traffic patterns and quality-of-service re- tors and are less expensive than SRAM quirements. Several authors [Zhang et al. cells. 2001; Dally and Towles 2001; Sgroi et al. The techniques we will introduce are not 2001] propose the application of these confined to any specific type of RAM or techniques at the chip level. These claims ROM. Rather they are high-level architec- have yet to be verified, and interconnect tural principles that apply across the spec- power reduction remains a promising area trum of memories to the extent that the for future research. required technology is available. They at- Wang et al. [2003] have developed tempt to reduce the energy dissipation of hardware-level optimizations for differ- memory accesses in two ways, either by ent sources of energy consumption in reducing the energy dissipated in a mem- an on-chip network. One of the alterna- ory accesses, or by reducing the number of tive network topologies they propose is memory accesses. the segmented crossbar topology which aims to reduce the energy consumption 3.3.2. Splitting Memories Into Smaller Sub- of data transfers. When data is trans- systems. An effective way to reduce the fered over a regular network grid, all rows energy that is dissipated in a memory

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 208 V. Venkatachalam and M. Franz access is to activate only the needed mem- processor and first level cache. It would ory circuits in each access. One way to ex- be smaller than the first level cache and pose these circuits is to partition memo- hence dissipate less energy. Yet it would ries into smaller, independently accessible be large enough to store an application’s components. This can be done as different working set and filter out many mem- granularities. ory references; only when a data request At the lowest granularity, one can de- misses this cache would it require search- sign a main memory system that is split ing higher levels of cache, and the penalty up into multiple banks each of which for a miss would be offset by redirecting can independently transition into differ- more memory references to this smaller, ent power modes. Compilers and operating more energy efficient cache. This was the systems can cluster data into a minimal principle behind the filter cache. Though set of banks, allowing the other unused deceptively simple, this idea lies at the banks to power down and thereby save en- heart of many of the low-power cache de- ergy [Luz et al. 2002; Lebeck et al. 2000]. signs used today, ranging from simple ex- At a higher granularity, one can parti- tensions of the (e.g., block tion the separate banks of a partitioned buffering) to the scratch pad memories memory into subbanks and activate only used in embedded systems, and the com- the relevant subbank in every memory ac- plex trace caches used in high-end gener- cess. This is a technique that has been ap- ation processors. plied to caches [Ghose and Kamble 1999]. One simple extension is the technique of In a normal cache access, the set-selection block buffering. This technique augments step involves transferring all blocks in all caches with small buffers to store the most banks onto tag-comparison latches. Since recently accessed cache set. If a data item the requested word is at a known off- that has been requested belongs to a cache set in these blocks, energy can be saved set that is already in the buffer, then one by transfering only words at that offset. does not have to search for it in the rest of Cache subbanking splits the block array of the cache. each cache line into multiple banks. Dur- In embedded systems, scratch pad mem- ing set-selection, only the relevant bank ories [Panda et al. 2000; Banakar et al. from each line is active. Since a single sub- 2002; Kandemir et al. 2002] have been bank is active at a time, all subbanks of the extension of choice. These are soft- a line share the same output latches and ware controlled memories that reside on- sense amplifiers to reduce hardware com- chip. Unlike caches, the data they should plexity. Subbanking saves energy with- contain is determined ahead of time and out increasing memory access time. More- remains fixed while programs execute. over, it is independent of program locality These memories are less volatile than patterns. conventional caches and can be accessed within a single cycle. They are ideal for specialized embedded systems whose ap- 3.3.3. Augmenting the plications are often hand tuned. With Specialized Cache Structures. The sec- General purpose processors of the lat- ond way to reduce energy dissipation in est generation sometimes rely on more the memory hierarchy is to reduce the complex caching techniques such as the number of memory hierarchy accesses. trace cache. Trace caches were originally One group of researchers [Kin et al. 1997] developed for performance but have been developed a simple but effective technique studied recently for their power bene- to do this. Their idea was to integrate a fits. Instead of storing instructions in specialized cache into the typical mem- their compiled order, a trace cache stores ory hierarchy of a modern processor, a traces of instructions in their executed hierarchy that may already contain one order. If an instruction sequence is al- or more levels of caches. The new cache ready in the trace cache, then it need they were proposing would sit between the not be fetched from the instruction cache

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 209 but can be decoded directly from the trace cache. This saves power by re- ducing the number of instruction cache accesses. In the original trace cache design, both the trace cache and the instruction cache are accessed in parallel on every clock cy- cle. Thus instructions are fetched from the instruction cache while the trace cache is searched for the next trace that is pre- dicted to executed. If the trace is in the trace cache, then the instructions fetched from the instruction cache are discarded. Fig. 13. Drowsy cache line. Otherwise the program execution pro- ceeds out of the instruction cache and si- 3.4.1. Adaptive Caches. There is a multaneously new traces are created from wealth of literature on adaptive caches, the instructions that are fetched. caches whose storage elements (lines, Low-power designs for trace caches typi- blocks, or sets) can be selectively acti- cally aim to reduce the number of instruc- vated based on the application workload. tion cache accesses. One alternative, the One example of such a cache is the dynamic direction prediction-based trace Deep-Submicron Instruction (DRI) cache [Hu et al. 2003], uses branch predi- cache [Powell et al. 2001]. This cache tion to decide where to fetch instructions permits one to deactivate its individual from. If the predicts the sets on demand by gating their supply next trace with high confidence and that voltages. To decide what sets to activate at trace is in the trace cache, then the in- any given time, the cache uses a hardware structions are fetched from the trace cache profiler that monitors the application’s rather than the instruction cache. cache-miss patterns. Whenever the cache The selective trace cache [Hu et al. 2002] misses exceed a threshold, the DRI cache extends this technique with compiler sup- activates previously deactivated sets. port. The idea is to identify frequently exe- Likewise, whenever the miss rate falls cuted, or hot traces, and bring these traces below a threshold, the DRI deactivates into the trace cache. The compiler inserts some of these sets by inhibiting their hints after every branch instruction, indi- supply voltages. cating whether it belongs to a hot trace. At A problem with this approach is that runtime, these hints cause instructions to dormant memory cells lose data and need be fetched from either the trace cache or more time to be reactivated for their next the instruction cache. use. Thus an alternative to inhibiting their supply voltages is to reduce their voltages as low as possible without losing data. 3.4. Low-Power Processor Architecture This is the aim of the drowsy cache,acache Adaptations whose lines can be placed in a drowsy So far we have described the energy sav- mode where they dissipate minimal power ing features of hardware as though they but retain data and can be reactivated were a fixed foundation upon which pro- faster. grams execute. However, programs exhibit Figure 13 illustrates a typical drowsy wide variations in behavior. Researchers cache line [Kim et al. 2002]. The wakeup have been developing hardware structures signal chooses between voltages for the whose parameters can be adjusted on de- active, drowsy, and off states. It also mand so that one can save energy by regulates the word decoder’s output to activating just the minimum hardware prevent data in a drowsy line from be- resources needed for the code that is ing accessed. Before a cache access, the executing. line circuitry must be precharged. In a

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 210 V. Venkatachalam and M. Franz

the first adaptive instruction issue queues, a 32-bit queue consisting of four equal size partitions that can be independently acti- vated. Each partition consists of wakeup Fig. 14. Dead block elimination. logic that decides when instructions are ready to execute and readout logic that traditional cache, the precharger is contin- dispatches ready instructions into the uously on. To save power, the drowsy cir- pipeline. At any time, only the partitions cuitry regulates the wakeup signal with a that contain the currently executing in- precharge signal. The precharger becomes structions are activated. active only when the line is woken up. There have been many heuristics devel- Many heuristics have been developed oped for configuring these queues. Some for controlling these drowsy circuits. The require measuring the rate at which in- simplest policy for controlling the drowsy structions are executed per processor clock cache is cache decay [Kaxiras et al. 2001] cycles, or IPC. For example, Buyukto- which simply turns off unused cache lines sunoglu et al.’s heuristic [2001] periodi- after a fixed interval. More sophisticated cally measures the IPC over fixed length policies [Zhang et al. 2002; 2003; Hu et al. intervals. If the IPC of the current inter- 2003] consider information about runtime val is a factor smaller than that of the program behavior. For example, Hu et previous interval (indicating worse per- al. [2003] have proposed hardware that formance), then the heuristic increases detects when an execution moves into a the instruction queue size to increase the hotspot and activates only those cache throughput. Otherwise it may decrease lines that lie within that hotspot. The the queue size. main idea of their approach is to count Bahar and Manne [2001] propose a how often different branches are taken. similar heuristic targeting a simulated When a branch is taken sufficiently many 8-wide issue processor that can be re- times, its target address is a hotspot, and configured as a 6-wide issue or 4-wide the hardware sets special bits to indicate issue. They also measure IPC over fixed that cache lines within that hotspot should intervals but measure the floating point not be shut down. Periodically, a runtime IPC separately from the integer IPC since routine disables cache lines that do not the former is more critical for floating have these bits set and decays the profil- point applications. Their heuristic, like ing data. Whenever it reactivates a cache Buyuktosunoglu et al.’s [2001], compares line before its next use, it also reactivates the measured IPC with specific thresholds the line that corresponds to the next sub- to determine when to downsize the issue sequent memory access. queue. There are many other heuristics for con- Folegnani and Gonzalez [2001], on the trolling cache lines. Dead-block elimina- other hand, use a different heuristic. They tion [Kabadi et al. 2002] powers down find that when a program has little par- cache lines containing basic blocks that allelism, the youngest part of the issue have reached their final use. It is a queue (which contains the most recent in- compiler-directed technique that requires structions) contributes very little to the a static control flow analysis to identify overall IPC in that instructions in this these blocks. Consider the control flow part are often committed late. Thus their graph in Figure 14. Since block B1 exe- heuristic monitors the contribution of the cutes only once, its cache lines can power youngest part of this queue and deacti- down once control reaches B2. Similarly, vates this part when its contribution is the lines containing B2 and B3 can power minimal. down once control reaches B4.

3.4.2. Adaptive Instruction Queues. Buyu- 3.4.3. Algorithms for Reconfiguring Multi- ktosunoglu et al. [2001] developed one of ple Structures. In addition to this work,

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 211 researchers have also been tackling the to apply, they use offline profiling to plot problem of problem of reconfiguring mul- an energy-delay tradeoff curve. Each point tiple hardware units simultaneously. in the curve represents the energy and de- One group [Hughes et al. 2001; Sasanka lay tradeoff of applying an adaptation to et al. 2002; Hughes and Adve 2004] a subroutine. There are three regions in has been developing heuristics for com- the curve. The first includes adaptations bining hardware adaptations with dy- that save energy without impairing per- namic voltage scaling during multime- formance; these are the adaptations that dia applications. Their strategy is to run will always be applied. The third repre- two algorithms while individual video sents adaptations that worsen both per- frames are being processed. A global algo- formance and energy; these are the adap- rithm chooses the initial DVS setting and tations that will never be applied. Be- baseline hardware configuration for each tween these two extremes are adaptations frame, while a local algorithm tunes var- that trade performance for energy savings; ious hardware parameters (e.g., instruc- some of these may be applied depending on tion window sizes) while the frame is ex- the tolerable performance loss. The main ecuting to use up any remaining slack idea is to trace the curve from the ori- periods. gin and apply every adaptation that one Another group [Iyer and Marculescu encounters until the cumulative perfor- 2001, 2002a] has developed heuristics that mance loss from applying all the encoun- adjust the pipeline width and register up- tered adaptations reaches the maximum date unit (RUU) size for different hotspots tolerable performance loss. or frequently executed program regions. Ponomarev et al. [2001] develop heuris- To detect these hotspots, their technique tics for controlling a configurable issue counts the number of times each branch queue, a configurable reorder buffer and a instruction is taken. When a branch is configurable load/store queue. Unlike the taken enough times, it is marked as a can- IPC based heuristics we have seen, this didate branch. heuristic is occupancy-based because the To detect how often candidate branches occupancy of hardware structures indi- are taken, the heuristic uses a hotspot cates how congested these structures are, detection . Initially the hotspot and congestion is a leading cause of perfor- counter contains a positive integer. Each mance loss. Whenever a hardware struc- time a candidate branch is taken, the ture (e.g., instruction queue) has low oc- counter is decremented, and each time a cupancy, the heuristic reduces its size. In noncandidate branch is taken, the counter contrast, if the structure is completely full is incremented. If the counter becomes for a long time, the heuristic increases its zero, this means that the program execu- size to avoid congestion. tion is inside a hotspot. Another group of researchers [Albonesi Once the program execution is in a et al. 2003; Dropsho et al. 2002] ex- hotspot, this heuristic tests the energy plore the complex problem of configur- consumption of different processor config- ing a processor whose adaptable compo- urations at fixed length intervals. It finally nents include two levels of caches, integer chooses the most energy saving configura- and floating point issue queues, load/store tion as the one to use the next time the queues, register files, as well as a reorder same hotspot is entered. To measure en- buffer. To dodge the sheer explosion of pos- ergy at runtime, the heuristic relies on sible configurations, they tune each com- hardware that monitors usage statistics of ponent based only on its local usage statis- the most power hungry units and calcu- tics. They propose two heuristics for doing lates the total power based on energy per this, one for tuning caches and another for access values. tuning buffers and register files. Huang et al.[2000; 2003] apply hard- The caches they consider are selec- ware adaptations at the granularity of tive way caches; each of their multiple subroutines. To decide what adaptations ways can independently be activated or

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 212 V. Venkatachalam and M. Franz disabled. Each way has a most recently [Turing 1937]). There have been ongoing used (MRU) counter that is incremented efforts [Nilsen and Rygg 1995; Lim et al. every time that a cache searche hits the 1995; Diniz 2003; Ishihara and Yasuura way. The cache tuning heuristic samples 1998] in the real-time systems community this counter at fixed intervals to determine to accurately estimate the worst case exe- the number of hits in each way and the cution times of programs. These attempts number of overall misses. Using these ac- use either measurement-based prediction, cess statistics, the heuristic computes the static analysis, or a mixture of both. Mea- energy and performance overheads for all surement involves a learning lag whereby possible cache configurations and chooses the execution times of various tasks are the best configuration dynamically. sampled and future execution times are The heuristic for controlling other struc- extrapolated. The problem is that, when tures such as issue queues is occupancy workloads are irregular, future execu- based. It measures how often different tion times may not resemble past execu- components of these structures fill up with tion times. Microarchitectural innovations data and uses this information to de- such as pipelining, hyperthreading and cide whether to upsize or downsize the out-of-order execution make it difficult to structure. predict execution times in real systems statically. Among other things, a compiler 3.5. Dynamic Voltage Scaling needs to know how instructions are inter- Dynamic voltage scaling (DVS) addresses leaved in the pipeline, what the probabili- the problem of how to modulate a pro- ties are that different branches are taken, cessor’s clock frequency and supply volt- and when cache misses and pipeline haz- age in lockstep as programs execute. The ards are likely to occur. This requires mod- premise is that a processor’s workloads eling the pipeline and memory hierarchy. vary and that when the processor has less A number of researchers [Ferdinand 1997; work, it can be slowed down without af- Clements 1996] have attempted to inte- fecting performance adversely. For exam- grate complete pipeline and cache mod- ple, if the system has only one task and it els into compilers. It remains nontrivial has a workload that requires 10 cycles to to develop models that can be used ef- execute and a deadline of 100 seconds, the ficiently in a compiler and that capture processor can slow down to 1/10 cycles/sec, all the complexities inherent in current saving power and meeting the deadline systems. right on time. This is presumably more ef- ficient than running the task at full speed and idling for the remainder of the pe- 3.5.2. Indeterminism And Anomalies In Real riod. Though it appears straightforward, Systems. Even if the DVS algorithm pre- this naive picture of how DVS works is dicts correctly what the processor’s work- highly simplified and hides serious real- loads will be, determining how fast to run world complexities. the processor is nontrivial. The nondeter- minism in real systems removes any strict 3.5.1. Unpredictable Nature Of Workloads. relationships between clock frequency, ex- The first problem is how to predict work- ecution time, and power consumption. loads with reasonable accuracy. This re- Thus most theoretical studies on voltage quires knowing what tasks will execute scaling are based on assumptions that at any given time and the work re- may sound reasonable but are not guar- quired for these tasks. Two issues com- anteed to hold in real systems. plicate this problem. First, tasks can One misconception is that total micro- be preempted at arbitrary times due to processor system power is quadratic in user and I/O device requests. Second, supply voltage. Under the CMOS transis- it is not always possible to accurately tor model, the power dissipation of indi- predict the future runtime of an arbi- vidual transistors are quadratic in their trary algorithm (c.f., the Halting Problem supply voltages, but there remains no

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 213 precise way of estimating the power dis- sipation of an entire system. One can construct scenarios where total system power is not quadratic in supply voltage. Martin and Siewiorek [2001] found that the StrongARM SA-1100 has two supply voltages, a 1.5V supply that powers the CPU and a 3.3V supply that powers the I/O pins. The total power dissipation is domi- nated by the larger supply voltage; hence, even if the smaller supply voltage were al- lowed to vary, it would not reduce power quadratically. Fan et al. [2003] found that, in some architectures where active mem- ory banks dominate total system power consumption, reducing the supply voltage dramatically reduces CPU power dissi- pation but has a miniscule effect on to- tal power. Thus, for full benefits, dynamic voltage scaling may need to be combined with resource to power down Fig. 15.Scheduling effects. When DVS is applied other energy hungry parts of the chip. (a), task one takes a longer time to enter its criti- cal section, and task two is able to preempt it right Another misconception is that it is before it does. When DVS is not applied (b), task most power efficient to run a task at one enters its critical section sooner and thus pre- the slowest constant speed that allows vents task two from preempting it until it finishes it to exactly meet its deadline. Several its critical section. This figure has been adopted from theoretical studies [Ishihara and Yasuura Buttazzo [2002]. 1998] attempt to prove this claim. These proofs rest on idealistic assumptions such two tasks (Figure 15). When the processor as “power is a convex linearly increasing slows down (a), Task One takes longer to function of frequency”, assumptions that enter its critical section and is thus pre- ignore how DVS affects the system as a empted by Task Two as soon as Task Two whole. When the processor slows down, is released into the system. When the pro- devices may remain activated cessor instead runs at maximum speed (b), longer, consuming more power. An impor- Task One enters its critical section earlier tant study of this issue was done by Fan and blocks Task Two until it finishes this et al. [2003] who show that, for specific section. This hypothetical example sug- DRAM architectures, the energy versus gests that DVS can affect the order in slowdown curve is “U” shaped. As the pro- which tasks are executed. But the order in cessor slows down, CPU energy decreases which tasks execute affects the state of the but the cumulative energy consumed in cache which, in turn, affects performance. active memory banks increases. Thus the The exact relationship between the cache optimal speed is actually higher than the state and execution time is not clear. Thus, processor’s lowest speed; any speed lower it is too simplistic to assume that execu- than this causes memory energy dissipa- tion time is inversely proportional to clock tion to overshadow the effects of DVS. frequency. A third misconception is that execution All of these issues show that theo- time is inversely proportional to clock fre- retical studies are insufficient for un- quency. It is actually an open problem derstanding how DVS will affect system how slowing down the processor affects state. One needs to develop an experi- the execution time of an arbitrary appli- mental approach driven by heuristics and cation. DVS may result in nonlinearities evaluate the tradeoffs of these heuristics [Buttazzo 2002]. Consider a system with empirically.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 214 V. Venkatachalam and M. Franz

Most of the existing DVS approaches formance and energy dissipation. Weissel can be classified as interval-based ap- and Bellosa’s technique involves moni- proaches, intertask approaches, or in- toring these event rates using hardware tratask approaches. performance counters and attributing them to the processes that are executing. At each context switch, the heuristic 3.5.3. Interval-Based Approaches. adjusts the clock frequency for the process Interval-based DVS algorithms mea- being activated based on its previously sure how busy the processor is over measured event rates. To do this, the some interval or intervals, estimate how heuristic refers to a previously con- busy it will be in the next interval, and structed table of event rate combinations, adjust the processor speed accordingly. divided into frequency domains. Weissel These algorithms differ in how they and Bellosa construct this table through estimate future processor utilization. a detailed offline simulation where, for Weiser et al. [1994] developed one of the different combinations of event rates, they earliest interval-based algorithms. This determine the lowest clock frequency that algorithm, Past, periodically measures does not violate a user-specified threshold how long the processor idles. If it idles for performance loss. longer than a threshold, the algorithm Intertask DVS has two drawbacks. slows it down; if it remains busy longer First, task workloads are usually un- than a threshold, it speeds it up. Since known until tasks finish running. Thus this approach makes decisions based traditional algorithms either assume per- only on the most recent interval, it is fect knowledge of these workloads or es- error prone. It is also prone to thrashing timate future workloads based on prior since it computes a CPU speed for every workloads. When workloads are irregular, interval. Govil et al. [1995] examine a estimating them is nontrivial. Flautner number of algorithms that extend Past et al. [2001] address this problem by by considering measurements collected classifying all workloads as interactive over larger windows of time. The Aged episodes and producer-consumer episodes Averages algorithm, for example, esti- and using separate DVS algorithms for mates the CPU usage for the upcoming each. For interactive episodes, the clock interval as a weighted average of usages frequency used depends not only on the measured over all previous intervals. predicted workload but also on a percep- These algorithms save more energy than tion threshold that estimates how fast the Past since they base their decisions on episode would have to run to be accept- more information. However, they all able to a user. To recover from prediction assume that workloads are regular. As we error, the algorithm also includes a panic have mentioned, it is very difficult, if not threshold.Ifasession lasts longer than impossible, to predict irregular workloads this threshold, the processor immediately using history information alone. switches to full speed. The second drawback of intertask ap- 3.5.4. Intertask Approaches. Intertask proaches is that they are unaware of pro- DVS algorithms assign different speeds gram structure. A program’s structure for different tasks. These speeds remain may provide insight into the work re- fixed for the duration of each task’s exe- quired for it, insight that, in turn, may cution. Weissel and Bellosa [2002] have provide opportunities for techniques such developed an intertask algorithm that as voltage scaling. Within specific program uses hardware events as the basis for regions, for example, the processor may choosing the clock frequencies for differ- spend significant time waiting for mem- ent processes. The motivation is that, as ory or disk accesses to complete. During processes execute, they generate various the execution of these regions, the proces- hardware events (e.g., cache misses, IPC) sor can be slowed down to meet pipeline at varying rates that relate to their per- stall delays.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 215

3.5.5. Intratask Approaches. Intratask the first of these algorithms. They no- approaches [Lee and Sakurai 2000; Lorch ticed that programs have multiple exe- and Smith 2001; Gruian 2001; Yuan and cution paths, some more time consum- Nahrstedt 2003] adjust the processor ing than others, and that whenever con- speed and voltage within tasks. Lee and trol flows away from a critical path, there Sakurai [2000] developed one of the earli- are opportunities for the processor to slow est approaches. Their approach, run-time down and still finish the programs within voltage hopping, splits each task into their deadlines. Based on this observation, fixed length timeslots. The algorithm as- Shin and Kim developed a tool that pro- signs each timeslot the lowest speed that files a program offline to determine worst allows it to complete within its preferred case and average cycles for different exe- execution time which is measured as cution paths, and then inserting instruc- the worst case execution time minus the tions to change the processor frequency at elapsed execution time up to the current the start of different paths based on this timeslot. An issue not considered in this information. work is how to divide a task into timeslots. Another approach, program checkpoint- Moreover, the algorithm is pessimistic ing [Azevedo et al. 2002], annotates a pro- since tasks could finish before their worst gram with checkpoints and assigns tim- case execution time. It is also insensitive ing constraints for executing code between to program structure. checkpoints. It then profiles the program There are many variations on this idea. offline to determine the average number Two operating system-level intratask al- of cycles between different checkpoints. gorithms are PACE [Lorch and Smith Based on this profiling information and 2001] and Stochastic DVS [Gruian 2001]. timing constraints, it adjusts the CPU These algorithms choose a speed for ev- speed at each checkpoint. ery cycle of a task’s execution based on the Perhaps the most well known approach probability distribution of the task work- is by Hsu and Kremer [2003]. They pro- load measured over previous cycles. The file a program offline with respect to all difference between PACE and stochastic possible combinations of clock frequencies DVS lies in their cost functions. Stochastic assigned to different regions. They then DVS assumes that energy is proportional build a table describing how each combi- to the square of the supply voltage, while nation affects execution time and power PACE assumes it is proportional to the consumption. Using this table, they select square of the clock frequency. These sim- the combination of regions and frequen- plifying assumptions are not guaranteed cies that saves the most power without in- to hold in real systems. creasing runtime beyond a threshold. Dudani et al. [2002] have also developed intratask policies at the operating system level. Their work aims to combine in- 3.5.6. The Implications of Memory Bounded tratask voltage scaling with the earliest- Code. Memory bounded applications deadline-first scheduling algorithm. have become an attractive target for DVS Dudani et al. split into two subprograms algorithms because the time for a memory each program that the scheduler chooses access, on most commercial systems, is to execute. The first subprogram runs at independent of how fast the processor is the maximum clock frequency, while the running. When frequent memory accesses second runs slow enough to keep the com- dominate the program execution time, bined execution time of both subprograms they limit how fast the program can finish below the average execution time for the executing. Because of this “memory wall”, whole program. the processor can actually run slower In addition to these schemes, there and save a lot of energy without losing have been a number of intratask poli- as much performance as it would if it cies implemented at the compiler level. were slowing down a compute-intensive Shin and Kim [2001] developed one of code.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 216 V. Venkatachalam and M. Franz

Fig. 16. Making the best of the memory wall.

Figure 16 illustrates this idea. For ease Nevertheless, a growing number of of exposition, we call a sequence of exe- researchers are addressing the DVS prob- cuted instructions a task. Part (a) depicts lem from this angle; that is, they are devel- a processor executing three tasks. Task 1 oping techniques for modulating the pro- is followed by a memory load instruction cessor frequency based on memory access that causes a main memory access. While patterns. Because it is difficult to reason this access is taking place, task 2 is able abstractly about the complex events occur- to execute since it does not depend on the ring in modern processors, the research in results of task 1. However task 3, which this area has a strong experimental flavor; comes after task 2, depends on the result many of the techniques used are heuristics of task 1, and thus can execute only af- that are validated through detailed simu- ter the memory access completes. So im- lations and experiments on real systems. mediately after executing task 2, the pro- One recent work is by Kondo and cessor sits idle until the memory access Nakamura [2004]. It is an interval-based finishes. What the processor could have approach driven by cache misses. The done is illustrated in part (b). Here instead heuristic periodically calculates the num- of running task 2 as fast as possible and ber of outstanding cache misses and idling until the memory access completes, increments one of three counters depend- the processor slows down the execution of ing on whether the number of outstand- task 2 to overlap exactly with the main ing misses is zero, one, or greater than memory access. This, of course, saves en- one. The cost function that Kondo and ergy without impairing performance. Nakamura use expresses the memory But there are many idealized assump- boundedness of the code as a weighted tions at work here such as the assumption sum of the three counters. In particular, that a DVS algorithm can predict with the third counter which is incremented complete accuracy a program’s future each time there are multiple outstanding behavior and switch the clock frequency misses, receives the heaviest weight. At without any hidden costs. In reality, fixed length intervals, the heuristic com- things are far more complex. A funda- pares the memory boundedness, so com- mental problem is how to detect program puted, with respect to an upper and lower regions during which the processor stalls, threshold. If the memory boundedness is waiting for a memory, disk, or network greater than the upper threshold, then access to complete. Modern processors the heuristic decreases the frequency and contain counters that measure events voltage by one setting; otherwise if the such as cache misses, but it is difficult memory boundedness is below the lower to extrapolate the nature of these stalls threshold, then it increases the frequency from observing these events. and voltage settings by one unit.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 217

Stanley-Marbell et al. [2002] propose a Unlike the previous approaches which similar approach, the Power Adaptation attempt to monitor memory boundedness, Unit. This is a finite state machine that de- a recent technique by Hsu and Feng [2004] tects the values for which attempts to monitor CPU boundedness. memory stalls often occur. When a stall is Their algorithm periodically measures the initially detected, the unit would enter a rate at which instructions are executed transient state where it counts the num- (expressed in million instructions per sec- ber of stall cycles, and after the stall cy- ond) to determine how compute-intensive cles exceeded a threshold, it would mark the workload is. The authors find that the initial program counter value as hot. this approach is as accurate as other ap- Thus the next time the program counter proaches that measure memory bounded- assumes the same value, the PAU would ness, but may be easier to implement be- automatically reduce the clock frequency cause of its simpler cost model. and voltage. Another recent work is by Choi et al. 3.5.7. Dynamic Voltage Scaling In Multiple [2004]. Their work is based on an exe- Clock Domain Architectures. A Globally cution time model where a program’s to- Asynchronous, Locally Synchronous tal runtime is split into two parts, time (GALS) chip is split into multiple do- that the program spends on-chip, and time mains, each of which has its own local that it spends in main memory accesses. clock. Each domain is synchronous with The on-chip time can be calculated from respect to its clock, but the different the number of on-chip instructions, the domains are mutually asynchronous in average cycles for each of these instruc- that they may run at different clock tions, and the processor clock frequency. frequencies. This design has three ad- The off-chip time is similarly a function vantages. First, the clocks that power of the number of off-chip (memory refer- different domains are able to distribute ence) instructions, the average cycles per their signals over smaller areas, thus off-chip instruction, and the memory clock reducing clock skew. Second, the effects frequency. of changing the clock frequency are felt Using this model, Choi et al. [2004] de- less outside the given domain. This is an rive a formula for the lowest clock fre- important advantage that GALS has over quency that would keep the performance conventional CPUs. When a conventional loss within a tolerable threshold. They find CPU scales its clock frequency, all of the that the key to computing this formula hardware structures that receive the clock is determining the average number of cy- signal slow down causing widespread per- on cles for an on-chip instruction CPUavg; this formance loss. In GALS, on the other essential information determines the ra- hand, one can choose to slow down some tio of on-chip to off-chip instructions. They parts of the circuit, while allowing others also found that the average cycles per in- to operate at their maximum frequencies. struction (CPIavg)isalinear function of This creates more opportunities for saving the average stall energy. In compute bound applications, (SPIavg). This reduced their problem to for example, a GALS system can keep the calculating SPIavg.Tocalculate this, they critical paths of the circuit running as used data-cache miss statistics provided fast as possible but slow down other parts by hardware counters. Reasoning that the of the circuit. cycles spent in stalls increase with re- Nevertheless, it is nontrivial to create spect to the number of cache misses, Choi voltage scaling algorithms for GALS pro- et al. built a table that associated different cessors. First, these processors require cache miss ranges with different values special mechanisms allowing the different of SPIavg; this would allow their heuris- domains to communicate and synchronize. tic to calculate in real time the work- Second, it is unclear how to split the pro- load decomposition and appropriate clock cessor into domains. Because communica- frequency. tion energy can be large, domains must be

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 218 V. Venkatachalam and M. Franz chosen so that there is minimal commu- searching for operations whose energy dis- nication between them. One possibility is sipation exceeds a threshold. It stretches to cluster similar functionalities into the out the workload of such operations un- same domain. In a GALS processor devel- til the energy dissipation is below a given oped by Semeraro et al. [2002], all the in- threshold, repeating this process until ei- teger execution units are in the same do- ther no more operations dissipate exces- main. If instead the integer issue queue sive energy or until all of the operations were in a different domain than the in- have used up their slack. The information teger ALU, these two units would waste provided by this algorithm would be used more energy communicating since they by a compiler to select clock frequencies for are often accessed together. different GALS domains in different pro- Another factor that makes it hard to de- gram regions. sign algorithms is that one needs to con- sider possible interdependencies between 3.6. Resource Hibernation domains. For example, if one domain is slowed down, data may take more time to Though switching activity causes power move through it and reach other domains. consumption, computer components con- This may impact the overall congestion in sume power even when idle. Resource hi- the circuit and affect performance. A DVS bernation techniques power down compo- algorithm would need to be aware of these nents during idle periods. A variety of global effects. heuristics have been developed to man- Numerous researchers have tackled the age the power settings of different devices complex problem of controlling GALS sys- such as disk drives, network interfaces, tems [Magklis et al. 2003; Marculescu and displays. We now give examples of 2004; Dropsho et al. 2004; Semeraro et al. these techniques. 2002; Iyer and Marculescu 2002b; Magklis et al. 2003]. We follow with two recent ex- 3.6.1. Disk Drives. Disk drives can ac- amples, the first being an interval-based count for a large percentage of a system’s heuristic and the second being a compiler- total energy dissipation and many tech- based heuristic. niques have been developed to attack this Semeraro et al. [2002] developed one problem [Gurumurthi et al. 2003; Li et al. of the first runtime algorithms for dy- 2004]. Most of the power loss stems from namic voltage scaling in a GALS system. the rotating platter during disk activity They observed that each domain has an in- and idle periods. To save energy, operating struction issue queue that indicates how systems tend to pause disk rotation after fast instructions are flowing through the a fixed period of inactivity and restart it pipeline, thus giving insight into the work- during the next disk access. The decision load. They propose to monitor the issue of when to spin down an idle disk involves queue utilization over fixed length inter- tradeoffs between power and performance. vals and to respond swiftly to sudden High idle time thresholds lead to more en- changes by rapidly increasing or decreas- ergy loss but better performance since the ing the domain’s clock frequency,but to de- disk need not restart to process requests. crease the clock frequency very gradually Low thresholds allow a disk to spin down when there are little observable changes sooner when idle but increase the proba- in the workload. bility of an immediate restart to process One of the first compiler-based algo- a spontaneous request. These immediate rithms for controlling GALS systems is restarts, called bumps,waste energy and by Magklis et al. [2003]. Their main algo- cause delays. rithm is called the shaker algorithm due to One of the techniques used at the oper- its resemblance to a salt shaker. It repeat- ating system level is Predictive Dynamic edly traverses the Directed Acyclic Graph Threshold Adjustment. This technique ad- (DAG) representation of a program from justs at runtime the idleness threshold, or root to leaves and from leaves to root, acceptable window of time before an idle

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 219

Fig. 17. Predictive disk management. disk can be spun down. If a disk immedi- disk may remain uninterrupted in a low- ately spins up after spinning down, this power, inactive state for longer periods. technique increases the idleness thresh- However, dependencies between disk re- old. Otherwise, it decreases the threshold. quests can sometimes make it impossible This is an example of a technique that uses to apply clustering. past disk usage information to predict Moreover, idle periods for spinning future usage patterns. These techniques down the disk are not always avail- are less effective at dealing with irreg- able. For example, large scale servers are ular workloads, as Figure 17 illustrates. continuously processing heavy workloads. Part (a) depicts a disk usage pattern in Gurumurthi et al. [2003] have recently which active phases are abruptly followed investigated an alternative strategy for by idle phases, and the idle phases keep server environments called Dynamic RPM getting longer. Dynamic threshold adjust- control (DRPM). Instead of spinning down ment (b) abruptly restarts the disk ev- the disk fully, DRPM modulates the ro- erytime it spins it down and keeps the tational speed (in rotations per minute) disk activated during long idle phases. according to workload demands, thereby To improve on this technique, an operat- eliminating the overheads of transition- ing system can cluster disk requests (c), ing between low power and fully active thereby lengthening idle periods (d) when states. As part of their study, Gurumurthi it can spin down the disk. To create op- et al. develop power models for a disk drive portunities for clustering, the algorithm of that supports 15 different RPM levels. Us- Papathanasiou and Scott [2002] delays ing these models they propose a simple processing nonurgent disk requests and heuristic for RPM modulation. The heuris- reduces the number of remaining requests tic involves some cooperation between the by aggressively prefetching disk data into disks and a controller that oversees the memory. The delayed requests accumu- disks. The controller initially defines a late in a queue for later processing in a range of RPM levels for each disk; the more voluminous transaction. Thus, the disks modulate their RPMs within this

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 220 V. Venkatachalam and M. Franz range based on their demands. They pe- to the device. The technique identifies pro- riodically monitor their input queues, and gram regions where the card should hyber- whenever the number of requests exceeds nate based on the nature of array accesses a threshold, they increase their RPM set- in each region. If an accessed array is not tings by 1. Meanwhile, the controller over- in local memory, the card is activated to sees the response times in the system. transfer it from the remote server. To de- When the response time is above a thresh- termine what arrays are in memory dur- old, this is a sign of performance degra- ing each region’s execution, the compiler dation, and the controller immediately re- simulates a least-recently-used memory quests the disks to switch to full power. bank. When on the other hand, the response times are sufficiently low, this is a sign that the disks could be slowed down fur- 3.6.3. Displays. Displays are a major ther, and the controller increases the min- source of power consumption. Normally, imum RPM threshold setting. an operating system dims the display af- ter a sufficiently long interval with no user response. This heuristic may not coincide 3.6.2. Network Interfaces. Network inter- with a user’s intentions. faces such as wireless cards present a Two examples of more advanced tech- unique challenge for resource hibernation niques that manage displays based on techniques because the naive solution of user activity are face-off and zoned back- shutting them off would disconnect hosts lighting. They have yet to be implemented from other devices with which they may in commercial systems. have to communicate. Thus power man- Face-off [Dalton and Ellis 2003] is an agement strategies also include protocols experimental face recognition system on for synchronizing the communication of an IBM T21 Thinkpad running Red these devices with other devices in the net- Hat Linux. It periodically photographs work. Kravets et al. [1998] have developed the monitor’s perspective and detects a a heuristic for doing this at the operat- face through large regions of skin color. It ing system level. One of its key features then controls the display using the ACPI is that it uses timers to track the idleness interface. Its developers tested it on a of devices. If a device should be transmit- long network transfer during which the ting but its idle timer expires, it enters a user looked away from the screen. Face-off listening mode or goes to sleep. To decide saved more power than the default display how long devices should remain idle be- manager. A drawback is the repetitive fore shifting into listening mode, Kravets polling for face detection. For future et al. apply an adaptive threshold algo- work, the authors suggest detecting a rithm. When the algorithm detects com- user’s presence through low-power heat munication activity, it decreases the ac- microsensors. These sensors can trigger ceptable hibernation time. When it detects the face recognition algorithm when they idle periods, it increases the time. detect a user. Threshold-based schemes such as this Zoned backlighting [Flinn and Satya- one allow a network card to remain idle narayanan 1999] targets future display for an extended length of time before shut- systems that will allow software to selec- ting down. Hom and Kremer [2003] have tively adjust the brightness of different argued that compilers can shut down a regions of the screen based on usage pat- card sooner using a model of future pro- terns and battery drain. Researchers at gram behavior. Their compiler-based hi- Carnegie Mellon have proposed ideas for bernation technique targets a simplified exploiting these displays. The underlying scenario where a mobile device’s virtual assumption is that users are interested memory resides on a server. Page faults re- in only a small part of the screen. One quire activating the device’s wireless card idea is to make foreground windows with so that pages can be sent from the server keyboard and mouse focus brighter than

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 221

Fig. 18.Performance versus power. back-ground windows. Another is to scale compares the power dissipation and exe- images to allow most of the screen to dim cution time of doing two additions in paral- when the battery is low. lel (as in a VLIW architecture) (a) and do- ing them in succession (b). Doing two ad- ditions in parallel activates twice as many 3.7. Compiler-Level Power Management adders over a shorter period. It induces 3.7.1. What Compilers Can Do. There are higher peak resource usage but fewer exe- many ways a compiler can help reduce cution cycles and is better for performance. power regardless of whether a proces- To determine which scenario saves more sor explicitly supports software-controlled energy, one needs more information such power reduction. Aside from generating as the peak power increase in scenario (a), code that reconfigures hardware units the execution cycle increase in scenario or activates power reduction mechanisms (b), and the total energy dissipation per that we have seen, compilers can ap- cycle for (a) and (b). For ease of illustra- ply common performance-oriented opti- tion, we have depicted the peak power in mizations that also save energy by re- (a) to be twice that of (b), but all of these ducing the execution time, optimizations parameters may vary with respect to the such as Common Subexpression Elimi- architecture. nation, Partial Redundancy Elimination, Compilers can reduce memory accesses and Strength Reduction. However, some to reduce energy dissipated in these ac- performance optimizations increase code cesses. One way to reduce memory ac- size or parallelism, sometimes increas- cesses is to eliminate redundant load ing resource usage and peak power dissi- and store operations. Another is to keep pation. Examples include Loop Unrolling data as close as possible to the processor, and Software Pipelining. Researchers preferably in the registers and lower-level [Givargis et al. 2001] have developed mod- caches, using aggressive register alloca- els for relating performance and power, tion techniques and optimizations improv- but these models are relative to specific ar- ing cache usage. Several loop transforma- chitectures. There is no fixed relationship tions alter data traversal patterns to make between performance and power across all better use of the cache. Examples include architectures and applications. Figure 18 Loop Interchange, Loop Tiling, and Loop

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 222 V. Venkatachalam and M. Franz

Fusion. The assignment of data to memory While Palm et al. [2002] restrict their locations also influences how long data re- studies to local and remote compilation, mains in the cache. Though all these tech- Chen et al. [2003] also examine the ben- niques reduce power consumption along efits of executing compiled Java code re- one or more dimensions, they might be motely.Their testbed consists of a 750MHz suboptimal along others. For example, to SPARC workstation server and a client exploit the full bandwidth a banked mem- handheld running a microSPARC-IIep ory architecture provides, one may need embedded processor. The programmer ini- to disperse data across multiple memory tially annotates a set of candidate meth- banks at the expense of cache locality and ods that the client may execute remotely. energy. Before calling any of these methods, the client decides where to compile and exe- cute the method based on computation and 3.7.2. Remote Compilation and Remote Exe- communication costs. It also selects opti- cution. A new area of research for compil- mization levels for compilation. If remote ers involves compiling and executing ap- execution is favorable, the client transfers plications jointly between mobile devices all method data to the server via the object and more powerful servers to reduce exe- serialization interface. The server uses re- cution time and increase battery life. This flection to invoke the method and serial- is an area frought with challenges such ization to send back the results. Chen et al. as the decision of what program sections evaluate different compilation and execu- to compile or execute remotely. Other is- tion strategies in this framework through sues include application partitioning be- simulation and analytical models. They tween multiple servers and handhelds, ap- consider seven partitioning strategies, five plication migration, and fault tolerance. of which are static and two of which are Though researchers have yet to address dynamic. While the static strategies fix all these issues, a number of studies pro- the partitioning decision for all methods vide valuable insights into the benefits of of a program, the dynamic ones decide the partitioning. best strategy for each method at call time. Recent studies of low-power Java ap- Chen et al. find the best static strategy to plication partitioning are by Palm et al. vary depending on input size and commu- [2002] and Chen et al. [2003]. Palm et al. nication channel quality.When the quality compare the energy costs of remote and is low, the client’s network interface must local compilation and optimization. They send data at a higher transmission power examine four scenarios in which a hand- to reach the server. For large inputs and held downloads code from a server and good channel quality, static remote execu- executes it locally. In the first two sce- tion prevails over other static strategies. narios, the handheld downloads bytecode, Overall, Chen et al. find the highest en- compiles it, and executes it. In the re- ergy savings to be achieved by a dynamic maining scenarios, the server compiles the strategy that decides where to compile and bytecode and the handheld downloads the execute each method. native codestream. Palm et al. find that, The Proxy Virtual Machine [Venkat- when a method’s compilation time exceeds achalam et al. 2003] reduces the energy the execution time, remote compilation is consumption of mobile devices by combin- favorable to local compilation. However, if ing application partitioning with dynamic no optimizations are used, remote compi- optimization. The framework (Figure 19) lation can be more expensive due to the positions a powerful server infrastructure, cost of downloading code from the server. the proxy, between mobile devices and the Overall, the largest energy savings are ob- Internet. The proxy includes a just-in- tained by compiling and optimizing code time compiler and bytecode translator. A remotely but sending results to the hand- high bandwidth, low-latency secure wire- held exactly when needed to minimize idle less connection mediates communication time. between the proxy and mobile devices in

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 223

Fig. 19. The proxy VM framework. the vicinity.Users can request Internet ap- be ideal for specialized embedded systems plications as they normally would through where the set of programs that will ex- aWeb browser. The proxy intercepts these ecute is determined ahead of time and requests and reissues them to a remote where execution behavior is mostly pre- Internet server. The server sends the dictable, real systems tend to be more com- proxy the application in mobile code for- plex. The events occurring within them mat. The proxy verifies the application, continuously change and compete for pro- compiles it to native code, and sends part cessor and memory resources. For exam- of the generated code to the target. The re- ple, context switches abruptly shift exe- mainder of the code executes on the proxy cution from a region of one program to a itself to significantly reduce energy con- completely different region of another pro- sumption on the mobile device. gram. If a compiler had generated code so that the two regions put a device into different power mode settings, then there 3.7.3. The Limitations of Statically Optimiz- will be an overhead to transition between ing Compilers. The drawback of compiler- these settings when execution flows from based approaches is that a compiler’s one region to another. view is usually limited to the programs it is compiling. This raises two problems. First, compilers have incomplete informa- 3.7.4. Dynamic Compilation. Dynamic tion about how a program will actually be- compilation addresses some of these have, information such as the control flow problems by introducing a feedback paths and loop iterations that will be exe- loop. A program is compiled but is then cuted. Hence, statically optimizing compil- also monitored as it executes. As the ers often rely on profiling data that is col- behavior of the program changes, possibly lected from a program prior to execution, along with other changes in the runtime to determine what optimizations should environment (e.g., resource levels), the be applied in different program regions. program is recompiled to adapt to these A program’s actual runtime behavior may changes. Since the program is continu- diverge from its behavior in these “simu- ously recompiled in response to runtime lation runs”, thus leading to suboptimal feedback, its code is of a significantly decisions. higher quality than it would have been if The second problem is that compiler op- it had been generated by a static compiler. timizations tend to treat programs as if There are a number of scenarios where they exist in a vacuum. While this may continuous compilation can be effective.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 224 V. Venkatachalam and M. Franz

For example, as battery capacity de- different performance counter events si- creases, a continuous compiler can ap- multaneously. For each of the core compo- ply more aggressive transformations that nents of this processor Isci and Martonosi trade the quality of data output for re- have developed power models that relate duced power consumption. (For example, their power consumption to their usage a compiler can replace expensive floating statistics. In their actual setup, they take point operations with integer operations, two kinds of power readings in parallel. or generate code that dims the display.) One is based on current measurements However, there are tradeoffs. For example, and the other is based on the performance the cost function for a dynamic compiler counters. These readings give the total needs to weigh the overhead of recompil- power as well as a breakdown of power ing a program with the energy that can be consumed among the various components. saved. Moreover, there is the issue of how Isci and Martonosi find that their perfor- to profile an executing program in a mini- mance counter-based approach is nearly mally invasive, low overhead way. as accurate as the current measurements While there is a large body of work and is also able to distinguish between dif- in dynamic compilation for perfor- ferent phases in program execution. mance [Kistler and Franz 2001; Kistler and Franz 2003], dynamic recompilation 3.8. Application-Level Power Management for power is still a young field with many open problems. One of the earliest Researchers have been exploring how to commercial dynamic compilation systems give applications a larger role in power is Transmeta’s Crusoe processor [Trans- management decisions. Most of the recent meta Corporation 2003; Transmeta work has two goals. The first is to de- Corporation 2001] which we will discuss velop techniques that enable applications in Section 3.10. One of the earliest aca- to adapt to their runtime environment. demic studies of power-aware dynamic The second is to develop interfaces allow- recompilation was done by Unnikrishnan ing applications to provide hints to lower et al. [2002]. Their idea is to instrument layers of the stack (i.e., operating systems, critical program regions with sensitiv- hardware) and likewise exposing useful ity lists that detect changes in various information from the lower layers to appli- energy budgets and hot-swap optimized cations. Although these are two separate versions of these regions with the original goals, the techniques for achieving them versions based on these changes. Each can overlap. optimized version is precomputed ahead of time, using offline energy estimation 3.8.1. Application Transformations and Adap- techniques. The authors target a banked tations. As a starting point, some re- memory architecture and focus on loop searchers [Tan et al. 2003] have adopted transformations such as tiling, unrolling, an “architecture-centric” view of applica- and interchange. Their implementation tions that allows for some high-level trans- is based on Dyninst, a tool that allows formations. These researchers claim that program regions to be patched at runtime. an application’s architecture consists of Though dynamic compilation for power fundamental elements that comprise all is still in its infancy, there is a growing applications, namely processes, event han- body of work in runtime power monitor- dlers, and communication mechanisms. ing. The work that is most relevant for Power is consumed when these various el- dynamic compilation is that of Isci and ements interact during activities such as Martonosi [2003]. These researchers have context switches and interprocess commu- developed techniques for profiling the nication. power behavior of programs at runtime They propose a low-power applica- using information from hardware perfor- tion design methodology. Initially, an ap- mance counters. They target a Pentium 4 plication is represented as a software processor which allows one to sample 18 architecture graph (SAG) [Tan et al. 2003]

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 225 which captures how it has been built out ful. Originally, Odyssey supported four ap- of processes and events. It is then run plications, a video player, speech recog- through a simulator to measure its base nizer, map viewer, and Web browser. The energy consumption which is then com- video player, for instance, would down- pared to its energy consumption when load images from a server storing multi- transformations are applied to the SAG, ple copies at different compression levels. transformations that include merging pro- When bandwidth is low, it would down- cesses to reduce interprocess communica- load the compressed versions of images tion, replacing expensive IPC mechanisms to transfer less data over the network. As by cheaper mechanisms, and migrating battery resources become scarce, it would computations between processes. also reduce the size of the images to con- To determine the optimal set of transfor- serve energy. mations, the researchers use a greedy ap- One example of an operating system proach. They keep applying transforma- that assists application adaptation is tions that reduce energy until no more ECOSystem [Zeng et al. 2002]. It treats transformations remain to be applied. energy as a commodity assigned to the dif- This results in an optimized version of the ferent resources in a system. The idea is same application that has a more energy to allow resources to operate so long as efficient mix of processes, events, and com- they have enough of the commodity. Thus munication mechanisms. resource management in ECOSystem re- Sachs et al. [2003] have explored a dif- lies on the notion of “currentcy”, an ab- ferent kind of adaptation that involves straction that models the power resource trading the accuracy of computations for as a monetary unit. Essentially, the dif- reduced energy consumption. They pro- ferent resources in a computer (e.g., ALU, pose a video encoder that allows one to memory) have prices for using them, and vary its compression efficiency by selec- applications pay these prices with cur- tively skipping the motion search and dis- rentcy. Periodically, the operating system crete cosine transform phases of the en- distributes a fixed amount of currentcy coding algorithm. The extent to which it among applications, using the desired bat- skips these phases depends on parame- tery drain rate as the basis for deciding on ters chosen by the adaptation algorithm. how much to distribute. Applications can The target architecture is a processor with use specific resources only if they can pay support for DVS and architectural adapta- for them. As an application executes, it ex- tions (e.g., configurable caches). Two algo- pends its currentcy. The application is in- rithms work side by side. One algorithm terrupted once it depletes its currentcy. tunes the hardware parameters at the start of each frame, while another tunes the parameters of the video encoder while 3.8.2. Application Hints. In one of the ear- a frame is being processed. liest works on application hints, Pereira Yet another form of adaptation involves et al. [2002] propose a software architec- reducing energy consumption by trading ture containing two layers of API calls, off fidelity or the quality of data presented one allowing applications to communicate to users. This can take many forms. One with the operating system, and the other example of a fidelity reduction system allowing the operating system to com- is Odyssey [Flinn and Satyanarayanan municate with the underlying hardware. 1999]. Odyssey is an operating system The API layer that is exposed to appli- and execution framework for multime- cations allows applications to drop hints dia and Web applications. It continuously to the operating system such as the start monitors the resources used by applica- times, deadlines, and expected execution tions and alerts applications whose re- times of tasks. Using these hints provided sources fall below requested levels. In by the application, the operating system turn, the applications lower their qual- will have a better understanding of the ity of service until resources are plenti- possible future behavior of programs and

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 226 V. Venkatachalam and M. Franz will thus be able to make better DVS popular is how to develop holistic ap- decisions. proaches that integrate information from Similar to this is the work by Anand multiple levels (e.g., compiler, OS, hard- et al. [2004] which considers a setting ware) into power management decisions. where applications adapt by deciding There are already a number of systems be- when to access different I/O devices based ing developed to explore cross-layer adap- on the relative costs of accessing these tations. We give four examples of such sys- devices. For example, it may be expen- tems. sive to access data resident on a disk Forge [Mohapatra et al. 2003] is an in- if the disk is powered down and would tegrated power management framework take a long time to restart. In such cases, for networked multimedia applications. an application may choose instead to ac- Forge targets a typical scenario where cess the data through the network. To aid users request video streams for their in these adaptations, Anand et al. pro- handheld devices and these requests are pose an API layer that gives applications filtered by a remote proxy that transcodes control over all I/O devices but abstracts the streams and transmits them to the away the details of these devices. The users at the most energy efficient QoS API functions as a giant regulating dial levels. that applications can tune to specify their Forge integrates multiple levels of adap- desired power/performance tradeoffs and tations. The hardware level contains a fre- which transparently manages all the de- quency and voltage scaling interface, a vices to achieve these tradeoffs. The API network card interface (allowing the net- also give applications more fine-grained work card to be powered down when idle), control over devices they may want to ac- as well as interfaces allowing various ar- cess. One of its unique features is that it chitectural parameters (e.g., cache ways, allows applications to issue ghost hints: if register file sizes) to be tuned. A level an application chose not to access a de- above hardware are the operating system vice that was deactivated but would have and compiler, which control the architec- saved more energy by accessing that de- tural knobs, and a level above the oper- vice if it had been activated, then it can ating system is a distributed middleware send the device a message to this effect. frame-work, part of which resides on the After a few of these ghost hints, the de- handheld devices and part of which re- vice automatically becomes activated, an- sides on the remote proxy.The middleware ticipating that the application will need to on each handheld device monitors the en- use it. ergy statistics of the device (through the Heath et al. [2004] propose similar work OS), and sends feedback to the middle- to increase opportunities for spinning ware on the proxy. The middleware on the down idle hard disks and saving energy. In proxy uses this feedback to decide how to their framework, a compiler performs var- regulate network traffic and at what QoS ious code transformations to cluster disk levels to stream requested video feeds. accesses and then inserts hints that tell Different heuristics are used for differ- the operating system how long these clus- ent adaptations. To determine the optimal tered disk accesses will take. The operat- cache parameters for a video feed, one sim- ing system is then able to serve these disk ulates the feed offline at each QoS level requests in a single batch and power down for all possible variations of cache size and the disk for longer periods of time. associativity and determines the most en- ergy optimal configuration. This configu- ration is automatically chosen when the 3.9. Cross-Layer Adaptations feed executes. The cache configuration, in Because power consumption depends on turn, affects how long it takes to decode decisions spanning the entire stack from each video frame. If the decoding time is transistors to applications, a research less than the frame’s deadline, then the question that is becoming increasingly DVS algorithm slows down the frame to

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 227 exactly meet the deadline. Meanwhile on Another multilayer framework for mul- the proxy, the middleware layer continu- timedia applications has been developed ally receives requests for video feeds and by Fei et al. [2004]. Their framework con- information about the resource levels on sists of four layers divided into user space various devices. When it receives a request and system space. The system space con- from a user on a handheld device, it checks sists of the target platform and the oper- whether there is enough remaining energy ating system running on top of it. Directly on the device allowing it to receive the above the operating system in user space video feed at an acceptable quality.If there is a middleware layer that decides how to is, then it transmits the feed at the high- apply adaptations, guided by information est quality that does not exceed the en- it receives from the other layers. All appli- ergy constraints of the device. Otherwise, cations execute on top of the middleware it rejects the user’s request or sends the coordinator which is responsible for decid- feed at a lower quality. When the proxy ing when to admit different applications sends a video feedback back to the users, into the system and how to select their QoS it transmits the feed in bursts of packets, levels. When applications first enter the allowing the users’ network cards to be system, they store their meta-information shut down over longer idle periods to save inside the middleware layer. This includes energy. information about the QoS levels they of- Grace [Yuan and Nahrstedt 2003] is an- fer and the energy consumption for each other cross-layer adaptation framework of these levels. The middleware layer, in that attempts to integrate dynamic volt- turn, monitors battery levels using coun- age scaling, power-aware task schedul- ters inside the operating system. Based on ing, and QoS setting. Its target applica- the battery levels, the middleware decides tions are real-time multimedia tasks with on the most energy efficient QoS setting fixed periods and deadlines. Grace applies for each application. It then admits appli- two levels of adaptations, global and local. cations into the system according to user- Global adaptations respond to larger re- defined priorities. source variations and are applied less fre- A fourth example is the work of AbouG- quently because they are more expensive. hazaleh et al. [2003] which explores how These adaptations are controlled by a cen- compilers and operating systems can in- tral coordinator that continuously moni- teract to save energy. This work targets tors battery levels and application work- real-time applications with fixed dead- loads and responds to widespread changes lines and worst-case execution times. In in these variables. Local adaptations, on this approach, the compiler instruments the other hand, respond to smaller work- specific program regions with code that load variations within a task. These adap- saves the remaining worst-case execution tations are controlled by local adaptors. cycles into a register. The operating sys- There are three local adaptors, one to tem periodically reads this register and set the CPU frequency, another to sched- adjusts the speed of the processor so that ule tasks, and a third to adapt QoS pa- the task finishes by its worst-case execu- rameters. Major changes in the runtime tion time. environment, such as low battery levels 3.10. Commercial Systems or wide variations in workload, trigger the central coordinator to decide upon a To understand what power management new set of global adaptations. The coor- techniques industry is adopting, we exam- dinator chooses the most energy-optimal ine the low-power techniques used in four adaptations by solving a constrained op- widely used processors, the Pentium 4, timization problem. It then broadcasts its Pentium M, the PXA27x, and Transmeta decision to the local adaptors which imple- Crusoe. We then discuss three power man- ment these changes but are free to adjust agement strategies by IBM, ARM, and these initial settings within the execution National Semiconductor that are rapidly of specific tasks. gaining in importance.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 228 V. Venkatachalam and M. Franz

3.10.1. The Pentium 4 Processor. Though 3.10.2. The Pentium M Processor. The its goal is high performance, the Pen- Pentium M is the fruit of Intel’s efforts tium 4 processor also contains features to bring the Pentium 4 to the mobile Do- to manage its power consumption [Gun- main. It carefully balances performance ther et al. 2001; Hinton et al. 2004]. enhancing features with several power- One of Intel’s goals in including these saving features that increase the bat- features was to prevent the Pentium’s tery lifetime [Genossar and Shamir 2003; internal temperatures from becoming Gochman et al. 2003]. It uses three main dangerously high due to the increased strategies to reduce dynamic power con- power dissipation. To meet this goal, sumption: reducing the total instructions the designers included a thermal de- and micro-operations executed, reducing tection and response mechanism which the switching activity in the circuit, and inhibits the processor clock whenever reducing the energy dissipated per tran- the observed temperature exceeds a safe sistor switch. threshold. To monitor temperature vari- To save energy, the Pentium M in- ations, the processor features a diode- tegrates several techniques that reduce based thermal sensor. The sensor sits the total switching activity. These include at the hottest area of the chip and hardware for predicting idle units and in- measures temperature via voltage drops hibiting their clock signals, buses whose across the diode. Whenever the tempera- components are activated only when data ture increases into a danger zone, the sen- needs to be transferred, and a technique sor issues a STOPCLOCK request, caus- called execution stacking which clusters ing the main clock signal to be inhibited units that perform similar functions into from reaching most of the processor until similar regions so that the processor can the temperature is no longer in the dan- selectively activate the parts of the circuit ger zone. This is to guarantee response that will be needed by an instruction. time while the chip is cooling. The temper- To reduce the static power dissipation, ature sensor also provides temperature the Pentium M incorporates low leakage information to higher levels of software transistors in the caches. To further re- through output signals (some of which are duce both dynamic and leakage energy in registers), allowing the compiler or op- throughout the processor, the Pentium M erating system, for instance, to activate supports an enhanced version of the In- other techniques in response to the high tel SpeedStep technology, which unlike its temperatures. predecessor, allows the processor to tran- The Pentium 4 also supports the low- sition between 6 different frequency and power operating states defined by the Ad- voltage settings. vanced Configuration and Power Interface (ACPI) specification, allowing software to control the processor power modes. In par- 3.10.3. The Intel PXA27x Processors. The ticular, it features a model-specific regis- Intel PXA27x processors used in many ter allowing software to influence the pro- wireless handheld devices feature the cessor clock. In addition to stopping the Intel Wireless Speedstep power man- clock, the Pentium 4 features the Intel ager [Intel Corporation 2004]. This power Basic Speedstep Technology which allows manager uses memory boundedness as two settings for the processor clock fre- the criterion for deciding how to manage quency and voltage, a high setting for per- the processor’s power modes. It receives formance and a low setting for power. The information from an idle profiler which high setting is normally used when the monitors the system idle and a per- computer is connected to a wall outlet, and formance profiler that monitors the pro- the low setting is normally used when the cessor’s built-in performance counters to computer is running on batteries as in the gather statistics pertaining to cache and case of a laptop. TLB misses, instructions executed, and

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 229 processor stalls. Using this information, device-specific parameters (e.g., available the power manager determines how mem- clock frequencies) and to integrate their ory bounded the workload is and decides own policies in a “plug-and-play” manner. how to adjust the processor’s power modes The main idea is to model the underly- as well as its clock frequency and voltage. ing hardware as a state machine composed of different operating points and operat- ing states. The operating points are clock 3.10.4. The Transmeta Crusoe Processor. frequency and voltage settings, while the Transmeta’s Crusoe processor [Transmeta operating states are power mode settings Corporation 2001, 2003] saves power not such as active, idle and deep sleep. De- only by means of its LongRun Power Man- signers specify the operating points and ager but also through its overall design states for their target architecture, and us- which shifts many of the complexities in ing this abstraction, they write policies for instruction execution from hardware into transitioning between the states. IBM has software. The Crusoe relies on a code mor- experimented with a wide range of poli- phing engine,alayer of software that in- cies for the PowerPC 405LP, from simple terprets or compiles programs into policies that adjust the frequency based on the processor’s native VLIW instructions, whether the processor is idle or active, to saving the generated code in a Transla- more complex policies that integrate infor- tion Cache so that they can be reused. mation about application workloads and This layer of software is also responsi- deadlines. ble for monitoring executing programs for hotspots and reoptimizing code on the fly. As a result, features that are typically im- 3.10.6. Powerwise and Intelligent Energy plemented in hardware (e.g., out-of-order Management. National Semiconductor execution) are instead implemented in and ARM Inc. have collaborated to software. This reduces the total on-chip develop a single end-to-end energy real estate and its accompanying power management system for embedded pro- dissipation. To further reduce the power cessors. Their system brings together two dissipation, Crusoe’s LongRun modulates technologies, ARM’s Intelligent Energy the clock frequency and voltage according Management Software (IEM), which to workload demands. It identifies idle pe- modulates the processor speed based on riods of the operating system and scales workloads, and National Semiconductor’s the clock frequency and voltage accord- Powerwise Adaptive Voltage Scaling ingly. (AVS), which dynamically adjusts the Transmeta’s later generation Efficeon supply voltage for a given clock frequency processor contains enhanced versions of based on variations in temperature and the code morphing engine and the Lon- other ambient system effects. gRun Power Manager. Some news arti- The IEM software consists of a three- cles claim that the new LongRun includes level decision hierarchy of policies for de- techniques to reduce leakage, but the de- ciding how fast the processor should run. tails are proprietary. At the bottom is a baseline algorithm that selects the optimal frequency based 3.10.5. IBM Dynamic Power Management. on estimating future task workloads from IBM has developed an extensible oper- past workloads measured over a sliding ating system module [Brock and Raja- history window. At the top is an algo- mani 2003] that allows power manage- rithm more suited for interactive and me- ment policies to be developed for a wide dia applications, which considers how fast range of embedded devices. For portability, these applications would need to run to this module abstracts away from the un- deliver smooth performance to users. Be- derlying architecture and provides a sim- tween these two layers is an interface al- ple interface allowing designers to specify lowing applications to communicate their

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 230 V. Venkatachalam and M. Franz workload requirements directly. Each of they must be discarded unless they are these three layers combines its perfor- rechargeable, and recharging a battery mance prediction with a confidence rating, can take several hours and even recharge- which indicates how to compare its deci- able batteries eventually die. Moreover, sion with that of other layers. Whenever battery technology has been advancing there is a context switch, the IEM software slowly. Building more efficient batteries analyzes the decisions and confidence rat- still means building bigger batteries, and ings of each layer to finally decide how fast bigger and heavier batteries are not suited to run the processor. for small mobile devices. The Powerwise Adaptive Voltage Scal- Fuel cells are similar to batteries in ing technology [Dhar et al. 2002] aims to that they generate electricity by means save more energy than conventional volt- of a chemical reaction but, unlike batter- age scaled systems by choosing the mini- ies, fuel cells can in principle supply en- mum acceptable voltage for a given clock ergy indefinitely. The main components frequency. Conventional DVS processors of a fuel cell are an anode, a cathode, a determine the appropriate voltage for a membrane separating the anode from the clock frequency by referring to a table cathode, and a link to transfer the gener- that is developed offline using a worst-case ated electricity. The fuel enters the anode, model of ambient hardware characteris- where it reacts with a catalyst and splits tics. As a result, the voltages chosen are into protons and electrons. The protons often higher than they need to be. Power- diffuse through the membrane, while the wise gets around this problem by adding electrons are forced to travel through the a feedback loop. It continually monitors link generating electricity. When the pro- variations in temperature and other am- tons reach the cathode, they combine with bient system effects through performance Oxygen in the air to produce water and counters and uses this information to dy- heat as by-products. If the fuel cell uses a namically adjust the supply voltage for water-diluted fuel (as some do), then this each clock frequency.This results in an ad- waste water can be recycled back to the ditional 45% energy savings over conven- anode. tional voltage scaling. Fuel cells have a number of advantages. One advantage is that fuels (e.g., hydro- 3.11. A Glimpse Into Emerging Radical gen) are abundantly available from a wide Technologies variety of natural resources, and many of- fer energy densities high enough to allow Imagine a laptop that runs on hydrogen portable devices to run far longer than fuel or a mobile phone that is powered by they do on batteries. Another advantage is the same type of engines that propel gi- that refueling is significantly faster than gantic aircraft. Such ideas may sound far recharging a battery. In some cases, it fetched, but they are exactly the kinds of merely involves spraying more fuel into a innovations that some researchers claim tank. A third advantage is that there is will eventually solve the energy problem. no limit to how many times a fuel cell can We close our survey by briefly discussing be refueled. As long as it contains fuel, it two of the more radical technologies that generates electricity. are likely to gain in importance over the Several companies are aggressively de- next decade. The focus of these techniques veloping fuel cell technologies for portable is on improving energy efficiency. The devices including Micro Fuel Cell Inc., techniques include fuel cells and MEMS NEC, Toshiba, Medis, Panasonic, Mo- systems. torola, Samsung, and Neah Systems. Al- ready a number of prototype cells have 3.11.1. Fuel Cells. Fuel cells [Dyer 2004] emerged as well as prototype mobile com- are being developed to replace the bat- puters powered by hydrogen or alcohol- teries used in mobile devices. Batteries based fuels. Despite these rapid advances, have a limited capacity. Once depleted, some industry analysts estimate that it

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 231 will take another 5-10 years for fuel cells over the next four years. However, it is too to become commonplace in mobile devices. early to tell whether it will in fact replace Fuel cells also have several drawbacks. batteries and fuel cells. One problem First, they can get very hot (e.g., 500- is that jet engines produce hot exhaust 1000 celsius). Second, the metallic materi- gases that could raise chip temperatures als and mechanical components that they to dangerous levels, possibly requiring are composed of can be quite expensive. new materials for a chip to withstand Third, fuel cells are flammable. In par- these temperatures. Other issues include ticular, fuel-powered devices will require flammability and the power dissipation of safety measures as well as more flexible the rotating turbines. laws allowing them inside airplanes. 4. CONCLUSION

3.11.2. MEMS Systems. Microelectrical Power and energy management has grown and Mechanical Systems (MEMS) are into a multifaceted effort that brings to- miniature versions of large scale devices gether researchers from such diverse ar- commonly used to convert mechanical en- eas as physics, mechanical engineering, ergy into electrical energy. Researchers at electrical engineering, design automation, MIT and the Georgia Institute of Technol- logic and high-level synthesis, computer ogy [Epstein 2004] are exploring a radical architecture, operating systems, compiler wayofusing them to solve the energy prob- design, and application development. We lem. They are developing prototype mil- have examined how the power problem limeter scale versions of the gigantic gas arises and how the problem has been turbine engines that power airplanes and addressed along multiple levels ranging drive electrical generators. These micro- from transistors to applications. We have engines, they claim, will give mobile com- also surveyed major commercial power puters unprecedented amounts of unteth- management technologies and provided a ered lifetime. glimpse into some emerging technologies. The microengines work using similar We conclude by noting that the field is principles as their large scale counter- still active, and that researchers are con- parts. They suck air into a compressor tinually developing new algorithms and and ignite it with fuel. The compressed air heuristics along each level as well as ex- then spins a set of turbines that are con- ploring how to integrate algorithms from nected to a generator to generate electrical multiple levels. Given the wide variety power. The fuels used could be hydrogen, of microarchitectural and software tech- diesel based, or more energy-dense solid niques available today and the astound- fuels. ingly large number of techniques that will Made from layers of silicon wafers, these be available in the future, it is highly likely tiny engines are supposed to output the that we will overcome the limits imposed same levels of electrical power per unit by high power consumption and continue of fuel as their large scale counterparts. to build processors offering greater levels Their proponents claim they have two ad- of performance and versatility. However, vantages. First, they can output far more only time will tell which approaches will power using less fuel than fuel cells or ultimately succeed in solving the power batteries alone. In fact, the ones under problem. development are expected to output 10 to 100 Watts of power almost effortlessly REFERENCES and keep mobile devices powered for days. ABOUGHAZALEH,N.,CHILDERS,B.,MOSSE,D.,MEL- Moreover, as a result of their high energy HEM, R., AND CRAVEN,M. 2003. Energy man- density, these engines would require less agement for real-time embedded applications with compiler support. In Proceedings of the space than either fuel cells or batteries. ACM SIGPLAN Conference on Languages, Com- This technology, according to re- pilers, and Tools for Embedded Systems. ACM searchers, is likely to be commercialized Press, 284–293.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 232 V. Venkatachalam and M. Franz

ALBONESI,D.,DROPSHO,S.,DWARKADAS,S.,FRIEDMAN, of the ACM/SIGDA 12th International Sympo- E., HUANG, M., KURSUN,V.,MAGKLIS,G.,SCOTT, sium on Field Programmable Gate Arrays. ACM M., SEMERARO,G.,BOSE,P.,BUYUKTOSUNOGLU, A., Press, 109–117. COOK,P.,AND SCHUSTER,S. 2003. Dynamically CHEN,G.,KANG,B.,KANDEMIR, M., VIJAYKRISHNAN,N., tuning processor resources with adaptive pro- AND IRWIN,M. 2003. Energy-aware compila- cessing. IEEE Computer Magazine 36, 12, 49– tion and execution in java-enabled mobile de- 58. vices. In Proceedings of the 17th Parallel and ANAND, M., NIGHTINGALE, E., AND FLINN,J. 2004. Distributed Processing Symposium. IEEE Press, Ghosts in the machine: Interfaces for better 34a. power management. In Proceedings of the Inter- CHOI, K., SOMA, R., AND PEDRAM,M. 2004. Dynamic national Conference on Mobile Systems, Applica- voltage and frequency scaling based on work- tions, and Services. 23–35. load decomposition. In Proceedings of the Inter- AZEVEDO, A., ISSENIN, I., CORNEA, R., GUPTA, R., national Symposium on Low Power Electronics DUTT,N.,VEIDENBAUM, A., AND NICOLAU,A. 2002. and Design. ACM Press, 174–179. Profile-based dynamic voltage scheduling using CLEMENTS,P.C. 1996. A survey of architecture de- program checkpoints. In Proceedings of the Con- scription languages. In Proceedings of the 8th In- ference on Design, Automation and Test in Eu- ternational Workshop on Software Specification rope. 168–175. and Design. IEEE Computer Society, 16–25. BAHAR,R.I.AND MANNE,S. 2001. Power and energy DALLY,W.J.AND TOWLES,B. 2001. Route packets, reduction via pipeline balancing. In Proceedings not wires: on-chip inteconnection networks. In of the 28th Annual International Symposium on Proceedings of the 38th Conference on Design Au- . ACM Press, 218–229. tomation. ACM Press, 684–689. BANAKAR, R., STEINKE,S.,LEE,B.-S., BALAKRISHNAN, M., DALTON,A.B.AND ELLIS,C.S. 2003. Sensing user AND MARWEDEL,P. 2002. : intention and context for energy management. Design alternative for cache on-chip memory in In Proceedings of the 9th Workshop on Hot Topics embedded systems. In Proceedings of the 10th In- in Operating Systems. 151–156. ternational Symposium on Hardware/Software Codesign. 73–78. DE,V.AND BORKAR,S. 1999. Technology and de- sign challenges for low power and high perfor- BANERJEE,K.AND MEHROTRA,A. 2001. Global in- mance. In Proceedings of the International Sym- terconnect warming. IEEE Circuits and Devices posium on Low Power Electronics and Design Magazine 17,5,16–32. ISLPED’99 . ACM Press, 163–168. ISHOP AND RWIN B ,B. I ,M.J. 1999. Databus charge DHAR,S.,MAKSIMOVIC,D.,AND KRANZEN,B. 2002. recovery: practical considerations. In Proceed- Closed-loop adaptive voltage scaling controller ings of the International Symposium on Low for standard-cell asics. In Proceedings of the In- Power Electronics and Design. 85–87. ternational Symposium on Low Power Electron- BROCK,B.AND RAJAMANI,K. 2003. Dynamic power ics and Design ISLPED’02. ACM Press, 103– management for embedded systems. In Proceed- 107. ings of the IEEE International SOC Conference. DINIZ,P.C. 2003. A compiler approach to perfor- 416–419. mance prediction using empirical-based model- BUTTAZZO,G.C. 2002. Scalable applications for ing. In International Conference On Computa- energy-aware processors. In Proceedings of the tional Science. 916–925. 2nd International Conference On Embedded DROPSHO,S.,BUYUKTOSUNOGLU, A., BALASUBRAMONIAN, Software. Springer-Verlag, 153–165. R., ALBONESI,D.H., DWARKADAS,S.,SEMERARO,G., BUTTS,J.A.AND SOHI,G.S. 2000. A static power MAGKLIS,G.,AND SCOTT,M.L. 2002. Integrat- model for architects. In Proceedings of the 33rd ing adaptive on-chip storage structures for re- Annual ACM/IEEE International Symposium duced dynamic power. In Proceedings of the In- on . Monterey, CA. 191–201. ternational Conference on Parallel Architectures BUYUKTOSUNOGLU, A., SCHUSTER,S.,BROOKS,D.,BOSE, and Compilation Techniques. IEEE Computer P., C OOK,P.W.,AND ALBONESI,D. 2001. An Society, 141–152. adaptive issue queue for reduced power at high DROPSHO,S.,SEMERARO,G.,ALBONESI,D.H., MAGKLIS, performance. In Proceedings of the 1st Interna- G., AND SCOTT,M.L. 2004. Dynamically trad- tional Workshop on Power-Aware Computer Sys- ing frequency for complexity in a gals micropro- tems. 25–39. cessor. In Proceedings of the 37th International CALHOUN,B.H., HONORE,F.A., AND CHANDRAKASAN, Symposium on Microarchitecture. IEEE Com- A. 2003. Design methodology for fine-grained puter Society, 157–168. leakage control in MTCMOS. In Proceedings DUDANI, A., MUELLER,F.,AND ZHU,Y. 2002. Energy- of the International Symposium on Low Power conserving feedback EDF scheduling for embed- Electronics and Design. ACM Press, 104–109. ded systems with real-time constraints. In Pro- CHEN,D.,CONG,J.,LI,F.,AND HE,L. 2004. Low- ceedings of the Joint Conference on Languages, power technology mapping for FPGA architec- Compilers, and Tools for Embedded Systems. tures with dual supply voltages. In Proceedings ACM Press, 213–222.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 233

DYER,C. 2004. Fuel cells and portable electronics. GIVARGIS,T.,VAHID,F.,AND HENKEL,J.2001. In Symposium On VLSI Circuits Digest of Tech- System-level exploration for pareto-optimal con- nical Papers (2004). 124–127. figurations in parameterized systems-on-a-chip. EBERGEN,J.,GAINSLEY,J.,AND CUNNINGHAM,P. 2004. In Proceedings of the IEEE/ACM International Transistor sizing: How to control the speed and Conference on Computer-Aided Design. IEEE energy consumption of a circuit. In the 10th Press, 25–30. International Symposium on Asynchronous Cir- GOCHMAN,S.,RONEN, R., ANATI, I., BERKOVITS, A., cuits and Systems. 51–61. KURTS,T.,NAVEH, A., SAEED, A., SPERBER, Z., AND EPSTEIN,A.2004. Millimeter-scale, micro- VALENTINE,R. 2003. The Intel Pentium M pro- electromechanical systems gas turbine engines. cessor: Microarchitecture and performance. In- J. Eng. Gas Turb. Power 126, 205–226. tel. Tech. J. 7,2,21–59. ERNST,D.,KIM,N.,DAS,S.,PANT,S.,RAO, R., PHAM, GOMORY,R.E.AND HU,T.C. 1961. Multi-terminal T., Z IESLER,C.,BLAAUW,D.,AUSTIN,T.,FLAUT- network flows. J. SIAM 9,4,551–569. NER, K., AND MUDGE,T. 2003. Razor: A low- GOVIL, K., CHAN, E., AND WASSERMAN,H. 1995. Com- power pipeline based on circuit-level timing paring algorithms for dynamic speed-setting of a speculation. In Proceedings of the 36th Annual low-power CPU. In Proceedings of the 1st Annual IEEE/ACM International Symposium on Mi- International Conference on Mobile Computing croarchitecture. IEEE Computer Society, 7–18. and Networking. ACM Press, 13–25. FAN, X., ELLIS,C.S.,AND LEBECK,A.R. 2003. Syn- GRUIAN,F. 2001. Hard real-time scheduling for ergy between power-aware memory systems and low-energy using stochastic data and DVS pro- processor voltage scaling. In Proceedings of the cessors. In Proceedings of the International Sym- Workshop on Power-Aware Computer Systems. posium on Low Power Electronics and Design. 164–179. ACM Press, 46–51. FEI,Y.,ZHONG, L., AND JHA,N. 2004. An energy- GUNTHER,S.,BINNS,F.,CARMEAN,D.,AND HALL,J. aware framework for coordinated dynamic soft- 2001. Managing the impact of increasing mi- ware management in mobile computers. In Pro- croprocessor power consumption. Intel Tech. J. ceedings of the IEEE Computer Society’s 12th GURUMURTHI,S.,SIVASUBRAMANIAM, A., KANDEMIR, M., Annual International Symposium on Model- AND FRANKE,H. 2003. DRPM: Dynamic speed ing, Analysis, and Simulation of Computer and control for power management in server class Telecommunications Systems. 306–317. disks. SIGARCH Comput. Architect. News 31,2, FERDINAND,C. 1997. Cache behavior prediction for 169–181. real time systems. Ph.D. thesis, Universitat¨ des HEATH,T.,PINHEIRO, E., HOM,J.,KREMER,U., Saarlandes. AND BIANCHINI,R. 2004. Code transformations FLAUTNER, K., REINHARDT,S.,AND MUDGE,T. 2001. for energy-efficient device management. IEEE Automatic performance setting for dynamic volt- Trans. Comput. 53,8,974–987. age scaling. In Proceedings of the 7th Annual HINTON,G.,SAGER,D.,UPTON, M., BOGGS,D.,CARMEAN, International Conference on Mobile Computing D., KYKER, A., AND ROUSSEL,P. 2004. The mi- and Networking. ACM Press, 260–271. croarchitecture of the Intel Pentium 4 processor FLINN,J.AND SATYANARAYANAN,M. 1999. Energy- on 90nm technology. Intel Tech. J. 8,1,1–17. aware adaptation for mobile applications. In HO,Y.-T. AND HWANG,T.-T. 2004. Low power de- Proceedings of the 17th ACM Symposium on Op- sign using dual threshold voltage. In Proceed- erating Systems Principles. ACM Press, 48–63. ings of the Conference on Asia South Pacific De- FOLEGNANI,D.AND GONZALEZ,A. 2001. Energy- sign Automation IEEE Press, (Piscataway, NJ,) effective issue logic. In Proceedings of the 28th 205–208. Annual International Symposium on Computer HOM,J.AND KREMER,U. 2003. Energy manage- Architecture. ACM Press, 230–239. ment of on diskless devices. GAO,F.AND HAYES,J.P. 2003. ILP-based optimiza- In Compilers and Operatings Systems for Low tion of sequential circuits for low power. In Pro- Power. Kluwer Academic Publishers, Norwell, ceedings of the International Symposium on Low MA. 95–113. Power Electronics and Design. ACM Press, 140– HOSSAIN, R., ZHENG,M.AND ALBICKI,A. 1996. Re- 145. ducing power dissipation in CMOS circuits by GENOSSAR,D.AND SHAMIR,N. 2003. Intel Pentium signal probability based transistor reording. In M processor power estimation, budgeting, opti- IEEE Trans. Comput.-Aided Design Integrated mization, and validation. Intel. Tech. J. 7,2,44– Circuits Syst. 15,3,361–368. 49. HSU,C.AND FENG,W. 2004. Effective dynamic GHOSE,K.AND KAMBLE,M.B. 1999. Reducing voltage scaling through cpu-boundedness detec- power in caches using sub- tion. In Workshop on Power Aware Computing banking, multiple line buffers, and bit-line seg- Systems. mentation. In Proceedings of the International HSU,C.-H. AND KREMER,U.2003. The design, Symposium on Low Power Electronics and De- implementation, and evaluation of a com- sign. ACM Press, 70–75. piler algorithm for CPU energy reduction. In

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 234 V. Venkatachalam and M. Franz

Proceedings of the ACM SIGPLAN Conference on ings of the Conference on Design, Automation Programming Language Design and Implemen- and Test in Europe. IEEE Press, 190–196. tation. ACM Press, 38–48. IYER,A.AND MARCULESCU,D.2002a. HU,J.,IRWIN, M., VIJAYKRISHNAN,N.,AND KANDEMIR,M. Microarchitecture-level power management. 2002. Selective trace cache: A low power and IEEE Trans. VLSI Syst. 10,3,230–239. high performance fetch mechanism. Tech. Rep. IYER,A.AND MARCULESCU,D. 2002b. Power ef- CSE-02-016, Department of Computer Science ficiency of voltage scaling in multiple clock, and Engineering, The Pennsylvania State Uni- multiple voltage cores. In Proceedings of versity (Oct.). the IEEE/ACM International Conference on HU,J.,VIJAYKRISHNAN,N.,IRWIN, M., AND KANDEMIR, Computer-Aided Design. ACM Press, 379–386. M. 2003. Using dynamic branch behavior for JONE,W.-B., WANG,J.S., LU, H.-I., HSU,I.P.,AND power-efficient instruction fetch. In Proceedings CHEN,J.-Y. 2003. Design theory and imple- of the IEEE Computer Society Annual Sympo- mentation for low-power segmented bus sys- sium on VLSI. 127–132. tems. ACMTrans. Design Autom. Electr. Syst. HU,J.S., NADGIR, A., VIJAYKRISHNAN,N.,IRWIN,M.J., 8,1,38–54. AND KANDEMIR,M. 2003. Exploiting program KABADI, M., KANNAN,N.,CHIDAMBARAM,P.,NARAYANAN, hotspots and code sequentiality for instruction S., SUBRAMANIAN, M., AND PARTHASARATHI,R. cache leakage management. In Proceedings of 2002. Dead-block elimination in cache: A the International Symposium on Low Power mechanism to reduce i-cache power consump- Electronics and Design. ACM Press, 402–407. tion in high performance microprocessors. In HUANG, M., RENAU,J.,YOO,S.-M., AND TORRELLAS,J. Proceedings of the International Conference on 2000. A framework for dynamic energy effi- High Performance Computing. Springer Verlag, ciency and temperature management. In Pro- 79–88. ceedings of the 33rd Annual ACM/IEEE Inter- KANDEMIR, M., RAMANUJAM,J.,AND CHOUDHARY,A. national Symposium on Microarchitecture. ACM 2002. Exploiting shared scratch pad memory Press, 202–213. space in embedded multiprocessor systems. In HUANG,M.C., RENAU,J.,AND TORRELLAS,J. 2003. Proceedings of the 39th Conference on Design Au- Positional adaptation of processors: application tomation. ACM Press, 219–224. to energy reduction. In Proceedings of the 30th KAXIRAS,S.,HU, Z., AND MARTONOSI,M. 2001. Cache Annual International Symposium on Computer decay: exploiting generational behavior to re- Architecture ISCA ’03. ACM Press, 157–168. duce cache leakage power. In Proceedings of the HUGHES,C.J.AND ADVE,S.V. 2004. A formal ap- 28th Annual International Symposium on Com- proach to frequent energy adaptations for mul- puter Architecture. ACM Press, 240–251. timedia applications. In Proceedings of the 31st KIM,C.AND ROY,K. 2002. Dynamic Vth scaling Annual International Symposium on Computer scheme for active leakage power reduction. In Architecture (ISCA’04). IEEE Computer Society, Proceedings of the Design, Automation and Test 138. in Europe Conference and Exhibition. IEEE HUGHES,C.J.,SRINIVASAN,J.,AND ADVE,S.V. 2001. Computer Society, 0163–0167. Saving energy with architectural and frequency KIM,N.S., FLAUTNER, K., BLAAUW,D.,AND MUDGE, adaptations for multimedia applications. In Pro- T. 2002. Drowsy instruction caches: leakage ceedings of the 34th Annual ACM/IEEE In- power reduction using dynamic voltage scaling ternational Symposium on Microarchitecture and cache sub-bank prediction. In Proceedings of (MICRO’34). IEEE Computer Society, 250– the 35th Annual ACM/IEEE International Sym- 261. posium on Microarchitecture. IEEE Computer Intel Corporation. 2004. Wireless Intel Speedstep Society Press, 219–230. Power Manager. Intel Corporation. KIN,J.,GUPTA, M., AND MANGIONE-SMITH,W.H. 1997. ISCI,C.AND MARTONOSI,M. 2003. Runtime power The filter cache: An energy efficient memory monitoring in high-end processors: Methodology structure. In Proceedings of the 30th Annual and empirical data. In Proceedings of the 36th ACM/IEEE International Symposium on Mi- Annual IEEE/ACM International Symposium croarchitecture. IEEE Computer Society, 184– on Microarchitecture (MICRO’36). IEEE Com- 193. puter Society, 93–104. KISTLER,T.AND FRANZ,M. 2001. Continuous pro- ISHIHARA,T.AND YASUURA,H.1998. Voltage gram optimization: design and evaluation. IEEE scheduling problem for dynamically variable Trans. Comput. 50,6,549–566. voltage processors. In Proceedings of the Inter- KISTLER,T.AND FRANZ,M. 2003. Continuous pro- national Symposium on Low Power Electronics gram optimization: A case study. ACMTrans. and Design. ACM Press, 197–202. Program. Lang. Syst. 25,4,500–548. ITRSRoadmap. The International Technology KOBAYASHI AND SAKURAI. 1994. Self-adjusting Roadmap for Semiconductors. Available at threshold voltage scheme (SATS) for low- http://public.itrs.net. voltage high-speed operation. In Proceedings of IYER,A.AND MARCULESCU,D. 2001. Power aware the IEEE Custom Integrated Circuits Confer- microarchitecture resource scaling. In Proceed- ence. 271–274.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 235

KONDO,M.AND NAKAMURA,H. 2004. Dynamic pro- tional Conference on Measurement and Modeling cessor throttling for power efficient computa- of Computer Systems. ACM Press, 50–61. tions. In Workshop on Power Aware Computing LUZ,V.D.L., KANDEMIR, M., AND KOLCU,I. 2002. Systems. Automatic data migration for reducing energy KONG,B.,KIM,S.,AND JUN,Y. 2001. Conditional- consumption in multi-bank memory systems. In capture flip-flop for statistical power reduc- Proceedings of the 39th Conference on Design Au- tion. IEEE J. Solid State Circuits 36,8,1263– tomation. ACM Press, 213–218. 1271. LYUBOSLAVSKY,V.,BISHOP,B.,NARAYANAN,V.,AND IR- KRANE, R., PARSONS,J.,AND BAR-COHEN,A. 1988. WIN,M.J. 2000. Design of databus charge re- Design of a candidate thermal control system covery mechanism. In Proceedings of the Inter- for a cryogenically cooled computer. IEEE Trans. national Conference on ASIC. ACM Press, 283– Components, Hybrids, Manufact. Techn. 11,4, 287. 545–556. MAGKLIS,G.,SCOTT,M.L., SEMERARO,G.,ALBONESI, KRAVETS,R.AND KRISHNAN,P. 1998. Power manage- D. H., AND DROPSHO,S. 2003. Profile-based dy- ment techniques for mobile communication. In namic voltage and frequency scaling for a mul- Proceedings of the 4th Annual ACM/IEEE Inter- tiple clock domain microprocessor. In Proceed- national Conference on Mobile Computing and ings of the 30th Annual International Sympo- Networking. ACM Press, 157–168. sium on Computer Architecture. ACM Press, 14– KURSUN, E., GHIASI,S.,AND SARRAFZADEH,M. 2004. 27. Transistor level budgeting for power optimiza- MAGKLIS,G.,SEMERARO,G.,ALBONESI,D.,DROPSHO,S., tion. In Proceedings of the 5th International DWARKADAS,S.,AND SCOTT,M. 2003. Dynamic Symposium on Quality Electronic Design. 116– frequency and voltage scaling for a multiple- 121. clock-domain microprocessor. IEEE Micro 23,6, LEBECK,A.R., FAN, X., ZENG, H., AND ELLIS,C. 2000. 62–68. Power aware page allocation. In Proceedings of MARCULESCU,D. 2004. Application adaptive en- the 9th International Conference on Architec- ergy efficient clustered architectures. In Pro- tural Support for Programming Languages and ceedings of the International Symposium on Low Operating Systems (2000). ACM Press, 105–116. Power Electronics and Design. 344–349. LEE,S.AND SAKURAI,T. 2000. Run-time voltage MARTIN,T.AND SIEWIOREK,D. 2001. Nonideal bat- hopping for low-power real-time systems. In Pro- tery and main memory effects on cpu speed- ceedings of the 37th Conference on Design Au- setting for low power. IEEE Tran. (VLSI) Syst. 9, tomation. ACM Press, 806–809. 1, 29–34. LI, H., KATKOORI,S.,AND MAK,W.-K. 2004. Power MENG.Y.,SHERWOOD,T.,AND KASTNER,R. 2005. Ex- minimization algorithms for LUT-based FPGA ploring the limits of leakage power reduction in technology mapping. ACMTrans. Design Autom. caches. ACMTrans. Architecture Code Optimiz., Electr. Syst. 9,1,33–51. 1, 221–246. LI, X., LI, Z., DAVID,F.,ZHOU,P.,ZHOU,Y.,ADVE,S., MOHAPATRA,S.,CORNEA, R., DUTT,N.,NICOLAU, A., AND KUMAR,S. 2004. Performance directed en- AND VENKATASUBRAMANIAN,N. 2003. Integrated ergy management for main memory and disks. power management for video streaming to mo- In Proceedings of the 11th International Confer- bile handheld devices. In Proceedings of the 11th ence on Architectural Support for Programming ACM International Conference on Multimedia. Languages and Operating Systems (ASPLOS- ACM Press, 582–591. XI). ACM Press, New York, NY, 271–283. NEDOVIC,N.,ALEKSIC, M., AND OKLOBDZIJA,V. 2001. LIM,S.-S., BAE,Y.H., JANG,G.T.,RHEE,B.-D., MIN, Conditional techniques for low power consump- S. L., PARK,C.Y.,SHIN, H., PARK, K., MOON,S.- tion flip-flops. In Proceedings of the 8th Interna- M., AND KIM,C.S. 1995. An accurate worst tional Conference on Electronics, Circuits, and case timing analysis for RISC processors. IEEE Systems. 803–806. Trans. Softw. Eng. 21,7,593–604. NILSEN,K.D.AND RYGG,B. 1995. Worst-case exe- LIU, M., WANG,W.-S., AND ORSHANSKY,M. 2004. cution time analysis on modern processors. In Leakage power reduction by dual-vth designs Proceedings of the ACM SIGPLAN Workshop on under probabilistic analysis of vth variation. In Languages, Compilers and Tools for Real-Time Proceedings of the International Symposium on Systems. ACM Press, 20–30. Low Power Electronics and Design ACM Press, PALM,J.,LEE, H., DIWAN, A., AND MOSS,J.E.B. 2002. New York, NY, 2–7. When to use a compilation service? In Proceed- LLOPIS,R.AND SACHDEV,M.1996. Low power, ings of the Joint Conference on Languages, Com- testable dual edge triggered flip-flops. In Pro- pilers, and Tools for Embedded Systems. ACM ceedings of the International Symposium on Low Press, 194–203. Power Electronics and Design. IEEE Press, 341– PANDA,P.R., DUTT,N.D.,AND NICOLAU,A. 2000. On- 345. chip vs. off-chip memory: the data partitioning LORCH,J.AND SMITH,A. 2001. Improving dynamic problem in embedded processor-based systems. voltage scaling algorithms with PACE. In Pro- ACMTrans. Design Autom. Electr. Syst. 5,3, ceedings of the ACM SIGMETRICS Interna- 682–704.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. 236 V. Venkatachalam and M. Franz

PAPATHANASIOU,A.E.AND SCOTT,M.L. 2002. In- 2002. Energy-efficient using creasing disk burstiness for energy efficiency. multiple clock domains with dynamic voltage Technical Report 792 (November), Department and frequency scaling. In Proceedings of the 8th of Computer Science, University of Rochester International Symposium on High-Performance (Nov.). Computer Architecture. IEEE Computer Society, PATEL,K.N.AND MARKOV,I.L.2003. Error- 29–40. correction and crosstalk avoidance in dsm SETA, K., HARA, H., KURODA,T.,KAKUMU, M., AND SAKU- busses. In Proceedings of the International Work- RAI,T. 1995. 50% active-power saving without shop on System-Level Interconnect Prediction. speed degradation using standby power reduc- ACM Press, 9–14. tion (SPR) circuit. In Proceedings of the IEEE In- PENZES,P.,NYSTROM, M., AND MARTIN,A. 2002. ternational Solid-State Conference. IEEE Press, Transistor sizing of energy-delay-efficient 318–319. circuits. Tech. Rep. 2002003, Department SGROI, M., SHEETS, M., MIHAL, A., KEUTZER, K., MA- of Computer Science, California Institute of LIK,S.,RABAEY,J.,AND SANGIOVANNI-VENCENTELLI, Technology. A. 2001. Addressing the system-on-a-chip in- PEREIRA,C.,GUPTA, R., AND SRIVASTAVA,M. 2002. terconnect woes through communication-based PASA: A software architecture for building design. In Proceedings of the 38th Conference on power aware embedded systems. In Proceedings Design Automation. ACM Press, 667–672. of the IEEE CAS Workshop on Wireless Commu- SHIN,D.,KIM,J.,AND LEE,S. 2001. Low-energy nication and Networking. intra-task voltage scheduling using static timing POLLACK,F.1999. New microarchitecture chal- analysis. In Proceedings of the 38th Conference lenges in the coming generations of CMOS pro- on Design Automation. ACM Press, 438–443. cess technologies. International Symposium on STAN,M.AND BURLESON,W. 1995. Bus-invert cod- Microarchitecture. ing for low-power i/o. IEEE Trans. VLSI, 49–58. PONOMAREV,D.,KUCUK,G.,AND GHOSE,K. 2001. STANLEY-MARBELL,P.,HSIAO, M., AND KREMER,U. Reducing power requirements of instruction 2002. A Hardware Architecture for Dynamic scheduling through dynamic allocation of mul- Performance and Energy Adaptation. In Pro- tiple resources. In Proceedings of the ceedings of the Workshop on Power-Aware Com- 34th Annual ACM/IEEE International Sympo- puter Systems. 33–52. sium on Microarchitecture. IEEE Computer So- STROLLO, A., NAPOLI, E., AND CARO,D.D. 2000. New ciety, 90–101. clock-gating techniques for low-power flip-flops. POWELL, M., YANG,S.-H., FALSAFI,B.,ROY, K., AND In Proceedings of the International Symposium VIJAYKUMAR,T.N. 2001. Reducing leakage in on Low Power Electronics and Design. ACM a high-performance deep-submicron instruction Press, 114–119. cache. IEEE Trans. VLSI Syst. 9,1,77–90. SULTANIA, A., SYLVESTER,D.,AND SAPATNEKAR,S. 2004. RUTENBAR,R.A., CARLEY,L.R., ZAFALON, R., AND DRAG- Transistor and pin reordering for gate oxide ONE,N. 2001. Low-power technology mapping leakage reduction in dual Tox circuits. In IEEE for mixed-swing logic. In Proceedings of the Inter- International Conference on Computer Design. national Symposium on Low Power Electronics 228–233. and Design. ACM Press, 291–294. SYLVESTER,D.AND KEUTZER,K. 1998. Getting to SACHS,D.,ADVE,S.,AND JONES,D. 2003. Cross- the bottom of deep submicron. In Proceedings layer adaptive video coding to reduce energy on of the IEEE/ACM International Conference on general purpose processors. In Proceedings of the Computer-Aided Design. ACM Press, 203–211. International Conference on Image Processing. TAN,T.,RAGHUNATHAN, A., AND JHA,N. 2003. Soft- 109–112. ware architectural transformations: A new ap- SASANKA, R., HUGHES,C.J.,AND ADVE,S.V. 2002. proach to low energy embedded software. In Joint local and global hardware adaptations Design, Automation and Test in Europe. 1046– for energy. SIGARCH Computer Architecture 1051. News 30,5,144–155. TAYLOR,C.N.,DEY,S.,AND ZHAO,Y. 2001. Modeling SCHMIDT,R.AND NOTOHARDJONO,B. 2002. High-end and minimization of interconnect energy dissi- server low-temperature cooling. IBM J. Res. De- pation in nanometer technologies. In Proceed- vel. 46,6,739–751. ings of the 38th Conference on Design Automa- SEMERARO,G.,ALBONESI,D.H., DROPSHO,S.G., MAGK- tion. ACM Press, 754–757. LIS,G.,DWARKADAS,S.,AND SCOTT,M.L. 2002. Transmeta Corporation. 2001. LongRun Power Dynamic frequency and voltage control for a Management: Dynamic Power Management for multiple clock domain microarchitecture. In Pro- Crusoe Processors. Transmeta Corporation. ceedings of the 35th Annual ACM/IEEE Interna- Transmeta Corporation. 2003. Crusoe Processor tional Symposium on Microarchitecture. IEEE Product Brief: Model TM5800. Transmeta Cor- Computer Society Press, 356–367. poration. SEMERARO,G.,MAGKLIS,G.,BALASUBRAMONIAN, R., AL- TURING,A.1937. Computability and lambda- BONESI,D.H., DWARKADAS,S.,AND SCOTT,M.L. definability. J. Symbolic Logic 2,4,153–163.

ACM Computing Surveys, Vol. 37, No. 3, September 2005. Power Reduction Techniques for Microprocessor Systems 237

UNNIKRISHNAN,P.,CHEN,G.,KANDEMIR, M., AND computing. In Proceedings of the 2003 Interna- MUDGETT,D.R. 2002. Dynamic compilation tional Symposium on Low Power Electronics and for energy adaptation. In Proceedings of Design. ACM Press, 110–115. the IEEE/ACM International Conference on YUAN,W.AND NAHRSTEDT,K. 2003. Energy-efficient Computer-Aided Design. ACM Press, 158– soft real-time cpu scheduling for mobile multi- 163. media systems. In Proceedings of the 19th ACM VENKATACHALAM,V.,WANG, L., GAL, A., PROBST,C., Symposium on Operating Systems Principles AND FRANZ,M. 2003. Proxyvm: A network- (SOSP’03). 149–163. based compilation infrastructure for resource- ZENG, H., ELLIS,C.S., LEBECK,A.R., AND VAHDAT,A. constrained devices. Technical Report 03-13, 2002. Ecosystem: managing energy as a first University of California, Irvine. class operating . In Proceedings VICTOR,B.AND KEUTZER,K. 2001. Bus encoding of the 10th International Conference on Archi- to prevent crosstalk delay. In Proceedings of tectural Support for Programming Languages the IEEE/ACM International Conference on and Operating Systems. ACM Press, 123– Computer-Aided Design. IEEE Press, 57–63. 132. WANG, H., PEH, L.-S., AND MALIK,S. 2003. Power- ZHANG,H.AND RABAEY,J. 1998. Low-swing inter- driven design of router in on- connect interface circuits. In Proceedings of the chip networks. In Proceedings of the 36th An- International Symposium on Low Power Elec- nual IEEE/ACM International Symposium on tronics and Design. ACM Press, 161–166. Microarchitecture (MICRO’36). IEEE Computer ZHANG, H., WAN, M., GEORGE,V.,AND RABAEY,J. 2001. Society, 105–116. Interconnect architecture exploration for low- WEI, L., CHEN, Z., JOHNSON, M., ROY, K., AND DE,V. energy reconfigurable single-chip dsps. In Pro- 1998. Design and optimization of low voltage ceedings of the International Symposium on Sys- high performance dual threshold CMOS circuits. tems Synthesis. ACM Press, 33–38. In Proceedings of the 35th Annual Conference on ZHANG,W.,HU,J.S.,DEGALAHAL,V.,KANDEMIR, Design Automation. ACM Press, 489–494. M., VIJAYKRISHNAN,N.,AND IRWIN,M.J. 2002. WEISER, M., WELCH,B.,DEMERS,A.J.,AND SHENKER,S. Compiler-directed instruction cache leakage op- 1994. Scheduling for reduced CPU energy. In timization. In Proceedings of the 35th Annual Proceedings of the 1st USENIX Symposium on ACM/IEEE International Symposium on Mi- Operating Systems Design and Implementation. croarchitecture. IEEE Computer Society Press, 13–23. 208–218. WEISSEL,A.AND BELLOSA,F. 2002. Process cruise ZHANG,W.,KARAKOY, M., KANDEMIR, M., AND CHEN,G. control: event-driven clock scaling for dynamic 2003. A compiler approach for reducing data power management. In Proceedings of the In- cache energy. In Proceedings of the 17th An- ternational Conference on Compilers, Architec- nual International Conference on Supercomput- ture, and Synthesis for Embedded Systems. ACM ing. ACM Press, 76–85. Press, 238–246. ZHAO,P.,DARWSIH,T.,AND BAYOUMI,M. 2004. High- WON, H.-S., KIM, K.-S., JEONG, K.-O., PARK, K.-T., CHOI, performance and low-power conditional dis- K.-M., AND KONG,J.-T. 2003. An MTCMOS de- charge flip-flop. IEEE Trans. VLSI Syst. 12,5, sign methodology and its application to mobile 477–484.

Received December 2003; revised February and June 2005; accepted September 2005

ACM Computing Surveys, Vol. 37, No. 3, September 2005.