A Rate Matching-based Approach to Dynamic Voltage Scaling

David Biermann, Emin G¨un Sirer, Rajit Manohar Computer Systems Laboratory Cornell University Ithaca, NY 14853, U.S.A. Abstract converge to a single optimal operating point. This has We present a simple rate matching-based mechanism for led to a recent experimental study that thoroughly evalu- voltage adaptation in a microprocessor running a mul- ated many previous voltage scaling schemes to conclude tiprogrammed workload. The mechanism incorporates that “No heuristic policy that we examined achieved a set of architecture and operating system extensions [the optimal voltage and frequency].” 1 [8] Part of the through which applications can communicate their actual reason why these heuristic approaches are limited is be- and desired progress to the operating system. Using this cause they are driven solely by system idle time and have feedback, the operating system uses a modified schedul- no application-specific information. ing algorithm to run all applications at a single, globally- Other work has examined how to select the optimal optimal voltage. We demonstrate that significant energy voltage given complete information about application savings are possible with a simple, practical set of exten- start times, deadlines, and computation needs [12, 20]. sions to the architecture and operating system. Given complete application knowledge, these omniscient schemes can optimally pick the operating voltage to min- imize energy requirements while meeting application 1 Introduction deadlines. However, while such schemes can provide lower bounds on energy requirements, they are hard to Power conservation is critical in many computational set- use in practice because they require complete applica- tings. It is well known that improvements in battery ca- tion information. Due to data-dependent execution and pacity have not tracked the increased power requirements hardware effects such as cache misses, estimating future in modern processors. Energy efficiency is critical in mo- execution time for an application is a daunting task. bile and ubiquitous computing environments, including We contend that the problem with these two extremes is sensor networks and hand-held devices, where form fac- the lack of application-specific information. OS-directed tor constrains the total battery capacity. Voltage scaling schemes do not take any application-specific deadline in- is a promising mechanism for reducing the power con- formation into account, while omniscient schemes as- sumption, thereby extending the battery life of such de- sume an impractical level of application knowledge. The vices. Energy efficiency is also important in high-end problem stems from the lack of an interface by which processors, where thermal limits constrain the maximum application writers can inform the hardware of relevant power consumption of a processor. Voltage scaling in information for making energy-optimal decisions. this scenario would be useful to prevent the proecessor from exceedings its thermal budget. In this paper, we propose a new interface through which applications can independently express their power There is a wealth of research on voltage scaling algo- needs to dynamically select the globally optimal operat- rithms [20, 3, 7, 8, 16, 23, 27]. This work has mostly fo- ing voltage. An interface for power-aware voltage adap- cused on operating system (OS) techniques for selecting tation should exhibit the following properties: a globally optimal voltage setting. The primary driving factor in this class of selection algorithms has been the • Simplicity. It should be practical and intuitive to total system idle time. The operating system typically use. In particular, their use should not be predicated scales the voltage (and frequency) down in response to on detailed knowledge of future application behav- idle periods and increases it during bursts of activity to ior. try and find the lowest possible voltage setting that elim- • inates idleness. Such schemes are compelling because Efficiency. It should provide sufficient information they only require minor changes to the operating system for the hardware to make optimal or near-optimal scheduler and no application-level modifications. How- voltage scheduling decisions with minimal run-time ever, heuristic-based, operating system driven algorithms overhead. tend not to exhibit stable behavior, nor do they robustly 1Emphasis in the original. • Protection. It should allow the operating system 2 Related Work to make per-process voltage scheduling decisions. Applications should not be able to override energy limits imposed on them by the operating system. For any dynamic voltage adaptation scheme to be fea- sible, the hardware must support operation at multiple voltage levels. Current commercial processors support • Flexibility. It should enable applications to imple- a small number of discrete voltage/frequency adjustment ment any voltage selection algorithm. Variations options. ’s Mobile Pentium III with SpeedStep has in application execution bursts necessitate differing two levels of operation [10] and AMD’s Mobile Athlon voltage adaptation schemes. 4 with Power NOW! [1] has five levels. This small range of voltage adjustment can only support very course grain voltage adjustments, such as lowering the voltage when • Compatibility. It should not preclude legacy appli- a system switches to battery power. Transmeta’s Cru- cations from being executed without modification. soe processor is one of the few commercial processors to support fine-grain voltage/frequency adjustments. How- ever, unlike our voltage-adaptive scheme the Crusoe’s We propose a set of simple extensions that achieve these voltage/frequency adjustments are not directly driven by goals via fine-grained rate-matching. Our approach re- the application. Crusoe’s power management software lies on extracting explicit progress information from the monitors power consumption by sampling CPU sleep application. The hardware can then use this information states and using heuristics to adjust the voltage and fre- to tune its voltage and frequency to match application quency of the processor [6]. Some low-power embed- needs. Applications provide their target execution rate ded processors, such as the Intel 80200 Processor [11], on initialization, and include progress indicators in criti- also support fine-grain voltage/frequency adjustments. cal locations in the code. The progress indicators specify The existence of these capabilities in modern processors the actual application execution rate, which is then ad- has spurred researchers to examine algorithms for volt- justed to match the target rate by scaling the operating age/frequency management. voltage. Previous work on voltage adaptation has focused on op- This paper makes the following contributions. First, it erating system techniques that choose an operating volt- introduces a simple, efficient, and flexible interface for age which minimizes idle time. Early work by Weiser et application-directed voltage control. The interface is al. showed the potential benefits of voltage scaling [27]. easy and intuitive to use; it typically requires the addi- They looked at two schemes, FUTURE and PAST, that tion of a small amount of code to initialize the system, examine idle time in scheduling windows to determine and a single call in the main application loop. Second, the voltage setting for the next epoch and compared these we show that this interface when used independently by to OPT, the optimal strategy. This work was further ex- competing applications leads to a globally optimal dy- tended by Govil et al. which examined many other can- namic voltage level. Our interface provides a way for didate strategies for voltage scheduling using the same each application to pursue the best voltage for its own framework as Weiser [7]. They showed that when eval- application-specific goals. In aggregate, the global sys- uating a range of applications with a single scheduling tem converges to the optimal voltage without the operat- policy, simple strategies achieved energy savings that ing system having to explicitly derive it from application were comparable to those obtained by more sophisticated characteristics or needs. Finally, we show that this in- strategies. Martin examined the effect of non-ideal bat- terface permits an efficient implementation that achieves tery behavior and memory performance, and uses this significant power savings. For a demanding multipro- to formulate a more sophisticated model of the effect of grammed workload, our implementation in the Linux op- voltage/frequency scaling on system lifetime [18]. Grun- erating system is within 3.4% of optimal and achieves wald et al. experimentally evaluate different voltage scal- 42% reduction in energy on average. ing policies on Itsy, a prototype hand-held computer [8]. The rest of the paper is organized as follows. Section 2 They conclude that none of the policies proposed to date relates our contributions to previous work. Section 3 de- work well in the general case. These papers share a set scribes our API for voltage management, in the context of common characteristics: (i) They investigate single of a single application. Section 4 shows how the API can system heuristics that have to operate well across a wide be used to achieve globally optimal voltage selection in a range of applications; (ii) They are all coarse grained and multiprogrammed environment. Section 5 describes our interval based. The system re-evaluates the voltage set- Linux implementation, and shows that achieves perfor- ting only when the scheduler is invoked. The choice of mance that is close to the optimal. the interval is determined by the scheduler, independent of application needs; (iii) They are driven entirely by sys- 3 Voltage Scaling API tem idle time, which is not directly related to application needs. Inferring computation needs from idle time mea- Our API achieves application-driven adaptivity by rate- surements is complicated by phenomena like deceptive matching. We call this API RMVS, for rate-matching idle times [13], where applications might remain idle due voltage scaling. We begin with the minimal set of opera- to outstanding I/O requests. These three characteristics tions that provide a mechanism by which the application have limited the efficiency of such schemes. In contrast, can inform the hardware about its progress. The hard- our rate-matching based approach enables application- ware can then pick the optimal voltage level to reduce controlled voltage adaptation, facilitates scheduling at energy consumption given a target progress guarantee. finer granularity, and is directly driven by application progress. There has been much work (both theoretical and simula- 3.1 API Description tion based) on optimal voltage scheduling policies based on complete information about application deadlines, ar- The voltage control system is centered about a counter, rival times, and computation workload. Hong et al. look VCNT, which captures the application’s progress. This at the voltage scheduling problem given full information counter is periodically incremented by the application about periodic tasks [9]. Pering et al. describe the de- via the PROGRESS operation, and decremented by the sign of a low-power microprocessor system that incor- system at a rate specified by the VRATE register. Equilib- porates dynamic voltage scaling [20]. They build on a rium is achieved when the increment and decrement rates real-time OS infrastructure, and assume that application balance and keep the counter at a near-constant value. deadlines and computation needs are available to their If the program runs too slowly, then the application- scheduler. Ishihara et al. make the same assumptions controlled increments will occur less often than the and approach the optimal voltage scheduling problem system-controlled decrements and the counter will even- through linear programming [12]. Manzak et al. pro- tually underflow. Likewise, if the application runs too vide techniques to compute the optimal task voltages for quickly, then the counter will eventually overflow. These a number of tasks that have a common, global dead- conditions are signals to adjust the voltage up or down line [16]. More recent work makes similar assumptions respectively. when implementing their real-time dynamic voltage scal- ing algorithms [19, 28]. This body of work relies on ex- Such conditions are reflected to the application through plicit application deadline information, and total appli- exceptions. Throughout this paper, we will refer to these cation execution time for at least one reference voltage. exceptions as counter exceptions. They are handled by In practice, these two metrics, especially the latter, may an exception handler that picks the new operating voltage be difficult to obtain. Furthermore, having the applica- for the application. The exception handler does so by tion compute an accurate estimate of future workload is writing the VLT register, which in turn updates the actual likely to incur an unacceptable performance penalty. In supply voltage level. contrast, our rate-matching based approach does not re- For protection reasons, the exception handler should not quire that the application make any estimates of its future be able to adjust VLT to an arbitrary value. Otherwise, behavior. malicious applications could consume more power than Simunic et al. present techniques that use a change-point their allotment and circumvent OS resource management detection algorithm to detect the difference in arrival and decisions. The OS has the ability to set hard bounds on service rates for an MPEG player and MP3 player [23]. VLT via the registers VMIN and VMAX. These bounds They assume the presence of a power manager that can could be the physical limits at which the processor can monitor these rates and the number of frames decoded operate, or the OS may wish to restrict the voltage level by the two players. Their technique also uses an off-line further. For example, to extend battery life when a lap- calculation to determine thresholds for the change-point top is not plugged in to a wall socket, the OS may set detection algorithm. Our rate-matching based, approach VMAX to something far lower than it would if there is while similar in spirit, relies only on run-time informa- more power available. tion and does not require an off-line calculation phase. If the counter hits zero while an application is running Further, it is not limited to applications with input and and VLT is already equal to VMAX, then the application output queues that can be monitored by dedicated hard- cannot meet its current performance goal. Depending ware. on the nature of the application, it may then want to exit, continue running at the highest allowable voltage level, or modify its performance requirement, possibly Reg Description 2.8e+07 CMAX Maximum counter level 2.6e+07 2.4e+07 VINCS Increments since last voltage exception 2.2e+07 VINS Instructions since last voltage exception 2e+07 VLT Supply voltage level 1.8e+07 VMAX Maximum allowable supply voltage 1.6e+07 VMIN Minimum allowable supply voltage 1.4e+07 Number of Instructions VRATE Counter decrement rate (Hz) 1.2e+07 VCNT Counter 1e+07 Table 1: New registers introduced by the voltage-adaptive ISA 8e+06 extensions. 0 20 40 60 80 100 120 Frame Number Instruction Description Figure 1: Number of instructions required to decode frames in PROGRESS Increments counter by one an MPEG video sequence. HALT Stops processor until interrupt interrupt occurs. The HALT instruction has already been Table 2: New instructions introduced by the voltage-adaptive adopted and implemented in many modern ISAs, and is ISA extensions. simply included here for completeness. performing a quality-of-service adjustment. For exam- ple, an scalable video-decoder that cannot meet its frame 3.2 Example Application rate goal, even at VMAX, may choose to use fewer colors, lower resolution, or a slower frame rate. Likewise, if a program is running too quickly while already at VMIN, In this section, we illustrate how to integrate our exten- it may want to do nothing and continue running at this sions into real applications. We focus on a video decoder level, or change its quality-of-service by switching to with a fixed, average-case throughput goal. An MPEG more colors or higher resolution. video decoder with its periodic structure provides a good example of how our extensions can be used to control The amount of hysteresis in the system can be controlled the average-case behavior of an application. Modifying by limiting the range of the VCNT register. To accom- applications involves two steps: inserting PROGRESS in- plish this, we provide a new register CMAX that bounds structions appropriately into a program, and determining the maximum possible counter value. Note that a CMIN application-specific values for VRATE and CMAX. register is unnecessary, as the amount of hysteresis only depends on the range of values the counter can take, not Even with a regular benchmark like MPEG, it is diffi- the absolute value of the counter itself. With this mod- cult to specify its future computation requirements based ification, exceptions are reported when the counter hits on its immediate inputs. Each frame may have drasti- zero or CMAX. To keep the hysteresis symmetric, VCNT cally different computational requirements for decoding, is typically initialized to CMAX/2. with some using as many as three times as many dynamic instructions as others [2]. A graph of the number of in- To more quickly arrive at the equilibrium voltage level structions needed to decode frames for a 116-frame video (i.e. the level at which the rate of increments equals the clip is shown in Figure 1. This figure shows that even rate of decrements), a counter exception handler needs to a regular benchmark like MPEG exhibits irregular and know the number of instructions and the total number of time-varying computational needs that are hard to cap- counter increments since the last exception. These values ture. are stored in VINS and VINCS, respectively. After the exception handler has changed the voltage, it typically To provide an indication of progress to the hardware, the resets the counter to CMAX/2. application writer needs to place PROGRESS operations at appropriate locations. A natural choice for MPEG is to By default, an application begins with VRATE set to place one operation at the end of the code that decodes a zero. This has the effect of turning off the voltage con- single frame. We found this to be a straightforward mod- trol mechanism if the application does not contain any ification for the benchmarks we examined in this paper. PROGRESS instructions. This allows legacy applications Generally, the PROGRESS operations were placed in the to execute with no modification. main dispatch loop of event-driven applications. Finally, there are also times when the processor is truly The desired rate of progress is determined by the value in idle and just needs to wait until it receives an interrupt VRATE. For MPEG, this corresponds precisely with the (from a timer, for instance). In this case, the HALT in- designed frame rate. Therefore, VRATE is set to 30Hz struction causes the processor to wait until an external for a 30 frames/sec (fps) target. 3.5 ’mpeg_long_vonestep_g8.txt’ Extending the ISA. The overall impact of our ISA ex- 3

2.5 tensions on the main processor execution path is modest,

2

Vdd (V) especially when integrating such an ISA extension with a 1.5 MIPS-like ISA that supports coprocessors [14]. The ISA 1 0 1e+10 2e+10 3e+10 4e+10 5e+10 6e+10 7e+10 8e+10 time (x2 ns) extension along with its accompanying registers is im- 3.5 ’mpeg_long_vonestep_g16.txt’ plemented as a voltage management coprocessor (VU), 3 and the primary decode need only classify the proposed 2.5 extensions as VU instructions to be routed to the voltage 2 Vdd (V) 1.5 management coprocessor. The only interaction between

1 0 1e+10 2e+10 3e+10 4e+10 5e+10 6e+10 7e+10 8e+10 the primary processor datapath and the VU is through time (x2 ns) move instructions that transfer data to and from VU reg- Figure 2: Voltage level for an MPEG video decoder with (top) isters. Protection can be achieved by only allowing user- CMAX =8and (bottom) CMAX =16. mode code access the progress operation and the supply CMAX determines how quickly the rate of PROGRESS voltage level register. We implemented this modification operations approaches the equilibrium VRATE. CMAX to the RTL-level description of an asynchronous MIPS can be thought of as the size of the window over which processor, and it had no noticeable impact on the perfor- the PROGRESS-rate is averaged, and hence the sensitiv- mance of the processor. ity of the mechanism. Note that if CMAX =2, then any counter increment or decrement will immediately cause Voltage Calculation. When the counter value exceeds an exception. On the other hand, if CMAX is large (say CMAX or drops below 0, application-level code must be 2000), then at least a thousand decrements (≥ 33 sec- executed to adjust the operating voltage. The simplest onds) must occur before a counter exception could al- strategy we adopt is called INCDEC, where the voltage low the system to adjust its throughput. For the MPEG is decreased by a fixed amount if the counter crosses the video decoder, the desired sensitivity depends on the low threshold, and increased by a fixed amount if the amount of buffering available. In general, we would like counter exceeds CMAX. In general, an application could to make the mechanism as insensitive as possible within adopt more sophisticated strategies when determining its the constraints of the application, since changing volt- new voltage in response to a counter exception. age levels adds to the energy and time overhead cost. If the number of buffered frames is b, then we should en- Voltage Adaptation. The granularity of our voltage- sure that CMAX/2

Counter Updates. The reference clock provides a peri- 3.3 Implementation Considerations odic timing signal to the processor. This is required to produce an absolute time reference so that the voltage- The API described above can be implemented using in- adaptive logic can keep track of how fast the processor struction set architecture (ISA) extensions, or by using a 2This latency strongly depends both on the capacitance between system call interface with little/no architectural support. Vddand GND, and the direction and amount of the voltage change. is running, regardless of its current operating voltage. the total energy? Since both k and E are functions of In our implementation, the reference clock controls how only V (t) and other constants, the optimal solution is often the voltage counter is decremented. The VRATE V (t) is a constant. register controls the rate the counter is decremented by dividing down the reference clock frequency. The refer- Property 2. In an optimal voltage schedule, the volt- ence clock may either be integrated on-chip with the rest age can increase only when new tasks enter the system; of the processor (using an additional fixed supply volt- otherwise, the voltage is a non-increasing function with age) or can be an external oscillator. changes in voltage occurring only when tasks leave the system. This can be proved in the same way as Prop- erty 1. 4 From Local to Global Voltage Se- These two properties provide us with some intuition re- lection garding the behavior of the optimal voltage curve.

The preceding section described how applications ex- press their energy needs in isolation. In this section, we 4.2 Globally Optimal Voltage Selection show how this can be combined with scheduling deci- A B sions in the operating system to achieve globally optimal Consider a simple scheduler with two tasks and .For A k voltage selection. simplicity, we assume that task has A units of work that need to be completed in t time slices, and B has kB units of work that need to be completed in the same 4.1 Optimal Voltage for a Single Applica- number of time slices t. tion A priority scheduler schedules task A with probability p and B with probability (1 − p). Therefore, pt slices Our model for the effect of voltage scaling on the perfor- are allocated to task A, and (1 − p)t slices are allocated mance and energy requirements of a design uses the con- to task B. In this situation, we would like to use the ventional first-order approximation for energy and delay. following voltages for A and B: k Let be the number of units of work that a processor op- V k erating at voltage V can complete in t units of time using V = 0 A A pt E units of energy. We assume that these quantities are V k related by the following equations: V = 0 B B (1 − p)t Vt k = (1) V V0 where 0 is the voltage where a process can complete one unit of work in one time slice. CV 3t E = CV 2 × k = t V (2) Therefore, the energy used after slices (normalized by 0 the average capacitance per unit of work) is given by: where the quantities V0 and C are constants that depend k3 on the design.3 For the purpose of analysis, a task corre- 2 2 A EA = kAVA = V0 sponds to a given amount of work that must be completed p2t2 by a certain time. Give a set of such tasks, we can deter- k3 E = k V 2 = V 2 B mine the voltage schedule that minimizes the total energy B B B 0 (1 − p)2t2   required by the system. Some properties this schedule V 2 k3 k3 E = 0 A + B obeys are: t2 p2 (1 − p)2 Property 1. The optimal voltage schedule is piecewise constant, with changes occuring when tasks are com- We can complete this joint task using the minimum pleted. The proof of this follows from (1) and (2). amount of energy when: This can also be established for more complex func- ∂E =0 tional forms by posing the following variational problem: ∂p Given a fixed amount of work to be completed in a fixed kA amount of time, what is the optimal V (t) that minimizes i.e., p = kA + kB 3 k + k This first order model neglects threshold voltage effects, leakage, V = V = V A B and short-circuit current terms in the equation for energy and delay. i.e., A B 0 t In other words, the optimal scheduling strategy is to treat The voltage scheduling that results is quite different from the combined job “A + B” as one job running at a sin- what one would observe with a Weiser-style idle time gle voltage that is chosen precisely so that both tasks can scheduler. For instance, consider a single, cpu-bound ap- meet the joint deadline. This result easily generalizes plication. An idle time scheduler would always run this to the case of n processes, where the optimal schedul- application at a high voltage, because each scheduling ing policy once again equalizes all the voltages, and the interval has no idle time. The insertion of PROGRESS scheduling probability is proportional to the amount of instructions in the application gives the hardware addi- work to be performed. tional information about the actual needs of the applica- tion which may not always be reflected in the idle time. In general, if we have n tasks with periodic deadlines that This enables us to save energy without loss in application have slice allocations s1, s2, ..., sn operating at steady performance. state voltages of V1, ..., Vn, each task completing work given by k1, ..., kn units per deadline. We know that An interesting result of this scheme is that the overall av- erage voltage-level of the processor will scale with the V0ki m Vi = load on the system. If there are copies of the pre- si vious video application using the CPU and if the proces- sor’s throughput scales linearly with the voltage level, the where V0 is the voltage where a task can complete one overall system voltage will also increase linearly with m. unit of work per slice. As the operating system sched- uler, we can observe the values Vi and si. Each task should receive a priority that is proportional to its work ki. Therefore, the operating system scheduler can com- 5 Results pute the new slice values k V s /V Due to the absence of hardware that provides an im- s = i = i i 0 i N N plementation of our API, we evaluated its effectiveness kj Vj sj/V0 j=1 j=1 with a simulation-based study. The simulator we use is V s s = i i based on , an open-source, x86 simulator that in- i.e., i N (3) j=1 Vjsj cludes models for the network, disk, and other devices and can boot the Linux operating system. We adopted a Therefore, to minimize energy, the operating system full system approach to simulation because our proposed should dynamically adjust its scheduling policy using extensions include both operating system and architec- equation (3). Note that the quantities Vi and si are both ture modifications. known to the operating system: Vi is maintained in our ISA as a register in the voltage unit; si is the proportion of time that is allocated to the process by the scheduler. The core simulator was modified to support the API as an ISA extension. The effect of voltage scaling only impacts We modified the Linux scheduler according to equa- the energy and performance of the processor, and Bochs tion (3) to use per-process voltage information to adjust was modified to correctly account for a selective slow- scheduling priorities of the processes that participated in down/speedup of the processor. We augmented Bochs dynamic voltage scheduling. Once the scheduler priori- to also record the energy consumption of the processor. ties have been correctly adjusted, each application with a non-zero VRATE will automatically pick its own local Name Description voltage to be the same as the globally optimal voltage austin clip from Austin Powers choice. bach clip from a Bach composition godfather clip from The Godfather Note that as opposed to previous work on idle time mini- hal2001 clip from 2001 mization, the voltage level is our system is determined by jesse Jesse by Joshua Kadison a combination of application information and operating kennedy clip of John F. Kennedy speaking system scheduling. While previous work has examined lastresort clip from Last Resort by Papa Roach the problem of finding a single global voltage that satis- mozart Ein Kleine Nacht Musik by Mozart fies all applications, we enable each application to deter- pachebel Canon in D by Pachebel mine its own voltage level that is appropriate for its own rebecca Rebecca by Pat McGee Band progress metric. The operating system uses the voltage fear clip of Roosevelt’s nothing to fear speech information per application to change scheduling prior- infamy clip of Roosevelt’s Pearl Harbor speech ities, and this combination results in a globally optimal Table 3: Audio files used as input for benchmarks. choice of voltage. The simulator supports voltage levels ranging from 1.5V Calibration Results to 0.3V in 0.1V steps. Our simulator also takes the non- 1.2 linear dependence between voltage and throughput into account, as well as accounting for those times that do not 1 scale with voltage (cache misses, disk access, etc). The 0.8

simulator also calculates the optimal energy that an ap- Energy 0.6 plication could operate at if it had perfect knowledge of Time

0.4 future behavior based on the arrival times of tasks using Ratio of Simulated to Real the analysis from Section 4. All the reported results cor- 0.2 respond to applications that run on our modified Linux being simulated on a modified version of Bochs. 0 go

toast/jesse mp3/jesse mp3/mozart toast/mozart toast/rebecca untoast/jesse mp3/rebecca toast/pachebel mp3/pachebel untoast/mozart untoast/rebecca We modified a Linux 2.4 kernel to include all the nec- untoast/pachebel Application essary operating system support for our API. The thread structure was modified to include all the values for the Figure 3: Results of calibration against measured data. registers introduced by the API. In addition, the context four of the audio data sets (jesse, mozart, pachebel, re- switch routine was modified to save and restore the hard- becca). For energy calibration, we charged a different ware registers that correspond to our API. Processes that energy cost for each instruction type (integer, floating- operate at constant voltages have VRATE set to zero. The point, memory). The results of the energy reported by voltage state is shared by child processes so that the time the simulator were compared against measurements from taken by sub-tasks spawned by a parent is correctly ac- the Pentium II system. We measured the current being counted for. used by the processor by attaching a probe to the voltage regulator on the motherboard. Benchmarks. We used a set of six benchmark programs (shown in Table 4), attempting to use a range of differ- Figure 3 shows the results of our calibration runs. The ent application types to illustrate the applicability of the y-axis shows the ratio of the metric reported by the sim- proposed API. toast and untoast are audio codecs, ulator to the measured metric. Both energy and time cal- and mpg123 is an MP3 player. go is a game-tree search, ibration is reported per benchmark. The largest error in and we modified it to limit the search it performs by reg- energy measurements we observed was an underestimate ulating the amount of work per move. This adaptively by 13.9%, and the largest error we observed in timing prunes the search tree based on computational require- measurements was an overestimate by 4.2%. The aver- ments. ehgml is a ray-tracer that we modified so as it age of the absolute values of the error percentages was would render scenes at a fixed frame rate. Finally abyss 6.1% for the energy and 1.9% for the time. is a web-server that was modified to respond to web traf- ficatafixed rate. The multimedia benchmarks used a variety of input data sets that are described in Table 3. 5.2 Application Performance Modifying each benchmark was a simple task, and it took us less than an hour per benchmark to perform the nec- Figure 4 shows the results of RMVS on the six bench- essary modifications. Unlike other related work where marks from Table 4. For benchmarks with multiple input the quality of service was varied to meet performance re- sets, we took the sum of the energy per input set which quirements [20], our benchmarks provide a fixed quality corresponds to a workload that executes each clip from of service—i.e. each run of a benchmark corresponds to Table 3 once. For each benchmark, we normalize the the same amount of work. reported energy against the energy required by the ap- plication when no voltage scaling is performed (i.e. the normal energy requirement for each application would be 5.1 Calibration 1.0 in Figure 4). Each application has two bars: one cor- Benchmark Description We calibrated the time and energy reported by Bochs toast GSM encoder against the time and energy we measured from a untoast GSM decoder 400 MHz Pentium II-based system with 128MB mem- mpg123 mp3 player ory, 4GB disk, and the Intel 440BX . The time go simulation of the game of go taken by each application was measured using the Unix ehgml ray tracer time command on Bochs as well as on the real machine. abyss web server The runtimes of the applications ranged from 4.43 secs Table 4: List of benchmarks (low) to 24 secs (high). For calibration purposes, we used 0.9 1

0.9 0.8

0.8 0.7 0.7

0.6 0.6

RMVS 0.5 0.5 RMVS Optimal Optimal 0.4 0.4

0.3 0.3 0.2 0.2 0.1

0.1 0 3app/gsm 3app/go 3app/mp3 0.44 0.86 0.5 0 RMVS toast untoast mp3 go ehgml abyss Optimal 0.43 0.84 0.46 RMVS 0.44 0.58 0.31 0.82 0.69 0.61 Optimal 0.43 0.54 0.3 0.81 0.63 0.56 Figure 6: RMVS with three applications running simultane- ously. Figure 4: RMVS voltage adaptation with single application. application workload is shown in Figure 6. For each 0.8 workload, the per application energy consumption us- 0.7 ing RMVS is compared against the energy used by the

0.6 application under the optimal voltage schedule for the

0.5 workload. Our implementation of RMVS performs to

RMVS within 3.4% of the optimal, and reduces the energy re- 0.4 Optimal quirements of the workload by 42% on average. 0.3

0.2 Figure 7 shows the voltage as a function of time for mpg123 with the “Bach” dataset. The voltage curve 0.1 shows the effect of using an incremental adjustment in 0 go+mp3/mp3 go+mp3/go gsm-multi/gsm1 gsm-multi/gsm2 the voltage. For this particular run, the optimal voltage RMVS 0.55 0.75 0.47 0.48 Optimal 0.54 0.75 0.46 0.45 lies between 1.0V and 1.1V . The discrete nature of the Figure 5: RMVS with two applications running simultane- voltage adaptation causes the voltage to periodically in- 0.1 1.0 ously. crease by V before stabilizing at V for a further interval. responding to using RMVS, and the other corresponding to optimal voltage scheduling according to Section 4. Figure 8 shows the voltage as a function of time when both go and mpg123 are executing. The mp3 player is RMVS uses 10% more energy than the optimal voltage playing a clip that ends at 12 seconds. The combination scaling strategy in the worst case (abyss), and 5.3% more of the two applications causes the processor to operate at energy on average. Compared to not applying any form 1.5V until the mp3 player completes. Notice how both of voltage scaling, RMVS saves 43% of the total energy applications independently chose the same voltage to op- required on average. erate at due to the modified scheduler. Once the mp3 player completes, go is allowed to use a larger fraction of the processor—immediately lowering its operating volt- 5.3 Multiprogrammed Workloads age and proceeding along the same phases as before. 1.6 We now examine the effects of running multiple appli- 1.4 cations simultaneously. In each case, we keep track of 1.2

the per-application energy as well as the optimal per- 1

application energy. We report results from RMVS runs 0.8

for three different multiprogrammed workloads. Work- voltage 0.6 load go+mp3 corresponds to running the go bench- mark and mpg123 benchmark simultaneously. Work- 0.4 load gsm-multi corresponds to two runs of untoast, 0.2 the GSM decoder. Finally 3app corresponds to go, 0 6 8 10 12 14 16 18 mpg123, and untoast running simultaneously. time (s)

A summary of the results for the two application work- Figure 7: mpg123/Bach loads is provided in Figure 5. Results from the three 1.6 MP3 Player [10] Intel. Intel SpeedStep Technology, Jan 2000. Go 1.5 [11] Intel. Intel 80200 Processor based on Intel XScale Microarchitecture, Nov 1.4 2000.

1.3 [12] T. Ishihara and H. Yasuura. Voltage Scheduling Problem for Dynamically 1.2 Variable Voltage Processors. Proc. of International Symposium on Low Power Electronics and Design, August 1998. voltage 1.1 [13] Sitaram Iyer and Peter Druschel. Anticipatory scheduling: A disk schedul- 1 ing framework to overcome deceptive idleness in synchronous I/O. Proc. 18th ACM Symposium on Operating Systems Principles, 2001. 0.9 [14] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice-Hall, 1992. 0.8 6 8 10 12 14 16 18 20 22 24 time (s) [15] Kuroda et al. Variable supply-voltage scheme for low-power high-speed CMOS. IEEE J. Solid-State Circuits, vol. 33, pp 454-462, March 1998.

Figure 8: go and mpg123 [16] A. Manzak and C. Chakrabarti. Variable voltage task scheduling algorithms for minimizing energy. International Symposium on Low Power Electronics 6 Summary and Design, 2001. [17] A. J. Martin, A. Lines, R. Manohar, M. Nystr¨om, P. Penzes, R. Southworth, U. V. Cummings, and T.K. Lee. The Design of an Asynchronous MIPS This paper presented a simple API for rate matching- R3000. Proceedings of the 17th Conference on Advanced Research in VLSI, based dynamic voltage scaling. The API provides a September 1997. mechanism for applications to choose their own volt- [18] Thomas Martin. Balancing Batteries, Power and Performance: System Is- ages, and competing applications automatically adjust sues in CPU Speed-Setting for Mobile Computing. Ph.D. thesis, Carnegie their voltages to the same, globally optimal value due to Mellon University, 1999. interactions with our proposed operating system sched- [19] P. Pillai and K.G. Shin. Real-Time Dynamic Voltage Scaling for Low- uler. We evaluated our strategy on a calibrated, full sys- Power Embedded Operating Systems. SOSP 2001 tem simulator using a number of applications running un- [20] T. Pering and R. Brodersen. Energy Efficient Voltage Scheduling for Real- Time Operating Systems. 4th IEEE Real-Time Technology and Applications der both single application and multiprogrammed work- Symposium, 1998, Works In Progress Session. loads. Our results show that RMVS achieves an single [21] J. Pouwelse, K. Langendoen, and H. Sips. Dynamic voltage scaling on a application energy reductions that are within 5.3% of op- low-power microprocessor. Proceedings of the 7th Int. Conference on Mo- timal on average, and multiprocessor reductions that are bile Computing and Networking, July 2001. within 3.4% of the optimal. [22] Semtech. Power Supply Controller for Portable Pentium II & III SpeedStep Processors, Aug 2000.

[23] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. De Micheli. Dy- References namic voltage scaling and power management for portable systems. Pro- ceedings of the 38th Design Automation Conference, 2001. [1] AMD, Mobile AMD Athlon 4 Processor Model 6 CPGA Data Sheet, Sep [24] Marshall McKusick, Keith Bostic, Michael Karels, John Quarterman. The 2001. Design and Implementation of the 4.4 BSD Operating System. Addison- Wesley, 1996. [2] A. C. Bavier, A. B. Montz, L. L. Peterson. Predicting MPEG Execution Times. Proceedings of SIGMETRICS, pp. 131-140, 1998. [25] G. Qu and M. Potkonjak. Energy minimization with guaranteed quality of service. Proc. 2000 international symposium on Low Power Electronics and [3] T. D. Burd, T. A. Pering, A. J. Stratakos and R. Brodersen. Dynamic Voltage Design, pp. 43–49, 2000. Scaled Microprocessor System. IEEE J. Solid-State Circuits, vol. 35, pp. 1571-1580, Nov. 2000. [26] R. Wahbe, S. Lucco, T. E. Anderson, and S. L. Graham. Efficient software- based fault isolation. In Proceedings of the 14th ACM Symposium on Oper- [4] T. D. Burd and R. Brodersen. Design issues for dynamic voltage scaling. ating Systems Principles, pages 203–216, December 1993. Proc. 2000 international symposium on Low Power Electronics and Design, pp. 9–14, 2000. [27] Mark Weiser, Brent Welch, Alan J. Demers, Scott Shenker. Scheduling for Reduced CPU Energy. Proc. of Operating System Design and Implementa- [5] J. Douceur and W. Bolosky. Progress-based Regulation of Lowimportance tion, pp. 13–23, 1994. Processes. In Proceedings of the Seventeenth ACM Symposium on Operat- ing Systems Principles, pages 247–258, December 1999. [28] Y. Zhu and F. Mueller. Feedback EDF Scheduling Exploiting Dynamic Voltage Scaling. Proc. Real-Time and Embedded Technology and Appli- [6] M. Fleischmann. LongRun Power Management - Dynamic Power Manage- cations Symposium, May 2004. ment for Crusoe Processors. Transmeta Corporation, 2001.

[7] K. Govil, E. Chan, and H. Wassermann. Comparing algorithms for dynamic speed-setting of a low-power cpu. Proceedings of the 1st Conference on Mobile Computing and Networking, Mar 1995.

[8] Dirk Grunwand, Phillips Levis, Keith Farkas, Charles B. Morrey III, and Michael Neufeld. Policies for Dynamic Clock Scheduling. Proc. Operating Systems Design and Implementation, 2000.

[9] I. Hong, M. Potkonjak, and M. Srivastava. On-line scheduling of hard real- time tasks. Proc. International Conference on Computer-Aided Design, November 1998.