DYNAMIC-COMPILER-DRIVEN CONTROL FOR MICROPROCESSOR ENERGY AND PERFORMANCE

A GENERAL DYNAMIC-COMPILATION ENVIRONMENT OFFERS POWER AND

PERFORMANCE CONTROL OPPORTUNITIES FOR MICROPROCESSORS. THE

AUTHORS PROPOSE A DYNAMIC-COMPILER-DRIVEN RUNTIME VOLTAGE AND

FREQUENCY OPTIMIZER. A PROTOTYPE OF THEIR DESIGN, IMPLEMENTED AND

Qiang Wu DEPLOYED IN A REAL SYSTEM, ACHIEVES ENERGY SAVINGS OF UP TO 70 PERCENT. Margaret Martonosi Energy and power have become pri- apply to other control means as well. Douglas W. Clark mary issues in modern processor design. A dynamic compiler is a runtime software Processor designers face increasingly vexing system that compiles, modifies, and optimizes power and thermal challenges, such as reduc- a program’s instruction sequence as it runs. ing average and maximum power consump- Examples of infrastructures based on dynam- tion, avoiding thermal hotspots, and ic compilers include the HP Dynamo,4 the Vijay Janapa Reddi maintaining voltage regulation quality. In IBM DAISY (Dynamically Architected addition to static design time power manage- Instruction Set from Yorktown),5 Intel Dan Connors ment techniques, dynamic adaptive tech- IA32EL6 and the Intel PIN.7 niques are becoming essential because of an Figure 1 shows the architecture of a gener- University of Colorado at increasing gap between worst-case and aver- al dynamic-compiler system. A dynamic-com- age-case demand. pilation system serves as an extra execution Boulder In recent years, researchers have proposed layer between the application binary and the and studied many low-level hardware tech- operating system and hardware. At runtime, niques to address fine-grained dynamic power a dynamic-compilation system interacts with Youfeng Wu and performance control issues.1-3 At a high- application code execution and applies possi- er level, the compiler and the application can ble optimizations to it. Beyond regular per- Jin Lee also take active roles in maximizing micro- formance optimizations, a dynamic compiler processor power control effectiveness and can also apply energy optimizations such as Intel Corp. actively managing power, performance, and DVFS because most DVFS implementations thermal goals. In this article, we explore power allow direct software control through mode control opportunities in a general dynamic- set instructions (by accessing special mode set David Brooks compilation environment for microproces- registers). A dynamic compiler can simply sors. In particular, we look at one control insert DVFS mode set instructions into appli- Harvard University mechanism: dynamic voltage and frequency cation binary code. If CPU execution slack scaling (DVFS). We expect our methods to exists (that is, CPU idle cycles are waiting for

0272-1732/06/$20.00 © 2006 IEEE Published by the IEEE computer Society 119 MICRO TOP PICKS

memory), these instructions can scale down Applicaton binary the CPU voltage and frequency to save ener- gy with little or no performance impact. Because a dynamic compiler has access to Dynamic- DVFS compilation optimization high-level information on program code struc- system Performance ture as well as runtime system information, a optimization dynamic-compiler-driven DVFS scheme offers some unique features and advantages not pre- OS and hardware sent in existing hardware-based or static- compiler-based DVFS approaches. (See the Figure 1. General dynamic-compiler system “Why dynamic-compiler-driven DVFS?” side- architecture, serving as an extra execution bar for detail.) layer between application binary and OS Our work is one of the first efforts to devel- and hardware. op dynamic-compiler techniques for micro- processor voltage and frequency control.8 We

Why dynamic-compiler-driven DVFS? Most existing research efforts on fine-grained DVFS control fall into trast, dynamic-compiler DVFS can use runtime system information to make one of two categories: hardware or OS interrupt-based approaches or input-adaptive and architecture-adaptive decisions. static-compiler-based approaches. We must also point out that dynamic-compiler DVFS has disadvan- Hardware or OS time-interrupt-based DVFS techniques typically mon- tages. The most significant is that, just as for any dynamic-optimization itor particular system statistics, such as issue queue occupancy,1 in fixed technique, every cycle spent for optimization is a cycle lost to execution. time intervals and choose DVFS settings for future time intervals.1-3 Therefore, our challenge is to design simple and inexpensive analysis and Because the time intervals are predetermined and independent of pro- decision algorithms that minimize runtime optimization cost. gram structure, DVFS control by these methods might not be efficient in adapting to program phase changes. One reason is that program phase changes are generally caused by the invocation of different code regions.4 References Thus, hardware or OS techniques might not be able to infer enough about 1. G. Semeraro et al., “Dynamic Frequency and Voltage Control application code attributes and find the most effective adaptation points. for a Multiple Clock Domain Microarchitecture,” Proc. 35th Another reason is that program phase changes are often recurrent (that Ann. Symp. Microarchitecture (Micro-35), IEEE Press, 2002, is, loops). In this case, the hardware or OS schemes would need to detect pp. 356-367. and adapt to the recurring phase changes repeatedly. A compiler-driven 2. D. Marculescu, “On the Use of Microarchitecture-Driven DVFS scheme can apply DVFS to fine-grained code regions, adapting nat- Dynamic Voltage Scaling,” Proc. Workshop on Complexity- urally to program phase changes. Hardware- or OS-based DVFS schemes Effective Design (WCED 00), 2000, http://www.ece. with fixed intervals lack this code-aware adaptation. rochester.edu/~albonesi/wced00/. Existing compiler DVFS work focuses mainly on static-compiler tech- 3. A. Weissel and F. Bellosa, “Process Cruise Control: Event- niques.5,6 Typically, these techniques use profiling to learn about program Driven Clock Scaling for Dynamic Power Management,” Proc. behavior. Then, offline analysis techniques, such as linear programming,5 Int’l Conf. Compilers, Architecture and Synthesis for Embedded decide on DVFS settings for some code regions. A limitation is that the Systems (CASES 02), ACM Press, 2002, pp. 238-246. DVFS setting obtained at static-compile time might not be appropriate 4. M.C. Huang, J. Renau, and J. Torrellas, “Positional Adaptation for the program at runtime because the profiler and the actual program of Processors: Application to energy reduction,” Proc. 30th have different runtime environments. The reasoning behind the above Ann. Int’l Symp. Computer Architecture (ISCA 03), IEEE Press, statement is that DVFS decisions depend on the program’s memory-bound- 2003, pp. 157-168. edness. In turn, the program’s behavior in terms of memory-boundedness 5. C.-H. Hsu and U. Kremer, “The Design, Implementation, and depends on runtime system characteristics such as machine or architec- Evaluation of a Compiler Algorithm for CPU Energy ture configuration or program input size and patterns. For example, Reduction,” Proc. Conf. Programming Language Design and machine or architecture settings such as cache configuration or memory Implementation (PLDI 03), ACM Press, 2003, pp. 38-48. bus speed can affect how much CPU slack or idle time exists. Also, dif- 6. F. Xie, M. Martonosi, and S. Malik, “Compile-Time Dynamic ferent program input sizes or patterns can affect how much memory must Voltage Scaling Settings: Opportunities and Limits,” Proc. be used and how it must be used. It is thus inherently difficult for a stat- Conf. Programming Language Design and Implementation ic compiler to make DVFS decisions that adapt to these factors. In con- (PLDI 03), ACM Press, 2003, pp. 49-62.

120 IEEE MICRO have developed a design framework for a run- decision model to design a fast DVFS deci- time DVFS optimizer (RDO) in a dynamic- sion algorithm that uses hardware feedback compilation environment. We have information. implemented a prototype RDO and integrat- ed it in an industrial-strength dynamic opti- DVFS code insertion and transformation mization system. We deployed the optimization If the decision algorithm finds DVFS bene- system in a real hardware platform that allows ficial for a candidate code region, we insert us to directly measure CPU current and volt- DVFS mode set instructions at every code age for accurate power and energy readings. To region entry point to start DVFS and at every evaluate the system, we experimented with exit point to restore the voltage level. One design physical measurements for more than 40 SPEC question is how many adjusted regions we want or Olden benchmarks. Evaluation results show to have in a program. Some existing static-com- that the optimizer achieves significant energy piler algorithms choose only a single DVFS efficiency—for example, up to 70 percent ener- code region for a program (to avoid an exces- gy savings (with 0.5 percent performance loss) sively long analysis time).10 In our design, we for the SPEC benchmarks. identify multiple DVFS regions to provide more energy-saving opportunities. In addition to code Design framework insertion, the dynamic compiler can perform There are several key design issues to con- code transformation to create energy-saving sider for the RDO in a dynamic-compilation opportunities—for example, merging two sep- and optimization environment. arate (small) memory-bound code regions into one big one. The DVFS optimizer and the con- Candidate code region selection ventional performance optimizer interact to For cost effectiveness, we want to optimize check that this code merging doesn’t harm the only frequently executed code regions (so- program’s performance or correctness. called hot code regions). In addition, because DVFS is a relatively slow process (with a volt- Operation block diagram age transition rate typically around 1 mv per Figure 2 shows a block diagram of a dynam- 1 μs), we want to optimize only long-running ic-compiler DVFS optimization system’s over- code regions. Therefore, in our design, we all operation and the interactions between its chose functions and loops as candidate code components. At the start, the dynamic opti- regions. Most dynamic optimization systems mizer dispatches or patches original binary code are already equipped with a lightweight pro- and delivers it to the hardware for execution. filing mechanism to identify hot code regions The system is now in cold-code execution (for example, DynamoRio profiles every pos- mode. While the cold code executes, the sible loop target9). We extended the existing dynamic optimization system monitors and profiling infrastructure to monitor and iden- identifies the hot code regions. Then, the RDO tify hot functions or loops. applies optimization to the hot code regions, either before or after conventional performance DVFS decisions optimizations have occurred. The system is For each candidate code region, an impor- now in hot-code execution mode, in which the tant step is to decide whether applying DVFS now-optimized code is executed. Finally, if a is beneficial (that is, whether the region can code transformation is desirable, the RDO operate at a lower voltage and frequency to queries the regular performance optimizer to save energy without significant impact on check the code transformation’s feasibility. overall performance) and to determine the appropriate DVFS setting. For a dynamic DVFS decision algorithms optimization system, the analysis or decision To make DVFS decisions, the RDO first algorithm must be simple and fast to mini- inserts testing and decision code at a candi- mize overhead. The offline analysis techniques date code region’s entry and exit points. The used by static-compiler DVFS10 are typically code collects runtime information such as the too time consuming and are not appropriate number of cache misses or memory bus trans- here. For our work, we followed an analytical actions for this code region. When sufficient

JANUARY–FEBRUARY 2006 121 MICRO TOP PICKS

Figure 3 shows our analytical decision Start model for DVFS based on this rationale; it is Dynamic an extension of the analytical model proposed Dispatcher Monitor 11 optimizer by Xie, Martonosi, and Malik. It categorizes processor operations into memory operations and CPU operations. Because memory is Run-time DVFS asynchronous with respect to CPU frequency optimizer f, we denote memory operation time as (RDO) tasyn_mem. CPU operation time can be further Cold code Hot code divided: Part 1 is CPU operations that can run execution execution concurrently with memory operations; part 2 is CPU operations dependent on the final results of pending memory operations. OS and hardware Because CPU operation time depends on CPU frequency f, we denote concurrent CPU

Figure 2. Operation and interaction block diagram of operation time Nconcurrent/f, where Nconcurrent is dynamic-compiler DVFS optimization system. the number of clock cycles for concurrent CPU operations. Similarly, we denote depen-

dent CPU operation time Ndependent/f. Figure 3 shows that if the overlap period is Memory CPU memory-bound—that is, tasyn_mem > Nconcurrent/f— operation operation there is a CPU slack time defined as

Nconcurrent CPU slack time = tasyn_mem – Nconcurrent/f tasyn_mem ƒ Ideally, the concurrent CPU operation can be slowed down to consume CPU slack time. Following this model, we want to compute N dependent Execution frequency-scaling factor β for a candidate code time ƒ region. (Thus, if the original clock frequency is f, the new clock frequency will be ‚ β f; and

Figure 3. Analytical decision model for DVFS (tasyn_mem is the voltage will be scaled accordingly.) We

asynchronous memory access time, Nconcurrent is the number introduce a new concept called relative CPU

of execution cycles for concurrent CPU operation, Ndependent slack time, which we define as CPU slack time is the number of cycles for dependent CPU operation, and f over total execution time. For a memory-

is CPU clock frequency). bound case, total_time = tasyn_mem + Ndependent/f. From Figure 3, we see that the larger the rela- tive CPU slack, the more frequency reduction information has been collected, the RDO tests the system can have without affecting overall whether the code region is long-running. performance. So frequency reduction (1 – β) Then it decides whether applying DVFS is is proportional to relative CPU slack time. beneficial (saving energy with little or no per- A derivation we have detailed elsewhere8 formance cost) and what the appropriate gives the following equation for β: DVFS setting is for the candidate region. The key observation supporting beneficial t β =−1 Pk asyn_mem DVFS is the existence of an asynchronous loss 0 total_ time memory system independent of the CPU N / f clock and many times slower than the CPU. +Pk concuurrent loss 0 total_ time Therefore, if we can identify CPU execution slack (CPU stall or idle cycles waiting for the

completion of memory operations), we can where Ploss is the maximum allowed perfor-

scale down CPU voltage and frequency to save mance loss expressed in percentage, and k0 is energy without much performance impact. a constant coefficient depending on machine

122 IEEE MICRO (a) (b) (c) (d)

Figure 4. Processor power measurement setup: running system (a); signal-conditioning unit (b); data acquisition unit (c); data- logging unit (d). configurations. Intuitively, this equation code can update and replace itself at runtime) means that the scaling factor is negatively pro- and customized trace or code region selection. portional to the memory intensity level (the In addition, unlike the standard Pin, which is term including tasyn_mem) and positively pro- JIT (just-in-time compiler) based and executes portional to the CPU intensity level (the term generated code only, O-Pin takes a partial-JIT including Nconcurrent). approach and executes a mix of original and We can estimate the time ratios in this equa- generated code. For example, we can configure tion by using hardware feedback information O-Pin to first patch, instrument, and profile such as hardware performance counter (HPC) the original code at a coarse granularity (such events. For example, for an x86 processor, we as function calls only). Then, at runtime, it can estimate the first time ratio in the equa- selectively generates JIT code and performs tion by using the ratio of two HPC events: the more fine-grained profiling and optimization number of memory-busy transactions and the of the dynamically compiled code (such as all number of micro-operations retired.8 loops inside a function). Therefore, O-Pin has less operation overhead than regular Pin. Implementation and deployment methods We outline the prototype RDO system’s and experience operation as follows: We have implemented a prototype of the proposed RDO and integrated it into a real 1. The RDO instruments all function calls dynamic-compilation system. To evaluate it, in the program and all loops in the main we conducted live-system physical power function to monitor and identify fre- measurements. quently executed code regions. 2. If the RDO finds that a candidate code Implementation region is hot (that is, the execution count To implement our DVFS algorithm and is greater than a certain threshold), the develop the RDO, we used the Intel Pin sys- DVFS testing and decision code starts to tem as the basic software platform. Pin is a collect runtime information and decide dynamic instrumentation and compilation sys- how memory-bound the code region is. tem and is publicly available (http://rogue. 3. If the code region is memory-bound, the colorado.edu/Pin/index.html). We modified RDO removes the instrumentation code, the regular Pin system to make it more suitable inserts DVFS mode set instructions, and and more convenient for dynamic optimiza- resumes program execution. On the other tions. We refer to the modified system as O- hand, if a code region is CPU-bound, no Pin (Optimization Pin). Compared with the DVFS instructions are inserted. standard Pin package, O-Pin has more features 4. For the medium case, in which the can- supporting dynamic optimizations, such as didate code region exhibits mixed mem- adaptive code replacement (the instrumented ory behavior (probably because it

JANUARY–FEBRUARY 2006 123 MICRO TOP PICKS

contains both memory-bound and CPU- filter to reduce measurement noise. bound subregions). the RDO checks • Data acquisition (DAQ) unit. This unit whether it is a long-running function (Figure 4c) samples and reads voltage and containing loops. If it is, the RDO current signals. Capturing program behav- dynamically generates a copy of this ior variations (especially with DVFS) function and identifies and instruments requires a fast sampling rate. We use all loops inside the function. National Instruments’ data acquisition sys- 5. The process repeats.8 tem DAQPad-6070E, which has a maxi- mum aggregate sampling rate of 1.2M/s Deployment in a real system (http://www.ni.com/dataacquisition). We We deployed our RDO system in a real set a sampling rate of 200,000 per second running system. Figure 4a shows the hardware for each channel (5-μs sample length), platform, an Intel development board con- which is more than adequate for our reads. taining a Pentium M processor. The Pentium • Data-logging and -processing unit. This M we used has a maximum clock frequency of unit (Figure 4d) is the host logging 1.6 GHz, two 32-Kbyte L1 caches, and one machine, which processes sampling data. unified 1-Mbyte L2 cache. The board has a Every 0.1 seconds, the DAQ unit sends 400-MHz front-side bus and a 512-Mbyte collected data to the host logging double-data-rate RAM. machine via a high-speed fire-wire cable. The Pentium M has six DVFS settings, or The logging machine then processes the “SpeedSteps” (expressed in frequency/voltage received data. We use a regular laptop pairs): 1.6 GHz/1.48 V, 1.4 GHz/1.42 V, 1.2 running National Instruments’ Labview GHz/1.27 V, 1.0 GHz/1.16 V, 800 MHz/1.04 DAQ software to process the data. We V, and 600 MHz/0.96 V. The DVFS voltage configured Labview for various tasks: transition rate is about 1 mv per 1 μs (accord- monitoring, raw data recording, and ing to our own measurements). power and energy computation. The OS is Linux kernel 2.4.18 (with GCC updated to 3.3.2). We implemented two load- Experimental results able kernel modules to provide user-level sup- For all experiments, we chose a performance

port for DVFS control and HPC reading in loss target Ploss of 5 percent. (With a larger Ploss, the form of system calls. the resulting frequency settings would be The running system allows accurate power lower, allowing more aggressive energy savings.

measurements. Figure 4 shows the entire Conversely, a smaller Ploss would lead to high- processor power measurement setup, which er and more conservative DVFS settings.) includes the following four components. Because the voltage transition time between SpeedSteps is about 100 μs to 500 μs for our • Running system. This unit (Figure 4a) iso- machine,12 we set the long-running threshold lates and measures CPU voltage and cur- for a code region to 1.5 ms (or 2.4 million rent signals. We isolate and measure CPU cycles for a 1.6-GHz processor) to make it at power rather than the entire board’s power least three times the voltage transition time. because we want more deterministic and For evaluation, we used all the SPECfp2000 accurate results, unaffected by random fac- and SPECint2000 benchmarks. Because pre- tors on the board. We use the main voltage vious static-compiler DVFS work10 used regulator’s output sense resistors to mea- SPECfp95 benchmarks, we also included them sure current going to the CPU, and the in our benchmark suites. In addition, we bulk capacitor to measure CPU voltage. included several Olden benchmarks,13 which • Signal-conditioning unit. This unit (Fig- are popular integer benchmarks for studying ure 4b) reduces measurement noise for program memory behavior. For each bench- more accurate readings. Measurement mark, we used the Intel C++/Fortran compil- noise from sources such as the CPU er V8.1 to obtain the application binary board is inevitable. Because noise typi- (compiled with optimization level 2). We test- cally has far higher frequency than mea- ed each benchmark with the largest reference sured signals, we use a two-layer low-pass input set running to completion. The power

124 IEEE MICRO Table 1. Information and DVFS settings obtained by RDO for SPEC benchmark 173.applu. Average numbers are per million micro-operations retired.

Total Average Average Average Total micro- L2 cache memory instructions DVFS Total hot DVFS Region operations misses transactions retired setting regions regions name (millions) (thousands) (thousands) (millions) (GHz) 72 5 jacld() 208 12.4 24.8 0.99 0.8 blts() 286 5.9 11.5 0.99 1.2 jacu() 156 12.7 25.6 0.99 0.8 buts() 254 7.0 12.9 0.99 1.2 rhs() 188 4.2 8.2 1.0 1.4 and performance results reported here are aver- 1.6 age results obtained from three separate runs. 1.6GHz Table 1 shows program information and 1.4 1.4GHz results obtained by the RDO system for SPEC 1.2GHz benchmark 173.applu, a program for elliptic 1.2 partial differential equations. The table lists Voltage (V) 1 the total number of hot code regions in the 800MHz program and the total number of DVFS 0.8 0 2 4 6 regions identified. For each DVFS code region, it shows total number of micro-oper- ations retired for the code region (in a single invocation), average number of L2 cache miss- 10 es, average number of memory bus transac- tions, average number of instructions retired,

Power (W) 5 and obtained DVFS settings. Figure 5 shows part of the CPU voltage and power traces for 173.applu running with 0 RDO. By inserting DVFS instructions direct- 0 2 4 6 ly into the five code regions indicated in the Time (seconds) table, RDO adjusts the CPU voltage and fre- quency to adapt to program phase changes. Figure 5. Partial traces of CPU voltage and power for SPEC Specifically, as program execution enters code benchmark 173.applu running with RDO. region jacld(), RDO scales CPU clock fre- quency (and voltage) from the default 1.6 GHz to the selected DVFS setting 0.8 GHz. When Energy and performance results the program enters the next code region, blts(), As Figure 2 shows, we view the RDO as an the clock frequency switches to 1.2 GHz; and addition to the regular dynamic (perfor- so on. The power trace is also interesting. Ini- mance) optimization system. So, to isolate the tially it fluctuates around the value of 11 W contribution of the DVFS optimization, we (because of various system-switching activities). report energy and performance results relative After program execution enters the DVFS code to the O-Pin system without DVFS (we don’t regions, power drops dramatically to a level as want to mix the effect of our DVFS opti- low as 2.5 W. As experimental results will show, mization and that of the underlying dynam- DVFS optimization applied to the code regions ic compilation and optimization system being in 173.applu led to considerable energy savings developed by researchers at Intel and the Uni- (~35 percent) with little performance loss versity of Colorado). In addition, as a com- (~5 percent). parison, we report the energy results obtained from static voltage scaling, which scales sup- ply voltage and frequency statically for all

JANUARY–FEBRUARY 2006 125 MICRO TOP PICKS

60 StaticScale 50 RDO 40 30 20 10 0

EDP improvement (percent) −14 −10 BH Mst Bisoft Em3D Power Health 175.vpr 176.gcc 181.mcf 252.eon 254.gap Average Average 164.gzip 300.twolf Perimeter 256.bzip2 186.crafty 255.vortex 197.parser 253.perlbmk

(a) (b)

60 69 50 40 30 20 10 0

EDP improvement (percent) −10 179.art Average Average 141.apsi 301.apsi 145.pppp 102.swim 171.swim 189.lucas 177.mesa 110.applu 173.applu 107.mgrld 172.mgrld 103.u2cor 178.galgel 188.ammp 191.fma3d 125.turb3d 146.wave5 187.facerec 183.equake 200.sixtrack 101.tomcatv 104.hydro2d 168.wupwise

(c) (d)

Figure 6. Energy delay product (EDP) improvement for SPECint2000 (a), Olden (b), SPECfp95 (c), and SPECfp2000 (d) bench- marks, obtained from RDO and static voltage scaling (StaticScale).

benchmarks to get roughly the same amount gin for nearly all benchmarks. Second, energy of average performance loss as that in our and performance results for individual bench- results. (We chose f = 1.4 GHz for static volt- marks in each benchmark suite vary signifi- age scaling, which is the only voltage setting cantly (from -1 percent to 70 percent for RDO; point in our system that produces an average from -14 percent to 20 percent for StaticScale). performance loss close to 5 percent.) This is because individual applications vary Figure 6 shows energy delay product (EDP) widely in their proportion of memory-bound- improvement results obtained for all our edness. In particular, the SPECint2000 pro- benchmarks from RDO and static voltage grams are almost all CPU-bound (except for scaling (StaticScale). These results take into 181.mcf) and hence are nearly immune to our account all DVFS optimization overhead, attempted optimizations. such as time required for checking a code Table 2 summarizes the average results for region’s memory-boundedness. each benchmark suite. It shows the results These results lead to several interesting obser- from our techniques and the StaticScale vations. First, in terms of EDP improvement, results. The EDP improvement results for RDO outperforms StaticScale by a wide mar- RDO represent a three- to fivefold better

126 IEEE MICRO Table 2. Average results for each benchmark suite: RDO versus StaticScale.

Performance Energy Energy delay product Benchmark degradation (%) savings (%) improvement (%) suite RDO StaticScale RDO StaticScale RDO StaticScale SPECfp95 2.1 7.9 24.1 13.0 22.4 5.6 SPECfp2000 3.3 7.0 24.0 13.5 21.5 6.8 SPECint2000 0.7 11.6 6.5 11.5 6.0 –0.3 Olden 3.7 7.8 25.3 13.7 22.7 6.3 improvement than the StaticScale results. (We promise for adaptive power and performance present additional experimental results, management in future microprocessors. MICRO including those for basic O-Pin overhead, in another publication.8) Acknowledgments Overall, the results in Figures 6 and Table We thank Gilberto Contreras, Ulrich Kre- 2 show that the proposed technique address- mer, Chung-Hsing Hsu, C.-K. Luk, Robert es the microprocessor energy and performance Cohn, and Kim Hazelwood for their helpful control problem well. We attribute these discussions during the development of our promising results to our design’s efficiency and design framework and methodology. We also to the advantages of the dynamic-compiler- thank the anonymous reviewers for their use- driven approach. ful comments and suggestions about this arti- cle. Qiang Wu was supported by an Intel Microarchitectural suggestions Foundation Graduate Fellowship. Our work The experimental results are promising, but was also supported in part by NSF grants we could improve them if more microarchi- CCR-0086031 (ITR), CNS-0410937, CCF- tectural support were available. For example, 0429782, Intel, IBM, and SRC. logic to identify and predict CPU execution slack, such as proposed by Fields, Bodik, and References Hill,14 would make DVFS computation easi- 1. D. Marculescu, “On the Use of er and more accurate. Another supporting Microarchitecture-Driven Dynamic Voltage mechanism would be power-aware hardware- Scaling,” Proc. Workshop on Complexity- monitoring counters and events to keep track Effective Design (WCED 00), 2000, http:// of a processor unit’s power consumption and www.ece.rochester.edu/~albonesi/wced00/. voltage variations. In addition, additional fine- 2. G. Semeraro et al., “Dynamic Frequency and grained DVFS settings would make the Voltage Control for a Multiple Clock Domain intratask DVFS design more effective. In our Microarchitecture,” Proc. 35th Ann. Symp. experiments, RDO was forced to select an Microarchitecture (Micro-35), IEEE Press, unnecessarily high voltage or frequency set- 2002, pp, 356-367. ting for many code regions in the benchmarks 3. A. Weissel and F. Bellosa, “Process Cruise because the Pentium M processors lack inter- Control: Event-Driven Clock Scaling for mediate steps between the six SpeedSteps. Dynamic Power Management,” Proc. Int’l Conf. Compilers, Architecture and Synthesis he proposed technique’s orthogonal for Embedded Systems (CASES 02), ACM Tapproach and advantages make it an effec- Press, 2002, pp. 238-246. tive complement to existing techniques and 4. V. Bala, E. Duesterwald, and S. Banerjia, will help us work toward a multilayer (software “Dynamo: A Transparent Dynamic and hardware), collaborative power manage- Optimization System,” Proc. Conf. ment scheme. In addition, the design frame- Programming Language Design and work and methodology described here are Implementation (PLDI 00), ACM Press, applicable to other emerging microprocessor 2000, pp. 1-12. problems, such as di/dt and thermal control. 5. K. Ebcioglu and E.R. Altman, “DAISY: Such techniques and framework offer great Dynamic Compilation for 100% Architectural

JANUARY–FEBRUARY 2006 127 MICRO TOP PICKS

Compatibility,” Proc. 24th Ann. Int’l Symp. His research interests include power-aware Computer Architecture (ISCA 97), IEEE microprocessor design and control. Wu has Press, pp. 26-37. an MSc in electrical engineering from the 6. L. Baraz et al., “IA-32 Execution Layer: A University of Sydney. He is a student mem- Two-Phase Dynamic Translator Designed to ber of the IEEE and the ACM. Support IA-32 Applications on Itanium-Based Systems,” Proc. 36th Ann. Symp. Micro- Margaret Martonosi is a professor of electri- architecture (Micro-36), ACM Press, 2003, cal engineering at Princeton University. Her pp. 191-204. research interests include computer architec- 7. C.-K. Luk et al., “Pin: Building Customized ture and hardware-software interfaces, with a Program Analysis Tools with Dynamic focus on power-efficient systems and mobile Instrumentation,” Proc. Conf. Programming computing. Martonosi has a BS from Cornell Language Design and Implementation (PLDI University and an MS and a PhD from Stan- 05), ACM Press, 2005, pp. 190-200. ford University, all in electrical engineering. 8. Q. Wu, et al., “A Dynamic Compilation She is a senior member of the IEEE and a Framework for Controlling Microprocessor member of the ACM. Energy and Performance,” Proc. 38th Ann. Symp. Microarchitecture (Micro-38), ACM Douglas W. Clark is a professor of computer Press, 2005, pp. 271-282. science at Princeton University. His research 9. D. Bruening, T. Garnett, and S. Amarasinghe, interests include computer architecture, low- “An Infrastructure for Adaptive Dynamic power techniques, and clocking and timing Optimization,” Proc. Int’l Symp. Code in digital systems. Clark has a BS in engi- Generation and Optimization (CGO 03), ACM neering and applied science from Yale Uni- Press, 2003, pp. 265-275. versity and a PhD in computer science from 10. C.-H. Hsu and U. Kremer, “The Design, Carnegie-Mellon University. Implementation, and Evaluation of a Compiler Algorithm for CPU Energy Vijay Janapa Reddi is pursuing a PhD in the Reduction,” Proc. Conf. Programming Department of Electrical and Computer Language Design and Implementation (PLDI Engineering at the University of Colorado at 03), ACM Press, 2003, pp. 38-48. Boulder. His research interests include the 11. F. Xie, M. Martonosi, and S. Malik, design of dynamic code transformation sys- “Compile-Time Dynamic Voltage Scaling tems for deployment in everyday computing Settings: Opportunities and Limits,” Proc. environments. Reddi has a BSc in computer Conf. Programming Language Design and engineering from Santa Clara University. Implementation (PLDI 03), ACM Press, 2003, pp. 49-62. Dan Connors is an assistant professor in the 12. S. R. Gochman et al., “The Intel Pentium M Department of Electrical and Computer Engi- Processor: Microarchitecture and neering and the Department of Computer Sci- Performance,” Intel Technology J., vol. 7, ence at the University of Colorado at Boulder no. 2, May 2003, pp. 21-36. and the leader of the DRACO research group 13. M.C. Carlisle et al., “Early Experiences with on runtime code transformation. His research Olden,” Proc. 6th Int’l Workshop on interests include modern compiler optimiza- Languages and Compilers for Parallel tion and multithreaded multicore architecture Computing (LCPC 93), 1993, http://www.cs. design. Connors has an MS and PhD in elec- princeton.edu/~mcc/olden.html. trical engineering from the University of Illi- 14. B. Fields, R. Bodik, and M.D. Hill, “Slack: nois at Urbana-Champaign. Maximizing Performance under Tech- nological Constraints,” Proc. 29th Int’l Symp. Youfeng Wu is a principal engineer in Intel’s Computer Architecture (ISCA 02), IEEE Corporate Technology Group and leads a Press, 2002, pp. 47-58. research team on multiprocessor compilation and dynamic binary optimizations. His Qiang Wu is a PhD candidate in Princeton research interests include codesigned power- University’s Computer Science Department. efficient computer systems, binary and

128 IEEE MICRO dynamic optimizations, and security and safe- research interests include architectural-level ty enhancement via compiler and binary tools. power modeling and power-efficient design Wu has an MS and PhD in computer science of hardware and software for embedded and from Oregon State University. He is a mem- high-performance computer systems. Brooks ber of the ACM and the IEEE. has a BS from the University of Southern Cal- ifornia and an MA and PhD from Princeton Jin Lee is the president and CEO of Amicus University, all in electrical engineering. He is Wireless Technology. He previously worked at a member of the IEEE. Intel, where he participated in the work described in this article. His research interests Direct questions and comments about this include compiler technology, network proces- article to Qiang Wu, Dept. of Computer Sci- sors, and semiconductor fabrication technolo- ence, Princeton University, Princeton, NJ gies. Lee has an MS and PhD in mechanical 08544; [email protected]. engineering from . For further information on this or any other David Brooks is an assistant professor of com- computing topic, visit our Digital Library at puter science at Harvard University. His http://www.computer.org/publications/dlib.

JANUARY–FEBRUARY 2006 129