Dynamic-Compiler-Driven Control for Microprocessor Energy and Performance
Total Page:16
File Type:pdf, Size:1020Kb
DYNAMIC-COMPILER-DRIVEN CONTROL FOR MICROPROCESSOR ENERGY AND PERFORMANCE A GENERAL DYNAMIC-COMPILATION ENVIRONMENT OFFERS POWER AND PERFORMANCE CONTROL OPPORTUNITIES FOR MICROPROCESSORS. THE AUTHORS PROPOSE A DYNAMIC-COMPILER-DRIVEN RUNTIME VOLTAGE AND FREQUENCY OPTIMIZER. A PROTOTYPE OF THEIR DESIGN, IMPLEMENTED AND Qiang Wu DEPLOYED IN A REAL SYSTEM, ACHIEVES ENERGY SAVINGS OF UP TO 70 PERCENT. Margaret Martonosi Energy and power have become pri- apply to other control means as well. Douglas W. Clark mary issues in modern processor design. A dynamic compiler is a runtime software Processor designers face increasingly vexing system that compiles, modifies, and optimizes Princeton University power and thermal challenges, such as reduc- a program’s instruction sequence as it runs. ing average and maximum power consump- Examples of infrastructures based on dynam- tion, avoiding thermal hotspots, and ic compilers include the HP Dynamo,4 the Vijay Janapa Reddi maintaining voltage regulation quality. In IBM DAISY (Dynamically Architected addition to static design time power manage- Instruction Set from Yorktown),5 Intel Dan Connors ment techniques, dynamic adaptive tech- IA32EL6 and the Intel PIN.7 niques are becoming essential because of an Figure 1 shows the architecture of a gener- University of Colorado at increasing gap between worst-case and aver- al dynamic-compiler system. A dynamic-com- age-case demand. pilation system serves as an extra execution Boulder In recent years, researchers have proposed layer between the application binary and the and studied many low-level hardware tech- operating system and hardware. At runtime, niques to address fine-grained dynamic power a dynamic-compilation system interacts with Youfeng Wu and performance control issues.1-3 At a high- application code execution and applies possi- er level, the compiler and the application can ble optimizations to it. Beyond regular per- Jin Lee also take active roles in maximizing micro- formance optimizations, a dynamic compiler processor power control effectiveness and can also apply energy optimizations such as Intel Corp. actively managing power, performance, and DVFS because most DVFS implementations thermal goals. In this article, we explore power allow direct software control through mode control opportunities in a general dynamic- set instructions (by accessing special mode set David Brooks compilation environment for microproces- registers). A dynamic compiler can simply sors. In particular, we look at one control insert DVFS mode set instructions into appli- Harvard University mechanism: dynamic voltage and frequency cation binary code. If CPU execution slack scaling (DVFS). We expect our methods to exists (that is, CPU idle cycles are waiting for 0272-1732/06/$20.00 © 2006 IEEE Published by the IEEE computer Society 119 MICRO TOP PICKS memory), these instructions can scale down Applicaton binary the CPU voltage and frequency to save ener- gy with little or no performance impact. Because a dynamic compiler has access to Dynamic- DVFS compilation optimization high-level information on program code struc- system Performance ture as well as runtime system information, a optimization dynamic-compiler-driven DVFS scheme offers some unique features and advantages not pre- OS and hardware sent in existing hardware-based or static- compiler-based DVFS approaches. (See the Figure 1. General dynamic-compiler system “Why dynamic-compiler-driven DVFS?” side- architecture, serving as an extra execution bar for detail.) layer between application binary and OS Our work is one of the first efforts to devel- and hardware. op dynamic-compiler techniques for micro- processor voltage and frequency control.8 We Why dynamic-compiler-driven DVFS? Most existing research efforts on fine-grained DVFS control fall into trast, dynamic-compiler DVFS can use runtime system information to make one of two categories: hardware or OS interrupt-based approaches or input-adaptive and architecture-adaptive decisions. static-compiler-based approaches. We must also point out that dynamic-compiler DVFS has disadvan- Hardware or OS time-interrupt-based DVFS techniques typically mon- tages. The most significant is that, just as for any dynamic-optimization itor particular system statistics, such as issue queue occupancy,1 in fixed technique, every cycle spent for optimization is a cycle lost to execution. time intervals and choose DVFS settings for future time intervals.1-3 Therefore, our challenge is to design simple and inexpensive analysis and Because the time intervals are predetermined and independent of pro- decision algorithms that minimize runtime optimization cost. gram structure, DVFS control by these methods might not be efficient in adapting to program phase changes. One reason is that program phase changes are generally caused by the invocation of different code regions.4 References Thus, hardware or OS techniques might not be able to infer enough about 1. G. Semeraro et al., “Dynamic Frequency and Voltage Control application code attributes and find the most effective adaptation points. for a Multiple Clock Domain Microarchitecture,” Proc. 35th Another reason is that program phase changes are often recurrent (that Ann. Symp. Microarchitecture (Micro-35), IEEE Press, 2002, is, loops). In this case, the hardware or OS schemes would need to detect pp. 356-367. and adapt to the recurring phase changes repeatedly. A compiler-driven 2. D. Marculescu, “On the Use of Microarchitecture-Driven DVFS scheme can apply DVFS to fine-grained code regions, adapting nat- Dynamic Voltage Scaling,” Proc. Workshop on Complexity- urally to program phase changes. Hardware- or OS-based DVFS schemes Effective Design (WCED 00), 2000, http://www.ece. with fixed intervals lack this code-aware adaptation. rochester.edu/~albonesi/wced00/. Existing compiler DVFS work focuses mainly on static-compiler tech- 3. A. Weissel and F. Bellosa, “Process Cruise Control: Event- niques.5,6 Typically, these techniques use profiling to learn about program Driven Clock Scaling for Dynamic Power Management,” Proc. behavior. Then, offline analysis techniques, such as linear programming,5 Int’l Conf. Compilers, Architecture and Synthesis for Embedded decide on DVFS settings for some code regions. A limitation is that the Systems (CASES 02), ACM Press, 2002, pp. 238-246. DVFS setting obtained at static-compile time might not be appropriate 4. M.C. Huang, J. Renau, and J. Torrellas, “Positional Adaptation for the program at runtime because the profiler and the actual program of Processors: Application to energy reduction,” Proc. 30th have different runtime environments. The reasoning behind the above Ann. Int’l Symp. Computer Architecture (ISCA 03), IEEE Press, statement is that DVFS decisions depend on the program’s memory-bound- 2003, pp. 157-168. edness. In turn, the program’s behavior in terms of memory-boundedness 5. C.-H. Hsu and U. Kremer, “The Design, Implementation, and depends on runtime system characteristics such as machine or architec- Evaluation of a Compiler Algorithm for CPU Energy ture configuration or program input size and patterns. For example, Reduction,” Proc. Conf. Programming Language Design and machine or architecture settings such as cache configuration or memory Implementation (PLDI 03), ACM Press, 2003, pp. 38-48. bus speed can affect how much CPU slack or idle time exists. Also, dif- 6. F. Xie, M. Martonosi, and S. Malik, “Compile-Time Dynamic ferent program input sizes or patterns can affect how much memory must Voltage Scaling Settings: Opportunities and Limits,” Proc. be used and how it must be used. It is thus inherently difficult for a stat- Conf. Programming Language Design and Implementation ic compiler to make DVFS decisions that adapt to these factors. In con- (PLDI 03), ACM Press, 2003, pp. 49-62. 120 IEEE MICRO have developed a design framework for a run- decision model to design a fast DVFS deci- time DVFS optimizer (RDO) in a dynamic- sion algorithm that uses hardware feedback compilation environment. We have information. implemented a prototype RDO and integrat- ed it in an industrial-strength dynamic opti- DVFS code insertion and transformation mization system. We deployed the optimization If the decision algorithm finds DVFS bene- system in a real hardware platform that allows ficial for a candidate code region, we insert us to directly measure CPU current and volt- DVFS mode set instructions at every code age for accurate power and energy readings. To region entry point to start DVFS and at every evaluate the system, we experimented with exit point to restore the voltage level. One design physical measurements for more than 40 SPEC question is how many adjusted regions we want or Olden benchmarks. Evaluation results show to have in a program. Some existing static-com- that the optimizer achieves significant energy piler algorithms choose only a single DVFS efficiency—for example, up to 70 percent ener- code region for a program (to avoid an exces- gy savings (with 0.5 percent performance loss) sively long analysis time).10 In our design, we for the SPEC benchmarks. identify multiple DVFS regions to provide more energy-saving opportunities. In addition to code Design framework insertion, the dynamic compiler can perform There are several key design issues to con- code transformation to create energy-saving sider for the RDO in a dynamic-compilation opportunities—for example, merging two sep- and optimization environment. arate (small) memory-bound code regions into one big one. The DVFS optimizer and the con- Candidate code region selection ventional performance optimizer interact to For cost effectiveness, we want to optimize check that this code merging doesn’t harm the only frequently executed code regions (so- program’s performance or correctness. called hot code regions). In addition, because DVFS is a relatively slow process (with a volt- Operation block diagram age transition rate typically around 1 mv per Figure 2 shows a block diagram of a dynam- 1 μs), we want to optimize only long-running ic-compiler DVFS optimization system’s over- code regions. Therefore, in our design, we all operation and the interactions between its chose functions and loops as candidate code components.