<<

DYNAMIC FREQUENCY AND VOLTAGE SCALING FOR A MULTIPLE--DOMAIN

MULTIPLE CLOCK DOMAINS IS ONE SOLUTION TO THE INCREASING PROBLEM

OF PROPAGATING THE ACROSS INCREASINGLY LARGER AND

FASTER CHIPS. THE ABILITY TO INDEPENDENTLY SCALE FREQUENCY AND

VOLTAGE IN EACH DOMAIN CREATES A POWERFUL MEANS OF REDUCING

POWER DISSIPATION. Grigorios Magklis Demand for higher per- domains can have independent voltage and Intel formance has led to a dramatic increase in frequency control, enabling dynamic voltage clock frequency as well as an increasing num- scaling at the domain level. ber of transistors in the processor core. As Global already Greg Semeraro chips become faster and larger, designers face appears in many systems and can help reduce significant challenges, including global clock power dissipation for rate-based and partial- Rochester Institute of distribution and power dissipation. ly idle workloads. An MCD architecture can A multiple clock domain (MCD) microar- save power even during intensive computa- Technology chitecture,1 which uses a globally asynchro- tion by slowing domains that are compara- nous, locally synchronous (GALS) clocking tively unimportant to the application’s current style,2,3 permits future aggressive frequency critical path, even when it is impossible to David H. Albonesi increases, maintains a synchronous design completely gate off those domains. The dis- methodology, and exploits the trend of mak- advantage is the need for interdomain syn- Steven G. Dropsho ing functional blocks more autonomous. In chronization, which, because of buffering, MCD, each processor domain is internally out-of-order execution, and superscalar data Sandhya Dwarkadas synchronous, but domains operate asynchro- paths, has a relatively minor impact on over- nously with respect to one another. Design- all performance, less than 2 percent.4 Michael L. Scott ers still apply existing synchronous design MCD potentially has a significant energy techniques to each domain, but global clock advantage with only modest performance cost, University of Rochester skew is no longer a constraint. Moreover, if the frequencies and voltages of the various

62 Published by the IEEE Society 0272-1732/03/$17.00  2003 IEEE Front end External memory L1 instruction Main memory

Fetch unit

Memory

ROB, rename, dispatch L2 cache

Integer unit Floating-point unit Load/store unit Integer issue queue Floating-point issue queue

Integer ALUs and Floating-point ALUs and register file L1 data cache

Figure 1. MCD processor block diagram. domains assume appropriate values at appro- that decoupled different pipeline func- priate times.1 Designers can implement this tions or control function completely online in hard- • relatively little interfunction communi- ware, making it transparent to the user and sys- cation occurred. tem software.4 Online control is useful in environments where legacy applications must Main memory is external to the processor, run without modification, or significant user and we can view it as a fifth domain that involvement is undesirable. Otherwise, profil- always runs at full speed. ing and instrumentation of the application We based our frequency and voltage-scal- provides a more global view of the program ing model on the Intel XScale processor (as than in a hardware implementation, and has described by L.T. Clark in the short course the potential to provide better results, if the “Circuit Design of Xscale ,” behavior observed during the profiling run is at the 2001 Symp. VLSI Circuits). The XScale consistent with that occurring in production.5 continues to execute through the voltage/fre- This article briefly summarizes both of these quency change. There is, however, a substan- approaches and compares their performance tial delay before the change becomes fully against a near-optimal offline technique. effective. Key to MCD’s fine-grained adaptation is effi- MCD cient, on-chip voltage scaling circuitry, a rapid- The MCD microarchitecture1 consists of ly emerging technology. New microinductor four different on-chip clock domains, shown technologies are paving the way for highly-effi- in Figure 1, each with independent control of cient, on-chip, buck converters.6 This circuit frequency and voltage. In choosing the technology should be mature enough for com- boundaries among domains, we identified mercialization within the next few years, and points where the MCD microarchitecture, including the volt- age control algorithms we present, will be ready • there already existed a queue structure to take advantage of the technology.

NOVEMBER–DECEMBER 2003 63 MICRO TOP PICKS

Online control algorithm signal processing and signal synthesis inspired Analysis of processor resource utilization this algorithm.7 reveals a correlation, over an interval of The MCD architecture employs the instructions, between the valid entries in the attack/decay algorithm independently in each input queue (for each of the integer, floating- back-end domain. The hardware counts the point, and load/store domains) and the entries in the domain issue queue over a desired frequency for the domain. This cor- 10,000-instruction interval. Using that num- relation follows from considering the instruc- ber and the corresponding number from the tion processing core as the domain queue’s prior interval, the algorithm determines if sink and the front end as the source. Queue there has been a significant change (a thresh- utilization indicates the rate at which instruc- old of 1.75 percent), in which case the algo- tions flow through the core; if utilization rithm uses the attack mode: The frequency increases, instructions are not flowing fast changes (up or down as appropriate) by a enough. Queue utilization is thus an appro- modest amount (6 percent). If no significant priate metric for dynamically determining the change occurs or if there is no activity in the desired domain frequency (except in the front- domain, the algorithm uses the decay mode: end domain, which the online algorithm does It decreases the domain frequency slightly not attempt to control). (0.175 percent). This correlation between issue queue utiliza- In all cases, if the overall instructions per tion and desired frequency is not without chal- cycle (IPC) changes by more than a certain lenges. Notable among them is that changes in threshold (2.5 percent), the frequency remains a domain’s frequency might affect the issue unchanged for that interval. This convention queue utilization of that domain and possibly identifies natural decreases in performance others. This interaction among the domains is that are unrelated to the domain frequency a potential source of error that might degrade and prevents the algorithm from reacting to performance beyond acceptable thresholds or them. Thresholding tends to reduce the inter- lead to lower-than-expected energy savings. action of a domain with adjustments in other Interactions might lead to instability in domain domains. The IPC performance is the frequencies, as changes in the other domains only global information that is available to all influence each particular domain. domains. The online algorithm consists of two com- To protect against settling at a local mini- ponents that act independently but coopera- mum when a global minimum exists, the algo- tively. The result is a frequency curve that rithm forces an attack whenever a domain approximates the envelope of the queue uti- frequency has been at one extreme or the other lization curve, creating a small performance for 10 consecutive intervals. This is a com- degradation and a significant energy savings. mon technique to apply when a control sys- In general, an envelope detection algorithm tem reaches an end point and the reacts quickly to sudden changes in the input plant/control relationship becomes undefined. signal (queue utilization, in this case). In the absence of significant changes, this algorithm Profile-based control algorithm slowly decreases the controlling parameter. The profile-based control algorithm has Such an approach represents a four phases: It control system. For a control system, if the plant (the entity under control) and the con- • uses standard performance profiling tech- trol point (the parameter being adjusted) are niques to identify subroutines and loop linearly related, then the system will be stable, nests that run long enough to justify and the control point will correctly adjust to reconfiguration; changes in the plant. Because of the rapid • constructs a directed acyclic graph adjustments necessary for significant changes (DAG) that represents dependences in utilization and the otherwise slow adjust- among domain operations in these long- ments, we call the approach an attack/decay running fragments of code, and distrib- algorithm.4 The attack-decay-sustain-release utes the slack in the DAG to minimize (ADSR) envelope-generating techniques in energy;

64 IEEE MICRO A 75K

B C 50K 19K

D 29K E 20K 4K 1K 2K L M N

20K 8K 5K 4K 3K 6K F G H I J K

Figure 2. Call tree with associated instruction counts. The shaded nodes are candidates for reconfiguration.

• uses per-domain histograms of operating After running the binary code and collect- frequencies to identify, for each long-run- ing its statistics, we annotate each tree node ning code fragment, the minimum fre- with the dynamic instances and the total quency for each domain that would instructions executed, from which we can cal- permit execution to complete within a culate the average instructions per instance fixed slow-down bound; and (including instructions executed in the node’s • edits the application’s binary code to children). We then identify all nodes that run embed path-tracking and reconfiguration long enough (10,000 instructions or more) for instructions that will instruct the hardware a frequency change to take effect and to have to adopt appropriate frequencies at appro- a potential impact on energy consumption. priate times during production runs. Starting from the leaves and working up, we identify all nodes whose average instance Choosing reconfiguration points (excluding instructions executed in long-run- Phase one uses a binary editing tool8 to ning children) exceeds 10,000. Figure 2 shows instrument subroutines and loops. When the a call tree; the long-running nodes are shaded. instrumented binary executes, it counts when Note that these nodes, taken together, are each subroutine or loop executes in a given guaranteed to cover almost all of the applica- context. In the most general case we consid- tion history in the profiled run. ered, a call tree that captures all call sites between the main and the current point of The shaker algorithm execution can represent a context. The call To select frequencies and corresponding tree differs from the static call graph that many voltages for long-running tree nodes, we run compilers construct because it has a separate the application through a heavily modified node for every path over which a given sub- version of the SimpleScalar/Wattch tool kit,9,10 routine is reachable (it will also be missing any with all clock domains at full frequency. Dur- nodes that the profiling tool did not ing this run, in phase two, we collect a trace of encounter during its run). The call tree is not all primitive events (temporally contiguous a true dynamic call trace, but a compressed work performed within a single hardware unit one, which superimposes multiple instances on behalf of a single instruction), and the of the same path. For example, if the program functional and data dependences among these calls a subroutine from inside a loop, this loop events. The trace output is a dependence will have the same call history every time, and DAG for each long-running node in the call can be represented by a single node in the tree, tree. Working from this DAG, the shaker even though the program might have actual- algorithm attempts to “stretch” (make longer ly called it many times. in time) individual events that are not on the

NOVEMBER–DECEMBER 2003 65 MICRO TOP PICKS

application’s critical execution path, as if they mance degradation, d, we can choose a fre- could run at their own, event-specific, lower quency that causes some events to run slower frequency. than ideal. Using the histograms generated by Whenever an event in a dependence DAG the shaker algorithm, we choose a frequency has two or more incoming arcs, it is likely that based on all the events in higher bins of the one arc constitutes the critical path and that histogram. For the chosen frequency, the extra the others will have slack. Slack indicates that time necessary to execute those events must the previous operation completed earlier than be less than or equal to d percent of the total necessary. If all of the outgoing arcs of an event time required to execute all the events in the have slack, then we have an opportunity to node, run at their ideal frequencies. save energy by performing the event at a lower frequency. With each event in the DAG, we Application editing associate a power factor whose initial value is In phase four, to effect the reconfigurations based on the relative power consumption of chosen by the slow-down thresholding algo- the corresponding clock domain in our rithm, we must insert code at the beginning processor model. When we stretch an event, and end of each long-running subroutine or we scale its power factor accordingly. loop. Although the instrumentation overhead The shaker tries to distribute slack as uni- necessary to track the full definition of con- formly as possible. It begins at the end of the text is low (about 9 extra cycles for each DAG and works backward. When it encoun- 10,000-instruction interval, plus 8 more ters a stretchable event whose power factor cycles if the frequency requires changing), exceeds the current threshold (originally set simply tracking the yields to be slightly below that of the few most results that are almost as accurate as in this full power-intensive events in the graph) the shak- definition. er scales the event until it either consumes all The results we present associate a single the available slack or its power factor drops desired frequency with each long-running sub- below the current threshold. If any slack routine or loop, regardless of calling context. remains, the event moves later, so that as much At the beginning of each such code fragment, slack as possible occurs at its incoming edges. the instrumented binary writes a statically When the shaker reaches the beginning of known frequency into an MCD hardware the DAG, it reverses direction, reduces its reconfiguration register. More complex defin- power threshold by a small amount, and makes itions of context require additional instru- a new pass forward through the DAG, scaling mentation as well as a lookup table containing high-power events and moving slack to out- the frequencies chosen by the slow-down going edges. It repeats this back-and-forth thresholding algorithm of phase three.5 until all the available slack is con- sumed, or until all the events adjacent to slack Results edges have been scaled to the minimum per- We assume a processor microarchitecture missible frequency. When it completes its similar to that of the Alpha 21264 with a fre- work, the shaker constructs a per-domain sum- quency range of 250 MHz to 1 GHz and a cor- mary histogram that indicates, for each of the responding voltage range of 0.65 V to 1.2 V. frequency steps, the total cycles for events in Traversing the entire voltage range requires 55 the domain that have been scaled to run at or µs. We select applications from the Media- near that frequency. A combination of the his- Bench and SPEC CPU2000 suites. For the tograms for multiple dynamic instances of the profile-based approach, we use a smaller train- same tree node then becomes the input to the ing input data set during profiling, but gather slow-down thresholding algorithm. final results using the larger reference data set. The MCD processor has an inherent per- Slow-down thresholding formance penalty of less than 2 percent com- Phase three recognizes that we cannot in pared to its globally clocked counterpart, and practice scale the frequency of individual an energy penalty of about 1 percent. Figure events: We must scale each domain as a whole. 3 shows energy × delay improvements for the If we are willing to tolerate a small perfor- online and profile-based algorithms relative

66 IEEE MICRO 45 40 Online Offline 35 Profile-based 30 25 20 15 10 5 Percentage improvement 0 −5 art vpr mcf gzip swim applu equake average epic_decode epic_encode gsm_decode gsm_encode jpeg_decode jpeg_encode g721_decode g721_encode adpcm_decode adpcm_encode mpeg2_decode mpeg2_encode

Figure 3. Energy × delay improvement results.

to this baseline MCD processor with no volt- 02), IEEE CS Press, 2002, pp. 29-40. age control. We obtain the so-called offline 2. D.M. Chapiro, “Globally Asynchronous results with perfect future knowledge.1 Locally Synchronous Systems,” PhD thesis, For all three control strategies, the average Stanford Univ., 1984. performance degradation (not shown) is 3. A. Iyer and D. Marculescu, “Power and Per- approximately 7 percent. The online algo- formance Evaluation of Globally Asynchro- rithm achieves a significant overall energy × nous Locally Synchronous Processors,” delay improvement, about 17 percent, Proc. 29th Int’l Symp. Computer Architec- although its reactive nature results in a slight ture (ISCA 02), IEEE CS Press, 2002, pp. degradation for one benchmark. As expected, 158-170. profiling yields better and more consistent 4. G. Semeraro et al., “Dynamic Frequency results, about a 27 percent overall energy × and Voltage Control for a Multiple Clock delay improvement, nearly matching that of Domain Microarchitecture,” Proc. 35th Ann. the omniscient offline algorithm. IEEE/ACM Int’l Symp. Microarchitecture (MICRO-35), IEEE CS Press, 2002, pp. 356- he MCD approach alleviates many of the 370. Tbottlenecks of fully synchronous systems, 5. G. Magklis et al., “Profile-based Dynamic while exploiting proven synchronous design Voltage and Frequency Scaling for a Multi- methodologies. The union of the MCD ple Clock Domain Microprocessor,” Proc. microarchitecture with emerging on-chip volt- 30th Int’l Symp. age scaling technology permits fine-grained (ISCA 03), ACM Press, 2003, pp. 14-27. voltage scaling that is broadly applicable. Both 6. V. Kursun et al., “Analysis of Buck Convert- the online and profile-based techniques that ers for On-Chip Integration with a Dual Sup- we have developed exploit this capability to ply Voltage Microprocessor,” IEEE Trans. provide significant energy savings. MICRO VLSI Systems, vol. 11, no. 3, June 2003, pp. 514-522. References 7. K. Jensen, “Envelope Model of Isolated 1. G. Semeraro et al., “Energy-Efficient Musical Sounds,” Proc. 2nd COST G-6 Using Multiple Clock Workshop on Digital Audio Effects Domains with Dynamic Voltage and Fre- (DAFx99), Norwegian University of Science quency Scaling,” Proc. 8th Int’l Symp. High- and Technology, 1999, pp. W99-1–W99-5. Performance Computer Architecture (HPCA 8. A. Eustace and A. Srivastava, “ATOM: A

NOVEMBER–DECEMBER 2003 67 MICRO TOP PICKS

Flexible Interface for Building High Perfor- Steven G. Dropsho is a postdoctoral mance Program Analysis Tools,” Proc. researcher in the Department of Computer Usenix 1995 Technical Conf., Usenix Assoc., Science at the University of Rochester. His 1995, pp. 303-314. research interests include architecture, power 9. D. Brooks, V. Tiwari, and M. Martonosi, efficiency, and parallel and distributed sys- “Wattch: A Framework for Architectural- tems. Dropsho has a PhD in computer sci- Level Power Analysis and Optimizations,” ence from the University of Massachusetts at Proc. 27th Int’l Symp. Computer Architecture Amherst. He is a member of the IEEE Com- (ISCA 00), IEEE CS Press, 2000, pp. 83-94. puter Society and the ACM. 10. D. Burger and T. Austin, The SimpleScalar Tool Set, Version 2.0, technical report CS- Sandhya Dwarkadas is an associate professor TR-97-1342, Computer Science Dept., Univ. in the Department of Computer Science at the of Wisconsin, June 1997. University of Rochester. Her research interests include parallel and , Grigorios Magklis is a researcher at the Intel- computer architecture, and networks, and the UPC Barcelona Research Center. His research interactions among and interfaces between the interests include architecture, operating sys- compiler, runtime system, and underlying tems, application analysis, and tools. Magklis architecture. Dwarkadas has a PhD in electri- is a PhD candidate and has an MSc in com- cal and computer engineering from Rice Uni- puter science from the University of Rochester. versity. She is a member of the IEEE, the IEEE He is a member of the ACM and IEEE. Computer Society, and the ACM.

Greg Semeraro is an assistant professor in the Michael L. Scott is a professor of computer Department of Computer Engineering at the science at the University of Rochester. His Rochester Institute of Technology. His research interests include operating systems, research interests include the modeling, analy- languages, architecture, and tools, with a par- sis, and simulation of microarchitecture; dig- ticular emphasis on parallel and distributed ital and real-time systems; and nonlinear systems. He has a PhD in computer sciences control systems. Semeraro has a PhD in elec- from the University of Wisconsin-Madison. trical and computer engineering from the He is a member of the IEEE, the IEEE Com- University of Rochester. He is a member of puter Society, and the ACM. the IEEE Computer Society, the IEEE Edu- cation Society, and the American Society for Engineering Education. Direct questions and comments about this David H. Albonesi is an associate professor in article to David H. Albonesi, Computer Stud- the Department of Electrical and Computer ies Bldg., University of Rochester, PO Box Engineering at the University of Rochester. 270231, Rochester, NY 14627-0231; His research interests include microarchitec- [email protected]. ture with an emphasis on adaptive architec- tures, power-aware computing, and multithreading. Albonesi has a PhD from the University of Massachusetts at Amherst. He is For further information on this or any other a senior member of the IEEE, and a member computing topic, visit our Digital Library at of the IEEE Computer Society and the ACM. http://www.computer.org/publications/dlib.

68 IEEE MICRO