Dynamic Frequency and Voltage Scaling for a Multiple-Clock-Domain Microprocessor

DYNAMIC FREQUENCY AND VOLTAGE SCALING FOR A MULTIPLE-CLOCK-DOMAIN MICROPROCESSOR MULTIPLE CLOCK DOMAINS IS ONE SOLUTION TO THE INCREASING PROBLEM OF PROPAGATING THE CLOCK SIGNAL ACROSS INCREASINGLY LARGER AND FASTER CHIPS. THE ABILITY TO INDEPENDENTLY SCALE FREQUENCY AND VOLTAGE IN EACH DOMAIN CREATES A POWERFUL MEANS OF REDUCING POWER DISSIPATION. Grigorios Magklis Demand for higher processor per- domains can have independent voltage and Intel formance has led to a dramatic increase in frequency control, enabling dynamic voltage clock frequency as well as an increasing num- scaling at the domain level. ber of transistors in the processor core. As Global dynamic voltage scaling already Greg Semeraro chips become faster and larger, designers face appears in many systems and can help reduce significant challenges, including global clock power dissipation for rate-based and partial- Rochester Institute of distribution and power dissipation. ly idle workloads. An MCD architecture can A multiple clock domain (MCD) microar- save power even during intensive computa- Technology chitecture,1 which uses a globally asynchro- tion by slowing domains that are compara- nous, locally synchronous (GALS) clocking tively unimportant to the application’s current style,2,3 permits future aggressive frequency critical path, even when it is impossible to David H. Albonesi increases, maintains a synchronous design completely gate off those domains. The dis- methodology, and exploits the trend of mak- advantage is the need for interdomain syn- Steven G. Dropsho ing functional blocks more autonomous. In chronization, which, because of buffering, MCD, each processor domain is internally out-of-order execution, and superscalar data Sandhya Dwarkadas synchronous, but domains operate asynchro- paths, has a relatively minor impact on over- nously with respect to one another. Design- all performance, less than 2 percent.4 Michael L. Scott ers still apply existing synchronous design MCD potentially has a significant energy techniques to each domain, but global clock advantage with only modest performance cost, University of Rochester skew is no longer a constraint. Moreover, if the frequencies and voltages of the various 62 Published by the IEEE Computer Society 0272-1732/03/$17.00 2003 IEEE Front end External memory L1 instruction cache Main memory Fetch unit Memory ROB, rename, dispatch L2 cache Integer unit Floating-point unit Load/store unit Integer issue queue Floating-point issue queue Integer ALUs and register file Floating-point ALUs and register file L1 data cache Figure 1. MCD processor block diagram. domains assume appropriate values at appro- that decoupled different pipeline func- priate times.1 Designers can implement this tions or control function completely online in hard- • relatively little interfunction communi- ware, making it transparent to the user and sys- cation occurred. tem software.4 Online control is useful in environments where legacy applications must Main memory is external to the processor, run without modification, or significant user and we can view it as a fifth domain that involvement is undesirable. Otherwise, profil- always runs at full speed. ing and instrumentation of the application We based our frequency and voltage-scal- provides a more global view of the program ing model on the Intel XScale processor (as than in a hardware implementation, and has described by L.T. Clark in the short course the potential to provide better results, if the “Circuit Design of Xscale Microprocessors,” behavior observed during the profiling run is at the 2001 Symp. VLSI Circuits). The XScale consistent with that occurring in production.5 continues to execute through the voltage/fre- This article briefly summarizes both of these quency change. There is, however, a substan- approaches and compares their performance tial delay before the change becomes fully against a near-optimal offline technique. effective. Key to MCD’s fine-grained adaptation is effi- MCD microarchitecture cient, on-chip voltage scaling circuitry, a rapid- The MCD microarchitecture1 consists of ly emerging technology. New microinductor four different on-chip clock domains, shown technologies are paving the way for highly-effi- in Figure 1, each with independent control of cient, on-chip, buck converters.6 This circuit frequency and voltage. In choosing the technology should be mature enough for com- boundaries among domains, we identified mercialization within the next few years, and points where the MCD microarchitecture, including the voltage control algorithms we present, will be ready • there already existed a queue structure to take advantage of the technology. NOVEMBER–DECEMBER 2003 63 MICRO TOP PICKS Online control algorithm signal processing and signal synthesis inspired Analysis of processor resource utilization this algorithm.7 reveals a correlation, over an interval of The MCD architecture employs the instructions, between the valid entries in the attack/decay algorithm independently in each input queue (for each of the integer, floating- back-end domain. The hardware counts the point, and load/store domains) and the entries in the domain issue queue over a desired frequency for the domain. This cor- 10,000-instruction interval. Using that num- relation follows from considering the instruc- ber and the corresponding number from the tion processing core as the domain queue’s prior interval, the algorithm determines if sink and the front end as the source. Queue there has been a significant change (a thresh- utilization indicates the rate at which instruc- old of 1.75 percent), in which case the algo- tions flow through the core; if utilization rithm uses the attack mode: The frequency increases, instructions are not flowing fast changes (up or down as appropriate) by a enough. Queue utilization is thus an appro- modest amount (6 percent). If no significant priate metric for dynamically determining the change occurs or if there is no activity in the desired domain frequency (except in the front- domain, the algorithm uses the decay mode: end domain, which the online algorithm does It decreases the domain frequency slightly not attempt to control). (0.175 percent). This correlation between issue queue utiliza- In all cases, if the overall instructions per tion and desired frequency is not without chal- cycle (IPC) changes by more than a certain lenges. Notable among them is that changes in threshold (2.5 percent), the frequency remains a domain’s frequency might affect the issue unchanged for that interval. This convention queue utilization of that domain and possibly identifies natural decreases in performance others. This interaction among the domains is that are unrelated to the domain frequency a potential source of error that might degrade and prevents the algorithm from reacting to performance beyond acceptable thresholds or them. Thresholding tends to reduce the inter- lead to lower-than-expected energy savings. action of a domain with adjustments in other Interactions might lead to instability in domain domains. The IPC performance counter is the frequencies, as changes in the other domains only global information that is available to all influence each particular domain. domains. The online algorithm consists of two com- To protect against settling at a local mini- ponents that act independently but coopera- mum when a global minimum exists, the algo- tively. The result is a frequency curve that rithm forces an attack whenever a domain approximates the envelope of the queue uti- frequency has been at one extreme or the other lization curve, creating a small performance for 10 consecutive intervals. This is a com- degradation and a significant energy savings. mon technique to apply when a control sys- In general, an envelope detection algorithm tem reaches an end point and the reacts quickly to sudden changes in the input plant/control relationship becomes undefined. signal (queue utilization, in this case). In the absence of significant changes, this algorithm Profile-based control algorithm slowly decreases the controlling parameter. The profile-based control algorithm has Such an approach represents a feedback four phases: It control system. For a control system, if the plant (the entity under control) and the con- • uses standard performance profiling tech- trol point (the parameter being adjusted) are niques to identify subroutines and loop linearly related, then the system will be stable, nests that run long enough to justify and the control point will correctly adjust to reconfiguration; changes in the plant. Because of the rapid • constructs a directed acyclic graph adjustments necessary for significant changes (DAG) that represents dependences in utilization and the otherwise slow adjust- among domain operations in these long- ments, we call the approach an attack/decay running fragments of code, and distrib- algorithm.4 The attack-decay-sustain-release utes the slack in the DAG to minimize (ADSR) envelope-generating techniques in energy; 64 IEEE MICRO A 75K B C 50K 19K D 29K E 20K 4K 1K 2K L M N 20K 8K 5K 4K 3K 6K F G H I J K Figure 2. Call tree with associated instruction counts. The shaded nodes are candidates for reconfiguration. • uses per-domain histograms of operating After running the binary code and collect- frequencies to identify, for each long-run- ing its statistics, we annotate each tree node ning code fragment, the minimum fre- with the dynamic instances and the total quency for each domain that would instructions executed, from which we can

Load more