POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors

POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors THE ABILITY TO ESTIMATE POWER CONSUMPTION DURING EARLY-STAGE DEFINITION AND TRADE-OFF STUDIES IS A KEY NEW METHODOLOGY ENHANCEMENT. OPPORTUNITIES FOR SAVING POWER CAN BE EXPOSED VIA MICROARCHITECTURE-LEVEL MODELING, PARTICULARLY THROUGH CLOCK- GATING AND DYNAMIC ADAPTATION. Power dissipation limits have Thus far, most of the work done in the area David M. Brooks emerged as a major constraint in the design of high-level power estimation has been focused of microprocessors. At the low end of the per- at the register-transfer-level (RTL) description Pradip Bose formance spectrum, namely in the world of in the processor design flow. Only recently have handheld and portable devices or systems, we seen a surge of interest in estimating power Stanley E. Schuster power has always dominated over perfor- at the microarchitecture definition stage, and mance (execution time) as the primary design specific work on power-efficient microarchi- Hans Jacobson issue. Battery life and system cost constraints tecture design has been reported.2-8 drive the design team to consider power over Here, we describe the approach of using Prabhakar N. Kudva performance in such a scenario. energy-enabled performance simulators in Increasingly, however, power is also a key early design. We examine some of the emerg- Alper Buyuktosunoglu design issue in the workstation and server mar- ing paradigms in processor design and com- kets (see Gowan et al.)1 In this high-end arena ment on their inherent power-performance John-David Wellman the increasing microarchitectural complexities, characteristics. clock frequencies, and die sizes push the chip- Victor Zyuban level—and hence the system-level—power Power-performance efficiency consumption to such levels that traditionally See the “Power-performance fundamentals” Manish Gupta air-cooled multiprocessor server boxes may box. The most common (and perhaps obvious) soon need budgets for liquid-cooling or refrig- metric to characterize the power-performance Peter W. Cook eration hardware. This need is likely to cause efficiency of a microprocessor is a simple ratio, a break point—with a step upward—in the such as MIPS (million instructions per sec- IBM T.J. Watson ever-decreasing price-performance ratio curve. ond)/watts (W). This attempts to quantify effi- As such, a design team that considers power ciency by projecting the performance achieved Research Center consumption and dissipation limits early in or gained (measured in MIPS) for every watt the design cycle and can thereby adopt an of power consumed. Clearly, the higher the inherently lower power microarchitectural line number, the “better” the machine is. will have a definite edge over competing teams. While this approach seems a reasonable 26 0272-1732/00/$10.00 2000 IEEE Power-performance fundamentals at the microarchitecture level At the elementary transistor gate level, we can formulate total power Performance basics dissipation as the sum of three major components: switching loss, leak- The most straightforward metric for measuring performance is the age, and short-circuit loss.1-4 execution time of a representative workload mix on the target processor. We can write the execution time as =++ PWdevice (/)12 C V DD Vswing af IV leakage DD IV sc DD (A) T=× PL CPI × CT =× PL CPI ×(/)1 f (E) Here, C is the output capacitance, VDD is the supply voltage, f is the chip clock frequency, and a is the activity factor (0 < a < 1) that determines the Here, PL is the dynamic path length of the program mix, measured as the device switching frequency. Vswing is the voltage swing across the out- number of machine instructions executed. CPI is the average processor put capacitor. Ileakage is the leakage current, and Isc is the average short- cycles per instruction incurred in executing the program mix, and CT is circuit current. The literature often approximates Vswing as equal to VDD the processor cycle time (measured in seconds per cycle) whose inverse (or simply V for short), making the switching loss around (1/2)C V 2af. determines clock frequency f. Since performance increases with decreas- Also for current ranges of VDD (say, 1 volt to 3 volts) switching loss, ing T, we may formulate performance PF as (1/2)C V2af remains the dominant component.2 So as a first-order approx- == imation for the whole chip we may formulate the power dissipation as PF chip K pf f KVpv (F) = 2 PWchip (12/ )[∑ C iiV af ii ] (B) Here, the K’s are constants for a given microarchitecture-compiler imple- i mentation. The Kpf value stands for the average number of machine Ci , Vi , ai , and fi are unit- or block-specific average values. The sum- instructions executed per cycle on the machine being measured. PFchip mation is taken over all blocks or units i, at the microarchitecture level in this case is measured in MIPS. (instruction cache, data cache, integer unit, floating-point unit, load- Selecting a suite of publicly available benchmark programs that every- store unit, register files, and buses). For the voltage range considered, one accepts as being representative of real-world workloads is difficult. the operating frequency is roughly proportional to the supply voltage; Adopting a noncontroversial weighted mix is also not easy. For the com- C remains roughly the same if we keep the same design, but scale the monly used SPEC benchmark suite (http://www.specbench.org) the SPEC- voltage. If a single voltage and clock frequency is used for the whole mark rating for each class is derived as a geometric mean of execution chip, the formula reduces to time ratios for the programs within that class. Each ratio is calculated as the speedup with respect to execution time on a specified reference ==3 v 3 f PW chip VK ( ∑ i afi ) ()∑ K i ai (C) machine. If we believe in the particular benchmark suite, this method i i has the advantage of allowing us to rank different machines unambigu- where K’s are unit- or block-specific constants. If we consider the worst- ously from a performance viewpoint. That is, we can show the ranking case activity factor for each unit i—that is, if ai = 1 for all i, then as independent of the reference machine used in such a formulation. ==33 PW chip KVv Kf f (D) References where Kv and Kf are design-specific constants. 1. V. Zyuban, “Inherently Lower-Power High Performance That equation leads to the so-called cube-root rule.2 This points to the Super Scalar Architectures,” PhD dissertation, Univ. of Notre single most efficient method for reducing power dissipation for a proces- Dame, Dept. of Computer Science and Engineering, 2000. sor designed to operate at high frequency: reduce the voltage (and hence 2. M.J. Flynn et al., “Deep-Submicron Microprocessor Design the frequency). This is the primary mechanism of power control in Trans- Issues,” IEEE Micro, Vol. 19, No. 4, July/Aug. 1999, pp. 11-22. meta’s Crusoe chip (http://www.transmeta.com). There’s a limit, how- 3. S. Borkar, “Design Challenges of Technology Scaling,” IEEE ever, on how much VDD can be reduced (for a given technology), which Micro, Vol. 19, No. 4, July/Aug. 1999, pp. 23-29. has to do with manufacturability and circuit reliability issues. Thus, a 4. R. Gonzalez and M. Horowitz, “Energy Dissipation in General combination of microarchitecture and circuit techniques to reduce power Purpose Microprocessors,” IEEE J. Solid-State Circuits, Vol. consumption—without necessarily employing multiple or variable sup- 31, No. 9, Sept. 1996, pp. 1277-1284. ply voltages—is of special relevance. choice for some purposes, there are strong argu- driver of such server-class designs, and cost or ments against it in many cases, especially when efficiency issues have been of secondary impor- it comes to characterizing higher end proces- tance. Specifically, a design team may well sors. Performance has typically been the key choose a higher frequency design point (which NOVEMBER–DECEMBER 2000 27 POWER-AWARE MICROARCHITECTURE 50 30 SpecInt/W 2 40 SpecInt /W 25 SpecInt3/W 20 SpecFp/W 30 SpecFp2/W 3 15 SpecFp /W 20 Improvement 10 Improvement 10 over worst performer over worst performer 5 0 0 Sun USII Sun USII AMD AthlonHP-PA8600IBM Power3 Intel Celeron AMD AthlonHP-PA8600IBM Power3 Intel Celeron MIPS R12000 MIPS R12000 Intel Pentium III Compaq 21264 Hal Sparc 64III Intel Pentium III Compaq 21264 Hal Sparc 64III (a)Motorola PPC740 (b) Motorola PPC740 Figure 1. Performance-power efficiencies across commercial processor products. The newer processors appear on the left. meets maximum power budget constraints) marking issue is also relevant in assigning an even if it operates at a much lower MIPS/W average power rating. In measuring power and efficiency compared to one that operates at bet- performance together for a given program exe- ter efficiency but at a lower performance level. cution, we may use a fused metric such as the As such, (MIPS)2/W or even (MIPS)3/W may power-delay product (PDP) or energy-delay be the metric of choice at the high end. product (EDP).9 In general, the PDP-based for- On the other hand, at the lowest end, where mulations are more appropriate for low-power, battery life (or energy consumption) is the pri- portable systems in which battery life is the primary driver, designers may want to put an even mary index of energy efficiency. The MIPS/W greater weight on the power aspect than the metric is an inverse PDP formulation, where simplest MIPS/W metric. That is, they may delay refers to average execution time per just be interested in minimizing the power for instruction.

POWER-AWARE MICROARCHITECTURE: Design and Modeling Challenges for Next-Generation Microprocessors

Inside Intel® Core™ Microarchitecture Setting New Standards for Energy-Efficient Performance

Hardware Architecture

Microcontroller Serial Interfaces

Reverse Engineering X86 Processor Microcode

Intel(R) Software Guard Extensions Developer Guide

Itanium® 2 Processor Microarchitecture Overview

Microarchitecture-Level Soc Design 27 Young-Hwan Park, Amin Khajeh, Jun Yong Shin, Fadi Kurdahi, Ahmed Eltawil, and Nikil Dutt

Computer Architectures an Overview

Digital and System Design

ECE 4750 Computer Architecture, Fall 2020 T02 Fundamental Processor

Itanium Processor

Intel's Haswell CPU Microarchitecture