See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/241637649

Power-Management Architecture of the Code-Named

Article in IEEE Micro · March 2012 DOI: 10.1109/MM.2012.12

CITATIONS READS 107 126

5 authors, including:

Effi Rotem Alon Naveh Intel Technion - Israel Institute of Technology

31 PUBLICATIONS 232 CITATIONS 14 PUBLICATIONS 248 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Effi Rotem on 01 May 2016.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document and are linked to publications on ResearchGate, letting you access and read them immediately. [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 20

...... POWER-MANAGEMENT ARCHITECTURE OF THE INTEL MICROARCHITECTURE CODE-NAMED SANDY BRIDGE

...... MODERN ARE EVOLVING INTO SYSTEM-ON-A-CHIP DESIGNS WITH

HIGH INTEGRATION LEVELS, CATERING TO EVER-SHRINKING FORM FACTORS.PORTABILITY

WITHOUT COMPROMISING PERFORMANCE IS A DRIVING MARKET NEED.AN ARCHITEC-

TURAL APPROACH THAT’S ADAPTIVE TO AND COGNIZANT OF WORKLOAD BEHAVIOR AND

PLATFORM PHYSICAL CONSTRAINTS IS INDISPENSABLE TO MEETING THESE PERFORMANCE

AND EFFICIENCY GOALS.THIS ARTICLE DESCRIBES POWER-MANAGEMENT INNOVATIONS

INTRODUCED ON INTEL’S SANDY BRIDGE .

...... Continuous advances in multithreading capabilities and improved technology let designers integrate an ever- and energy efficiency. increasing number of transistors onto a single We also introduced several architectural die. This increased transistor density has en- changes, most notably the 256-bit Advanced abled system-on-a-chip (SoC) functionality Vector Extension (AVX). by integrating graphics engines, memory con- In addition, we moved the graph- Efraim Rotem trollers, and other platform components into ics from the chip set or discrete graphics onto modern CPU dies. The Sandy Bridge client the lead CPU process technology, resulting in Alon Naveh CPU contains just over 1 billion transistors, much higher transistor counts, lower power, and the Sandy Bridge server contains up to and higher frequencies. We also significantly Doron Rajwan 3 times as many on a single monolithic die. improved the internal microarchitecture of The second-generation Sandy Bridge core the processor graphics and media, resulting Avinash Ananthakrishnan processor has five main domains: Intel archi- in much higher-performance execution units tecture cores, a ring interconnect, a shared and high-performance media functionality. Eliezer Weissmann last-level (LLC), a system agent, and The ring interconnect and LLC architec- processor graphics. Figure 1 describes the ture provide high modularity and feature- Intel Sandy Bridge die. rich general-purpose, graphics, media, and system integration. Sandy Bridge redesign The increase and integra- We significantly redesigned Sandy Bridge tion of platform components into a mono- from the previous generation, providing lithic die, together with the increase in core increased instruction-level parallelism and frequency, introduce demanding power and ......

20 Published by the IEEE Society 0272-1732/12/$31.00 c 2012 IEEE [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 21

energy challenges to modern . Be- cause computer systems’ power and thermal envelopes aren’t increasing, addressing these challenges requires a highly energy-efficient design. Sandy Bridge power management builds on the existing SpeedStep and Turbo Boost technologies significantly increasing performance and energy efficiency. Sandy Bridge’s power-management features provide the maximum performance possible within the package and system physical constraints when needed, while consuming very low power and energy when full performance isn’t needed. Furthermore, we expanded Figure 1. The Sandy Bridge die photo. the power-management features implemented in previous generations of Intel CPUs on Sandy Bridge to give the rest of the SoC power budgeting and prioritization capabilities. DMI PCI Express Power-management architecture Voltage System agent The Package (PCU) is regulator SVID IMC the brain behind the Sandy Bridge power- Embedded PECI PCU management features. The power-management controller Display architecture is highlighted over the block Core PMA LLC diagram in Figure 2. The PCU resides in Core LLC the system agent and is a combination of dedi- Core LLC cated hardware state machines and an inte- grated . A power-management Core LLC link connects the PCU to different cores Processor graphics and functional blocks on the die via power- management agents (PMAs). PMAs collect te- lemetry information such as power consump- Figure 2. Sandy Bridge’s power-management tion and junction temperature, and perform architecture. Sandy Bridge block diagram control functions such as P-state and C-state showing the major functional blocks and transitions. The PCU communicates to the ex- the power-management control blocks ternal voltage regulator and embedded control- and interconnect. The platform power- ler that perform system power-management management components and the associated functions. The PCU runs firmware that con- platform interconnect are also shown. stantly collects power and thermal informa- tion, communicates with the system, and performs various power-management func- tions and optimization algorithms. independent power plane, whose voltage and Sandy Bridge’s package implements two frequency can be varied independently. It independent variable power planes. One can also be turned off completely when the shared power plane feeds all CPU cores, the graphics are inactive. Additional fixed power ring, and the LLC. Embedded power gates planes control the system agent and I/O. turn each core on and off individually. The The Sandy Bridge power management LLC’s power gates can turn on or off portions maximizes the user experience under multi- of the cache in shallow package sleep states or ple constraints. The user experience has the all of the cache in deeper sleep states. All the following attributes: cores and the ring share the same clock and perform dynamic voltage and frequency scal- throughput performance, ing together. The graphics processor has an responsiveness (burst performance), ......

MARCH/APRIL 2012 21 [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 22

...... HOT CHIPS

same time at worst-case conditions.1 In 45 most cases, the CPU is running a less- 40 35 demanding application and the Intel Turbo 30 Boost technology uses this power headroom 25 to extract higher performance when possi- 20 ble.2,3 Sandy Bridge’s power performance

Power (W) 15 control is performed primarily through 10 dynamic voltage and frequency scaling 5 (DVFS). When the operating system identi- 0 0 50 100 150 200 250 fies a need for high performance, it issues a Time (s) high P-state request. Whenever power and thermal headroom exist, the PCU increases CPU–predicted CPU–actual PG–predicted PG–actual the voltage and frequency to the highest Package–predicted Package–actual point that is lower than or equal to the oper- ating system request, that still meets all phys- Figure 3. Power meter: predicted and actual power of the CPU, processor ical constraints. Sandy Bridge implements graphics, and total package. The figure shows a power snapshot of architectural power meters. It collects a set combined CPU and graphics workload. The chart presents the actual of architectural events from each Intel archi- measured power and the architectural power meter reporting for the IA tecture core, the processor graphics, and I/O, core, processor graphics, and total package. The actual and reported and combines them with energy weights to power correlate accurately. predict the package’s active power consump- tion. Leakage information is coded into the die and is scaled with operating conditions such as voltage and temperature to provide CPU and graphics performance, the package’s total power consumption. battery life and energy bills, and The system uses architectural power predic- ergonomics (acoustic noise, heat, and tor output, which is also exposed externally so on). to software, to decide the amount of turbo To meet user preferences, the power- upside available for the current workload management algorithms optimize around (turbo upside is available higher frequency the following physical constraints: that the CPU can go up to and use for higher performance).4,5 The architectural power silicon capabilities, including voltage, predictor provides a consistent turbo behavior frequency and power characteristics; while minimizing the die-to-die variations system thermomechanical capabilities; and dependency on ambient temperature. power-delivery capabilities; Figure 3 describes the actual versus predicted software and operating system explicit power of the CPU, processor graphics, and control; and total package. workload and usage characteristics. In addition to the tech- The system designer can control the nology already implemented in previous gen- power-management functionality’s behavior erations of Intel processors, Sandy Bridge and preferences via basic input/output sys- offers two new functionalities: total package tem (BIOS) settings, runtime software, or power control and responsiveness via dy- an on-board embedded controller. At run- namic turbo. Sandy Bridge is an SoC mono- time, the system reads and controls parame- lithic die. The power is specified in terms of ters such as power, maximum current the entire package’s total power consump- consumption, and die temperature. tion. The real workload uses the die’s differ- ent computational and communication Intel Turbo Boost technology 2.0 resources. The PCU continuously monitors The power and frequency of the CPU the individual functional blocks’ power con- and processor graphics are defined by a sce- sumption and performs dynamic budget al- nario of concurrent CPU and processor location to the various components. One graphics running a heavy workload at the such example is power sharing between the ...... 22 IEEE MICRO [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 23

“Energy budget” accumulated Used for performance burst Power

In steady state, power equals C0/P0 (Turbo) P > TDP TDP, possibly at higher than Max power nominal frequency 1.2–1.3X TDP

30–60 s Sustain power

TDP

Sleep or 5 s / 30–60 s low power moving average

Time Build up thermal budget during idle periods

Figure 4. Dynamic behavior of the Intel Turbo Boost. After a period of low power consumption, the CPU and graphics can burst to very high power and performance for 30 to 60 seconds, delivering a responsive user experience. After this period, the power stabilizes back to the rated TDP.

CPU and the processor graphics. In work- behavior. We achieve a boost in responsive- loads that heavily use the CPU and don’t ness by managing energy budgets. The perform graphics operations, the PCU shifts EWMA algorithm tracks energy consump- the power budget to the CPU, and vice tion over a predefined time window; during versa. For balanced workloads, the system periods in which the CPU is idle or consum- designer can use a preference knob provided ing less than the thermal design point (TDP) via BIOS or the runtime driver to set the power, it accumulates an energy credit. The desired processor graphics and CPU prefer- PCU then uses these energy credits to burst ence policy. above TDP for short periods until the energy We defined the package’s power specifica- credit is fully consumed. This provides tions based on steady-state conditions. A typ- responsiveness and an instantaneous perfor- ical heat sink has a thermal mass, and its heat mance boost for the user. capacity can sustain short periods of high Sandy Bridge’s turbo algorithm differs power before it reaches steady-state tempera- significantly from those of prior-generation tures. We use this phenomenon in Sandy CPUs3 in that the amount of turbo upside Bridge to deliver responsiveness on interactive is a function of the accrued energy credits workloads. The PCU uses an exponential over a thermally significant time period. weight-moving average (EWMA) algorithm This lets Sandy Bridge turbo up in frequency over multiple seconds to control the power. during periods of high activity while still This results in short periods of time that meetingthelong-termsystempowerand the CPU exceeds its rated power as long as thermal constraints. Figure 5 illustrates mea- the rolling average is within the platform’s sured gain in overall performance between thermal specifications. The averaging time the similarly configured four-core Sandy Bridge window is a function of the system’s thermo- platform and the Clarksfield platform. The mechanical design, and therefore the system IPC gains between Sandy Bridge and Clarks- designer can configure it. Figure 4 describes field average between 10 and 15 percent. the new Intel Turbo Boost technology Sandy Bridge’s additional performance gain ......

MARCH/APRIL 2012 23 [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 24

...... HOT CHIPS

performance boost on short encoding, 100 video editing workloads by allowing the 80 CPU to burst above TDP for 20 to 30 sec- onds at a time before settling back to run 60 within the rated TDP envelope.

40 Graphics and core integration Sandy Bridge is the first Intel microproc- 20 essor to fully integrate a graphics engine and Performance gain (%) a traditional Intel architecture processing 0 core on a single die. As Figure 2 shows, the graphics and Intel architecture cores WME9 oncept share the LLC, interconnect resources, and Sysmark07 the integrated present Cinebench R10 Speclnt2K6_rateSpecFP2K6_rate on the die. This integration poses interest- H.264 main c ing opportunities from both a power- management and performance-optimization Figure 5. Sandy Bridge performance gain over Clarksfield. The figure viewpoint. shows Sandy Bridge’s measured benchmarks performance compared to Because the processor graphics and the Clarksfield with Turbo enabled. Up to 88 percent performance gains can be Intel architecture cores are integrated on observed. Most of the performance gain comes from improvements in the the die, they share a common power and Turbo algorithms and power-management features. thermal envelope. This implies that running the cores at a higher frequency and hence at a higher power level takes away power head- room that we could otherwise allocate to 60 the processor graphics to boost its frequency, and vice versa. The EWMA algorithm 50 implemented in the PCU tracks the energy 40 consumed over a certain configurable time window and the headroom available to the 30 entire die at a given time. On workloads that keep predominantly only one of the 20 two compute engines busy (that is, either the Intel architecture cores or the graphics

Performance boost over 10 Crysis-dx9 core), how to allocate the available headroom UT3-dx9 0-percent graphics budget (%) 0 to maximize performance is a trivial decision. 0 20 40 60 80 100 Real-world applications, especially high- Power allocated to graphics (%) performance gaming workloads, use the Intel architecture cores and the processor Figure 6. Performance versus power balancing. This figure shows the graphics simultaneously, and the usage of performance of two graphics-intensive games as a function of power each component varies dynamically over budget assigned to the graphics processor. The baseline is all power the course of the workload. budget assigned to the IA core and no power budget to the graphics. The PCU on Sandy Bridge exposes a soft- The more power allocated to the graphics processor, the higher the ware interface through which an operating performance. There is a diminishing return on the power allocation at the system or graphics driver can choose how high ratios because this budget is taken from the IA core, which needs to to partition the available energy headroom feed the graphics processor. between the Intel architecture cores and pro- cessor graphics.5 Figure 6 illustrates the sen- sitivity of performance to how power is comes from intelligent power management allocated to processor graphics on two DX9 and turbo algorithms. gaming workloads. As Figure 5 shows, Intel’s Turbo Boost Software performs this repartitioning dy- 2.0 technology provides a significant namically, accounting for workload-specific ...... 24 IEEE MICRO [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 25

characteristics, and hence further improving 100 performance under a power-constrained en- velope. It repartitions unused energy budget 96 again such that it’s not being lost. In addition to power and thermal depen- dency between the CPU and the processor 92 90 graphics is a functional dependency. The Scalability (%) CPU and processor graphics share the same LLC and ring interconnect. As a result, 86 Time (s) graphics performance strongly depends on how much interconnect and cache band- Figure 7. CPU and memory-bound profile. The figure shows scalability width is made available to the graphics en- snapshot as a function of time, from the SPEC . Scalability of gine. Furthermore, Intel architecture cores 100 percent means that the IA core performance gains performance and the ring interconnect are part of a single proportional to the frequency increase. Lower values imply that there are voltage and frequency domain. This implies memory accesses and increasing the IA core frequency will only partly that the cores, when active, will run at the improve the overall benchmark runtime. same frequency as the ring. Sandy Bridge exposes a set of performance counters for the software and the graphics driver to help select the optimal CPU and interconnect fre- quency at graphics workloads. Energy-aware CPU We designed Sandy Bridge to deliver high -level Thread-level performance when needed while consuming Active states coordination P0-P1-Pn coordination minimum active and idle energy when the core C0 core C0 CPU or processor graphics aren’t active. Running at the highest possible frequency delivers the best performance but isn’t energy Core C1 Core C1 efficient because of the cube dependency of power in frequency and voltage. Some memory-bound workloads that run at a Package C3 high frequency consume high power but Core C3 Core C3 don’t fully gain the performance from the increased frequency. If a user sets a balanced power preference, Sandy Bridge activates an Core C6 Package C6 Core C6 energy-aware turbo algorithm that profiles the memory access pattern at runtime and per- forms a less-aggressive turbo during memory- Package C7 bound intervals while increasing the frequency during CPU-bound phases. Figure 7 demon- strates a real-time application capture, show- ing the memory access patterns and predicted performance gain from higher frequency. At Figure 8. Sandy Bridge package C-state coordination. low-scalability periods, the PCU performs a DVFS transition to a less-aggressive turbo fre- quency, gaining power and energy for a small into a sleep state. In ACPI terminology,6 performance loss. this sleep state is called the C-state. The operating system controls each core individu- Smart average power support ally, and the PCU coordinates between the During low-activity periods, when there cores and threads. Sandy Bridge introduced are no active jobs pending to be executed, new PCU-managed C-states (see Figure 8).2,3 the operating system puts the processor Deeper C-states offer more power savings, ......

MARCH/APRIL 2012 25 [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 26

...... HOT CHIPS

Table 1. Demotion and un-demotion algorithm results.

Demotion algorithm Un-demotion benefit vs. no optimization over Demotion only Platform Performance Power Performance Power

SYSmark 07 þ4% þ122 mW þ0.3% —110 mW Mobile Mark 07 þ35% —17 mW —2% —21 mW

but at the cost of longer latency to enter and The promotion method gained back this exit the C-state. Proper management of the power with minimal performance impact. C-states is a fundamental capability for low power consumption and long battery life. System power management The ability to save power by entering into The Sandy Bridge power-management ar- deeper idle states includes two major issues: chitecture enables a hierarchical control for total system power and energy consumption. Entering and exiting a deeper core-level A new system serial (PECI) connects state involves microarchitectural opera- Sandy Bridge to a system-embedded control- tions and voltage transitions, which ler.8 Thispower-managementbusletsthe result in long latency. For example, exit- embedded controller read and manage the ing a deeper C-state in Sandy Bridge die’s power, energy, and temperature. typically takes a few tens of micro- A voltage regulator module (VRM) deliv- seconds. During this time, the CPU is ers voltage to the package. Sandy Bridge not executing useful code. Too-frequent introduced a serial voltage regulator control transitions could therefore affect the called SVID, which lets the PCU control workload’s overall performance. multiple voltage regulators such as the The entry and exit idle states also con- CPU and the processor graphics on die.9 It sume energy. Frequent C-state transition also offers configuration capabilities and te- might cost more energy than the energy lemetry features. The PCU directly commu- saved while in the deep sleep state. nicates with the external VRM to change the supply voltage for the various frequencies. To deal with these two issues, Sandy Bridge Furthermore, the PCU performs fine- includes a Demotion algorithm from a deeper grained voltage control to optimize the volt- C-state into less aggressive C-states. Typically, age in order to meet the CPU’s exact run the operating system frequently uses the sleep conditions (that is, the number of active state in bursts. Sandy Bridge’s PCU C-state cores, the die temperature, and the maxi- algorithm performs runtime C-state entry mum power consumption). The PCU also and exit profiling. It then calculates the energy manages the voltage regulator power states savings achieved by using these C-states. If to minimize losses and optimize the total too-frequent entry and exit transitions to a power consumption. deep C-state result in net performance or energy losses, the Demotion algorithm will rocess technology improvements allow override the operating system request and P an ever-increasing number of transis- keep the cores at a higher power state. tors to be integrated into modern micro- During idle periods following these C- processors. The power savings and speed state bursts, the CPU will promote the pack- gains resulting from process improvement age again to deeper C-states to minimize en- can’t keep up with transistor density. ergy consumption. Modern microprocessors such as Sandy Table 1 shows the added value of the Bridge are beginning to resemble SoCs, C-state promotion/demotion algorithm. integrating multiple compute components The Demotion algorithm improved Mobile such as a CPU core with a graphics engine Mark 07’s performance by 35 percent but and a memory controller, all within the same had a negative power effect on SYSmark 07.7 CPU die. These platforms face significant ...... 26 IEEE MICRO [3B2-9] mmi2012020020.3d 12/3/012 14:20 Page 27

thermal and power delivery challenges. 9. Intersil, Power Management Products— Maximizing performance under these SoC Processor Power, 2012; http://www.intersil. platforms will require architectural power- com/processor_power. management techniques to provide im- proved performance and energy efficiency. Efraim Rotem is a senior principal engineer Sandy Bridge’s Turbo Boost 2.0 allows the in the client microprocessor group at Intel, CPU to burst above TDP for a short dura- where he’s working on next-generation tion, providing increased performance and power-management features. His interests responsiveness while remaining compliant to include CPUs, systems on chip, and plat- platform electrical and thermal limits. In addi- form power management. Rotem has MSc in tion, Sandy Bridge’s Dynamic Turbo provides electrical engineering from the Technion, up to 80 percent higher performance over Israeli Institute of Technology. previous-generation CPUs. Finally, Sandy Bridge offers a mechanism that lets software Alon Naveh is a principal engineer in the dynamically move power between the tradi- client microprocessor group at Intel. His tional Intel architecture and processor graph- work focuses on processor and platform ics computational elements, improving the power management. Naveh has a BS in graphics performance by up to 50 percent. electrical engineering from the Technion, As the industry moves toward eight hours Israel Institute of Technology, and an MBA and longer battery life requirements with no from San Jose State University. compromise on performance, algorithms such Doron Rajwan is a power-management as Sandy Bridge’s energy-efficient turbo are architect in the client microprocessor group the first step toward addressing this need. MICRO at Intel. His work has included power- ...... management algorithms and firmware de- sign. Rajwan has an MSc in electronic References engineering from Tel Aviv University. 1. Second Generation Processor Family Mobile External Design Specification Avinash Ananthakrishnan is a power- (EDS), v. 1-2, 2011. management architect in the client micro- 2. S. Gochman et al., ‘‘Introduction to Intel Core processor group at Intel. His work focuses on Duo Processor Architecture,‘‘ Intel Technol- P-state optimization, turbo, package power ogy J., vol. 10, no. 2, 2006, pp. 89-97. sharing, platform power delivery, and power- 3. S. Gunther et al., ‘‘Energy-Efficient Comput- performance projections on client CPU ing: Power Management System on the platforms. Ananthakrishnan has an MSc in Nehalem Family of Processors,‘‘ Intel Tech- electrical engineering from the University of nology J., vol. 14, no. 3, 2010. Michigan. 4. Intel 64 IA-32 Architectures Software Devel- oper’s Manual Documentation Changes, Eliezer Weissmann is a principal engineer in 2011; http://www.intel.com/content/www/ the software and service group at Intel, where us/en/architecture-and-technology/64-ia-32- he’s working on next-generation power- architectures-software-developers-manual. management features. His interests include the html. architecture definition of power-management 5. Intel 64 and IA-32 Architectures Optimiza- features, optimization of idle and perfor- tion Reference Manual, 2011; http://www. mance states, and software and hardware intel.com/content/www/us/en/architecture- optimization for CPU and graphics compu- and-technology/64-ia-32-architectures- tation. Weissmann has an MSc in computer optimization-manual.html. science from the Technion, Israel Institute of 6. Advanced Configuration and Power Interface Technology. (ACPI) Specification, rev. 5.0, Dec. 2011; http://www.acpi.info/spec50.htm. Direct questions and comments about 7. Bapco, SYSMark 2007; www.bapco.com. this article to Efraim Rotem, Intel Israel 8. M. Berktold and T. Tian, ‘‘CPU Monitoring with (74) Ltd., PO Box 1659, Haifa 31015 DTS/PECI,‘‘ white paper, Intel, Sept. 2010. Israel; [email protected]......

MARCH/APRIL 2012 27

View publication stats