
Architecting for Power Management: The IBM® POWER7™ Approach Malcolm Ware*, Karthick Rajamani*, Michael Floyd§, Bishop Brock§, Juan C Rubio*, Freeman Rawson*, John B Carter* §IBM Systems and Technology Group, *IBM Research Austin {mware,karthick,mfloyd,bcbrock,rubioj,frawson,retrac}@us.ibm.com Abstract 32KB 4-way set associative L1 I-cache and a 32KB 8- way set associative L1 D-cache, a private per-core The POWER7 processor is the newest member of 256KB L2 cache and a 4MB portion of the shared the IBM POWER® family of server processors. With 32MB L3 cache. The L2 is fully inclusive of both the greater than 4X the peak performance and the same local D/I L1 caches. The L3 cache exploits embedded power budget as the previous generation POWER6®, DRAM technology [8] to maximize area and power POWER7 will deliver impressive energy-efficiency efficiency. The clock frequency of each core chiplet boosts. The improved peak energy-efficiency is may be independently (and asynchronously to the accompanied by a wide array of new features in the fabric) controlled via an innovative new digital PLL processor and system designs that advance IBM’s circuit. EnergyScale™ dynamic power management methodology. This paper provides an overview of these new features, which include better sensing, more advanced power controls, improved scalability for power management, and features to address the diverse needs of the full range of POWER servers from blades to supercomputers. We also highlight three challenges that need attention from a range of systems design and research teams: (i) power management in highly virtualized environments, (ii) power (in)efficiency of systems software and applications, and (iii) memory power costs, especially for servers with large memory footprints. 1. Introduction Figure 1: The IBM POWER7 Processor The POWER7 processor is the next generation server processor in the IBM POWER family. Each Like POWER6, POWER7 integrates memory POWER7 chip is 567mm2, contains 1.2 billion controllers on chip. It has full support for cache transistors, and is built in IBM’s 45 nm (12s CMOS) coherence for large SMP configurations and is Cu SOI technology. It is designed to offer better designed to be used in a wide range of servers, ranging performance, more opportunities for parallelism, and from blades to very large commercial configurations higher levels of power efficiency than its predecessors. and supercomputers. It is an 8-core design with each core having up to 4 Starting with POWER6 [1], IBM’s POWER SMT threads, where a core can run in single-threaded, family machines offer an array of power management SMT2 or full SMT4 mode. Cores are out-of-order, capabilities collectively known as EnergyScale [2]. allowing them to maximize instruction-level These capabilities include collection of power and parallelism. performance measurements and a selection of power Each core region, known as a “chiplet,” contains a management modes, including: (i) static power save, 987-1-4244-5659-8/09/$25.00 ©2009 IEEE which reduces power at a fixed performance cost; (ii) designed to have lower latency than in POWER6, dynamic power save, which constantly adjusts core where it was the only idle power mode supported. The frequencies to exploit opportunities to save power with Nap state is entered whenever the hypervisor has a minimal impact on performance; and (iii) a maximum executed a power-save instruction on all threads, with performance variant, which exploits available power at least one thread executing a Nap instruction. In the and thermal headroom to boost performance. Nap state, all of the execution units in a core and the EnergyScale also supports power capping, which L1 cache are clocked off; however the higher level ensures the safe and reliable operation of the servers caches and certain timing facilities remain functional, within user-set power limits regardless of workload allowing low-latency workload resumption in the event transients. of timer or external interrupts. Core RAS and EnergyScale relies on a combination of features configuration registers remain accessible to firmware provided by the components of the POWER7-based during Nap. system. It is primarily controlled by a dedicated The Nap state by itself provides modest power microcontroller, the Thermal and Power Management reduction over a software idle loop. Further, the Device (TPMD), which operates under the control of hardware supports the option of automatically the Flexible Support Processor (FSP) present on all lowering cache frequencies while in the Nap state. POWER family servers. The TPMD implements the This feature provides a significant reduction in power power management policies for the system. In larger, for napping core chiplets, at the potential expense of multi-node POWER7 machines, where each node is increased access latency for shared data requested by a processor-memory-power delivery complex in its own non-napping core from a napping cache (L2/L3). The right, there is one TPMD for every node in the system. latency from the presentation of an interrupt to a Hard real-time firmware running on the TPMD napping core to the first instruction completion after implements the selected power management policies Nap exit is typically less than 5µs. Instruction and collects data for reporting purposes. execution begins immediately upon wakeup regardless The POWER7 offers a variety of new features in of whether the frequency was dropped while in Nap. If support of EnergyScale that represent a dramatic the frequency was dropped for Nap, it is slewed back improvement over those found on the POWER6. These up to the operating point set by the TPMD firmware new features allow POWER7 machines to offer more while instruction execution resumes at Nap exit. energy savings, better energy proportionality, and finer control over the various components of the system. 2.2. Sleep This paper describes the features and their uses. Sleep is a new architectural feature introduced in 2. Architected Processor Idle Modes POWER7. It is a lower-power, higher-latency standby state intended for cores that the hypervisor/OS predicts The POWER architecture specifications [6] will be unused for an extended period of time. The supported by POWER7 defines four power-saving Sleep state is entered when every thread on a core modes that can provide a continuum of power savings executes a Sleep instruction. Upon entering Sleep, versus latency and software impact. POWER7 hardware state machines purge all data from the core implements two of these modes, Nap and Sleep. Both and caches before completely clocking off the entire Nap and Sleep are hypervisor-privileged modes that chiplet. A small logic macro associated with the core maintain few of the architected processor resources; chiplet remains awake to handle external interrupts that exit from a power saving instruction is similar to a wake the core out of sleep. thread-level reset. Power-saving instructions that When all cores in a POWER7 processor enter the trigger the entry to these modes also cause dynamic Sleep state, the voltage supplied to the core chiplets SMT mode switching. The core-level power-saving can be automatically lowered down up to a retention modes described below are only activated when every level, a non-operational voltage sufficient only to thread in the core has executed a Nap or Sleep maintain static configuration data in the latches and instruction. arrays. This mode provides the lowest standby power for a POWER7 processor. Note that firmware running 2.1. Nap on the FSP or TPMD can temporarily restore operational clocks and voltages to a Sleeping chiplet at Nap is a processor low-power state designed for any time for maintenance operations, e.g., to access short processor idle periods. Nap in POWER7 is RAS or configuration registers, without restarting instructions on that core. The latency for entering Sleep varies based on the eDRAM) cells, while a slightly lower corresponding system design and workload configuration. Sleep exit Vdd voltage significantly reduces leakage for the latency for a single core is typically less than 1ms, and majority of logic circuits in the core chiplet that is dominated by the time required to re-initialize the L3 perform computation and maintain data coherence. cache eDRAM. Chip-level Processor Sleep exit is Table 1: Major POWER7 voltage domains dominated by voltage change latency from the retention voltage, which varies by system depending Rail Type Use on the voltage control scheme used in the system design. Vdd Dynamic CPU core, cache logic Figure 2 shows the relative power reductions for Vcs Dynamic Cache arrays, other SRAM Nap and Sleep as measured on a small, early sample set of POWER7 processors. For these measurements Vio Fixed Interconnect logic and I/O the minimum frequency used, fmin, was about 46% of Vmem Fixed Memory controller I/O the maximum frequency used, fmax. Vmax and Vmin refer to the set of voltages corresponding to those frequency levels and Vret to the set of voltages 3.1. DPLL (Digital Phase-Locked Loop) corresponding to the retention level. 120% Scenario 1 Scenario 2 scenario 100% Average power in idle modes Energy savings Energy savings using using fmin for fmin for cores with low napping cores load 100% 80% 90% 80% 60% 70% 40% 60% 50% 20% 40% Normalized power 30% 0% Power normalized to common frequence case for each case for Power normalized to common frequence Common Frequency Per-core Frequency Common Frequency Per-core Frequency 20% 1 Core busy, 7 Napping 1 Core busy, 7 at low load 10% 0% Figure 3: Better energy reduction from DPLL idle nap nap at fmin sleep at Vmax sleep at Vmin sleep at Vret with per-core frequency scaling. Figure 2: Comparison of processor idle modes POWER7 for the first time allows cores in a 3.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages11 Page
-
File Size-