Ibm Power5 Chip: a Dual-Core Multithreaded Processor

IBM POWER5 CHIP: A DUAL-CORE MULTITHREADED PROCESSOR FEATURING SINGLE- AND MULTITHREADED EXECUTION, THE POWER5 PROVIDES HIGHER PERFORMANCE IN THE SINGLE-THREADED MODE THAN ITS POWER4 PREDECESSOR AT EQUIVALENT FREQUENCIES. ENHANCEMENTS INCLUDE DYNAMIC RESOURCE BALANCING TO EFFICIENTLY ALLOCATE SYSTEM RESOURCES TO EACH THREAD, SOFTWARE-CONTROLLED THREAD PRIORITIZATION, AND DYNAMIC POWER MANAGEMENT TO REDUCE POWER CONSUMPTION WITHOUT AFFECTING PERFORMANCE. IBM introduced Power4-based sys- that base requirement, we specified increased tems in 2001.1 The Power4 design integrates performance and other functional enhance- two processor cores on a single chip, a shared ments of server virtualization, reliability, second-level cache, a directory for an off-chip availability, and serviceability at both chip and third-level cache, and the necessary circuitry system levels. In this article, we describe the to connect it to other Power4 chips to form a approach we used to improve chip-level system. The dual-processor chip provides nat- performance. ural thread-level parallelism at the chip level. Additionally, the Power4’s out-of-order exe- Multithreading cution design lets the hardware bypass instruc- Conventional processors execute instruc- Ron Kalla tions whose operands are not yet available tions from a single instruction stream. Despite (perhaps because of an earlier cache miss dur- microarchitectural advances, execution unit Balaram Sinharoy ing register loading) and execute other instruc- utilization remains low in today’s micro- tions whose operands are ready. Later, when processors. It is not unusual to see average exe- Joel M. Tendler the operands become available, the hardware cution unit utilization rates of approximately can execute the skipped instruction. Coupled 25 percent across a broad spectrum of envi- IBM with a superscalar design, out-of-order exe- ronments. To increase execution unit utiliza- cution results in higher instruction execution tion, designers use thread-level parallelism, in parallelism than otherwise possible. which the physical processor core executes The Power5 is the next-generation chip in instructions from more than one instruction this line. One of our key goals in designing stream. To the operating system, the physical the Power5 was to maintain both binary and processor core appears as if it is a symmetric structural compatibility with existing Power4 multiprocessor containing two logical proces- systems to ensure that binaries continue exe- sors. There are at least three different meth- cuting properly and all application optimiza- ods for handling multiple threads. tions carry forward to newer systems. With In coarse-grained multithreading, only one 40 Published by the IEEE Computer Society 0272-1732/04/$20.00 2004 IEEE thread executes at any instance. When a Processor Processor Processor Processor thread encounters a long-latency event, such as a cache miss, the hardware swaps in a sec- L2 L2 ond thread to use the machine’s resources, cache cache rather than letting the machine remain idle. By allowing other work to use what otherwise Fabric Fabric would be idle cycles, this scheme increases controller controller overall system throughput. To conserve resources, both threads share many system L3 L3 resources, such as architectural registers. cache cache Hence, swapping program control from one Memory Memory thread to another requires several cycles. IBM controller controller implemented coarse-grained multithreading Memory Memory in the IBM eServer pSeries Model 680.2 (a) A variant of coarse-grained multithreading is fine-grained multithreading. Machines of Processor Processor Processor Processor this class execute threads in successive cycles, 3 in round-robin fashion. Accommodating this L3 L2 L2 L3 design requires duplicate hardware facilities. cache cache cache cache When a thread encounters a long-latency event, its cycles remain unused. Fabric Fabric Finally, in simultaneous multithreading controller controller (SMT), as in other multithreaded implemen- Memory Memory tations, the processor fetches instructions controller controller from more than one thread.4 What differen- tiates this implementation is its ability to Memory Memory schedule instructions for execution from all (b) threads concurrently. With SMT, the system Figure 1. Power4 (a) and Power5 (b) system structures. dynamically adjusts to the environment, allowing instructions to execute from each thread if possible, and allowing instructions fabric. This can cause greater contention and from one thread to utilize all the execution negatively affect system scalability. Moving the units if the other thread encounters a long- level-three (L3) cache from the memory side to latency event. the processor side of the fabric lets the Power5 The Power5 design implements two-way more frequently satisfy level-two (L2) cache SMT on each of the chip’s two processor cores. misses with hits in the 36-Mbyte off-chip L3 Although a higher level of multithreading is cache, avoiding traffic on the interchip fabric. possible, our simulations showed that the References to data not resident in the on-chip added complexity was unjustified. As design- L2 cache cause the system to check the L3 ers add simultaneous threads to a single phys- cache before sending requests onto the inter- ical processor, the marginal performance connection fabric. Moving the L3 cache pro- benefit decreases. In fact, additional multi- vides significantly more cache on the processor threading might decrease performance because side than previously available, thus reducing of cache thrashing, as data from one thread traffic on the fabric and allowing Power5-based displaces data needed by another thread. systems to scale to higher levels of symmetric multiprocessing. Initial Power5 systems sup- Power5 system structure port 64 physical processors. Figure 1 shows the high-level structures of The Power4 includes a 1.41-Mbyte on-chip Power4- and Power5-based systems. The L2 cache. Power4+ chips are similar in design Power4 handles up to a 32-way symmetric to the Power4 but are fabricated in 130-nm multiprocessor. Going beyond 32 processors technology rather than the Power4’s 180-nm increases interprocessor communication, technology. The Power4+ includes a 1.5- resulting in high traffic on the interconnection Mbyte on-chip L2 cache, whereas the Power5 MARCH–APRIL 2004 41 HOT CHIPS 15 ing paths. In 130 nm lithography, the chip uses eight metal levels and measures 389 mm2. The Power5 processor supports the 64-bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. The two cores share a 1.875-Mbyte (1,920-Kbyte) L2 cache. We implemented the L2 cache as three identical slices with separate controllers for each. The L2 slices are 10-way set-associative with 512 congruence classes of 128-byte lines. The data’s real address determines which L2 slice the data is cached in. Either processor core can independently access each L2 controller. We also integrated the directory for an off- chip 36-Mbyte L3 cache on the Power5 chip. Having the L3 cache directory on chip allows the processor to check the directory after an Figure 2. Power5 chip (FXU = fixed-point execution unit, ISU L2 miss without experiencing off-chip delays. = instruction sequencing unit, IDU = instruction decode unit, To reduce memory latencies, we integrated LSU = load/store unit, IFU = instruction fetch unit, FPU = the memory controller on the chip. This elim- floating-point unit, and MC = memory controller). inates driver and receiver delays to an exter- nal controller. supports a 1.875-Mbyte on-chip L2 cache. Processor core Power4 and Power4+ systems both have 32- We designed the Power5 processor core to Mbyte L3 caches, whereas Power5 systems support both enhanced SMT and single- have a 36-Mbyte L3 cache. threaded (ST) operation modes. Figure 3 The L3 cache operates as a backdoor with shows the Power5’s instruction pipeline, separate buses for reads and writes that oper- which is identical to the Power4’s. All pipeline ate at half processor speed. In Power4 and latencies in the Power5, including the branch Power4+ systems, the L3 was an inline cache misprediction penalty and load-to-use laten- for data retrieved from memory. Because of cy with an L1 data cache hit, are the same as the higher transistor density of the Power5’s in the Power4. The identical pipeline struc- 130-nm technology, we could move the mem- ture lets optimizations designed for Power4- ory controller on chip and eliminate a chip based systems perform equally well on previously needed for the memory controller Power5-based systems. Figure 4 shows the function. These two changes in the Power5 Power5’s instruction flow diagram. also have the significant side benefits of reduc- In SMT mode, the Power5 uses two sepa- ing latency to the L3 cache and main memo- rate instruction fetch address registers to store ry, as well as reducing the number of chips the program counters for the two threads. necessary to build a system. Instruction fetches (IF stage) alternate between the two threads. In ST mode, the Chip overview Power5 uses only one program counter and Figure 2 shows the Power5 chip, which can fetch instructions for that thread every IBM fabricates using silicon-on-insulator cycle. It can fetch up to eight instructions (SOI) devices and copper interconnect. SOI from the instruction cache (IC stage) every technology reduces device capacitance to cycle. The two threads share the instruction increase transistor performance.5 Copper cache and the instruction translation facility. interconnect decreases wire resistance and In a given cycle, all fetched instructions come reduces delays in wire-dominated chip-tim- from the same thread. 42 IEEE MICRO Branch redirects Out-of-order processing Branch Instruction fetch pipeline MP ISS RF EX WB Xfer Load/store IF pipeline IC BP MP ISS RF EA DC Fmt WB Xfer CP D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer Fixed-point Group formation and pipeline instruction decode MP ISS RF F6 WB Xfer Floating- point pipeline Interrupts and flushes Figure 3.

Ibm Power5 Chip: a Dual-Core Multithreaded Processor

Ray Tracing on the Cell Processor

IBM System P5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction

Implementing Powerpc Linux on System I Platform

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

IBM Power Systems Performance Report Apr 13, 2021

IBM's POWER5 Micro Processor Design and Methodology

Microprocessor

Introduction to the Cell Multiprocessor

IBM Power Roadmap

POWER8: the First Openpower Processor

Power8 Quser Mspl Nov 2015 Handout