IBM's POWER5 Micro Processor Design and Methodology

IBM's POWER5 Micro Processor Design and Methodology Ron Kalla IBM Systems Group © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Outline POWER5 Overview Design Process Power UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology POWER Server Roadmap 2001 2002-3 2004* 2005* 2006* POWER4 POWER4+ POWER5 POWER5+ POWER6 90 nm 65 nm 130 nm 130 nm Ultra high 180 nm >> GHz >> GHz frequency cores Core Core 1.7 GHz 1.7 GHz > GHz > GHz L2 caches Core Core Core Core 1.3 GHz 1.3 GHz Shared L2 Advanced System Features Core Core Distributed Switch Shared L2 Shared L2 Distributed Switch Distributed Switch Shared L2 Simultaneous multi-threading Distributed Switch Sub-processor partitioning Reduced size Dynamic firmware updates Enhanced scalability, parallelism Chip Multi Processing Lower power High throughput performance - Distributed Switch Larger L2 Enhanced memory subsystem - Shared L2 More LPARs (32) Dynamic LPARs (16) Autonomic Computing Enhancements *Planned to be offered by IBM. All statements about IBM’s future direction and intent are* subject to change or withdrawal without notice and represent goals and objectives only. UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology POWER5 Technology: 130nm lithography, Cu, SOI 389mm2 276M Transistors Dual processor core 8-way superscalar Simultaneous multithreaded (SMT) core Up to 2 virtual processors per real processor UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Multi-threading Evolution Single Thread Coarse Grain Threading FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Fine Grain Threading Simultaneous Multi-Threading FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Thread Priority Single Thread Mode Instances when unbalanced execution desirable No work for opposite thread 2 2 Thread waiting on lock 1 Software determined non uniform 1 balance 1 Power management IPC 1 … 1 Solution: Control instruction decode 0 rate 0 0 Software/hardware controls 8 priority 0,7-5 -3 -1 0 1 3 5 7,0 1,1 levels for each thread Thread 1 Priority - Thread 0 Priority Power Thread 0 IPC Thread 1 IPC Save Mode UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Terminology PowerPC Addresses Effective>(SLB)>Virtual>(Page Table)>Real>(LPAR)>Physical Instruction Execution I-fetch Decode Dispatch Issue Finish Complete UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Multithreaded Instruction Flow in Processor Pipeline Out-of-Order Processing Branch Redirects Instruction Fetch BR MP ISS RF EX WB Xfer LD/ST IFIF IC BP MP ISS RF EA DC Fmt WB Xfer CPCP D0D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer FX Group Formation and MP ISS RF F6F6 Instruction Decode F6F6 FP F6F6 WB Xfer Interrupts & Flushes Branch Prediction Dynamic Instruction Shared Selection Execution Program Branch Return Target Cache Shared Issue Units Counter History Stack Queues Tables LSU0 Alternate FXU0 Instruction LSU1 Buffer 0 Group Formation, Instruction Instruction Decode, FXU1 Group Store Cache Completion Instruction Dispatch FPU0 Queue Instruction Buffer 1 FPU1 Translation BXU Thread Data Data Priority Shared CRL Translation Cache Read Shared Write Shared Register Register Files Register Files Mappers L2 Cache Shared by two threads Resource used by thread 0 Resource used by thread 1 UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Resource Sizes Analysis done to optimize every ST SMT micro-architectural resource size GPR/FPR rename pool size I-fetch buffers Reservation Station IPC SLB/TLB/ERAT I-cache/D-cache Many Workloads examined ~ Associativity also examined ~ 50 60 70 80 90 100 110 120 130 Number of GPR Renames Results based on simulation of an online transaction processing application Vertical axis does not originate at 0 UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Single Thread Operation Advantageous for execution unit limited applications Floating or fixed point intensive workloads Execution unit limited applications provide minimal performance leverage IPC for SMT Extra resources necessary for SMT POWER5 ST POWER4+ provide higher performance benefit POWER5 SMT when dedicated to single thread Determined dynamically on a per Matrix Multiply processor basis UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Modifications to POWER4 System Structure P P P P L3 L2 L2 L3 L3 L3 Fab Ctl Fab Ctl Mem Ctl Mem Ctl L3 L3 Mem Ctl Mem Ctl Memory Memory UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology 16-way Building Block Book Memory I/O Memory I/O Memory I/O Memory I/O MCM L3 L3 L3 L3 POWER5 POWER5 POWER5 POWER5 POWER5 POWER5 POWER5 POWER5 L3 L3 L3 L3 MCM I/O I/O I/O I/O Memory Memory Memory Memory UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology POWER5 Multi-chip Module 95mm % 95mm Four POWER5 chips Four cache chips 4,491 signal I/Os 89 layers of metal UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology 64-way SMP Interconnection Interconnection exploits enhanced distributed switch All chip interconnections operate at half processor frequency and scale with processor frequency UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Design Process Concept Phase (~10 People/4 months) Competitive assessment Customer requirements Technology assessment – Try to match technology introduction with products AREA/POWER estimates – Cost Schedule Staffing Lots or executive/technical reviews UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Design Process (cont.) High Level Design Phase (~50 People/6 months) Micro-architecture Cycle Time closed – Cross section of critical paths – Power closed Area/Power budgets established Arrays architected All Unit I/O defined timing contract in place Performance model meets concept phase targets HLD Exit review. Everything is now in place to effectively engage full team UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Design Process (cont.) Implementation Phase (~200 People/12-18 months) Write VHDL Schematic entry for circuits Simulation – Block>unit>core>chip>system Integration process begins – RLM and Custom macros > unit > core >chip Design rule checking, (can’t trust human’s) Bulk simulation Check recheck and check again RIT (Release Information Tape) UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Bring-up 100’s People 12-18 Months Starts at Wafer test. AVPs Random test programs Boot operating system OS based testing Formal testing (new groups of people ) POWER (Will discuss later) Performance testing Additional RITs if necessary Release to mfg GA (Hurray) Field Support UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Power Micro-Processor Power Complex equation. DC Power (IDDQ) – Voltage – Temperature – Speed of Chip (PSRO) Processor speed Ring Oscillator. AC Power – Function of CV2f X Switching Factor. – Workload Dependent. UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology GR DD2.0 Fmax vs. 1.2v, 25C PSRO Unguardbanded, 1.3v, 85C Fmax Frequency (MHz) PSRO (ps/stage) Faster parts Faster partsParts UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Power 5 Leakage Power vs. PSRO 1.3 V and 85C 9s2 Leakage Power Leakage Power (W) PSRO (ps/stage) UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Power 5 AC Power vs. Frequency 1.3 V and 85 C 9s2 AC Power AC Power (W) Frequency (MHz) UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Chip Power vs Frequency 1.3V 85C Frequency (MHz) Power (W) UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Other SMT Considerations Power Management SMT Increases execution unit utilization Dynamic power management does not impact performance Debug tools / Lab bring-up Instruction tracing Hang detection Forward progress monitor Performance Monitoring Serviceability UT CS352 , April 2005 © 2003 IBM Corporation.

IBM's POWER5 Micro Processor Design and Methodology

Ray Tracing on the Cell Processor

IBM System P5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction

Implementing Powerpc Linux on System I Platform

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

IBM Power Systems Performance Report Apr 13, 2021

Microprocessor

Introduction to the Cell Multiprocessor

IBM Power Roadmap

POWER8: the First Openpower Processor

Power8 Quser Mspl Nov 2015 Handout

A Bibliography of Publications in IEEE Micro