IBM's POWER5 Micro Processor Design and Methodology

Ron Kalla IBM Systems Group

© 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Outline

ƒ POWER5 Overview ƒ Design Process ƒ Power

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

POWER Server Roadmap

2001 2002-3 2004* 2005* 2006* POWER4 POWER4+ POWER5 POWER5+ POWER6

90 nm 65 nm 130 nm 130 nm Ultra high 180 nm >> GHz >> GHz frequency cores Core Core 1.7 GHz 1.7 GHz > GHz > GHz L2 caches Core Core Core Core 1.3 GHz 1.3 GHz Shared L2 Advanced System Features Core Core Distributed Switch Shared L2 Shared L2

Distributed Switch Distributed Switch Shared L2 Simultaneous multi-threading Distributed Switch Sub-processor partitioning Reduced size Dynamic firmware updates Enhanced scalability, parallelism Chip Multi Processing Lower power High throughput performance - Distributed Switch Larger L2 Enhanced memory subsystem - Shared L2 More LPARs (32) Dynamic LPARs (16) Autonomic Computing Enhancements

*Planned to be offered by IBM. All statements about IBM’s future direction and intent are* subject to change or withdrawal without notice and represent goals and objectives only. UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

POWER5

ƒ Technology: 130nm lithography, Cu, SOI ƒ 389mm2 276M Transistors ƒ Dual processor core ƒ 8-way superscalar ƒ Simultaneous multithreaded (SMT) core  Up to 2 virtual processors per real processor

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Multi-threading Evolution Single Coarse Grain Threading

FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL

Fine Grain Threading Simultaneous Multi-Threading

FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Executing Thread 1 Executing No Thread Executing

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Thread Priority Single Thread Mode ƒ Instances when unbalanced execution desirable  No work for opposite thread 2 2  Thread waiting on lock 1  Software determined non uniform 1 balance 1

 Power management IPC 1  … 1 ƒ Solution: Control instruction decode 0 rate 0 0  Software/hardware controls 8 priority 0,7-5 -3 -1 0 1 3 5 7,0 1,1 levels for each thread Thread 1 Priority - Thread 0 Priority Power Thread 0 IPC Thread 1 IPC Save Mode

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Terminology

ƒ PowerPC Addresses  Effective>(SLB)>Virtual>(Page Table)>Real>(LPAR)>Physical ƒ Instruction Execution  I-fetch  Decode  Dispatch  Issue  Finish  Complete

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology Multithreaded Instruction Flow in Processor

Pipeline Out-of-Order Processing Branch Redirects

Instruction Fetch BR MP ISS RF EX WB Xfer LD/ST IFIF IC BP MP ISS RF EA DC Fmt WB Xfer CPCP D0D0 D1 D2 D3 Xfer GD MP ISS RF EX WB Xfer FX Group Formation and MP ISS RF F6F6 Instruction Decode F6F6 FP F6F6 WB Xfer Interrupts & Flushes

Branch Prediction Dynamic Instruction Shared Selection Execution Program Branch Return Target Cache Shared Issue Units Counter History Stack Queues Tables LSU0 Alternate FXU0 Instruction LSU1 Buffer 0 Group Formation, Instruction Instruction Decode, FXU1 Group Store Cache Completion Instruction Dispatch FPU0 Queue Instruction Buffer 1 FPU1 Translation BXU Thread Data Data Priority Shared CRL Translation Cache Read Shared Write Shared Register Register Files Register Files Mappers L2 Cache Shared by two threads Resource used by thread 0 Resource used by thread 1 UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Resource Sizes

ƒ Analysis done to optimize every ST SMT micro-architectural resource size GPR/FPR rename pool size I-fetch buffers

Reservation Station IPC SLB/TLB/ERAT I-cache/D-cache ƒ Many Workloads examined ~ ƒ Associativity also examined ~ 50 60 70 80 90 100 110 120 130 Number of GPR Renames

Results based on simulation of an online transaction processing application Vertical axis does not originate at 0 UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Single Thread Operation

ƒ Advantageous for execution unit limited applications  Floating or fixed point intensive workloads ƒ Execution unit limited applications

provide minimal performance leverage IPC for SMT  Extra resources necessary for SMT POWER5 ST POWER4+ provide higher performance benefit POWER5 SMT when dedicated to single thread ƒ Determined dynamically on a per Matrix Multiply processor basis

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Modifications to POWER4 System Structure

P P P P

L3 L2 L2 L3

L3 L3

Fab Ctl Fab Ctl

Mem Ctl Mem Ctl

L3 L3

Mem Ctl Mem Ctl

Memory Memory

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

16-way Building Block

Book Memory I/O Memory I/O Memory I/O Memory I/O

MCM L3 L3 L3 L3 POWER5 POWER5 POWER5 POWER5

POWER5 POWER5 POWER5 POWER5 L3 L3 L3 L3 MCM

I/O I/O I/O I/O Memory Memory Memory Memory

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

POWER5 Multi-chip Module

ƒ 95mm % 95mm ƒ Four POWER5 chips ƒ Four cache chips ƒ 4,491 signal I/Os ƒ 89 layers of metal

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

64-way SMP Interconnection

Interconnection exploits enhanced distributed switch ƒ All chip interconnections operate at half processor frequency and scale with processor frequency

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Design Process

ƒ Concept Phase (~10 People/4 months)  Competitive assessment  Customer requirements  Technology assessment – Try to match technology introduction with products  AREA/POWER estimates – Cost  Schedule  Staffing  Lots or executive/technical reviews

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Design Process (cont.)

ƒ High Level Design Phase (~50 People/6 months)  Micro-architecture  Cycle Time closed – Cross section of critical paths – Power closed  Area/Power budgets established  Arrays architected  All Unit I/O defined timing contract in place  Performance model meets concept phase targets  HLD Exit review.  Everything is now in place to effectively engage full team

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Design Process (cont.) ƒ Implementation Phase (~200 People/12-18 months)  Write VHDL  Schematic entry for circuits  Simulation – Block>unit>core>chip>system  Integration process begins – RLM and Custom macros > unit > core >chip  Design rule checking, (can’t trust human’s)  Bulk simulation  Check recheck and check again ƒ RIT (Release Information Tape)

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Bring-up 100’s People 12-18 Months

ƒ Starts at Wafer test. ƒ AVPs ƒ Random test programs ƒ Boot operating system  OS based testing ƒ Formal testing (new groups of people )  POWER (Will discuss later)  Performance testing ƒ Additional RITs if necessary ƒ Release to mfg ƒ GA (Hurray) ƒ Field Support

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Power

ƒ Micro-Processor Power Complex equation.  DC Power (IDDQ) – Voltage – Temperature – Speed of Chip (PSRO) Processor speed Ring Oscillator.  AC Power – Function of CV2f X Switching Factor. – Workload Dependent.

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

GR DD2.0 Fmax vs. 1.2v, 25C PSRO Unguardbanded, 1.3v, 85C Fmax Frequency (MHz)

PSRO (ps/stage) Faster parts Faster partsParts

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Power 5 Leakage Power vs. PSRO 1.3 V and 85C

9s2 Leakage Power Leakage Power (W)

PSRO (ps/stage)

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Power 5 AC Power vs. Frequency 1.3 V and 85 C

9s2 AC Power AC Power (W)

Frequency (MHz)

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Chip Power vs Frequency 1.3V 85C Frequency (MHz)

Power (W)

UT CS352 , April 2005 © 2003 IBM Corporation IBM’s POWER5 Micro Processor Design and Methodology

Other SMT Considerations

ƒ Power Management  SMT Increases execution unit utilization  Dynamic power management does not impact performance ƒ Debug tools / Lab bring-up  Instruction tracing  Hang detection  Forward progress monitor ƒ Performance Monitoring ƒ Serviceability

UT CS352 , April 2005 © 2003 IBM Corporation