POWER5 Processor and System Evolution

ScicomP 11 Charles Grassl IBM May, 2005

© 2005 IBM Agenda

• pSeries systems •Caches •Memory • POWER5 Processors •Registers •Speeds •Design features

2 © 2005 IBM Corporation POWER5™ Design

• POWER4 base • Binary and structural compatibility • Shared memory scalability • Up to 64 processors • 128 threads • High floating point performance • Server flexibility • Power efficient design • Utility: • Reliability, availability, serviceability

3 © 2005 IBM Corporation POWER5 Systems

• Second generation dual AIX 5L AIX 5L LiLinnux LiLinnux OS/440000 core chip AIX 5L AIX 5L LiLinnux LiLinnux OS/440000 Integrated xSeriIenste Sgeratrveder AIX AIX Linux Linux SLIC xSer(IXAies) Server kernAIelX kernAIXel kerneLinuxl kerneLinuxl • Intelligent 2-way SMT SLIC (IXA) kernel kernel kernel kerneVirltual I/O Virtual I/O Virtual I/O Virtual LAN Virtual I/O Virtual I/O Virtual I/O V i r t ua lVi I/Ortual LAN POWERPOWER HyHyperVirpetual I/Orvvisisor • Power management with POWERPOWER HyHyperpervvisisor LAN Hardware IOA IOA LAN IOA Management IOA IOA IOALAN Hardware IOA IOA LAN IOA no performance impact ManaConsgemeole nt IOA IOA IOA (HMC)Cons ole LAN (HMC) • 130 nm lithography LAN • 276M transistors • 8 layers of metal • Micropartitioning • Up to 64 physical processors,1280 • Multi Chip Modules virtual processors per system (MCM): • Eight way SMP looks like 16-way to software • 95 mm on a side

4 © 2005 IBM Corporation POWER5 System Features

• Dual Core Chip • Shared L2 cache • Shared L3 cache • Shared Memory • Multiple Page Size support • Simultaneous Multi Threading

5 © 2005 IBM Corporation POWER5 Features

Instruction 0 1 2 3 Streams

Processor 0 Processor 1

L3 L2 L2 L2

•Chip

6 © 2005 IBM Corporation POWER4 Chip --- (December 2001)

• Technology: 180nm lithography, Cu,

SOI U FPU ISU ISU FPU FX • POWER4+ shipping in 130nm today FXU • Dual processor core • 8-way superscalar IDU IDU LSU LSU • Out of Order execution IFU IFU • 2 Load / Store units BXU BXU

• 2 Fixed Point units l o • 2 Floating Point units

• Logical operations on Condition Contr Register / L2 L2 L2 • Branch Execution unit • > 200 instructions in flight ectory • Hardware instruction and data

prefetch L3 Dir

7 © 2005 IBM Corporation POWER5 Chip

• IBM CMOS 130nm • Copper and SOI • 8 layers of metal • Chip • 389 mm2 • 276M transistors • I/Os: 2313 signal, 3057 power • Same technology as POWER4+

8 © 2005 IBM Corporation POWER5 Multi-chip Module

• 95mm % 95mm • Four POWER5 chips • Four cache chips • 4,491 signal I/Os • 89 layers of metal

9 © 2005 IBM Corporation Multi Chip Module (MCM) Architecture

POWER4 POWER5 • 4 processor chips • 4 processor chips • 2 processors per chip • 2 processors per chip • 8 off-module L3 chips • 4 L3 cache chips • L3 cache is controlled by • L3 cache is used by MCM and logically shared processor pair across node • “Extension” of L2 • 4 Memory control chips ------• 8 chips • 16 chips

L3 Chip Mem. Controller

10 © 2005 IBM Corporation Dynamic Power Management

• Two components: • Switching power • Leakage power • Impact of SMT on power: • More instructions executed per cycle • Switching power reduction: • Extensive fine-grain, dynamic clock-gating • Leakage power reduction • Minimal use of low Vt devices • No performance impact • Low power mode for low priority threads

11 © 2005 IBM Corporation Dynamic Power Management No Power Dynamic Power Management Management

Single

Simultaneous Photos taken with thermal Multi-threading sensitive camera while prototype POWER5 chip was undergoing tests

Simultaneous Multi-threading with dynamic power management reduces power consumption below standard, single threaded level

12 © 2005 IBM Corporation Modifications to POWER4 to create POWER5

Reduced L3 P P P P Latency

L2 L2 LargerLarger SMPsSMPs Fab Ctl Fab Ctl

L3 L3 Cntrl Cntrl Number of chips cut Faster in half L3 access to L3 memory Mem Ctl Mem Ctl

Memory Memory

13 © 2005 IBM Corporation 64-way SMP Interconnection

8B @ 2:1

Interconnection exploits enhanced distributed switch • All chip interconnections operate at half processor frequency and scale with processor frequency

IBM Confidential 14 © 2005 IBM Corporation Simultaneous Multi-Threading in POWER5

• Each chip appears as a 4-way SMP to software Simultaneous Multi-Threading • 2 processors • 2 threads per processor FX0 • Processor resources FX1 optimized for enhanced SMT FP0 performance FP1 • Software controlled thread LS0 priority LS1 • Dynamic feedback of runtime BRX behavior to adjust priority CRL • Dynamic switching between single and multithreaded Thread 0 active Thread 1 active mode

15 © 2005 IBM Corporation Multi-threading Evolution

Single Thread Coarse Grain Threading

FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL

Fine Grain Threading Simultaneous Multi-Threading

FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Thread 1 No Thread Executing Executing Executing

16 © 2005 IBM Corporation Simultaneous multi-threading

POWER5POWER4 Simultaneous (Single Threaded) Multi Threading

FX0 Thread0 active FX1 No thread active LSO Thread1 active

LS1 throughput FP0 Appears as 4 CPUs per chip to the FP1 operating system BRZ (AIX 5L V5.3 and System ST CRL Linux) SMT

• Utilizes unused execution unit cycles • Symmetric multiprocessing (SMP) programming model • Natural fit with superscalar out-of-order execution core • Dispatch two threads per processor. Net result: • Better processor utilization

17 © 2005 IBM Corporation POWER5 Performance Expectations

• Higher sustained-to-peak floating point rate ratio compared to POWER4 • Reduction in L3 and memory latency • Integrated • Increased rename resources • Higher instruction level parallelism in compute intensive applications • Fast barrier synchronization operation • Enhanced data prefetch mechanism

18 © 2005 IBM Corporation Processor Architecture

• CPUs • Caches • Performance Features

19 © 2005 IBM Corporation Chip Enhancements

• Caches and translation resources • Larger caches • Enhance associativity • Resource pools • Rename registers: GPRs, FPRs increased to 120 each • L2 cache coherency engines: increased by 100% • Memory controller moved on chip • Dynamic power management

20 © 2005 IBM Corporation New POWER5 Instructions

• Enhanced data prefetch (eDCBT) • Floating-point: • Non-IEEE mode of execution for divide and square-root • Reciprocal estimate, double-precision • Reciprocal square-root estimate, single- precision • Population count

21 © 2005 IBM Corporation Processor Characteristics

• Deep pipelines • High frequency clocks • High asymptotic rates • Superscalar • Speculative out-of-order instructions • Up to 8 outstanding cache line misses • Large number of instructions in flight • Branch prediction • Prefetching

22 © 2005 IBM Corporation POWER4 and POWER5 Storage Hierarchy

POWER4 POWER5 L2 Cache 1.44 Mbyte 1.92 Mbyte Capacity, line size 128 byte line 128 byte line Associativity, replacement 8-way, LRU 10-way, LRU Off-chip L3 Cache 32 Mbyte 36 Mbyte Capacity, line size 512 byte line 256 byte line Associativity, replacement 8-way, LRU 12-way, LRU Chip interconnect Enhanced distributed Type Distributed switch switch Intra-MCM data buses ½ processor speed Processor speed Inter-MCM data buses 1/3 processor speed ½ processor speed Memory 1 Tbyte 2 Tbyte

IBM Confidential 23 © 2005 IBM Corporation Multiprocessor Chip

• 2 CPUs (processors) on one chip • Each processor: • L1 cache • Data • Instruction • Each chip: • Shared memory path • Shared L3 cache • 32 Mbyte • Shared L2 cache • 1.5 Mbyte

24 © 2005 IBM Corporation Chip Structure

Processor Processor Core 1 Core 1

L2 L2 L2 Cache Cache Cache

Chip Chip To To Chip Chip Fabric Controller MCM MCM To To MCM L3 Directory MCM GX Control L3/Mem Control GX Bus L3/Mem Bus

25 © 2005 IBM Corporation Micro Architecture

• 64-bit RISC • Multiple Execution Units • Hardware Data Prefetch • Out-of-Order Execution • • 8 Instructions / Cycle

26 © 2005 IBM Corporation Multiple Functional Units

• Symmetric functional units • Two Floating Point Units (FPU) • Three Fixed Point Units (FXU) • Two Integer • One Control • Two Load/Store Units (LSU) • One Branch Processing Unit (BPU) •CR FMA

Load/Store Fixed Pt. FMA Load/Store Fixed Pt.

Branch

27 © 2005 IBM Corporation Fast Core: Instruction-level Parallelism

• Speculative Core organization • Out-of-Order execution IFAR I-cache • Large rename pools BR Scan • 8 instruction issue, 5 instruction Instr Q complete BR Decode, • Large instruction window for Predict Crack & scheduling Group • 8 Execution pipelines Formation • 2 load / store units GCT • 2 fixed point units

• 2 DP multiply-add execution units BR/CR FX/LD 1 FX/LD 2 FP • 1 branch resolution unit Issue Q Issue Q Issue Q Issue Q

• 1 CR execution unit BR CR FX1 LD1 LD2 FX2 FP1 FP2 • Aggressive branch prediction Exec Exec Exec Exec Exec Exec Exec Exec Unit Unit Unit Unit Unit Unit Unit Unit • Target address and outcome prediction StQ • Static prediction / branch hints used • Fast, selective flush on branch D-cache mispredict

28 © 2005 IBM Corporation Registers

POWER4: POWER5: Resource Logical Physical Physical GPRs 32 80 120 FPRs 32 72 120 8 (9) 4-bit CRs 32 32 fields Link/Count 2 16 16

FPSCR 1 20 20

XER 4 fields 24 24

29 © 2005 IBM Corporation Functional Unit Progression

POWER2:

POWER3:

POWER4:

POWER2 POWER3 POWER4 POWER5 Clock Periods 2 3 6 6 125 MHz 375 MHz 1.3 GHz 1.9 GHz Time 16 8 4.6 3.2 (Nanosec.)

30 © 2005 IBM Corporation Registers

• CPU's point of view •120 FP registers (POWER5) • User point of view •32 FP registers (architecture) • Rename registers •Relieve register "pressure"

32 Architecture Registers 72 Physical Registers

31 © 2005 IBM Corporation Register Renaming

• Architecture has 32 registers •Legacy • Cases which require additional registers: •Tight loops • Computationally intensive •“Broad” loops • Many variables involved •Deep pipe lines • Renaming registers are increasingly important with Simultaneous MultiThreading

32 © 2005 IBM Corporation Register Renaming: Read After Write

Nothing to be R13 = R14 + R15 done …

R16 = R13 + R12

33 © 2005 IBM Corporation Register Renaming: Write After Write

R13 = R14 + R15 R13 = R14 + R15 … … Renaming R13 = R16 + R17 R42 = R16 + R17

R19 = R13 + R18 R19 = R42 + R18

34 © 2005 IBM Corporation Effect of Registers

POWER4 POWER5

GP Registers 80 120

FP Registers 72 120

DGEMM speed 60% of burst 90% of burst

Enhances performance of computationally intensive kernels

35 © 2005 IBM Corporation Renaming Example

• Matrix multiple with 4x unrolling •2 FMAs and 1 LFD per cycle •3 renames per cycle •After 13 cycles, 39 FP renames of the 40 (POWER4, 72-32) are allocated •Cycles 14, 16, and 18: • Instruction are rejected due to lack of renames • Result: ~70 to 75% of peak •Software rule of thumb: •approx. 13 renames available every 6 cycles • POWER5 alleviates this with 120 FP renames

36 © 2005 IBM Corporation Effect of Rename Registers

Peak Peak MATMUL Polynomial 7000 7000

6000 6000

5000 5000 s 4000 s 4000 / p/ p o o l Mf Mfl 3000 3000

2000 2000

1000 1000

0 0 POWER4 POWER5 POWER4 POWER5

POWER4 @ 1.5 GHz POWER5 @ 1.65 GHz

37 © 2005 IBM Corporation Floating Point Functional Units

• Two floating point execution units • Divide and square root sub- units • NOT pipelined Single Double • Double precision (64-bit) data Instruction path (Cycles) (Cycles) Fma 6 6

Fdiv ~25 32

Fsqrt - 34

38 © 2005 IBM Corporation Floating Point Functional Units

• 2 floating add-multiply (FMA) units •Per instruction: • 1 floating point add • 1 floating point multiply •4 floating point ops per clock period • IEEE arithmetic

39 © 2005 IBM Corporation Arithmetic

• IEEE 754 single and double Floating Size Integer floating-point Point • Floating multiply-add: 16 Yes No • Intermediate value is not rounded • 64-bit integer arithmetic 32 Yes Yes instructions Yes • Used only in 64-bit addressing mode 64 Yes (-q64) 128 No No

40 © 2005 IBM Corporation Pipelined Functional Units

Multiply Add 12 results/ Multiply or Add A,B, C A*B+C 6 clock periods 2 results/ Divide 32 clock periods D, E, F D*E+F 2 results/ Square root 34 clock periods

Divide

A,B A/B

D, E D/E

41 © 2005 IBM Corporation Deep Pipelines

• Operations limited by functional unit transit time: •Divide •Square root •Intrinsic functions •Recursion

42 © 2005 IBM Corporation Translation Lookaside Buffer (TLB)

• 1024 entry • Page sizes: • 4096 Bytes • 16 Mbyte Memory

Page

Page Page Address Page

Page Page

43 © 2005 IBM Corporation TLB Thrashing

• TLB spans a small amount of memory • Strategy: •Avoid large strides •Avoid randomly using large constructs •Gather and scatter are very bad • Common problem on RISC

44 © 2005 IBM Corporation Hardware Prefetch

• Detects adjacent cache line references • Forward and backward strides • Prefetches up to two lines ahead per stream • Up to eight concurrent streams • Twelve prefetch filter queues • No prefetch on store misses • (when a store instruction causes a cache line miss) • Ramped Initialization • L2 to L1 prefetches • L3 to L2 prefetches • Memory to L3 prefetches

45 © 2005 IBM Corporation Memory Access

• Load and Store • Two per CPU • Connect CPU to memory

Load Store Memory Processor Prefetch Memory Load Store Buffers

46 © 2005 IBM Corporation Memory Prefetching

• Eight prefetch stream buffers • Connect CPU to memory

Stream 1

Stream 2 Load Store Processor Memory Load Store Stream 7

Stream 8

Prefetch Buffers

47 © 2005 IBM Corporation Cache Line Load

• Memory system does not detect patterns within cache line • Detect location within first 3/4 or last 1/4 of cache line

48 © 2005 IBM Corporation Stride Pattern Recognition

• Upon a cache miss: •Biased guess is made as to the direction of that stream •Guess is based upon where in the cache line the address associated with that miss occurred •If it is in the first 3/4, then the direction is guessed as ascending •If in the last 1/4, the direction is guessed descending

49 © 2005 IBM Corporation Prefetching

Memory

Cache Lines 1 5

2 6 1 2 3 4 5 6

3 7

4 8

50 © 2005 IBM Corporation Streams

Memory Processor Non- Streaming

Memory Processor Streaming

51 © 2005 IBM Corporation Streams

Memory Processor

52 © 2005 IBM Corporation Effect of Prefetch Buffers

• Memory load overlap • Up to 8 streams • Variables • Patterns

for (j=0;j

53 © 2005 IBM Corporation Memory Bandwidth

12000

10000

8000 s /

e LP t

y 6000 SP Mb 4000

2000

0 ESSL DCBZ Copy Daxpy Daxpy2 Daxpy4

p5-595 1.9 GHz

54 © 2005 IBM Corporation Memory Bandwidth

12000 10000 8000 s / e

t POWER5 LP

y 6000 POWER5 SP Mb 4000 POWER4 SP 2000 0 y y y2 y4 p Co ESSL DCBZ Daxp Daxp Daxp

p5-595 1.9 GHz p690 1.3 GHz

55 © 2005 IBM Corporation Memory Bandwidth

• Typical POWER5 12000 bandwidth: • 4 Gbyte/s Small Pages (SP) 10000 • 8 Gbyte/s Large pages (LP) 8000 • Twice the bandwidth of POWER4 6000

4000

2000

0

56 © 2005 IBM Corporation Memory Bandwidth: Stream Buffers

7000 6000 5000 s /

e 4000 t

y LP 3000

Mb SP 2000 1000 0 123456789101112 Right Hand Sides

p5-595 1.9 GHz

57 © 2005 IBM Corporation Strides

• Cache line size is 128 bytes Bandwidth • Double precision: 16 words Reduction • Single precision: 32 words Stride Single Double 1 1x 1x 2 ½ ½ 4 ¼ ¼ 8 1/8 1/8 16 1/16 1/16 32 1/32 1/16 64 1/32 1/16

58 © 2005 IBM Corporation Stride Test: Small Strides

9000 8000 7000 6000 s

e/ 5000 yt

b 4000 M 3000 2000 1000 0 -16 -12 -10 -8 -6 -4 -2 -1 1 2 4 6 8 10 12 16 Stride (Double Precision)

p5-595 1.9 GHz

59 © 2005 IBM Corporation Stride Test: Large Strides

450 400 350 300 s

e/ 250 yt

b 200 LP M 150 SP 100 50 0 24 32 36 68 132 260 516 1028 Stride (Double Precision)

p5-595 1.9 GHz

60 © 2005 IBM Corporation POWER5: Memory

• One or two memory cards per MCM •Best bandwidth with two memory cards • 16 Gbyte/s bandwidth per chip

L3

L3 L3 L3

61 © 2005 IBM Corporation Interleaved Memory

• Interleaved 4 way within MCM L3 •(Only if 2 memory cards match in size) • Pages interleaved within an MCM L3 • Consecutive pages L3 can be on any MCM L3

62 © 2005 IBM Corporation Memory Page Placement

• Default is random page placement •Small pages • Local page placement optional with "first touch" policy • "Round robin" option available with AIX 5.2 • Large pages are also available •Loader option or "tag" binary •Large pages are statically allocated •Placed at allocation time, not first reference

63 © 2005 IBM Corporation Memory Allocation

• Pages are allocated by module • Approximately uniform distribution • Approximately round robin

64 © 2005 IBM Corporation Memory Allocation

• Memory Affinity •Allocate pages on memory local to module

65 © 2005 IBM Corporation Memory Latencies: POWER4 and POWER5

Time vs. Length 300 250

onds 200 c

e 150 POWER5 MCM 100 nos POWER4 a 50 N 0 0 500000000 1000000000 Length

p655 1.5 GHz p5-595 1.9 GHz

66 © 2005 IBM Corporation Memory Latencies: POWER5 Memory Affinity

Time vs. Length 300 250

onds 200 c

e 150 POWER5 100 nos POWER5 MCM a 50 N 0 0 500000000 1000000000 Length

p655 1.5 GHz p5-595 1.9 GHz

67 © 2005 IBM Corporation Memory Latencies: L1 Cache – L2 Cache

Time vs. Length 10 8 onds

c 6 e 4 nos

a 2 N 0 0 20000 40000 60000 Length

p5-595 1.9 GHz

68 © 2005 IBM Corporation Memory Latencies: L3 cache - Memory

Time vs. Length 250 200 onds

c 150 e 100 nos

a 50 N 0 0 50000000 100000000 Length

p5-595 1.9 GHz

69 © 2005 IBM Corporation Memory Latencies

Size Time Clocks Region (byte) (nanosec.) (1.9 GHz) L1 32 kbyte 1 2 32 L2 36 Mbyte 9

L3 36 Mbyte 48 92

Memory - 210 403

p5-595 1.9 GHz Program: Lmbench

70 © 2005 IBM Corporation Memory Latency

Level POWER4 POWER4+ POWER5 L1 2 2 2 L2 9 7 9 L3 95 75 48 Memory 295 255 210

Results are in nanoseconds

71 © 2005 IBM Corporation Chip to Chip Communications

• On chip •2:1 bus frequency • Bandwidth • 35 Gbyte/s

72 © 2005 IBM Corporation Memory Performance

• Bandwidth •3 – 8 Gbyte/s per single processor •200 Gbyte/s per p5-595 • Latency •~200 nanoseconds

73 © 2005 IBM Corporation Bandwidth Considerations

• Affect bandwidth: •Right hand side streams •Use of 8 stream buffers • Overlap cache line loads •Page size • Small or large memory pages • 4 kbyte or 16 Mbyte

74 © 2005 IBM Corporation Summary

• POWER5 •8 Functional units • Chip: •2 processors •Shared L2 cache • Module •Four chips •8 processors • P5-595 system •8 Modules

75 © 2005 IBM Corporation Summary

• Registers: • 32 architecture • 88 rename • Functional Units • 6 clock periods for FMA • Pipelined • 32 clock periods for FDIV • NOT pipelined • 1024 entry TLB • 4096 byte page • 8 prefetch streams

76 © 2005 IBM Corporation