POWER5 Processor and System Evolution
ScicomP 11 Charles Grassl IBM May, 2005
© 2005 IBM Agenda
• pSeries systems •Caches •Memory • POWER5 Processors •Registers •Speeds •Design features
2 © 2005 IBM Corporation POWER5™ Design
• POWER4 base • Binary and structural compatibility • Shared memory scalability • Up to 64 processors • 128 threads • High floating point performance • Server flexibility • Power efficient design • Utility: • Reliability, availability, serviceability
3 © 2005 IBM Corporation POWER5 Systems
• Second generation dual AIX 5L AIX 5L LiLinnux LiLinnux OS/440000 core chip AIX 5L AIX 5L LiLinnux LiLinnux OS/440000 Integrated xSeriIenste Sgeratrveder AIX AIX Linux Linux SLIC xSer(IXAies) Server kernAIelX kernAIXel kerneLinuxl kerneLinuxl • Intelligent 2-way SMT SLIC (IXA) kernel kernel kernel kerneVirltual I/O Virtual I/O Virtual I/O Virtual LAN Virtual I/O Virtual I/O Virtual I/O V i r t ua lVi I/Ortual LAN POWERPOWER HyHyperVirpetual I/Orvvisisor • Power management with POWERPOWER HyHyperpervvisisor LAN Hardware IOA IOA LAN IOA Management IOA IOA IOALAN Hardware IOA IOA LAN IOA no performance impact ManaConsgemeole nt IOA IOA IOA (HMC)Cons ole LAN (HMC) • 130 nm lithography LAN • 276M transistors • 8 layers of metal • Micropartitioning • Up to 64 physical processors,1280 • Multi Chip Modules virtual processors per system (MCM): • Eight way SMP looks like 16-way to software • 95 mm on a side
4 © 2005 IBM Corporation POWER5 System Features
• Dual Core Chip • Shared L2 cache • Shared L3 cache • Shared Memory • Multiple Page Size support • Simultaneous Multi Threading
5 © 2005 IBM Corporation POWER5 Features
Instruction 0 1 2 3 Streams
Processor 0 Processor 1
L3 L2 L2 L2
•Chip
6 © 2005 IBM Corporation POWER4 Chip --- (December 2001)
• Technology: 180nm lithography, Cu,
SOI U FPU ISU ISU FPU FX • POWER4+ shipping in 130nm today FXU • Dual processor core • 8-way superscalar IDU IDU LSU LSU • Out of Order execution IFU IFU • 2 Load / Store units BXU BXU
• 2 Fixed Point units l o • 2 Floating Point units
• Logical operations on Condition Contr Register / L2 L2 L2 • Branch Execution unit • > 200 instructions in flight ectory • Hardware instruction and data
prefetch L3 Dir
7 © 2005 IBM Corporation POWER5 Chip
• IBM CMOS 130nm • Copper and SOI • 8 layers of metal • Chip • 389 mm2 • 276M transistors • I/Os: 2313 signal, 3057 power • Same technology as POWER4+
8 © 2005 IBM Corporation POWER5 Multi-chip Module
• 95mm % 95mm • Four POWER5 chips • Four cache chips • 4,491 signal I/Os • 89 layers of metal
9 © 2005 IBM Corporation Multi Chip Module (MCM) Architecture
POWER4 POWER5 • 4 processor chips • 4 processor chips • 2 processors per chip • 2 processors per chip • 8 off-module L3 chips • 4 L3 cache chips • L3 cache is controlled by • L3 cache is used by MCM and logically shared processor pair across node • “Extension” of L2 • 4 Memory control chips ------• 8 chips • 16 chips
L3 Chip Mem. Controller
10 © 2005 IBM Corporation Dynamic Power Management
• Two components: • Switching power • Leakage power • Impact of SMT on power: • More instructions executed per cycle • Switching power reduction: • Extensive fine-grain, dynamic clock-gating • Leakage power reduction • Minimal use of low Vt devices • No performance impact • Low power mode for low priority threads
11 © 2005 IBM Corporation Dynamic Power Management No Power Dynamic Power Management Management
Single Thread
Simultaneous Photos taken with thermal Multi-threading sensitive camera while prototype POWER5 chip was undergoing tests
Simultaneous Multi-threading with dynamic power management reduces power consumption below standard, single threaded level
12 © 2005 IBM Corporation Modifications to POWER4 to create POWER5
Reduced L3 P P P P Latency
L2 L2 LargerLarger SMPsSMPs Fab Ctl Fab Ctl
L3 L3 Cntrl Cntrl Number of chips cut Faster in half L3 access to L3 memory Mem Ctl Mem Ctl
Memory Memory
13 © 2005 IBM Corporation 64-way SMP Interconnection
8B @ 2:1
Interconnection exploits enhanced distributed switch • All chip interconnections operate at half processor frequency and scale with processor frequency
IBM Confidential 14 © 2005 IBM Corporation Simultaneous Multi-Threading in POWER5
• Each chip appears as a 4-way SMP to software Simultaneous Multi-Threading • 2 processors • 2 threads per processor FX0 • Processor resources FX1 optimized for enhanced SMT FP0 performance FP1 • Software controlled thread LS0 priority LS1 • Dynamic feedback of runtime BRX behavior to adjust priority CRL • Dynamic switching between single and multithreaded Thread 0 active Thread 1 active mode
15 © 2005 IBM Corporation Multi-threading Evolution
Single Thread Coarse Grain Threading
FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL
Fine Grain Threading Simultaneous Multi-Threading
FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Thread 1 No Thread Executing Executing Executing
16 © 2005 IBM Corporation Simultaneous multi-threading
POWER5POWER4 Simultaneous (Single Threaded) Multi Threading
FX0 Thread0 active FX1 No thread active LSO Thread1 active
LS1 throughput FP0 Appears as 4 CPUs per chip to the FP1 operating system BRZ (AIX 5L V5.3 and System ST CRL Linux) SMT
• Utilizes unused execution unit cycles • Symmetric multiprocessing (SMP) programming model • Natural fit with superscalar out-of-order execution core • Dispatch two threads per processor. Net result: • Better processor utilization
17 © 2005 IBM Corporation POWER5 Performance Expectations
• Higher sustained-to-peak floating point rate ratio compared to POWER4 • Reduction in L3 and memory latency • Integrated memory controller • Increased rename resources • Higher instruction level parallelism in compute intensive applications • Fast barrier synchronization operation • Enhanced data prefetch mechanism
18 © 2005 IBM Corporation Processor Architecture
• CPUs • Caches • Performance Features
19 © 2005 IBM Corporation Chip Enhancements
• Caches and translation resources • Larger caches • Enhance associativity • Resource pools • Rename registers: GPRs, FPRs increased to 120 each • L2 cache coherency engines: increased by 100% • Memory controller moved on chip • Dynamic power management
20 © 2005 IBM Corporation New POWER5 Instructions
• Enhanced data prefetch (eDCBT) • Floating-point: • Non-IEEE mode of execution for divide and square-root • Reciprocal estimate, double-precision • Reciprocal square-root estimate, single- precision • Population count
21 © 2005 IBM Corporation Processor Characteristics
• Deep pipelines • High frequency clocks • High asymptotic rates • Superscalar • Speculative out-of-order instructions • Up to 8 outstanding cache line misses • Large number of instructions in flight • Branch prediction • Prefetching
22 © 2005 IBM Corporation POWER4 and POWER5 Storage Hierarchy
POWER4 POWER5 L2 Cache 1.44 Mbyte 1.92 Mbyte Capacity, line size 128 byte line 128 byte line Associativity, replacement 8-way, LRU 10-way, LRU Off-chip L3 Cache 32 Mbyte 36 Mbyte Capacity, line size 512 byte line 256 byte line Associativity, replacement 8-way, LRU 12-way, LRU Chip interconnect Enhanced distributed Type Distributed switch switch Intra-MCM data buses ½ processor speed Processor speed Inter-MCM data buses 1/3 processor speed ½ processor speed Memory 1 Tbyte 2 Tbyte
IBM Confidential 23 © 2005 IBM Corporation Multiprocessor Chip
• 2 CPUs (processors) on one chip • Each processor: • L1 cache • Data • Instruction • Each chip: • Shared memory path • Shared L3 cache • 32 Mbyte • Shared L2 cache • 1.5 Mbyte
24 © 2005 IBM Corporation Chip Structure
Processor Processor Core 1 Core 1
L2 L2 L2 Cache Cache Cache
Chip Chip To To Chip Chip Fabric Controller MCM MCM To To MCM L3 Directory MCM GX Control L3/Mem Control GX Bus L3/Mem Bus
25 © 2005 IBM Corporation Micro Architecture
• 64-bit RISC Microprocessor • Multiple Execution Units • Hardware Data Prefetch • Out-of-Order Execution • Speculative Execution • 8 Instructions / Cycle
26 © 2005 IBM Corporation Multiple Functional Units
• Symmetric functional units • Two Floating Point Units (FPU) • Three Fixed Point Units (FXU) • Two Integer • One Control • Two Load/Store Units (LSU) • One Branch Processing Unit (BPU) •CR FMA
Load/Store Fixed Pt. FMA Load/Store Fixed Pt.
Branch
27 © 2005 IBM Corporation Fast Core: Instruction-level Parallelism
• Speculative superscalar Processor Core organization • Out-of-Order execution IFAR I-cache • Large rename pools BR Scan • 8 instruction issue, 5 instruction Instr Q complete BR Decode, • Large instruction window for Predict Crack & scheduling Group • 8 Execution pipelines Formation • 2 load / store units GCT • 2 fixed point units
• 2 DP multiply-add execution units BR/CR FX/LD 1 FX/LD 2 FP • 1 branch resolution unit Issue Q Issue Q Issue Q Issue Q
• 1 CR execution unit BR CR FX1 LD1 LD2 FX2 FP1 FP2 • Aggressive branch prediction Exec Exec Exec Exec Exec Exec Exec Exec Unit Unit Unit Unit Unit Unit Unit Unit • Target address and outcome prediction StQ • Static prediction / branch hints used • Fast, selective flush on branch D-cache mispredict
28 © 2005 IBM Corporation Registers
POWER4: POWER5: Resource Logical Physical Physical GPRs 32 80 120 FPRs 32 72 120 8 (9) 4-bit CRs 32 32 fields Link/Count 2 16 16
FPSCR 1 20 20
XER 4 fields 24 24
29 © 2005 IBM Corporation Functional Unit Progression
POWER2:
POWER3:
POWER4:
POWER2 POWER3 POWER4 POWER5 Clock Periods 2 3 6 6 Clock Rate 125 MHz 375 MHz 1.3 GHz 1.9 GHz Time 16 8 4.6 3.2 (Nanosec.)
30 © 2005 IBM Corporation Registers
• CPU's point of view •120 FP registers (POWER5) • User point of view •32 FP registers (architecture) • Rename registers •Relieve register "pressure"
32 Architecture Registers 72 Physical Registers
31 © 2005 IBM Corporation Register Renaming
• Architecture has 32 registers •Legacy • Cases which require additional registers: •Tight loops • Computationally intensive •“Broad” loops • Many variables involved •Deep pipe lines • Renaming registers are increasingly important with Simultaneous MultiThreading
32 © 2005 IBM Corporation Register Renaming: Read After Write
Nothing to be R13 = R14 + R15 done …
R16 = R13 + R12
33 © 2005 IBM Corporation Register Renaming: Write After Write
R13 = R14 + R15 R13 = R14 + R15 … … Renaming R13 = R16 + R17 R42 = R16 + R17
R19 = R13 + R18 R19 = R42 + R18
34 © 2005 IBM Corporation Effect of Registers
POWER4 POWER5
GP Registers 80 120
FP Registers 72 120
DGEMM speed 60% of burst 90% of burst
Enhances performance of computationally intensive kernels
35 © 2005 IBM Corporation Renaming Example
• Matrix multiple with 4x unrolling •2 FMAs and 1 LFD per cycle •3 renames per cycle •After 13 cycles, 39 FP renames of the 40 (POWER4, 72-32) are allocated •Cycles 14, 16, and 18: • Instruction are rejected due to lack of renames • Result: ~70 to 75% of peak •Software rule of thumb: •approx. 13 renames available every 6 cycles • POWER5 alleviates this with 120 FP renames
36 © 2005 IBM Corporation Effect of Rename Registers
Peak Peak MATMUL Polynomial 7000 7000
6000 6000
5000 5000 s 4000 s 4000 / p/ p o o l Mf Mfl 3000 3000
2000 2000
1000 1000
0 0 POWER4 POWER5 POWER4 POWER5
POWER4 @ 1.5 GHz POWER5 @ 1.65 GHz
37 © 2005 IBM Corporation Floating Point Functional Units
• Two floating point execution units • Divide and square root sub- units • NOT pipelined Single Double • Double precision (64-bit) data Instruction path (Cycles) (Cycles) Fma 6 6
Fdiv ~25 32
Fsqrt - 34
38 © 2005 IBM Corporation Floating Point Functional Units
• 2 floating add-multiply (FMA) units •Per instruction: • 1 floating point add • 1 floating point multiply •4 floating point ops per clock period • IEEE arithmetic
39 © 2005 IBM Corporation Arithmetic
• IEEE 754 single and double Floating Size Integer floating-point Point • Floating multiply-add: 16 Yes No • Intermediate value is not rounded • 64-bit integer arithmetic 32 Yes Yes instructions Yes • Used only in 64-bit addressing mode 64 Yes (-q64) 128 No No
40 © 2005 IBM Corporation Pipelined Functional Units
Multiply Add 12 results/ Multiply or Add A,B, C A*B+C 6 clock periods 2 results/ Divide 32 clock periods D, E, F D*E+F 2 results/ Square root 34 clock periods
Divide
A,B A/B
D, E D/E
41 © 2005 IBM Corporation Deep Pipelines
• Operations limited by functional unit transit time: •Divide •Square root •Intrinsic functions •Recursion
42 © 2005 IBM Corporation Translation Lookaside Buffer (TLB)
• 1024 entry • Page sizes: • 4096 Bytes • 16 Mbyte Memory
Page
Page Page Address Page
Page Page
43 © 2005 IBM Corporation TLB Thrashing
• TLB spans a small amount of memory • Strategy: •Avoid large strides •Avoid randomly using large constructs •Gather and scatter are very bad • Common problem on RISC microprocessors
44 © 2005 IBM Corporation Hardware Prefetch
• Detects adjacent cache line references • Forward and backward strides • Prefetches up to two lines ahead per stream • Up to eight concurrent streams • Twelve prefetch filter queues • No prefetch on store misses • (when a store instruction causes a cache line miss) • Ramped Initialization • L2 to L1 prefetches • L3 to L2 prefetches • Memory to L3 prefetches
45 © 2005 IBM Corporation Memory Access
• Load and Store • Two per CPU • Connect CPU to memory
Load Store Memory Processor Prefetch Memory Load Store Buffers
46 © 2005 IBM Corporation Memory Prefetching
• Eight prefetch stream buffers • Connect CPU to memory
Stream 1
Stream 2 Load Store Processor Memory Load Store Stream 7
Stream 8
Prefetch Buffers
47 © 2005 IBM Corporation Cache Line Load
• Memory system does not detect patterns within cache line • Detect location within first 3/4 or last 1/4 of cache line
48 © 2005 IBM Corporation Stride Pattern Recognition
• Upon a cache miss: •Biased guess is made as to the direction of that stream •Guess is based upon where in the cache line the address associated with that miss occurred •If it is in the first 3/4, then the direction is guessed as ascending •If in the last 1/4, the direction is guessed descending
49 © 2005 IBM Corporation Prefetching
Memory
Cache Lines 1 5
2 6 1 2 3 4 5 6
3 7
4 8
50 © 2005 IBM Corporation Streams
Memory Processor Non- Streaming
Memory Processor Streaming
51 © 2005 IBM Corporation Streams
Memory Processor
52 © 2005 IBM Corporation Effect of Prefetch Buffers
• Memory load overlap • Up to 8 streams • Variables • Patterns
for (j=0;j 53 © 2005 IBM Corporation Memory Bandwidth 12000 10000 8000 s / e LP t y 6000 SP Mb 4000 2000 0 ESSL DCBZ Copy Daxpy Daxpy2 Daxpy4 p5-595 1.9 GHz 54 © 2005 IBM Corporation Memory Bandwidth 12000 10000 8000 s / e t POWER5 LP y 6000 POWER5 SP Mb 4000 POWER4 SP 2000 0 y y y2 y4 p Co ESSL DCBZ Daxp Daxp Daxp p5-595 1.9 GHz p690 1.3 GHz 55 © 2005 IBM Corporation Memory Bandwidth • Typical POWER5 12000 bandwidth: • 4 Gbyte/s Small Pages (SP) 10000 • 8 Gbyte/s Large pages (LP) 8000 • Twice the bandwidth of POWER4 6000 4000 2000 0 56 © 2005 IBM Corporation Memory Bandwidth: Stream Buffers 7000 6000 5000 s / e 4000 t y LP 3000 Mb SP 2000 1000 0 123456789101112 Right Hand Sides p5-595 1.9 GHz 57 © 2005 IBM Corporation Strides • Cache line size is 128 bytes Bandwidth • Double precision: 16 words Reduction • Single precision: 32 words Stride Single Double 1 1x 1x 2 ½ ½ 4 ¼ ¼ 8 1/8 1/8 16 1/16 1/16 32 1/32 1/16 64 1/32 1/16 58 © 2005 IBM Corporation Stride Test: Small Strides 9000 8000 7000 6000 s e/ 5000 yt b 4000 M 3000 2000 1000 0 -16 -12 -10 -8 -6 -4 -2 -1 1 2 4 6 8 10 12 16 Stride (Double Precision) p5-595 1.9 GHz 59 © 2005 IBM Corporation Stride Test: Large Strides 450 400 350 300 s e/ 250 yt b 200 LP M 150 SP 100 50 0 24 32 36 68 132 260 516 1028 Stride (Double Precision) p5-595 1.9 GHz 60 © 2005 IBM Corporation POWER5: Memory • One or two memory cards per MCM •Best bandwidth with two memory cards • 16 Gbyte/s bandwidth per chip L3 L3 L3 L3 61 © 2005 IBM Corporation Interleaved Memory • Interleaved 4 way within MCM L3 •(Only if 2 memory cards match in size) • Pages interleaved within an MCM L3 • Consecutive pages L3 can be on any MCM L3 62 © 2005 IBM Corporation Memory Page Placement • Default is random page placement •Small pages • Local page placement optional with "first touch" policy • "Round robin" option available with AIX 5.2 • Large pages are also available •Loader option or "tag" binary •Large pages are statically allocated •Placed at allocation time, not first reference 63 © 2005 IBM Corporation Memory Allocation • Pages are allocated by module • Approximately uniform distribution • Approximately round robin 64 © 2005 IBM Corporation Memory Allocation • Memory Affinity •Allocate pages on memory local to module 65 © 2005 IBM Corporation Memory Latencies: POWER4 and POWER5 Time vs. Length 300 250 onds 200 c e 150 POWER5 MCM 100 nos POWER4 a 50 N 0 0 500000000 1000000000 Length p655 1.5 GHz p5-595 1.9 GHz 66 © 2005 IBM Corporation Memory Latencies: POWER5 Memory Affinity Time vs. Length 300 250 onds 200 c e 150 POWER5 100 nos POWER5 MCM a 50 N 0 0 500000000 1000000000 Length p655 1.5 GHz p5-595 1.9 GHz 67 © 2005 IBM Corporation Memory Latencies: L1 Cache – L2 Cache Time vs. Length 10 8 onds c 6 e 4 nos a 2 N 0 0 20000 40000 60000 Length p5-595 1.9 GHz 68 © 2005 IBM Corporation Memory Latencies: L3 cache - Memory Time vs. Length 250 200 onds c 150 e 100 nos a 50 N 0 0 50000000 100000000 Length p5-595 1.9 GHz 69 © 2005 IBM Corporation Memory Latencies Size Time Clocks Region (byte) (nanosec.) (1.9 GHz) L1 32 kbyte 1 2 32 L2 36 Mbyte 9 L3 36 Mbyte 48 92 Memory - 210 403 p5-595 1.9 GHz Program: Lmbench 70 © 2005 IBM Corporation Memory Latency Level POWER4 POWER4+ POWER5 L1 2 2 2 L2 9 7 9 L3 95 75 48 Memory 295 255 210 Results are in nanoseconds 71 © 2005 IBM Corporation Chip to Chip Communications • On chip •2:1 bus frequency • Bandwidth • 35 Gbyte/s 72 © 2005 IBM Corporation Memory Performance • Bandwidth •3 – 8 Gbyte/s per single processor •200 Gbyte/s per p5-595 • Latency •~200 nanoseconds 73 © 2005 IBM Corporation Bandwidth Considerations • Affect bandwidth: •Right hand side streams •Use of 8 stream buffers • Overlap cache line loads •Page size • Small or large memory pages • 4 kbyte or 16 Mbyte 74 © 2005 IBM Corporation Summary • POWER5 •8 Functional units • Chip: •2 processors •Shared L2 cache • Module •Four chips •8 processors • P5-595 system •8 Modules 75 © 2005 IBM Corporation Summary • Registers: • 32 architecture • 88 rename • Functional Units • 6 clock periods for FMA • Pipelined • 32 clock periods for FDIV • NOT pipelined • 1024 entry TLB • 4096 byte page • 8 prefetch streams 76 © 2005 IBM Corporation