POWER5 Processor and System Evolution

POWER5 Processor and System Evolution ScicomP 11 Charles Grassl IBM May, 2005 © 2005 IBM Agenda • pSeries systems •Caches •Memory • POWER5 Processors •Registers •Speeds •Design features 2 © 2005 IBM Corporation POWER5™ Design • POWER4 base • Binary and structural compatibility • Shared memory scalability • Up to 64 processors • 128 threads • High floating point performance • Server flexibility • Power efficient design • Utility: • Reliability, availability, serviceability 3 © 2005 IBM Corporation POWER5 Systems • Second generation dual AIX 5L AIX 5L LiLinnux LiLinnux OS/440000 core chip AIX 5L AIX 5L LiLinnux LiLinnux OS/440000 Integrated xSeriIenste Sgeratrveder AIX AIX Linux Linux SLIC xSer(IXAies) Server kernAIelX kernAIXel kerneLinuxl kerneLinuxl • Intelligent 2-way SMT SLIC (IXA) kernel kernel kernel kerneVirltual I/O Virtual I/O Virtual I/O Virtual LAN Virtual I/O Virtual I/O Virtual I/O V i r t ua lVi I/Ortual LAN POWERPOWER HyHyperVirpetual I/Orvvisisor • Power management with POWERPOWER HyHyperpervvisisor LAN Hardware IOA IOA LAN IOA Management IOA IOA IOALAN Hardware IOA IOA LAN IOA no performance impact ManaConsgemeole nt IOA IOA IOA (HMC)Cons ole LAN (HMC) • 130 nm lithography LAN • 276M transistors • 8 layers of metal • Micropartitioning • Up to 64 physical processors,1280 • Multi Chip Modules virtual processors per system (MCM): • Eight way SMP looks like 16-way to software • 95 mm on a side 4 © 2005 IBM Corporation POWER5 System Features • Dual Core Chip • Shared L2 cache • Shared L3 cache • Shared Memory • Multiple Page Size support • Simultaneous Multi Threading 5 © 2005 IBM Corporation POWER5 Features Instruction 0 1 2 3 Streams Processor 0 Processor 1 L3 L2 L2 L2 •Chip 6 © 2005 IBM Corporation POWER4 Chip --- (December 2001) • Technology: 180nm lithography, Cu, SOI U FPU ISU ISU FPU FX • POWER4+ shipping in 130nm today FXU • Dual processor core • 8-way superscalar IDU IDU LSU LSU • Out of Order execution IFU IFU • 2 Load / Store units BXU BXU • 2 Fixed Point units l o • 2 Floating Point units • Logical operations on Condition Contr Register / L2 L2 L2 • Branch Execution unit • > 200 instructions in flight ectory • Hardware instruction and data prefetch L3 Dir 7 © 2005 IBM Corporation POWER5 Chip • IBM CMOS 130nm • Copper and SOI • 8 layers of metal • Chip • 389 mm2 • 276M transistors • I/Os: 2313 signal, 3057 power • Same technology as POWER4+ 8 © 2005 IBM Corporation POWER5 Multi-chip Module • 95mm % 95mm • Four POWER5 chips • Four cache chips • 4,491 signal I/Os • 89 layers of metal 9 © 2005 IBM Corporation Multi Chip Module (MCM) Architecture POWER4 POWER5 • 4 processor chips • 4 processor chips • 2 processors per chip • 2 processors per chip • 8 off-module L3 chips • 4 L3 cache chips • L3 cache is controlled by • L3 cache is used by MCM and logically shared processor pair across node • “Extension” of L2 • 4 Memory control chips ------------------------------------------ --------------------------------------- • 8 chips • 16 chips L3 Chip Mem. Controller 10 © 2005 IBM Corporation Dynamic Power Management • Two components: • Switching power • Leakage power • Impact of SMT on power: • More instructions executed per cycle • Switching power reduction: • Extensive fine-grain, dynamic clock-gating • Leakage power reduction • Minimal use of low Vt devices • No performance impact • Low power mode for low priority threads 11 © 2005 IBM Corporation Dynamic Power Management No Power Dynamic Power Management Management Single Thread Simultaneous Photos taken with thermal Multi-threading sensitive camera while prototype POWER5 chip was undergoing tests Simultaneous Multi-threading with dynamic power management reduces power consumption below standard, single threaded level 12 © 2005 IBM Corporation Modifications to POWER4 to create POWER5 Reduced L3 P P P P Latency L2 L2 LargerLarger SMPsSMPs Fab Ctl Fab Ctl L3 L3 Cntrl Cntrl Number of chips cut Faster in half L3 access to L3 memory Mem Ctl Mem Ctl Memory Memory 13 © 2005 IBM Corporation 64-way SMP Interconnection 8B @ 2:1 Interconnection exploits enhanced distributed switch • All chip interconnections operate at half processor frequency and scale with processor frequency IBM Confidential 14 © 2005 IBM Corporation Simultaneous Multi-Threading in POWER5 • Each chip appears as a 4-way SMP to software Simultaneous Multi-Threading • 2 processors • 2 threads per processor FX0 • Processor resources FX1 optimized for enhanced SMT FP0 performance FP1 • Software controlled thread LS0 priority LS1 • Dynamic feedback of runtime BRX behavior to adjust priority CRL • Dynamic switching between single and multithreaded Thread 0 active Thread 1 active mode 15 © 2005 IBM Corporation Multi-threading Evolution Single Thread Coarse Grain Threading FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Fine Grain Threading Simultaneous Multi-Threading FX0 FX0 FX1 FX1 FP0 FP0 FP1 FP1 LS0 LS0 LS1 LS1 BRX BRX CRL CRL Thread 0 Thread 1 No Thread Executing Executing Executing 16 © 2005 IBM Corporation Simultaneous multi-threading POWER5POWER4 Simultaneous (Single Threaded) Multi Threading FX0 Thread0 active FX1 No thread active LSO Thread1 active LS1 throughput FP0 Appears as 4 CPUs per chip to the FP1 operating system BRZ (AIX 5L V5.3 and System ST CRL Linux) SMT • Utilizes unused execution unit cycles • Symmetric multiprocessing (SMP) programming model • Natural fit with superscalar out-of-order execution core • Dispatch two threads per processor. Net result: • Better processor utilization 17 © 2005 IBM Corporation POWER5 Performance Expectations • Higher sustained-to-peak floating point rate ratio compared to POWER4 • Reduction in L3 and memory latency • Integrated memory controller • Increased rename resources • Higher instruction level parallelism in compute intensive applications • Fast barrier synchronization operation • Enhanced data prefetch mechanism 18 © 2005 IBM Corporation Processor Architecture • CPUs • Caches • Performance Features 19 © 2005 IBM Corporation Chip Enhancements • Caches and translation resources • Larger caches • Enhance associativity • Resource pools • Rename registers: GPRs, FPRs increased to 120 each • L2 cache coherency engines: increased by 100% • Memory controller moved on chip • Dynamic power management 20 © 2005 IBM Corporation New POWER5 Instructions • Enhanced data prefetch (eDCBT) • Floating-point: • Non-IEEE mode of execution for divide and square-root • Reciprocal estimate, double-precision • Reciprocal square-root estimate, single- precision • Population count 21 © 2005 IBM Corporation Processor Characteristics • Deep pipelines • High frequency clocks • High asymptotic rates • Superscalar • Speculative out-of-order instructions • Up to 8 outstanding cache line misses • Large number of instructions in flight • Branch prediction • Prefetching 22 © 2005 IBM Corporation POWER4 and POWER5 Storage Hierarchy POWER4 POWER5 L2 Cache 1.44 Mbyte 1.92 Mbyte Capacity, line size 128 byte line 128 byte line Associativity, replacement 8-way, LRU 10-way, LRU Off-chip L3 Cache 32 Mbyte 36 Mbyte Capacity, line size 512 byte line 256 byte line Associativity, replacement 8-way, LRU 12-way, LRU Chip interconnect Enhanced distributed Type Distributed switch switch Intra-MCM data buses ½ processor speed Processor speed Inter-MCM data buses 1/3 processor speed ½ processor speed Memory 1 Tbyte 2 Tbyte IBM Confidential 23 © 2005 IBM Corporation Multiprocessor Chip • 2 CPUs (processors) on one chip • Each processor: • L1 cache • Data • Instruction • Each chip: • Shared memory path • Shared L3 cache • 32 Mbyte • Shared L2 cache • 1.5 Mbyte 24 © 2005 IBM Corporation Chip Structure Processor Processor Core 1 Core 1 L2 L2 L2 Cache Cache Cache Chip Chip To To Chip Chip Fabric Controller MCM MCM To To MCM L3 Directory MCM GX Control L3/Mem Control GX Bus L3/Mem Bus 25 © 2005 IBM Corporation Micro Architecture • 64-bit RISC Microprocessor • Multiple Execution Units • Hardware Data Prefetch • Out-of-Order Execution • Speculative Execution • 8 Instructions / Cycle 26 © 2005 IBM Corporation Multiple Functional Units • Symmetric functional units • Two Floating Point Units (FPU) • Three Fixed Point Units (FXU) • Two Integer • One Control • Two Load/Store Units (LSU) • One Branch Processing Unit (BPU) •CR FMA Load/Store Fixed Pt. FMA Load/Store Fixed Pt. Branch 27 © 2005 IBM Corporation Fast Core: Instruction-level Parallelism • Speculative superscalar Processor Core organization • Out-of-Order execution IFAR I-cache • Large rename pools BR Scan • 8 instruction issue, 5 instruction Instr Q complete BR Decode, • Large instruction window for Predict Crack & scheduling Group • 8 Execution pipelines Formation • 2 load / store units GCT • 2 fixed point units • 2 DP multiply-add execution units BR/CR FX/LD 1 FX/LD 2 FP • 1 branch resolution unit Issue Q Issue Q Issue Q Issue Q • 1 CR execution unit BR CR FX1 LD1 LD2 FX2 FP1 FP2 • Aggressive branch prediction Exec Exec Exec Exec Exec Exec Exec Exec Unit Unit Unit Unit Unit Unit Unit Unit • Target address and outcome prediction StQ • Static prediction / branch hints used • Fast, selective flush on branch D-cache mispredict 28 © 2005 IBM Corporation Registers POWER4: POWER5: Resource Logical Physical Physical GPRs 32 80 120 FPRs 32 72 120 8 (9) 4-bit CRs 32 32 fields Link/Count 2 16 16 FPSCR 1 20 20 XER 4 fields 24 24 29 © 2005 IBM Corporation Functional Unit Progression POWER2: POWER3: POWER4: POWER2 POWER3 POWER4 POWER5 Clock Periods 2 3 6 6 Clock Rate 125 MHz 375 MHz 1.3 GHz 1.9 GHz Time 16 8 4.6 3.2 (Nanosec.) 30 © 2005 IBM Corporation

Load more