POWER5, Federation and Linux: System Architecture

POWER5, Federation and Linux: System Architecture Charles Grassl IBM August, 2004 © 2004 IBM Agenda Architecture –Processors –POWER4 –POWER5 –System –Nodes –Systems High Performance Switch 2 © 2004 IBM Corporation 1 Processor Design POWER4 POWER5 Two processors per chip Two processors per chip L1: L1: –32 kbyte data –32 kbyte data –64 kbyte instructions –64 kbyte instructions L2: L2: –1.45 Mbyte per processor pair –1.9 Mbyte per processor pair L3: L3: –32 Mbyte per processor pair –36 Mbyte per processor pair –Shared by all processors –Shared by processor pair –Extension of L2 cache Simultaneous multithreading Enhanced memory loads and stores Additional rename registers 3 © 2004 IBM Corporation POWER4: Multi-processor Chip Processors Each processor: – L1 caches – Registers – Functional units Each chip (shared): – L2 cache – L3 cache – Shared across node – Path to memory L2 Cache 4 © 2004 IBM Corporation 2 POWER5: Multi-processor Chip Processors Each processor: – L1 caches – Registers – Functional units Each chip (shared): – L2 cache – L3 cache – NOT shared – Path to memory – Memory controller Memory L2 Cache Controller 5 © 2004 IBM Corporation POWER4: Features Multi-processor chip High clock rate – High computation burst rate Three cache levels – Bandwidth – Latency hiding Shared Memory – Large memory size 6 © 2004 IBM Corporation 3 POWER5 Features Performance –Registers –Additional rename registers –Cache improvements –L3 runs at ½ freq. (POWER4: 1/3 freq.) –Simultaneous Multi Threading (SMT) –Memory Bandwidth Virtualization –Partitioning Dynamic power management 7 © 2004 IBM Corporation POWER4, POWER5 Differences POWER4 Design POWER5 Design Benefit 2-way Associative 4-way Associative Improved L1 Cache L1 Cache FIFO LRU performance 8-way Associative 10-way Associative Fewer L2 Cache misses L2 cache 1.44MB 1.9MB Better performance 32MB 36MB Better Cache 8-way Associative 12-way Associative L3 Cache performance 118 Clock Cycles ~80 Clock Cycles Memory 4GBbyte / sec / ~16Gbyte / sec / 4X improvement Bandwidth Chip Chip Faster memory access Simultaneous Better processor No Yes Multi- utilization Threading Processor Better usage of 1 processor 1/10 of processor Addressing processor resources 50% more transistors in 412mm 389mm Size the same space 8 © 2004 IBM Corporation 4 Modifications to POWER4 System Structure P P P P L3 L2 L2 L3 L3 L3 Fab Ctl Fab Ctl Mem Ctl Mem Ctl L3 L3 Mem Ctl Mem Ctl Memory Memory 9 © 2004 IBM Corporation POWER5 Challenges Scaling from 32-way to logical 128-way – Large system effects, 2Tbyte of memory 1.9MB L2 shared across 4 threads (SMT) –L2 miss rates 36 Mbyte private L3 versus 128 Mbyte shared L3 – Will have to watch multi-programming level – Read-only data will get replicated Local/Remote memory latencies –Worst case remote memory latency is 60% higher than local 10 © 2004 IBM Corporation 5 Effect of Features: Registers Additional rename registers –POWER4: 72 total registers –POWER5: 110 total registers Matrix multiply: –POWER4: 65% of burst rate –POWER5: 90% of burst rate Enhances performance of computationally intensive kernels 11 © 2004 IBM Corporation Effect of Rename Registers 7000 6000 5000 4000 Polynomial 3000 MATMUL Mflop/s 2000 1000 0 POWER4 POWER4 POWER5 POWER5 1.5 GHz 1.5 GHz 1.65 GHz 1.65 GHz 12 © 2004 IBM Corporation 6 Effect of Rename Registers 7000 “Tight” algorithms were 6000 previously register bound 5000 Increased floating point efficiency 4000 Examples –Matrix multiply 3000 – LINPACK –Polynomial calculation 2000 – Horner’s Rule –FFTs 1000 0 13 © 2004 IBM Corporation Memory Bandwidth POWER5 14000 POWER4 12000 10000 8000 6000 Mbyte/s 4000 2000 0 0 100000 200000 300000 400000 s=s+A(i) 1.65 GHz POWER5 1.5 GHz POWER4 14 © 2004 IBM Corporation 7 Memory Bandwidth: L1 – L2 Cache Regime POWER5 14000 POWER4 L1 Cache 12000 10000 L2 Cache 8000 6000 Mbyte/s 4000 2000 0 0 100000 200000 300000 400000 15 © 2004 IBM Corporation Memory Bandwidth: L2 Cache Regime POWER5 14000 POWER4 12000 10000 -----------------------------------L2 Cache------------------------------- 8000 6000 Mbyte/s 4000 2000 0 0 1000000 2000000 3000000 4000000 16 © 2004 IBM Corporation 8 L2 and L3 Cache Improvement 14000 L3 cache enhances bandwidth 12000 –Previously only hid 10000 latency -----------------------------------L2 Cache------------------------------- –New L3 cache is like a “2.5 8000 cache” Larger blocking sizes 6000 4000 2000 0 0 1000000 2000000 3000000 4000000 17 © 2004 IBM Corporation Simultaneous Multi-Threading Two threads per processor – Threads execute concurrently Threads share: – Caches – Registers – Functional units 18 © 2004 IBM Corporation 9 POWER5 Simultaneous Multi-Threading (SMT) POWER4 POWER5 (Single Threaded) (Simultaneous Multi-threaded) FX0 FX0 FX1 FX1 LS0 LS0 LS1 LS1 FP0 FP0 FP1 FP1 BRX BRX CRL CRL Processor Cycles Processor Cycles Legend Thread0 active No Thread active Thread1 active ● Presents SMP programming model to software ● Natural fit with superscalar out-of-order execution core 19 © 2004 IBM Corporation Conventional Multi-Threading Functional Units Time FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 Thread 1 Threads alternate – Nothing shared 20 © 2004 IBM Corporation 10 Simultaneous Multi-Threading 0 1 FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL Thread 0 Thread 1 Simultaneous execution – Shared registers – Shared functional units 21 © 2004 IBM Corporation Manipulating SMT Administrator function Dynamic System-wide Function Command View /usr/sbin/smtctl Off /usr/sbin/smtctl –m off now On /usr/sbin/smtctl –m on now 22 © 2004 IBM Corporation 11 Dynamic Power Management Dynamic Power No Power Management Management Single Thread Photos taken Simultaneous with thermal sensitive camera Multi-threading while prototype POWER5 chip was undergoing tests 23 © 2004 IBM Corporation Multi-Chip Modules (MCMs) Four processor chips are mounted on a module –Unit of manufacture –Smallest freestanding system Multiple modules –Node Also available: Single Chip Modules –Lower cost –Lower bandwidth 24 © 2004 IBM Corporation 12 POWER4 Multi-Chip Module (MCM) 4 chips – 8 processors – 2 processors/chip >35 Gbyte/s in each chip-to- chip connection 2:1 MHz chip-to-chip busses 6 connections per MCM Logically shared L2's, L3's Data Bus (Memory, inter- module): 3:1 MHz MCM to MCM 25 © 2004 IBM Corporation POWER5 Multi-Chip Module (MCM) 95mm % 95mm Four POWER5 chips Four cache chips 4,491 signal I/Os 89 layers of metal 26 © 2004 IBM Corporation 13 Multi Chip Module (MCM) Architecture POWER4: POWER5: – 4 processor chips – 4 processor chips – 2 processors per chip – 2 processors per chip – 8 off-module L3 chips – 4 L3 cache chips – L3 cache is controlled by MCM – L3 cache is used by processor pair and logically shared across node – “Extension” of L2 – 4 Memory control chips ---------------------------------------------- ---------------------------------------------- – 8 chips – 16 chips L3 Chip Mem. Controller 27 © 2004 IBM Corporation System Design with POWER4 and with POWER5 Multi Chip Module (MCM) –Multiple MCMs per system node Other issues: –L3 cache sharing –Memory controller 28 © 2004 IBM Corporation 14 Power 4 POWER4 Processor Chip Memory Core Core Store Cache Store Cache MC 1.4M L2 Cache 32 Mbyte L3 Cache A (portion of 128M GX, shared) FBC NCU L3 Z Cache X Y A 29 © 2004 IBM Corporation POWER4 Multi-chip Module MCM to MCM GX Bus GX Bus MCM to MCM L2 & L3 Dirs L2 & L3 Dirs cpu0 Cpu 1 Cpu 2 Cpu 3 Shared L2 Shared L2 M 1.5MB 1.5MB Chip-chip communication Chip-chip communication M E E M M M C Shared C M O 128MB L3 O R Chip-chip communication Chip-chip communication R L2 & L3 Dirs Y Shared L2 Shared L2 L2 & L3 Dirs 1.5MB 1.5MB Y Cpu 4 Cpu 5 Cpu 6 Cpu 7 MCM to MCM MCM to MCM GX Bus GX Bus 30 © 2004 IBM Corporation 15 POWER4 32-way Plane Topology S T MCM 3 V U Uni-directional links S T MCM2 2-beat Addr @ 850Mhz = V 425M A/s U 8B Data @ 850Mhz (2:1) S T MCM 1 = 7 GB/s V U S T MCM 0 V U 31 © 2004 IBM Corporation Power 5 POWER5 Thread 0 Thread 1 Core Core Store Cache Store Cache Memory 36M L3 Cache 1.9M L2 Cache (victim) A MC,GX FBC NCU B Z X Y A 32 © 2004 IBM Corporation 16 POWER5 Multi Chip Module MCM to MCM MCM to MCM Book to Book Book to Book GX Bus GX Bus Mem CPU CPU CPU CPU Mem Ctrl 0, 1 2, 3 4, 5 6, 7 Ctrl M L3 L3 M Shared L2 Shared L2 E L3 Dir Dir L3 E 36 Chip-chip communication Chip-chip communication M MB M O O Chip-chip communication Chip-chip communication R L3 L3 R L3 Shared L2 Shared L2 L3 Y Dir Dir Y Mem CPU CPU CPU CPU Mem Ctrl 12,13 14,15 8, 9 10,11 Ctrl GX Bus GX Bus MCM to MCM MCM to MCM Book to Book RIO-2 Book to Book High Performance Switch InfiniBand 33 © 2004 IBM Corporation POWER5 64-way Plane Topology SS Vertical node links MCM 3 T T MCM 4 Used to reduce latency V V and increase bandwidth UU for data transfers. S S T T MCM 2 MCM 5 Used as alternate route V V for address operations U U during dynamic ring S S re-configuration (e.g. T MCM 1 T MCM 6 concur repair/upgrade). V V U U Uni-directional links S S T T 2-beat Addr @ 1GHz MCM 0 MCM 7 = 500M A/s V V 8B Data @ 1 GHz (2:1) U U = 8 GB/s 34 © 2004 IBM Corporation 17 Upcoming POWER5 Systems Early introductions are iSeries –Commercial Next: –Small nodes –2,4,8 processors –SF, ML Next, next –iH –H –Up to 64 processors 35 © 2004 IBM Corporation Planned POWER5 iH 2/4/6/8 – way POWER5 2U rack Chassis Rack: 24” x 43” deep H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM H C R U6 IBM 36 © 2004 IBM Corporation 18 Auxiliary 37 © 2004 IBM Corporation Switch Technology Generation Processors Internal network HPS Switch POWER2 –In lieu of, e.g.

POWER5, Federation and Linux: System Architecture

Wind Rose Data Comes in the Form >200,000 Wind Rose Images

Power4 Focuses on Memory Bandwidth IBM Confronts IA-64, Says ISA Not Important

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

Introduction to the Cell Multiprocessor

IBM Power Roadmap

Power Architecture® ISA 2.06 Stride N Prefetch Engines to Boost Application's Performance

POWER8: the First Openpower Processor

Power8 Quser Mspl Nov 2015 Handout

I.T.S.O. Powerpc an Inside View

Processor Design：System-On-Chip Computing For

Computer Architectures an Overview