POWER5, Federation and Linux: System Architecture
Charles Grassl IBM August, 2004
© 2004 IBM
Agenda
Architecture –Processors –POWER4 –POWER5 –System –Nodes –Systems High Performance Switch
2 © 2004 IBM Corporation
1 Processor Design
POWER4 POWER5 Two processors per chip Two processors per chip L1: L1: –32 kbyte data –32 kbyte data –64 kbyte instructions –64 kbyte instructions L2: L2: –1.45 Mbyte per processor pair –1.9 Mbyte per processor pair L3: L3: –32 Mbyte per processor pair –36 Mbyte per processor pair –Shared by all processors –Shared by processor pair –Extension of L2 cache Simultaneous multithreading Enhanced memory loads and stores Additional rename registers
3 © 2004 IBM Corporation
POWER4: Multi-processor Chip Processors Each processor: – L1 caches – Registers – Functional units Each chip (shared): – L2 cache – L3 cache – Shared across node – Path to memory
L2 Cache
4 © 2004 IBM Corporation
2 POWER5: Multi-processor Chip Processors Each processor: – L1 caches – Registers – Functional units Each chip (shared): – L2 cache – L3 cache – NOT shared – Path to memory – Memory controller
Memory L2 Cache Controller
5 © 2004 IBM Corporation
POWER4: Features
Multi-processor chip High clock rate – High computation burst rate Three cache levels – Bandwidth – Latency hiding Shared Memory – Large memory size
6 © 2004 IBM Corporation
3 POWER5 Features
Performance –Registers –Additional rename registers –Cache improvements –L3 runs at ½ freq. (POWER4: 1/3 freq.) –Simultaneous Multi Threading (SMT) –Memory Bandwidth Virtualization –Partitioning Dynamic power management
7 © 2004 IBM Corporation
POWER4, POWER5 Differences
POWER4 Design POWER5 Design Benefit 2-way Associative 4-way Associative Improved L1 Cache L1 Cache FIFO LRU performance 8-way Associative 10-way Associative Fewer L2 Cache misses L2 cache 1.44MB 1.9MB Better performance 32MB 36MB Better Cache 8-way Associative 12-way Associative L3 Cache performance 118 Clock Cycles ~80 Clock Cycles
Memory 4GBbyte / sec / ~16Gbyte / sec / 4X improvement Bandwidth Chip Chip Faster memory access Simultaneous Better processor No Yes Multi- utilization Threading
Processor Better usage of 1 processor 1/10 of processor Addressing processor resources 50% more transistors in 412mm 389mm Size the same space
8 © 2004 IBM Corporation
4 Modifications to POWER4 System Structure
P P P P
L3 L2 L2 L3
L3 L3
Fab Ctl Fab Ctl
Mem Ctl Mem Ctl
L3 L3
Mem Ctl Mem Ctl
Memory Memory
9 © 2004 IBM Corporation
POWER5 Challenges
Scaling from 32-way to logical 128-way – Large system effects, 2Tbyte of memory 1.9MB L2 shared across 4 threads (SMT) –L2 miss rates 36 Mbyte private L3 versus 128 Mbyte shared L3 – Will have to watch multi-programming level – Read-only data will get replicated Local/Remote memory latencies –Worst case remote memory latency is 60% higher than local
10 © 2004 IBM Corporation
5 Effect of Features: Registers
Additional rename registers –POWER4: 72 total registers –POWER5: 110 total registers Matrix multiply: –POWER4: 65% of burst rate –POWER5: 90% of burst rate Enhances performance of computationally intensive kernels
11 © 2004 IBM Corporation
Effect of Rename Registers
7000
6000
5000
4000 Polynomial 3000 MATMUL Mflop/s 2000
1000
0 POWER4 POWER4 POWER5 POWER5 1.5 GHz 1.5 GHz 1.65 GHz 1.65 GHz
12 © 2004 IBM Corporation
6 Effect of Rename Registers
7000
“Tight” algorithms were 6000 previously register
bound 5000 Increased floating point
efficiency 4000 Examples
–Matrix multiply 3000 – LINPACK –Polynomial calculation 2000 – Horner’s Rule –FFTs 1000
0
13 © 2004 IBM Corporation
Memory Bandwidth
POWER5 14000 POWER4
12000
10000
8000
6000 Mbyte/s
4000
2000
0 0 100000 200000 300000 400000 s=s+A(i) 1.65 GHz POWER5 1.5 GHz POWER4
14 © 2004 IBM Corporation
7 Memory Bandwidth: L1 – L2 Cache Regime
POWER5 14000 POWER4 L1 Cache 12000
10000 L2 Cache 8000
6000 Mbyte/s 4000
2000
0 0 100000 200000 300000 400000
15 © 2004 IBM Corporation
Memory Bandwidth: L2 Cache Regime
POWER5 14000 POWER4
12000
10000 ------L2 Cache------8000
6000 Mbyte/s 4000
2000
0 0 1000000 2000000 3000000 4000000
16 © 2004 IBM Corporation
8 L2 and L3 Cache Improvement
14000 L3 cache enhances bandwidth 12000
–Previously only hid 10000
latency ------L2 Cache------
–New L3 cache is like a “2.5 8000 cache”
Larger blocking sizes 6000
4000
2000
0 0 1000000 2000000 3000000 4000000
17 © 2004 IBM Corporation
Simultaneous Multi-Threading
Two threads per processor – Threads execute concurrently Threads share: – Caches – Registers – Functional units
18 © 2004 IBM Corporation
9 POWER5 Simultaneous Multi-Threading (SMT)
POWER4 POWER5 (Single Threaded) (Simultaneous Multi-threaded)
FX0 FX0 FX1 FX1 LS0 LS0 LS1 LS1 FP0 FP0 FP1 FP1 BRX BRX CRL CRL
Processor Cycles Processor Cycles
Legend Thread0 active No Thread active Thread1 active ● Presents SMP programming model to software ● Natural fit with superscalar out-of-order execution core
19 © 2004 IBM Corporation
Conventional Multi-Threading
Functional Units Time FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL
Thread 0 Thread 1
Threads alternate – Nothing shared
20 © 2004 IBM Corporation
10 Simultaneous Multi-Threading
0 1
FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL
Thread 0 Thread 1
Simultaneous execution – Shared registers – Shared functional units
21 © 2004 IBM Corporation
Manipulating SMT
Administrator function Dynamic System-wide
Function Command View /usr/sbin/smtctl
Off /usr/sbin/smtctl –m off now
On /usr/sbin/smtctl –m on now
22 © 2004 IBM Corporation
11 Dynamic Power Management
Dynamic Power No Power Management Management
Single Thread
Photos taken Simultaneous with thermal sensitive camera Multi-threading while prototype POWER5 chip was undergoing tests
23 © 2004 IBM Corporation
Multi-Chip Modules (MCMs)
Four processor chips are mounted on a module –Unit of manufacture –Smallest freestanding system Multiple modules –Node Also available: Single Chip Modules –Lower cost –Lower bandwidth
24 © 2004 IBM Corporation
12 POWER4 Multi-Chip Module (MCM)
4 chips – 8 processors – 2 processors/chip >35 Gbyte/s in each chip-to- chip connection 2:1 MHz chip-to-chip busses 6 connections per MCM Logically shared L2's, L3's Data Bus (Memory, inter- module): 3:1 MHz MCM to MCM
25 © 2004 IBM Corporation
POWER5 Multi-Chip Module (MCM)
95mm % 95mm Four POWER5 chips Four cache chips 4,491 signal I/Os 89 layers of metal
26 © 2004 IBM Corporation
13 Multi Chip Module (MCM) Architecture
POWER4: POWER5: – 4 processor chips – 4 processor chips – 2 processors per chip – 2 processors per chip – 8 off-module L3 chips – 4 L3 cache chips – L3 cache is controlled by MCM – L3 cache is used by processor pair and logically shared across node – “Extension” of L2 – 4 Memory control chips ------– 8 chips – 16 chips
L3 Chip Mem. Controller
27 © 2004 IBM Corporation
System Design with POWER4 and with POWER5
Multi Chip Module (MCM) –Multiple MCMs per system node Other issues: –L3 cache sharing –Memory controller
28 © 2004 IBM Corporation
14 Power 4 POWER4 Processor Chip Memory Core Core
Store Cache Store Cache
MC 1.4M L2 Cache
32 Mbyte L3 Cache A (portion of 128M GX, shared) FBC NCU L3 Z
Cache X Y A
29 © 2004 IBM Corporation
POWER4 Multi-chip Module
MCM to MCM GX Bus GX Bus MCM to MCM L2 & L3 Dirs L2 & L3 Dirs cpu0 Cpu 1 Cpu 2 Cpu 3
Shared L2 Shared L2 M 1.5MB 1.5MB Chip-chip communication Chip-chip communication M E E M M M C Shared C M O 128MB L3 O R Chip-chip communication Chip-chip communication R L2 & L3 Dirs Y Shared L2 Shared L2 L2 & L3 Dirs 1.5MB 1.5MB Y Cpu 4 Cpu 5 Cpu 6 Cpu 7
MCM to MCM MCM to MCM GX Bus GX Bus
30 © 2004 IBM Corporation
15 POWER4 32-way Plane Topology
S T MCM 3 V
U Uni-directional links S T MCM2 2-beat Addr @ 850Mhz = V 425M A/s U 8B Data @ 850Mhz (2:1) S T MCM 1 = 7 GB/s V
U S T MCM 0 V
U
31 © 2004 IBM Corporation
Power 5 POWER5 Thread 0 Thread 1
Core Core
Store Cache Store Cache Memory 36M L3 Cache 1.9M L2 Cache (victim)
A
MC,GX FBC NCU B Z
X Y A
32 © 2004 IBM Corporation
16 POWER5 Multi Chip Module
MCM to MCM MCM to MCM Book to Book Book to Book GX Bus GX Bus
Mem CPU CPU CPU CPU Mem Ctrl 0, 1 2, 3 4, 5 6, 7 Ctrl
M L3 L3 M Shared L2 Shared L2 E L3 Dir Dir L3 E 36 Chip-chip communication Chip-chip communication M MB M O O Chip-chip communication Chip-chip communication R L3 L3 R L3 Shared L2 Shared L2 L3 Y Dir Dir Y Mem CPU CPU CPU CPU Mem Ctrl 12,13 14,15 8, 9 10,11 Ctrl
GX Bus GX Bus MCM to MCM MCM to MCM Book to Book RIO-2 Book to Book High Performance Switch InfiniBand
33 © 2004 IBM Corporation
POWER5 64-way Plane Topology
SS Vertical node links MCM 3 T T MCM 4 Used to reduce latency V V and increase bandwidth UU for data transfers. S S T T MCM 2 MCM 5 Used as alternate route V V for address operations U U during dynamic ring S S re-configuration (e.g. T MCM 1 T MCM 6 concur repair/upgrade). V V
U U Uni-directional links S S T T 2-beat Addr @ 1GHz MCM 0 MCM 7 = 500M A/s V V
8B Data @ 1 GHz (2:1) U U = 8 GB/s
34 © 2004 IBM Corporation
17 Upcoming POWER5 Systems
Early introductions are iSeries –Commercial Next: –Small nodes –2,4,8 processors –SF, ML Next, next –iH –H –Up to 64 processors
35 © 2004 IBM Corporation
Planned POWER5 iH
2/4/6/8 – way POWER5 2U rack Chassis Rack: 24” x 43” deep
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
H C R U6
IBM
36 © 2004 IBM Corporation
18 Auxiliary
37 © 2004 IBM Corporation
Switch Technology
Generation Processors Internal network HPS Switch POWER2 –In lieu of, e.g. Gig Ethernet POWER2 → SP Switch Multiple links per POWER3 node SP Switch 2 POWER3 → –Match number of (Colony) POWER4 links to number of processors HPS POWER4 → (Federation) POWER5
38 © 2004 IBM Corporation
19 HPS Software
MPI-LAPI (PE V4.1) –Uses LAPI as the reliable transport –Library uses threads, not signals for async activities Existing applications binary compatible New performance characteristics –Improved collective communication
39 © 2004 IBM Corporation
High Performance Switch (HPS)
Also Known As “Federation” Follow on to SP Switch2 –Also known as “Colony” Specifications: –2 Gbyte/s (bidirectional) –5 microsecond latency Configuration: –Four adaptors per node –2 links per adaptor –16 Gbyte/s per node
40 © 2004 IBM Corporation
20 HPS Specifications
Bandwidth, Bandwidth, Latency single multiple [microsec.] [Mbyte/s] [Mbyte/s] SP Switch 2 15 350 550 (“Colony”) HPS 5 1700 200 (“Federation”)
41 © 2004 IBM Corporation
HPS Packaging
4U, 24-inch drawer –Fits in the bottom position of p655 or p690 frame Switch only frame (7040-W42 frame) –Up to eight HPS switches 16 ports for server-to-switch 16 ports for switch-to-switch connections Host attachment directly to server bus via 2- link or 4-link Switch Network Interface (SNI) –Up to two links per pSeries 655 –Up to eight links per pSeries 690
42 © 2004 IBM Corporation
21 HPS Scalability
Supports up to 16 p690 or p655 servers and 32 links at GA Increased support for up to 64 servers –32 can be pSeries 690 servers –128 links planned for July, 2004 Higher scalability limits available by special order
43 © 2004 IBM Corporation
HPS Performance: Single and Multiple Messages
2500
2000
1500 Single Multiple
Mbyte/s 1000
500
0 0 250000 500000 750000 1000000 Bytes
44 © 2004 IBM Corporation
22 HPS Performance: POWER5 Shared Memory MPI
4000 3500 3000 2500 2000
Mbyte/s 1500 1000 500 0 0 200000 400000 600000 800000 1000000 Bytes
45 © 2004 IBM Corporation
Summary
Processor technology: –POWER4 → POWER5 –Multiprocessor chips –Enhanced caches –SMT Cluster systems –P655 –P690
46 © 2004 IBM Corporation
23