POWER5, Federation and Linux: System Architecture

Charles Grassl IBM August, 2004

© 2004 IBM

Agenda

ƒ Architecture –Processors –POWER4 –POWER5 –System –Nodes –Systems ƒ High Performance Switch

2 © 2004 IBM Corporation

1 Processor Design

POWER4 POWER5 ƒ Two processors per chip ƒ Two processors per chip ƒ L1: ƒ L1: –32 kbyte data –32 kbyte data –64 kbyte instructions –64 kbyte instructions ƒ L2: ƒ L2: –1.45 Mbyte per processor pair –1.9 Mbyte per processor pair ƒ L3: ƒ L3: –32 Mbyte per processor pair –36 Mbyte per processor pair –Shared by all processors –Shared by processor pair –Extension of L2 cache ƒ Simultaneous multithreading ƒ Enhanced memory loads and stores ƒ Additional rename registers

3 © 2004 IBM Corporation

POWER4: Multi-processor Chip Processors ƒ Each processor: – L1 caches – Registers – Functional units ƒ Each chip (shared): – L2 cache – L3 cache – Shared across node – Path to memory

L2 Cache

4 © 2004 IBM Corporation

2 POWER5: Multi-processor Chip Processors ƒ Each processor: – L1 caches – Registers – Functional units ƒ Each chip (shared): – L2 cache – L3 cache – NOT shared – Path to memory – Memory controller

Memory L2 Cache Controller

5 © 2004 IBM Corporation

POWER4: Features

ƒ Multi-processor chip ƒ High – High computation burst rate ƒ Three cache levels – Bandwidth – Latency hiding ƒ Shared Memory – Large memory size

6 © 2004 IBM Corporation

3 POWER5 Features

ƒ Performance –Registers –Additional rename registers –Cache improvements –L3 runs at ½ freq. (POWER4: 1/3 freq.) –Simultaneous Multi Threading (SMT) –Memory Bandwidth ƒ Virtualization –Partitioning ƒ Dynamic power management

7 © 2004 IBM Corporation

POWER4, POWER5 Differences

POWER4 Design POWER5 Design Benefit 2-way Associative 4-way Associative Improved L1 Cache L1 Cache FIFO LRU performance 8-way Associative 10-way Associative Fewer L2 Cache misses L2 cache 1.44MB 1.9MB Better performance 32MB 36MB Better Cache 8-way Associative 12-way Associative L3 Cache performance 118 Clock Cycles ~80 Clock Cycles

Memory 4GBbyte / sec / ~16Gbyte / sec / 4X improvement Bandwidth Chip Chip Faster memory access Simultaneous Better processor No Yes Multi- utilization Threading

Processor Better usage of 1 processor 1/10 of processor Addressing processor resources 50% more transistors in 412mm 389mm Size the same space

8 © 2004 IBM Corporation

4 Modifications to POWER4 System Structure

P P P P

L3 L2 L2 L3

L3 L3

Fab Ctl Fab Ctl

Mem Ctl Mem Ctl

L3 L3

Mem Ctl Mem Ctl

Memory Memory

9 © 2004 IBM Corporation

POWER5 Challenges

ƒ Scaling from 32-way to logical 128-way – Large system effects, 2Tbyte of memory ƒ 1.9MB L2 shared across 4 threads (SMT) –L2 miss rates ƒ 36 Mbyte private L3 versus 128 Mbyte shared L3 – Will have to watch multi-programming level – Read-only data will get replicated ƒ Local/Remote memory latencies –Worst case remote memory latency is 60% higher than local

10 © 2004 IBM Corporation

5 Effect of Features: Registers

ƒ Additional rename registers –POWER4: 72 total registers –POWER5: 110 total registers ƒ Matrix multiply: –POWER4: 65% of burst rate –POWER5: 90% of burst rate ƒ Enhances performance of computationally intensive kernels

11 © 2004 IBM Corporation

Effect of Rename Registers

7000

6000

5000

4000 Polynomial 3000 MATMUL Mflop/s 2000

1000

0 POWER4 POWER4 POWER5 POWER5 1.5 GHz 1.5 GHz 1.65 GHz 1.65 GHz

12 © 2004 IBM Corporation

6 Effect of Rename Registers

7000

ƒ “Tight” algorithms were 6000 previously register

bound 5000 ƒ Increased floating point

efficiency 4000 ƒ Examples

–Matrix multiply 3000 – LINPACK –Polynomial calculation 2000 – Horner’s Rule –FFTs 1000

0

13 © 2004 IBM Corporation

Memory Bandwidth

POWER5 14000 POWER4

12000

10000

8000

6000 Mbyte/s

4000

2000

0 0 100000 200000 300000 400000 s=s+A(i) 1.65 GHz POWER5 1.5 GHz POWER4

14 © 2004 IBM Corporation

7 Memory Bandwidth: L1 – L2 Cache Regime

POWER5 14000 POWER4 L1 Cache 12000

10000 L2 Cache 8000

6000 Mbyte/s 4000

2000

0 0 100000 200000 300000 400000

15 © 2004 IBM Corporation

Memory Bandwidth: L2 Cache Regime

POWER5 14000 POWER4

12000

10000 ------L2 Cache------8000

6000 Mbyte/s 4000

2000

0 0 1000000 2000000 3000000 4000000

16 © 2004 IBM Corporation

8 L2 and L3 Cache Improvement

14000 ƒ L3 cache enhances bandwidth 12000

–Previously only hid 10000

latency ------L2 Cache------

–New L3 cache is like a “2.5 8000 cache”

ƒ Larger blocking sizes 6000

4000

2000

0 0 1000000 2000000 3000000 4000000

17 © 2004 IBM Corporation

Simultaneous Multi-Threading

ƒ Two threads per processor – Threads execute concurrently ƒ Threads share: – Caches – Registers – Functional units

18 © 2004 IBM Corporation

9 POWER5 Simultaneous Multi-Threading (SMT)

POWER4 POWER5 (Single Threaded) (Simultaneous Multi-threaded)

FX0 FX0 FX1 FX1 LS0 LS0 LS1 LS1 FP0 FP0 FP1 FP1 BRX BRX CRL CRL

Processor Cycles Processor Cycles

Legend Thread0 active No Thread active Thread1 active ● Presents SMP programming model to software ● Natural fit with superscalar out-of-order execution core

19 © 2004 IBM Corporation

Conventional Multi-Threading

Functional Units Time FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

Thread 0 Thread 1

ƒ Threads alternate – Nothing shared

20 © 2004 IBM Corporation

10 Simultaneous Multi-Threading

0 1

FX0 FX1 FP0 FP1 LS0 LS1 BRX CRL

Thread 0 Thread 1

ƒ Simultaneous execution – Shared registers – Shared functional units

21 © 2004 IBM Corporation

Manipulating SMT

ƒ Administrator function ƒ Dynamic ƒ System-wide

Function Command View /usr/sbin/smtctl

Off /usr/sbin/smtctl –m off now

On /usr/sbin/smtctl –m on now

22 © 2004 IBM Corporation

11 Dynamic Power Management

Dynamic Power No Power Management Management

Single Thread

Photos taken Simultaneous with thermal sensitive camera Multi-threading while prototype POWER5 chip was undergoing tests

23 © 2004 IBM Corporation

Multi-Chip Modules (MCMs)

ƒ Four processor chips are mounted on a module –Unit of manufacture –Smallest freestanding system ƒ Multiple modules –Node ƒ Also available: Single Chip Modules –Lower cost –Lower bandwidth

24 © 2004 IBM Corporation

12 POWER4 Multi-Chip Module (MCM)

ƒ 4 chips – 8 processors – 2 processors/chip ƒ >35 Gbyte/s in each chip-to- chip connection ƒ 2:1 MHz chip-to-chip busses ƒ 6 connections per MCM ƒ Logically shared L2's, L3's ƒ Data Bus (Memory, inter- module): ƒ 3:1 MHz MCM to MCM

25 © 2004 IBM Corporation

POWER5 Multi-Chip Module (MCM)

ƒ 95mm % 95mm ƒ Four POWER5 chips ƒ Four cache chips ƒ 4,491 signal I/Os ƒ 89 layers of metal

26 © 2004 IBM Corporation

13 Multi Chip Module (MCM) Architecture

ƒ POWER4: ƒ POWER5: – 4 processor chips – 4 processor chips – 2 processors per chip – 2 processors per chip – 8 off-module L3 chips – 4 L3 cache chips – L3 cache is controlled by MCM – L3 cache is used by processor pair and logically shared across node – “Extension” of L2 – 4 Memory control chips ------– 8 chips – 16 chips

L3 Chip Mem. Controller

27 © 2004 IBM Corporation

System Design with POWER4 and with POWER5

ƒ Multi Chip Module (MCM) –Multiple MCMs per system node ƒ Other issues: –L3 cache sharing –Memory controller

28 © 2004 IBM Corporation

14 Power 4 POWER4 Processor Chip Memory Core Core

Store Cache Store Cache

MC 1.4M L2 Cache

32 Mbyte L3 Cache A (portion of 128M GX, shared) FBC NCU L3 Z

Cache X Y A

29 © 2004 IBM Corporation

POWER4 Multi-chip Module

MCM to MCM GX Bus GX Bus MCM to MCM L2 & L3 Dirs L2 & L3 Dirs cpu0 Cpu 1 Cpu 2 Cpu 3

Shared L2 Shared L2 M 1.5MB 1.5MB Chip-chip communication Chip-chip communication M E E M M M C Shared C M O 128MB L3 O R Chip-chip communication Chip-chip communication R L2 & L3 Dirs Y Shared L2 Shared L2 L2 & L3 Dirs 1.5MB 1.5MB Y Cpu 4 Cpu 5 Cpu 6 Cpu 7

MCM to MCM MCM to MCM GX Bus GX Bus

30 © 2004 IBM Corporation

15 POWER4 32-way Plane Topology

S T MCM 3 V

U ƒ Uni-directional links S T MCM2 ƒ 2-beat Addr @ 850Mhz = V 425M A/s U ƒ 8B Data @ 850Mhz (2:1) S T MCM 1 = 7 GB/s V

U S T MCM 0 V

U

31 © 2004 IBM Corporation

Power 5 POWER5 Thread 0 Thread 1

Core Core

Store Cache Store Cache Memory 36M L3 Cache 1.9M L2 Cache (victim)

A

MC,GX FBC NCU B Z

X Y A

32 © 2004 IBM Corporation

16 POWER5 Multi Chip Module

MCM to MCM MCM to MCM Book to Book Book to Book GX Bus GX Bus

Mem CPU CPU CPU CPU Mem Ctrl 0, 1 2, 3 4, 5 6, 7 Ctrl

M L3 L3 M Shared L2 Shared L2 E L3 Dir Dir L3 E 36 Chip-chip communication Chip-chip communication M MB M O O Chip-chip communication Chip-chip communication R L3 L3 R L3 Shared L2 Shared L2 L3 Y Dir Dir Y Mem CPU CPU CPU CPU Mem Ctrl 12,13 14,15 8, 9 10,11 Ctrl

GX Bus GX Bus MCM to MCM MCM to MCM Book to Book RIO-2 Book to Book High Performance Switch InfiniBand

33 © 2004 IBM Corporation

POWER5 64-way Plane Topology

SS Vertical node links MCM 3 T T MCM 4 Used to reduce latency V V and increase bandwidth UU for data transfers. S S T T MCM 2 MCM 5 Used as alternate route V V for address operations U U during dynamic ring S S re-configuration (e.g. T MCM 1 T MCM 6 concur repair/upgrade). V V

U U Uni-directional links S S T T 2-beat Addr @ 1GHz MCM 0 MCM 7 = 500M A/s V V

8B Data @ 1 GHz (2:1) U U = 8 GB/s

34 © 2004 IBM Corporation

17 Upcoming POWER5 Systems

ƒ Early introductions are iSeries –Commercial ƒ Next: –Small nodes –2,4,8 processors –SF, ML ƒ Next, next –iH –H –Up to 64 processors

35 © 2004 IBM Corporation

Planned POWER5 iH

ƒ 2/4/6/8 – way POWER5 ƒ 2U rack Chassis ƒ Rack: 24” x 43” deep

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

H C R U6

IBM

36 © 2004 IBM Corporation

18 Auxiliary

37 © 2004 IBM Corporation

Switch Technology

Generation Processors ƒ Internal network HPS Switch POWER2 –In lieu of, e.g. Gig Ethernet POWER2 → SP Switch ƒ Multiple links per POWER3 node SP Switch 2 POWER3 → –Match number of (Colony) POWER4 links to number of processors HPS POWER4 → (Federation) POWER5

38 © 2004 IBM Corporation

19 HPS Software

ƒ MPI-LAPI (PE V4.1) –Uses LAPI as the reliable transport –Library uses threads, not signals for async activities ƒ Existing applications binary compatible ƒ New performance characteristics –Improved collective communication

39 © 2004 IBM Corporation

High Performance Switch (HPS)

ƒ Also Known As “Federation” ƒ Follow on to SP Switch2 –Also known as “Colony” ƒ Specifications: –2 Gbyte/s (bidirectional) –5 microsecond latency ƒ Configuration: –Four adaptors per node –2 links per adaptor –16 Gbyte/s per node

40 © 2004 IBM Corporation

20 HPS Specifications

Bandwidth, Bandwidth, Latency single multiple [microsec.] [Mbyte/s] [Mbyte/s] SP Switch 2 15 350 550 (“Colony”) HPS 5 1700 200 (“Federation”)

41 © 2004 IBM Corporation

HPS Packaging

ƒ 4U, 24-inch drawer –Fits in the bottom position of p655 or p690 frame ƒ Switch only frame (7040-W42 frame) –Up to eight HPS switches ƒ 16 ports for server-to-switch ƒ 16 ports for switch-to-switch connections ƒ Host attachment directly to server bus via 2- link or 4-link Switch Network Interface (SNI) –Up to two links per pSeries 655 –Up to eight links per pSeries 690

42 © 2004 IBM Corporation

21 HPS Scalability

ƒ Supports up to 16 p690 or p655 servers and 32 links at GA ƒ Increased support for up to 64 servers –32 can be pSeries 690 servers –128 links planned for July, 2004 ƒ Higher scalability limits available by special order

43 © 2004 IBM Corporation

HPS Performance: Single and Multiple Messages

2500

2000

1500 Single Multiple

Mbyte/s 1000

500

0 0 250000 500000 750000 1000000 Bytes

44 © 2004 IBM Corporation

22 HPS Performance: POWER5 Shared Memory MPI

4000 3500 3000 2500 2000

Mbyte/s 1500 1000 500 0 0 200000 400000 600000 800000 1000000 Bytes

45 © 2004 IBM Corporation

Summary

ƒ Processor technology: –POWER4 → POWER5 –Multiprocessor chips –Enhanced caches –SMT ƒ Cluster systems –P655 –P690

46 © 2004 IBM Corporation

23