//IBM (STI) Processor

Scientific Computing for Engineers: Spring 2007

Three Performance-Limiting Walls

¾ Power Wall

Increasingly, performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of is to improve power efficiency at about the same rate as the performance increase.

¾ Frequency Wall

Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account.

¾ Memory Wall

On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor.

01/31/07 11:10 2

1 The Memory Wall

"When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available."

H. "Cell Broadband Engine Architecture from 20,000 feet" http://www-128.ibm.com/developerworks/power/library/pa-cbea.html

Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem."

Richard B. Walsh "New Processor Options for HPC" http://www.hpcwire.com/hpc/906849.html

01/31/07 11:10 3

CELL Overview

¾ $400 million over 5 years ¾ Sony/Toshiba/IBM alliance known as STI ¾ STI Design Center – Austin, Texas – March 2001 ¾ Mercury Computer Systems Inc. – dual CELL blades ¾ Cell Broadband Engine Architecture / CBEA / CELL BE

¾ Playstation3, dual CELL blades, PCI accelerator cards ¾ 3.2 GHz, 90nm OSI, ¾ 234 million transistors ¾ 165 million – ¾ 220 million – Itanium 2 (2002) ¾ 1,700 million – Dual-Core Itanium 2 (2006)

01/31/07 11:10 4

2 CELL Architecture

¾ PPU – PowerPC 970 core ¾ SPE – Synergistic Processing Element ¾ SPU – Synergistic Processing Unit ¾ LS – Local Store ¾ MFC – Memory Flow Controller ¾ EIB – Element Interconnection Bus ¾ MIC – Memory Interface Controller

01/31/07 11:10 5

Power Processing Element

¾ Power Processing Element (PPE) ¾ Power 970 architecture compliant ¾ 2-way Symmetric Multithreading (SMT) ¾ 32KB Level 1 instruction cache ¾ 32KB level 1 data cache ¾ 512KB level 2 cache ¾ VMX (AltiVec) with 32 128-bit vector registers

¾ standard FPU ¾ fully pipelined DP with FMA ¾ 6.4 Gflop/s DP at 3.2 GHz

¾ AltiVec ¾ no DP ¾ 4-way fully pipelined SP with FMA ¾ 25.6 Gflop/s SP at 3.2 GHz

01/31/07 11:10 6

3 Synergistic Processing Elements

¾ Synergistic Processing Elements (SPEs) ¾ 128-bit SIMD ¾ 128 vector registers ¾ 256KB instruction and data local memory ¾ Memory Flow Controller (MFC)

¾ 16-way SIMD (8-bit integer) ¾ 8-way SIMD (16-bit integer) ¾ 4-way SIMD (32-bit integer, single prec. FP) ¾ 2-way SIMD (64-bit double prec. FP)

¾ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) ¾ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency)

01/31/07 11:10 7

SPE Architecture

¾ Dual issue (in order) pipeline ¾ Even – arithmetic ¾ integer ¾ floating point ¾ Odd – data motion ¾ permutations ¾ local store ¾ branches ¾ channel

01/31/07 11:10 8

4 Element Interconnection Bus

¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration)

01/31/07 11:10 9

Element Interconnection Bus

¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration)

01/31/07 11:10 10

5 Main Memory System

¾ Memory Interface Controller (MIC) ¾ external dual XDR, ¾ 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), ¾ each: 8 banks → max 256 MB, ¾ total: 16 banks → max 512 MB, ¾ 25.6 GB/s.

01/31/07 11:10 11

CELL Performance – Double Precision

In double precision ¾ every seven cycles each SPE can: ¾ process a two element vector, ¾ perform two operations on each element.

¾ in one cycle the FPU on the PPE can: ¾ process one element, ¾ perform two operations on the element.

8 x2 x2 x3.2 GHz / 7 = 14.63 Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s

21.03 Gflop/s

01/31/07 11:10 12

6 CELL Performance – Single Precision

In single precision ¾ in one cycle each SPE can: ¾ process a four element vector, ¾ perform two operations on each element.

¾ in one cycle the VMX on the PPE can: ¾ process a four element vector, ¾ perform two operations on each element.

8 x4 x2 x3.2 GHz =204.8 Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s

230.4 Gflop/s

01/31/07 11:10 13

CELL Performance – Bandwidth

Bandwidth: ¾ 3.2 GHz clock: ¾ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU) ¾ Main memory – 25.6 GB/s, ¾ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs)

01/31/07 11:10 14

7 Performance Comparison – Double Precision

1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s

3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 14.6 Gflop/s

01/31/07 11:10 15

Performance Comparison – Single Precision

1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s

3.2 GHz SPE ¾ 3.2 x 8 = 25.6 Gflop/s ¾ One SPE = 2 Dual-Core Itaniums 2

3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 204.8 Gflop/s ¾ One CBE = 16 Dual-Core Itaniums 2

01/31/07 11:10 16

8 CELL Programming Basics

¾ Programming the SPUs ¾ SIMD'ization (vectorization)

¾ Communication ¾ DMAs ¾ Mailboxes

¾ Measuring Performance ¾ SPU decrementer

01/31/07 11:10 17

SPE Register File

01/31/07 11:10 18

9 SPE SIMD Arithmetic

01/31/07 11:10 19

SPE SIMD Data Motion

01/31/07 11:10 20

10 SPE SIMD Vector Data Types

01/31/07 11:10 21

SPE SIMD Arithmetic Intrinsics

01/31/07 11:10 22

11 SPE Scalar Processing

01/31/07 11:10 23

SPE Static Code Analysis

01/31/07 11:10 24

12 SPE DMA Commands

01/31/07 11:10 25

SPE DMA Status

01/31/07 11:10 26

13 DMA Double Buffering

Prologue Receive tile 1 Loop body Receive tile 2 FOR I=2 TO N-1 Compute tile 1 Send tile I-1 Swap buffers Receive tile I+1 Epilogue Compute tile I Send tile N-1 Swap buffers Compute tile N END FOR Send tile N1

01/31/07 11:10 27

SPE Mailboxes

¾ FIFO queues ¾ 32-bit messages ¾ Intended for mainly for communication between the PPE and the SPEs

01/31/07 11:10 28

14 SPE Decrementer

¾ 14MHz – IBM dual CELL blade ¾ 80MHz – Sony Playstation3

01/31/07 11:10 29

CELL Basic Coding Tips

¾ Local Store ¾ Keep in mind the Local Store is 256KB in size ¾ Use plug-ins to handle larger codes ¾ DMA Transfers ¾ Use SPE-initiated DMA transfers ¾ Use double-buffering to hide transfers ¾ Use fence and barrier to order transfers ¾ Loops ¾ Unroll loops to reduce dependency stalls, increase dual-issue rate, and exploit SPU large register file ¾ Branches ¾ Eliminate non-predicted branches ¾ Dual-Issue ¾ Choose intrinsics to maximize dual-issue

01/31/07 11:10 30

15 UT CELL BE Cluster

Ashe.CS.UTK.EDU

01/31/07 11:10 31

UT CELL BE Cluster – Historical Perspective

Connection Machine CM-5 (512 CPUs) ¾ 512 x 128 = 65 Gflop/s DP

Playstation3 (4 units) ¾ 4 x 17 = 68 Gflop/s DP

01/31/07 11:10 32

16