Sony/Toshiba/IBM (STI) CELL Processor
Scientific Computing for Engineers: Spring 2007
Three Performance-Limiting Walls
¾ Power Wall
Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase.
¾ Frequency Wall
Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account.
¾ Memory Wall
On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor.
01/31/07 11:10 2
1 The Memory Wall
"When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available."
H. Peter Hofstee "Cell Broadband Engine Architecture from 20,000 feet" http://www-128.ibm.com/developerworks/power/library/pa-cbea.html
Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem."
Richard B. Walsh "New Processor Options for HPC" http://www.hpcwire.com/hpc/906849.html
01/31/07 11:10 3
CELL Overview
¾ $400 million over 5 years ¾ Sony/Toshiba/IBM alliance known as STI ¾ STI Design Center – Austin, Texas – March 2001 ¾ Mercury Computer Systems Inc. – dual CELL blades ¾ Cell Broadband Engine Architecture / CBEA / CELL BE
¾ Playstation3, dual CELL blades, PCI accelerator cards ¾ 3.2 GHz, 90nm OSI, ¾ 234 million transistors ¾ 165 million – Xbox 360 ¾ 220 million – Itanium 2 (2002) ¾ 1,700 million – Dual-Core Itanium 2 (2006)
01/31/07 11:10 4
2 CELL Architecture
¾ PPU – PowerPC 970 core ¾ SPE – Synergistic Processing Element ¾ SPU – Synergistic Processing Unit ¾ LS – Local Store ¾ MFC – Memory Flow Controller ¾ EIB – Element Interconnection Bus ¾ MIC – Memory Interface Controller
01/31/07 11:10 5
Power Processing Element
¾ Power Processing Element (PPE) ¾ Power 970 architecture compliant ¾ 2-way Symmetric Multithreading (SMT) ¾ 32KB Level 1 instruction cache ¾ 32KB level 1 data cache ¾ 512KB level 2 cache ¾ VMX (AltiVec) with 32 128-bit vector registers
¾ standard FPU ¾ fully pipelined DP with FMA ¾ 6.4 Gflop/s DP at 3.2 GHz
¾ AltiVec ¾ no DP ¾ 4-way fully pipelined SP with FMA ¾ 25.6 Gflop/s SP at 3.2 GHz
01/31/07 11:10 6
3 Synergistic Processing Elements
¾ Synergistic Processing Elements (SPEs) ¾ 128-bit SIMD ¾ 128 vector registers ¾ 256KB instruction and data local memory ¾ Memory Flow Controller (MFC)
¾ 16-way SIMD (8-bit integer) ¾ 8-way SIMD (16-bit integer) ¾ 4-way SIMD (32-bit integer, single prec. FP) ¾ 2-way SIMD (64-bit double prec. FP)
¾ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) ¾ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency)
01/31/07 11:10 7
SPE Architecture
¾ Dual issue (in order) pipeline ¾ Even – arithmetic ¾ integer ¾ floating point ¾ Odd – data motion ¾ permutations ¾ local store ¾ branches ¾ channel
01/31/07 11:10 8
4 Element Interconnection Bus
¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration)
01/31/07 11:10 9
Element Interconnection Bus
¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration)
01/31/07 11:10 10
5 Main Memory System
¾ Memory Interface Controller (MIC) ¾ external dual XDR, ¾ 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), ¾ each: 8 banks → max 256 MB, ¾ total: 16 banks → max 512 MB, ¾ 25.6 GB/s.
01/31/07 11:10 11
CELL Performance – Double Precision
In double precision ¾ every seven cycles each SPE can: ¾ process a two element vector, ¾ perform two operations on each element.
¾ in one cycle the FPU on the PPE can: ¾ process one element, ¾ perform two operations on the element.
8 x2 x2 x3.2 GHz / 7 = 14.63 Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s
21.03 Gflop/s
01/31/07 11:10 12
6 CELL Performance – Single Precision
In single precision ¾ in one cycle each SPE can: ¾ process a four element vector, ¾ perform two operations on each element.
¾ in one cycle the VMX on the PPE can: ¾ process a four element vector, ¾ perform two operations on each element.
8 x4 x2 x3.2 GHz =204.8 Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s
230.4 Gflop/s
01/31/07 11:10 13
CELL Performance – Bandwidth
Bandwidth: ¾ 3.2 GHz clock: ¾ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU) ¾ Main memory – 25.6 GB/s, ¾ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs)
01/31/07 11:10 14
7 Performance Comparison – Double Precision
1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s
3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 14.6 Gflop/s
01/31/07 11:10 15
Performance Comparison – Single Precision
1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s
3.2 GHz SPE ¾ 3.2 x 8 = 25.6 Gflop/s ¾ One SPE = 2 Dual-Core Itaniums 2
3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 204.8 Gflop/s ¾ One CBE = 16 Dual-Core Itaniums 2
01/31/07 11:10 16
8 CELL Programming Basics
¾ Programming the SPUs ¾ SIMD'ization (vectorization)
¾ Communication ¾ DMAs ¾ Mailboxes
¾ Measuring Performance ¾ SPU decrementer
01/31/07 11:10 17
SPE Register File
01/31/07 11:10 18
9 SPE SIMD Arithmetic
01/31/07 11:10 19
SPE SIMD Data Motion
01/31/07 11:10 20
10 SPE SIMD Vector Data Types
01/31/07 11:10 21
SPE SIMD Arithmetic Intrinsics
01/31/07 11:10 22
11 SPE Scalar Processing
01/31/07 11:10 23
SPE Static Code Analysis
01/31/07 11:10 24
12 SPE DMA Commands
01/31/07 11:10 25
SPE DMA Status
01/31/07 11:10 26
13 DMA Double Buffering
Prologue Receive tile 1 Loop body Receive tile 2 FOR I=2 TO N-1 Compute tile 1 Send tile I-1 Swap buffers Receive tile I+1 Epilogue Compute tile I Send tile N-1 Swap buffers Compute tile N END FOR Send tile N1
01/31/07 11:10 27
SPE Mailboxes
¾ FIFO queues ¾ 32-bit messages ¾ Intended for mainly for communication between the PPE and the SPEs
01/31/07 11:10 28
14 SPE Decrementer
¾ 14MHz – IBM dual CELL blade ¾ 80MHz – Sony Playstation3
01/31/07 11:10 29
CELL Basic Coding Tips
¾ Local Store ¾ Keep in mind the Local Store is 256KB in size ¾ Use plug-ins to handle larger codes ¾ DMA Transfers ¾ Use SPE-initiated DMA transfers ¾ Use double-buffering to hide transfers ¾ Use fence and barrier to order transfers ¾ Loops ¾ Unroll loops to reduce dependency stalls, increase dual-issue rate, and exploit SPU large register file ¾ Branches ¾ Eliminate non-predicted branches ¾ Dual-Issue ¾ Choose intrinsics to maximize dual-issue
01/31/07 11:10 30
15 UT CELL BE Cluster
Ashe.CS.UTK.EDU
01/31/07 11:10 31
UT CELL BE Cluster – Historical Perspective
Connection Machine CM-5 (512 CPUs) ¾ 512 x 128 = 65 Gflop/s DP
Playstation3 (4 units) ¾ 4 x 17 = 68 Gflop/s DP
01/31/07 11:10 32
16