Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2007 Three Performance-Limiting Walls ¾ Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. ¾ Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. ¾ Memory Wall On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 01/31/07 11:10 2 1 The Memory Wall "When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available." H. Peter Hofstee "Cell Broadband Engine Architecture from 20,000 feet" Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem." Richard B. Walsh "New Processor Options for HPC" 01/31/07 11:10 3 CELL Overview ¾ $400 million over 5 years ¾ Sony/Toshiba/IBM alliance known as STI ¾ STI Design Center – Austin, Texas – March 2001 ¾ Mercury Computer Systems Inc. – dual CELL blades ¾ Cell Broadband Engine Architecture / CBEA / CELL BE ¾ Playstation3, dual CELL blades, PCI accelerator cards ¾ 3.2 GHz, 90nm OSI, ¾ 234 million transistors ¾ 165 million – Xbox 360 ¾ 220 million – Itanium 2 (2002) ¾ 1,700 million – Dual-Core Itanium 2 (2006) 01/31/07 11:10 4 2 CELL Architecture ¾ PPU – PowerPC 970 core ¾ SPE – Synergistic Processing Element ¾ SPU – Synergistic Processing Unit ¾ LS – Local Store ¾ MFC – Memory Flow Controller ¾ EIB – Element Interconnection Bus ¾ MIC – Memory Interface Controller 01/31/07 11:10 5 Power Processing Element ¾ Power Processing Element (PPE) ¾ Power 970 architecture compliant ¾ 2-way Symmetric Multithreading (SMT) ¾ 32KB Level 1 instruction cache ¾ 32KB level 1 data cache ¾ 512KB level 2 cache ¾ VMX (AltiVec) with 32 128-bit vector registers ¾ standard FPU ¾ fully pipelined DP with FMA ¾ 6.4 Gflop/s DP at 3.2 GHz ¾ AltiVec ¾ no DP ¾ 4-way fully pipelined SP with FMA ¾ 25.6 Gflop/s SP at 3.2 GHz 01/31/07 11:10 6 3 Synergistic Processing Elements ¾ Synergistic Processing Elements (SPEs) ¾ 128-bit SIMD ¾ 128 vector registers ¾ 256KB instruction and data local memory ¾ Memory Flow Controller (MFC) ¾ 16-way SIMD (8-bit integer) ¾ 8-way SIMD (16-bit integer) ¾ 4-way SIMD (32-bit integer, single prec. FP) ¾ 2-way SIMD (64-bit double prec. FP) ¾ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) ¾ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency) 01/31/07 11:10 7 SPE Architecture ¾ Dual issue (in order) pipeline ¾ Even – arithmetic ¾ integer ¾ floating point ¾ Odd – data motion ¾ permutations ¾ local store ¾ branches ¾ channel 01/31/07 11:10 8 4 Element Interconnection Bus ¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration) 01/31/07 11:10 9 Element Interconnection Bus ¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration) 01/31/07 11:10 10 5 Main Memory System ¾ Memory Interface Controller (MIC) ¾ external dual XDR, ¾ 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), ¾ each: 8 banks → max 256 MB, ¾ total: 16 banks → max 512 MB, ¾ 25.6 GB/s. 01/31/07 11:10 11 CELL Performance – Double Precision In double precision ¾ every seven cycles each SPE can: ¾ process a two element vector, ¾ perform two operations on each element. ¾ in one cycle the FPU on the PPE can: ¾ process one element, ¾ perform two operations on the element. 8 x2 x2 x3.2 GHz / 7 = 14.63 Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s 21.03 Gflop/s 01/31/07 11:10 12 6 CELL Performance – Single Precision In single precision ¾ in one cycle each SPE can: ¾ process a four element vector, ¾ perform two operations on each element. ¾ in one cycle the VMX on the PPE can: ¾ process a four element vector, ¾ perform two operations on each element. 8 x4 x2 x3.2 GHz =204.8 Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s 230.4 Gflop/s 01/31/07 11:10 13 CELL Performance – Bandwidth Bandwidth: ¾ 3.2 GHz clock: ¾ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU) ¾ Main memory – 25.6 GB/s, ¾ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs) 01/31/07 11:10 14 7 Performance Comparison – Double Precision 1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 14.6 Gflop/s 01/31/07 11:10 15 Performance Comparison – Single Precision 1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz SPE ¾ 3.2 x 8 = 25.6 Gflop/s ¾ One SPE = 2 Dual-Core Itaniums 2 3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 204.8 Gflop/s ¾ One CBE = 16 Dual-Core Itaniums 2 01/31/07 11:10 16 8 CELL Programming Basics ¾ Programming the SPUs ¾ SIMD'ization (vectorization) ¾ Communication ¾ DMAs ¾ Mailboxes ¾ Measuring Performance ¾ SPU decrementer 01/31/07 11:10 17 SPE Register File 01/31/07 11:10 18 9 SPE SIMD Arithmetic 01/31/07 11:10 19 SPE SIMD Data Motion 01/31/07 11:10 20 10 SPE SIMD Vector Data Types 01/31/07 11:10 21 SPE SIMD Arithmetic Intrinsics 01/31/07 11:10 22 11 SPE Scalar Processing 01/31/07 11:10 23 SPE Static Code Analysis 01/31/07 11:10 24 12 SPE DMA Commands 01/31/07 11:10 25 SPE DMA Status 01/31/07 11:10 26 13 DMA Double Buffering Prologue Receive tile 1 Loop body Receive tile 2 FOR I=2 TO N-1 Compute tile 1 Send tile I-1 Swap buffers Receive tile I+1 Epilogue Compute tile I Send tile N-1 Swap buffers Compute tile N END FOR Send tile N1 01/31/07 11:10 27 SPE Mailboxes ¾ FIFO queues ¾ 32-bit messages ¾ Intended for mainly for communication between the PPE and the SPEs 01/31/07 11:10 28 14 SPE Decrementer ¾ 14MHz – IBM dual CELL blade ¾ 80MHz – Sony Playstation3 01/31/07 11:10 29 CELL Basic Coding Tips ¾ Local Store ¾ Keep in mind the Local Store is 256KB in size ¾ Use plug-ins to handle larger codes ¾ DMA Transfers ¾ Use SPE-initiated DMA transfers ¾ Use double-buffering to hide transfers ¾ Use fence and barrier to order transfers ¾ Loops ¾ Unroll loops to reduce dependency stalls, increase dual-issue rate, and exploit SPU large register file ¾ Branches ¾ Eliminate non-predicted branches ¾ Dual-Issue ¾ Choose intrinsics to maximize dual-issue 01/31/07 11:10 30 15 UT CELL BE Cluster Ashe.CS.UTK.EDU 01/31/07 11:10 31 UT CELL BE Cluster – Historical Perspective Connection Machine CM-5 (512 CPUs) ¾ 512 x 128 = 65 Gflop/s DP Playstation3 (4 units) ¾ 4 x 17 = 68 Gflop/s DP 01/31/07 11:10 32 16.
