Sony/Toshiba/IBM (STI) CELL Processor Three Performance-Limiting Walls
Total Page:16
File Type:pdf, Size:1020Kb
Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2007 Three Performance-Limiting Walls ¾ Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. ¾ Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. ¾ Memory Wall On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 01/31/07 11:10 2 1 The Memory Wall "When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available." H. Peter Hofstee "Cell Broadband Engine Architecture from 20,000 feet" http://www-128.ibm.com/developerworks/power/library/pa-cbea.html Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem." Richard B. Walsh "New Processor Options for HPC" http://www.hpcwire.com/hpc/906849.html 01/31/07 11:10 3 CELL Overview ¾ $400 million over 5 years ¾ Sony/Toshiba/IBM alliance known as STI ¾ STI Design Center – Austin, Texas – March 2001 ¾ Mercury Computer Systems Inc. – dual CELL blades ¾ Cell Broadband Engine Architecture / CBEA / CELL BE ¾ Playstation3, dual CELL blades, PCI accelerator cards ¾ 3.2 GHz, 90nm OSI, ¾ 234 million transistors ¾ 165 million – Xbox 360 ¾ 220 million – Itanium 2 (2002) ¾ 1,700 million – Dual-Core Itanium 2 (2006) 01/31/07 11:10 4 2 CELL Architecture ¾ PPU – PowerPC 970 core ¾ SPE – Synergistic Processing Element ¾ SPU – Synergistic Processing Unit ¾ LS – Local Store ¾ MFC – Memory Flow Controller ¾ EIB – Element Interconnection Bus ¾ MIC – Memory Interface Controller 01/31/07 11:10 5 Power Processing Element ¾ Power Processing Element (PPE) ¾ Power 970 architecture compliant ¾ 2-way Symmetric Multithreading (SMT) ¾ 32KB Level 1 instruction cache ¾ 32KB level 1 data cache ¾ 512KB level 2 cache ¾ VMX (AltiVec) with 32 128-bit vector registers ¾ standard FPU ¾ fully pipelined DP with FMA ¾ 6.4 Gflop/s DP at 3.2 GHz ¾ AltiVec ¾ no DP ¾ 4-way fully pipelined SP with FMA ¾ 25.6 Gflop/s SP at 3.2 GHz 01/31/07 11:10 6 3 Synergistic Processing Elements ¾ Synergistic Processing Elements (SPEs) ¾ 128-bit SIMD ¾ 128 vector registers ¾ 256KB instruction and data local memory ¾ Memory Flow Controller (MFC) ¾ 16-way SIMD (8-bit integer) ¾ 8-way SIMD (16-bit integer) ¾ 4-way SIMD (32-bit integer, single prec. FP) ¾ 2-way SIMD (64-bit double prec. FP) ¾ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) ¾ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency) 01/31/07 11:10 7 SPE Architecture ¾ Dual issue (in order) pipeline ¾ Even – arithmetic ¾ integer ¾ floating point ¾ Odd – data motion ¾ permutations ¾ local store ¾ branches ¾ channel 01/31/07 11:10 8 4 Element Interconnection Bus ¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration) 01/31/07 11:10 9 Element Interconnection Bus ¾ Element Interconnection Bus (EIB) ¾ 4 16B-wide unidirectional channels ¾ half the system clock (1.6GHz) ¾ 204.8 GB/s bandwidth (arbitration) 01/31/07 11:10 10 5 Main Memory System ¾ Memory Interface Controller (MIC) ¾ external dual XDR, ¾ 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), ¾ each: 8 banks → max 256 MB, ¾ total: 16 banks → max 512 MB, ¾ 25.6 GB/s. 01/31/07 11:10 11 CELL Performance – Double Precision In double precision ¾ every seven cycles each SPE can: ¾ process a two element vector, ¾ perform two operations on each element. ¾ in one cycle the FPU on the PPE can: ¾ process one element, ¾ perform two operations on the element. 8 x2 x2 x3.2 GHz / 7 = 14.63 Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s 21.03 Gflop/s 01/31/07 11:10 12 6 CELL Performance – Single Precision In single precision ¾ in one cycle each SPE can: ¾ process a four element vector, ¾ perform two operations on each element. ¾ in one cycle the VMX on the PPE can: ¾ process a four element vector, ¾ perform two operations on each element. 8 x4 x2 x3.2 GHz =204.8 Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s 230.4 Gflop/s 01/31/07 11:10 13 CELL Performance – Bandwidth Bandwidth: ¾ 3.2 GHz clock: ¾ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU) ¾ Main memory – 25.6 GB/s, ¾ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs) 01/31/07 11:10 14 7 Performance Comparison – Double Precision 1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 14.6 Gflop/s 01/31/07 11:10 15 Performance Comparison – Single Precision 1.6 GHz Dual-Core Itanium 2 ¾ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz SPE ¾ 3.2 x 8 = 25.6 Gflop/s ¾ One SPE = 2 Dual-Core Itaniums 2 3.2 GHz CELL BE (SPEs only) ¾ 3.2 x 8 x 8 = 204.8 Gflop/s ¾ One CBE = 16 Dual-Core Itaniums 2 01/31/07 11:10 16 8 CELL Programming Basics ¾ Programming the SPUs ¾ SIMD'ization (vectorization) ¾ Communication ¾ DMAs ¾ Mailboxes ¾ Measuring Performance ¾ SPU decrementer 01/31/07 11:10 17 SPE Register File 01/31/07 11:10 18 9 SPE SIMD Arithmetic 01/31/07 11:10 19 SPE SIMD Data Motion 01/31/07 11:10 20 10 SPE SIMD Vector Data Types 01/31/07 11:10 21 SPE SIMD Arithmetic Intrinsics 01/31/07 11:10 22 11 SPE Scalar Processing 01/31/07 11:10 23 SPE Static Code Analysis 01/31/07 11:10 24 12 SPE DMA Commands 01/31/07 11:10 25 SPE DMA Status 01/31/07 11:10 26 13 DMA Double Buffering Prologue Receive tile 1 Loop body Receive tile 2 FOR I=2 TO N-1 Compute tile 1 Send tile I-1 Swap buffers Receive tile I+1 Epilogue Compute tile I Send tile N-1 Swap buffers Compute tile N END FOR Send tile N1 01/31/07 11:10 27 SPE Mailboxes ¾ FIFO queues ¾ 32-bit messages ¾ Intended for mainly for communication between the PPE and the SPEs 01/31/07 11:10 28 14 SPE Decrementer ¾ 14MHz – IBM dual CELL blade ¾ 80MHz – Sony Playstation3 01/31/07 11:10 29 CELL Basic Coding Tips ¾ Local Store ¾ Keep in mind the Local Store is 256KB in size ¾ Use plug-ins to handle larger codes ¾ DMA Transfers ¾ Use SPE-initiated DMA transfers ¾ Use double-buffering to hide transfers ¾ Use fence and barrier to order transfers ¾ Loops ¾ Unroll loops to reduce dependency stalls, increase dual-issue rate, and exploit SPU large register file ¾ Branches ¾ Eliminate non-predicted branches ¾ Dual-Issue ¾ Choose intrinsics to maximize dual-issue 01/31/07 11:10 30 15 UT CELL BE Cluster Ashe.CS.UTK.EDU 01/31/07 11:10 31 UT CELL BE Cluster – Historical Perspective Connection Machine CM-5 (512 CPUs) ¾ 512 x 128 = 65 Gflop/s DP Playstation3 (4 units) ¾ 4 x 17 = 68 Gflop/s DP 01/31/07 11:10 32 16.