CELL Processor

Sony/Toshiba/IBM (STI) CELL Processor Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures ➢ Chip's performance is related to its cross section same area 2 ´ performance (Pollack's Rule) Cray's two oxen Cray's 1024 chicken 01/15/08 22:22 2 Three Performance-Limiting Walls ➢ Power Wall Increasingly, microprocessor performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of microprocessors is to improve power efficiency at about the same rate as the performance increase. ➢ Frequency Wall Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account. ➢ Memory Wall On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor. 01/15/08 22:22 3 Conventional Processors Don't Cut It ... shallower pipelines with in-order execution have proven to be the most area and energy efficient. [...] we believe the efficient building blocks of future architectures are likely to be simple, modestly pipelined (5-9 stages) processors, floating point units, vector, and SIMD processing elements. Note that these constraints fly in the face of the conventional wisdom of simplifying parallel programming by using the largest processors available. [...] David Patterson, [...] Kathy Yelick The Landscape of Parallel Computing Research: A View from Berkeley http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html future architectures ➢ shallow pipelines ➢ in order execution ➢ SIMD units 01/15/08 22:22 4 Cache Memories Don't Cut It When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available. H. Peter Hofstee Cell Broadband Engine Architecture from 20,000 feet http://www-128.ibm.com/developerworks/power/library/pa-cbea.html conventional processor ➢ bucket brigade ➢ a hundred people ➢ a few buckets 01/15/08 22:22 5 Cache Memories Don't Cut It Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem. Richard B. Walsh New Processor Options for HPC http://www.hpcwire.com/hpc/906849.html conventional processor ➢ latency-limited bandwidth 01/15/08 22:22 6 Conventional Memories Don't Cut It More flexible or even reconfigurable data coherency schemes will be needed to leverage the improved bandwidth and reduced latency. An example might be large, on-chip, caches that can flexibly adapt between private or shared configurations. In addition, real-time embedded applications prefer more direct control over the memory hierarchy, and so could benefit from on-chip storage configured as software-managed scratchpad memory. [...] David Patterson, [...] Kathy Yelick The Landscape of Parallel Computing Research: A View from Berkeley http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html future architectures ➢ reconfigurable coherency ➢ software-managed scrachpad memory 01/15/08 22:22 7 CELL ➢ multi-core ➢ 204.8 GFLOPS single precision ➢ in-order execution ➢ 204.8 GB/s internal bandwidth ➢ shallow pipeline ➢ 25.6 GB/s memory bandwidth ➢ SIMD ➢ scratchpad memory ➢ 3.2 GHz ➢ 90 nm OSI ➢ 234 million transistors ➢ 165 million – Xbox 360 ➢ 220 million – Itanium 2 (2002) ➢ 1,700 million – Dual-Core Itanium 2 (2006) – 12.8 GFLOPS (single or double) 01/15/08 22:22 8 CELL ➢ PPE – Power Processing Element ➢ SPE – Synergistic Processing Element ➢ SPU – Synergistic Processing Unit ➢ LS – Local Store ➢ MFC – Memory Flow Controller ➢ EIB – Element Interconnection Bus ➢ MIC – Memory Interface Controller 01/15/08 22:22 9 Power Processing Element ➢ Power Processing Element (PPE) ➢ Power 970 architecture compliant ➢ 2-way Symmetric Multithreading (SMT) ➢ 32KB Level 1 instruction cache ➢ 32KB level 1 data cache ➢ 512KB level 2 cache ➢ VMX (AltiVec) with 32 128-bit vector registers ➢ standard FPU ➢ fully pipelined DP with FMA ➢ 6.4 Gflop/s DP at 3.2 GHz ➢ AltiVec ➢ no DP ➢ 4-way fully pipelined SP with FMA ➢ 25.6 Gflop/s SP at 3.2 GHz 01/15/08 22:22 10 Synergistic Processing Elements ➢ Synergistic Processing Elements (SPEs) ➢ 128-bit SIMD ➢ 128 vector registers ➢ 256KB instruction and data local memory ➢ Memory Flow Controller (MFC) ➢ 16-way SIMD (8-bit integer) ➢ 8-way SIMD (16-bit integer) ➢ 4-way SIMD (32-bit integer, single prec. FP) ➢ 2-way SIMD (64-bit double prec. FP) ➢ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) ➢ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency) 01/15/08 22:22 11 SPE ➢ SIMD architecture ➢ two in-order (dual issue) pipelines ➢ large register file (128 128-bit registers) ➢ 256 KB of scratchpad memory (Local Store) ➢ Memory Flow Controller to DMA code and data from system memory 01/15/08 22:22 12 Element Interconnection Bus ➢ Element Interconnection Bus (EIB) ➢ 4 16B-wide unidirectional channels ➢ half the system clock (1.6GHz) ➢ 204.8 GB/s bandwidth (arbitration) 01/15/08 22:22 13 EIB ➢ 16 byte channels ➢ 4 unidirectional rings ➢ token – based arbitration ➢ half system clock (1.6 GHz) 01/15/08 22:22 14 Main Memory System ➢ Memory Interface Controller (MIC) ➢ external dual XDR, ➢ 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), ➢ each: 8 banks ® max 256 MB, ➢ total: 16 banks ® max 512 MB, ➢ 25.6 GB/s. 01/15/08 22:22 15 CELL Performance – Double Precision In double precision ➢ every seven cycles each SPE can: ➢ process a two element vector, ➢ perform two operations on each element. ➢ in one cycle the FPU on the PPE can: ➢ process one element, ➢ perform two operations on the element. 8 x 2 x 2 x 3.2 GHz / 7 = 14.63 Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s 21.03 Gflop/s 01/15/08 22:22 16 CELL Performance – Single Precision In single precision ➢ in one cycle each SPE can: ➢ process a four element vector, ➢ perform two operations on each element. ➢ in one cycle the VMX on the PPE can: ➢ process a four element vector, ➢ perform two operations on each element. 8 x 4 x 2 x 3.2 GHz = 204.8 Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s 230.4 Gflop/s 01/15/08 22:22 17 CELL Performance – Bandwidth Bandwidth: ➢ 3.2 GHz clock: ➢ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU) ➢ Main memory – 25.6 GB/s, ➢ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs) 01/15/08 22:22 18 CELL Performance – Historical Perspective Connection Machine CM-5 (512 CPUs) ➢ 512 x 128 = 65 Gflop/s DP Playstation3 (4 units) ➢ 4 x 17 = 68 Gflop/s DP 01/15/08 22:22 19 Performance Comparison – Double Precision 1.6 GHz Dual-Core Itanium 2 ➢ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz CELL BE (SPEs only) ➢ 3.2 x 8 x 8 = 14.6 Gflop/s 01/15/08 22:22 20 Performance Comparison – Single Precision 1.6 GHz Dual-Core Itanium 2 ➢ 1.6 x 4 x 2 = 12.8 Gflop/s 3.2 GHz SPE ➢ 3.2 x 8 = 25.6 Gflop/s ➢ One SPE = 2 Dual-Core Itaniums 2 3.2 GHz CELL BE (SPEs only) ➢ 3.2 x 8 x 8 = 204.8 Gflop/s ➢ One CBE = 16 Dual-Core Itaniums 2 01/15/08 22:22 21 SDK - Platforms Linux ➢ RPM-based distribution ➢ x86 ➢ Fedora Core recommended ➢ x86-64 ➢ PPC64 ➢ CELL plugins for Eclipse available ➢ CELL BE 01/15/08 22:22 22 SDK - Compilers ➢ PPU and SPU are different ISAs – different sets of compilers ➢ PPU ➢ SPU ➢ GCC ➢ GCC ➢ G++ ➢ G++ ➢ GFORTRAN ➢ XLC ➢ XLC ➢ XLC++ ➢ XLC++ ➢ XLF ➢ 32-bit ➢ 32-bit ➢ 64-bit ➢ assembler and linker are common to GNU and XL compilers ➢ XL compilers requite GNU tool chain for cross-assembling ➢ OpenMP and cross‑linking for both the PPE and the SPE 01/15/08 22:22 23 SDK - Samples ➢ /opt/ibm/cell-sdk/prototype ➢ /lib ➢ /samples ➢ /workloads ➢ FFT ➢ overlays ➢ FFT ➢ matrix ➢ using ALF ➢ MatMul ➢ game math ➢ simple DMA ➢ ..... ➢ audio resample ➢ tutorial samples ➢ curves and surfaces ➢ ..... ➢ software managed cache ➢ ..... 01/15/08 22:22 24 Compilation and Linking ➢ SPU obJects are embedded in PPU obJects ➢ SPU code and PPU code are linked into one executable ➢ SDK provides standard makefile structure ➢ /opt/ibm/cell-sdk/prototype ➢ make.env – compilation options ➢ make.footer – build rules – do not modify ➢ make.header – definitions – do not modify ➢ README_build_env.txt – makefile howto ➢ understand the build process ➢ run the default makefile ➢ see what it does 01/15/08 22:22 25 Compilation and Linking 01/15/08 22:22 26 Hello World - libspe2 ➢ spe_context_run() is a blocking call ➢ create one POSIX thread for each SPE thread SPE int main() PPE { // ....

CELL Processor

Quietrock Case Study | Sony Computer Entertainment America

USE CASE Requirements

List of Notable Handheld Game Consoles (Source

Copyrighted Material

The Evolution of Ibm Research Looking Back at 50 Years of Scientific Achievements and Innovations

14789093.Pdf

IBM Research AI Residency Program

POWER® Processor-Based Systems

Treatment and Differential Diagnosis Insights for the Physician's

The Impetus to Creativity in Technology

Zoned Storage for HDD and SSD Technologies

Storage Solutions for Embedded Applications