<<

Sony//IBM (STI)

Scientific Computing for Engineers: Spring 2008 Nec Hercules Contra Plures

➢ Chip's performance is related to its cross section

same area 2  performance (Pollack's Rule)

Cray's two oxen Cray's 1024 chicken

01/15/08 22:22 2 Three Performance-Limiting Walls

➢ Power Wall

Increasingly, performance is limited by achievable power dissipation rather than by the number of available integrated-circuit resources (transistors and wires). Thus, the only way to significantly increase the performance of is to improve power efficiency at about the same rate as the performance increase.

➢ Frequency Wall

Conventional processors require increasingly deeper instruction pipelines to achieve higher operating frequencies. This technique has reached a point of diminishing returns, and even negative returns if power is taken into account.

➢ Memory Wall

On multi-gigahertz symmetric multiprocessors – even those with integrated memory controllers – latency to DRAM memory is currently approaching 1,000 cycles. As a result, program performance is dominated by the activity of moving data between main storage (the effective-address space that includes main memory) and the processor.

01/15/08 22:22 3 Conventional Processors Don't Cut It

... shallower pipelines with in-order execution have proven to be the most area and energy efficient. [...] we believe the efficient building blocks of future architectures are likely to be simple, modestly pipelined (5-9 stages) processors, floating point units, vector, and SIMD processing elements. Note that these constraints fly in the face of the conventional wisdom of simplifying parallel programming by using the largest processors available.

[...] David Patterson, [...] Kathy Yelick The Landscape of Research: A View from Berkeley http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html future architectures ➢ shallow pipelines ➢ in order execution ➢ SIMD units

01/15/08 22:22 4 Cache Memories Don't Cut It

When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. [...] Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available.

H. Cell Broadband Engine Architecture from 20,000 feet http://www-128.ibm.com/developerworks/power/library/pa-cbea.html conventional processor ➢ bucket brigade ➢ a hundred people ➢ a few buckets

01/15/08 22:22 5 Cache Memories Don't Cut It

Their (multicore) low cost does not guarantee their effective use in HPC. This relates back to the data-intensive nature of most HPC applications and the sharing of already limited bandwidth to memory. The stream benchmark performance of Intel's new Woodcrest dual core processor illustrates this point. [...] Much effort was put into improving Woodcrest's memory subsystem, which offers a total of over 21 GBs/sec on nodes with two sockets and four cores. Yet, four-threaded runs of the memory intensive Stream benchmark on such nodes that I have seen extract no more than 35 percent of the available bandwidth from the Woodcrest's memory subsystem.

Richard B. Walsh New Processor Options for HPC http://www.hpcwire.com/hpc/906849.html conventional processor ➢ latency-limited bandwidth

01/15/08 22:22 6 Conventional Memories Don't Cut It

More flexible or even reconfigurable data coherency schemes will be needed to leverage the improved bandwidth and reduced latency. An example might be large, on-chip, caches that can flexibly adapt between private or shared configurations. In addition, real-time embedded applications prefer more direct control over the memory hierarchy, and so could benefit from on-chip storage configured as software-managed .

[...] David Patterson, [...] Kathy Yelick The Landscape of Parallel Computing Research: A View from Berkeley http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html future architectures ➢ reconfigurable coherency ➢ software-managed scrachpad memory

01/15/08 22:22 7 CELL

➢ multi-core ➢ 204.8 GFLOPS single precision ➢ in-order execution ➢ 204.8 GB/s internal bandwidth ➢ shallow pipeline ➢ 25.6 GB/s memory bandwidth ➢ SIMD ➢ scratchpad memory

➢ 3.2 GHz ➢ 90 nm OSI ➢ 234 million transistors ➢ 165 million – ➢ 220 million – Itanium 2 (2002) ➢ 1,700 million – Dual-Core Itanium 2 (2006) – 12.8 GFLOPS (single or double)

01/15/08 22:22 8 CELL

➢ PPE – ➢ SPE – Synergistic Processing Element ➢ SPU – Synergistic Processing Unit ➢ LS – Local Store ➢ MFC – Memory Flow Controller ➢ EIB – Element Interconnection ➢ MIC – Memory Interface Controller

01/15/08 22:22 9 Power Processing Element

➢ Power Processing Element (PPE) ➢ Power 970 architecture compliant ➢ 2-way Symmetric Multithreading (SMT) ➢ 32KB Level 1 instruction cache ➢ 32KB level 1 data cache ➢ 512KB level 2 cache ➢ VMX (AltiVec) with 32 128-bit vector registers

➢ standard FPU ➢ fully pipelined DP with FMA ➢ 6.4 Gflop/s DP at 3.2 GHz

➢ AltiVec ➢ no DP ➢ 4-way fully pipelined SP with FMA ➢ 25.6 Gflop/s SP at 3.2 GHz

01/15/08 22:22 10 Synergistic Processing Elements

➢ Synergistic Processing Elements (SPEs) ➢ 128-bit SIMD ➢ 128 vector registers ➢ 256KB instruction and data local memory ➢ Memory Flow Controller (MFC)

➢ 16-way SIMD (8-bit integer) ➢ 8-way SIMD (16-bit integer) ➢ 4-way SIMD (32-bit integer, single prec. FP) ➢ 2-way SIMD (64-bit double prec. FP)

➢ 25.6 Gflop/s SP at 3.2 Ghz (fully pipelined) ➢ 1.8 Gflop/s DP at 3.2 Ghz (7 cycle latency)

01/15/08 22:22 11 SPE

➢ SIMD architecture ➢ two in-order (dual issue) pipelines ➢ large (128 128-bit registers) ➢ 256 KB of scratchpad memory (Local Store) ➢ Memory Flow Controller to DMA code and data from system memory

01/15/08 22:22 12 Element Interconnection Bus

➢ Element Interconnection Bus (EIB) ➢ 4 16B-wide unidirectional channels ➢ half the system clock (1.6GHz) ➢ 204.8 GB/s bandwidth (arbitration)

01/15/08 22:22 13 EIB

➢ 16 channels ➢ 4 unidirectional rings ➢ token – based arbitration ➢ half system clock (1.6 GHz)

01/15/08 22:22 14 Main Memory System

➢ Memory Interface Controller (MIC) ➢ external dual XDR, ➢ 3.2 Ghz max effective frequency, (max 400 MHz, Octal Data Rate), ➢ each: 8 banks  max 256 MB, ➢ total: 16 banks  max 512 MB, ➢ 25.6 GB/s.

01/15/08 22:22 15 CELL Performance – Double Precision

In double precision ➢ every seven cycles each SPE can: ➢ process a two element vector, ➢ perform two operations on each element.

➢ in one cycle the FPU on the PPE can: ➢ process one element, ➢ perform two operations on the element.

8 x 2 x 2 x 3.2 GHz / 7 = 14.63 Gflop/s 2 x 3.2 GHz = 6.4 Gflop/s

21.03 Gflop/s

01/15/08 22:22 16 CELL Performance – Single Precision

In single precision ➢ in one cycle each SPE can: ➢ process a four element vector, ➢ perform two operations on each element.

➢ in one cycle the VMX on the PPE can: ➢ process a four element vector, ➢ perform two operations on each element.

8 x 4 x 2 x 3.2 GHz = 204.8 Gflop/s 4 x 2 x 3.2 GHz = 25.6 Gflop/s

230.4 Gflop/s

01/15/08 22:22 17 CELL Performance – Bandwidth

Bandwidth: ➢ 3.2 GHz clock: ➢ each SPU – 25.6 GB/s, (compare to 25.6 Gflop/s per SPU) ➢ Main memory – 25.6 GB/s, ➢ EIB – 204.8 GB/s. (compare to 204.8 Gflop/s – 8 SPUs)

01/15/08 22:22 18 CELL Performance – Historical Perspective

Connection Machine CM-5 (512 CPUs) ➢ 512 x 128 = 65 Gflop/s DP

Playstation3 (4 units) ➢ 4 x 17 = 68 Gflop/s DP

01/15/08 22:22 19 Performance Comparison – Double Precision

1.6 GHz Dual-Core Itanium 2 ➢ 1.6 x 4 x 2 = 12.8 Gflop/s

3.2 GHz CELL BE (SPEs only) ➢ 3.2 x 8 x 8 = 14.6 Gflop/s

01/15/08 22:22 20 Performance Comparison – Single Precision

1.6 GHz Dual-Core Itanium 2 ➢ 1.6 x 4 x 2 = 12.8 Gflop/s

3.2 GHz SPE ➢ 3.2 x 8 = 25.6 Gflop/s ➢ One SPE = 2 Dual-Core Itaniums 2

3.2 GHz CELL BE (SPEs only) ➢ 3.2 x 8 x 8 = 204.8 Gflop/s ➢ One CBE = 16 Dual-Core Itaniums 2

01/15/08 22:22 21 SDK - Platforms

Linux ➢ RPM-based distribution ➢ x86 ➢ Fedora Core recommended ➢ x86-64 ➢ ➢ CELL plugins for Eclipse available ➢ CELL BE

01/15/08 22:22 22 SDK - Compilers

➢ PPU and SPU are different ISAs – different sets of compilers

➢ PPU ➢ SPU ➢ GCC ➢ GCC ➢ G++ ➢ G++ ➢ GFORTRAN ➢ XLC ➢ XLC ➢ XLC++ ➢ XLC++ ➢ XLF ➢ 32-bit

➢ 32-bit ➢ 64-bit ➢ assembler and linker are common to GNU and XL compilers ➢ XL compilers requite GNU tool chain for cross-assembling ➢ OpenMP and cross‑linking for both the PPE and the SPE

01/15/08 22:22 23 SDK - Samples

➢ /opt//cell-sdk/prototype

➢ /lib ➢ /samples ➢ /workloads ➢ FFT ➢ overlays ➢ FFT ➢ matrix ➢ using ALF ➢ MatMul ➢ game math ➢ simple DMA ➢ ..... ➢ audio resample ➢ tutorial samples ➢ curves and surfaces ➢ ..... ➢ software managed cache ➢ .....

01/15/08 22:22 24 Compilation and Linking

➢ SPU objects are embedded in PPU objects ➢ SPU code and PPU code are linked into one executable

➢ SDK provides standard makefile structure ➢ /opt/ibm/cell-sdk/prototype ➢ make.env – compilation options ➢ make.footer – build rules – do not modify ➢ make.header – definitions – do not modify ➢ README_build_env.txt – makefile howto

➢ understand the build process ➢ run the default makefile ➢ see what it does

01/15/08 22:22 25 Compilation and Linking

01/15/08 22:22 26 Hello World - libspe2

➢ spe_context_run() is a blocking call ➢ create one POSIX for each SPE thread SPE int main() PPE { // ..... } #include #include

int main() void* spe_thread() { { spe_context_create() spe_image_open() pthread_create() spe_program_load() pthread_join() spe_context_run() spe_context_destroy() pthread_exit() } }

01/15/08 22:22 27 SPU Context Switching

quiesce the SPE

harvest (reset) an SPE

save privileged and problem state to CSA save low 16 K of LS to CSA load and start SPE context-save sequence load and start SPE context-restore sequence

save GPRs and channel state to CSA save LSCSA to CSA copy LSCSA from CSA to LS restore 240 KB of LS from CSA save 240 KB of LS to CSA restore 240 KB of LS from CSA restore GPRs and channel state from LSCSA

restore privileged state from CSA restore remaining problem state from CSA restore 16 KB of LS from CSA

01/15/08 22:22 28 CELL Programming Models / Environments

Gedae – commercial product

01/15/08 22:22 29 CELL Resources

➢ developerWorks CELL Resource Center http://www.ibm.com/developerworks/power/cell/ ➢ Barcelona Center http://www.bsc.es/  Computer Sciences  on CELL ➢ Power.org CELL Developer Corner http://www.power.org/resources/devcorner/cellcorner/ ➢ The CELL Project at IBM Research http://www.research.ibm.com/cell/ ➢ CELL BE at IBM alphaWorks http://www.alphaworks.ibm.com/topics/cell/ ➢ GA Tech CELL Workshop http://www.cc.gatech.edu/~bader/CellProgramming.html ➢ ICL CELL Summit http://www.cs.utk.edu/~dongarra/cell2006/ ➢ CellPerformance http://www.cellperformance.com/

➢ CELL Broadband Engine Programming Handbook ➢ CELL Broadband Engine Programming Tutorial ➢ SPE Runtime Management Library ➢ C/C++ Language Extensions for CELL Broadband Engine Architecture ➢ SPU Assembly Language Specification ➢ Synergistic Processor Unit Instruction Set Architecture

01/15/08 22:22 30