<<

IBM Systems & Technology Group

Introduction to IBM /B.E. SDK v3.1 Programming IBM PowerXcell 8i / QS22

PRACE Winter School 10-13 February 2009, Athens, Greece

1 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Objectives

ƒ Introduce you to …

– Cell Software Development Kit (SDK) for Multicore Acceleration Version 3.1 – Programming the Cell/B.E (libSPE2, MFC, SIMD, … ) – Programming Models: DaCS, ALF, OpenMP – Programming Tips & Tricks – Performance tools

Trademarks – Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Computer Entertainment, Inc.

2 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Cell/B.E. Programming Approaches are Fully Customizable! Increasing Programmer Control over Cell/B.E. resources

Decreasing programmer attention to architectural details 1. “Native” Programming Æ HW 2. Assisted 3. Development Resources Programming Tools Intrinsics, Æ Libraries, ÆUser tool-driven DMA, etc. Frameworks ÆHardware Programming Effort Programming abstraction 1. “Native” Programming 2. Assisted Programming 3. Development Tools Advantages ƒBest performance possible ƒGreatly reduced development time over ƒMinimum development time required “Native” Programming ƒBest use of “Native” resources ƒSome degree of platform ƒStill allows some custom use of “Native” independence resources Limitations ƒRequires the most coding work of ƒPerformance gains may not be as great as ƒPerformance gains may not be as the three options with “Native” Programming great as with “Native” Programming ƒRequires highest level of “Native” ƒConfined to limitations of frameworks and ƒCASE tool determines debugging expertise libraries chosen capabilities and platform support choices Where it is most ƒEmbedded Hardware / Real-time ƒVast Majority of all applications ƒSimultaneous Deployments across useful Applications multiple Hardware architectures ƒHardware resources / power / space ƒProgrammers pool / skill base is / cost are at a premium restricted to high level skills

3 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell Multi-core Programming Effort Roadmap

Requires mostly the same effort to port to any multi-core architecture.

Port app Begin Optimizing Port app to Power, Cell BE moving function Porting & to , run on Tune SIMD function on SPE’s Optimizing if needed PPE to SPE’s Optimizing function •Exploit Parallelism at Task - Local Store Level Management •Exploit Parallelism at instruction / data level •Data and Instruction Locality Tuning

WritingWriting forfor CellCell BEBE speedsspeeds upup codecode onon allall multi-coremulti-core architecturesarchitectures becausebecause itit usesuses thethe samesame parallelparallel bestbest prpracticesactices – – Cell Cell architecturearchitecture justjust gainsgains moremore fromfrom themthem becausebecause ofof itsits design.design.

4 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Objectives

ƒ Introduce you to …

– Cell Software Development Kit (SDK) for Multicore Acceleration Version 3.1 – Programming the Cell/B.E (libSPE2, MFC, SIMD, … ) – Programming Models: DaCS, ALF, OpenMP – Programming Tips & Tricks – Performance tools

Trademarks – Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

5 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

IBM SDK for Multicore Acceleration and related tools

The IBM SDK is a complete tools package that simplifies programming for the Cell Broadband Engine Architecture

Eclipse-based IDE

Simulator IBM XL C/C++ compiler* Optimized compiler for use in creating Cell/B.E. optimized applications. Offers: * improved performance * automatic overlay support * SPE code generation XLC compiler is Performance a Tools complementary GNU tool chain product to SDK Libraries and frameworks Data Accelerated Communication Basic Linear Standardized Library and Algebra SIMD math Framework (ALF) Synchronization Subroutines (BLAS) libraries (DaCS)

Denotes software components included in the SDK for Multicore Acceleration

6 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell BE SDK for Multicore Acceleration v3.1 ƒ Overview ƒ Runtime Environment ƒ Program Development Tools ƒ Programming Models ƒ Development Libraries ƒ Performance Tools

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

7 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell BE SDK for Multicore Acceleration v3.1 ƒ Overview ƒ Runtime Environment ƒ Program Development Tools ƒ Programming Models ƒ Development Libraries ƒ Performance Tools

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

8 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group IBM Cell SW Environment

IDE – Integrated Dev Env Development VPA – Visual Perf Analyzer Runtime Environment PTP – Parallel Tools Platform Prog Env Environment End-User ALF, DaCS Experience Performance Tools Security SDK spu_timing, asmVis, PDT/PDTR, FDPR-Pro, Examples, Demos, Cell Perf Counter, Benchmarks Oprofile, Code Analyzer Application Libraries SIMD math, MASS/MASSV, crypto, gdb – combined debugger Monte Carlo RNG, FFT, BLAS, LAPACK

Compilers SPE Runtime Management Library (libspe2) gnu C/C++, , Ada SPU system library (C99/posix, __ea cache, spu_timers) XL – C/C++, Fortran, single source compiler Enhanced Linux – RHEL 5.2/5.3 Fedora 9

GNU binutils Hardware – QS21, QS22, Soma CAB IBM Full System Simulator

STANDARDS – HW (CBEA) SW (ABI, Language, Assembly, SIMD math, libspe2)

9 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell BE SDK for Multicore Acceleration v3.1 ƒ Overview ƒ Runtime Environment – Linux Kernel – SPE Runtime Management Library – System Simulator ƒ Program Development Tools ƒ Programming Models ƒ Development Libraries ƒ Performance Tools

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

10 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Linux Kernel

ƒ Fedora 9 – Patches made to Linux 2.6.25 kernel to provide services required to support the Cell BE hardware facilities – Patches and pre-built kernel binaries are distributed by the Barcelona Supercomputing Center (BSC- CNS) http://www.bsc.es/projects/deepcomputing/linuxoncell

ƒ RHEL 5.2/5.3 – Patches included in the kernel distribution.

ƒ For the QS21/QS22, – the kernel is installed into the /boot directory – yaboot.conf is modified – needs reboot to activate this kernel

11 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPE Runtime Management Library

ƒ The SPE runtime management library (libspe2) contains an SPE programming model for Cell BE applications ƒ is used to control SPE program execution from the PPE program ƒ Handles SPEs as virtual objects called SPE contexts – SPE programs can be loaded and executed by operating SPE contexts ƒ Licensed under the GNU LPGL ƒ Fedora 9 – Packages available at the Barcelona Supercomputing Center (BSC-CNS) http://www.bsc.es/plantillaH.php?cat_id=581 ƒ RHEL 5.2/5.3 – Packages available at the RHEL extras iso image.

12 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

IBM Full-System Simulator

ƒ Emulates the behavior of a full system that contains a Cell BE . ƒ Can start Linux on the simulator and run applications on the simulated . ƒ Supports the loading and running of statically-linked executable programs and standalone tests without an underlying operating system. ƒ Simulation models – Functional-only simulation: Models the program-visible effects of instructions without modeling the time it takes to run these instructions. • ÎFor code development and debugging.

– Performance simulation: Models internal policies and mechanisms for system components, such as arbiters, queues, and pipelines. Operation latencies are modeled dynamically to account for both processing time and resource constraints. • ÎFor system and application performance analysis.

13 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Simulator Structure and Windows

Command Window GUI Window Console Window

systemsim% [user@bringup /]#

Linux on Simulated Machine Simulated System Cell Simulated Machine IBM Full System Simulator Simulator Linux Operating System Base Simulator Base processor Hosting Environment

14 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell BE SDK for Multicore Acceleration v3.1 ƒ Overview ƒ Runtime Environment ƒ Program Development Tools – gcc and GNU Toolchain – XL C/C++ Compilers – Eclipse IDE ƒ Programming Models ƒ Development Libraries ƒ Performance Tools

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

15 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group GNU Toolchain

ƒ Contains the GCC compiler for the PPU and the SPU. – ppu-gcc, ppu-g++, ppu32-gcc, ppu32-g++, spu-gcc, spu-g++ – For the PPU, GCC replaces the native GCC on PPC platforms and it is a cross-compiler on x86. The GCC for the PPU is preferred and the makefiles are configured to use it when building the libraries and samples. – For the SPU, GCC contains a separate SPE cross-compiler that supports the standards defined in the following documents: • C/C++ Language Extensions for Cell BE Architecture V2.4 • SPU Application Binary Interface (ABI) Specification V1.7 • SPU Instruction Set Architecture V1.2 ƒ Support additional languages – GNU Fortran for PPE and SPE - No SPE-specific extensions (e.g. intrinsics) – GNU Ada for PPE only ƒ The assembler and linker are common to both the GCC and XL C/C++ compilers. – ppu-ld, ppu-as, spu-ld, spu-as – The GCC associated assembler and linker additionally support the SPU Assembly Language Specification V1.5. ƒ GDB support is provided for both PPU and SPU debugging, and the debugger client can be in the same process or a remote process. ƒ GDB also supports combined (PPU and SPU) debugging. – ppu-gdb, ppu-gdbserver, ppu32-gdbserver

16 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group XL C/C++/Fortran Compilers

ƒ IBM XL C/C++ for Multicore Acceleration for Linux, V10.1 (single/dual source compiler) – Product quality and support – ppuxlc, ppuxlc++, spuxlc, spuxlc++ – Performance improvements in auto-SIMD – Improved diagnostic capabilities for detecting SIMD opportunities (-qreport) – Enablement of high optimization levels (O4, O5) on the SPE – Automatic generation of code overlays – Auto-SIMDization is enabled at O3 -qhot or O4 and O5 by default. – Allows programmer to use OpenMP directives to specify parallelism on PPE and SPE • cbexlc, cbexlc++, cbexlC • Compiler hides complexity of DMA transfers, code partitioning, overlays, etc.. from the programmer

ƒ IBM XL Fortran for Multicore Acceleration for Linux, V11.1 (dual source compiler) – Optimized Fortran code generation for PPE and SPE – Support for Fortran 77, 90 and 95 standards as well as many features from the Fortran 2003 standard – Auto-SIMD optimizations – Automatic generation of code overlays – Initial support for SPU intrinsics. No libspe2 Fortran wrapper.

17 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Eclipse Cell IDE Key Features

ƒ Cell C/C++ PPE/SPE managed make project support ƒ A C/C++ editor that supports syntax highlighting; a customizable template; and an outline window view for procedures, variables, declarations, and functions that appear in . ƒ Full configurable build properties. ƒ A rich C/C++ source level PPE and/or SPE cell GDB debugger integrated into eclipse. ƒ Seamless integration of Cell BE Simulator into Eclipse ƒ Automatic makefile generator, builder, performance tools, and several other enhancements. ƒ Support development platforms (x86, x86_64, Power PC, Cell) ƒ Support target platforms – Local Cell Simulator – Remote Cell Simulator – Remote Native Cell Blade Cell IDE ƒ Performance tools Support . CDT ƒ Automatic embedSPU integration Eclipse ƒ ALF programming model support ƒ SOMA support Cell Tool Chain Sim perf

18 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell BE SDK for Multicore Acceleration v3.1 ƒ Overview ƒ Runtime Environment ƒ Program Development Tools ƒ Programming Models – DaCS – ALF ƒ Development Libraries ƒ Performance Tools

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

19 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Data Communications and Synchronization Library (DaCS)

ƒ Unified hardware abstraction layer for hybrid system – Consistent APIs at all levels of the system hierarchy ƒ Focuses on topology and process management, data communication, synchronization, and error handling. ƒ Efficient data transfer to maximize available bandwidth and minimize inherent latency ƒ Allows a natural division of labor among the different types of processors in a hybrid system – Host Elements – manage and supply data to the accelerator elements – Accelerator Elements – process computationally intensive workloads ƒ Focuses on point-to-point one-way rDMA communication operations ƒ Manages endian conversion (needed for Hybrid) ƒ Simple, low-level, flexible – Based on windows and channels architecture

20 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Accelerator Library Framework (ALF) Overview

ƒ Aims at workloads that are highly parallelizable – e.g. Raycasting, FFT, Monte Carlo, Video Codecs ƒ Provides a simple user-level programming framework for Cell library developers that can be extended to other hybrid systems. ƒ Division of Labor approach – ALF provides wrappers for computational kernels – Frees programmers from writing their own architectural-dependent code including: data transfer, task management, double buffering, data communication ƒ Manages data partitioning – Provides efficient scatter/gather implementations via CBE DMA – Extensible to variety of data partitioning patterns – Host and accelerator describe the scatter/gather operations – Accelerators gather the input data and scatter output data from host’s memory – Manages input/output buffers to/from SPEs ƒ Remote error handling ƒ Utilizes DaCS library for some low-level operations (on Hybrid)

21 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell BE SDK for Multicore Acceleration v3.1 ƒ Overview ƒ Runtime Environment ƒ Program Development Tools ƒ Programming Models ƒ Development Libraries – SIMD Math / MASS / BLAS / LAPACK / FFT – Monte carlo Random Number Generator – SPU Timer ƒ Performance Tools

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

22 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SIMD Math Library

ƒ Completed Implementation of JSRE SIMD Math Library by adding: – tgammaf4 (PPU/SPU) – expm1d2 (SPU) – tgammad2 (SPU) – expm1f4 (PPU/SPU) – lgammaf4 (PPU/SPU) – hypotf4 (PPU/SPU) – erff4 (PPU/SPU) – sincosd2 (SPU) – erfcf4 (PPU/SPU) – sincosf4 (PPU/SPU) – fpclassifyd2 (SPU) – tanhd2 (SPU) – fpclassifyf4 (PPU/SPU) – tanhf4 (PPU/SPU) – nextafterd2 (SPU) – atand2dp – nextafterf4 (PPU/SPU) – acoshd2 (SPU) – modff4 (PPU/SPU) – acoshf4 (PPU/SPU) – lldivi2 (SPU) – asinhd2 (SPU) – lldivu2 (SPU) – asinhf4 (PPU/SPU) – iroundf4 (PPU/SPU) – atanhd2 (SPU) – irintfr (PPU/SPU) – atanhf4 (PPU/SPU) – log1pd2 (SPU) – atan2d2 (SPU) – log1pf4 (PPU/SPU) – atan2f4 (PPU/SPU)

23 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group MASS and MASS/V Library

ƒ Mathematical Acceleration SubSystem – High-performance alternative to standard system math libraries • i.e. libm, SIMDmath – Versions exist for , PPU, SPU – Up to 23x faster than libm functions ƒ PPU MASS – 57 scalar functions, 60 vector functions – both single and double precision ƒ SPU MASS – SDK 2.1 contains 28 SIMD functions and 28 vector functions (SP only) – Expanded SPU MASS to include all single-precision functions in PPU MASS – Added 8 new SP functions • erf, erfc, expm1, hypot, lgamma, log1p, vpopcnt4, vpopcnt8 – Improved tuning of existing functions

24 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Basic Linear Algebra Subprograms (BLAS) Library

ƒ BLAS on PPU – The BLAS APIs are available as standard ANSI C and standard FORTRAN 77/90 interfaces. – Conforming to standard BLAS interface – Easy port of existing applications – Selected routines optimized utilizing SPUs – Only real single precision and real double precision versions supported. Complex version is not supported. ƒ BLAS on SPU – Offer SPU Kernel routines to SPU applications • Underlying functionality implemented on the SPE • Operate on data (input/output) residing in local store – Similar to the corresponding PPU routine but not conforming to APIs ƒ Routines optimized to use the SPEs: – BLAS Level I • SSCAL, DSCAL, SCOPY, DCOPY, ISAMAX, IDAMAX, SAXPY, DAXPY, SDOT, DDOT, DASUM, DNRM2, DROT – BLAS Level II • SGEMV, DGEMV, STRMV, DTRMV, STRSV, DTRSV, DGBMV, SGER, DGER, DSYMV, DTBMV, DSYR – BLAS Level III • SGEMM, DGEMM, SSYRK, DSYRK, STRSM, DTRSM, STRMM, DTRMM, SSYMM, DSYMM, SSYR2K, DSYR2K ƒ Focus on single precision optimization

25 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Linear Algebra PACKage (LAPACK) Library

ƒ LAPACK on PPU – All LAPACK routines are available for use in the SDK. – These routines include: real single precision, real double precision, complex single precision, and complex double precision.

ƒ Routines optimized to use the SPEs: – DGETRF - Computes the LU factorization of a general matrix – DGETRS – Linear equation solver – DGETRI - Computes the inverse of a general matrix using an LU factorization – DGEQRF - Computes the QR factorization of a general matrix – DGELQF - Compute the LQ factorization of a general matrix – DPOTRF - Computes the Cholesky factorization of a symmetric positive matrix – DPOTRS – Linear equation (symmetric positive definite) solver – DBDSQR - Computes the singular value decomposition of a real bi-diagonal matrix using an implicit zero-shift QR algorithm – DSTEQR - Computes the singular value decomposition of a real symmetric tridiagonal matrix using an implicit QR algorithm – DGESVD - Computes the singular value decomposition (SVD) of a real matrix using an implicit zero-shift QR algorithm, optionally computing the left and/or right singular vectors v – DGESDD - Computes the singular value decomposition (SVD) of a real matrix using a divide-and-conquer algorithm, optionally computing the left and/or right singular vectors.

26 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group FFT Library

ƒ 1D, 2D Square, 2D rectangular, 3D cube, 3D rectangular ƒ Integer, Single Precision, Double Precision ƒ Complex to Real, Real to Complex, Complex to Complex ƒ Row size is Power-of-2, Power-of-Low-Primes, factorization of row size includes large primes ƒ SPU, BE, ppu ƒ In place, out of place ƒ Forward, Inverse

C2R R2C C2C C2R R2C C2C power of 2 power of 2 power of 2 low primes low primes low primes

1D Single 2 to 8192(SPU) 2 to 8192(SPU) 2 to 8192 SPU) 2 to 8192(SPU) 2 to 8192(SPU) 2 to 8192 SPU) 2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE) out of place out of place out of place out of place out of place out of place forward forward forward/inverse forward forward forward/inverse

2D Square Single 32 to 2048 (BE) 2 to 2048 (BE) In/out of place out of place forward/inverse forward

2D Rectangular Single 2 to 2048 (BE) out of place forward

2D Rectangular Double 32 to 2048 (BE) In/out of place forward/inverse

27 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Monte Carlo Random Number Generator Library

Types of Random Number Generators

– True (Hardware random number generator - HW RNG) • Samples hardware on the Cell Broadband Engine to generate its values. • Represents the closest interface to being truly random • No seed value is required and the resulting sequence does not have a predictable pattern • These random numbers are not repeatable

– Quasi • repeatable with same seed • attempts to uniformly fill n-dimension space – Sobol

– Pseudo • repeatable with same seed – – Kirkpatrick-Stoll

28 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPU Timer Library

ƒ Provides virtual clock and timer services for SPU programs

ƒ Virtual Clock – Software managed 64-bit timebase counter – Built on top of 32-bit decrementer register – Can be used for high resulution time measurements

ƒ Virtual Timers – Interval timers built on virtual clock – User registered handler is called on requested interval – Can be used for statistical profiling – Up to 4 timers can be active simultaneously, with different intervals

29 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell BE SDK for Multicore Acceleration v3.1 ƒ Overview ƒ Runtime Environment ƒ Program Development Tools ƒ Programming Models ƒ Development Libraries ƒ Performance Tools – Static Analysis – Dynamic Analysis

Trademarks - Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

30 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Performance Tools – Static Analysis

ƒ FDPR-Pro – Perform global optimization at the entire executable ƒ SPU Timing – Analysis of SPE instruction scheduling

31 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Performance Tools – Dynamic Analysis

ƒ Performance Debugging and Tool (PDT) – Generalized tracing facility; instrumentation of DaCS and ALF – Hybrid system support, e.g. PDT on , etc. ƒ PDTR: PDT post-processor – Post processes PDT traces – Provide analysis and summary reports (Lock analysis, DMA analysis, etc.) ƒ OProfile – PPU Time and event profiling – SPU time profiling ƒ Hardware Performance Monitoring ƒ Collect performance monitoring events – Perfmon2 support and enablement for PAPI, etc. ƒ Hybrid System Performance Monitoring and Tracing facility – Launch, activate and dynamically configure tools on CellBlades and Opteron host blade – Synchronize, merge and aggregate traces ƒ Visual Performance Analyzer (from alphaWorks)

32 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Objectives

ƒ Introduce you to learn …

– Cell Software Development Kit (SDK) for Multicore Acceleration Version 3.1 – Programming the Cell/B.E (libSPE2, MFC, SIMD, … ) – Programming Models: DaCS, ALF, OpenMP – Programming Tips & Tricks – Performance tools

Trademarks – Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

33 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell Multi-core Programming Effort Roadmap

Requires mostly the same effort to port to any multi-core architecture.

Port app Begin Optimizing Port app to Power, Cell BE moving function Porting & to Linux, run on Tune SIMD function on SPE’s Optimizing if needed PPE to SPE’s Optimizing function •Exploit Parallelism at Task - Local Store Level Management •Exploit Parallelism at instruction / data level •Data and Instruction Locality Tuning

WritingWriting forfor CellCell BEBE speedsspeeds upup codecode onon allall multi-coremulti-core architecturesarchitectures becausebecause itit usesuses thethe samesame parallelparallel bestbest prpracticesactices – – Cell Cell architecturearchitecture justjust gainsgains moremore fromfrom themthem becausebecause ofof itsits design.design.

34 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Basic Programming Concepts

ƒ The PPE is just a PowerPC running Linux. – No special programming techniques or compilers are needed. ƒ The PPE manages SPE processes as POSIX pthreads. ƒ A provided library (libspe2) handles SPE process management within the threads. ƒ Compiler tools embed SPE executables into PPE executables: one file provides instructions for all execution units.

PPE core SPE core - VMX unit -128-bit SIMD instruction set - 32k L1 caches - – 128x128-bit - 512k L2 cache - Local store – 256KB - 2 way SMT -MFC - Isolation mode

35 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PPE Programming Environment

ƒ PPE runs PowerPC applications and operating system ƒ Handles thread allocation and resource management among SPEs ƒ PPE’s Linux kernel controls the SPUs’ execution of programs • Schedule SPE execution independent from regular Linux threads • Responsible for runtime loading, passing parameters to SPE programs, notification of SPE events and errors, and debugger support ƒ PPE’s Linux kernel manages virtual memory, including mapping each SPE’s local store (LS) and problem state (PS) into the effective-address space ƒ The kernel also controls virtual-memory mapping of MFC resources, as well as MFC segment-fault and page-fault handling

36 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Programming Environment

ƒ Each SPE has a SIMD instruction set, 128 vector registers and two in-order execution units, and no operating system ƒ Data must be moved between main memory and the 256 KB of SPE local store with explicit DMA commands ƒ Standard compilers are provided – GNU and XL compilers, C, C++ and Fortran – Will compile scalar code into the SIMD-only SPE instruction set – Language extensions provide SIMD types and instructions. Î The programmer must handle – A set of processors with varied strengths and unequal access to data and communication – Data layout and SIMD instructions to exploit SIMD utilization – Local store management (data locality and overlapping communication and computational)

37 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Local Store

ƒ 256-KB, ECC-protected, single-ported, non-caching memory ƒ Holds instructions and data ƒ Filled with instructions and data using DMA transfers initiated from SPU or PPE software ƒ DMA operations are buffered and can only access the LS at most one of every eight cycles ƒ Instruction prefetches deliver at least 17 instructions sequentially from the branch target ¾ impact of DMA operations on SPU loads and stores and program- execution times is, by design, limited.

38 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Communication between the PPE and SPEs

ƒ PPE communicates with SPEs through MMIO registers supported by the MFC of each SPE ƒ Three primary communication mechanisms between the PPE and SPEs – Mailboxes • Queues for exchanging 32-bit messages • Two mailboxes (the SPU Write Outbound Mailbox and the SPU Write Outbound Interrupt Mailbox) are provided for sending messages from the SPE to the PPE • One mailbox (the SPU Read Inbound Mailbox) is provided for sending messages to the SPE – Signal notification registers • Each SPE has two 32-bit signal-notification registers, each has a corresponding memory-mapped I/O (MMIO) register into which the signal-notification data is written by the sending processor • Signal-notification channels, or signals, are inbound (to an SPE) registers • They can be used by other SPEs, the PPE, or other devices to send information, such as a buffer-completion synchronization flag, to an SPE – DMAs • To transfer data between main storage and the LS

39 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Basic PPE and SPE Program Control and Data Flow

1) (PPE Program) Loads the SPE program to the LS. 2) (PPE Program) Instructs the SPEs to execute the SPE program. 3) (SPE Program) Transfers required data from the main memory to the LS. 4) (SPE Program) Processes the received data in accordance with the requirements. 5) (SPE Program) Transfers the processed result from the LS to the main memory. 6) (SPE Program) Notifies the PPE program of the termination of processing. Source: PS3-Cell Programming Tutorial

40 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Example of an SPE Executable Embedded in a PPE Program

#include #include Declaring a reference to #include an SPU image PPU #include extern spe_program_handle_t hello_spu; Code int main(void) { . . . . . // Run the SPE context rc = spe_context_run(speid, &entry, 0, argp, envp, &stop_info); . . . . . } SPU #include Code int main(unsigned long long speid, unsigned long long argp, unsigned long long envp) {

SPE entry point printf("Hello world!\n"); return 0; }

41 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Build Process

SPE SPE Embed SPE Code Toolchain Objects Utility SPE Code

PPE Obj Cell BE Executable SPE Code

PPE PPE PPE PPE Code PPE Code Toolchain Toolchain Objects (Linker)

42 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPE Runtime Management Library v2.3

43 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPE Runtime Management Library v2.3

ƒ The SPE runtime management library (libspe) contains an SPE thread programming model for Cell BE applications ƒ is used to control SPE program execution from the PPE program ƒ Licensed under the GNU LPGL ƒ Design is independent of the underlying operating system – can run on a variety of OS platforms ƒ In Linux implementation, it uses of the SPU File System (SPUFS) as the Linux kernel interface ƒ Handles SPEs as virtual objects called SPE contexts – SPE programs can be loaded and executed by operating SPE contexts ƒ The elfspe enables direct SPE executable execution from a Linux shell without the need for a PPE application creating an SPE thread

44 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Terminology

ƒ SPE context: ƒ Main thread: – Holds all persistent information about – The application’s main thread. a "logical SPE" used by the – A typical CellBE application consists a main thread application. that creates as many SPE threads as needed and “orchestrates” them ƒ Gang context: ƒ SPE thread: – A regular thread executing on the PPE that actually – Holds all persistent information about runs an SPE context a group of SPE contexts that should be treated as a gang, that is, be – The API call spe_context_run is a synchronous, executed together with certain blocking call from the perspective of the thread properties. using it, that is, while an SPE program is executed, the associated SPE thread blocks and is usually put to “sleep” by the operating system Notes: Context’s data structure should not be accessed directly; instead the ƒ SPE event: application uses a pointer to an SPE – A mechanism for asynchronous notification. gang context as an identifier for the SPE gang it is dealing with through libspe – Typical usage is the main thread sets up an event API calls handler to receive notification about certain events caused by the asynchronously running SPE threads. – The current library supports events to indicate that an SPE has stopped execution, mailbox messages being written or read by an SPE, and PPE-initiated DMA operations have completed.

45 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Functionality ƒ In order to provide this functionality, the SPE Runtime Management Library consists of various sets of PPE functions: 1. A set of PPE functions to create and destroy SPE and gang contexts. 2. A set of PPE functions to load SPE objects into SPE local store memory for execution. 3. A set of PPE functions to start the execution of SPE programs and to obtain information on reasons why an SPE has stopped running. 4. A set of PPE functions to receive asynchronous events generated by an SPE. 5. A set of PPE functions used to access the MFC (Memory Flow Control) problem state facilities, which includes a. MFC proxy command issue b. MFC proxy tag-group completion facility c. Mailbox facility d. SPE signal notification facility 6. A set of PPE functions that enable direct application access to an SPE’s local store and problem state areas. 7. Means to register PPE-assisted library calls for an SPE program.

46 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Function list ƒ SPE MFC Proxy Command Issue – spe_mfcio_put, spe_mfcio_putb, spe_mfcio_putf ƒ SPE Context Creation – spe_mfcio_get, spe_mfcio_getb, spe_mfcio_getf – spe_context_create ƒ SPE MFC Proxy Tag-Group Completion Facility – spe_context_destroy – spe_mfcio_tag_status_read – spe_gang_context_create – spe_out_mbox_read – spe_gang_context_destroy – spe_out_mbox_status ƒ SPE Program Image Handling – spe_in_mbox_write – spe_image_open – spe_in_mbox_status – spe_image_close – spe_out_intr_mbox_read – spe_program_load – spe_out_intr_mbox_status ƒ SPE Run Control ƒ SPE SPU Signal Notification Facility – spe_context_run – spe_signal_write – spe_stop_info_read ƒ Direct SPE Access for Applications ƒ SPE Event Handling – spe_ls_area_get – spe_event_handler_create – spe_ls_size_get – spe_event_handler_destroy – spe_ps_area_get – spe_event_handler_register ƒ PPE-assisted Library Calls – spe_event_handler_deregister – spe_callback_handler_register – spe_event_wait – spe_callback_handler_deregister

47 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Single thread

ƒ A simple application uses a single PPE thread, that is, the application’s PPE thread ƒ The basic scheme for a simple application using an SPE is as follows: 1. Create an SPE context 2. Load an SPE executable object into the SPE context’s local store 3. Run SPE context – this transfers control to the operating system requesting the actual scheduling of the context to a physical SPE in the system 4. Destroy SPE context

ƒ Note that step 3 represents a synchronous call to the operating system. The calling application’s PPE thread blocks until the SPE stops execution and the operating system returns from the system call invoking the SPE execution

48 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Hello World single thread sample

PPE program SPE program hello_spu #include … #include #include int main( unsigned long long speid, unsigned long long argp, extern spe_program_handle_t hello_spu; unsigned long long envp ) { printf("Hello World!\n"); int main( void) { return 0; spe_context_ptr_t speid; // Structure for an SPE context } unsigned int flags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_ptr_t spe_context_create(unsigned int flags, spe_gang_context_ptr_t gang) void * argp = NULL; void * envp = NULL; flags A bit-wise OR of modifiers that are applied when the SPE context is created. spe_stop_info_t stop_info; gang Associate the new SPE context with this gang context. If NULL is specified, the new SPE context is int rc; not associated with any gang. On success, a pointer to the newly created SPE context is returned.

// Create an SPE context int spe_program_load (spe_context_ptr_t spe, spe_program_handle_t *program) speid = spe_context_create(flags, NULL); spe A valid pointer to the SPE context for which an SPE program should be loaded. if (speid == NULL) { perror("spe_context_create"); return -2; } program A valid address of a mapped SPE program On success, 0 (zero) is returned. spe_program_load loads an SPE main program that has been mapped to memory at the address pointed to by program into the local store of the SPE identified by the SPE context spe. This is mandatory before // Load an SPE executable object into the SPE context local store running the SPE context with spe_context_run. if (spe_program_load(speid, &hello_spu) ) int spe_context_run( spe_context_ptr_t spe, unsigned int *entry, unsigned int runflags, void *argp, void *envp, { perror("spe_program_load"); return -3; } spe_stop_info_t *stopinfo) spe A pointer to the SPE context that should be run. entry Input: The entry point, that is, the initial value of the SPU instruction pointer, at which the SPE program should start // Run the SPE context executing. If the value of entry is SPE_DEFAULT_ENTRY, the entry point for the SPU main program is obtained from the loaded SPE image. This is usually the local store address of the initialization function crt0. rc = spe_context_run(speid, &entry, 0, argp, envp, &stop_info); Output: The SPU instruction pointer at the moment the SPU stopped execution, that is, the local store address of the next instruction that would be have been executed. if (rc < 0) perror("spe_context_run"); runflags A bit mask that can be used to request certain specific behavior for the execution of the SPE context. If the value is 0, this indicates default behavior (see Usage). spe_context_destroy(speid); // Destroy the SPE context argp An (optional) pointer to application specific data, and is passed as the second parameter to the SPE program. envp An (optional) pointer to environment specific data, and is passed as the third parameter to the SPE program. return 0; Stopinfo An (optional) pointer to a structure of type spe_stop_info_t . } On success, 0 (zero) or a positive number is returned.

49 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Request execution of an SPE context

ƒ A SPE program must be loaded (using spe_program_load) before you can run the SPE context. ƒ The thread calling spe_context_run blocks and waits until the SPE stops, due to normal termination of the SPE program, or an SPU stop and signal instruction, or an error condition. When spe_context_run returns, the calling thread must take appropriate actions depending on the application logic. ƒ spe_context_run returns information about the termination of the SPE program in three ways. This allows applications to deal with termination conditions on various levels. – First, the most common usage for many applications is covered by the return value of the function and the errno value being set appropriately. – Second, the optional stopinfo structure provides detailed information about the termination condition in a structured way that allows applications more fine-grained error handling and enables implementation of special scenarios. – Third, the stopinfo structure contains the field spu_status that contains the value of the CBEA SPU Status Register (SPU_Status) as specified in the Cell Broadband Engine Architecture, Version 1, section 8.5.2 upon termination of the SPE program. This can be very useful, especially in conjunction with the SPE_NO_CALLBACKS flag, for applications that run non-standard SPE programs and want to react to all possible conditions flexibly and not rely on predefined conventions.

50 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Single Thread Asynchronous Execution Example

ƒ The application creates one PPE thread ƒ The PPE thread will run an SPE context at a time ƒ The basic scheme for a simple application running 1 SPE contexts asynchronously is 1. Create an SPE context 2. Load the appropriate SPE executable object into the SPE context’s local store 3. Create a PPE thread a. Run the SPE context in the PPE thread b. Terminate the PPE thread 4. Wait for the PPE thread to terminate 5. Destroy the SPE context

51 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Single Thread Asynchronous Execution Example

#include #include pthread extern spe_program_handle_t hello_spu; #include header file #include int main(void) #include { Add pthread_t to ppu_pthread_data_t data; // Structure for an SPE thread Define data the SPE context data.context = spe_context_create(0, NULL); typedef struct ppu_pthread_data{ data structure and fill it with spe_program_load(data.context, &hello_spu); spe_context_ptr_t context; necessary data.entry = SPE_DEFAULT_ENTRY; context data pthread_t pthread; Wait until the PPE data.flags = 0; and then load unsigned int entry; thread terminate data.argp = NULL; SPE program unsigned int flags; data.envp = NULL; image void *argp; pthread_create( &data.pthread, NULL, &ppu_pthread_function, void *envp; Define a pthread &data); function so we can spe_stop_info_t stopinfo; run the SPE pthread_join(data.pthread, NULL); } ppu_pthread_data_t; context spe_context_destroy(data.context); return 0; Call the ppu_thread_function to run the context in the PPE // pthread function to run the SPE context } thread void *ppu_pthread_function(void *arg) { ppu_pthread_data_t *datap = (ppu_pthread_data_t *)arg; int rc; rc = spe_context_run(datap->context, &datap->entry, datap->flags, datap->argp, datap->envp, &datap->stopinfo); pthread_exit(NULL); }

52 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Multi-thread

ƒ Many applications need to use multiple SPEs concurrently ƒ In this case, it is necessary for the application to create at least as many PPE threads as concurrent SPE contexts are required ƒ Each of these PPE threads may run a single SPE context at a time ƒ If N concurrent SPE contexts are needed, it is common to have a main application thread plus N PPE threads dedicated to SPE context execution ƒ The basic scheme for a simple application running N SPE contexts is 1. Create N SPE contexts 2. Load the appropriate SPE executable object into each SPE context’s local store 3. Create N PPE threads a. In each of these PPE threads run one of the SPE contexts b. Terminate the PPE thread 4. Wait for all N PPE threads to terminate 5. Destroy all N SPE contexts

53 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Multi Thread Asynchronous Execution Example #include #include #include #include #include

Define a pthread extern spe_program_handle_t simple_spu; function so we can run the SPE context #define SPU_THREADS 6

void *ppu_pthread_function(void *arg) { spe_context_ptr_t ctx; unsigned int entry = SPE_DEFAULT_ENTRY;

ctx = *((spe_context_ptr_t *)arg); if (spe_context_run(ctx, &entry, 0, NULL, NULL, NULL) < 0) { perror ("Failed running context"); exit (1); } pthread_exit(NULL); }

54 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE Runtime Management Library v2.3 Fork a pthread thread Multi Thread Asynchronous Execution Example giving thread id, function, and data

/* Create thread for each SPE context */ int main() if (pthread_create (&threads[i], NULL, { &ppu_pthread_function, int i; &ctxs[i])) { spe_context_ptr_t ctxs[SPU_THREADS]; perror ("Failed creating thread"); pthread_t threads[SPU_THREADS]; exit (1); } /* Create several SPE-threads to execute 'simple_spu'. } */ for(i=0; i

55 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Cell’s Primary Communication Mechanisms

ƒ DMA transfers, mailbox messages, and signal-notification ƒ All three are implemented and controlled by the SPE’s MFC

56 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Storage Domain Communication Concepts

ƒ An SPE program references its own LS using a Local Store Address (LSA). ƒ The LS of each SPE is also assigned a Real Address (RA) range within the system's memory map. Î This allows privileged software on the PPE to map LS areas into the Effective Address (EA) space, where the PPE, other SPEs, and other devices that generate EAs can access the LS like any regular component on the main storage. Î The fact that the local stores may be mapped to the main storage, allows SPEs to use DMA operations to directly transfer data between their LS to another SPE’s LS. Î This mode of data transfer is very efficient, because the DMA transfers go directly from SPE to SPE on the high performance local without involving the system memory. ƒ A code that runs on an SPU can only fetch instructions from its own LS, and loads and stores can only access that LS. ƒ Data transfers between the SPE's LS and main storage are primarily executed using DMA transfers controlled by the MFC DMA controller for that SPE. ƒ Each SPE's MFC serves as a data-transfer engine. ƒ DMA transfer requests contain both an LSA and an EA. Î They can address both an SPE's LS and main storage and thereby initiate DMA transfers between the domains.

57 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group MFC channels and MMIO interfaces and queues

NOTES: ƒ Since accessing the channel remains local within a certain SPE it has low latency (for non blocking commands about 6 cycles if channel is not full) and also doesn’t have any negative influence EIB bandwidth. ƒ When SPE reads or writes a non-blocking channel, the operation executes without delay. However, when SPE software reads or writes a blocking channel, the SPE might stall for an arbitrary length if the associated channel count (which is its remaining capacity) is ‘0’. In this case, the SPE will remain stalled until the channel count becomes ‘1’ or more.

58 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

MFC Commands

ƒ Main mechanism for SPUs to – access main storage (DMA commands) – maintain synchronization with other processors and devices in the system (Synchronization commands) ƒ Can be issued either SPU via its MFC by PPE or other device, as follows: – Code running on the SPU issues an MFC command by executing a series of writes and/or reads using channel instructions - read channel (rdch), write channel (wrch), and read channel count (rchcnt). – Code running on the PPE or other devices issues an MFC command by performing a series of stores and/or loads to memory-mapped I/O (MMIO) registers in the MFC ƒ MFC commands are queued in one of two independent MFC command queues: – MFC SPU Command Queue — For channel-initiated commands by the associated SPU – MFC Proxy Command Queue — For MMIO-initiated commands by the PPE or other device

59 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Sequences for Issuing MFC Commands

9All operations on a given channel are unidirectional (they can be only read or write operations for a given channel, not bidirectional) 9Accesses to channel-interface resources through MMIO addresses do not stall 9Channel operations are done in program order 9Channel read operations to reserved channels return ‘0’s 9Channel write operations to reserved channels have no effect 9Reading of channel counts on reserved channels returns ‘0’ 9Channel instructions use the 32-bit preferred slot in a 128-bit transfer

60 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

DMA Overview

61 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

DMA Commands

ƒ MFC commands that transfer data are referred to as DMA commands ƒ Transfer direction for DMA commands referenced from the SPE ƒ Into an SPE (from main storage to local store) Æ get ƒ Out of an SPE (from local store to main storage) Æ put

62 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

mfc_get implemented DMA Commands as MFC command and composite intrinsics

#include “spu_mfcio.h“ Channel Control mfc_get(lsa, ea, size, tag, tid, rid); Intrinsics // Implemented as the following composite intrinsic: spu_mfcdma64(lsa, mfc_ea2h(ea), mfc_ea2l(ea), size, tag, spu_writech ((tid<<24)|(rid<<16)|MFC_GET_CMD)); // wait until DMA transfer is complete (or do other things before that) Composite Intrinsics spu_mfcdma32

spu_writech(MFC_LSA, lsa); MFC Commands spu_writech(MFC_EAH, eah); mfc_get spu_writech(MFC_EAL, eal); spu_writech(MFC_Size, size); mfc_get defined as spu_writech(MFC_TagID, tag); macros in implemented as spu_mfcio.h channel control spu_writech(MFC_CMD, 0x0040); intrinsics

63 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group DMA Get and Put Command (SPU) ƒ DMA get from main memory into local store (void) mfc_get( volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid) ƒ DMA put into main memory from local store (void) mfc_put(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag, uint32_t tid, uint32_t rid) ƒ To ensure order of DMA request execution: – mfc_putf : fenced (all commands executed before within the same tag group must finish first, later ones could be before) – mfc_putb : barrier (the barrier command and all commands issued thereafter are not executed until all previously issued commands in the same tag group have been performed) ƒ 5-bit DMA Tag for all DMA commands (except getllar, putllc, and putlluc) – Tag can be used to determine status for entire group/command or check/wait on the completion of all queued commands in one or more tag groups – Tagging is optional but can be useful when using barriers to control the ordering of MFC commands within a single command queue. – Synchronization of DMA commands within a tag group: fence and barrier – Execution of a fenced command option is delayed until all previously issued commands within the same tag group have been performed. – Execution of a barrier command option and all subsequent commands is delayed until all previously issued commands in the same tag group have been performed.

64 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Barriers and Fences – Synchronization Commands

65 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group DMA Characteristics ƒ DMA transfers – transfer sizes can be 1, 2, 4, 8, and n*16 (n integer) – maximum is 16KB per DMA transfer – 128B alignment is preferable ƒ DMA command queues per SPU – 16-element queue for SPU-initiated requests – 8-element queue for PPE-initiated requests ¾ SPU-initiated DMA is always preferable ƒ DMA tags – each DMA command is tagged with a 5-bit identifier – same identifier can be used for multiple commands – tags used for polling status or waiting on completion of DMA commands ƒ DMA lists – a single DMA command can cause execution of a list of transfer requests (in LS) – lists implement scatter-gather functions – a list can contain up to 2K transfer requests

66 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PPE – SPE DMA Transfer

67 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group PPE to SPE Transfer through MMIO Interface

ƒ Two methods for accessing SPE from PPE through the MMIO interface. – MFC functions • Simple to use • Less development time – Direct problem state access (or Direct SPE access) • Offer better performance than MFC functions

spe_context_ptr_t spe_ctx : a Example of pointer to the context of the using MFC relevant SPE.This context is created when the SPE thread is functions created.

#include “libspe2.h“ spe_mfcio_get ( spe_ctx, lsa, ea, size, tag, tid, rid); // wait till data was transfered to LS, or do other things...

68 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group DMA Example: Read into local store

MFC functions header file #include inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size)

{ Initiate a DMA transfer - Write contents of unsigned int tag = 0; mem_addr into ls_addr unsigned int mask = 1; mfc_get(ls_addr,mem_addr,size,tag,0,0); Set tag mask to mfc_write_tag_mask(mask); determine which tag ID mfc_read_tag_status_all(); to notify upon completion }

Wait until all tagged DMA commands completed

69 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group DMA Example: Write to Main Memory

MFC functions header file #include inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size)

{ Initiate a DMA transfer - Write contents of unsigned int tag = 0; mem_addr into ls_addr unsigned int mask = 1; mfc_put(ls_addr,mem_addr,size,tag,0,0); Set tag mask to mfc_write_tag_mask(mask); determine which tag ID mfc_read_tag_status_all(); to notify upon completion }

Wait until all tagged DMA commands completed

70 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Tips to Achieve Peak Bandwidth for DMAs

ƒ The performance of a DMA data transfer is best when the source and destination addresses have the same quadword offsets within a PPE cache line. ƒ Quadword-offset-aligned data transfers generate full cache-line bus requests for every unrolling, except possibly the first and last unrolling. ƒ Transfers that start or end in the middle of a cache line transfer a partial cache line (less than 8 quadwords) in the first or last bus request, respectively.

71 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Supported and recommended values for DMA parameters

ƒ Direction: Data transfer may be in any of the two directions as referenced from the perspective of an SPE: – get commands: transfer data to a LA from the main storage. – put commands: transfers data out of the LS to the main storage. ƒ Size: Transfer size should obey the following guidelines: – Supported transfer sizes are 1, 2, 4, 8, or 16 bytes, and multiples of 16-bytes – Maximum transfer size is 16 KB. – Peak performance is achieved when transfer size is a multiple of 128 bytes. ƒ Alignment: Alignment of the LSA and the EA should obey the following guidelines: – Source and destination addresses must have the same 4 least significant bits. – For transfer sizes less than 16 bytes, address must be naturally aligned (bits 28 through 31 must provide natural alignment based on the transfer size). – For transfer sizes of 16 bytes or greater, address must be aligned to at least a 16- boundary (bits 28 through 31 must be ‘0’). – Peak performance is achieved when both source and destination are aligned on a 128-byte boundary (bits 25 through 31 cleared to ‘0’).

72 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Data transfer example PPE program SPE program #include #include //#include #include //#include //#include #include #include #include unsigned char buffer[128] __attribute__ ((aligned(128))); int main( unsigned long long speid, unsigned long long argp, unsigned long long envp ) //spu program { extern spe_program_handle_t getbuf_spu; int tag = 31, tag_mask = 1<

printf("New modified buffer is %s\n", buffer); return 0; }

73 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Mailboxes ƒ To communicate messages up to 32 bits in length, such as buffer completion flags or program status – e.g., When the SPE places computational results in main storage via DMA. After requesting the DMA transfer, the SPE waits for the DMA transfer to complete and then writes to an outbound mailbox to notify the PPE that its computation is complete ƒ Can be used for any short-data transfer purpose, such as sending of storage addresses, function parameters, command parameters, and state-machine parameters ƒ Can also be used for communication between an SPE and other SPEs, processors, or devices – Privileged software needs to allow one SPE to access the mailbox register in another SPE by mapping the target SPE’s problem-state area into the EA space of the source SPE. If software does not allow this, then only atomic operations and signal notifications are available for SPE-to-SPE communication.

74 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Mailboxes - Overview

Each MFC provides three mailbox queues of 32 bit each: ƒ Two Outbound mailboxes - to send messages from the local SPE to the PPE or other SPEs: – SPU Write Outbound mailbox (SPU_WrOutMbox) – SPU Write Outbound Interrupt mailbox (SPU_WrOutIntrMbox) ƒ One Inbound mailbox - to send messages to the local SPE from the PPE or other SPEs: – SPU (“SPU read inbound”) mailbox queue

¾ Each mailbox entry is a fullword

75 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Mailboxes - Characteristics

76 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Mailboxes API – libspe2 (See more from libspe2 documents)

PPU (libspe2.h) SPU (spu_mfcio.h) MFC dataflow

spe_out_mbox_status() spu_stat_out_mbox PPE mbox spe_out_mbox_read(, &)) spu_write_out_mbox out_mbox

dataflow spe_out_intr_mbox_status() spu_stat_out_intr_mbox PPE intr mbox spe_get_event spu_write_out_intr_mbox out_intr_mbox dataflow spe_in_mbox_status() spu_stat_in_mbox SPE mbox spe_in_mbox_write(,) spu_read_in_mbox in_mbox

77 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Mailboxes MFC Functions

NOTE: To access a mailbox of the local SPU from other SPU, need the following steps: 1. PPU code map the SPU’s controls area to main storage using libspe2.h file’s spe_ps_area_get function with SPEE_CONTROL_AREA flag set. 2. PPE forward the SPU’s control area base address to another SPU. 3. The other SPU uses ordinary DMA transfers to access the mailbox. Effective address should be control area base plus offset to specific mailbox register.

78 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Example of Using Mailboxes – PPE Code

#include #define MY_ALIGN(_my_var_def_, _my_al_) _my_var_def_ #include __attribute__((__aligned__(_my_al_))) #include #include MY_ALIGN(float array_a[ARRAY_SIZE],128); #include MY_ALIGN(float array_b[ARRAY_SIZE],128); #include MY_ALIGN(float array_c[ARRAY_SIZE],128); #include extern spe_program_handle_t array_spu_add; #define ARRAY_SIZE 1024 #define NUM 512 int status; void init_array() { typedef struct ppu_thread_data int i; { spe_context_ptr_t context; for(i=0;i

79 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Example of Using Mailboxes – PPE Code /* New thread*/ void *ppu_pthread_function(void *arg) /* Create a spe context that we'll use to load our spe program into*/ { threadData.context = ppu_pthread_data_t *datap = (ppu_pthread_data_t spe_context_create(0, NULL); *)arg; /* Load our spe program: array_spu_add*/ spe_context_run(datap->context, &datap->entry, datap->flags, datap->argp, datap->envp, &datap- spe_program_load(threadData.context, >stopinfo); &array_spu_add); pthread_exit(NULL); /* Setup our SPE program parameters*/ } threadData.entry = SPE_DEFAULT_ENTRY; int main(void){ threadData.flags = 0; int i, retVal; threadData.argp = NULL; ppu_pthread_data_t threadData; threadData.envp = NULL; unsigned int *mbox_dataPtr; /* Now we can create thread to run on a SPE*/ unsigned int mbox_data[3]; pthread_create(&threadData.pthread, init_array(); NULL, &ppu_pthread_function, /* check for working hardware */ &threadData); if (spe_cpu_info_get(SPE_COUNT_PHYSICAL_SPES, - // Setup mailbox messages 1) < 1) mbox_data[0] = (unsigned { int)&array_a[0]; fprintf(stderr,"System has no working mbox_data[1] = (unsigned SPEs. Exiting\n"); int)&array_b[0]; return -1; mbox_data[2] = (unsigned int)&array_c[0]; }

80 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Example of Using Mailboxes – PPE Code

/* Send effective addresses of arrays to spe through do mailboxes*/ { while (spe_in_mbox_status(threadData.context) == 0); mbox_dataPtr = &mbox_data[2]; do retVal = { spe_in_mbox_write(threadData.context, mbox_dataPtr, 1, SPE_MBOX_ANY_NONBLOCKING); mbox_dataPtr = &mbox_data[0]; } retVal = spe_in_mbox_write(threadData.context, mbox_dataPtr, 1, SPE_MBOX_ANY_NONBLOCKING); while (retVal !=1); } /* wait for spe to finish */ while (retVal != 1); pthread_join (threadData.pthread, NULL); while (spe_in_mbox_status(threadData.context) == 0); spe_context_destroy(threadData.context); do __asm__ __volatile__ ("sync" : : : "memory"); { printf("Array Addition complete. Verifying results...\n"); mbox_dataPtr = &mbox_data[1]; /* verifying */ retVal = spe_in_mbox_write(threadData.context, mbox_dataPtr, 1, SPE_MBOX_ANY_NONBLOCKING); for (i=0; i

81 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Example of Using Mailboxes – SPE Code

#include /*Now that we have the array EAs we can DMA our data over */ #include mfc_get(&array_a, Aaddr, #define ARRAY_SIZE 1024 (sizeof(float)*ARRAY_SIZE), 31, 0, 0); #define MY_ALIGN(_my_var_def_, _my_al_) mfc_get(&array_b, Baddr, _my_var_def_ (sizeof(float)*ARRAY_SIZE), 31, 0, 0); __attribute__((__aligned__(_my_al_))) mfc_write_tag_mask(1<<31); MY_ALIGN(float array_a[ARRAY_SIZE],128); mfc_read_tag_status_all(); MY_ALIGN(float array_b[ARRAY_SIZE],128); /*array add*/ MY_ALIGN(float array_c[ARRAY_SIZE],128); for(i=0; i

82 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPU Write Outbound Mailbox

– The value written to the SPU Write Outbound Mailbox channel SPU_WrOutMbox is entered into the outbound mailbox in the MFC if the mailbox has capacity to accept the value. – If the mailbox can accept the value, the channel count for SPU_WrOutMbox is decremented by ‘1’. – If the outbound mailbox is full, the channel count will read as ‘0’. – If SPE software writes a value to SPU_WrOutMbox when the channel count is ‘0’, the SPU will stall on the write. – The SPU remains stalled until the PPE or other device reads a message from the outbound mailbox by reading the MMIO address of the mailbox. – When the mailbox is read through the MMIO address, the channel count is incremented by ‘1’.

83 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPU Write Outbound Interrupt Mailbox

– The value written to the SPU Write Outbound Interrupt Mailbox channel (SPU_WrOutIntrMbox) is entered into the outbound interrupt mailbox if the mailbox has capacity to accept the value. – If the mailbox can accept the message, the channel count for SPU_WrOutIntrMbox is decremented by ‘1’, and an interrupt is raised in the PPE or other device, depending on interrupt enabling and routing. – There is no ordering of the interrupt and previously issued MFC commands. – If the outbound interrupt mailbox is full, the channel count will read as ‘0’. – If SPE software writes a value to SPU_WrOutIntrMbox when the channel count is ‘0’, the SPU will stall on the write. – The SPU remains stalled until the PPE or other device reads a mailbox message from the outbound interrupt mailbox by reading the MMIO address of the mailbox. – When this is done, the channel count is incremented by ‘1’.

84 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PPU reads SPU Outbound Mailboxes

ƒ PPU must check Mailbox Status Register first – check that unread data is available in the SPU Outbound Mailbox or SPU Outbound Interrupt Mailbox – otherwise, stale or undefined data may be returned ƒ To determine that unread data is available – PPE reads the Mailbox Status register – extracts the count value from the SPU_Out_Mbox_Count field ƒ count is – non-zero Æ at least one unread value is present – zero Æ PPE should not read but poll the Mailbox Status register

85 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPU Read Inbound Mailbox Channel ƒ Mailbox is FIFO queue – If the SPU Read Inbound Mailbox channel (SPU_RdInMbox) has a message, the value read from the mailbox is the oldest message written to the mailbox. ƒ Mailbox Status (empty: channel count =0) – If the inbound mailbox is empty, the SPU_RdInMbox channel count will read as ‘0’. ƒ SPU stalls on reading empty mailbox – If SPE software reads from SPU_RdInMbox when the channel count is ‘0’, the SPU will stall on the read. The SPU remains stalled until the PPE or other device writes a message to the mailbox by writing to the MMIO address of the mailbox. ƒ When the mailbox is written through the MMIO address, the channel count is incremented by ‘1’. ƒ When the mailbox is read by the SPU, the channel count is decremented by '1'. ƒ The SPU Read Inbound Mailbox can be overrun by a PPE in which case, mailbox message data will be lost. ƒ A PPE writing to the SPU Read Inbound Mailbox will not stall when this mailbox is full.

86 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

MFC Synchronization Commands

MFC synchronization commands ƒ Used to control the order in which DMA storage accesses are performed ƒ Four atomic commands (getllar, putllc, putlluc, and putqlluc), ƒ Three send-signal commands (sndsig, sndsigf, and sndsigb), and ƒ Three barrier commands (barrier, mfcsync, and mfceieio).

87 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Cell Multi-core Programming Effort Roadmap

Requires mostly the same effort to port to any multi-core architecture.

Port app Begin Optimizing Port app to Power, Cell BE moving function Porting & to Linux, run on Tune SIMD function on SPE’s Optimizing if needed PPE to SPE’s Optimizing function •Exploit Parallelism at Task - Local Store Level Management •Exploit Parallelism at instruction / data level •Data and Instruction Locality Tuning

WritingWriting forfor CellCell BEBE speedsspeeds upup codecode onon allall multi-coremulti-core architecturesarchitectures becausebecause itit usesuses thethe samesame parallelparallel bestbest prpracticesactices – – Cell Cell architecturearchitecture justjust gainsgains moremore fromfrom themthem becausebecause ofof itsits design.design.

88 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SIMD Architecture

ƒ SIMD = “single-instruction multiple-data” ƒ SIMD exploits data-level parallelism – a single instruction can apply the same operation to multiple data elements in parallel ƒ SIMD units employ “vector registers” which holds multiple data elements ƒ SIMD is pervasive in the BE – PPE includes VMX (SIMD extensions to PPC architecture) – SPE is a native SIMD architecture (VMX-like) ƒ SIMD in VMX and SPE – 128bit-wide datapath – 128bit-wide registers – 4-wide fullwords, 8-wide halfwords, 16-wide bytes – SPE includes support for 2-wide doublewords

89 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SIMD (Single Instruction Multiple Data) processing • Data-level parallelism / operate on multiple data elements simultaneously Scalar code read the next instruction and decode it Vector code get this number read instruction and decode it get that number get these 4 numbers add them Instr. Decode get those 4 numbers put the result here add them read the next instruction and decode it put the results here get this number Number 1 get that number + Repeat add them Number 2 4 times put the result here =

read the next instruction and decode it Result 1 get this number Single Instruction Instr. Decode get that number add them put the result here. Number 1aNumber 1b Number 1c Number 1d read the next instruction and decode it + + + + Number 2aNumber 2b Number 2c Number 2d Do get this number Multiple Data Once get that number = = = = Result 1 Result 2 Result 3 Result 4 add them put the result there • (Significantly) smaller amount of code => improved execution efficiency • Number of elements processed in parallel = (size of vector / size of element)

90 Cell Programming Workshop © 2009 IBM Corporation IBM Systems & Technology Group SPU Instruction Set ƒ Operates primarily on SIMD vector operands, both fixed-point and floating-point, ƒ Supports some scalar operands ƒ 128 of 128-bit General-Purpose Registers (GPRs) that can be used to store all data types ƒ Supports big-endian data ordering, – lowest-address byte and lowest-numbered bit are the most-significant (high) byte and bit ƒ Supports data types – byte—8 bits – halfword—16 bits – word—32 bits – doubleword—64 bits – quadword—128 bits

Vector Data Types

91 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE – Pipelines and Dual-Issues Rules ƒ SPU has two pipelines SPU can issue and complete up to two – even (pipeline 0) instructions per cycle, one in each of the pipelines – odd (pipeline 1) ƒ Dual-issue occurs when a fetch group has two issue-able instructions in which the first instruction can be executed on the even pipeline and the second instruction can be executed on the odd pipeline.

92 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

SPE and Scalar Code

ƒ SPU only loads and stores a quadword at a time ƒ Value of scalar operands (including addresses) is kept in the preferred slot of a SIMD register ƒ Scalar (sub quadword) loads and stores require several instructions to format the data for use on the SIMD architecture of the SPE – e.g., scalar stores require a read, scalar insert, and write operation ƒ Strategies to make operations on scalar data more efficient: – Change the scalars to quadword vectors to eliminate three extra instructions associated with loading and storing scalars – Cluster scalars into groups, and load multiple scalars at a time using a quadword memory access. Manually extract or insert the scalars as needed. This will eliminate redundant loads and stores.

Intrinsics for Changing Scalar and Vector Data Types

93 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Register Layout of Data Types and Preferred Slot

When instructions use or produce scalar operands or addresses, the values are in the preferred scalar slot

The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot

94 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SIMD Programming

ƒ “Native SIMD” programming – algorithm vectorized by the programmer – coding in high-level language (e.g. C, C++) using intrinsics – intrinsics provide access to SIMD assembler instructions • e.g. c = spu_add(a,b) Æ add vc,va,vb

ƒ “Traditional” programming – algorithm coded “normally” in scalar form – compiler does auto-vectorization ¾ but auto-vectorization capabilities remain limited

95 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SPE C/C++ Language Extensions (Intrinsics)

ƒ Vector datatypes: vector [unsigned] {char, short, float, double} – e.g. “vector float”, “vector signed short”, “vector unsigned int”, … ƒ Vector pointers: – e.g. “vector float *p” (p+1 points to the next vector (16B) after that pointed to by p) ƒ Specific Intrinsics: intrinsics that have a one-to-one mapping with a single assembly-language instruction (si_asembly-inst-name) and are provided for all instructions except some branch and interrupt related ones. ƒ Generic / Buit-In – Constant formation (spu_splats) – Conversion (spu_convtf, spu_convts, ..) – Arithmetic (spu_add, spu_madd, spu_nmadd, ...) – Byte operations (spu_absd, spu_avg,...) – Compare and branch (spu_cmpeq, spu_cmpgt,...) – Bits and masks (spu_shuffle, spu_sel,...) – Logical (spu_and, spu_or, ...) – Shift and rotate (spu_rlqwbyte, spu_rlqw,...) – Control (spu_stop, spu_ienable, spu_idisable, ...) – Channel Control (spu_readch, spu_writech,...) – Scalar (spu_insert, spu_extract, spu_promote) ƒ Composite: DMA (spu_mfcdma32, spu_mfcdma64, spu_mfcstat)

96 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Simple SIMD Operations

ƒ Example of basic SIMD code #include vector float vec1={8.0,8.0,8.0,8.0}, vec2={2.0,4.0,8.0,16.0}; vec1 = vec1 + vec2;

ƒ Example of SIMD intrinsics #include vector float vec1={8.0,8.0,8.0,8.0}, vec2={2.0,4.0,8.0,16.0}; vec1 = spu_sub( (vector float)spu_splats((float)3.5), vec1); vec1 = spu_mul( vec1, vec2);

97 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Vectorizing a Loop – A Simple Example

/* scalar version */ /* vectorized version */

int mult1(float *in1, float *in2, float *out, int N) int vmult1(float *in1, float *in2, float *out, int N) { { int i; /* assumes that arrays are quadword aligned */ /* assumes that N is divisible by 4 */ for (i = 0, i < N, i++) { int i, Nv; out[i] = in1[i] * in2[i]; Nv = N >> 2; /* N/4 vectors */ } vector float *vin1 = (vector float *) (in1); vector float *vin2 = (vector float *) (in2); return 0; vector float *vout = (vector float *) (out);

for (i = 0, i < Nv, i++) { vout[i] = spu_mul(vin1[i],vin2[i]); }

return 0;

ƒ Loop does term-by-term multiply of two arrays – arrays assumed here to remain scalar outside of subroutine ƒ If arrays not quadword-aligned, extra work is necessary ƒ If array size is not a multiple of 4, extra work is necessary

98 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Objectives

ƒ Introduce you to …

– Cell Software Development Kit (SDK) for Multicore Acceleration Version 3.1 – Programming the Cell/B.E (libSPE2, MFC, SIMD, … ) – Programming Models: DaCS, ALF, OpenMP – Programming Tips & Tricks – Performance tools

Trademarks – Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

99 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group ALF and DaCS: IBM’s Software Enablement Strategy for Multicore Memory-Hierarchy Systems

100 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

DaCS - Data Communications and Synchronization ƒ Focused on data movement primitives – DMA like interfaces (put, get) – Message interfaces (send/recv) – Mailbox – Endian Conversion ƒ Provides Process Management, Accelerator Topology Services ƒ Based on Remote Memory windows and Data channels architecture ƒ Common API – Intra-Accelerator (CellBE), Host - Accelerator ƒ Double and multi-buffering – Efficient data transfer to maximize available bandwidth and minimize inherent latency – Hide complexity of asynchronous compute/communicate from developer ƒ Supports ALF, directly used by US National Lab for Host-Accelerator HPC environment ƒ Thin layer of API support on CELLBE native hardware ƒ Hybrid DaCS (not applicable to BSC Prototype)

101 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

DaCS - APIs

ƒ Init / Term ƒ Data Communication ƒ Locking Primitives – dacs_runtime_init – dacs_remote_mem_create – dacs_mutex_init – dacs_runtime_exit – dacs_remote_mem_share – dacs_mutex_share – dacs_remote_mem_accept – dacs_mutex_accept ƒ Reservation Service – dacs_remote_mem_release – dacs_mutex_lock – dacs_get_num_ – dacs_remote_mem_destroy – dacs_mutex_try_lock avail_children – dacs_remote_mem_query – dacs_mutex_unlock – dacs_reserve_children – dacs_put – dacs_mutex_release – dacs_release_de_list – dacs_get – dacs_mutex_destroy – dacs_put_list ƒ Process Management – dacs_get_list ƒ Error Handling – dacs_de_start – dacs_send – dacs_errhandler_reg – dacs_num_processes_suppor – dacs_recv – dacs_strerror ted – dacs_mailbox_write – dacs_error_num – dacs_num_processes_runnin – dacs_mailbox_read – dacs_error_code g – dacs_mailbox_test – dacs_error_str – dacs_de_wait – dacs_wid_reserve – dacs_error_de – dacs_de_test – dacs_wid_release – dacs_error_pid – dacs_test ƒ Group Functions – dacs_wait – dacs_group_init – dacs_group_add_member – dacs_group_close – dacs_group_destroy – dacs_group_accept – dacs_group_leave – dacs_barrier_wait

102 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

DaCS Example (PPU code) program main #include integer, parameter :: N=64 #include "init_cb.h" !must match DATA_BUFFER_ENTRIES in control_blk.h extern dacs_program_handle_t array_spu; real, dimension(N) :: arr uint32_t local_error_handler(dacs_error_t); integer(kind=8) :: aptr void update_array_(float *carr, int *nptr) { integer :: amod DACS_ERR_T rc; integer :: j int32_t status; interface de_id_t rsvd_child_des[MAX_UNIT_COUNT]; subroutine update_array(x, i) dacs_process_id_t rsvd_child_pids[MAX_UNIT_COUNT]; real x(*) dacs_remote_mem_t carr_remote_mem; integer :: I dacs_wid_t ppu_wid[MAX_UNIT_COUNT]; end subroutine end interface rc = dacs_init(0); ERRCHK("PPE: dacs_init", rc);

pointer(aptr,arr) rc = dacs_errhandler_reg(local_error_handler, 0); aptr = malloc(N*4+16) ERRCHK("PPE: dacs_errhandler_reg", rc); write(*,'(A,z8,A,z8,A)') 'loc arr[',loc(arr),'] aptr[',aptr,']' arr = (/(real(i),i=1, N)/); // reserve the max spes we want count_spes = MAX_UNIT_COUNT; call update_array(arr, N) rc = dacs_reserve_children( DACS_DE_SPE, &count_spes, rsvd_child_des ); ERRCHK("PPE: dacs_reserve_children", rc); write(*,'(8f6.1)') arr end program // release extra children if (count_spes > num_rsvd_child) { rc = dacs_release_de_list( count_spes-num_rsvd_child, &rsvd_child_des[ num_rsvd_child ] ); } … rc = dacs_remote_mem_create( carr, DATA_BUFFER_SIZE, DACS_READ_WRITE, &carr_remote_mem ); ERRCHK("PPE: dacs_remote_mem_create", rc); … (continues in next page)

103 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

DaCS Example (PPU code) (continues from previous page) … (continues from lest column) for (i=0; i

PRINTF("ppu unit.%i sharing mem\n",index); rc = dacs_remote_mem_share(rsvd_child_des[i], rsvd_child_pids[i], carr_remote_mem); ERRCHK("PPE: dacs_remote_mem_share", rc); } … (cotinues on right column)

104 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

DaCS Example (SPU code) subroutine compute_f(x, n) #include real :: x(*) #include "init_cb.h" integer :: n /* local copy of the data array, to be filled by the DMA */ real :: factr(n) float *data; /* local copy of the control block, to be filled by the receive */ factr = (/ (rand(), i=1, n) /) control_block cb1 __attribute__ ((aligned(128))); x(1:n) = x(1:n) * factr int main(unsigned long long speid, addr64 argp, addr64 envp) { end subroutine DACS_ERR_T rc; dacs_wid_t spu_wid; dacs_remote_mem_t spu_remote_mem;

rc = dacs_init(0); ERRCHK("array_spu: dacs_init", rc); rc = dacs_wid_reserve(&spu_wid); ERRCHK("array_spu: dacs_wid_reserve", rc); PRINTF("array_spu: receiving cb\n");

rc = dacs_recv(&cb1, sizeof(cb1), DACS_DE_PARENT, DACS_PID_PARENT, STREAM_ID, spu_wid,DACS_BYTE_SWAP_DISABLE); ERRCHK("array_spu: dacs_recv", rc);

rc = dacs_wait(spu_wid); ERRCHK("array_spu: dacs_wait", rc);

rc = dacs_remote_mem_accept(DACS_DE_PARENT, DACS_PID_PARENT, &spu_remote_mem); n = cb1.chunk_size/sizeof(float); data = (float *)_malloc_align(cb1.chunk_size, 7); rc = dacs_get(data, spu_remote_mem, cb1.offset,cb1.chunk_size,spu_wid,DACS_ORDER_ATTR_NONE, DACS_BYTE_SWAP_DISABLE); rc = dacs_wait(spu_wid); PRINTF("array_spu: unit.%i invoking compute\n",index); compute_f_(data, &n); rc = dacs_put(spu_remote_mem, cb1.offset, data, cb1.chunk_size,spu_wid,DACS_ORDER_ATTR_NONE, DACS_BYTE_SWAP_DISABLE); rc = dacs_wait(spu_wid); rc = dacs_remote_mem_release(&spu_remote_mem); rc = dacs_wid_release(&spu_wid); dacs_exit(); return 0; }

105 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF - Accelerator Library Framework

ƒ Aims at workloads that are highly parallelizable – e.g. Raycasting, FFT, Monte Carlo, Video Codecs ƒ Provides a simple user-level programming framework for Cell library developers that can be extended to other hybrid systems. ƒ Division of Labor approach – ALF provides wrappers for computational kernels – Frees programmers from writing their own architectural-dependent code including: data transfer, task management, double buffering, data communication ƒ Manages data partitioning – Provides efficient scatter/gather implementations via CBE DMA – Extensible to variety of data partitioning patterns – Host and accelerator describe the scatter/gather operations – Accelerators gather the input data and scatter output data from host’s memory – Manages input/output buffers to/from SPEs ƒ Remote error handling ƒ Utilizes DaCS library for some low-level operations (on Hybrid)

106 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF Data Partitioning

107 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF – Workblock

ƒ Workblock is the basic data unit of ALF ƒ Workblock = Partition Information ( Input Data, Output Data ) + Parameters

ƒ Input Data and Output Data for a work load can be divided into many work blocks

108 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF – Compute task

ƒ Compute Task processes a Work Block ƒ A task takes in the input data, context. parameters and produces the output data

109 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF – Task Context ƒ Provides persistent data buffer across work blocks ƒ Can be used for all-reduce operations such as min, max, sum, average, etc. ƒ Can have both read-only section and writable section – Writable section can be returned to host memory once the task is finished – ALF runtime does not provide data coherency support if there is conflict in writing the section back to host memory – Programmers should create unique task context for each instance of a compute kernel on an accelerator

110 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF – Data Transfer List ƒ Work Block input & output data descriptions are stored as Data Transfer List ƒ Input data transfer list is used to gather data from host memory ƒ Output data transfer list is used to scatter data to host memory

111 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF – Queues ƒ Two different queues are important to programmers – Work block queue for each task • Pending work blocks will be issued to the work queue • Task instance on each accelerator node will fetch from this queue – Task queue for each ALF runtime • Multiple Tasks executed at one time – except where programmer specifies dependencies • Future tasks can be issued. They will be placed on the task queue awaiting execution. ƒ ALF runtime manages both queues ALF – Buffer management on accelerators ƒ ALF manages buffer allocation on accelerators’ local memory ƒ ALF runtime provides pointers to 5 different buffers to a computational kernel – Task context buffer (RO and RW sections) – Work block parameters – Input Buffer – Input/Output Buffer – Output Buffer ƒ ALF implements a best effort double buffering scheme – ALF determines if there is enough local memory for double buffering ƒ Double buffer scenarios supported by ALF – 4 buffers: [In0, Out0; In1, Out1] – 3 buffers: [In0, Out0; In1] : [Out1; In2, Out2]

112 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF – Synchronization Support

ƒ Barrier – All work blocks enqueued before the barrier are guaranteed to be finished before any additional work block added after the barrier can be processed on any of the accelerators

– Programmers can register a callback function once the barrier is encountered. The main PPE application thread cannot proceed until the callback function returns.

ƒ Notification – Allows programmers query for a specific work block completion

– Allows programmers to register a callback function once this particular work block has been finished

– Does not provide any ordering

113 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Basic structure of an ALF application

114 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF Example (PPU code)

/* Compute PI by Buffon-Needle method based on ALF */ #include "alf.h" #include "pi.h" result_t result; char spu_image_path[PATH_BUF_SIZE]; // Used to hold the complete path to SPU image char library_name[PATH_BUF_SIZE]; // Used to hold the name of spu library char spu_image_name[] = "alf_pi_spu"; char kernel_name[] = "comp_kernel"; char ctx_setup_name[] = "context_setup"; char ctx_merge_name[] = "context_merge";

int main(int argc __attribute__ ((__unused__)), char *argv[] __attribute__ ((__unused__))) {

/* Declare variables for ALF runtime */ void *config_parms = NULL; alf_handle_t alf_handle; alf_task_desc_handle_t desc_info_handle; alf_task_handle_t task_handle; alf_wb_handle_t wb_handle; if (alf_init(config_parms, &alf_handle) < 0) { printf("Failed to call alf_init\n"); return 1; }

if (alf_query_system_info(alf_handle,ALF_QUERY_NUM_ACCEL, ALF_ACCEL_TYPE_SPE, &nodes) < 0) { printf("Failed to call alf_query_system_info.\n"); return 1; } else if( nodes <= 0 ) { printf("Cannot allocate spe to use.\n"); return 1; }

alf_num_instances_set(alf_handle, nodes);

/* Create a task descriptor for the task */ alf_task_desc_create(alf_handle, ALF_ACCEL_TYPE_SPE, &desc_info_handle); … (continues on next page)

115 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF Example (PPU code) (continues from previous page) … alf_task_desc_set_int32(desc_info_handle, ALF_TASK_DESC_WB_PARM_CTX_BUF_SIZE, sizeof(udata_t)); alf_task_desc_set_int32(desc_info_handle, ALF_TASK_DESC_TSK_CTX_SIZE, sizeof(result_t)); alf_task_desc_set_int64(desc_info_handle, ALF_TASK_DESC_ACCEL_IMAGE_REF_L, (unsigned long long) spu_image_name); alf_task_desc_set_int64(desc_info_handle, ALF_TASK_DESC_ACCEL_LIBRARY_REF_L, (unsigned long long) library_name); alf_task_desc_set_int64(desc_info_handle, ALF_TASK_DESC_ACCEL_KERNEL_REF_L, (unsigned long long) kernel_name); alf_task_desc_set_int64(desc_info_handle, ALF_TASK_DESC_ACCEL_CTX_SETUP_REF_L, (unsigned long long) ctx_setup_name); alf_task_desc_set_int64(desc_info_handle, ALF_TASK_DESC_ACCEL_CTX_MERGE_REF_L, (unsigned long long) ctx_merge_name); alf_task_desc_ctx_entry_add(desc_info_handle, ALF_DATA_DOUBLE, 1); /* result_t.value */ alf_task_desc_ctx_entry_add(desc_info_handle, ALF_DATA_INT32, 2); /* result_t.wbs and result_t.pad */

// Create task alf_task_create(desc_info_handle, &result, 0, 0, 1, &task_handle); result.value = 0.0; my_udata.iter = ITERATIONS;

for (i = 0; i < WBS; i++) { alf_wb_create(task_handle, ALF_WB_SINGLE, 1, &wb_handle); my_udata.seed = rand(); alf_wb_parm_add(wb_handle, (void *)&(my_udata.seed), 1, ALF_DATA_INT32, 2); alf_wb_parm_add(wb_handle, (void *)&(my_udata.iter), 1, ALF_DATA_INT32, 2); alf_wb_enqueue(wb_handle); } alf_task_finalize(task_handle);

/* Wait until the task is finished */ alf_task_wait(task_handle, -1);

/* Cleanup */ alf_task_desc_destroy(desc_info_handle); alf_exit(alf_handle, ALF_EXIT_POLICY_FORCE, 0); pi = result.value / result.wbs; printf("PI = %f\n", pi); return 0; }

116 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ALF Example (SPU code) #include "alf_accel.h" int context_merge(void* p_dst_task_ctx, void* p_task_ctx) { #include "../pi.h" result_t *tsk_ctx_ptr = (result_t *)p_task_ctx; int comp_kernel(void *p_task_ctx, result_t *dst_tsk_ctx_ptr = (result_t *)p_dst_task_ctx; void *p_parm_ctx_buffer, tsk_ctx_ptr->value += dst_tsk_ctx_ptr->value; void *p_input_buffer __attribute__ ((unused)), tsk_ctx_ptr->wbs += dst_tsk_ctx_ptr->wbs; void *p_output_buffer __attribute__ ((unused)), return 0; void *p_inout_buffer __attribute__ ((unused)), } unsigned int current_count __attribute__ ((unused)), unsigned int total_count __attribute__ ((unused))) { ALF_ACCEL_EXPORT_API_LIST_BEGIN ALF_ACCEL_EXPORT_API("", comp_kernel); udata_t *my_udata_ptr = (udata_t *)p_parm_ctx_buffer; ALF_ACCEL_EXPORT_API("", context_setup); result_t *tsk_ctx_ptr = (result_t *)p_task_ctx; ALF_ACCEL_EXPORT_API("", context_merge); ALF_ACCEL_EXPORT_API_LIST_END srand(my_udata_ptr->seed); for (i = 0; i < my_udata_ptr->iter; i++) { x = rand() * 1.0 / RAND_MAX; y = rand() * 1.0 / RAND_MAX; if (x*x + y*y <= 1.0) result += 1; } tsk_ctx_ptr->value += (result * 4.0 / ITERATIONS); tsk_ctx_ptr->wbs ++ ; return 0; }

int context_setup(void *p_task_ctx) { result_t *tsk_ctx_ptr = (result_t *)p_task_ctx; tsk_ctx_ptr->value = 0.0; tsk_ctx_ptr->wbs = 0; return 0; }

117 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

OpenMP for Cell B.E.

ƒ Initial OpenMP support provided by the IBM XL C/C++ for Multicore Acceleration for Linux, V10.1 – Single source compiler technology • cbexlc, cbexlc++, cbexlC • -qarch=celledp, -qtune=celledp – Compiler hides complexity of DMA transfers, code partitioning, overlays, etc.. from the programmer – Supports the OpenMP API Version 2.5 specification. • #pragma omp atomic, #pragma omp barrier, #pragma omp critical • #pragma omp flush, #pragma omp for, #pragma omp master • #pragma omp ordered, #pragma omp parallel, #pragma omp parallel for • #pragma omp parallel sections, #pragma omp section, #pragma omp sections • #pragma omp single, #pragma omp threadprivate – OMP_NUM_THREADS Æ Number of SPUs ƒ No Fortran compiler available yet.

118 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group OpenMP for Cell B.E. (Example)

#include #include #include

#define SIZE 10000000 #define NITERS 100

float v[SIZE], v1[SIZE], v2[SIZE];

Int main (void) { int i; int iter;

for (i = 0; i < SIZE; i++) { v[i] = 0; v1[i] = i; v2[i] = i; }

#pragma omp parallel private (iter) for (iter = 0; iter < NITERS; iter++) #pragma omp for for (i = 0; i < SIZE; i++) v[i] = v1[i] + v2[i];

for (i = 0; i < SIZE; i++) { if (v[i] != i+i) { break; } }

if (i == SIZE) printf("vecadd Valid\n"); else printf("vecadd Invalid\n"); }

119 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Objectives

ƒ Introduce you to …

– Cell Software Development Kit (SDK) for Multicore Acceleration Version 3.1 – Programming the Cell/B.E (libSPE2, MFC, SIMD, … ) – Programming Models: DaCS, ALF, OpenMP – Programming Tips & Tricks – Performance tools

Trademarks – Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

120 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

ƒ Key programming techniques to exploit cell hardware organization and language features for

ƒ SPE Æ SPU Programming Tips 1. Level of Programming (Assembler, Intrinsics, Auto-Vectorization) 2. Overlap DMA with computation (double, multiple buffering) 3. Dual Issue rate (Instruction Scheduling) 4. Design for limited local store 5. Branch hints or elimination 6. Loop unrolling and pipelining 7. Integer multiplies (avoid 32-bit integer multiplies) 8. Avoid scalar code 9. Choose the right SIMD strategy 10. Load / Store only by quadword

121 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 1. Programming Levels on Cell BE Trade-Off ƒ Expert level Performance vs. Effort – Assembler, high performance, high efforts ƒ More ease of programming – C compiler, vector data types, intrinsics, compiler schedules instructions + allocates registers ƒ Auto-SIMDization – for scalar loops, user should support by alignment directives, compiler provides feedback about SIMDization ƒ Highest degree of ease of use – user-guided parallelization necessary, Cell BE looks like a single processor

Requirements for Compiler increasing with each level

122 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 2. Overlap DMA with computation ƒ Double or multi-buffer code or (typically) data ƒ Example for double buffering n+1 data blocks: – Use multiple buffers in local store – Use unique DMA tag ID for each buffer – Use fence commands to order DMAs within a tag group – Use barrier commands to order DMAs within a queue ƒ Use SPE-initiated DMA transfers rather than PPE-initiated DMA transfers, because – there are more SPEs than the one PPE – the PPE can enqueue only eight DMA requests whereas each SPE can enqueue 16 ƒ When using DMA buffers, declare the DMA buffers as volatile to ensure that buffers are not accessed by SPU load or store instructions until after DMA transfers have completed – Channel commands are ordered with respect to volatile-memory accesses. The DMA commands specify LS addresses as volatile, void pointers. By declaring all DMA buffers as volatile, it forces all accesses to these buffers to be performed (that is, they cannot be cached in a register) and ordered. ƒ When coding DMA transfers, exploit DMA list transfers whenever possible

123 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

3. Dual Issue Rate (Instruction Scheduling)

124 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 4. Design for Limited Local Store

ƒ The Local Store holds up to 256 KB for – the program, stack, local data structures, and DMA buffers. ƒ Most performance optimizations put pressure on local store (e.g. multiple DMA buffers) ƒ Use plug-ins (runtime download program kernels) to build complex function servers in the LS.

125 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 5. Branch Optimizations

ƒ SPE – Heavily pipelined Æ high penalty for branch misses (18 cycles) – Hardware policy: assume all branches are not taken ƒ Advantage – Reduced hardware complexity, faster clock cycles, increased predictability ƒ Solution approaches – if (a>b) c += 1; else c = a+b; – If-conversions: compare and select operations (spu_sel) select = spu_cmpgt(a,b); c1 = spu_add(c, 1); ab = spu_add(a, b); c=spu_sel( ab, c1, select) – Predications/code re-org: compiler analysis, user directives • Use feedback directed optimization – Branch hint instruction (hbr, 11 cycles before branch) • Use __builtin_expect when programmer can explicitly direct branch prediction if (__builtin_expect(a>b,0)) c += 1; else c= a+b; // predict a is not > b

126 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 6(a). Loop Unrolling – Unroll loops • to reduce dependencies • increase dual-issue rates – This exploits the large SPU register file. – Compiler auto-unrolling is not perfect, but pretty good.

j=N; a[1] = (b[1] + b[N]) / 2; For(i=1, i

For(i=1, i<99, i+=2) { For(i=1, i<100, i++) { a[i] = b[i+2] * c[i-1]; a[i] = b[i+2] * c[i-1]; a[i+1] = b[i+3] * c[i]; } }

127 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

6(b). SPU – Software Pipeline

Design for balanced pipeline use

128 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 7. Integer Multiplies

ƒ Avoid integer multiplies on operands greater than 16 bits – SPU supports only a 16-bit x16-bit multiply – 32-bit multiply requires five instructions (three 16-bit multiplies and two adds) ƒ Keep array elements sized to a power-of-2 to avoid multiplies when indexing. ƒ Cast operands to unsigned short prior to multiplying. Constants are of type int and also require casting. ƒ Use a macro to explicitly perform 16-bit multiplies. This can avoid inadvertent introduction of signed extends and masks due to casting.

#define MULTIPLY(a, b)\ (spu_extract(spu_mulo((vector unsigned short)spu_promote(a,0),\ (vector unsigned short)spu_promote(b, 0)),0))

129 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

8. Avoid Scalar Code

ƒ Use spu_promote and spu_extract to efficiently promote scalars to vectors, or vectors to scalars.

130 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 9. Choose an SIMD strategy appropriate for your algorithm

ƒ Evaluate array-of-structure (AOS) organization – For graphics vertices, this organization (also called or vector-across) can have more- efficient code size and simpler DMA needs, – but less-efficient computation unless the code is unrolled.

ƒ Evaluate structure-of-arrays (SOA) organization. – For graphics vertices, this organization (also called parallel-array) can be easier to SIMDize, – but the data must be maintained in separate arrays or the SPU must shuffle AOS data into an SOA form.

ƒ Consider unrolling affects when picking SIMD strategy

131 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group 10. Load / Store by Quadword

ƒ Scalar loads and stores are slow, with long latency. ƒ SPUs only support quadword loads and stores. ƒ Consider making scalars into quadword integer vectors. ƒ Load or store scalar arrays as quadwords, and perform your own extraction and insertion to eliminate load and store instructions.

132 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Objectives

ƒ Introduce you to …

– Cell Software Development Kit (SDK) for Multicore Acceleration Version 3.1 – Programming the Cell/B.E (libSPE2, MFC, SIMD, … ) – Programming Models: DaCS, ALF, OpenMP – Programming Tips & Tricks – Performance tools

Trademarks – Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc.

133 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Performance Tools Architecture

Windows / AIX Visual Performance Analyzer Remote Data Collector

Profile Counter Analyzer Code Analyzer Analyzer Visualization plugins Trace Analyzer

Cell Blade Remote tools invocation Post processing / Analyzer and data collection FDPR-Pro PDTR Performance OProfile Debugging Tool (PDT)‏

CellPerfCount Performance Data collection perfmon2 Instrumentation HW Perf counters - s/w counter & - Probes (trace data) H/W & S/W Counters & Probes

134 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group CPC (Cell Performance Counter) - PMU features

• Four 32-bit counters for each Cell processor • Each can also be used as two 16-bit counters

• 1400+ events available to count

• Hardware sampling • Specify initial counter values and sampling time interval • PMU record and reset counter values after each interval • Samples available in hardware trace-buffer • Sample sequence can be annotated through writes to the PPU bookmark register • Thus reduce the number of calls that CPC has to make into the kernel

• Seven “logic islands”: PPU, PPSS, SPU, MFC, EIB, MIC, and BEI • Each island has groups of signals • Each signal represents an hardware event • PMU can monitor two signal groups ► Monitor any number of signals within one group – up to the number of available counters

Man page available in CPC home page http://aixptools.austin.ibm.com/perf/w3_tools/cpc/

135 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

CPC Usage

Command-line tool

Workload mode Counters active only during complete execution of a workload Provide a very accurate view of the performance of a single process

System-wide mode Counters monitor all processes running on specified CPUs for specified duration

Several output formats Text, HTML XML (as input to VPA)‏

136 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group OProfile Overview

ƒ Periodically sample the program counter (PC) – Give a statistical profile of which application, and where in the application the CPU was executing

ƒ Profile based on hardware performance counter events – Support collecting profiles on multiple events at a time – Available performance counter events can be listed with • opcontrol –list_events – opcontrol can also specify which event to use, frequency of sampling, and mode (user, kernel or both)

ƒ opreport generate the profiles – XML output is supported, and is accepted by VPA

ƒ User's manual can be found at: – http://oprofile.sourceforge.net/doc/

137 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group OProfile - Summary

ƒ OProfile is a powerful tool for identifying where the time is being spent in the system or in an application. ƒ OProfile can be used to identify where time, cache misses, DTLB misses etc. are occurring. ƒ PPU time/event profile support is available in SDK2.0; SPU time profile support is available in SDK 2.1 and later. ƒ We are exploring the possibility of SPU event profiling for SDK release in 2008

138 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group PDT - introduction

• Performance Debugging Tool • Instrument events for collection of trace records • in real-time • at the application level • minimal interference with the application • very small amount of code added and memory used on SPE • All the relevant Opteron, PPE and SPE SDK functions are instrumented • programmer can specify events of interest • Application source code does not need to be modified • code relinking(shared libaries) or rebuilding is needed ►PPE code can only use static libraries so it must always be rebuilt • Trace records are collected at the PPE level • adds SPE/PPE communications

139 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PDT - categories of traced events

•PPE events •SPE events • SPE control • DMA control • DMA transfers • DMA transfers • Mail-box usage • Mail-box usage • Sync mutex operations • Sync mutex operations • User defined events • User defined events

•Libraries instrumented •on PPE ►DaCS, ALF, libspe2 and libsync •on SPE ►DaCS, ALF, libsync, spu_mfcio.h and the overlay manager •RPMs with -trace suffix contain instrumented versions of libraries

140 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PDTR (PDT Reader) - Overview

ƒ Command line PDT trace post-processor – Support PDT's XML meta file trace format

ƒ Generates text output – Quick PDT trace analysis before using VPA for further GUI based analysis

ƒ Produces various summary output reports – Lock statistics – DMA statistics – Mailbox usage statistics – Overall event profiles

ƒ Also produces sequential report with time-stamped event and its parameters per line

ƒ Provides full SPE and PPE, data and instruction address to name mapping, including for SPE overlays

141 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group PDTR - Example of General Summary Report

General Summary Report ======

1.107017 seconds in trace

Total trace events: 3975 Count EvID Event min avg max evmin,evmax ------672 1202 SPE_MFC_READ_TAG_STATUS 139.7ns 271.2ns 977.8ns 26, 3241 613 0206 _DACS_HOST_MUTEX_LOCK 349.2ns 1.6us 20.3us 432, 2068 613 0406 _DACS_HOST_MUTEX_UNLOCK 336 0402 SPE_MFC_GETF 240 1406 _DACS_SPE_MUTEX_UNLOCK 239 1206 _DACS_SPE_MUTEX_LOCK 279.4ns 6.6us 178.7us 773, 3152 224 0102 SPE_MFC_PUTF 99 0200 HEART_BEAT 96 0302 SPE_MFC_GET 64 1702 SPE_READ_IN_MBOX 139.7ns 3.7us 25.0us 21, 191 16 0002 SPE_MFC_PUT 16 2007 _DACS_MBOX_READ_ENTRY 16 2107 _DACS_MBOX_READ_EXIT_INTERVAL 11.7us 15.2us 25.9us 557, 192 16 0107 _DACS_RUNTIME_INIT_ENTRY 16 2204 _DACS_MBOX_WRITE_ENTRY 16 0207 _DACS_RUNTIME_INIT_EXIT_INTERVAL 6.6us 7.4us 8.2us 29, 2030 16 2304 _DACS_MBOX_WRITE_EXIT_INTERVAL 4.1us 6.0us 11.3us 2011, 193 16 0601 SPE_PROGRAM_LOAD 16 0700 SPE_TRACE_START 16 0800 SPE_TRACE_END 16 2A04 _DACS_MUTEX_SHARE_ENTRY ...

142 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PDTR - Sequential Trace Output Example

Reading 20061013093100.1.trace ----- Trace File(s) ------Event type size: 0 Metafile version: 1 0 0.000000 0.000ms PPE_HEART_BEAT PPU 00000001 TB:0000000000000000 1 0.005025 5.025ms PPE_SPE_CREATE_GROUP PPU F7FC73A0 TB:000000000001190F *** Unprocessed event *** 2 0.015968 10.943ms PPE_HEART_BEAT PPU 00000001 TB:0000000000037D13 3 0.031958 15.990ms PPE_HEART_BEAT PPU 00000001 TB:000000000006FB69 4 0.047957 15.999ms PPE_HEART_BEAT PPU 00000001 TB:00000000000A7A3B 5 0.053738 5.781ms PPE_SPE_CREATE_THREAD PPU F7FC73A0 TB:00000000000BBD90 6 0.053768 0.030ms PPE_SPE_WRITE_IN_MBOX PPU F7FC73A0 TB:00000000000BBF3F *** Unprocessed event *** : 20 0 0.000us SPE_ENTRY SPU 1001F348 Decr:00000001 21 163 11.384us SPE_MFC_WRITE_TAG_MASK SPU 1001F348 Decr:000000A4 *** Unprocessed event *** 22 170 0.489us SPE_MFC_GET SPU 1001F348 Decr:000000AB Size: 0x80 (128), Tag: 0x4 (4) 23 176 0.419us SPE_MFC_WRITE_TAG_UPDATE SPU 1001F348 Decr:000000B1 *** Unprocessed event *** 24 183 0.489us SPE_MFC_READ_TAG_STATUS_ENTRY SPU 1001F348 Decr:000000B8 25 184 0.070us SPE_MFC_READ_TAG_STATUS_EXIT SPU 1001F348 Decr:000000B9 >>> delta tics:1 ( 0.070us) rec:24 {DMA done[tag=4,0x4] rec:22 0.978us 130.9MB/s} 26 191 0.489us SPE_MUTEX_LOCK_ENTRY SPU 1001F348 Lock:1001E280 Decr:000000C0 : 33 4523 0.210us SPE_MUTEX_LOCK_EXIT SPU 1001F348 Lock:1001E280 Decr:000011AC >>> delta tics:3 ( 0.210us) rec:32 34 0 0.000us SPE_ENTRY SPU 1001F9D8 Decr:00000001 35 96 6.705us SPE_MFC_WRITE_TAG_MASK SPU 1001F9D8 Decr:00000061 *** Unprocessed event *** 36 103 0.489us SPE_MFC_GET SPU 1001F9D8 Decr:00000068 Size: 0x80 (128), Tag: 0x4 (4) 37 109 0.419us SPE_MFC_WRITE_TAG_UPDATE SPU 1001F9D8 Decr:0000006E *** Unprocessed event *** 38 116 0.489us SPE_MFC_READ_TAG_STATUS_ENTRY SPU 1001F9D8 Decr:00000075 39 117 0.070us SPE_MFC_READ_TAG_STATUS_EXIT SPU 1001F9D8 Decr:00000076 >>> delta tics:1 ( 0.070us) rec:38 {DMA done[tag=4,0x4] rec:36 0.978us 130.9MB/s}

143 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PDTR - Example of DMA Report

DMA Report ======

SPE's: ...

DMA Complete time (from mfc_get/put to tagstatus exit)‏ lspe Count min avg max (us)‏ DMA transfer rate: ------16 42 0.6 1.7 24.4 lspe Count min avg max (MB/s) 15 42 0.6 1.5 13.9 ------14 42 0.8 2.1 39.6 - 13 42 0.6 1.2 13.3 16 42 23.6 691.3 916.4 12 42 0.6 2.3 42.3 15 42 41.4 688.5 916.4 11 42 0.6 1.8 29.6 14 42 14.5 506.4 687.3 10 42 0.6 1.2 11.9 13 42 43.4 685.5 916.4 9 42 0.8 2.3 36.2 8 42 0.6 1.4 19.6 12 42 13.6 532.0 916.4 7 42 0.8 1.9 26.0 11 42 19.5 535.2 916.4 6 42 0.6 1.8 22.7 10 42 48.2 695.6 1030.9 5 42 0.8 2.2 30.5 9 42 15.9 508.4 687.3 4 42 0.6 1.6 22.3 3 42 0.8 2.7 66.4 8 42 29.5 697.6 916.4 2 42 0.6 1.4 14.9 7 42 22.2 506.3 687.3 1 42 0.6 1.7 29.8 6 42 25.4 681.8 916.4 5 42 18.9 502.4 687.3 4 42 25.8 683.8 916.4 3 42 8.7 513.9 687.3 2 42 38.5 549.1 916.4 1 42 19.3 674.1 916.4

144 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

PDTR - Lock Report Output Example

145 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Feedback Directed Program Restructuring - FDPR

146 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group FDPR-Pro - overview

•Feedback Directed Program Restructuring Source files

•A post-link program optimizer Compiler • optimizations: Standard ►code restructuring development Object files •inlining toolchain Debugger Linker •loop unrolling ►hot-cold code motion ExecutableExecutable programprogram ►many others optimizations

•Supports AIX, Linux and Window hosts, Power, z, FDPR-Pro and Cell targets 1 3

Instrumented Runtime Optimized program Execution Profile program 2

Execution workload

147 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group FDPR-Pro for Cell - the challenges • Memory-constrained SPE architecture • SPE only has access to 256K • counter information is embedded in the program ►the program is its own profile, write it out when done ►profile counters are extracted using pattern recognition techniques ►~5%-20% overhead (1.05X – 1.2X) versus typical 2X • Overlaid SPE programs • read overlay table and detect overlay structure • map overlaid program to virtual flat space • perform regular instrumentation/optimization • map back flat representation to real overlay space

• SPE-specific optimizations • Instruction prefetch optimization (-RC & -BH)‏ ►use code restructuring (-RC) to minimize taken branches ►profile-based insertion/removal of branch hint (BRH) instructions •FDPR-Pro can hint on highly-probable taken IF conditions • Profile-based instruction re-alignment ►like the above, compiler correctly schedule hot loops ►FDPR-Pro can re-schedule additional hot code

148 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Visual Performance Analyzer - goals

ƒProvide platform independent, easy to use integrated set of graphical application performance analysis tools

ƒLeverage existing platform specific non-GUI performance analysis tools to collect a comprehensive set of data, for instance: –tprof and hpmcount on AIX –OProfile, pmcount and CellPerfCounter on Linux –Performance Inspector on Windows and Linux

ƒCreate consistent set of integrated tools to provide a platform independent drill down performance analysis experience

149 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Visual Performance Analyzer - architecture Eclipse based tool set, currently including six plug-in applications working collaboratively: Profile Analyzer, Code Analyzer, Pipeline Analyzer, Counter Analyzer, Trace Analyzer and the experimental Control Flow Analyzer

Control Profile Code Pipeline Counter Trace Analyzer Analyzer Analyzer Analyzer Analyzer Flow Analyzer Eclipse Real Time Environment

Remote Data Collection Initiator Windows/AIX/Linux()

performance data Platform Specific Data Collectors OProfile Tprof Perf. Inspector Platform specific CellPerfCounter hpmcount/hpmstat tool(s) FDPR- Pro FDPR-Pro Perf. Inspector PDT AIX OProfileLinux Linux Windows Linux Other FDPR-Pro System p Systempmcount p SystemPerf. Inspector x Cell Blade Systems 150 PRACE Winter School OProfile 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group VPA Components ƒ Profile Analyzer provides graphical and text-based views that allow users to narrow down performance problems to a particular process, thread, module, symbol, offset, instruction, or source line.

ƒ Code Analyzer examines executable files and displays detailed information about functions, basic blocks, and assembly instructions.

ƒ Pipeline Analyzer examines how code is executed on various IBM POWER processors. Pipeline Analyzer displays the pipeline execution of instruction traces generated by a POWER series processor.

ƒ Counter Analyzer analyzes hardware performance counter data.

ƒ Trace Analyzer visualizes Cell BE traces containing information such as DMA communication, locking/unlocking activities, mailbox messages, etc. Trace Analyzer shows this data organized by core, along a common timeline. Extra details are available for each kind of events, for example, lock identifier for lock operations, accessed address for DMA transfers, etc.

ƒ Control Flow Analyzer analyzes the call trace data collected by the tool like Performance Inspector JProf. The call trace data contains the information like when one method calls another, how much time is spent in every invocation, and so on.

151 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Visual Performance Analyzer - current look

Using RCP allows VPA to be built with a custom appearance.

6 tools

1. Windows, AIX, Linux 2. DB2 and HSQLDB support 3. Java profiling via JVMPI/JVMTI 4. Single tool framework using common data models 5. RSE for remote execution of test case and collection of data 152 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Where to get more Cell BE information?

153 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Cell Resources

ƒ Cell resource center at developerWorks – http://www-128.ibm.com/developerworks/power/cell/ ƒ Cell developer's corner at power.org – http://www.power.org/resources/devcorner/cellcorner/ ƒ The cell project at IBM Research – http://www.research.ibm.com/cell/ ƒ The Cell BE at IBM alphaWorks – http://www.alphaworks.ibm.com/topics/cell ƒ Cell BE at IBM Engineering & Technical Services – http://www-03.ibm.com/technology/ ƒ IBM Power Architecture – http://www-03.ibm.com/chips/power/ ƒ Cell BE documentation at IBM Microelectronics – http://www- 306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_EngineCell ƒ Linux info at the Barcelona Supercomputing Center website – http://www.bsc.es/projects/deepcomputing/linuxoncell/

154 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Cell Education

ƒ Online courses at IBM Education Assistant – http://publib.boulder.ibm.com/infocenter/ieduasst/stgv1r0/index.jsp ƒ Online courses at IBM Learning – http://ibmlearning.ibm.com/index.html ƒ Podcasts at power.org – http://www.power.org ƒ Onsite classes at IBM Innovation Center – https://www-304.ibm.com/jct09002c/isv/spc/events/cbea.html

155 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group SDK 3.1 Documentation Programming Standards C/C++ Language Extensions for Cell Broadband Engine Architecture Software Development Kit SPU Application Binary Interface Specification IBM SDK for Multicore Acceleration Installation Guide SIMD Math Library Specification for Cell Broadband Engine Architecture Cell Broadband Engine Programming Handbook Cell Broadband Engine Linux Reference Implementation Application Binary Interface Cell Broadband Engine Programming Tutorial Specification

Cell Broadband Engine Programmer's Guide SPU Assembly Language Specification

Oprofile (SDK Programmer's Guide) Programming Library Documentation

PDT (SDK Programmer's Guide) Data Communication and Synchronization Programmer's Guide and API Reference

Security SDK V3.1 Installation and User’s Guide Data Communication and Synchronization for Hybird-x86 Programmer's Guide and API Reference Programming Tools Documentation SPE Runtime Management Library Performance Analysis with the IBM Full-System Simulator SPE Runtime Management Library Version 1.2 to 2.2 Migration Guide (revised name) IBM Full-System Simulator User's Guide Accelerated Library Framework Programmer's Guide and API Reference XL C/C++ Compiler Information Accelerated Library Framework for Hybrid-x86 Programmer's Guide and API Reference Installation Guide, Getting Started, Compiler Reference Language Reference, Programming Guide Software Development Kit 3.1 SIMD Math Library Specifications XL Fortran Compiler Information Basic Linear Algebra Subprograms Programmer's Guide and API Reference Installation Guide, Getting Started, Compiler Reference Language Reference, Programming Guide Example Library API Reference Using the single-source compiler Cell BE Monte Carlo Library API Reference Manual IBM Visual Performance Analyzer User's Guide Cell Broadband Engine Security Software Development Kit - Installation and User's Guide SPU Timer Library Mathematical Acceleration Subsystem (MASS)

156 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Special Notices -- Trademarks

This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally- available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

Revised January 19, 2006 157 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group

Special Notices (Cont.) -- Trademarks

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro- Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both. is a registered trademark of Rambus, Inc. XDR and FlexIO are trademarks of Rambus, Inc. UNIX is a registered trademark in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Fedora is a trademark of Redhat, Inc. , Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries. AMD Opteron is a trademark of , Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). AltiVec is a trademark of Freescale Semiconductor, Inc. PCI-X and PCI Express are registered trademarks of PCI SIG. InfiniBand™ is a trademark the InfiniBand® Trade Association Other company, product and service names may be trademarks or service marks of others.

Revised July 23, 2006

158 PRACE Winter School 2/16/2009 © 2009 IBM Corporation IBM Systems & Technology Group Special Notices - Copyrights

(c) Copyright International Business Machines Corporation 2005. All Rights Reserved. Printed in the United Sates September 2005.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

IBM Microelectronics Division The IBM home page is http://www.ibm.com 1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com

159 PRACE Winter School 2/16/2009 © 2009 IBM Corporation