Broadband Engine

Chip Multiprocessing and the Cell Broadband Engine

Michael Gschwind

© 2006 IBM Corporation ACM Computing Frontiers 2006 Cell Broadband Engine

Computer Architecture at the turn of the millenium

ƒ “The end of architecture” was being proclaimed – Frequency scaling as performance driver – State of the art •multiple instruction issue •out of order architecture •register renaming •deep pipelines ƒ Little or no focus on compilers – Questions about the need for compiler research ƒ Academic papers focused on microarchitecture tweaks

2 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

The age of frequency scaling

Scaled Device Voltage, V / α WIRING 1

tox/α bips W/α GATE 0.9

n+ n+ source drain 0.8 L/α xd/α p substrate, doping α*NA

Performance 0.7 SCALING: RESULTS: Voltage: V/α Higher Density: ~α2 0.6

Oxide: tox /α Higher Speed: ~α Wire width: W/α Power/ckt: ~1/α2 0.5 Gate width: L/α Power Density: 37 34 31 28 25 22 19 16 13 10 7 ~Constant Total FO4 Per Stage Diffusion: xd /α Substrate: α *NA Source: Dennard et al., JSSC 1974. deeper pipeline

3 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Frequency scaling

ƒ A trusted standby – Frequency as the tide that raises all boats – Increased performance across all applications – Kept the industry going for a decade ƒ Massive investment to keep going – Equipment cost – Power increase – Manufacturing variability – New materials

4 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

… but reality was closing in! 1000

100

1 10 0.8 1 0.6

0.4 0.1 bips 0.2 bips^3/W 0.01

Relative to Optimal FO4 Optimal to Relative Active Power 0 Leakage Power 37 34 31 28 25 22 19 16 13 10 7 0.001 Total FO4 Per Stage 1 0.1 0.01

5 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Power crisis

ƒ The power crisis is not 1000 “natural” – Created by deviating from ideal scaling theory 100 Tox(Å)

– Vdd and Vt not scaled by α • additional performance with increased voltage 10 ƒ Laws of physics Vdd classic scaling – Pchip = ntransistors * Ptransistor V ƒ Marginal performance gain 1 t per transistor low – significant power increase

– decreased power/performance 0.1 efficiency 1 0.1 0.01 gate length Lgate (µm)

6 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

The power inefficiency of deep pipelining Power-performance optimal Performance optimal 1

0.8

0.6

0.4 bips 0.2 bips^3/W deeper pipeline Relative to Optimal FO4 Relative 0 37 34 31 28 25 22 19 16 13 10 7 Total FO4 Per Stage Source: Srinivasan et al., MICRO 2002

ƒ Deep pipelining increases number of latches and switching rate ⇒ power increases with at least f2 ƒ Latch insertion delay limits gains of pipelining

7 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

MFLOPS Memory latency Hitting the memory wall 100 80

60 ƒ Latency gap 40 20 normalized – Memory speedup lags behind 0 processor speedup 1990 2003 -20

– limits ILP -40

-60 ƒ Chip I/O bandwidth gap Source: McKee, Computing Frontiers 2004 – Less bandwidth per MIPS ƒ Latency gap as application bandwidth gap usable bandwidth avg. request size

no. of inflight roundtrip latency requests – Typically (much) less than chip I/O bandwidth Source: Burger et al., ISCA 1996

8 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Our Y2K challenge

ƒ 10× performance of desktop systems

ƒ 1 TeraFlop / second with a four-node configuration

ƒ 1 Byte bandwidth per 1 Flop – “golden rule for balanced supercomputer design”

ƒ scalable design across a range of design points

ƒ mass-produced and low cost

9 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Cell Design Goals

ƒ Provide the platform for the future of computing – 10× performance of desktop systems shipping in 2005 ƒ Computing density as main challenge –Dramatically increase performance per X • X = Area, Power, Volume, Cost,… ƒ Single core designs offer diminishing returns on investment – In power, area, design complexity and verification cost

10 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Necessity as the mother of invention

ƒ Increase power-performance efficiency –Simple designs are more efficient in terms of power and area ƒ Increase memory subsystem efficiency –Increasing data transaction size –Increase number of concurrently outstanding transactions ⇒ Exploit larger fraction of chip I/O bandwidth ƒ Use CMOS density scaling –Exploit density instead of frequency scaling to deliver increased aggregate performance ƒ Use compilers to extract parallelism from application –Exploit application parallelism to translate aggregate performance to application performance

11 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Exploit application-level parallelism

ƒ Data-level parallelism – SIMD parallelism improves performance with little overhead ƒ -level parallelism – Exploit application threads with multi-core design approach ƒ Memory-level parallelism – Improve memory access efficiency by increasing number of parallel memory transactions ƒ Compute-transfer parallelism – Transfer data in parallel to execution by exploiting application knowledge

12 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Concept phase ideas modular design large architected design reuse register set

Chip control cost of high frequency coherency and low power! Multiprocessor

reverse Vdd efficient use of scaling for low memory interface power

13 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Cell Architecture

SPE ƒ Heterogeneous multi- SPU SPU SPU SPU SPU SPU SPU SPU core system SXU SXU SXU SXU SXU SXU SXU SXU

architecture LS LS LS LS LS LS LS LS – Power Processor Element for control tasks SMF SMF SMF SMF SMF SMF SMF SMF – Synergistic Processor 16B/cycle Elements for data- EIB (up to 96B/cycle) intensive processing 16B/cycle PPE 16B/cycle 16B/cycle (2x) ƒ Synergistic Processor PPU MIC BIC Element (SPE) consists of L2 L1 PXU – Synergistic Processor 32B/cycle 16B/cycle Unit (SPU) Dual FlexIOTM TM – Synergistic Memory Flow XDR Control (SMF) 64-bit Power Architecture with VMX

14 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Shifting the Balance of Power with Cell Broadband Engine

ƒ Data processor instead of control system – Control-centric code stable over time – Big growth in data processing needs • Modeling • Games • Digital media • Scientific applications

ƒ Today’s architectures are built on a 40 year old data model – Efficiency as defined in 1964 – Big overhead per data operation – Parallelism added as an after-thought

15 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Powering Cell – the Synergistic Processor Unit

64B Fetch

ILB Local Store 128B SMF Issue / Single Port Branch SRAM 2 instructions 16B

VRF

16B x 2 16B x 3 x 2

V F P U V F X U PERM LSU

Source: Gschwind et al., Hot Chips 17, 2005 16 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Density Computing in SPEs

ƒ Today, execution units only fraction of core area and power – Bigger fraction goes to other functions • Address translation and privilege levels • Instruction reordering • Register renaming • Cache hierarchy ƒ Cell changes this ratio to increase performance per area and power – Architectural focus on data processing • Wide datapaths • More and wide architectural registers • Data privatization and single level processor-local store • All code executes in a single (user) privilege level • Static scheduling

17 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Streamlined Architecture ƒ Architectural focus on simplicity – Aids achievable operating frequency – Optimize circuits for common performance case – Compiler aids in layering traditional hardware functions – Leverage 20 years of architecture research ƒ Focus on statically scheduled data parallelism – Focus on data parallel instructions • No separate scalar execution units • Scalar operations mapped onto data parallel dataflow – Exploit wide data paths • data processing • instruction fetch – Address impediments to static scheduling • Large register set • Reduce latencies by eliminating non-essential functionality

18 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

SPE Highlights LS DP ƒ RISC organization SFP – 32 bit fixed instructions LS – Load/store architecture FXU EVN – Unified register file ƒ User-mode architecture FWD – No page translation within SPU FXU ODD LS ƒ SIMD dataflow – Broad set of operations (8, 16, 32, 64 Bit) GPR – Graphics SP-Float CONTROL LS – IEEE DP-Float ƒ 256KB Local Store CHANNEL – Combined I & D DMA SMM ƒ DMA block transfer ATO – using Power Architecture memory RTB

translation SBI B

E B 14.5mm2 (90nm SOI) Source: Kahle, Spring Processor Forum 2005 19 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine Synergistic Processing large DLP register shorter scalar file pipeline layering

instruction bundling data static static scheduling prediction parallel select high frequency simpler wide μarch data ILP paths determ. opt. large latency basic blocks local compute store single sequential density fetch port

20 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Efficient data-sharing between scalar and SIMD processing

ƒ Legacy architectures separate scalar and SIMD processing – Data sharing between SIMD and scalar processing units expensive • Transfer penalty between register files – Defeats data-parallel performance improvement in many scenarios ƒ Unified register file facilitates data sharing for efficient exploitation of data parallelism – Allow exploitation of data parallelism without data transfer penalty • Data-parallelism always an improvement

21 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

SPU Pipeline

SPU PIPELINE FRONT END IF1 IF2 IF3 IF4 IF5 IB1 IB2 ID1 ID2 ID3 IS1 IS2

SPU PIPELINE BACK END Branch Instruction RF1 RF2 IF Instruction Fetch Permute Instruction IB Instruction Buffer EX1 EX2 EX3 EX4 WB ID Instruction Decode Load/Store Instruction IS Instruction Issue RF Register File Access EX1 EX2 EX3 EX4 EX5 EX6 WB EX Execution Fixed Point Instruction WB Write Back EX1 EX2 WB Floating Point Instruction EX1 EX2 EX3 EX4 EX5 EX6 WB

22 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

SPE Block Diagram

Floating-Point Unit Permute Unit Fixed-Point Unit Load-Store Unit Branch Unit Local Store Channel Unit (256kB) Single Port SRAM Result Forwarding and Staging Register File

Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write

On-Chip Coherent Bus DMA Unit

8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle

23 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Synergistic Memory Flow Control ƒ SMF implements memory management and mapping SPE SPU ƒ SMF operates in parallel to SPU SXU – Independent compute and transfer LS – Command interface from SPU • DMA queue decouples SMF & SPU SMF

DMA Engine DMA • MMIO-interface for remote nodes Queue ƒ Block transfer between system memory and local store Atomic MMU RMT ƒ SPE programs reference system Facility memory using user-level effective address space – Ease of data sharing Bus I/F Control MMIO – Local store to local store transfers Data Bus – Protection Snoop Bus Control Bus Translate Ld/St MMIO

24 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Synergistic Memory Flow Control

ƒ Bus Interface ƒ Element Interconnect Bus – Up to 16 outstanding DMA – Four 16 byte data rings requests – Multiple concurrent transfers – Requests up to 16KByte – Token-based Bus Access – 96B/cycle peak bandwidth Management – Over 100 outstanding requests – 200+ GByte/s @ 3.2+ GHz

PPE SPE1 SPE3 SPE5 SPE7 IOIF1 16B 16B 16B 16B 16B 16B 16B 16B

16B 16B

16B 16B 16B Data Arb 16B

16B 16B

16B 16B 16B 16B 16B 16B 16B 16B MIC SPE0 SPE2 SPE4 SPE6 BIF/IOIF0

Source: Clark et al., Hot Chips 17, 2005 25 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Memory efficiency as key to application performance

ƒ Greatest challenge in translating peak into app performance – Peak Flops useless without way to feed data ƒ Cache miss provides too little data too late – Inefficient for streaming / bulk data processing – Initiates transfer when transfer results are already needed – Application-controlled data fetch avoids not-on-time data delivery ƒ SMF is a better way to look at an old problem – Fetch data blocks based on algorithm – Blocks reflect application data structures

26 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Traditional memory usage

ƒ Long latency memory access operations exposed ƒ Cannot overlap 500+ cycles with computation ƒ Memory access latency severely impacts application performance

computation mem protocol mem idle mem contention

27 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Exploiting memory-level parallelism

ƒ Reduce performance impact of memory accesses with concurrent access ƒ Carefully scheduled memory accesses in numeric code ƒ Out-of-order execution increases chance to discover more concurrent accesses – Overlapping 500 cycle latency with computation using OoO illusory ƒ Bandwidth limited by queue size and roundtrip latency

28 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Exploiting MLP with Chip Multiprocessing ƒ More threads Î more memory level parallelism – overlap accesses from multiple cores

29 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

A new form of parallelism: CTP

ƒ Compute-transfer parallelism – Concurrent execution of compute and transfer increases efficiency • Avoid costly and unnecessary serialization – Application thread has two threads of control • SPU Î computation thread • SMF Î transfer thread ƒ Optimize memory access and data transfer at the application level – Exploit programmer and application knowledge about data access patterns

30 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Memory access in Cell with MLP and CTP ƒ Super-linear performance gains observed – decouple data fetch and use

31 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Synergistic Memory Management parallel explicit compute & reduced transfer timely coherence data data transfer cost deeper delivery fetch queue decouple fetch and application many access control outstanding multi transfers core

high throughput local store block reduce improved fetch miss code gen complexity shared I & D low latency storage access sequential density single port fetch

32 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Cell BE Implementation Characteristics

Frequency Increase vs. Power Consumption ƒ 241M transistors ƒ 235mm2 ƒ Design operates across wide frequency range – Optimize for power & yield ƒ > 200 GFlops (SP) @3.2GHz

ƒ > 20 GFlops (DP) @3.2GHz Relative ƒ Up to 25.6 GB/s memory bandwidth ƒ Up to 75 GB/s I/O bandwidth ƒ 100+ simultaneous bus transactions – 16+8 entry DMA queue per SPE

Voltage

Source: Kahle, Spring Processor Forum 2005

33 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Cell Broadband Engine

34 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

™ ™ ™ ™ System configurations XDR XDR XDR XDR

Cell BE Cell BE ƒ Game console systems Processor Processor IOIF BIF IOIF ƒ Blades ƒ HDTV

Cell ƒ Home media servers Design XDR™ XDR™ XDR™ XDR™ ƒ Supercomputers

Cell BE Cell BE Processor Processor

XDR™ XDR™ IOIF IOIF BIF

BIF switch IOIF

IOIF

Processor Processor

Cell BE Cell Cell BE BE Cell Processor

IOIF0 IOIF1

XDR XDR XDR XDR

™ ™ ™ ™

35 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Compiling for the Cell Broadband Engine

ƒ The lesson of “RISC computing” – Architecture provides fast, streamlined primitives to compiler – Compiler uses primitives to implement higher-level idioms – If the compiler can’t target it Î do not include in architecture ƒ Compiler focus throughout project – Prototype compiler soon after first proposal – Cell compiler team has made significant advances in

• Automatic SIMD code generation Raw Hardware Performance • Automatic parallelization Cell Programmability • Data privatization Design

Programmer Productivity

36 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Single source automatic parallelization ƒ Single source compiler – Address legacy code and programmer productivity – Partition a single source program into SPE and PPE functions ƒ Automatic parallelization – Across multiple threads – Vectorization to exploit SIMD units ƒ Compiler managed local store – Data and code blocking and tiling – “Software cache” managed by compiler ƒ Range of programmer input – Fully automatic – Programmer-guided • pragmas, intrinsics, OpenMP directives ƒ Prototype developed in IBM Research

37 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Compiler-driven performance

ƒ Code partitioning – Thread-level parallelism – Handle local store size – “Outlining” into SPE overlays ƒ Data partitioning – Data access sequencing – Double buffering – “Software cache” ƒ Data parallelization – SIMD vectorization – Handling of data with .unknown alignment

Source: Eichenberger et al., PACT 2005

38 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Raising the bar with parallelism…

ƒ Data-parallelism and static ILP in both core types – Results in low overhead per operation ƒ Multithreaded programming is key to great Cell BE performance – Exploit application parallelism with 9 cores – Regardless of whether code exploits DLP & ILP – Challenge regardless of homogeneity/heterogeneity ƒ Leverage parallelism between data processing and data transfer – A new level of parallelism exploiting bulk data transfer – Simultaneous processing on SPUs and data transfer on SMFs – Offers superlinear gains beyond MIPS-scaling

39 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

… while understanding the trade-offs

ƒ Uniprocessor efficiency is actually low – Gelsinger’s Law captures historic (performance) efficiency • 1.4x performance for 2x transistors – Marginal uni-processor (performance) efficiency is 40% (or lower!) • And power efficiency is even worse ƒ The “true” bar is marginal uniprocessor efficiency – A multiprocessor “only” has to beat a uniprocessor to be the better solution – Many low-hanging fruit to be picked in multithreading applications • Embarrassing application parallelism which has not been exploited

40 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Cell: a Synergistic System Architecture

ƒ Cell is not a collection of different processors, but a synergistic whole – Operation paradigms, data formats and semantics consistent – Share address translation and memory protection model ƒ SPE optimized for efficient data processing – SPEs share Cell system functions provided by Power Architecture – SMF implements interface to memory • Copy in/copy out to local storage ƒ Power Architecture provides system functions – Virtualization – Address translation and protection – External exception handling ƒ Element Interconnect Bus integrates system as data transport hub

41 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

A new wave of innovation

ƒ CMPs everywhere ƒ Novel systems – same ISA CMPs – Power4 – different ISA CMPs – BOA – homogeneous CMP – Piranha – heterogeneous CMP – Cyclops – thread-rich CMPs – Niagara – GPUs as compute engine – BlueGene – Power5 ƒ Compilers and software – Xbox360 must and will follow… – Cell BE

42 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

Conclusion

ƒ Established performance drivers are running out of steam ƒ Need new approach to build high performance system – Application and system-centric optimization – Chip multiprocessing – Understand and manage memory access – Need compilers to orchestrate data and instruction flow ƒ Need new push on understanding workload behavior ƒ Need renewed focus on compilation technology ƒ An exciting journey is ahead of us!

43 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006 Cell Broadband Engine

© Copyright International Business Machines Corporation 2005,6. All Rights Reserved. Printed in the United States May 2006.

The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture

Other company, product and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

44 Chip Multiprocessing and the Cell Broadband Engine © 2006 IBM Corporation M. Gschwind, Computing Frontiers 2006