IBM STG

Cell Broadband Engine Processor TM

(A multi core design based on Power Architecture™ technology)

Michael Paolini, paolini@us..com Solutions Architect (SWG Master Inventor) IBM Systems & Technology Group Nov 13, 2006

© 2006 IBM Corporation Broadband Engine Processor Performance Limiters and Challenges in Conventional Microprocessors

(Single Thread Throughput)

. Memory Wall – Latency induced bandwidth limitations . Power Wall – Must improve efficiency and performance equally . Frequency Wall – Diminishing returns from deeper pipelines (can be negative if power is taken into account)

2 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor What’s Causing The Problem? P=(1/2)CV2f Gate dielectric approaching a fundamental limit: Atomic defects matter !

G10aSte To Sxt=a1c1Ak 65 nm 1000 Air Cooling limit 100

) Active 2 Power m

c 10 / Frequency Increase vs Power Consumption W

3.5 ( Passive Power

y 1 t

3 i s

2.5 n

e 0.1 D e 2

v i t r a l a e e

R 1.5

0.w 01

1 o

P 1994 2004 0.5 0.001

0 1 0.1 0.01 0.9 1 1.1 1.2 1.3 Gate Length (microns) Pow er Gate Length (microns) Voltage Frequency

3 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor TThhee DDiissccoonnttiinnuuiittyy

Then (~2003) Now

. Scaling drove . Innovation drives performance performance . Scaling drove down cost . Scaling drives down cost . Performance constrained . Power constrained . Active power . Standby power dominates dominates . Focus on processor . Focus on system performance performance

4 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Collaborative Innovation Drivers

. Economics - Investment unaffordable even for large entities

. Technology - Extent of invention and innovation required

. Perspective – Breadth of expertise and knowledge required

. Creativity – Collaboration truly brings great ideas forward

. Focus – Allows companies to best leverage core skills

. Time – Rapid team assembly and execution

5 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Cell Broadband Engine History

. IBM, SCEI/Sony, Toshiba Alliance formed in 2000 . Design Center opens March 2001 . ~$400M Investment, 5 years, 600 people . February 7, 2005: First technical disclosures . January 12, 20006: Alliance extended 5 more years

YKT, EFK, BURLINGTON, ENDICOTT ROCHESTER BOEBLINGEN

TOKYO

SAN JOSE

RTP AUSTIN INDIA

ISRAEL 6 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Introducing Cell BE . Cell BE is an accelerator extension to Power – Built on a Power ecosystem – Used best know system practices for processor design . Sets a new performance standard – Exploits parallelism while achieving high frequency – Supercomputer attributes with extreme floating point capabilities – Sustains high memory bandwidth with smart DMA First Generation Cell BE controllers . 90 nm . Designed for natural human interaction . 241M transistors – Photo-realistic effects – Predictable real-time response . 235mm2 – Virtualized resources for concurrent activities . 9 cores, 10 threads . Designed for flexibility . >200 GFlops (SP) – Wide variety of application domains . >20 GFlops (DP) – Highly abstracted to highly exploitable programming models . Up to 25 GB/s memory B/W – Reconfigurable I/O interfaces . Up to 75 GB/s I/O B/W – Virtual trusted computing environment for security . >300 GB/s EIB . Cell BE is the chip powering the Sony PlayStation 3 . Top frequency >4GHz – Ships in volume the US in Nov ‘06 (observed in lab)

7 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor The Cell BE Concept

. Compatibility with 64b Power Architecture™ – Builds on and leverages IBM investment and community . Increased efficiency and performance – Attacks on the “Power Wall” • Non Homogenous Coherent Multiprocessor • High design frequency @ a low operating voltage with advanced power management – Attacks on the “Memory Wall” • Streaming DMA architecture • 3-level Memory Model: Main Storage, Local Storage, Register Files – Attacks on the “Frequency Wall” • Highly optimized implementation • Large shared register files and software controlled branching to allow deeper pipelines . Interface between user and networked world – Image rich information, virtual reality, shared reality – Flexibility and security . Multi-OS support, including RTOS / non-RTOS – Combine real-time and non-real time worlds

Add words around ECC etc.

8 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

Heterogeneous Multi-core Architecture

9 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

1 PPE core: - VMX unit - L1, L2 cache - 2 way SMT

10 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

8 SPEs -128-bit SIMD instruction set - Register file – 128x128-bit - Local store – 256KB - Dedicated Asynchronous DMA engine - Isolation mode

Add words around ECC etc.

11 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

Element Interconnect Bus (EIB) - 96B / cycle bandwidth

12 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Debug Bus

13 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor SIMD Architecture

. SIMD = “single-instruction multiple-data” . SIMD exploits data-level parallelism – a single instruction can apply the same operation to multiple data elements in parallel . SIMD units employ “vector registers” – each register holds multiple data elements . SIMD is pervasive in the BE – PPE includes VMX (SIMD extensions to PPC architecture) – SPE is a native SIMD architecture (VMX-like) . SIMD in VMX and SPE – 128bit-wide datapath – 128bit-wide registers – 4-wide fullwords, 8-wide halfwords, 16-wide bytes – SPE includes support for 2-wide doublewords

14 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Specialized Purpose Processor vs. Traditional General Purpose Processor (Roughly to scale 65nm)

Roughly Half the size & Power @ the frequency, 9 Cores, ~230 SP GFlops

349mm2, 2 Cores, 3.4 GHz @ 150W, ~54.4 SP GFlops

15 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Ideal Cell BE Software Target Areas . Data Manipulation . Structured –Digital Media –Image processing – Easier for memory fetch & SIMD operations –Video processing – Data prefetch possible –Visualization of output – Non branchy instruction pipeline; –Compression/decompression –Encryption /decryption – Data more tolerant, but has the same caution –DSP –Audio processing, language translation? . Multiple Operations on Data . Graphics – Many operations on same data before –Transformation between domains (viewpoint reloading transformation; time vs space; 2D vs 3D) –Lighting . Easy Parallelize and SIMD –Ray Tracing / Ray casting – Little or nor collective communication . Floating Point Intensive Applications (SP) required –Single precision Physics – No Global or Shared memory or nested –Single precision HPC loops –Sonar . Pattern Matching . Compute Intense –Bioinformatics – Determined by ops per byte –String manipulation (search engine) –Parsing, transformation,translation (XSLT) . Fits Streaming Model –Audio processing, language translation? – Small computation kernel through which you –Filtering & Pruning stream a large body of data . Offload Engines – – Algorithms that fit Graphics Processing Units TCP/IP –Compiler for gaming applications – GPU’s are being used for more than just –XML graphics today thanks to PCI Express –Network Security, Virus Scan and Intrusion

16 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

Cell BE Processor Isn't Just for Games. Innovative Chip is best high-performance embedded processor of 2005 We chose the Cell BE as the best high-performance embedded processor of 2005 because of its innovative design and future potential....Even if the Cell BE accumulates no more design wins, the PlayStation 3 could drive sales to nearly 100 million units over the likely five-year lifespan of the console. That would make the Cell BE one of the most successful microprocessors in history.

“…Cell BE could power “It was originally conceived hundreds of new apps, as the microprocessor to create a new video- power Sony's [PS3], but it is processing industry and expected to find a home in fuel a multibillion-dollar lots of other broadband- build out of tech hardware connected consumer items over ten years.” and in servers too.” -- Forbes -- IEEE Spectrum

17 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Cell Broadband Engine Architecture™ (CBEA) Technology Competitive Roadmap

Next Gen (2PPE’+32SPE’) 45nm SOI ~1 TFlop (est.) Performance Enhancements/ Enhanced Scaling Cell BE (1+8eDP SPE) 65nm SOI

Cost Cell BE Reduction (1+8) 90nm SOI

2006 2007 2008 2009 2010 All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs. 18 Cell B IEB MRo |a pdamoalipn [email protected] 5..1c o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation IBM Confidential Cell Broadband Engine Processor Cell Broadband Engine™ Blade – The first in a line of planned offerings using Cell Broadband Engine technology

Performance Target Availability: 1H08 Enhanced Cell BE-based Blade Target Availability: 2H07 2 Enhanced Cell BE Processors Advanced Cell BE-Based SP & DP Floating Point Affinity Blade Up to 32 GB Memory GA: 2H06 2 Cell BE Processors Up to 16X PCI Express Cell BE-Based Blade Single Precision Floating Pt Affinity 2 Cell BE Processors 2 GB Memory Single Precision Floating Pt Affinity Up to 16X PCI Express SDK 3.0 1 GB Memory Up to 4X PCI Express™ Target Availability: 2H07 SDK 2.0

SDK 1.1 Hardware Target Availability: 1H07 Alpha Software

Available: 17 July 2006 Beta Software

GA Software

2006 2007 2008

All future dates and specifications are estimations only; Subject to change without notice. 19 Cell B IEB MRo |a pdamoalipn [email protected] 5..1c o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor IBM Server Strategy

Clusters and Large SMPs Virtualization g n i t u p m o C

P M S

/ High Density p U Racks/Blades e l a c S

Scale Out / Distributed Computing

20 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Cell BE Form Factors

Cell BE is being delivered in a 2way Blade Form Factor today.

(IBM Blade, and BladeCenter)

Additional Form factors have been announced from our partners: •PCI Express Card •Workstation •Rack Optimized •CE Devices such as HDTV’s and PlayStation 3 21 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Accelerator Technology & Programming Models Accelerator programming model choice dictated by the distance and corresponding latency and bandwidth performance between host and slave

System on a Chip Network I/O Slot On Board (Planar) Accelerator Distance (SoC) From Host interconnect Increasing Proximity of Host & Slave (s) •Reduces Latency and support infrastructure (faster / cheaper)

Clusters nVidia Host + (CC) HT + FPGA Implementation Cell BE ATI Host + (CC) HT + Clearspeed Examples Clearspeed etc Mercury CAB

•RPC •RPC •User Directed Shared •User Directed Shared RPC •RPC memory (e.g. SHMEM, memory (e.g. SHMEM, •User Directed OpenMP) OpenMP) Programming Model Shared memory •Library Pass thru •Library Pass thru •Library Pass thru •Software Managed Cache •Software Managed •Hardware Managed Cache Coherency (e.g. NUMA, CC- •Hardware Managed HT) Coherency

22 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

main() Accelerators // do something global (struct A) global (int N) HOST global (struct M) // foo will be executed on the slave Pinned // it is probably best to set up asychronous or Memory pipeline processing on the host for max performance and efficiency A = foo (N, M)

// when we are all done with A, N, and M we can DD // release the pinned memory // do more stuff

•A, N, and M are pinned in memory •The foo() entry point is mapped to the slave by the host DD (addresses are passed to the slave) •The IO MMU on the slave DMAs data from pinned memory using the address maps •If a fault occurs on the SLAVE the IO MMU goes to the pinned memory for a DMA, if a SLAVE subsequent fault occurs (e.g. address not in pinned memory) the Host DD gets an interrupt and must pin the new address in shared memory •If the SLAVE does not have permission to read that address or the pinned memory space is full, a SEGV is generated to the Slave foo (M,N) •NOTES: •If we allow for the user to not expressly pin shared memory, how is the shared memory managed in terms of caching policies •Are the shared memory APIs applicable to PE-SPE or SPE-SPE programming?? I this would be a logical fit and allow for a more generalized view (and portable) of accelerator programming (is this just SHMEM or UPC on the SPEs??)

23 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Accelerator Library Framework & Data and Communication Synchronization Library

Application

Library

DaCS ALF IDE Topology Process Data Management Partitioning Others Process gdb Management Error Workload Handling Distribution Tooling

Synchronization Trace analysis

Send / Receive Remote DMA Mailbox Error Handling

Platform

IB Open DMA Mailbox LAPI ARMCI etc. Verbs MPI

24 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor DaCS . Focus on data movement primitives – Message Passing • send/recv – Remote Direct Memory Access (rDMA) • put/get • put_list/get_list – Double and multi-buffering • Efficient data transfer to maximize available bandwidth and minimize inherent latency • Hide complexity of asynchronous compute/communicate from developer – Mailbox • write to mailbox/read from mailbox – Synchronization • Mutex • Barrier • atomic operations . Based on windows and channels architecture . Process Management – Supports remote launching and termination of an accelerator’s process from a host process . Error Handling

25 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor ALF . Division of labor approach – ALF provides wrappers for computational kernels; synthesize kernels with data partitioning . Initialization and cleanup of a group of accelerators – Groups are dynamic – Mutex locks provide synchronization mechanism for data and processing . Remote error handling . Data partitioning and list creation – Efficient scatter/gather implementations • Stateless embarassingly parallel processing, strided partitioning, butterfly communications, etc. – Extensible to variety of data partitioning patterns . Target prototypes: FFT, TRE, Sweep3D

26 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Cell BE IDE

. Based on Eclipse/CDT . Develop, build, debug Cell BE and Hybrid applications . Supports Full System Simulator and remote execution . Integrates performance analysis tools Cell IDE . Support for Hybrid Model programming wizard CDT Eclipse

Cell Tool Chain Sim perf

27 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Delivering the Platform for Emerging Workloads

Market & Solution Specific Assets

Digital Information Based Home Media Financial Aerospace Media Medicine Consumer Services Digital and Defense Electronics Sector Video Surveillance Focused Common Workload Characteristics/Requirements Real-time Analytics Image/Video Creation/Mgt Unstructured Data Processing of Data Presentation of Data Multimodal Search Information Synthesis Visualization Data Transforms Analysis Imaging Pattern Matching

28 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation IBM STG

Seeing is believing

Cell BE Demos

© 2006 IBM Corporation Cell Broadband Engine Processor 3D Visualization via Volumetric Rendering for Medical Images

Courtesy of Mercury Computer Systems Solution

A single scan can result to in 2,000 – 3,000 2D images. Shown below is a CT Reconstruction performed on an x86 and Cell BE blade.

High End x86 Server Solution Cell BE Solution (90nm first gen) ~6 minutes to render entire volume ~2 seconds to render entire volume ~2 seconds per slice Courtesy of Mercury Computer Systems http://www.mc.com/cell/media/medium.cfm 30 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Ray-Tracing Quaternion Julia Fractals using the floating point power in graphics processing units (GPUs).

This kind of algorithm is pretty much ideal for the GPU - extremely high arithmetic intensity and almost zero bandwidth usage.

Dual Xeon IBM + BFG Nvidia GeForce Cell BE Solution (Blade 1 bring up board) 7800 GT ~ 18 Frames per Second with no textures added ~ 3-4 Frames per Second (matching left) ~15 Frames per Second with Textures (as shown) 31 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Cg to SPE C + Intrinsics

float4 quatSq( float4 q ) float4 quatSq ( float4 q ) { { float4 r; float4 r; r.x = q.x*q.x - dot( q.yzw, q.yzw ); float q_x, rx; r.yzw = 2*q.x*q.yzw; float4 q_yzw, ryzw; return r; } q_x = spu_extract (q, 0); q_yzw = spu_slqwbyte (q, 4); rx = q_x * q_x - dot3 (q_yzw, q_yzw); ryzw = spu_mul (spu_splats (2.0f * q_x), q_yzw); r = spu_insert (rx, spu_rlmaskqwbyte (ryzw, -4), 0); return r; }

32 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor AOS vs SOA Vectors

Array of Structures Structure of Arrays

X Y Z X X X X

Y Y Y Y

Z Z Z Z

33 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor AOS vs SOA Code

Array of Structures Code Structure of Arrays Code

float4 quatSq ( float4 q ) void quatSq4 ( float4 qx, float4 qy, float4 qz, float4 qw, { float4 *rx, float4 *ry, float4 *rz, float4 *rw ) float4 r; { float q_x, rx; float4 qx2; float4 q_yzw, ryzw; float4 dotq;

q_x = spu_extract (q, 0); qx2 = spu_mul (((float4)(2.0f)), qx); q_yzw = spu_slqwbyte (q, 4); dotq = _dot_product3_v (qy, qz, qw, qy, qz, qw); rx = q_x * q_x - dot3 (q_yzw, q_yzw); *rx = spu_msub (qx, qx, dotq); ryzw = spu_mul (spu_splats (2.0f * q_x), q_yzw); *ry = spu_mul (qx2, qy); r = spu_insert (rx, spu_rlmaskqwbyte (ryzw, -4),0); *rz = spu_mul (qx2, qz); return r; *rw = spu_mul (qx2, qw); } }

34 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Terrain Rendering Engine (TRE)

A 2d height map is converted to a 3D model, covered in texture, lighting and atmosphere are added, and that model is then Ray-Cast

This video is broken into 3 segments: •30 Seconds of Intel performing the process •30 Seconds of an Apple G5 using VMX •30 Seconds of Cell BE using 7 SPUs to render, and the 8th to compress and do network operations

35 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor IBM’s iRT – Ray Trace Images in Interactive Time (First “LANL style” application – AMD & Cell working together)

Ray traced car with up to Ambient occlusion Composite image four levels of reflection and estimate of global including reflection, transparency, detailed illumination for the transparency, shadows, and 4x4 jittered scene, using 64 detailed shadows, multi-sampling, totaling up random occlusion ambient occlusion, to 96 rays per pixel. samples per pixel. and 4x4 jittered multi-sampling, totaling up to 288 Ray Traced Am bient Occlusion 40 10 rays per pixel. The d d n 8 car model is n 30 o c o e c comprised of 1.6

s 6 e

/ s

/ 20 million polygons and s 4 e s e

m the image resolution a m 10 2 r a f

r is 1080p hi-def. f 0 0 1 2 3 4 1 2 3 4 blades blades

36 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Cell BE @ CEATECH 2005

Toshiba’s Magic Mirror Demo

37 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor IBM Extreme Blue Intern Team Project – ( Summer ’06 - AcCellerated Vision) Project Outline .Used 2 USB $39 Logitech Cameras (640x480 color @ 15 fps) .Performed real time depth perception & gesture Recognition in an office environment – in this case, it’s a shared lab space with 40+ people coming and going in the background of the cameras field of vision .This demo uses a “depth barrier” to trigger drawling on screen vs. not drawling .The blue glove was used as a programming convenience for the students because of the time pressure

–Single SPU per camera stream (total: 2 streams = 2 SPUs) –Cell BE computation was fast enough to the point that the only bottleneck was at the camera –Simplistic single-buffered DMA requests were used. –Function offload model to control SPU execution –Used mailboxes for communication between PPU and SPU. –Sent address of function specific control block, and the SPU switched based on the control block type. .Team consisted of 4 interns for 12 weeks, 8 of which were –Straightforward SIMD implementation. available programming time (using standard C and OpenCV): –Cell porting took just over 1 week (3 programmers in the “over the shoulder” extreme –Jacob Albertson (CMU, Junior) programming model). –Kenneth Arnold (Cornell, Senior) –Stopped optimizing when total time less than 30ms per frame since camera is only 30fps –Steve Goldman (Stanford, Masters) • Showed 20x vs similar clocked Duo Core Intel –AJ Sessa (CMU/Tepper, MBA) • Additional 10x increase easily possible.

38 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation IBM STG

Area’s IBM and Our Partners are looking at for Cell BE

© 2006 IBM Corporation Cell Broadband Engine Processor

.Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation of: – Protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. –By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine.

.Folding@Home utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. With this new technology we will likely be able to attain performance on the 100 gigaflop scale per computer. –With about 10,000 such machines, we would be able to achieve performance on the petaflop scale. http://folding.stanford.edu/FAQ-PS3.html Dr. V. S. Pande, folding@home, Distributed Computing Project, Stanford University

40 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

IBM to Build World's First Cell Broadband Engine™ Based Supercomputer Revolutionary Hybrid Supercomputer at Los Alamos National Laboratory Will Harness Cell Game Chips and AMD Opteron™ Technology Cell BE x86 Linux Accelerator Master Cluster Linux Cluster AMD Opteron™ 8000+ Blades System x3755

. Goal 1 PetaFlop Double Precision Floating Point Sustained – 1.6 PetaFlop Peak DP Floating point (3.2 SP) – 360 server racks that take up around 12,000 square feet--about three basketball courts. – Hybrid of Opteron X64 AMD processors (System x3755 servers) and Cell BE Blade Servers connected via high speed network

– Modular Approach means the Master Cluster could be made up of Any Type System – including Power, Intel

41 Cell Broadband IBEMng i|n pea iosl inai @traudse.mibmar.kc omf Sony Computer Entertainment, Inc. Nov 13, 2006 © 2006 IBM Corporation Development collaboration between Sony, Toshiba and IBM. Cell Broadband Engine Processor Multigrid Finite Element Solver on Cell

Ported by

using the free SDK

235 584 tetrahedra 48 000 nodes 28 iterations in NKMG solver In 3.8 seconds

Sustained Performance for large Objects: 52 GFLOP/s using all 8 SPUs at once ls7-www.cs.uni-dortmund.de www.digitalmedics.de

42 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Storage

IBM DVS Vision video Show all red cars clips that drove North on 10 Ave over the last one month Video Compression objects & & Analysis meta data

IBM DVS watches video & logs activity, Surveillance L Middleware A generates alerts N Why is this car / V P parked here? N We should change our Smart Surveillance Technology Base: •Alert – car parked in security policy •Object Detection in the presence of distraction motion •2D Object Tracking: Multi-object tracking with occlusion resolution. loading zone > 5 mins •Object Classification: View independent object classification. •3D Object Tracking: Precise 3D location using standard cameras. •Multi-scale Tracking: Automatic PTZ Camera control to track objects. •Multi-camera Handoff: The ability to track an object across cameras. •Face Cataloging: Captures faces at large distances from the camera. •XML Metadata Representation for object and its motion attributes. •Extensible Engine Architecture for plug and play video analytics. •Real Time Event Indexing: Scene events are instantaneously available for searching in a distributed database environment. •Web service interfaces for Event Search & Retrieval support the rapid application development of customer specific applications. •Scalable Backend System: COTS database technology allows for both distributed surveillance and scalability.

43 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor IBM Smart Surveillance System (Previous PeopleVision Project)

Research Areas: • Robust Background Subtraction • Salient Motion Detection • Object Classification • 2D Tracking • 3D Multi-Person Tracking • Articulated Human Body Tracking • Active Head Tracking • Coarse Head Pose Estimation • Position Independent Absolute Head Pose Estimation • Face Cataloger • Video Privacy • Multi-scale Tracking & Index Browser • Real Time Alerts • Middleware for Large Scale Surveillance (MILS) •Performance Evaluation of Surveillance Systems

44 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Marvel: Semantic Filtering of Audio-Visual Content

Video:

Associated speech: Today, jet fighters practiced maneuvers and forces increased military preparations as tensions in Middle East reached … –Speech Image & Video –Closed Captions Sequence Analysis –Transcript Text Analysis Visual Features

Motion Color patterns “Jet fighters” “military” Texture Shapes “Middle East” Text Metadata Visual Feature Metadata

Automatic Semantic Concept Detection Ontology- Based Models

Airplane (0.8) Indoors (0.8) Protest (0.6) Parade (0.5) Outdoors (0.9) Semantics Metadata Sky (0.7) Meeting (0.7) * with associated People (0.7) confidence scores

45 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation VLSI Design Center Tidex solution - iPlus  Developer and owner of a revolutionizing technology, capable of transforming any 2D video footage into an accurate 3D model - iPlus™  The iPlus technology allows the easy creation of 3-Dimensional, true-to-reality databases out of a regular video footage  It does so by automatically reconstructing the depth dimension of the movie, and building a depth map of the movie’s pixels.  The video textures are thus an integral part of the final model

A house shot from the air, modeled with textures

46 IBM Cell Broadband Engine Processor Early Data points for Cell BE in the Financial Services Areas

Benchmark Scaling Factor by Number of SPUs

. European Option risk analytics 80 ) c e

. Monte-Carlo simulation s 60 Single (

e Precision FP

m 40 . 200,000,000 random numbers generated i

T Double

l

a 20 Precision FP – Mersenne Twister RNG e R – Cell also has a Hardware RNG 0 0 4 8 12 16 20 Linear . Log normal distribution Number of SPEs Speedup . Min, max, average, 95% quantile for losses SP DP Code Modification SPUs Real Time (s) Speedup Real Time (s) Speedup 1 12.132 1.00 77.009 1.00 2 6.074 2.00 38.516 2.00 . 3 days of effort 4 3.051 3.98 19.271 4.00 8 1.549 7.83 9.662 7.97 . Split the problem across multiple SPUs 16 0.817 14.85 4.873 15.80 Single precision: 4 numbers generated per SPU at once . SIMD-ized core operations using SPU intrinsicsDouble precision: 2 numbers generated per SPU at once (today) . Unrolled some loops manually SPE Peak SP Peak DP Today Cores GFLOPS GFLOPS Cell BE Chip 8 204.8 14.7 Cell BE Blade 16 409.6 29.3

47 IBCMe ll| Bparoalidnbi@anuds .Eibnmgi.nceo mis a trademark of Sony Computer Entertainment ,N Ionvc .1 3, 2006 © 2006 IBM Corporation Development collaboration between Sony, Toshiba and IBM. Cell Broadband Engine Processor Accelerator Hierarchy Example for DCC

•Host system receives request from plug-in •Scene data assembled •Different elements scattered remotely •Brought into host memory •Request partitioned amongst multiple blades •Rendering potentially 1000s of frames

•Existing x86 DCC application •Leverage existing infrastructure • Plug-ins part of many apps •Cell BE Blades receive render request • Plug-in drives Cell BE Cluster •Each blade services one frame

•Rendering Computation •Frame sliced across SPEs

48 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Accelerator Hierarchy Example for TRE

•Manages input model (ie. Joystick, script) •Terrain data tiles pre-fetched from disk/net •Predictive tile pre-fetching algorithms •Animation sequences dispatched to Cell BE blades •Display client •Input device •Multiple clients

•Interleaved frames sent to each blade •Data tile cache

•Higher-end visualization •Drives multiple blades

•Target frame dispatched to blade •Vertical slices calculated on SPEs

Once problems are parallelized on Cell BE in a micro-acceleration model, cluster based macro-acceleration becomes a natural extension

49 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Remote Visualization continued

•Display client •Input device •Multiple clients

.This Rendering and Visualization pattern work for a number of things: –Tours, one project is looking at taking detailed models of ancient cities, combining with location of the PDA, Ray Casting them and streaming the video to the PDA’s – this allows people to see what the location of the ancient city looked like from their angle, and compare it to present day –Rendering Technical models for buildings and machinery remotely to the PDA allowing visualization aids for wiring in, structural support and so on (for example an architect or builder on-site might be able to walk around a site visualizing his model at different stages of completion. Maintenance personal gain instant access to 3d schematics, and can be augmented with history or repairs, common correction steps, and so on.

50 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation CSyesltl eBmrosa adnbda nTde cEhnngoilnoeg yP rGorcoeuspsor Terrain Rendering Engine (TRE) and IBM Blades

Aircraft data / Field Data

Combine Data & Render

CPBW Blade Next-Gen GCS

BladeCenter-1 Add Live Video, Aerial Information, Chassis Commodity Cell BE Combat Situational Awareness Blade 5151 IBM | [email protected] Nov 13, 2006 © 2020056 IIBMM CCoorrppoorraattiion Cell Broadband Engine Processor Conversion between Content Protection Schemes Media protection schemes and standards rely on client device and client software remaining secure, and untampered with. Cell BE provides a superior alternative. HD IP Streams HD Disc from ISP (Blu-ray, AACS HD-DVD) Other Providers Encrypted & Methods Stream U n d e t A f l i

A n C u 1. Decrypted using one S e a V d , e standard…. t g AACS

c n i s

s Decrypt

Cell BE acts as the e

Cell BE based c Content is in the clear o Content Player as trusted, open platform r only within the Secure P

for various content e

trusted r Processing Vault

protection schemes u transcoder c

e (Most valuable form of

S D P content, decrypted but still C T E DTCP IP

D B C

encoded/compressed)

H l

l P Encrypt e

/

I C P 2. …Encrypted using another standard Other Devices

DTCP IP Encrypted Stream HDTV, Portable 1080p etc Devices

52 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Role of the Transformation Tier Battlegrounds .Formats & Standards .Security for Content

, HD Content / Device A y

–Increased Security brings increased: t i r

•Fidelity /Resolution of Content u Blu-Ray / etc “Cell BE” c

•User Rights such as Time-shif ting e s

Highest Fidelity and Security .Usage Monitoring / Transaction f o

s e.g. 1920 x 1080 l

.Packaging and Delivery e v

e Time Shifting, Sharing, l

.Content Customization & Tailoring t s Repurposing .Unique Watermarking to User e h g i h

e Device B h t r AVI / MPEG 4

e t n i i

“x86” n t l T s e u

t e s r n e n

” u Good Fidelity r t

o y y s o t a i t E t o e e t i i C

f e.g. 480p n r l

r

B a d e d u l u l n t l d u c m a a i o e , r e h e

y Device C F s t o S

C i F l

t f TV

d e “ u s d e d e b Standard TV i

e n s , f s s e a a a l a b e r e i e r Medium Fidelity r r s r T s e c c i o c n n

p e.g. NTSC, Pal, etc n I

T I r r I

e e r e n i i a

o T T E i

Device D t

B c

a Custom i a l l t t g e ARM / MIPS / etc n a o C e - D s L n o e Custom Fidelity r N

: P

e e.g. 320, 640, etc t o N * … Device n

53 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

Coherent Optical Processor

54 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

Increased focus on manufacturability during the design process Optical Proximity Correction (OPC) and Optical Rules Checking (ORC)

Design Wafer Image without OPC

Why OPC?  Improve time-to-market  Improve design quality D o t  Improve manufacturability hey agr ee?  Improve manufacturing yield  Improve profitability

OPC is applied Wafer Image with OPC

55 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor User Interaction Drives Innovation in Computing

Broadband Natural Interaction

Networked Stand Alone PC Internet Graphic User Mini Computer Interface (GUI) WYSIWYG Mainframe Gaming Multitasking WWW Mainframe Multitasking Batch / Cut & Paste

Word Processing Green Screen/ Teletype Punch Cards

56 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor More information at

. www.ibm.com/developerworks/power/cell – Full Architecture Spec – Full System Simulator – Gcc & XLC compiler – Example applications and libraries . All Free!

. TIME TO GET IN THE GAME !!

57 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor

Thank You!

58 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Special Notices

© Copyright International Business Machines Corporation 2006 All Rights Reserved

This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

59 Cell B IEB MRo |a pdamoalipn [email protected] 5..1c o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation Cell Broadband Engine Processor Special Notices (Cont.) -- Trademarks

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro- Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both. Rambus is a registered trademark of Rambus, Inc. XDR and FlexIO are trademarks of Rambus, Inc. UNIX is a registered trademark in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Fedora is a trademark of Redhat, Inc. Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). AltiVec is a trademark of Freescale Semiconductor, Inc. PCI-X and PCI Express are registered trademarks of PCI SIG. InfiniBand™ is a trademark the InfiniBand® Trade Association Other company, product and service names may be trademarks or service marks of others.

Revised July 23, 2006

60 Cell B IEB MRo |a pdamoalipn [email protected] 5..1c o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation IBM STG

Cell Broadband Engine Processor TM

(A multi core design based on Power Architecture™ technology)

Michael Paolini, [email protected] Solutions Architect (SWG Master Inventor) IBM Systems & Technology Group Nov 13, 2006

© 2006 IBM Corporation

Cell Broadband Engine Processor Performance Limiters and Challenges in Conventional Microprocessors

(Single Thread Throughput)

. Memory Wall – Latency induced bandwidth limitations . Power Wall – Must improve efficiency and performance equally . Frequency Wall – Diminishing returns from deeper pipelines (can be negative if power is taken into account)

2 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

2

Cell Broadband Engine Processor What’s Causing The Problem? P=(1/2)CV2f Gate dielectric approaching a fundamental limit: Atomic defects matter !

G10Sate Tox St=11Aack 65 nm 1000 Air Cooling limit 100

) Active 2 Power m

c 10 / Frequency Increase vs Power Consumption W

3.5 ( Passive Power

y 1 t

3 i s

2.5 n

e 0.1 D e 2

v i t r a l a e e R 1.5 0.01 w

1 o

P 1994 2004 0.5 0.001

0 1 0.1 0.01 0.9 1 1.1 1.2 1.3 Gate Length (microns) Pow er Gate Length (microns) Voltage Frequency

3 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

3

Cell Broadband Engine Processor The Discontinuity

Then (~2003) Now

. Scaling drove . Innovation drives performance performance . Scaling drove down cost . Scaling drives down cost . Performance constrained . Power constrained . Active power . Standby power dominates dominates . Focus on processor . Focus on system performance performance

4 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

4

Cell Broadband Engine Processor Collaborative Innovation Drivers

. Economics - Investment unaffordable even for large entities

. Technology - Extent of invention and innovation required

. Perspective – Breadth of expertise and knowledge required

. Creativity – Collaboration truly brings great ideas forward

. Focus – Allows companies to best leverage core skills

. Time – Rapid team assembly and execution

5 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Need to find some real words here. Thought it might be a place to insert the thought of paralyzing isn’t just for processors, but the natural way to attack any really large undertaking. Couldn’t decide if an introduction was needed for the next slide, or if this is better as speaker for the next slide

5

Cell Broadband Engine Processor Cell Broadband Engine History

. IBM, SCEI/Sony, Toshiba Alliance formed in 2000 . Design Center opens March 2001 . ~$400M Investment, 5 years, 600 people . February 7, 2005: First technical disclosures . January 12, 2006: Alliance extended 5 more years

YKT, EFK, BURLINGTON, ENDICOTT ROCHESTER BOEBLINGEN

TOKYO

SAN JOSE

RTP AUSTIN INDIA

ISRAEL 6 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

6

Cell Broadband Engine Processor Introducing Cell BE . Cell BE is an accelerator extension to Power – Built on a Power ecosystem – Used best know system practices for processor design . Sets a new performance standard – Exploits parallelism while achieving high frequency – Supercomputer attributes with extreme floating point capabilities – Sustains high memory bandwidth with smart DMA First Generation Cell BE controllers . 90 nm . Designed for natural human interaction . 241M transistors – Photo-realistic effects – Predictable real-time response . 235mm2 – Virtualized resources for concurrent activities . 9 cores, 10 threads . Designed for flexibility . >200 GFlops (SP) – Wide variety of application domains . >20 GFlops (DP) – Highly abstracted to highly exploitable programming models . Up to 25 GB/s memory B/W – Reconfigurable I/O interfaces . Up to 75 GB/s I/O B/W – Virtual trusted computing environment for security . >300 GB/s EIB . Cell BE is the chip powering the Sony PlayStation 3 . Top frequency >4GHz – Ships in volume the US in Nov ‘06 (observed in lab)

7 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

7

Cell Broadband Engine Processor The Cell BE Concept

. Compatibility with 64b Power Architecture™ – Builds on and leverages IBM investment and community . Increased efficiency and performance – Attacks on the “Power Wall” • Non Homogenous Coherent Multiprocessor • High design frequency @ a low operating voltage with advanced power management – Attacks on the “Memory Wall” • Streaming DMA architecture • 3-level Memory Model: Main Storage, Local Storage, Register Files – Attacks on the “Frequency Wall” • Highly optimized implementation • Large shared register files and software controlled branching to allow deeper pipelines . Interface between user and networked world – Image rich information, virtual reality, shared reality – Flexibility and security . Multi-OS support, including RTOS / non-RTOS – Combine real-time and non-real time worlds

Add words around ECC etc.

8 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

8

Cell Broadband Engine Processor

Heterogeneous Multi-core Architecture

9 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

9

Cell Broadband Engine Processor

1 PPE core: - VMX unit - L1, L2 cache - 2 way SMT

10 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

10

Cell Broadband Engine Processor

8 SPEs -128-bit SIMD instruction set - Register file – 128x128-bit - Local store – 256KB - Dedicated Asynchronous DMA engine - Isolation mode

Add words around ECC etc.

11 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

11

Cell Broadband Engine Processor

Element Interconnect Bus (EIB) - 96B / cycle bandwidth

12 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

12

Cell Broadband Engine Processor Debug Bus

13 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

13

Cell Broadband Engine Processor SIMD Architecture

. SIMD = “single-instruction multiple-data” . SIMD exploits data-level parallelism – a single instruction can apply the same operation to multiple data elements in parallel . SIMD units employ “vector registers” – each register holds multiple data elements . SIMD is pervasive in the BE – PPE includes VMX (SIMD extensions to PPC architecture) – SPE is a native SIMD architecture (VMX-like) . SIMD in VMX and SPE – 128bit-wide datapath – 128bit-wide registers – 4-wide fullwords, 8-wide halfwords, 16-wide bytes – SPE includes support for 2-wide doublewords

14 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

14

Cell Broadband Engine Processor Specialized Purpose Processor vs. Traditional General Purpose Processor (Roughly to scale 65nm)

Roughly Half the size & Power @ the frequency, 9 Cores, ~230 SP GFlops

349mm2, 2 Cores, 3.4 GHz @ 150W, ~54.4 SP GFlops

15 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Left: Cell BE Right: Intel Tulsa (Xeon MP 7100 series) Introduced @ $1980 in 1k quantities http://www.intel.com/products/processor/xeon/index.htm

15

Cell Broadband Engine Processor Ideal Cell BE Software Target Areas . Data Manipulation . Structured –Digital Media –Image processing – Easier for memory fetch & SIMD operations –Video processing – Data prefetch possible –Visualization of output – Non branchy instruction pipeline; –Compression/decompression –Encryption /decryption – Data more tolerant, but has the same caution –DSP –Audio processing, language translation? . Multiple Operations on Data . Graphics – Many operations on same data before –Transformation between domains (viewpoint reloading transformation; time vs space; 2D vs 3D) –Lighting . Easy Parallelize and SIMD –Ray Tracing / Ray casting – Little or nor collective communication . Floating Point Intensive Applications (SP) required –Single precision Physics – No Global or Shared memory or nested –Single precision HPC loops –Sonar . Pattern Matching . Compute Intense –Bioinformatics – Determined by ops per byte –String manipulation (search engine) –Parsing, transformation,translation (XSLT) . Fits Streaming Model –Audio processing, language translation? – Small computation kernel through which you –Filtering & Pruning stream a large body of data . Offload Engines – – Algorithms that fit Graphics Processing Units TCP/IP –Compiler for gaming applications – GPU’s are being used for more than just –XML graphics today thanks to PCI Express –Network Security, Virus Scan and Intrusion

16 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

16

Cell Broadband Engine Processor

Cell BE Processor Isn't Just for Games. Innovative Chip is best high-performance embedded processor of 2005 We chose the Cell BE as the best high-performance embedded processor of 2005 because of its innovative design and future potential....Even if the Cell BE accumulates no more design wins, the PlayStation 3 could drive sales to nearly 100 million units over the likely five-year lifespan of the console. That would make the Cell BE one of the most successful microprocessors in history.

“…Cell BE could power “It was originally conceived hundreds of new apps, as the microprocessor to create a new video- power Sony's [PS3], but it is processing industry and expected to find a home in fuel a multibillion-dollar lots of other broadband- build out of tech hardware connected consumer items over ten years.” and in servers too.” -- Forbes -- IEEE Spectrum

17 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

17

Cell Broadband Engine Processor Cell Broadband Engine Architecture™ (CBEA) Technology Competitive Roadmap

Next Gen (2PPE’+32SPE’) 45nm SOI ~1 TFlop (est.) Performance Enhancements/ Enhanced Scaling Cell BE (1+8eDP SPE) 65nm SOI

Cost Cell BE Reduction (1+8) 90nm SOI

2006 2007 2008 2009 2010 All future dates and specifications are estimations only; Subject to change without notice. Dashed outlines indicate concept designs. 18 Cell B IBE MRo |a pdamoalipn [email protected] 5..c1 o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation IBM Confidential

18

Cell Broadband Engine Processor Cell Broadband Engine™ Blade – The first in a line of planned offerings using Cell Broadband Engine technology

Performance Target Availability: 1H08 Enhanced Cell BE-based Blade Target Availability: 2H07 2 Enhanced Cell BE Processors Advanced Cell BE-Based SP & DP Floating Point Affinity Blade Up to 32 GB Memory GA: 2H06 2 Cell BE Processors Up to 16X PCI Express Cell BE-Based Blade Single Precision Floating Pt Affinity 2 Cell BE Processors 2 GB Memory Single Precision Floating Pt Affinity Up to 16X PCI Express SDK 3.0 1 GB Memory Up to 4X PCI Express™ Target Availability: 2H07 SDK 2.0

SDK 1.1 Hardware Target Availability: 1H07 Alpha Software

Available: 17 July 2006 Beta Software

GA Software

2006 2007 2008

All future dates and specifications are estimations only; Subject to change without notice. 19 Cell B IBE MRo |a pdamoalipn [email protected] 5..c1 o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation

19

Cell Broadband Engine Processor IBM Server Strategy

Clusters and Large SMPs Virtualization g n i t u p m o C

P M S

/ High Density p U Racks/Blades e l a c S

Scale Out / Distributed Computing

20 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Systems that take you where you want to go:

Major Points to Make To offer this choice in deployment options, IBM’s continuing innovation and investment in essential server and storage technology provides both scale-up SMP processors and scale-out distributed processors. Advanced virtualization technology shared across our symmetric multiprocessor systems lets you consolidate multiple workloads to scale “up.”

If you want to scale “out,” the architecture of our rack optimized or new BladeCenter products support rapid deployment while reducing cost and complexity.

The industry’s most advanced clustering technology offers the best of both worlds, seamlessly interconnecting multiple servers and storage into a unified computing resource.

To support both these flexible strategies, IBM leverages the Capacity on Demand (CoD) Offerings so customers can activate dormant processing or storage capacity as workload requirements increase, thereby investing for the future while paying only for today’s requirements. CoD provides the ability to quickly scale up to meet growth or new application workload requirements without requiring the acquisition of new servers or storage devices, thereby lowering overall server TCO. Other Points to Make IBM is unique in the ability to offer these multiple growth paths.

Transition line:

And it’s not just the hardware technology that IBM has developed that offers flexible growth. Let’s take a look at a key strategic initiative that IBM has been a part of – Linux – that helps protect your ability to pick the right business application and run it on the best possible platform.

20

Cell Broadband Engine Processor Cell BE Form Factors

Cell BE is being delivered in a 2way Blade Form Factor today.

(IBM Blade, and BladeCenter)

Additional Form factors have been announced from our partners: •PCI Express Card •Workstation •Rack Optimized •CE Devices such as HDTV’s and PlayStation 3 21 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

21

Cell Broadband Engine Processor Accelerator Technology & Programming Models Accelerator programming model choice dictated by the distance and corresponding latency and bandwidth performance between host and slave

System on a Chip Network I/O Slot Accelerator Distance On Board (Planar) (SoC) From Host interconnect Increasing Proximity of Host & Slave (s) •Reduces Latency and support infrastructure (faster / cheaper)

Clusters nVidia Host + (CC) HT + FPGA Implementation Cell BE ATI Host + (CC) HT + Clearspeed Examples Clearspeed etc Mercury CAB

•RPC •RPC •User Directed Shared •User Directed Shared RPC •RPC memory (e.g. SHMEM, memory (e.g. SHMEM, •User Directed OpenMP) OpenMP) Programming Model Shared memory •Library Pass thru •Library Pass thru •Library Pass thru •Software Managed Cache •Software Managed •Hardware Managed Cache Coherency (e.g. NUMA, CC- •Hardware Managed HT) Coherency

22 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

22

Cell Broadband Engine Processor

main() Accelerators // do something global (struct A) global (int N) HOST global (struct M) // foo will be executed on the slave Pinned // it is probably best to set up asychronous or Memory pipeline processing on the host for max performance and efficiency A = foo (N, M)

// when we are all done with A, N, and M we can DD // release the pinned memory // do more stuff

•A, N, and M are pinned in memory •The foo() entry point is mapped to the slave by the host DD (addresses are passed to the slave) •The IO MMU on the slave DMAs data from pinned memory using the address maps •If a fault occurs on the SLAVE the IO MMU goes to the pinned memory for a DMA, if a SLAVE subsequent fault occurs (e.g. address not in pinned memory) the Host DD gets an interrupt and must pin the new address in shared memory •If the SLAVE does not have permission to read that address or the pinned memory space is full, a SEGV is generated to the Slave foo (M,N) •NOTES: •If we allow for the user to not expressly pin shared memory, how is the shared memory managed in terms of caching policies •Are the shared memory APIs applicable to PE-SPE or SPE-SPE programming?? I think this would be a logical fit and allow for a more generalized view (and portable) of accelerator programming (is this just SHMEM or UPC on the SPEs??)

23 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

23

Cell Broadband Engine Processor Accelerator Library Framework & Data and Communication Synchronization Library

Application

Library

DaCS ALF IDE Topology Process Data Management Partitioning Others Process gdb Management Error Workload Handling Distribution Tooling

Synchronization Trace analysis

Send / Receive Remote DMA Mailbox Error Handling

Platform

IB Open DMA Mailbox LAPI ARMCI etc. Verbs MPI

24 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

24

Cell Broadband Engine Processor DaCS . Focus on data movement primitives – Message Passing • send/recv – Remote Direct Memory Access (rDMA) • put/get • put_list/get_list – Double and multi-buffering • Efficient data transfer to maximize available bandwidth and minimize inherent latency • Hide complexity of asynchronous compute/communicate from developer – Mailbox • write to mailbox/read from mailbox – Synchronization • Mutex • Barrier • atomic operations . Based on windows and channels architecture . Process Management – Supports remote launching and termination of an accelerator’s process from a host process . Error Handling

25 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

25

Cell Broadband Engine Processor ALF . Division of labor approach – ALF provides wrappers for computational kernels; synthesize kernels with data partitioning . Initialization and cleanup of a group of accelerators – Groups are dynamic – Mutex locks provide synchronization mechanism for data and processing . Remote error handling . Data partitioning and list creation – Efficient scatter/gather implementations • Stateless embarassingly parallel processing, strided partitioning, butterfly communications, etc. – Extensible to variety of data partitioning patterns . Target prototypes: FFT, TRE, Sweep3D

26 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

26

Cell Broadband Engine Processor Cell BE IDE

. Based on Eclipse/CDT . Develop, build, debug Cell BE and Hybrid applications . Supports Full System Simulator and remote execution . Integrates performance analysis tools Cell IDE . Support for Hybrid Model programming wizard CDT Eclipse

Cell Tool Chain Sim perf

27 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

CDT: Eclipse C/C++ Development Tooling

27

Cell Broadband Engine Processor Delivering the Platform for Emerging Workloads

Market & Solution Specific Assets

Digital Information Based Home Media Financial Aerospace Media Medicine Consumer Services Digital and Defense Electronics Sector Video Surveillance Focused Common Workload Characteristics/Requirements Real-time Analytics Image/Video Creation/Mgt Unstructured Data Processing of Data Presentation of Data Multimodal Search Information Synthesis Visualization Data Transforms Analysis Imaging Pattern Matching

28 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Opportunities for multimodal analysis technology arise where the rate of data produced far outpaces the rate at which humans can digest the data, interpret as information, and apply to knowledge based decisions … all in real time…

Unstructured Data refers to information that may serve its purpose for information dissemination, but is not structured in a clean and consistent fashion for use as a data source.

28 IBM STG

Seeing is believing

Cell BE Demos

© 2006 IBM Corporation

Cell Broadband Engine Processor 3D Visualization via Volumetric Rendering for Medical Images

Courtesy of Mercury Computer Systems Solution

A single scan can result to in 2,000 – 3,000 2D images. Shown below is a CT Reconstruction performed on an x86 and Cell BE blade.

High End x86 Server Solution Cell BE Solution (90nm first gen) ~6 minutes to render entire volume ~2 seconds to render entire volume ~2 seconds per slice Courtesy of Mercury Computer Systems http://www.mc.com/cell/media/medium.cfm 30 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation http://www.mc.com/cell/media/medium.cfm Please note this slide requires 2 video files placed in C:\Cell Videos recon-anim-cell-cvid.avi Cell PC WMV.wmv

A single scan can create as many as 2,000 – 3,000 images

30

Cell Broadband Engine Processor Ray-Tracing Quaternion Julia Fractals using the floating point power in graphics processing units (GPUs).

This kind of algorithm is pretty much ideal for the GPU - extremely high arithmetic intensity and almost zero bandwidth usage.

Dual Xeon IBM + BFG Nvidia GeForce Cell BE Solution (Blade 1 bring up board) 7800 GT ~ 18 Frames per Second with no textures added ~ 3-4 Frames per Second (matching left) ~15 Frames per Second with Textures (as shown) 31 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/ Blogged under Cell by Barry Minor on Wednesday 30 November 2005 at 7:39 pm Recently I came across a link on www.gpgpu.org that I found interesting. It described a method of ray-tracing quaternion Julia fractals using the floating point power in graphics processing units (GPUs). The author of the GPU code , Keenan Crane, stated that “This kind of algorithm is pretty much ideal for the GPU - extremely high arithmetic intensity and almost zero bandwidth usage”. I thought it would be interesting to port this Nvidia CG code to the Cell processor, using the public SDK, and see how it performs given that it was ideal for a GPU. First we directly translated the CG code line for line to C + SPE intrinsics. All the CG code structures and data types were maintained. Then we wrote a CG framework to execute this shader for Cell that included a backend image compression and network delivery layer for the finished images. To our surprise, well not really, we found that using only 7 SPEs for rendering a 3.2 GHz Cell chip could out run an Nvidia 7800 GT OC card at this task by about 30%. We reserved one SPE for the image compression and delivery task. Furthermore the way CG structures it SIMD computation is inefficient as it causes large percentages of the code to execute in scalar mode. This is due to the way they structure their vector data, AOS vs SOA. By converting this CG shader from AOS to SOA form, SIMD utilization was much higher which resulted in Cell out performing the Nvidia 7800 by a factor of 5 - 6x using only 7 SPEs for rendering. Given that the Nvidia 7800 GT is listed as having 313 GFLOPs of computational power and seven 3.2 GHz SPEs only have 179.2 GFLOPs this seems impossible but then again maybe we should start reading more white papers and 31 less marketing hype. used a Uniprocessor Cell bringup system. 3.2 GHz DD3.1 Cell processor with 8 good SPEs, 512 MB XDR Memory, 100 Mb network. Pixels were 128 bit, 32 bit float per color channel. All rendering parameters were the defaults set by the Cg program with the exception of the window size which was increased to 1024×1024. By the way the frame rate I saw on the 7800 GT OC was 3-4 fps with a window size of 1024×1024. http://graphics.cs.uiuc.edu/svn/kcrane/web/project_qjulia.html http://astronomy.swin.edu.au/~pbourke/fractals/quatjulia/ http://graphics.cs.uiuc.edu/~jch/papers/rtqjs.pdf

Cell Broadband Engine Processor Cg to SPE C + Intrinsics

float4 quatSq( float4 q ) float4 quatSq ( float4 q ) { { float4 r; float4 r; r.x = q.x*q.x - dot( q.yzw, q.yzw ); float q_x, rx; r.yzw = 2*q.x*q.yzw; float4 q_yzw, ryzw; return r; } q_x = spu_extract (q, 0); q_yzw = spu_slqwbyte (q, 4); rx = q_x * q_x - dot3 (q_yzw, q_yzw); ryzw = spu_mul (spu_splats (2.0f * q_x), q_yzw); r = spu_insert (rx, spu_rlmaskqwbyte (ryzw, -4), 0); return r; }

32 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

32

Cell Broadband Engine Processor AOS vs SOA Vectors

Array of Structures Structure of Arrays

X Y Z X X X X

Y Y Y Y

Z Z Z Z

33 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

33

Cell Broadband Engine Processor AOS vs SOA Code

Array of Structures Code Structure of Arrays Code float4 quatSq ( float4 q ) void quatSq4 ( float4 qx, float4 qy, float4 qz, float4 qw, { float4 *rx, float4 *ry, float4 *rz, float4 *rw ) float4 r; { float q_x, rx; float4 qx2; float4 q_yzw, ryzw; float4 dotq;

q_x = spu_extract (q, 0); qx2 = spu_mul (((float4)(2.0f)), qx); q_yzw = spu_slqwbyte (q, 4); dotq = _dot_product3_v (qy, qz, qw, qy, qz, qw); rx = q_x * q_x - dot3 (q_yzw, q_yzw); *rx = spu_msub (qx, qx, dotq); ryzw = spu_mul (spu_splats (2.0f * q_x), q_yzw); *ry = spu_mul (qx2, qy); r = spu_insert (rx, spu_rlmaskqwbyte (ryzw, -4),0); *rz = spu_mul (qx2, qz); return r; *rw = spu_mul (qx2, qw); } }

34 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

34

Cell Broadband Engine Processor Terrain Rendering Engine (TRE)

A 2d height map is converted to a 3D model, covered in texture, lighting and atmosphere are added, and that model is then Ray-Cast

This video is broken into 3 segments: •30 Seconds of Intel performing the process •30 Seconds of an Apple G5 using VMX •30 Seconds of Cell BE using 7 SPUs to render, and the 8th to compress and do network operations

35 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

35

Cell Broadband Engine Processor IBM’s iRT – Ray Trace Images in Interactive Time (First “LANL style” application – AMD & Cell working together)

Ray traced car with up to Ambient occlusion Composite image four levels of reflection and estimate of global including reflection, transparency, detailed illumination for the transparency, shadows, and 4x4 jittered scene, using 64 detailed shadows, multi-sampling, totaling up random occlusion ambient occlusion, to 96 rays per pixel. samples per pixel. and 4x4 jittered multi-sampling, totaling up to 288 Ray Traced Am bient Occlusion 40 10 rays per pixel. The d

d n 8 car model is n 30 o c o e c comprised of 1.6

s 6 e

/ s

/ 20 million polygons and s 4 e s e

m the image resolution a m 10 2 r a f

r is 1080p hi-def. f 0 0 1 2 3 4 1 2 3 4 blades blades

36 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

iRT: An Interactive Ray Tracer for the CELL Processor

Barry Minor∗ Mark Nutter Joaquin Madruga IBM Corporation

∗ e-m ail {bm inor,m nutter,joaquinm }@us.ibm .com Abstract Recent advances in softw are techniques have renew ed interest in ray tracing images in interactive time [3]. These techniques have show n a significant performance improvement for primary and shadow rays, particularly w hen mapped to the CELL Broadband Engine [1, 2]. Much of this w ork has focused specifically on ray casting and simple shading, w ithout fully addressing global effects such as reflection or ambient occlusion [5]. We present an interactive ray tracer for CELL w hich builds on previous w ork but additionally computes accurate reflections, transparency, detailed shadow s, BRDF lighting, and cubic environment mapped textures. The system uses a combination of optimized SIMD ray kernels and a hierarchical scheduler w hich distributes w ork across a cluster of CELL-based systems. The optimized ray kernels explicitly cache BVH scene data [1, 2], and achieve 93- 97% efficiency rates for the node and vertex caches. We further apply ray-packet techniques to accelerate traversal for ambient occlusion rays, w hich can be used as an approximation of global illumination [4, 5]. To this w e add the novel tw ist of reversing direction of the ambient occlusion rays to point inw ard to the scene, w hich has the benefit of providing an early-exit condition for their packet traversal. The hierarchical scheduler, meanw hile, distributes w ork dynamically w ith a high degree of spatial affinity at both the per-CELL and per-node levels. By combining these techniques a CELL-based system can produce 720p ray traced images at interactive frame rates, even for moderately complex scenes containing more than one million polygons. These results demonstrate that ray tracing and real time ambient occlusion are attractive for implementations on CELL-based systems. Figure 1: Left: Ray traced car w ith up to four levels of reflection and transparency, detailed shadow s, and 4x4 jittered multi-sampling, totaling up to 96 rays per pixel. Middle: Ambient occlusion estimate of global illumination for the scene, using 64 random occlusion samples per pixel. Right: Composite image including reflection, transparency, detailed shadow s, ambient occlusion, and 4x4 jittered multi-sampling, totaling up to 288 rays per pixel. The car model is comprised of 1.6 million polygons and the image resolution is 1080p hi-def. Figure 2: Scalable performance for animated sequence as it is dynamically rebalanced across a cluster of IBM BladeCenter QS20 servers. Left: Frame rates for ray traced images w ithout multi-sampling. Right: Frame rates for ambient occlusion, w ithout multi-sampling. The image resolution is 720p hi-def.

36

Cell Broadband Engine Processor Cell BE @ CEATECH 2005

Toshiba’s Magic Mirror Demo

37 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

37

Cell Broadband Engine Processor IBM Extreme Blue Intern Team Project – ( Summer ’06 - AcCellerated Vision) Project Outline .Used 2 USB $39 Logitech Cameras (640x480 color @ 15 fps) .Performed real time depth perception & gesture Recognition in an office environment – in this case, it’s a shared lab space with 40+ people coming and going in the background of the cameras field of vision .This demo uses a “depth barrier” to trigger drawling on screen vs. not drawling .The blue glove was used as a programming convenience for the students because of the time pressure

–Single SPU per camera stream (total: 2 streams = 2 SPUs) –Cell BE computation was fast enough to the point that the only bottleneck was at the camera –Simplistic single-buffered DMA requests were used. –Function offload model to control SPU execution –Used mailboxes for communication between PPU and SPU. –Sent address of function specific control block, and the SPU switched based on the control block type. .Team consisted of 4 interns for 12 weeks, 8 of which were –Straightforward SIMD implementation. available programming time (using standard C and OpenCV): –Cell porting took just over 1 week (3 programmers in the “over the shoulder” extreme –Jacob Albertson (CMU, Junior) programming model). –Kenneth Arnold (Cornell, Senior) –Stopped optimizing when total time less than 30ms per frame since camera is only 30fps –Steve Goldman (Stanford, Masters) • Showed 20x vs similar clocked Duo Core Intel –AJ Sessa (CMU/Tepper, MBA) • Additional 10x increase easily possible.

38 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

natural skin tone can be done, and is done by S3, but require a lot more coding effort to isolate, for example finding and ignoring the head & neck – given weeks only, the blue glove was used

38 IBM STG

Area’s IBM and Our Partners are looking at for Cell BE

© 2006 IBM Corporation

Cell Broadband Engine Processor

.Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation of: – Protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. –By joining together hundreds of thousands of PCs throughout the world, calculations which were previously considered impossible have now become routine.

.Folding@Home utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. With this new technology we will likely be able to attain performance on the 100 gigaflop scale per computer. –With about 10,000 such machines, we would be able to achieve performance on the petaflop scale.

http://folding.stanford.edu/FAQ-PS3.html Dr. V. S. Pande, folding@home, Distributed Computing Project, Stanford University

40 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Dr. V. S. Pande, Distributed Computing Project, Stanford University (permission given for showing the video as well) Folding@Hom e on the PS3: the Cure@PS3 project

INTRODUCTION Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation. By joining together hundreds of thousands of PCs throughout the w orld, calculations w hich w ere previously considered impossible have now become routine. FAH has targeted the study of of protein folding and protein folding disease, and num erous scientific advances have come from the project. Now in 2006, w e are looking forw ard to another major advance in capabilities. This advance utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers. With this new technology (as w ell as new advances w ith GPUs), w e w ill likely be able to attain performance on the 100 gigaflop scale per computer. With about 10,000 such machines, w e w ould be able to achieve performance on the petaflop scale. With softw are from Sony, the PlayStation 3 w ill now be able to contribute to the Folding@Home project, pushing Folding@Home a major step forw ard. Our goal is to apply this new technology to push Folding@Home into a new level of capabilities, applying our simulations to further study of protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer. With these computational advances, coupled w ith new simulation methodologies to harness the new techniques, w e w ill be able to address questions previously considered impossible to tackle computationally, and make even greater impacts on our know ledge of folding and folding related diseases.

ADVANCED FEATURES FOR THE PS3 The PS3 client w ill also support some advanced visualization features. While the Cell microprocessor does most of the calculation processing of the simulation, the graphic chip of the PLAYSTATION 3 system (the RSX) displays the actual folding process in real-time using new technologies such as HDR and ISO surface rendering. It is possible to navigate the 3D space of the molecule using the interactive controller of the PS3, allow ing us to look at the protein from different angles in real-time. For a preview of a prototype of the GUI for the PS3 client, check out a screenshot or one of these videos ( 355K avi, 866K avi , 6MB avi , 6MB avi-- more videos and formats to come).

There is also a "bootleg" video of Sony's presentation on FAH that is now on YouTube (although the audio and video quality is pretty bad).

40

Cell Broadband Engine Processor

IBM to Build World's First Cell Broadband Engine™ Based Supercomputer Revolutionary Hybrid Supercomputer at Los Alamos National Laboratory Will Harness Cell Game Chips and AMD Opteron™ Technology Cell BE x86 Linux Accelerator Master Cluster Linux Cluster AMD Opteron™ 8000+ Blades System x3755

. Goal 1 PetaFlop Double Precision Floating Point Sustained – 1.6 PetaFlop Peak DP Floating point (3.2 SP) – 360 server racks that take up around 12,000 square feet--about three basketball courts. – Hybrid of Opteron X64 AMD processors (System x3755 servers) and Cell BE Blade Servers connected via high speed network

– Modular Approach means the Master Cluster could be made up of Any Type System – including Power, Intel

41 Cell Broadband IBEMng i|n pea iosl inai @traudse.mibmar.kc omf Sony Computer Entertainment, Inc. Nov 13, 2006 © 2006 IBM Corporation Development collaboration between Sony, Toshiba and IBM.

The U.S. Department of Energy's National Nuclear Security Administration (NNSA) has selected IBM to design and build the world's first supercomputer to harness the immense power of the Cell Broadband Engine™ (Cell B.E.) processor aiming to produce a machine capable of a sustained speed of up to 1,000 trillion calculations per second, or one petaflop. The 'hybrid' supercomputer, codenamed Roadrunner, will be installed at DOE's Los Alamos National Laboratory. In a first-of-a-kind design, Cell B.E. chips -- originally designed for video game platforms -- will work in conjunction with systems based on x86 processors from Advanced Micro Devices, Inc. (AMD). Designed specifically to handle a broad spectrum of scientific and commercial applications, the supercomputer design will include new, highly sophisticated software to orchestrate over 16,000 AMD Opteron™ processor cores and over 16,000 Cell B.E. processors in tackling some of the most challenging problems in computing today. The revolutionary supercomputer will be capable of a peak performance of over 1.6 petaflops (or 1.6 thousand trillion calculations per second). The machine is to be built entirely from commercially available hardware and based on the Linux® operating system. IBM® System x™ 3755 servers based on AMD Opteron technology will be deployed in conjunction with IBM BladeCenter® H systems with Cell B.E. technology. Each system used is designed specifically for high performance implementations. Designed also with space and power consumption issues in mind, the system will employ advanced cooling and power management technologies and will occupy only 12,000 square feet of floor space, or approximately the size of three basketball courts. New Era of Industry Supercomputing Roadrunner's construction will involve the creation of advanced "Hybrid Programming" software which will orchestrate the Cell B.E.-based system and AMD system and will inaugurate a new era of heterogeneous technology designs in supercomputing. These innovations, created collaboratively among IBM and LANL engineers will allow IBM to deploy mixed-technology systems to companies of all sizes, spanning industries such as life sciences, financial services, automotive and aerospace design.

41

Cell Broadband Engine Processor Multigrid Finite Element Solver on Cell

Ported by

using the free SDK

235 584 tetrahedra 48 000 nodes 28 iterations in NKMG solver In 3.8 seconds

Sustained Performance for large Objects: 52 GFLOP/s using all 8 SPUs at once ls7-www.cs.uni-dortmund.de www.digitalmedics.de

42 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation http://www.digitalmedics.de/projects/mfes/

Researchers at Digital Medics and the University of Dortmund developed the first ever Finite Element solver on the revolutionary Cell BE™ microprocessor manufactured by STI. The solver is capable of computing dynamic non-linear problems from solid mechanics using a Newton-Krylov Multigrid algorithm with unprecedented performance. For example, on medium- to large-scale problems, the solver reached a sustained floating-point performance of 52 GFLOPS per second on a single processor (using all 8 SPUs at once). Possible applications for the solver are in biomechanics, classical civil and mechanical engineering. Work on a fluid- dynamics solver has also been started and is expected to be finished in the next two months.

42

Cell Broadband Engine Processor Storage

IBM DVS Vision video Show all red cars clips that drove North on 10 Ave over the last one month Video Compression objects & & Analysis meta data

IBM DVS watches video & logs activity, Surveillance L Middleware A generates alerts N Why is this car / V P parked here? N We should change our Smart Surveillance Technology Base: •Alert – car parked in security policy •Object Detection in the presence of distraction motion •2D Object Tracking: Multi-object tracking with occlusion resolution. loading zone > 5 mins •Object Classification: View independent object classification. •3D Object Tracking: Precise 3D location using standard cameras. •Multi-scale Tracking: Automatic PTZ Camera control to track objects. •Multi-camera Handoff: The ability to track an object across cameras. •Face Cataloging: Captures faces at large distances from the camera. •XML Metadata Representation for object and its motion attributes. •Extensible Engine Architecture for plug and play video analytics. •Real Time Event Indexing: Scene events are instantaneously available for searching in a distributed database environment. •Web service interfaces for Event Search & Retrieval support the rapid application development of customer specific applications. •Scalable Backend System: COTS database technology allows for both distributed surveillance and scalability.

43 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

43

Cell Broadband Engine Processor IBM Smart Surveillance System (Previous PeopleVision Project)

Research Areas: • Robust Background Subtraction • Salient Motion Detection • Object Classification • 2D Tracking • 3D Multi-Person Tracking • Articulated Human Body Tracking • Active Head Tracking • Coarse Head Pose Estimation • Position Independent Absolute Head Pose Estimation • Face Cataloger • Video Privacy • Multi-scale Tracking & Index Browser • Real Time Alerts • Middleware for Large Scale Surveillance (MILS) •Performance Evaluation of Surveillance Systems

44 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation http://www.research.ibm.com/peoplevision/index.html

44

Cell Broadband Engine Processor Marvel: Semantic Filtering of Audio-Visual Content

Video:

Associated speech: Today, jet fighters practiced maneuvers and forces increased military preparations as tensions in Middle East reached … –Speech Image & Video –Closed Captions Sequence Analysis –Transcript Text Analysis Visual Features

Motion Color patterns “Jet fighters” “military” Texture Shapes “Middle East” Text Metadata Visual Feature Metadata

Automatic Semantic Concept Detection Ontology- Based Models

Airplane (0.8) Indoors (0.8) Protest (0.6) Parade (0.5) Outdoors (0.9) Semantics Metadata Sky (0.7) Meeting (0.7) * with associated People (0.7) confidence scores

45 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Source: John Smith Automatic Video Metadata Extraction (MPEG-7) – Broadcast News Example Metadata is extracted automatically by analyzing the video content. (4) Speech terms and text is extracted using automatic speech recognition, closed caption text or transcript (5) Content-based features are automatically extracted from video content using automatic shot boundary determination, key-frame extraction, and extraction of colors, textures, shapes, motions, edges, etc. (6) Semantic labels are generated automatically using statistical modeling techniques that “learn” to associate the labels w ith the extract features and speech terms

Media: Trecvid Text retrieval conference video track run by NIST contest – IBM Research just won 1st place (CEO milestone). 40 different competitors.

The Wall Street Journal: THE JOURNAL REPORT: TECHNOLOGY The Best and The Brightest - GEORGE ANDERS, 11/15/04 International Business Machines Corp. of Armonk, N.Y., won for a new indexing and searching technique to help users deal with multimedia files. IBM's approach, known as Marvel, "drastically reduces the work involved in manually labeling, indexing and searching" files that include pictures, audio or movies, said judge Markus Bayegan. He is chief technology officer at Swiss electrical-engineering company ABB Ltd.

IBM said Marvel was built to simultaneously use three different search methods for multimedia files: text, concept models and feature 45 descriptors. None of those approaches alone has been powerful and foolproof enough to satisfy users to date. IBM is betting that its amalgam of all three will have greater appeal.

Cell Broadband Engine Processor Early Data points for Cell BE in the Financial Services Areas

Benchmark Scaling Factor by Number of SPUs

. European Option risk analytics 80 ) c e

. Monte-Carlo simulation s 60

( Single

e Precision FP

m 40 . 200,000,000 random numbers generated i

T Double

l

a 20 Precision FP – Mersenne Twister RNG e R – Cell also has a Hardware RNG 0 0 4 8 12 16 20 Linear . Log normal distribution Number of SPEs Speedup . Min, max, average, 95% quantile for losses SP DP Code Modification SPUs Real Time (s) Speedup Real Time (s) Speedup 1 12.132 1.00 77.009 1.00 2 6.074 2.00 38.516 2.00 . 3 days of effort 4 3.051 3.98 19.271 4.00 8 1.549 7.83 9.662 7.97 . Split the problem across multiple SPUs 16 0.817 14.85 4.873 15.80 Single precision: 4 numbers generated per SPU at once . SIMD-ized core operations using SPU intrinsicsDouble precision: 2 numbers generated per SPU at once (today) . Unrolled some loops manually SPE Peak SP Peak DP Today Cores GFLOPS GFLOPS Cell BE Chip 8 204.8 14.7 Cell BE Blade 16 409.6 29.3

47 IBCMe l|l pBaroalindib@anuds .Eibnmg.inceo mis a trademark of Sony Computer Entertainment ,N Ionvc .1 3, 2006 © 2006 IBM Corporation Development collaboration between Sony, Toshiba and IBM.

Three full days to convert the code From 50 lines of C code to ~300 lines of C code with intrinsics (c) Split the problem across multiple SPUs (d) Did the core operations in a SIMD-ized fashion (using SP intrinsics) (e) Unrolled some loops manually Concentrate on Mersenne Twister RNG, ignore the Cell SDK’s RNG

Cell Blade vs. Competitive 2-socket Servers Cell Blade SP % Cell Blade DP % Frequency Cores Peak GFLOPS vs.Competition vs.Competition 4Q06 4Q07 4Q06 4Q07 4Q06 4Q07 4Q06 4Q07 4Q06 4Q07 Cell 3.2 3.2 16 16 SP 409.6 409.6 DP 29.3 204.8

Intel Xeon 2.667 2.667 4 8 SP 42.8 85.3 957% 480% DP 21.3 42.7 138% 480%

AMD Opteron 3 2.6 4 8 SP 48.0 83.2 853% 492% DP 24.0 41.6 122% 492%

47

Cell Broadband Engine Processor Accelerator Hierarchy Example for DCC

•Host system receives request from plug-in •Scene data assembled •Different elements scattered remotely •Brought into host memory •Request partitioned amongst multiple blades •Rendering potentially 1000s of frames

•Existing x86 DCC application •Leverage existing infrastructure • Plug-ins part of many apps •Cell BE Blades receive render request • Plug-in drives Cell BE Cluster •Each blade services one frame

•Rendering Computation •Frame sliced across SPEs

48 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

48

Cell Broadband Engine Processor Accelerator Hierarchy Example for TRE

•Manages input model (ie. Joystick, script) •Terrain data tiles pre-fetched from disk/net •Predictive tile pre-fetching algorithms •Animation sequences dispatched to Cell BE blades •Display client •Input device •Multiple clients

•Interleaved frames sent to each blade •Data tile cache

•Higher-end visualization •Drives multiple blades

•Target frame dispatched to blade •Vertical slices calculated on SPEs

Once problems are parallelized on Cell BE in a micro-acceleration model, cluster based macro-acceleration becomes a natural extension

49 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

49

Cell Broadband Engine Processor Remote Visualization continued

•Display client •Input device •Multiple clients

.This Rendering and Visualization pattern work for a number of things: –Tours, one project is looking at taking detailed models of ancient cities, combining with location of the PDA, Ray Casting them and streaming the video to the PDA’s – this allows people to see what the location of the ancient city looked like from their angle, and compare it to present day –Rendering Technical models for buildings and machinery remotely to the PDA allowing visualization aids for wiring in, structural support and so on (for example an architect or builder on-site might be able to walk around a site visualizing his model at different stages of completion. Maintenance personal gain instant access to 3d schematics, and can be augmented with history or repairs, common correction steps, and so on.

50 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

50

CSyesltl eBmrosa adnbda nTde cEhnngolinoeg yP rGorcoeuspsor Terrain Rendering Engine (TRE) and IBM Blades

Aircraft data / Field Data

Combine Data & Render

CPBW Blade Next-Gen GCS

BladeCenter-1 Add Live Video, Aerial Information, Chassis Commodity Cell BE Combat Situational Awareness Blade 5151 IBM | [email protected] Nov 13, 2006 © 2000056 IBM CCoorrppoorratation

-Cell Blade systems compute and compress images. These images are then delivered via the network to clients for decompression and display. -GPStream framework can be used to deliver the images to mobile clients via wireless.

51

Cell Broadband Engine Processor Conversion between Content Protection Schemes Media protection schemes and standards rely on client device and client software remaining secure, and untampered with. Cell BE provides a superior alternative.

HD IP Streams HD Disc from ISP (Blu-ray, AACS HD-DVD) Other Providers Encrypted & Methods Stream U n d e t A f l i

A n C u 1. Decrypted using one S e a d V , e standard…. t g AACS

c n i s

s Decrypt

Cell BE acts as the e

Cell BE based c Content is in the clear o Content Player as trusted, open platform r only within the Secure P for various content e

trusted r Processing Vault

protection schemes u transcoder c

e (Most valuable form of

S D P content, decrypted but still E C T DTCP IP

D B C

encoded/compressed)

H l

l P Encrypt e

/

I C P 2. …Encrypted using another standard Other Devices

DTCP IP Encrypted Stream HDTV, Portable 1080p etc Devices

52 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Helps to Address: Secure Content Management @ Software/Protocol Level Secure device-to-device content transfer Trusted copying of HD content Network download & burn Secure DVD/PC to HDTV connection Device Interoperability

Security designed by Sony – driven by the requirements of Media Market

52

Cell Broadband Engine Processor Role of the Transformation Tier Battlegrounds .Formats & Standards .Security for Content

, HD Content / Device A y

–Increased Security brings increased: t i r

•Fidelity /Resolution of Content u Blu-Ray / etc “Cell BE” c

•User Rights such as Time-shif ting e s

Highest Fidelity and Security .Usage Monitoring / Transaction f o

s e.g. 1920 x 1080 l

.Packaging and Delivery e v

e Time Shifting, Sharing, l

.Content Customization & Tailoring t s Repurposing .Unique Watermarking to User e h g i h

e Device B h t r AVI / MPEG 4

e t n i i

“x86” n t l T s e u

t e s r n e n

” u Good Fidelity r t

o y y s o t a i t E t o e e t i i C

f e.g. 480p n r l

r

B a d e d u l u l n t l d u c m a a i

o e , r e h Device C e y F s t o S

C i

F l

t f TV

d e “ u s d e d e b Standard TV i

e n s , f s s e a a a l a b e r e i e r Medium Fidelity r r s T r s e c c i o c n n

p e.g. NTSC, Pal, etc n I

T r I r I

e e e r n i i a

o T T E i

Device D t

c B

a Custom i a l l t t g e ARM / MIPS / etc n a o C e - D s L n o e Custom Fidelity r N

: P

e e.g. 320, 640, etc t o N * … Device n

53 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Note: non Cell BE Transformation Tiers are possible, but should not result in the highest levels of security, fidelity, and feature. It could also watermark the content to the individual user at the time of transformation.

53

Cell Broadband Engine Processor

Coherent Optical Processor

54 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

54

Cell Broadband Engine Processor

Increased focus on manufacturability during the design process Optical Proximity Correction (OPC) and Optical Rules Checking (ORC)

Design Wafer Image without OPC

Why OPC?  Improve time-to-market  Improve design quality Do  Improve manufacturability they agr ee?  Improve manufacturing yield  Improve profitability

OPC is applied Wafer Image with OPC

55 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

Tapeout The process of converting chip design data to manufacturing data. Photolithography Projecting an integrated circuit design layout from a mask, through a complex lens system that shrinks the image onto a wafer, which is later subdivided into individual chips. Optical Proximity Correction (OPC) OPC is a complex compute-intensive technique used to improve the manufacturability of chips with ever-increasing numbers of smaller and more densely populated features. OPC is the process of applying systematic changes to mask geometries to anticipate and compensate for photolithography distortions in manufacturing.

The goal of OPC is to produce smaller features in an integrated circuit by enhancing the printability of a wafer pattern.

55

Cell Broadband Engine Processor User Interaction Drives Innovation in Computing

Broadband Natural Interaction

Networked Stand Alone PC Internet Graphic User Mini Computer Interface (GUI) WYSIWYG Mainframe Gaming Multitasking WWW Mainframe Multitasking Batch / Cut & Paste

Word Processing Green Screen/ Teletype Punch Cards

56 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

56

Cell Broadband Engine Processor More information at

. www.ibm.com/developerworks/power/cell –Full Architecture Spec –Full System Simulator –Gcc & XLC compiler –Example applications and libraries . All Free!

. TIME TO GET IN THE GAME !!

57 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

57

Cell Broadband Engine Processor

Thank You!

58 IBM | [email protected] Nov 13, 2006 © 2006 IBM Corporation

58

Cell Broadband Engine Processor Special Notices

© Copyright International Business Machines Corporation 2006 All Rights Reserved

This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment.

59 Cell B IBE MRo |a pdamoalipn [email protected] 5..c1 o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation

59

Cell Broadband Engine Processor Special Notices (Cont.) -- Trademarks

The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro- Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both. Rambus is a registered trademark of Rambus, Inc. XDR and FlexIO are trademarks of Rambus, Inc. UNIX is a registered trademark in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Fedora is a trademark of Redhat, Inc. Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). AltiVec is a trademark of Freescale Semiconductor, Inc. PCI-X and PCI Express are registered trademarks of PCI SIG. InfiniBand™ is a trademark the InfiniBand® Trade Association Other company, product and service names may be trademarks or service marks of others.

Revised July 23, 2006

60 Cell B IBE MRo |a pdamoalipn [email protected] 5..c1 o7m-Aug-2006 Nov 13, 2006 © 2006 IBM Corporation

60