An Introduction to the Cell Broadband Engine Architecture
Total Page:16
File Type:pdf, Size:1020Kb
IBM Systems & Technology Group An Introduction to the Cell Broadband Engine Architecture Owen Callanan IBM Ireland, Dublin Software Lab. Email: [email protected] 1 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group Agenda About IBM Ireland development Cell Introduction Cell Architecture Programming the Cell – SPE programming – Controlling data movement – Tips and tricks Trademarks – Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. 2 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group Introduction - Development in IBM Ireland Lotus Software Research Tivoli Industry Assets & Models RFID Center of Competence High Performance Computing 3 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group Introduction – HPC in IBM Dublin Lab Blue Gene team Deep Computing Visualization (DCV) Dynamic Application Virtualization (DAV) Other areas also – E.g. RDMA communications systems 4 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group Cell Introduction 5 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group Cell Broadband Engine History IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Design Center opens March 2001 ~$400M Investment, 5 years, 600 people February 7, 2005: First technical disclosures January 12, 2006: Alliance extended 5 additional years YKT, EFK, BURLINGTON, ENDICOTT ROCHESTER BOEBLINGEN TOKYO SAN JOSE RTP AUSTIN INDIA ISRAEL 6 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group First implementation of Cell B.E. Architecture 7 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group IBM PowerXCell™ 8i – QS22 PowerXCellPowerXCell 8i 8i processorprocessor 6565 nmnm 99 cores,cores, 1010 threadsthreads 230.4230.4 GFlopsGFlops peak peak (SP)(SP) atat 3.2GHz3.2GHz 108.8108.8 GFlopsGFlops peak peak (DP)(DP) atat 3.2GHz3.2GHz UpUp toto 2525 GB/sGB/s memorymemory bandwidthbandwidth UpUp toto 7575 GB/sGB/s I/OI/O bandwidthbandwidth 8 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group 25.6 GB/sec Cell Processor Memory Inteface Synergistic Processor Elements “Supercomputer-on-a-Chip”“ (SPE): •8 per chip Power Processor Element (PPE): •128-bit wide SIMD Units N N LocalStore LocalStore •General Purpose, 64-bit RISC SPU SPU •Integer and Floating Point capable Processor (PowerPC 2.02) AUC AUC •256KB Local Store N N MFC MFC •2-Way Hardware Multithreaded Element Interconnect Bus •Up to 25.6 GF/s per SPE --- •L1 : 32KB I ; 32KB D AUC Local Store 200GF/s total * Local Store AUC N N NCU MFC •L2 : 512KB MFC SPU SPU •Coherent load/store Power Core •VMX (PPE) Internal Interconnect: AUC Local Store Local Store AUC •Coherent ring structure •3.2 GHz L2 Cache N N MFC MFC SPU SPU •300+ GB/s total internal interconnect bandwidth •DMA control to/from SPEs N N MFC MFC AUC 20 GB/sec AUC supports >100 outstanding Coherent memory requests SPU SPU LocalStore LocalStore N N Interconnect 5 GB/sec I/O Bus External Interconnects: Memory Management & Mapping •25.6 GB/sec BW memory interface •SPE Local Store aliased into PPE system memory •2 Configurable I/O Interfaces •MFC/MMU controls SPE DMA accesses •Coherent interface (SMP) •Compatible with PowerPC Virtual Memory •Normal I/O interface (I/O & Graphics) architecture •Total BW configurable between interfaces •S/W controllable from PPE MMIO •Up to 35 GB/s out •Hardware or Software TLB management •Up to 25 GB/s in •SPE DMA access protected by MFC/MMU 9 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group IBM BladeCenter QS21 Announcement: August 28, 2007 IBM BladeCenter QS22 Announcement: May 13, 2008 10 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group IBM BladeCenter QS21 Core Electronics – Dual 3.2GHz Cell/B.E. Processor Configuration Memory Memory – 2GB XDRAM (1GB per processor) 1GB 1GB – Dual Gigabit Ethernet (GbE) controllers (18x XDR) (18x XDR) – Single-wide blade (uses 1 BladeCenter H slot) Rambus XIO – Infiniband 4x channel adapters / (optional) • Cisco Systems 4X InfiniBand HCA Expansion Card for Cell/B.E. Cell/B.E. BladeCenter (32R1760) Rambus FlexIO ™ – Serial Attached SCSI (SAS) daughter card (39Y9190) / (optional) Flash, RTC DD DD BC Chassis Configuration & NVRAM DD DD R South R South – Standard IBM BladeCenter H R R 2 UART, SPI 2 Bridge 2 Bridge 2 2 Con Legacy – Max. 14 QS21 per chassis SPI PCI-X – 2 Gigabit Ethernet switches PCI PCI-E x16 PCI-E x8 – External IB switches required for IB option 4x HSC: HSDC 2x USB 1GbE • Cisco Systems 4X InfiniBand Switch Module for 2.0 2x PCI-E BladeCenter (32R1756) x16 Optional IB 2 port Peak Performance IB x4 HCA – Up to 460 GFLOPS per blade USB to GbE to – Up to 6.4 TFLOPS (peak) in a single BladeCenter H chassis BC-H mid plane BC-H mid plane IB-4x to – Up to 25.8 TFLOPS in a standard 42U rack BC-H high speed fabric/mid plane 11 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group BladeCenter® QS22 – PowerXCell 8i Core Electronics – Two 3.2GHz PowerXCell 8i Processors D D D D D D D D – SP: 460 GFlops peak per blade D D D D D D D D R R R R R R R R – DP: 217 GFlops peak per blade 2 2 2 2 2 2 2 2 – Up to 32GB DDR2 800MHz DDR2 – Standard blade form factor PowerXCell 8i PowerXCell 8i – Support BladeCenter H chassis Rambus® FlexIO ™ Flash, RTC Integrated features D D & NVRAM D IBM D IBM – Dual 1Gb Ethernet (BCM5704) R South R South 2 UART, SPI 2 2 Con Legacy – Serial/Console port, 4x USB on PCI Bridge Bridge SPI PCI-X PCI PCI-E x16 PCI-E x8 Optional 4x HSC *1 2x USB HSDC 1GbE 2.0 2x PCI-E – Pair 1GB DDR2 VLP DIMMs as I/O buffer x16 Optional IB (2GB total) (46C0501) 2 port Flash IB x4 HCA – 4x SDR InfiniBand adapter (32R1760) Drive USB to IB-4x to GbE to – SAS expansion card (39Y9190) BC mid plane BC-H high speed BC mid plane fabric/mid plane – 8GB Flash Drive (43W3934) *The HSC interface is not enabled on the standard products. This interface can be enabled on “custom” system implementations for clients by working with the Cell services organization in IBM Industry Systems. 12 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group IBM to Build World's First Cell Broadband Engine™ Based Supercomputer Revolutionary Hybrid Supercomputer at Los Alamos National Laboratory Will Harness Cell PowerXCell 8i and AMD Opteron™ Technology x86 Linux PowerXCell 8i Master Nodes Accelerators AMD Opteron™ 12,960 PowerXCell 8i chips 6,912 Dual Core chips ~1,330 TFlop/s ~50 TFlop/s 52 TB Memory 55 TB Memory Goal 1 PetaFlop Double Precision Floating Point Sustained – 1.4 PetaFlop Peak DP Floating point (2.8 SP) Roadrunner Building Blocks – 216 GB/s sustained File System I/O: – 107 TB of Memory Roadrunner Tri-blade Node IBM Blade Center H – 296 server racks that take up around 5,000 square feet- 3.9 MW power – Hybrid of Opteron X64 AMD processors (System x3755 servers) and Cell BE Blade Servers connected via high speed network – Modular Approach means the Master Cluster could be made up of Any Type System – including Power, Intel 13 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group Roadrunner At a Glance Cluster of 18 Connected Units RHEL & Fedora Linux – 6,912 AMD dual-core Opterons – 12,960 IBM Cell eDP accelerators SDK for Multicore Acceleration – 49.8 Teraflops peak (Opteron) – 1.33 Petaflops peak (Cell eDP) xCAT Cluster Management – 1PF sustained Linpack – System-wide GigEnet network InfiniBand 4x DDR fabric – 2-stage fat-tree; all-optical cables 3.9 MW Power – Full bi-section BW within each CU – 0.35 GF/Watt • 384 GB/s (bi-directional) – Half bi-section BW among CUs Area • 3.45 TB/s (bi-directional) – 296 racks – Non-disruptive expansion to 24 CUs – 5500 ft 2 80 TB aggregate memory – 28 TB Opteron – 52 TB Cell 216 GB/s sustained File System I/O – 216x2 10G Ethernets to Panasas 14 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group IBM iRT - Interactive Ray-tracer Ray Trace Images in Interactive Time SPE Partitions Blade 0 Blade 1 Blade 2 Blade 3 Blade 4 Blade 5 Blade 6 GDC 07 Configuration, Seven QS20 Blades, 3.2 TFLOPS Also Shown at GDC Gigabit Network March 2007 Composite image including • PS3 head node reflection, transparency, Leveraging detailed shadows, ambient seven QS20 iRT Blade Performance Scaling occlusion, and 4x4 jittered blades 20 multi-sampling, totaling up • 3 Million 18 Triangle Urban 16 to 288 rays per pixel. The 14 car model is comprised of 1.6 Landscape 12 • >100 Textures 10 1080p 1.6Mmillion Triangles polygons and the 8 • 1080p HDTV Frames/Sec 6 image resolution is 1080p hi- 4 def. 2 0 1 2 3 4 5 6 7 Demo Blades 15 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group Cell Architecture 16 Cell Programming Workshop 5/7/2009 © 2009 IBM Corporation IBM Systems & Technology Group The new reality of microprocessors Moore’s law, the doubling of transistors on a chip every 18 months (top curve), continues unabated. Three other metrics impacting computer performance (bottom three curves) have leveled off since 2002: the maximum electric power a microprocessor can use, the clock speed, and performance (operations) per clock tick. “The biggest reason for the leveling off is the heat dissipation problem.