Introduction to Cell Broadband Engine Architecture
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Cell Broadband Engine Architecture PRACE Petascale Computing Winter School Gabriele Carteni 10-13 February 2009 Operations Department Athens - Greece Barcelona Supercomputing Center 1 A brief history... • The Cell Broadband Engine Architecture (CBEA) defines a processor structure directed toward distributed processing • Jointly developed by Sony Computer Entertainment, Toshiba and IBM • Architecture Design and First Implementation: - 4 Years (2001-2005) - Over 400 engineers from the STI Alliance - Enhanced versions of the design tools for the POWER4 processor • SOI process from 90 nm to 65 nm (2007) to 45 nm (2008) • In May 2008, IBM introduced the high-performance double-precision floating- point version of the Cell processor: the PowerXCell 8i PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 2 Towards an hardware accelerated concept • Moore’s law continues unabated. • Three other metrics impacting computer performance: - the maximum electric power a microprocessor can use - the clock speed - performance (operations) per clock tick • “The biggest reason for the leveling off is the heat dissipation problem. With transistors at 65-nanometer sizes, the heating rate would increase 8 times whenever the clock speed was doubled” - Ken Koch, Roadrunner project (Los Alamos National Laboratory - USA) source: “Roadrunner Computing in the fast lane” 1663 Los Alamos Science and Technology Magazine, May 2008 PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 3 Towards an hardware accelerated concept • Specialized co-processor Increase standard CPU with many (8-32) small, efficient, - L2 CACHE hardware accelerators (SIMD or vector) with private memories, visible to applications. • Storage hierarchy - Rewrite code to highly parallel model with explicit on- chip vs. off-chip memory. L2 CACHE • Applications coded for parallelism - Stream data between cores on same chip to reduce off- chip accesses. PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 4 Cell BE Components • PowerPC Processor Element (PPE) - used for operating systems and the management and allocation of tasks for the SPEs in a system - 64-bit Power-Architecture compliant core - compatibility with the Power Architecture provides a base for porting existing software to Cell • Synergistic Processor Element (SPE) - share Cell system functions provided by Power Architecture - less complex computational units - Single Instruction, Multiple Data (SIMD) capability - enable applications that require a higher computational unit density - cost-effective processing over a wide range of applications • Bus Subsystem (EIB) - provides (@3.2GHz): ๏ aggregate main memory bandwidth: ∼25.6GB/s ๏ I/O bandwidth: 35GB/s (inbound) 40GB/s (outbound) ๏ fair amount of bandwidth left over for moving data within the processor PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 5 Cell BE Components: depth view on the SPE • 2 storage domains: main and local • implements a new instruction-set architecture optimized for power and performance on computing-intensive applications • operates on a local store memory (256 KB) that stores instructions and data • data and instructions are transferred between this local memory and system memory by asynchronous coherent DMA commands, executed by the memory flow control unit (MFC) included in each SPE • a new level of memory hierarchy beyond the registers that provide local storage of data in most processor architectures. This is to provide a mechanism to combat the ”memory wall” limit PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 6 CELL/B.E. from consumer products to HPC Roadrunner (16,000 PowerXCell 8i. IBM BladeServer + AMD) (2 Cell/B.E. or PowerXCell 8i) Mercury 1u Dual Cell Sony Cell/B.E. PowerXCell 8i Computing Unit PCI card (Cell/B.E. + Host) (Cell/B.E. + GPU + AV I/O) SCE PS3 Toshiba SpursEngine (Cell/B.E. + GPU) (SPU’s. + Host) MariCel BSC Prototype (144 PowerXCell 8i) Consumer Business Enterprise HPC PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 7 The IBM PowerXCell™ 8i Processor • CMOS SOI @65 nm • 9 cores, 10 threads (2 threads on PPE, simultaneous multithreading, SMT) • 230.4 GFlops peak (SP) @3.2GHz • 108.8 GFlops peak (DP) @3.2GHz • Up to 25 GB/s memory bandwidth Testo • Up to 75 GB/s I/O bandwidth • Die size 212 square mm PowerXCell 8i @3.2GHz: Die Photo • Maximum power dissipation (est.) 92 W PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 8 The IBM BladeCenter® QS22 • Core Electronics IBM BladeCenter® QS22 - Two 3.2GHz PowerXCell 8i Processors announced on May 13, 2008 - SP: 460 GFlops peak per blade - DP: 217 GFlops peak per blade - Up to 32GB DDR2 800MHz - Standard blade form factor - Support BladeCenter H chassis • Integrated features - Dual 1Gb Ethernet (BCM5704) - Serial/Console port, 4x USB on PCI - 2x1GB DDR2 VLP DIMMs as I/O buffer (optional) - 4x DDR InfiniBand adapter (optional) - SAS expansion card (optional) PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 9 The IBM BladeCenter® QS22 • Core Electronics IBM BladeCenter® QS22 - Two 3.2GHz PowerXCell 8i Processors announced on May 13, 2008 - SP: 460 GFlops peak per blade D D D D D D D D DP: 217 GFlops peak per blade D D D D D D D D - R R R R R R R R - Up to 32GB DDR2 800MHz 2 2 2 2 2 2 2 2 DDR2 - Standard blade form factor PowerXCell 8i PowerXCell 8i - Support BladeCenter H chassis Rambus® FlexIO ™ Flash, RTC D D & NVRAM • Integrated features D IBM D IBM R South R South 2 UART, SPI - Dual 1Gb Ethernet (BCM5704) 2 Bridge 2 Bridge Legacy Con SPI PCI-X PCI - Serial/Console port, 4x USB on PCI PCI-E x16 PCI-E x8 *1 4x HSC HSDC 2x USB 2x PCI-E 1GbE 2x1GB DDR2 VLP DIMMs as I/O buffer (optional) 2.0 - x16 Optional IB 2 port Flash IB x4 HCA - 4x DDR InfiniBand adapter (optional) Drive USB to GbE to - SAS expansion card (optional) BC mid plane BC mid plane PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 10 M a r i C e l The PRACE Prototype @BSC PRACE Petascale Computing Winter School Gabriele Carteni 10-13 February 2009 Operations Department Athens - Greece Barcelona Supercomputing Center 11 The PRACE Project • 16 Partners from 14 countries • Establish the PRACE permanent Research Infrastructure as a single Legal Entity in 2010 • Perform the technical work to prepare operation of the Tier-0 systems in 2009/2010 (deployment and benchmarking of prototypes for Petaflops systems Tier-0 and porting, optimising, peta-scaling of applications) European Centres • Six prototypes for petaflops systems: - IBM Blue Gene/P at FZJ Tier-1 National - IBM POWER6 at SARA Centres - Cray XT5 at CSC (joint proposal with ETHZ-CSCS) - IBM Cell/POWER6 at BSC Tier-2 Regional/University - NEC SX9/x86 system at HLRS Centres - Intel Nehalem/Xeon IB cluster at FZJ/GENCI-CEA. PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 12 MariCel: Specs and Performance • Vendor/Integrator: IBM • CPUs: IBM PowerXCell 8i 3.2Ghz + IBM Power6 4.0Ghz • 6 BladeCenter H • 12 QS22 + 2 JS22 for each BC-H (84 nodes - 1344 cores) • 960 GB of memory (12 nodes with 32GB, 72 nodes with 8GB) • InfiniBand 4x DDR Direct Connection Network • Peak performance: 15.6 Teraflops (LINPACK: 10 Teraflops) • SAN 870GB for global file systems (GPFS) • SAS external 90GB per node • SAS internal 146 GB per JS22 node • Energy consumption: ∼20kW • Cooling type: air cooling front to back PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 13 MariCel: Hardware overview • 6 BladeCenter H with 2 JS22 (management) and 12 QS22 (computing) • Hypernode concept: 1 JS22 + 6 QS22 = 1 Hypernode • 3 IBM System Storage DS3200 (1 DS3200 per 2 BC-H) • 2 IBM System P5 Server: - Head Node - Virtual I/O Server (Login Node and OS Masters) • 1 System X3650 Storage Server (Intel Xeon quad core @2.50GHz) • 1 Voltaire ISR2004 96 port 4x DDR Infiniband Switch • 1 Force 10 S50N 48 port Gigabit switch • 1 10/100/1000 Standard Ethernet Switch PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 14 MariCel: Operational Model • MariCel cluster can be characterised as a standard HPC cluster of thin shared memory nodes • Key differences from a standard homogeneous cluster: - The use of the Cell processor in the compute nodes in which a Cell aware code running on the single PPU core is accelerated by the SPU cores (fine grain hybrid). - The mixture of 2 different nodes, the JS22s to provide system services and the QS22s for compute tasks. • Actually the JS22 nodes can be considered as part of the I/O subsystem in that they export GPFS over NFS to the QS22's and do not provide any other services. • The original design of the MariCel cluster by IBM was based on exporting more system services from the JS22 to QS22s in a hypernode (project called Virtual PowerXCell Environment) PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 15 MariCel: The BladeCenter H • The BC-H manages a set of 14 blade server nodes, providing power and connectivity via a backplane • The BC-H is split into 2 halves for power supply • The prototype has 6 such blade centers • Specification: - 4 redundant power supplies each 2900W - Management Module with Ethernet - Nortel Gb Ethernet switch with 6 external ports - InfiniBand pass through module with 14 external ports - 2 SAS switches - 9U height (4 units in a 42U rack) PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 16 MariCel: The BladeCenter H PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 17 MariCel: The Infiniband Network • A Voltaire ISR2004 96 port 4x DDR switch runs the InfiniBand network connecting the JS22 and QS22 blade servers and the GPFS file servers • The network uses a switched fabric (point to point serial link) topology.