Introduction to Cell Broadband Engine Architecture

Introduction to Cell Broadband Engine Architecture

Introduction to Cell Broadband Engine Architecture PRACE Petascale Computing Winter School Gabriele Carteni 10-13 February 2009 Operations Department Athens - Greece Barcelona Supercomputing Center 1 A brief history... • The Cell Broadband Engine Architecture (CBEA) defines a processor structure directed toward distributed processing • Jointly developed by Sony Computer Entertainment, Toshiba and IBM • Architecture Design and First Implementation: - 4 Years (2001-2005) - Over 400 engineers from the STI Alliance - Enhanced versions of the design tools for the POWER4 processor • SOI process from 90 nm to 65 nm (2007) to 45 nm (2008) • In May 2008, IBM introduced the high-performance double-precision floating- point version of the Cell processor: the PowerXCell 8i PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 2 Towards an hardware accelerated concept • Moore’s law continues unabated. • Three other metrics impacting computer performance: - the maximum electric power a microprocessor can use - the clock speed - performance (operations) per clock tick • “The biggest reason for the leveling off is the heat dissipation problem. With transistors at 65-nanometer sizes, the heating rate would increase 8 times whenever the clock speed was doubled” - Ken Koch, Roadrunner project (Los Alamos National Laboratory - USA) source: “Roadrunner Computing in the fast lane” 1663 Los Alamos Science and Technology Magazine, May 2008 PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 3 Towards an hardware accelerated concept • Specialized co-processor Increase standard CPU with many (8-32) small, efficient, - L2 CACHE hardware accelerators (SIMD or vector) with private memories, visible to applications. • Storage hierarchy - Rewrite code to highly parallel model with explicit on- chip vs. off-chip memory. L2 CACHE • Applications coded for parallelism - Stream data between cores on same chip to reduce off- chip accesses. PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 4 Cell BE Components • PowerPC Processor Element (PPE) - used for operating systems and the management and allocation of tasks for the SPEs in a system - 64-bit Power-Architecture compliant core - compatibility with the Power Architecture provides a base for porting existing software to Cell • Synergistic Processor Element (SPE) - share Cell system functions provided by Power Architecture - less complex computational units - Single Instruction, Multiple Data (SIMD) capability - enable applications that require a higher computational unit density - cost-effective processing over a wide range of applications • Bus Subsystem (EIB) - provides (@3.2GHz): ๏ aggregate main memory bandwidth: ∼25.6GB/s ๏ I/O bandwidth: 35GB/s (inbound) 40GB/s (outbound) ๏ fair amount of bandwidth left over for moving data within the processor PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 5 Cell BE Components: depth view on the SPE • 2 storage domains: main and local • implements a new instruction-set architecture optimized for power and performance on computing-intensive applications • operates on a local store memory (256 KB) that stores instructions and data • data and instructions are transferred between this local memory and system memory by asynchronous coherent DMA commands, executed by the memory flow control unit (MFC) included in each SPE • a new level of memory hierarchy beyond the registers that provide local storage of data in most processor architectures. This is to provide a mechanism to combat the ”memory wall” limit PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 6 CELL/B.E. from consumer products to HPC Roadrunner (16,000 PowerXCell 8i. IBM BladeServer + AMD) (2 Cell/B.E. or PowerXCell 8i) Mercury 1u Dual Cell Sony Cell/B.E. PowerXCell 8i Computing Unit PCI card (Cell/B.E. + Host) (Cell/B.E. + GPU + AV I/O) SCE PS3 Toshiba SpursEngine (Cell/B.E. + GPU) (SPU’s. + Host) MariCel BSC Prototype (144 PowerXCell 8i) Consumer Business Enterprise HPC PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 7 The IBM PowerXCell™ 8i Processor • CMOS SOI @65 nm • 9 cores, 10 threads (2 threads on PPE, simultaneous multithreading, SMT) • 230.4 GFlops peak (SP) @3.2GHz • 108.8 GFlops peak (DP) @3.2GHz • Up to 25 GB/s memory bandwidth Testo • Up to 75 GB/s I/O bandwidth • Die size 212 square mm PowerXCell 8i @3.2GHz: Die Photo • Maximum power dissipation (est.) 92 W PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 8 The IBM BladeCenter® QS22 • Core Electronics IBM BladeCenter® QS22 - Two 3.2GHz PowerXCell 8i Processors announced on May 13, 2008 - SP: 460 GFlops peak per blade - DP: 217 GFlops peak per blade - Up to 32GB DDR2 800MHz - Standard blade form factor - Support BladeCenter H chassis • Integrated features - Dual 1Gb Ethernet (BCM5704) - Serial/Console port, 4x USB on PCI - 2x1GB DDR2 VLP DIMMs as I/O buffer (optional) - 4x DDR InfiniBand adapter (optional) - SAS expansion card (optional) PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 9 The IBM BladeCenter® QS22 • Core Electronics IBM BladeCenter® QS22 - Two 3.2GHz PowerXCell 8i Processors announced on May 13, 2008 - SP: 460 GFlops peak per blade D D D D D D D D DP: 217 GFlops peak per blade D D D D D D D D - R R R R R R R R - Up to 32GB DDR2 800MHz 2 2 2 2 2 2 2 2 DDR2 - Standard blade form factor PowerXCell 8i PowerXCell 8i - Support BladeCenter H chassis Rambus® FlexIO ™ Flash, RTC D D & NVRAM • Integrated features D IBM D IBM R South R South 2 UART, SPI - Dual 1Gb Ethernet (BCM5704) 2 Bridge 2 Bridge Legacy Con SPI PCI-X PCI - Serial/Console port, 4x USB on PCI PCI-E x16 PCI-E x8 *1 4x HSC HSDC 2x USB 2x PCI-E 1GbE 2x1GB DDR2 VLP DIMMs as I/O buffer (optional) 2.0 - x16 Optional IB 2 port Flash IB x4 HCA - 4x DDR InfiniBand adapter (optional) Drive USB to GbE to - SAS expansion card (optional) BC mid plane BC mid plane PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 10 M a r i C e l The PRACE Prototype @BSC PRACE Petascale Computing Winter School Gabriele Carteni 10-13 February 2009 Operations Department Athens - Greece Barcelona Supercomputing Center 11 The PRACE Project • 16 Partners from 14 countries • Establish the PRACE permanent Research Infrastructure as a single Legal Entity in 2010 • Perform the technical work to prepare operation of the Tier-0 systems in 2009/2010 (deployment and benchmarking of prototypes for Petaflops systems Tier-0 and porting, optimising, peta-scaling of applications) European Centres • Six prototypes for petaflops systems: - IBM Blue Gene/P at FZJ Tier-1 National - IBM POWER6 at SARA Centres - Cray XT5 at CSC (joint proposal with ETHZ-CSCS) - IBM Cell/POWER6 at BSC Tier-2 Regional/University - NEC SX9/x86 system at HLRS Centres - Intel Nehalem/Xeon IB cluster at FZJ/GENCI-CEA. PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 12 MariCel: Specs and Performance • Vendor/Integrator: IBM • CPUs: IBM PowerXCell 8i 3.2Ghz + IBM Power6 4.0Ghz • 6 BladeCenter H • 12 QS22 + 2 JS22 for each BC-H (84 nodes - 1344 cores) • 960 GB of memory (12 nodes with 32GB, 72 nodes with 8GB) • InfiniBand 4x DDR Direct Connection Network • Peak performance: 15.6 Teraflops (LINPACK: 10 Teraflops) • SAN 870GB for global file systems (GPFS) • SAS external 90GB per node • SAS internal 146 GB per JS22 node • Energy consumption: ∼20kW • Cooling type: air cooling front to back PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 13 MariCel: Hardware overview • 6 BladeCenter H with 2 JS22 (management) and 12 QS22 (computing) • Hypernode concept: 1 JS22 + 6 QS22 = 1 Hypernode • 3 IBM System Storage DS3200 (1 DS3200 per 2 BC-H) • 2 IBM System P5 Server: - Head Node - Virtual I/O Server (Login Node and OS Masters) • 1 System X3650 Storage Server (Intel Xeon quad core @2.50GHz) • 1 Voltaire ISR2004 96 port 4x DDR Infiniband Switch • 1 Force 10 S50N 48 port Gigabit switch • 1 10/100/1000 Standard Ethernet Switch PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 14 MariCel: Operational Model • MariCel cluster can be characterised as a standard HPC cluster of thin shared memory nodes • Key differences from a standard homogeneous cluster: - The use of the Cell processor in the compute nodes in which a Cell aware code running on the single PPU core is accelerated by the SPU cores (fine grain hybrid). - The mixture of 2 different nodes, the JS22s to provide system services and the QS22s for compute tasks. • Actually the JS22 nodes can be considered as part of the I/O subsystem in that they export GPFS over NFS to the QS22's and do not provide any other services. • The original design of the MariCel cluster by IBM was based on exporting more system services from the JS22 to QS22s in a hypernode (project called Virtual PowerXCell Environment) PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 15 MariCel: The BladeCenter H • The BC-H manages a set of 14 blade server nodes, providing power and connectivity via a backplane • The BC-H is split into 2 halves for power supply • The prototype has 6 such blade centers • Specification: - 4 redundant power supplies each 2900W - Management Module with Ethernet - Nortel Gb Ethernet switch with 6 external ports - InfiniBand pass through module with 14 external ports - 2 SAS switches - 9U height (4 units in a 42U rack) PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 16 MariCel: The BladeCenter H PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 17 MariCel: The Infiniband Network • A Voltaire ISR2004 96 port 4x DDR switch runs the InfiniBand network connecting the JS22 and QS22 blade servers and the GPFS file servers • The network uses a switched fabric (point to point serial link) topology.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    22 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us