Introduction to Broadband Engine Architecture

PRACE Petascale Computing Winter School Gabriele Carteni 10-13 February 2009 Operations Department Athens - Greece Barcelona Supercomputing Center

1 A brief history...

• The Cell Broadband Engine Architecture (CBEA) defines a processor structure directed toward distributed processing • Jointly developed by Sony Entertainment, and IBM • Architecture Design and First Implementation: - 4 Years (2001-2005) - Over 400 engineers from the STI Alliance - Enhanced versions of the design tools for the POWER4 processor • SOI process from 90 nm to 65 nm (2007) to 45 nm (2008) • In May 2008, IBM introduced the high-performance double-precision floating- point version of the Cell processor: the PowerXCell 8i

PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 2 Towards an hardware accelerated concept

• Moore’s law continues unabated. • Three other metrics impacting computer performance: - the maximum electric power a can use - the clock speed - performance (operations) per clock tick

• “The biggest reason for the leveling off is the heat dissipation problem. With at 65-nanometer sizes, the heating rate would increase 8 times whenever the clock speed was doubled” - Ken Koch, Roadrunner project (Los Alamos National Laboratory - USA)

source: “Roadrunner Computing in the fast lane” 1663 Los Alamos Science and Technology Magazine, May 2008 PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 3 Towards an hardware accelerated concept

• Specialized co-processor Increase standard CPU with many (8-32) small, efficient, - L2 CACHE hardware accelerators (SIMD or vector) with private memories, visible to applications.

• Storage hierarchy - Rewrite code to highly parallel model with explicit on- chip vs. off-chip memory.

L2 CACHE • Applications coded for parallelism - Stream data between cores on same chip to reduce off- chip accesses.

PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 4 Cell BE Components

• PowerPC Processor Element (PPE) - used for operating systems and the management and allocation of tasks for the SPEs in a system - 64-bit Power-Architecture compliant core - compatibility with the Power Architecture provides a base for porting existing software to Cell

• Synergistic Processor Element (SPE) - share Cell system functions provided by Power Architecture - less complex computational units - Single Instruction, Multiple Data (SIMD) capability - enable applications that require a higher computational unit density - cost-effective processing over a wide range of applications • Bus Subsystem (EIB) - provides (@3.2GHz): ๏ aggregate main memory bandwidth: ∼25.6GB/s ๏ I/O bandwidth: 35GB/s (inbound) 40GB/s (outbound) ๏ fair amount of bandwidth left over for moving data within the processor

PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 5 Cell BE Components: depth view on the SPE

• 2 storage domains: main and local • implements a new instruction-set architecture optimized for power and performance on computing-intensive applications • operates on a local store memory (256 KB) that stores instructions and data • data and instructions are transferred between this local memory and system memory by asynchronous coherent DMA commands, executed by the memory flow control unit (MFC) included in each SPE • a new level of memory hierarchy beyond the registers that provide local storage of data in most processor architectures. This is to provide a mechanism to combat the ”memory wall” limit

PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 6 CELL/B.E. from consumer products to HPC

Roadrunner (16,000 PowerXCell 8i. IBM BladeServer + AMD) (2 Cell/B.E. or PowerXCell 8i) Mercury 1u Dual Cell Sony Cell/B.E. PowerXCell 8i Computing Unit PCI card (Cell/B.E. + Host) (Cell/B.E. + GPU + AV I/O) SCE PS3 Toshiba SpursEngine (Cell/B.E. + GPU) (SPU’s. + Host) MariCel BSC Prototype (144 PowerXCell 8i)

Consumer Business Enterprise HPC

PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 7 The IBM PowerXCell™ 8i Processor

• CMOS SOI @65 nm • 9 cores, 10 threads (2 threads on PPE, simultaneous multithreading, SMT) • 230.4 GFlops peak (SP) @3.2GHz • 108.8 GFlops peak (DP) @3.2GHz • Up to 25 GB/s memory bandwidth Testo • Up to 75 GB/s I/O bandwidth • Die size 212 square mm PowerXCell 8i @3.2GHz: Die Photo • Maximum power dissipation (est.) 92 W

PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 8 The IBM BladeCenter® QS22

• Core Electronics IBM BladeCenter® QS22 - Two 3.2GHz PowerXCell 8i Processors announced on May 13, 2008 - SP: 460 GFlops peak per blade - DP: 217 GFlops peak per blade - Up to 32GB DDR2 800MHz - Standard blade form factor - Support BladeCenter H chassis

• Integrated features - Dual 1Gb Ethernet (BCM5704) - Serial/Console port, 4x USB on PCI - 2x1GB DDR2 VLP DIMMs as I/O buffer (optional) - 4x DDR InfiniBand adapter (optional) - SAS expansion card (optional) PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 9 The IBM BladeCenter® QS22

• Core Electronics IBM BladeCenter® QS22 - Two 3.2GHz PowerXCell 8i Processors announced on May 13, 2008 - SP: 460 GFlops peak per blade D D D D D D D D DP: 217 GFlops peak per blade D D D D D D D D - R R R R R R R R - Up to 32GB DDR2 800MHz 2 2 2 2 2 2 2 2 DDR2 - Standard blade form factor PowerXCell 8i PowerXCell 8i - Support BladeCenter H chassis Rambus® FlexIO ™ Flash, RTC D D & NVRAM • Integrated features D IBM D IBM R South R South 2 UART, SPI - Dual 1Gb Ethernet (BCM5704) 2 Bridge 2 Bridge Legacy Con SPI PCI-X PCI - Serial/Console port, 4x USB on PCI PCI-E x16 PCI-E x8 *1 4x HSC HSDC 2x USB 2x PCI-E 1GbE 2x1GB DDR2 VLP DIMMs as I/O buffer (optional) 2.0 - x16 Optional IB 2 port Flash IB x4 HCA - 4x DDR InfiniBand adapter (optional) Drive USB to GbE to - SAS expansion card (optional) BC mid plane BC mid plane PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 10 M a r i C e l The PRACE Prototype @BSC

PRACE Petascale Computing Winter School Gabriele Carteni 10-13 February 2009 Operations Department Athens - Greece Barcelona Supercomputing Center

11 The PRACE Project

• 16 Partners from 14 countries • Establish the PRACE permanent Research Infrastructure as a single Legal Entity in 2010 • Perform the technical work to prepare operation of the Tier-0 systems in 2009/2010 (deployment and benchmarking of prototypes for Petaflops systems Tier-0 and porting, optimising, peta-scaling of applications) European Centres • Six prototypes for petaflops systems:

- IBM Blue Gene/P at FZJ Tier-1 National - IBM POWER6 at SARA Centres - Cray XT5 at CSC (joint proposal with ETHZ-CSCS)

- IBM Cell/POWER6 at BSC Tier-2 Regional/University - NEC SX9/x86 system at HLRS Centres - Intel Nehalem/Xeon IB cluster at FZJ/GENCI-CEA.

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 12 MariCel: Specs and Performance

• Vendor/Integrator: IBM • CPUs: IBM PowerXCell 8i 3.2Ghz + IBM Power6 4.0Ghz • 6 BladeCenter H • 12 QS22 + 2 JS22 for each BC-H (84 nodes - 1344 cores) • 960 GB of memory (12 nodes with 32GB, 72 nodes with 8GB) • InfiniBand 4x DDR Direct Connection Network • Peak performance: 15.6 Teraflops (LINPACK: 10 Teraflops) • SAN 870GB for global file systems (GPFS) • SAS external 90GB per node • SAS internal 146 GB per JS22 node • Energy consumption: ∼20kW • Cooling type: air cooling front to back

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 13 MariCel: Hardware overview

• 6 BladeCenter H with 2 JS22 (management) and 12 QS22 (computing) • Hypernode concept: 1 JS22 + 6 QS22 = 1 Hypernode • 3 IBM System Storage DS3200 (1 DS3200 per 2 BC-H) • 2 IBM System P5 Server: - Head Node - Virtual I/O Server (Login Node and OS Masters) • 1 System X3650 Storage Server (Intel Xeon quad core @2.50GHz) • 1 Voltaire ISR2004 96 port 4x DDR Infiniband Switch • 1 Force 10 S50N 48 port Gigabit switch • 1 10/100/1000 Standard Ethernet Switch

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 14 MariCel: Operational Model

• MariCel cluster can be characterised as a standard HPC cluster of thin shared memory nodes • Key differences from a standard homogeneous cluster: - The use of the Cell processor in the compute nodes in which a Cell aware code running on the single PPU core is accelerated by the SPU cores (fine grain hybrid). - The mixture of 2 different nodes, the JS22s to provide system services and the QS22s for compute tasks. • Actually the JS22 nodes can be considered as part of the I/O subsystem in that they export GPFS over NFS to the QS22's and do not provide any other services. • The original design of the MariCel cluster by IBM was based on exporting more system services from the JS22 to QS22s in a hypernode (project called Virtual PowerXCell Environment)

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 15 MariCel: The BladeCenter H

• The BC-H manages a set of 14 blade server nodes, providing power and connectivity via a backplane • The BC-H is split into 2 halves for power supply • The prototype has 6 such blade centers • Specification: - 4 redundant power supplies each 2900W - Management Module with Ethernet - Nortel Gb Ethernet switch with 6 external ports - InfiniBand pass through module with 14 external ports - 2 SAS switches - 9U height (4 units in a 42U rack)

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 16 MariCel: The BladeCenter H

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 17 MariCel: The Infiniband Network

• A Voltaire ISR2004 96 port 4x DDR switch runs the InfiniBand network connecting the JS22 and QS22 blade servers and the GPFS file servers • The network uses a switched fabric (point to point serial link) topology. • The InfiniBand network will be used for MPI and trialled with GPFS • The InfiniBand network is cabled with copper using 28 x 3m and 56 x 5m cx4 cables

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 18 MariCel: Software overview

• Compute Nodes’ OS: - Red Hat Enterprise 5.2 (kernel 2.6.18) on JS22 - Fedora 9 (kernel 2.6.25 with BSC patch) on QS22 • Head Node / Login Node / I/O Node run RHEL 5.1 • Virtual I/O Server run with AIX 5.3 on a LPAR management • Yellow Dog Linux 6 and Suse Linux Enterpise Server 11b will be evaluated • A software from IBM called DIM (Diskless Image Management) is used to manage the cluster software configuration and maintenance • DIM is the system management software under production environment for MareNostrum • The IBM SDK for Multicore Acceleration Version 3.1 is installed on the QS22 Blades • OpenMPI 1.3 supported • Maui/Slurm Batch Scheduler System for job submission • 6 QS22 and 1 JS22 are configured to be Interactive Nodes

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 19 MariCel: How to gain access

• SSH to the Login Node: ssh [email protected]

• From the Login node you can... go to an interactive node by typing: ssh b1 (for JS22) ssh bx (x from 2..7 for QS22)

execute a batch job with MAUI/SLURM: mnsubmit myjob.cmd (submission) mnq (view job state) mncancel (job cancel)

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 20 MariCel: Job Submission Example

• mnsubmit myjob.cmd • “mnsubmit” is a wrapper command which simplifies the batch execution of a job • “myjob.cmd” is a bash script containing some directives such as: initialdir = ./ job_name = MYJOB class = benchmark output = out/imb_%j.out error = err/imb_%j.err wall_clock_limit = 1:00:00 total_tasks = 4 • After the directive section an mpirun command is used to submit the job. For example: mpirun $HOME/myapps/myjob

PRACE Winter School, February 2009 MariCel: the PRACE prototype @BSC 21 Questions and Answers

Special notices and Copirights: This presentation contains images and contents protected by © Copyright IBM Corporation

PRACE Winter School, February 2009 Introduction to Cell Broadband Engine™ Architecture 22