IBM eServer pSeries™

From Blue Gene to Power.org Moscow, JSCC Technical Day November 30, 2005

Dr. Luigi Brochard IBM Distinguished Engineer Deep Computing Architect luigi.brochard@fr..com

© 2004 IBM Corporation IBM eServer pSeries™ Technology Trends

ƒ As frequency increase is limited due to power limitation ƒ Dual core is a way to :

 2 x Peak Performance per chip (and per cycle)

 But at the expense of frequency (around 20% down) ƒ Another way is to increase Flop/cycle

© 2004 IBM Corporation IBM eServer pSeries™ IBM innovations

ƒ POWER :

 FMA in 1990 with POWER: 2 Flop/cycle/chip

 Double FMA in 1992 with POWER2 : 4 Flop/cycle/chip

 Dual core in 2001 with POWER4: 8 Flop/cycle/chip

 Quadruple core modules in Oct 2005 with POWER5: 16 Flop/cycle/module ƒ PowerPC:

 VMX in 2003 with ppc970FX : 8 Flops/cycle/core, 32bit only

 Dual VMX+ FMA with pp970MP in 1Q06 ƒ Blue Gene:

 Low frequency , system on a chip, tight integration of thousands of cpus ƒ Cell :

 8 SIMD units and a ppc970 core on a chip : 64 Flop/cycle/chip

© 2004 IBM Corporation IBM eServer pSeries™ Technology Trends

ƒ As needs diversify, systems are heterogeneous and distributed

 GRID technologies are an essential part to create cooperative environments based on standards

© 2004 IBM Corporation IBM eServer pSeries™ IBM innovations ƒ IBM is :

 a sponsor of Globus Alliances

 contributing to Globus Tool Kit open souce

 a founding member of Globus Consortium ƒ IBM is extending its products

 Global file systems : – Multi platform and multi cluster GPFS

 Meta schedulers : – Multi platform and multi cluster Loadleveler ƒ IBM is collaborating to major GRID projects

 DTF in US

 DEISA in Europe

© 2004 IBM Corporation IBM eServer pSeries™ PowerPC : The Most Scalable Architecture POWER6 POWER5 POWER4 rvers POWER3 Ser POWER2 Binary Compatibility Binary

PPC PPC 970FX PPC op 750GX Deskt PPC PPC 750FX ames PPC 750 750CXe G 603e

PPC PPC 440GX 440GP ed PPC Embedd 405GP E PPC 401

PPC 970FX

© 2004 IBM Corporation IBM eServer pSeries™

POWER5 Improves High Performance Computing

ƒ Higher “Sustained/Peak FLOPS” ratio compared to POWER4

 Increased rename resources allow higher instruction level parallelism (from 72 to 120) ƒ Significant reduction in L3 and memory latency ƒ Improved memory bandwidth per FLOP ƒ Simultaneous Multi Threading (SMT) ƒ Fast barrier synchronization operation ƒ Advanced data prefetch mechanism ƒ Stronger SMP coupling and scalability

© 2004 IBM Corporation IBM eServer pSeries™

POWER5 p575 Overview .

POWER5 IH Node

H C R U6 8 or16 POWER5 IBM Architecture 1.9 or 1.5 GHz POWER5 IH System L3 Cache 144MB / 288MB 2U rack chassis Memory 2GB - 256GB DDR1 Rack: 24" X 43 " Deep, Full Drawer 2U ( 24" rack ) Packaging 12 Nodes / Rack DASD / Bays 2 DASD ( Hot Plug)

H C R U6 IB M I/O Expansion 6 slots ( PCI-X)

H C R U6 IB M Integrated SCSI Ultra 320

H C R U6

IB M

H C R U6 IB M 4 Ports Integrated Ethernet

H C R U6 IB M 10/100/1000

H C R U6 IB M RIO Drawers Yes ( 1/2 or 1 )

H C R U6

IB M

H C R U6 IB M LPAR Yes

H C R U6 IB M Switch HPS and Myrinet

H C R U6

IB M OS AIX & Linux

12 Servers / Rack 96/192 Processors / Rack

© 2004 IBM Corporation IBM eServer pSeries™ p575 Sustained Performance

p5-575 8 way Benchmark publications p5-57516 way Benchmark publications

Sq IH/ p5-575 1w 8w Sq IH/ p5-575 1w 16w SPECint_2000 1,456 SPECint_2000 1,143 SPECfp2000 2,600 SPECfp2000 2,185 SPECint_rate2000 167 SPECint_rate2000 238 SPECfp_rate2000 282 SPECfp_rate2000 385

Linpack DP 1.776 Linpack DP Linpack TPP (n=1000) 5.872 34.5 Linpack TPP (n=1000) Linpack HPC 7,120 56,6 Linpack HPC 87,3 STREAM standard 41,5 STREAM standard 42,6 STREAM tuned 55,7 STREAM tuned 55,8

© 2004 IBM Corporation IBM eServer pSeries™ p575 Peak Performance and Memory Bandwidth

Processor Type Frequency Peak Perf by node/rack Memory BW GFlops /Flop p575 single core DDR1 1.9 GHz 60 / 720 1,6 p575 dual core DDR1 1.5 GHz 90 / 1150 1.1

© 2004 IBM Corporation IBM eServer pSeries™ HPC Interconnect Taxonomy Link 2005-2006 2007-2008 Bandwidth ______Timeframe: Power5 120 Gb

60 Gb High End 20 Gb Federation/HPS(aix) 16 Gb IB-4x (galaxy-1) 10 Gb 10G eNet 10 Gb Commodity Myrinet 10G 10 Gb scale-out Interconnect speed--> HPC Ethernet 4 Gb Myrinet 2000 2 Gb 1G eNet 1 Gb

© 2004 IBM Corporation IBM eServer pSeries™ IBM HCA for Power5 (Galaxy1) Power5 processor Bus (GX)

GX I/F

Uses IB link and IB IB-compliant physical layers to RIO-G Transport. transport IBM I/O Transport Conforms to HCA remote I/O (QPs, CQs, EQs, Architecture Tunneling packets. Memory Regions Conforms to etc) Load/Store protocol over IB. Mux/Steering logic LPE/HSS LPE/HSS

4x/12x IB DDR ports

© 2004 IBM Corporation IBM eServer pSeries™

ƒ Power PC 970

© 2004 IBM Corporation IBM eServer pSeries™ PowerPC 970 and POWER4 Comparaison

Memory SMP Optimized Bandwidth Balanced system/bus design 12.8 GB/s POWER4+ One or two processor cores

CPU Core L3 Main Memory PowerPC 970 Cache L2 Memory Bandwidth Cache 6.4 GB/s CPU SIMD Chip- CPUChip InterconnectCPU Core Units Core Core L3 Main Cache L2 Memory L2 BIU Cache Cache ( 512KB ) Chip- Chip Interconnect

Single processor core 8 Execution pipeline 2 Load/store units 2 load / store units 2 Fixed point units 2 fixed point units 2 IEE Floating point units 2 SIMD sub-units (VMX 32 bit) 2 DP multiply-add execution units Branch unit 1 branch resolution unit / 1 CR execution unit Condition register unit 8 prefetching streams © 2004 IBM Corporation IBM eServer pSeries™

BladeCenter Server Overview ƒEnterprise-class shared infrastructure Shared power, cooling, cabling, switches means reduced cost and improved availability

Enables consolidation of many servers for improved utilization curve ƒHigh performance and density: 16 blades in a 7U rack => 192 processors in a 42U rack ƒAvailable in 2 way or 4 way Xeon(HS20, HS40), 2 way Opteron (JS20) or 2 way PowerPC (JS20)

Scale out platform for growth

© 2004 IBM Corporation IBM eServer pSeries™ BladeCenter JS20 Roadmap 6/11/04 10/29/04 1H05 2H05

GA4/5 GA1-JS20 1.6GHz •2x (std.) 1.6 GHz PPC 970 JS20 2.2GHz •512 MB (srd.) to 4GB Memory •60GB IDE Drives •0-2 40GB IDE drives • 11/12/04 •SLES 8, RHEL AS 3, Myrinet, CSM

GA2-JS20 2.2GHz GA2 2Q05 •2x (std.) 2.2 GHz PPC 970+ •InfiniBand •512MB (std.) to 4GB Memory •iSCSI •0-2 40, 60GB IDE drives •2GB DIMMs •SLES 8/9, RHEL AS 3, •BCT & NEBSupport AIX 5L V5.2/5.3 Myrinet, CSM, USB DVD-ROM GA3 : 1Q06

* All statements regarding IBM future directions and intent are subject to change or withdrawal without notice.

© 2004 IBM Corporation IBM eServer pSeries™

ƒ Blue Gene

© 2004 IBM Corporation IBM eServer pSeries™ BlueGene/L Project Motivations

ƒ Traditional -processor design is hitting power/cost limits ƒ Complexity and power are major driver for cost and reliability ƒ Integration, power, and technology directions are driving toward multiple modest cores on a single chip rather than one high- performance processor  Optimal design point is very different from standard approach based on high-end superscalar nodes  Watts/FLOP will not improve much from future technologies. ƒ Applications for do scale fairly well  Growing volume of parallel applications  Physics is mostly local

© 2004 IBM Corporation IBM eServer pSeries™ IBM approach with Blue Gene

ƒ Use embedded system-on-a-chip (SOC) design  Significant reduction of complexity – Simplicity is critical, enough complexity already due to scale  Significant reduction of power – Critical to achieving a dense and inexpensive packaging solution.  Significant reduction in time to market, lower development cost and lower risk – Much of the technology is qualified. ƒ Utilize PowerPC architecture and standard messaging interface (MPI).  Standard, familiar programming model and mature compiler support. ƒ Integrated and tighly coupled networks  To reduce wiring complexity and sustain performance of applications on large number of nodes ƒ Close attention to RAS (reliability, availability, and serviceability) at all system levels.

© 2004 IBM Corporation IBM eServer pSeries™ BlueGene/L System (64 cabinets, 64x32x32)

Cabinet (32 Node boards, 8x8x16)

Node Board (32 chips, 4x4x2) 16 Compute Cards

Compute Card (2 chips, 2x1x1) 360 TFLOPS 32 TB DDR Chip (2 processors) 5.7 TFLOPS 512 GB DDR 180 GFLOPS 16 GB DDR 2.8/5.6 GFLOPS 11.2 GFLOPS 2x 0.5 GB DDR per processor/chip 4 MB

© 2004 IBM Corporation IBM eServer pSeries™

BlueGene/L Compute System-on-a-Chip ASIC 5.5GB/s

11GB/s PLB (4:1) 32k/32k L1 256 2.7GB/s 128 L2

440 CPU 4MB

EDRAM Shared “Double FPU” L3 directory L3 Cache 1024+ Multiported for EDRAM or 256 144 ECC snoop Shared Memory 5.6GF SRAM 32k/32k L1 128 Buffer 22GB/s peak L2 440 CPU 256 Includes ECC node I/O proc

256 “Double FPU” l

128

DDR Ethernet JTAG Control Collective Gbit Access Torus Global with ECC Barrier 5.5 GB/s

Gbit JTAG 6 out and 3 out and 144 bit wide Ethernet 6 in, each at 3 in, each at 4 global DDR 1.4 Gbit/s link 2.8 Gbit/s link barriers or 512MB interrupts

© 2004 IBM Corporation IBM eServer pSeries™ Blue Gene Networks

3 Dimensional Torus 32x32x64 connectivity Backbone for one-to-one and one-to-some communications 1.4 Gb/s bi-directional bandwidth in all 6 directions (Total 2.1 GB ~100 ns hardware node latency

Collective Network Global Operations 2.8Gb/s per link , 68TB/s aggregate bandwidth Arithmetic operations implemented in tree Integer/ Floating Point Maximum/Minimum Integer addition/subtract, bitwise logical operations Latency of tree less than 2.5usec to top, additional 2.5usec t Global sum over 64k in less than 2.5 usec (to top of tree) Global Barriers and Interrupts Low Latency Barriers and Interrupts Gbit Ethernet File I/O and Host Interface Funnel via Global Tree network

Control Network Boot, Monitoring and Diagnostics

© 2004 IBM Corporation IBM eServer pSeries™ BG/L is a well balanced system

System Memory Memory Network Network Bandwidth Latency Latency Barrier 128 cpu GB/s ns us us Byte/Flop cycles cycles cycles

BG/L 2.2 110 3.35 6,75 0,39 77 2345 4725

Opteron 6 120 4.98 39 Infiniband 0,34 264 17900 85800

POWER5 41 120 4.8 29,5 Federation 0,67 228 8160 50150

(*) POWER5 barrier is based on 8 way node , Opteron is based on 4 core blades

© 2004 IBM Corporation IBM eServer pSeries™ BlueGene/L Software Hierarchical Organization

ƒ Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) ƒ I/O nodes run Linux and provide a more complete range of OS services – files, sockets, process launch, signaling, debugging, and termination ƒ Service node performs system management services (e.g., heart beating, monitoring errors) - transparent to application software

© 2004 IBM Corporation IBM eServer pSeries™

Programming Models and Development Environment ƒ Familiar Aspects  SPMD model - Fortran, C, C++ with MPI (MPI1 + subset of MPI2) – Full language support – Automatic SIMD FPU exploitation  Linux development environment – User interacts with system through FE nodes running Linux – compilation, job submission, debugging – Compute Node Kernel provides look and feel of a Linux environment – POSIX system calls (with restrictions)  Tools – support for debuggers (Etnus TotalView), MPI tracer, profiler, hardware performance monitors, visualizer (HPC Toolkit, Paraver, Kojak) ƒ Restrictions (lead to significant scalability benefits) – Strictly space sharing - one parallel job (user) per partition of machine, one process per processor of compute node – Virtual memory constrained to physical memory size ƒ Same environment as other IBM POWER systems

© 2004 IBM Corporation IBM eServer pSeries™

CPMD - Alessandro Curioni, Salomon Billeter, Wanda Andreoni

Developed at IBM Zurich and other Universities from Car Parinello method for Molecular Dynamics Uses Plane Wave Basis functions, FFT, MPI_Collectives Demo – Si/SiO2 Interface % peak ~ 60 % VNM Ongoing project : IBM/LLNL PdH Hydrogen Storage Ab-initio Molecular Dynamics CPMD ~1000 atoms - 140 Ry

10 8 4096 6 2048 4 Speedup Speedup 1024 2 (256 processors=1) 512 256 0 0 1000 2000 3000 4000 5000 Number of Processors

10 sec/step on 2048 BG/L 25 sec/step on 1400 Xeon Cluster at LLNL.

© 2004 IBM Corporation SystemsIBM and eServer Technology pSeries ™Group Cell

27 © 20042005 IBM Corporation SystemsIBM and eServer Technology pSeries ™Group Cell History

ƒ IBM, SCEI/Sony, Toshiba Alliance formed in 2000 ƒ Design Center opened in March 2001  Based in Austin, Texas ƒ Single CellBE operational Spring 2004 ƒ 2-way SMP operational Summer 2004 ƒ February 7, 2005: First technical disclosures ƒ November 9, 2005: Open Source SDK Published

28 © 20042005 IBM Corporation SystemsIBM and eServer Technology pSeries ™Group Cell Highlights

ƒ Supercomputer on a chip ƒ Multi-core (9 cores) ƒ 3.2 GHz clock frequency ƒ 10x performance for many applications ƒ Initial Application – Sony Group PS3

29 © 20042005 IBM Corporation IBM eServer pSeries™ Cell Broadband Engine – 235mm2

© 2004 IBM Corporation IBM eServer pSeries™

Cell Processor 25.6 GB/sec Memory “Supercomputer-on-a-Chip”“ Inteface Synergistic Processor Elements (SPE): •8 per chip Power Processor Element (PPE): •128-bit wide SIMD Units N N Local Store Local Store SPU •General Purpose, 64-bit RISC SPU •Integer and Floating Point capable AUC Processor (PowerPC AS 2.0) AUC •256KB Local Store N N MFC MFC •2-Way Hardware Multithreaded Element Interconnect Bus •Up to 25.6 GF/s per SPE ---

•L1 : 32KB I ; 32KB D AUC Local Store 200GF/s total * Local Store AUC N N NCU MFC •L2 : 512KB MFC SPU SPU * At clock speed of 3.2GHz •Coherent load/store Power Core •VMX (PPE) Internal Interconnect: AUC Local Store Local Store AUC •Coherent ring structure 96B/cycle •3.2 GHz L2 Cache N N MFC MFC SPU SPU •300+ GB/s total internal interconnect bandwidth •DMA control to/from SPEs N N MFC MFC AUC 20 GB/sec AUC supports >100 outstanding Coherent memory requests SPU SPU Local Store Local Local Store Local N N Interconnect 5 GB/sec I/O Bus

External Interconnects: Memory Management & Mapping •25.6 GB/sec BW memory interface •SPE Local Store aliased into PPE system memory •2 Configurable I/O Interfaces •MFC/MMU controls SPE DMA accesses •Coherent interface (SMP) •Compatible with PowerPC Virtual Memory •Normal I/O interface (I/O & Graphics) architecture •Total BW configurable between interfaces •S/W controllable from PPE MMIO •Up to 35 GB/s out •Hardware or Software TLB management •Up to 25 GB/s in •SPE DMA access protected by MFC/MMU

© 2004 IBM Corporation IBM eServer pSeries™ Cell BE Processor Initial Application Areas

ƒ Cell excels at processing of rich media content in the context of broad connectivity  Digital content creation (games and movies)  Game playing and game serving  Distribution of dynamic, media rich content  Imaging and image processing  Image analysis (e.g. video surveillance)  Next-generation physics-based visualization  Video conferencing  Streaming applications (codecs etc.)  Physical simulation & science ƒ Cell is an excellent match for any applications that require:  Parallel processing  Real time processing  Graphics content creation or rendering  Pattern matching  High-performance SIMD capabilities

© 2004 IBM Corporation IBM eServer pSeries™ Cell Alpha Software Development Environment

Programmer Distributed on ExperienceIBM alphaworks & Barcelona Super Computer sites Nov 9th

Documentation End-User Experience Samples Code Dev Tools Workloads Demos

Development Debug Tools Execution Environment SPE Optimized Function LIbraries Environment Development Tools Stack Performance Tools SPE Management Lib

Linux with CellBE Extensions

Miscellaneous Tools System Level Simulator

Language extensions Standards: SPU ABI, Linux Reference ABI

© 2004 IBM Corporation IBM eServer pSeries™ Cell Based Blade

CELL Processor Board

High BW Sys ƒ First Prototype “Powered On” Storage ƒ 16 Tera-flops in a rack (est.) Networks Mgmt I/O  ( equals 1 Peta-flop in 64 racks ) Bridge CELL Processor ƒ Optimized for Digital Content (2-Way SMP) Creation, including Memory – Computer entertainment 16 TFlop rack – Movies – Real-time rendering – Physics simulation …

© 2004 IBM Corporation IBM eServer pSeries™ IBM common Software solutions for POWER

Same Front end tools for POWER/POWERPC: XLF/C/C++, ESSL/pESSL, libmass : compilers, libraries >Parallel Environment: MPI library and environment LL: job scheduler within and across AIX/Linux clusters > introduction of new features in checkpoint/restart and job migration Same system tools for all platforms: CSM : single management for AIX/Linux clusters GPFS : interoperable parallel file system within and across AIX/Linux clusters > possibilty to licence GPFS to any vendor or customer > for Linux x86/IA/ppc architecture TSM/HSM or HPSS : backup/archive/migrate solutions interfaced with GPFS

© 2004 IBM Corporation IBM eServer pSeries™ IBM integrated heterogeneous solutions

common compilers , libraries, schedulers

general highly blades cell purpose parallel servers servers servers servers

common file system

GRID and Multi-Cluster Middleware

© 2004 IBM Corporation IBM eServer pSeries™

ƒ Questions ?

© 2004 IBM Corporation