<<

Multicore and Massive Parallelism at IBM

Luigi Brochard IBM Distinguished Engineer IDRIS, February 12, 2009 Multi-PF Solutions Agenda

Multi core and Massive parallelism IBM multi core and massive parallel systems Programming Models Toward a Convergence ?

© 2007 IBM Corporation Multi-PF Solutions

14 Low- ) 2 CMOS Power Why Multicore? 12 Bipolar Multicore 10 ƒ Power ~ Voltage2 * Frequency ~ Frequency3 8 6 Junction ƒ We have a heat problem: Transistor – Power per chip is constant due to 4 Integrated cooling Circuit 3DI – => no more frequency improvment 2 Module Heat Flux (W/cm 0 ƒ But 1950 1960 1970 1980 1990 20002010 2020 203 – Smaller lithography => more transistors 1.0E+101010 – Decrease frequency by 20% => 50% 9 Power saving 1.0E+0910 1 Billion • => twice more transistors at same power consumption 8 1.0E+0810 ~50% CAGR

ƒ And 1.0E+07107 – Simpler « cpu » => less transistors => 6 more « cpu » per chip 1.0E+0610 1 Million Number of Transistors Accelerators 5 1.0E+0510 1980 1985 1990 1995 2000 2005 2010

© 2007 IBM Corporation Multi-PF Solutions

Why Parallelism?

– Always best power efficient solution • 2 processors at frequency F/2 vs 1 processor at frequency F – Produce the same work if NO OVERHEAD – Consume ¼ of the energy – Two types of parallelism • Internal parallelism is limited by: – Number of cores on a chip – Performance per wattin internal parallelism • External parallelism is limited by: – Floor Space – Total power consumption ! – Future is Multicore Massive Parallel Systems • How to manage/program million cores system/application

© 2007 IBM Corporation Multi-PF Solutions

So far

ƒ Multicore and Massive Parallelism have developed independently – Multicore : • Power4 : first dual core homogeneous microprocessor in 2001 • Cell BE : first nine core heterogeneous microprocessor in 2004 – Massively Parallelism • Long time ago : CM1, CM2, Cosmic Cube , ....GF11 • Recently : BG/L in 2004

© 2007 IBM Corporation Multi-PF Solutions Power Efficency (MF/w) of Top 8 TOP500 Systems

500 1 400 5 300 200 7 3 4 6 2 100 8

Megaflops/watt 0 IBM RR XT5 SGI BG/L BG/P Sun Cray XT4 Cray XT4 (LANL) (ORNL) (NASA (LLNL) (ANL) (TACC) (NERSC) UG Ames) (ORNL) * Number shown in column is Nov 2008 TOP500 rank

Rank Site Mfgr System MF/w Relative 1 LANL IBM Roadrunner QS22/LS21 445 1.00 2 ORNL Cray Jaguar XT5 2.3 GHz QC Opteron 152 2.92 3 NASA Ames SGI QC 3.0 Xeon 233 1.91 4 LLNL IBM Blue Gene/L 205 2.17 5 ANL IBM Blue Gene/P 357 1.25 6 TACC Sun 2.3 GHz QC Opteron 217 2.05 7 NERSC/LBNL Cray Jaguar XT4 2.3 GHz QC Opteron 232 1.92 8 ORNL Cray XT4 2.1 GHz QC Opteron 130 3.43

© 2007 IBM Corporation Multi-PF Solutions POWER Processor Roadmap

2001 2004 2007 2010 POWER4 / 4+ POWER5 / 5+ POWER6 POWER7 / 7+ 45 / 32 nm 65 nm 90 nm Multi-Core 130 nm Multi-CoreDesign 130 nm 180 nm ≥3.5GHz ≥3.5GHz Design Core Core EDRAM / Cache 1.9GHz 1.9GHz EDRAM / Cache AltiVec AltiVec Advanced 1.65+ Core1.5+ Core AdvancedSystem Features 1.5+ GHz 1.5+ GHz GHz GHz Cache Cache System Features Core Core 1+ GHzCore 1+ GHzCore Shared L2 Advanced Core Core System Features Modular Design SharedDistributed L2 Switch On Chip EDRAM Shared L2 Power optimized core Enhanced SMT Distributed Switch Distributed Switch Enhanced Reliability Shared L2 Adaptive Power Mgmnt.

Distributed Switch Very High Frequencies 4-5GHz Enhanced Scaling Enhanced Virtualization Simultaneous Multi-Threading (SMT) Advanced Memory Subsystem Chip Multi Processing Enhanced Distributed Switch Altivec / SIMD instructions - Distributed Switch Enhanced Core Parallelism Instruction Retry - Shared L2 Improved FP Performance Dynamic Energy Management Dynamic LPARs (32) Increased memory bandwidth Improved SMT Virtualization Memory Protection Keys

BINARY COMPATIBILITY

© 2007 IBM Corporation Multi-PF Solutions POWER6 Architecture Features: ƒ Ultra High Frequency dual core Alti P6 P6 Alti Vec Vec chip Core Core ƒ7way superscalar, 2-way SMT core ƒ5 instruction per thread, L3 4 MB 4 MB L3 ƒ8 execution units L3 Ctrl L2 L2 Ctrl L3 ƒ2LS, 2 FP, 2 FX, 1 BR, 1 VMX ƒ790 M transistors, 341 mm2 die ƒUpto 128 core SMP systems ƒ8MB on chip L2 – point of Fabric Bus Controller coherency Cntrl ƒOne chip L3 and memory controller Cntrl Memory ƒTwo memory controller on chip GX Bus Cntrl Memory

GX+ Bridge Memory+ Memory+

© 2007 IBM Corporation Multi-PF Solutions PERCS

PERCSPERCS –– Productive, Productive, Easy-to-use,Easy-to-use, ReliableReliable ComputingComputing SystemSystem

© 2007 IBM Corporation Multi-PF Solutions High Productivity Computing Systems Overview

Goal: Provide a new generation of economically viable high productivity computing systems Impact: • Performance (time-to-solution): speedup by 10X to 40X • Programmability (idea-to-first-solution): dramatically reduce cost & development time • Portability (transparency): insulate software from system • Robustness (reliability): continue operating in the presence of localized hardware failure, contain the impact of software defects, & minimize likelihood of operator error Applications:

Climate Nuclear Stockpile Weapons Weather Prediction Ocean/wave Ship Design Modeling Stewardship Integration Forecasting PERCSPERCS –– PProductive,roductive, EEasy-to-use,asy-to-use, RReliableeliable CComputingomputing SSystemystem isis IBM’sIBM’s responseresponse toto DARPA’sDARPA’s HPCS HPCS ProgramProgram © 2007 IBM Corporation Multi-PF Solutions IBM Hardware Innovations ƒ Next generation POWER processor with significant HPCS enhancements – Leveraged across IBM’s server platforms ƒ Enhanced POWER Instruction Set Architecture – Significant extensions for HPCS – Leverage the existing POWER software eco-system ƒ Integrated high speed network (very low latency, high bandwidth) ƒ Multiple hardware innovations to enhance programmer productivity ƒ Balanced system design to enable extreme scalability ƒ Significant innovation in system packaging, footprint, power and cooling

© 2007 IBM Corporation Multi-PF Solutions IBM Software Innovations ƒ Productivity focus: Tools for all aspects of human/system interaction; providing a 10x improvement in productivity ƒ Operational Efficiency: Advanced capabilities for resource management and scheduling including multi-cluster scheduling, checkpoint/restart, reservation in advance, backfill scheduling ƒ Systems Management: Significantly reduced complexity of managing petascale systems – Simplify the install, upgrade, and bring-up of large systems – Non-intrusive and efficient monitoring of the critical system components – Management framework for the networks, storage, and resources – OS level management: user ids, passwords, quotas, limits, … ƒ Reliability, Availability and Serviceability: Design for continuous operations

© 2007 IBM Corporation Multi-PF Solutions

IBM PERCS Software Innovations ƒ Leverage proven software architecture and extend it to petascale systems – Robustness: Design for continuous operation even in the event of multiple failures with minimal to no degradation – Scaling: All the required software will scale to tens of thousands of nodes while significantly reducing the memory footprint – File System: IBM Global Parallel File System (GPFS) will continue to drive toward unprecedented performance and scale. – Application Enablement: Significant innovation in protocols (LAPI), and programming models (MPI, OpenMP), new languages (X10), compilers (C, C++, FORTRAN, UPC), libraries (ESSL, PESSL) and tools for debugging (RationalR, PTP) and performance tuning (HPCS toolkit)

© 2007 IBM Corporation Multi-PF Solutions Blue Waters ƒ National Science Foundation Track 1 Award

ƒ Petascale Computing Environment for Science and Engineering

ƒFocus on Sustained Petascale Computing –Weather Modeling –Biophysics –Biochemistry –Computer Science projects…

ƒ Technology: IBM POWER7 Architecture

ƒ Location: NCSA –National Center for Supercomputing Applications @ © 2007 IBM Corporation U i i f Illi i Other Data

Roadrunner System Overview

IBM Confidential © 2007 IBM Corporation Multi-PF Solutions Roadrunner is a hybrid cell accelerated petascale system delivered in 2008

Connected Unit cluster 6,9126,912 dual-core dual-core Opterons Opterons⇒⇒5050 TF TF 180 compute nodes w/ Cells 12,96012,960 Cell Cell eDP eDP chips chips ⇒ ⇒1.31.3 PF PF 12 I/O nodes

18 clusters 3,456 nodes

288-port IB 4x DDR 288-port IB 4x DDR

Triblade 12 links per CU to each of 8 switches

Eight 2nd-stage 288-port IB 4X DDR switches

© 2007 IBM Corporation Multi-PF Solutions

Roadrunner Organization

Master Service Node x3755

CU-1 CU-17 x3655 x3655

LS21 (180) ….. QS22 (360)

© 2007 IBM Corporation Multi-PF Solutions Roadrunner at a glance

ƒ Cluster of 18 Connected Units ƒ RHEL & Fedora Linux – 12,960 IBM PowerXCell 8i accelerators ƒ SDK for Multicore Acceleration – 6,912 AMD dual-core Opterons (comp) – 408 AMD dual-core Opterons (I/O) ƒ xCAT Cluster Management – 34 AMD dual-core Opterons (man) – System-wide GigEnet network – 1.410 Petaflop/s peak (PowerXCell) – 46 Teraflop/s peak (Opteron-comp) ƒ 2.48 MW Power Linpack: – 1.105 Petaflop/s sustained Linpack – 0.445 MF/Watt ƒ InfiniBand 4x DDR fabric ƒ Area: – 2-stage fat-tree; all-optical cables – 279 racks – Full bi-section BW within each CU – 5200 ft2 • 384 GB/s – Half bi-section BW among CUs ƒ Weight: • 3.4 TB/s – 500,000 lb – Non-disruptive expansion to 24 CUs ƒ IB Cables: ƒ 104 TB memory – 55miles – 52 TB Opteron – 52 TB Cell eDP ƒ 408 GB/s sustained File System I/O: – 204x2 10G Ethernets to Panasas

© 2007 IBM Corporation Multi-PF Solutions

BladeCenter® QS22 – PowerXCell 8i

ƒ Core Electronics D D D D D D D D D D D D D D D D – Two 3.2GHz PowerXCell 8i Processors R R R R R R R R 2 2 2 2 2 2 2 2 – SP: 460 GFlops peak per blade DDR2 – DP: 217 GFlops peak per blade – Up to 32GB DDR2 800MHz PowerXCell 8i PowerXCell 8i

– Standard blade form factor Rambus® FlexIO ™ Flash, RTC D D – Support BladeCenter H chassis & NVRAM D IBM D IBM R South R South 2 UART, SPI

ƒ Integrated features 2 Bridge 2 Bridge Con Legacy – Dual 1Gb Ethernet (BCM5704) SPI PCI-X PCI PCI-E x16 PCI-E x8 – Serial/Console port, 4x USB on PCI 4x HSC *1 2x USB HSDC 1GbE 2.0 2x PCI-E x16 Optional IB ƒ Optional 2 port Flash IB x4 HCA – Pair 1GB DDR2 VLP DIMMs as I/O Drive buffer USB to IB-4x to GbE to BC mid plane BC-H high speed BC mid plane (2GB total) (46C0501) fabric/mid plane – 4x SDR InfiniBand adapter (32R1760) *The HSC interface is not enabled on the standard products. This interface can be enabled on “custom” – SAS expansion card (39Y9190)system implementations for clients by working with the Cell services organization in IBM Industry Systems.

19 – 8GB Flash Drive (43W3934) © 2007 IBM Corporation Multi-PF Solutions Roadrunner Triblade node integrates Cell and Opteron blades

ƒ QS22 is last generation IBM Cell blade Cell eDP Cell eDP containing two new enhanced double- precision (eDP/PowerXCell™) Cell 2xPCI-E x16 QS22 (Unused) chips I/O Hub I/O Hub 2x PCI-E x8

ƒ Expansion blade connects two QS22 Dual PCI-E x8 flex-cable via four PCI-e x8 links to LS21 & provides the node’s ConnectX IB 4X Cell eDP Cell eDP

DDR cluster attachment 2xPCI-E x16 QS22 (Unused) ƒ LS21 is an IBM dual-socket Opteron PCI I/O Hub I/O Hub 2x PCI-E x8

blade PCI-E x8 PCI-E x8 ƒ 4-wide IBM BladeCenter packaging Dual PCI-E x8 flex-cable HSDC HT2100 PCI-E x8 Connector ƒ Roadrunner Triblades are completely HT x16 (unused) diskless and run from RAM disks with HT x16 HT2100 NFS & Panasas only to the LS21 IB 2 x HT x16 Std PCI-E Expansion 4x Exp. Connector ƒ Node design points: DDR PCI-E x8 Connector blade – One Cell chip per Opteron core 2 x HT x16 Exp. AMD HT x16 Connector HT x16 AMD – ~400 GF/s double-precision & Dual Core Dual Core ~800 GF/s single-precision HT x16 – 16 GB Cell memory & LS21 16 GB Opteron memory

© 2007 IBM Corporation Multi-PF Solutions System Configuration

IB-DDR

Triblade-1 IB-4x Triblade-2 IB-4x AMD HT AMD AMD HT AMD

AMD Memory HT HT AMD Memory HT HT PCIe PCIe Addr 3 Addr 1 HT2100 HT2100 HT2100 HT2100

QS22-1PCIe QS22-2PCIe QS22-3PCIe QS22-4PCIe AXON AXON AXON AXON AXON AXON AXON AXON Addr 2 Addr 4 Cell-1 Cell-2 Cell-1 Cell-2 Cell-1 Cell-2 Cell-1 Cell-2 Mem Mem Mem Mem Mem Mem Mem Mem

Cell-1 Cell-2 Cell-1 Cell-2 Cell-1 Cell-2 Cell-1 Cell-2

© 2007 IBM Corporation Multi-PF Solutions Headlights down the path IBM Blue Gene

Blue Gene/P continues this leadership paradigm

© 2007 IBM Corporation www.top500.org www.green500.org Multi-PF Solutions BlueGene/P SystemSystem Rack Up to 72 Racks 72 Racks Cabled 8x8x16 Rack Cabled 8x8x16 Rack 32 Node Cards 32 Node Cards

1 PF/s Node Card 1 PB/s Node Card 144 or 288 TB 32 Chips, 4x4x2 144 TB (32 chips 4x4x2) (32 compute, 4 I/O cards) 32 compute, 0 4 IO cards 14 TF/s 2 TB Compute Card 14 TF/s 1Compute Chip, 1x1x1 Card 2 or 4 TB 1 chips, 1x1x1

435 GF/s ChipChip 435 GF/s 64 GB 4 processors4 processors 64 or 128 GB 13.6 GF/s13.9 GF/s 2 or 4 GB2 DDRGB DDR 13.613.6 GF/s GF/s 8 MB EDRAM

© 2007 IBM Corporation Multi-PF Solutions BGP vs BGL Property BG/L BG/P Node Node Processors 2* 440 PowerPC 4* 450 PowerPC Properties Processor Frequency 0.7GHz 0.85GHz (target) Coherency Software managed SMP L1 Cache (private) 32KB/processor 32KB/processor L2 Cache (private) 14 stream prefetching 14 stream prefetching L3 Cache size (shared) 4MB 8MB Main Store/node 512MB/1GB 2GB Main Store Bandwidth 5.6GB/s (16B wide) 13.9 GB/s (2*16B wide) Peak Performance 5.6GF/node 13.9 GF/node Torus Bandwidth 6*2*175MB/s=2.1GB/s 6*2*435MB/s=5.2GB/s Network Hardware Latency (Nearest 200ns (32B packet) 160ns (32B packet) Neighbor) 1.6us(256B packet) 1.3us(256B packet) Hardware Latency 6.4us (64 hops) 5.5us(64 hops) (Worst Case) Collective Bandwidth 2*350MB/s=700MB/s 2*0.87GB/s=1.74GB/s Network Hardware Latency (round trip 5.0us 4.5us worst case) System Peak Performance 72k nodes 410TF 1PF Properties Total Power 1.7MW 2.5-3.0MW

© 2007 IBM Corporation Multi-PF Solutions Blue Gene/P Architectural Highlights

ƒ Scaled performance through density and frequency bump – 2x performance from BlueGene/L through doubling the processors/node – 1.2x from frequency bump due to technology (target 850 MHz) ƒ Enhanced function from BlueGene/L – 4 way SMP – DMA, remote put-get, user programmable memory prefetch – Greatly enhanced 64 bit performance counters (including 450 core) ƒ Hold BlueGene/L packaging as much as possible: – Improve networks through higher speed signaling – Modest power increase through aggressive power management ƒ Higher signaling rate – 2.4x higher bandwidth, lower latency for Torus and Tree networks – 10x higher bandwidth for Ethernet IO ƒ 72ki nodes in 72 racks for 1.00 PF/s peak – Could be as much as 750 TF/s on Linpack – Amazing ASC application performance of around 250 TF/s

© 2007 IBM Corporation Multi-PF Solutions Sequoia Announcement

ƒ On Feb 3, IBM announced a deal to sell a new , SEQUOIA, to DOE ƒ 20 PFlops system in 2011 ƒ 20 x faster than RoadRunner ƒ 10 times less energy per calculation than Jaguar ƒ 18 cores per chip, interconnect on the chip, 1.6 millions cores tota

© 2007 IBM Corporation Multi-PF Solutions

Programming Models And Languages Support for Petascale Computing

27 © 2007 IBM Corporation Multi-PF Solutions

Different Approaches to Exploit Multi-Core Multi-Function Chips

Systems built around multi-core processor chips are driving the development of new techniques for automatic exploitation by applications

No change to Rewrite program customer code Single-thread Java Annotated Explicit Parallel program program program threads languages

Programming Intrusiveness

JIT Parallel Traditional Parallelizing Directives + Parallelizing Language Compiler Compiler Compiler Compiler Compiler Compiler Innovations

Speculative Assist Functional threads threads threads

Hardware Innovations

© 2007 IBM Corporation Multi-PF Solutions

PGAS Shmem MPI Fortran CAF /X10/ UPC GSM C

Logical Multiple Logical View Logical Single Logical View Address Space Address Spaces

Multiple Machine Address Spaces Cluster Level Cluster Level Homogeneous Cores Heterogeneous BG Power/PERC Roadrunner Node Level Open MP, SM-MPI Open MP, SM-MPI Open MP, OpenCL Node Level HW Cache HW Cache Software Cache

All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. © 2007 IBM Corporation Any reliance on these Statements of General Direction is at the relying party's sole risk and will not create liability or obligation for IBM. Multi-PF Solutions

Two Standards ƒ Two standards evolving from different sides of the market

OpenMP OpenCL

CPUs GPUs MIMD SIMD Scalar code bases Scalar/Vector code bases Parallel for loop Data Parallel Shared Memory Model Distributed Shared Memory Model

CPU SPE GPU

© 2007 IBM Corporation Multi-PF Solutions

OpenMP program computing PI

© 2007 IBM Corporation Multi-PF Solutions CUDA: computing PI

© 2007 IBM Corporation Multi-PF Solutions OpenCL Memory Model

Private Memory Private Memory Private Memory Private Memory ƒ Shared memory model ƒ Release consistency WorkItem 1 WorkItem M WorkItem 1 WorkItem M

ƒ Multiple distinct Compute Unit 1 Compute Unit N address spaces ƒ Address spaces can be collapsed depending on the Local Memory Local Memory device’s memory subsystem Global / Constant Memory Data Cache ƒ Address Qualifiers Compute Device ƒ __private ƒ __local ƒ __constant and __global Global Memory ƒ Example: Compute Device Memory ƒ __global float4 *p;

© 2007 IBM Corporation Multi-PF Solutions OpenMP Memory Model

Private Memory Private Memory Private Memory Private Memory ƒ Shared memory model ƒ Strong consistency WorkItem 1 WorkItem M WorkItem 1 WorkItem M

ƒ Single address space Compute Unit 1 Compute Unit N ƒ Uniform address space

ƒ Pragmas Local Memory Local Memory ƒ V2.5 : omp parallel Global / Constant Memory Data Cache ƒ V3.0 : omp tasks Compute Device Memory

ƒ Example: ƒ #pragma OMP parallel

© 2007 IBM Corporation Multi-PF Solutions What is Partitioned Global Address Space (PGAS)?

Process/Thread Address Space

PGAS Shared Memory Message passing UPC, CAF, X10 OpenMP MPI

• Computation is performed in multiple • A datum in one place may reference a places. datum in another place. • A place contains data that can be • Data-structures (e.g. arrays) may be operated on remotely. distributed across many places. • Data lives in the place it was created, • Places may have different computational for its lifetime. properties

35 © 2007 IBM Corporation Multi-PF Solutions

Current Language Landscape

OpenMP MPI PGAS (199x) 1997 1994 CAF/UPC/Titanium Scripting Fortran C/C++ Java Domain 1954 1973/1983 1994 specific

FPU ISU ISU FPU FXU FXU

IDU IDU IFU LSU LSU IFU BXU BXU

L2 L2 L2 L3 Directory/Control L3 Power 4 Cell BE Niagara Clusters BlueGene/L

36 © 2007 IBM Corporation Multi-PF Solutions Technology Trends and Challenges Programmer Productivity

Application Scaling Needs GAP Ideal

Single CoreActual Performance Growth PERFORMANCE

TIME

Key Problem: Frequency Improvements Do Not Match App Needs Increasing Burden On The Application Design Objective: Provide Tools to allow Scientists to Bridge the Gap

© 2007 IBM Corporation Other Data

Next stop, exaflop?

www.lanl.gov/roadrunner www.lanl.gov/news www-03.ibm.com/press/us/en/pressrelease/24405.wss www.ibm.com/deepcomputing

IBM Confidential © 2007 IBM Corporation Multi-PF Solutions

IBM Ultra Scale Approaches • • Blue Gene – Maximize Flops Per Watt with Homogeneous Cores by reducing Single Thread Performance

• Power/PERCS – Maximize Productivity and Single Thread Performance with Homogeneous Cores

• Roadrunner – Use Heterogeneous Cores and an Accelerator Software Model to Maximize Flops Per Watt and keep High Single Thread Performance

© 2007 IBM Corporation Multi-PF Solutions

HPC Cluster Directions ExaScale ExaF Accelerators Targeted Configurability BG/Q Accelerators

PF Roadrunner Accelerators PERCS BG/P Systems Performance Capability Extended Configurability Machines

Power, AIX/ Linux Accelerators Capacity Clusters Linux Clusters Power, x86-64, Less Demanding Communication

2007 2008 2009/2010 2011 2012 2013 2018-19 © 2007 IBM Corporation