Robert W. Wisniewski Chief Software Architect Blue Gene Research November 15, 2011

Codesign on MPICH

© 2010 IBM Corporation Collaboration with MPICH Team for BG/Q

° Began nearly 4 years ago for BG/Q

° Observation of where Q architecture was going to be 16 cores x 4 SMT

° Challenge both with intranode and internode – Intranode • Each node is 16 cores 4-way SMT = 64 hardware threads • Supporting up to 64 MPI tasks per node – Internode • Sequoia 98304 nodes x 64/node = 6291456 (6.3M) MPI tasks

° Partnered with MPICH team across messaging stack – MPI, PAMI, and SPI

2 IBM Confidential © 2011 IBM Corporation IBM System Technology Group Blue Gene/Q 4. Node Card: 3. Compute card: 32 Compute Cards, One chip module, Optical Modules, Link Chips; 5D Torus 16 GB DDR3 Memory,

Heat Spreader for H2O Cooling 2. Single Chip Module

1. Chip: 16+2 9P cores

5b. IO drawer: 7. System: 8 IO cards w/16 GB 96 racks, 20PF/s 8 PCIe Gen2 x8 slots 3D I/O torus

5a. Midplane: 16 Node Cards

•Sustained single node perf: 10x P, 20x L • MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) • Software and hardware support for programming models for exploitation of node hardware concurrency

6. Rack: 2 Midplanes © 2011 IBM Corporation Supporting Different /Threading Models on PAMI Can Have Familiar Programming Models for Exascale

Applications

e CAF* X10* UPC* GA r MPICH2 CHARM++* GASNet* IBM MPI a 2.x Runtime Runtime Runtime ARMCI* 2.x w e l d d

i PAMI APGAS Runtime MPCI

M ADI

PAMI API e r a w t f BG/Q x86 PERCS o S

messaging messaging messaging

m implementation implementation implementation e t s y S MU SPI HAL API

BG/Q x86 PERCS

*all runtimes are not supported on all platforms IBM Confidential

4 © 2011 IBM Corporation Blue Gene Software Stack Openness

I/O and Compute Nodes Service Node/Front End Nodes d n HPC

XL Runtime ESSL e o Open Toolchain Runtime

i PerfMon BG Nav

h CSM t Toolkit c a c S i / l r ISV p e mpirun Bridge API Loadleveler s p MPI Charm++ MPI-IO Schedulers, A U debuggers

DCMF (BG CIOD totalviewd High Level Control System (MMCS) m Message Stack) m e e DB2 t CNK t Partitioning, Job management and s kernel GPFS s

y y monitoring, RAS, Administrator interface S S Messaging SPIs Node SPIs e e Low Level Control System r r a Common Node Service a Power On/Off, Hardware probe, w Hardware init, RAS, w Hardware init, Parallel monitoring m Diags Bootloader m r r i Recovery Mailbox i Parallel boot, Mailbox F F

Link cards Service card W Compute nodes I/O nodes W SN FENs H H Node cards

New open source reference implementation licensed under CPL. New open source community under CPL license. Active IBM participation. Existing open source communities under various licenses. BG code will be contributed and/or new sub-community started.. Closed. No source provided. Not buildable. Closed. Buildable source available

© 2010 IBM Corporation Outcome

° Will release Blue Gene/Q software early next year with MPICH

° Messaging rates and latency meeting goals

° Expand to support across high-end IBM HPC

° The expected integration of Platform Computing technologies will augment not replace capability, for example, we will continue to pursue MPICH as a high- end solution across IBM HPC

6 IBM Confidential © 2011 IBM Corporation