MPI RUNTIMES at JSC, NOW and in the FUTURE Which, Why and How Do They Compare in Our Systems?

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE Which, why and how do they compare in our systems? 08.07.2018 I MUG’18, COLUMBUS (OH) I DAMIAN ALVAREZ • MPI RUNTIMES AT JSC Outline • FZJ’ mission • JSC’s role • JSC’s vision for Exascale-era computing • JSC’s systems • MPI runtimes at JSC • MPI performance at JSC systems • MVAPICH at JSC 07 September 2018 !2 MPI RUNTIMES AT JSC FZJ’s mission 07 September 2018 !3 MPI RUNTIMES AT JSC FZJ’s mission IKP IBG IEK IEK IEK IEK ICS ICS JSC JCNS ICS IEK ZEA INM IBG ZEA IEK 07 September 2018 !4 MPI RUNTIMES AT JSC JSC’s role •Supercomputer operation for: • Centre – FZJ • Region – RWTH Aachen University • Germany – Gauss Centre for Supercomputing John von Neumann Institute for Computing • Europe – PRACE, EU projects •Application support • Unique support & research environment at JSC • Peer review support and coordination •R&D work • Methods and algorithms, computational science, performance analysis and tools • Scientific Big Data Analytics with HPC • Computer architectures, Co-Design Exascale Labs together with IBM, Intel, NVIDIA •Education and training 07 September 2018 !5 MPI RUNTIMES AT JSC JSC’s role Communities Research Groups Simulation Labs PADC Cross-Sectional Teams Data Life Cycle Labs Exascale co-Design Facilities 07 September 2018 !6 MPI RUNTIMES AT JSC JSC’s role Source: Jack Dongarra, 30 Years of Supercomputing: History of TOP500, February 2018 07 September 2018 !7 MPI RUNTIMES AT JSC JSC’s role Source: Jack Dongarra, 30 Years of Supercomputing: History of TOP500, February 2018 JSC will host the first Exascale system in the EU around 2022 07 September 2018 !8 MPI RUNTIMES AT JSC JSC’s vision 07 September 2018 !9 MPI RUNTIMES AT JSC JSC’s vision 07 September 2018 !10 MPI RUNTIMES AT JSC JSC’s vision Extreme Scale Computing Big Data Analytics Deep Learning Interactivity 07 September 2018 !11 MPI RUNTIMES AT JSC JSC’s vision Module 2: Module 1: Module 3: Module 4: Module 5: GPU-Acc Cluster Many core Booster Memory Data Booster Analytics BN BN BN DN DN GA GA CN NAM BN BN BN GA GA CN Module 6: NAM Graphics BN BN BN GA GA Booster MEM… NIC GN GN MEM NIC NAM Module 0: Storage Disk Disk Disk Disk 07 September 2018 !12 MPI RUNTIMES AT JSC IBM Power 4+ JUMP (2004), 9 TFlop/s JSC’s vision IBM Power 6 IBM Blue Gene/L JUMP, 9 TFlop/s JUBL, 45 TFlop/s JUROPA IBM Blue Gene/P 200 TFlop/s JUGENE, 1 PFlop/s HPC-FF 100 TFlop/s IBM Blue Gene/Q JUQUEEN (2012) 5.9 PFlop/s JURECA Cluster (2015) 2.2 PFlop/s Modular Pilot System JURECA Booster (2017) 5 PFlop/s JUWELS Cluster Module (2018) 12 PFlop/s Modular JUWELS Scalable Supercomputer Module (2019/20) 50+ PFlop/s General Purpose Cluster Highly scalable 07 September 2018 !13 MPI RUNTIMES AT JSC JSC’s systems JURECA ▪ Dual-socket Intel Haswell (E5-2680 v3) ▪ 12" cores/socket ▪ 2.5 GHz ▪ # 128 GB main memory ▪ 1,884 compute nodes (45,216 cores) ▪ 75 nodes: 2" K80 NVIDIA GPUs ▪ 12 nodes: 2 " K40 NVIDIA GPUs 512 GB main memory ▪ Peak performance: 2.2 Petaflop/s (1.7 w/o GPUs) ▪ Mellanox InfiniBand EDR ▪ Connected to the GPFS file system on JUST ▪ ~15 PByte online disk and ▪ 100 PByte offline tape capacity 07 September 2018 !14 MPI RUNTIMES AT JSC JSC’s systems JURECA Booster ▪ Intel Knights Landing (7250-F) ▪ 68" cores ▪ 1.4 GHz ▪ 16 + 64 GB main memory ▪ 1,640 compute nodes (111,520 cores) ▪ Peak performance: ~5 Petaflop/s +JURECA ▪ Intel OmniPath June 2018: #14 in Europe ▪ Connected to GPFS #38 worldwide #66 in Green500 07 September 2018 !15 MPI RUNTIMES AT JSC JSC’s systems JUWELS ▪ Dual-socket Intel Skylake (Xeon Platinum 8168) ▪ 24" cores/socket ▪ 2.7 GHz ▪ # 96 GB main memory ▪ 2,559 compute nodes (122,448 cores) ▪ 48 nodes: 4" V100 NVIDIA GPUs, 192 GB main memory ▪ 4 nodes: 1" P100 NVIDIA GPUs, 768 GB main memory June 2018: ▪ Peak performance: 12 Petaflop/s (10.4 w/o GPUs) #7 in Europe ▪ Mellanox InfiniBand EDR #23 worldwide #29 in Green500 ▪ Connected to the GPFS file system on JUST 07 September 2018 !16 MPI RUNTIMES AT JSC MPI runtimes • ParaStationMPI • Preferred MPI runtime • Developed in a consortium including JSC, ParTec, KIT and University of Wüppertal • Based on MPICH • Intel + GCC 07 September 2018 !17 MPI RUNTIMES AT JSC MPI runtimes • ParaStationMPI • For the JURECA Cluster-Booster System, the ParaStation MPI Gateway Protocol bridges between Mellanox EDR and Intel Omni-Path • In general, the ParaStation MPI Gateway Protocol can connect any two low-level networks supported by pscom. • Implemented using the psgw plugin to pscom, working together with instances of the psgwd, the ParaStation MPI Gateway daemon. 07 September 2018 !18 MPI RUNTIMES AT JSC MPI runtimes • IntelMPI • As back up of ParaStationMPI • Software stack mirrored between these 2 MPI runtimes • Intel compiler • MVAPICH2-GDR • Purely for GPU nodes (but can run in normal nodes) • PGI + GCC (and Intel in the past) 07 September 2018 !19 MPI RUNTIMES AT JSC MPI performance • Microbenchmarks: Intel MPI Benchmarks • PingPong • Broadcast, gather, scatter and allgather • ParaStationMPI, Intel MPI 2018.2, Intel MPI 2019 beta, MVAPICH2 and OpenMPI • JURECA, Booster and JUWELS • 1 MPI process per node. Average times. No cache disabling. 07 September 2018 !20 MPI RUNTIMES AT JSC MPI performance Jureca Booster Juwels ParaStationMPI Verbs PSM2 UCX Intel 2018 Default (DAPL+Verbs) Default (PSM2) Default (DAPL+Verbs) Intel 2019 Default (libfabric+Verbs) Default (libfabric+PSM2) Default (libfabric+verbs) MVAPICH2-GDR Verbs - - MVAPICH2 Verbs PSM2 Verbs OpenMPI Default (UCX, Verbs and - - libfabric+Verbs enabled) 07 September 2018 !21 MPI RUNTIMES AT JSC MPI performance (PingPong) Jureca Booster Juwels 07 September 2018 !22 MPI RUNTIMES AT JSC MPI performance (PingPong) Jureca Booster Juwels 07 September 2018 !23 MPI RUNTIMES AT JSC MPI performance (Broadcast, msg size 1 byte) Jureca Booster Juwels 07 September 2018 !24 MPI RUNTIMES AT JSC MPI performance (Broadcast, msg size 4MB) Jureca Booster Juwels 07 September 2018 !25 MPI RUNTIMES AT JSC MPI performance (Scatter, msg size 1 byte) Jureca Booster Juwels 07 September 2018 !26 MPI RUNTIMES AT JSC MPI performance (Scatter, msg size 2MB) Jureca Booster Juwels 07 September 2018 !27 MPI RUNTIMES AT JSC MPI performance (Gather, msg size 1 byte) Jureca Booster Juwels 07 September 2018 !28 MPI RUNTIMES AT JSC MPI performance (Gather, msg size 2MB) Jureca Booster Juwels 07 September 2018 !29 MPI RUNTIMES AT JSC MPI performance PEPC performance • N-Body code • Asynchronous communication based in a communicating thread spawned using pthreads • 2 algorithms tested • Single particle processing with pthreads • Tile processing using OpenMP tasks • Experiments in JURECA using 16 nodes, 4 MPI ranks per node and 12 computing threads per MPI rank (SMT enabled) • Violin plots show average time per timestep and its distribution 07 September 2018 !30 MPI RUNTIMES AT JSC MPI performance • PEPC performance with single particle processing Plot: Dirk Brömmel 07 September 2018 !31 MPI RUNTIMES AT JSC MPI performance • PEPC performance with tile processing Plot: Dirk Brömmel 07 September 2018 !32 MPI RUNTIMES AT JSC MVAPICH at JSC • Generally it’s been a positive experience. But (as the guy in charge of installing all software in our systems)…. • The fact that MVAPICH2-GDR is a binary only package poses some annoyances • Need to ask the MVAPICH2 team for binaries for particular compiler versions • No icc/icpc/ifort version • Patching of compiler wrappers (xFLAGS used during compilation leak into them) • The Deep Learning community is pushing for OpenMPI as CUDA-Aware MPI of choice in our systems 07 September 2018 !33 THANK YOU!.

MPI RUNTIMES at JSC, NOW and in the FUTURE Which, Why and How Do They Compare in Our Systems?

Survey of Computer Architecture

JSC News No. 166, July 2008

Document.Pdf

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers

HPC Achievement and Impact – 2009 a Personal Perspective

33Rd TOP500 List

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

On4off Project – Welcome & Use Case „Vier Sterne Tisch“

Europe Shoots up in TOP500: 4 Systems in the Top 10

No Slide Title

No. 160 • Jan. 2008 JUGENE: Jülich's Next Step Towards Petascale

Fujitsu's Vision for High Performance Computing