Overview of the MVAPICH Project: Latest Status and Future Roadmap

MVAPICH2 User Group (MUG) Meeting

by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical Memory Memory Memory Shared Memory Memory Memory Memory

Shared Memory Model Model Partitioned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) , UPC, UPC++, Chapel, , CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 2 Supporting Programming Models for Multi-Petaflop and Exaflop Systems: Challenges

Application Kernels/Applications

Middleware Co-Design Opportunities Programming Models and Challenges MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenMP, across Various OpenACC, , Hadoop (MapReduce), Spark (RDD, DAG), etc. Layers

Communication or Runtime for Programming Models Performance

Point-to-point Collective Energy- Synchronization I/O and Fault Communication Communication Awareness and Locks File Systems Tolerance Resilience

Networking Technologies Multi-/Many-core Accelerators (InfiniBand, 40/100GigE, Architectures (GPU and MIC) Aries, and Omni-Path)

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 3 Designing (MPI+X) at Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) – Scalable job start-up – Low memory footprint • Scalable Collective communication – Offload – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation nodes (128-1024 cores) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for Accelerators (GPGPUs and FPGAs) • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, MPI+UPC++, CAF, …) • Virtualization • Energy-Awareness

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 4 Overview of the MVAPICH2 Project

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualization (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015 – Used by more than 2,925 organizations in 86 countries – More than 485,000 (> 0.48 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Jul ‘18 ranking) • 2nd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China • 12th, 556,104 cores (Oakforest-PACS) in Japan • 15th, 367,024 cores (Stampede2) at TACC • 24th, 241,108-core (Pleiades) at NASA and many others – Available with software stacks of many vendors and Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 5 MVAPICH Project Timeline

OSU-INAM

MVAPICH2-EA

MVAPICH2-Virt

MVAPICH2-MIC

MVAPICH2-GDR

MVAPICH2-X

MVAPICH2

OMB

MVAPICH EOL 04 02 04 10 12 15 14 15 15 15 ------Jan - Oct Jan - Nov Nov Sep Jul - Apr Aug Timeline Aug Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 6 MVAPICH2 Release Timeline and Downloads 600000

500000 MIC 2.0 - GDR 2.0b - MV2 MV2 1.9 400000 2.3b 1.8 MV2 X - MV2 2.3 1.7 Virt 2.2 MV2 GDR 2.3a - MV2 MV2 1.6 300000 MV2 OSU INAM 0.9.3 MV2 1.5 MV2 1.4 1.0.3 1.1 MV2 Number of Downloads Number 1.0 1.0 MV2 200000 MV MV2 MV MV2 MV2 0.9.8 MV 0.9.4 MV2 0.9.0

100000

0 Jan-05 Jan-06 Jan-07 Jan-08 Jan-09 Jan-10 Jan-11 Jan-12 Jan-13 Jan-14 Jan-15 Jan-16 Jan-17 Jan-18 Sep-04 Sep-05 Sep-06 Sep-07 Sep-08 Sep-09 Sep-10 Sep-11 Sep-12 Sep-13 Sep-14 Sep-15 Sep-16 Sep-17 May-05 May-06 May-07 May-08 May-09 May-10 May-11 May-12 May-13 May-14 May-15 May-16 May-17 May-18 Timeline

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 7 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspectio point Job Startup Algorithms Awareness Tolerance Messages n & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* NVLink* CAPI* IOV Rail Memory

* Upcoming

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 8 Strong Procedure for Design, Development and Release

• Research is done for exploring new designs • Designs are first presented to conference/journal publications • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – Exhaustive unit testing – Various test procedures on diverse range of platforms and interconnects – Test 19 different benchmarks and applications including, but not limited to • OMB, IMB, MPICH Test Suite, Intel Test Suite, NAS, ScalaPak, and SPEC – Spend about 18,000 core hours per commit – Performance regression and tuning – Applications-based evaluation – Evaluation on large-scale systems • All versions (alpha, beta, RC1 and RC2) go through the above testing

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 9 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 10 MVAPICH2 2.3-GA • Released on 07/23/2018 • Major Features and Enhancements – Based on MPICH v3.2.1 – Enhancements to mapping strategies and automatic architecture/network – Introduce basic support for executing MPI jobs in Singularity detection – Improve performance for MPI-3 RMA operations • Improve performance of architecture detection on high core-count systems – Enhancements for Job Startup • Enhanced architecture detection for OpenPOWER, Intel Skylake and Cavium ARM (ThunderX) systems • Improved job startup time for OFA-IB-CH3, PSM-CH3, and PSM2-CH3 • New environment variable MV2_THREADS_BINDING_POLICY for multi-threaded MPI and • On-demand connection management for PSM-CH3 and PSM2-CH3 channels MPI+OpenMP applications • Enhance PSM-CH3 and PSM2-CH3 job startup to use non-blocking PMI calls • Support 'spread', 'bunch', 'scatter', 'linear' and 'compact' placement of threads • Introduce capability to run MPI jobs across multiple InfiniBand subnets – Warn user if oversubscription of core is detected – Enhancements to point-to-point operations • Enhance MV2_SHOW_CPU_BINDING to enable display of CPU bindings on all nodes • Enhance performance of point-to-point operations for CH3-Gen2 (InfiniBand), CH3-PSM, and • Added support for MV2_SHOW_CPU_BINDING to display number of OMP threads CH3-PSM2 (Omni-Path) channels • Added logic to detect heterogeneous CPU/HFI configurations in PSM-CH3 and PSM2-CH3 • Improve performance for Intra- and Inter-node communication for OpenPOWER architecture channels • Enhanced tuning for OpenPOWER, Intel Skylake and Cavium ARM (ThunderX) systems – Thanks to Matias Cabral@Intel for the report • Improve performance for host-based transfers when CUDA is enabled • Enhanced HFI selection logic for systems with multiple Omni-Path HFIs • Improve support for large processes per node and hugepages on SMP systems • Introduce run time parameter MV2_SHOW_HCA_BINDING to show process to HCA – Enhancements to collective operations bindings • Enhanced performance for Allreduce, Reduce_scatter_block, Allgather, Allgatherv – Miscellaneous enhancements and improved debugging and tools support – Thanks to Danielle Sikich and Adam Moody @ LLNL for the patch • Enhance support for MPI_T PVARs and CVARs • Add support for non-blocking Allreduce using Mellanox SHARP • Enhance debugging support for PSM-CH3 and PSM2-CH3 channels – Enhance tuning framework for Allreduce using SHArP • Update to hwloc version 1.11.9 • Enhanced collective tuning for IBM POWER8, IBM POWER9, Intel Skylake, Intel KNL, Intel • Tested with CLANG v5.0.0 Broadwell

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 11 Highlights of MVAPICH2 2.3-GA Release

• Support for highly-efficient inter-node and intra-node communication • Scalable Start-up • Optimized Collectives using SHArP and Multi-Leaders • Support for OpenPOWER and ARM architectures • Performance Engineering with MPI-T • Application Scalability and Best Practices

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 12 One-way Latency: MPI over IB with MVAPICH2

Small Message Latency Large Message Latency 2 120 TrueScale-QDR 1.8 1.19 100 ConnectX-3-FDR 1.6 1.15 ConnectIB-DualFDR 1.4 80 ConnectX-5-EDR 1.2 Omni-Path 1 60 0.8 1.11 40 Latency (us)

Latency (us) 1.04 0.6 0.98 0.4 20 0.2 0 0

Message Size (bytes) Message Size (bytes)

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB Switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 13 Bandwidth: MPI over IB with MVAPICH2 Unidirectional Bandwidth 14000 Bidirectional Bandwidth 25000 12,590 24,136 TrueScale-QDR 12000 ConnectX-3-FDR 12,366 20000 10000 12,358 ConnectIB-DualFDR 22,564 ConnectX-5-EDR 21,983 8000 15000 12,161 6,356 Omni-Path 6000 10000 3,373 4000 6,228

Bandwidth (MBytes/sec) 5000 2000 Bandwidth (MBytes/sec)

0 0 4 16 64 256 1024 4K 16K 64K 256K 1M 4 16 64 256 1024 4K 16K 64K 256K 1M

Message Size (bytes) Message Size (bytes)

TrueScale-QDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with IB switch ConnectX-5-EDR - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 IB switch Omni-Path - 3.1 GHz Deca-core (Haswell) Intel PCI Gen3 with Omni-Path switch

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 14 Towards High Performance and Scalable Startup at Exascale • Near-constant MPI and OpenSHMEM a b c On-demand initialization time at any process count P M d a Connection • 10x and 30x improvement in startup time b PMIX_Ring of MPI and OpenSHMEM respectively at

P PGAS – State of the art e c PMIX_Ibarrier 16,384 processes • Memory consumption reduced for remote M MPI – State of the art d PMIX_Iallgather

Endpoint Information endpoint information by O(processes per O e Shmem based PMI

Memory Required to Store Store to Memory Required PGAS/MPI – Optimized O node)

Job Startup Performance • 1GB Memory saved per node with 1M processes and 16 processes per node

a On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI. S. Chakraborty, H. Subramoni, J. Perkins, A. A. Awan, and D K Panda, 20th International Workshop on High-level Parallel Programming Models and Supportive Environments (HIPS ’15) b PMI Extensions for Scalable MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, J. Perkins, M. Arnold, and D K Panda, Proceedings of the 21st European MPI Users' Group Meeting (EuroMPI/Asia ’14) c d Non-blocking PMI Extensions for Fast MPI Startup. S. Chakraborty, H. Subramoni, A. Moody, A. Venkatesh, J. Perkins, and D K Panda, 15th IEEE/ACM International Symposium on Cluster, Cloud and (CCGrid ’15) e SHMEMPMI – Shared Memory based PMI for Improved Performance and Scalability. S. Chakraborty, H. Subramoni, J. Perkins, and D K Panda, 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid ’16) Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 15 Startup Performance on KNL + Omni-Path

MPI_Init & Hello World - Oakforest-PACS 25 Hello World (MVAPICH2-2.3a) 21s 20 MPI_Init (MVAPICH2-2.3a) 15

10

5 5.8s 22s (Seconds)Time Taken 0 64 1K 2K 4K 8K 128 256 512 16K 32K 64K Number of Processes

• MPI_Init takes 22 seconds on 229,376 processes on 3,584 KNL nodes (Stampede2 – Full scale) • 8.8 times faster than Intel MPI at 128K processes (Courtesy: TACC) • At 64K processes, MPI_Init and Hello World takes 5.8s and 21s respectively (Oakforest-PACS) • All numbers reported with 64 processes per node New designs available since MVAPICH2-2.3a and as patch for SLURM-15.08.8 and SLURM-16.05.1

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 16 Advanced Allreduce Collective Designs Using SHArP 0.35 Avg DDOT Allreduce time of HPCG 0.3 MVAPICH2 12% SHArP Support (blocking and 0.25 non-blocking) is available 0.2 since MVAPICH2 2.3a 0.15 0.1

Latency (seconds) Latency 0.05 0 (4,28) (8,28) (16,28) M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, (Number of Nodes, PPN) Scalable Reduction Collectives with Data 0.2 Mesh Refinement Time of MiniAMR Partitioning-based Multi-Leader Design, MVAPICH2 Supercomputing '17. 0.15 13%

0.1

0.05 Latency (seconds) Latency

0 (4,28) (8,28) (16,28) (Number of Nodes, PPN) Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 17 Intra-node Point-to-Point Performance on OpenPOWER

Intra-Socket Small Message Latency Intra-Socket Large Message Latency 1.2 80 MVAPICH2-2.3 MVAPICH2-2.3 70 1 SpectrumMPI-10.1.0.2 60 SpectrumMPI-10.1.0.2 0.8 OpenMPI-3.0.0 50 OpenMPI-3.0.0 0.6 40 Latency (us) Latency Latency (us) Latency 30 0.4 20 0.2 0.30us 10 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Intra-Socket Bandwidth Intra-Socket Bi-directional Bandwidth 35000 70000 MVAPICH2-2.3 MVAPICH2-2.3 30000 60000 SpectrumMPI-10.1.0.2 SpectrumMPI-10.1.0.2 25000 50000 OpenMPI-3.0.0 OpenMPI-3.0.0 20000 40000 15000 30000 10000 20000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 5000 10000 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 18 Inter-node Point-to-Point Performance on OpenPOWER

Small Message Latency Large Message Latency 3.5 200 MVAPICH2-2.3 MVAPICH2-2.3 3 SpectrumMPI-10.1.0.2 150 SpectrumMPI-10.1.0.2 2.5 OpenMPI-3.0.0 OpenMPI-3.0.0 2 100 1.5 Latency (us) Latency Latency (us) Latency

1 50 0.5 0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M Bandwidth Bi-directional Bandwidth 14000 30000 MVAPICH2-2.3 MVAPICH2-2.3 12000 25000 SpectrumMPI-10.1.0.2 SpectrumMPI-10.1.0.2 10000 OpenMPI-3.0.0 20000 OpenMPI-3.0.0 8000 15000 6000 10000 4000 Bandwidth (MB/s) Bandwidth Bandwidth (MB/s) Bandwidth 2000 5000 0 0 1 8 64 512 4K 32K 256K 2M 1 8 64 512 4K 32K 256K 2M Platform: Two nodes of OpenPOWER (Power8-ppc64le) CPU using Mellanox EDR (MT4115) HCA

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 19 Intra-node Point-to-point Performance on ARM Cortex-A72

Small Message Latency Large Message Latency 1.2 700 MVAPICH2-2.3 MVAPICH2-2.3 1 600 500 0.8 400 0.6 0.27 micro-second 300 Latency (us) Latency Latency (us) Latency 0.4 (1 bytes) 200 0.2 100

0 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M Bandwidth Bi-directional Bandwidth 10000 16000 MVAPICH2-.3 14000 MVAPICH2-2.3 8000 12000 6000 10000 8000 4000 6000 Bandwidth (MB/s) Bandwidth 4000 2000 2000 Bidirectional Bandwidth Bidirectional 0 0 1 2 4 8 16 32 64 1K 2K 4K 8K 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M 1M 2M 4M 128 256 512 32K 64K 128K 256K 512K Platform: ARM Cortex A72 (aarch64) MIPS processor with 64 cores dual-socket CPU. Each socket contains 32 cores.

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 20 Performance Engineering Applications using MVAPICH2 and TAU ● Enhance existing support for MPI_T in MVAPICH2 to expose a richer set of performance and control variables ● Get and display MPI Performance Variables (PVARs) made available by the runtime in TAU ● Control the runtime’s behavior via MPI Control Variables (CVARs) ● Introduced support for new MPI_T based CVARs to MVAPICH2 ○ MPIR_CVAR_MAX_INLINE_MSG_SZ, MPIR_CVAR_VBUF_POOL_SIZE, MPIR_CVAR_VBUF_SECONDARY_POOL_SIZE ● TAU enhanced with support for setting MPI_T CVARs in a non-interactive mode for uninstrumented applications ● S. Ramesh, A. Maheo, S. Shende, A. Malony, H. Subramoni, and D. K. Panda, MPI Performance Engineering with the MPI Tool Interface: the Integration of MVAPICH and TAU, EuroMPI/USA ‘17, Best Paper Finalist ● More details in Sameer Shende’s talk today and poster presentations VBUF usage without CVAR based tuning as displayed by ParaProf VBUF usage with CVAR based tuning as displayed by ParaProf

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 21 Performance of SPEC MPI 2007 Benchmarks (KNL + Omni-Path)

400 6% Intel MPI 18.0.0 350 1% MVAPICH2 2.3 300

250 448 processes 200 10% on 7 KNL nodes of TACC Stampede2 6% 2% 150 (64 ppn) Execution Time in (s) 100 4% 2% 50

0 milc leslie3d pop2 lammps wrf2 tera_tf lu

MVAPICH2 outperforms Intel MPI by up to 10%

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 22 Performance of SPEC MPI 2007 Benchmarks (Skylake + Omni-Path)

120 0% 1% Intel MPI 18.0.0 100 MVAPICH2 2.3 4% 80 480 processes on 10 Skylake nodes 60 -4% of TACC Stampede2 38% (48 ppn) 40 Execution Time in (s) 2% -3% 20 0%

0 milc leslie3d pop2 lammps wrf2 GaP tera_tf lu

MVAPICH2 outperforms Intel MPI by up to 38%

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 23 Application Scalability on Skylake and KNL Cloverleaf (bm64) MPI+OpenMP, MiniFE (1300x1300x1300 ~ 910 GB) NEURON (YuEtAl2012) NUM_OMP_THREADS = 2 60 1200 2000 50 MVAPICH2 1000 MVAPICH2 MVAPICH2 1500 40 800 30 600 1000 20 400 500 10 200

Execution Time (s) 0 0 0 2048 4096 8192 48 96 192 384 768 48 96 192 384 768 1536 3072 No. of Processes (Skylake: 48ppn) No. of Processes (Skylake: 48ppn) No. of Processes (Skylake: 48ppn) 140 3500 1400 120 1200 MVAPICH2 MVAPICH2 3000 MVAPICH2 100 2500 1000 80 2000 800 60 1500 600 40 1000 400 Execution Time (s) 20 500 200 0 0 0 2048 4096 8192 64 128 256 512 1024 2048 4096 68 136 272 544 1088 2176 4352

No. of Processes (KNL: 64ppn) No. of Processes (KNL: 64ppn) No. of Processes (KNL: 68ppn) Courtesy: Mahidhar Tatineni @SDSC, Dong Ju (DJ) Choi@SDSC, and Samuel Khuvis@OSC ---- Testbed: TACC Stampede2 using MVAPICH2-2.3b Runtime parameters: MV2_SMPI_LENGTH_QUEUE=524288 PSM2_MQ_RNDV_SHM_THRESH=128K PSM2_MQ_RNDV_HFI_THRESH=128K Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 24 MVAPICH2 Upcoming Features

• Dynamic and Adaptive Tag Matching • Dynamic and Adaptive Communication Protocols • Support for FPGA-based Accelerators

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 25 Dynamic and Adaptive Tag Matching

Tag matching is a significant A new tag matching design Better performance than other state- overhead for receivers - Dynamically adapt to of-the art tag-matching schemes Existing Solutions are communication patterns Minimum memory consumption Results

- Static and do not adapt dynamically Solution - Use different strategies for different

Challenge to communication pattern ranks - Do not consider memory overhead - Decisions are based on the number Will be available in future MVAPICH2 of request object that must be releases traversed before hitting on the required one

Normalized Total Tag Matching Time at 512 Processes Normalized Memory Overhead per Process at 512 Processes Normalized to Default (Lower is Better) Compared to Default (Lower is Better) Adaptive and Dynamic Design for MPI Tag Matching; M. Bayatpour, H. Subramoni, S. Chakraborty, and D. K. Panda; IEEE Cluster 2016. [Best Paper Nominee] Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 26 Dynamic and Adaptive MPI Point-to-point Communication Protocols

Desired Eager Threshold Eager Threshold for Example Communication Pattern with Different Designs

Default Manually Tuned Dynamic + Adaptive Process Pair Eager Threshold (KB) 4 5 6 7 4 5 6 7 4 5 6 7 0 – 4 32 1 – 5 64 16 KB 16 KB 16 KB 16 KB 128 KB 128 KB 128 KB 128 KB 32 KB 64 KB 128 KB 32 KB 2 – 6 128 0 1 2 3 0 1 2 3 0 1 2 3 3 – 7 32 Process on Node 1 Process on Node 2

Default Poor overlap; Low memory requirement Low Performance; High Productivity Manually Tuned Good overlap; High memory requirement High Performance; Low Productivity Dynamic + Adaptive Good overlap; Optimal memory requirement High Performance; High Productivity

Execution Time of Amber Relative Memory Consumption of Amber 600 9 500 8 7 400 6 5 300 4 200 3 2 100 1 Wall Clock (seconds) Time Clock Wall 0 0

128 256 512 1K Consumption Memory Relative 128 256 512 1K Number of Processes Number of Processes

Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold Default Threshold=17K Threshold=64K Threshold=128K Dynamic Threshold

H. Subramoni, S. Chakraborty, D. K. Panda, Designing Dynamic & Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation & Communication, ISC'17 - Best Paper Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 27 Support for FPGA-Based Accelerators

• FPGAs are emerging in the market for HPC and Deep Learning • How to exploit FPGA functionalities to accelerate MPI library? • Funded Collaboration with Pattern Computers

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 28 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 29 MVAPICH2-X for Hybrid MPI + PGAS Applications

High Performance Parallel Programming Models MPI PGAS Hybrid --- MPI + X Message Passing Interface (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)

High Performance and Scalable Unified Communication Runtime Diverse APIs and Mechanisms Collectives Algorithms Introspection & Optimized Point-to- Remote Memory Scalable Job Fault Active Messages (Blocking and Non- Analysis with OSU point Primitives Access Startup Tolerance Blocking) INAM

Support for Efficient Intra-node Communication Support for Modern Networking Technologies Support for Modern Multi-/Many-core Architectures (POSIX SHMEM, CMA, LiMIC, XPMEM…) (InfiniBand, iWARP, RoCE, Omni-Path…) (Intel-Xeon, Intel-, OpenPOWER, ARM…)

• Current Model – Separate Runtimes for OpenSHMEM/UPC/UPC++/CAF and MPI – Possible if both runtimes are not progressed – Consumes more network resource • Unified communication runtime for MPI, UPC, UPC++, OpenSHMEM, CAF – Available with since 2012 (starting with MVAPICH2-X 1.9) – http://mvapich.cse.ohio-state.edu

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 30 Major Features of MVAPICH2-X

• Advanced MPI Features/Support – InfiniBand Features (DC, UMR, ODP, SHArP, and Core-Direct) – Kernel-assisted collectives • Hybrid MPI+PGAS (OpenSHMEM, UPC, UPC++, and CAF) • Allows to combine programming models (Pure MPI, Pure MPI+OpenMP, MPI+OpenSHMEM, MPI+UPC, MPI+UPC++, ..) in the same program to get best performance and scalability

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 31 Optimized CMA-based Collectives for Large Messages

1000000 1000000 1000000 KNL (2 Nodes, 128 Procs) KNL (4 Nodes, 256 Procs) KNL (8 Nodes, 512 Procs) 100000 100000 100000 ~ 3.2x 10000 ~ 2.5x 10000 10000 ~ 4x Better Better Better 1000 1000 1000 MVAPICH2-2.3a MVAPICH2-2.3a MVAPICH2-2.3a

Latency(us) Intel MPI 2017 100 100 ~ 17x 100 Intel MPI 2017 Better Intel MPI 2017 OpenMPI 2.1.0 10 10 OpenMPI 2.1.0 10 OpenMPI 2.1.0 MVAPICH2-X MVAPICH2-X MVAPICH2-X 1 1 1 1K 4K 16K 64K 256K 1M 4M 1K 4K 16K 64K 256K 1M 1K 4K 16K 64K 256K 1M Message Size Message Size Message Size Performance of MPI_Gather on KNL nodes (64PPN) • Significant improvement over existing implementation for Scatter/Gather with 1MB messages (up to 4x on KNL, 2x on Broadwell, 14x on OpenPOWER) • New two-level algorithms for better scalability • Improved performance for other collectives (Bcast, Allgather, and Alltoall) S. Chakraborty, H. Subramoni, and D. K. Panda, Contention Aware Kernel-Assisted MPI Collectives for Multi/Many-core Systems, IEEE Cluster ’17, BEST Paper Finalist Available since MVAPICH2-X 2.3b

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 32 Advanced Allreduce Collective Designs Using SHArP and Multi-Leaders

60 0.6 Lower is better is Lower 23% 50 0.5

40 0.4

30 0.3

Latency (us) Latency 20 40% 0.2 10 0.1

0 0 Communication Latency (Seconds) Latency Communication 4 8 16 32 64 128 256 512 1K 2K 4K 56 224 448 Message Size (Byte) Number of Processes MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP MVAPICH2 Proposed-Socket-Based MVAPICH2+SHArP OSU Micro Benchmark (16 Nodes, 28 PPN) HPCG (28 PPN)

• Socket-based design can reduce the communication latency by 23% and 40% on Broadwell + IB-EDR nodes • Support is available since MVAPICH2-X 2.3b M. Bayatpour, S. Chakraborty, H. Subramoni, X. Lu, and D. K. Panda, Scalable Reduction Collectives with Data Partitioning-based Multi- Leader Design, Supercomputing '17. Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 33 Performance of MPI_Allreduce On Stampede2 (10,240 Processes)

300 2000 1800 250 1600 1400 200 1200 150 1000 2.4X 800 Latency (us) 100 600

50 400 200 0 0 4 8 16 32 64 128 256 512 1024 2048 4096 8K 16K 32K 64K 128K 256K Message Size Message Size MVAPICH2 MVAPICH2-OPT IMPI MVAPICH2 MVAPICH2-OPT IMPI OSU Micro Benchmark 64 PPN

• MPI_Allreduce latency with 32K bytes reduced by 2.4X

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 34 Application Level Performance with Graph500 and Sort Graph500 Execution Time Sort Execution Time 3000 35 MPI-Simple MPI Hybrid 30 2500 MPI-CSC 25 MPI-CSR 2000 20 51% Hybrid (MPI+OpenSHMEM) 13X 1500 15 Time (s) 10 1000 Time (seconds) 5 7.6X 500 0 4K 8K 16K 0 500GB-512 1TB-1K 2TB-2K 4TB-4K No. of Processes Input Data - No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Application - 2.4X improvement over MPI-CSR • 4,096 processes, 4 TB Input Size - 7.6X improvement over MPI-Simple - MPI – 2408 sec; 0.16 TB/min • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 35 MVAPICH2-X Upcoming Features

• MVAPICH2-X 2.3RC1 will be released soon with the following features – Support for XPMEM (Point-to-point, Reduce, and All-reduce) – Efficient Asynchronous Progress with Full Subscription – Contention-aware CMA-based collective support for OpenSHMEM, UPC, and UPC++ • Implicit On-Demand Paging (ODP) Support • Other collectives with XPMEM-based Support

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 36 Shared Address Space (XPMEM)-based Collectives Design OSU_Allreduce (Broadwell 256 procs) OSU_Reduce (Broadwell 256 procs) MVAPICH2-2.3b 100000 MVAPICH2-2.3b 100000 1.8X IMPI-2017v1.132 IMPI-2017v1.132 4X 10000 10000 MVAPICH2-X MVAPICH2-X

1000 1000 73.2 37.9 100 100 Latency (us) Latency

10 10 36.1 16.8 1 1 16K 32K 64K 128K 256K 512K 1M 2M 4M 16K 32K 64K 128K 256K 512K 1M 2M 4M Message Size Message Size • “Shared Address Space”-based true zero-copy Reduction collective designs in MVAPICH2 • Offloaded computation/communication to peers ranks in reduction collective operation • Up to 4X improvement for 4MB Reduce and up to 1.8X improvement for 4M AllReduce J. Hashmi, S. Chakraborty, M. Bayatpour, H. Subramoni, and D. Panda, Designing Efficient Shared Address Space Reduction Collectives for Multi-/Many-cores, International Parallel & Distributed Processing Symposium (IPDPS '18), May 2018. Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 37 Optimized All-Reduce with XPMEM on OpenPOWER 2000 2000 48% MVAPICH2-X 4X MVAPICH2-X SpectrumMPI-10.1.0 SpectrumMPI-10.1.0 1000 OpenMPI-3.0.0 1000 OpenMPI-3.0.0 Latency (us) 2X Latency (us) 34%

(Nodes=1, PPN=20) (Nodes=1, 0 0 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M 4000 4000 MVAPICH2-X 3.3X MVAPICH2-X 3000 2X SpectrumMPI-10.1.0 SpectrumMPI-10.1.0

2000 OpenMPI-3.0.0 2000 OpenMPI-3.0.0 Latency (us) Latency (us) 1000 2X 3X 0 (Nodes=2, PPN=20) (Nodes=2, 0 16K 32K 64K 128K 256K 512K 1M 2M 16K 32K 64K 128K 256K 512K 1M 2M Message Size Message Size • Optimized MPI All-Reduce Design in MVAPICH2 Optimized Runtime Parameters: MV2_CPU_BINDING_POLICY=hybrid MV2_HYBRID_BINDING_POLICY=bunch – Up to 2X performance improvement over Spectrum MPI and 4X over OpenMPI for intra-node

More Details in Poster: Designing Shared Address Space MPI libraries in the Many-core Era - Jahanzeb Hashmi, The Ohio State University Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 38 Enhanced Asynchronous Progress Communication

• Achieve overlaps without using specialized hardware or software resources • Can work with MPI tasks in full-subscribed mode

18% 33% 44% 38% 12% 25% 6%

12% 29% 15% 10% 13%

SPEC MPI : KNL + Omni-Path SPEC MPI : Skylake + Omni-Path P3DFFT : Broadwell + InfiniBand

38% performance improvement for SPECMPI applications on 384 processes 44% performance improvement with the P3DFFT application on 448 processes. A. Ruhela, H. Subramoni, S. Chakraborty, M. Bayatpour, P. Kousha, and D. K. Panda, Efficient Asynchronous Communication Progress for MPI without Dedicated Resources, EuroMPI 2018 Efficient Asynchronous Communication Progress for MPI without Dedicated Resources - Amit Ruhela, The Ohio State University Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 39 SPEC MPI 2007 Benchmarks: Broadwell + InfiniBand

160 Intel MPI 18.1.163 140 MVAPICH2-X 120 29% 5% 100 80 -12% 60 1%

Execution Time in (s) 40 31% 11% 20

0 MILC Leslie3D POP2 LAMMPS WRF2 LU MVAPICH2-X outperforms Intel MPI by up to 31%

Configuration: 448 processes on 16 Intel E5-2680v4 (Broadwell) nodes having 28 PPN and interconnected with 100Gbps Mellanox MT4115 EDR ConnectX-4 HCA Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 40 SPEC MPI 2007 Benchmarks: KNL + Omni-Path

700 10% Intel 18.0.2 600 MVAPICH2-X 10% 500

400

300 -2% 200 22%

Execution Time in (s) 4%

100 -7% 15%

0 MILC Leslie3D POP2 LAMMPS WRF2 GAP Lu MVAPICH2-X outperforms Intel MPI by up to 22%

Configuration : 384 processes on 8 nodes of Intel Xeon Phi 7250(KNL) with 48 processes per node. KNL contains 68 cores on a single socket and interconnects with 100Gb/sec Intel Omni-Path network. Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 41 Implicit On-Demand Paging (ODP)

• Introduced by Mellanox to avoid pinning the pages of registered memory regions • ODP-aware runtime could reduce the size of pin-down buffers while maintaining performance Applications (256 Processes) 100 (s) Pin-down Explicit-ODP Implicit-ODP 10 Time

1

Execution 0.1 CG EP FT IS MG LU SP AWP-ODC Graph500

M. Li, X. Lu, H. Subramoni, and D. K. Panda, “Designing Registration Caching Free High-Performance MPI Library with Implicit On-Demand Paging (ODP) of InfiniBand”, HiPC ‘17

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 42 XPMEM Collectives: osu_bcast and osu_scatter

osu_bcast osu_scatter Intel MPI 2018 10000 100000 Intel MPI 2018 OpenMPI 3.0.1 OpenMPI 3.0.1 MV2-2.3Xrc1 (CMA Coll) 10000 1000 MV2-2.3Xrc1 (CMA Coll) MV2-XPMEM 5X over 1000 MV2-XPMEM 100 Open MPI 100 Latency (us) 10 2.4X over MV2 Latency (us) 10 CMA-coll 1 1 1K 2K 4K 8K 4K 8K 1M 2M 4M 1M 2M 4M 512 16K 32K 64K 16K 32K 64K 128K 256K 512K 128K 256K 512K Message Size (bytes) Message Size (bytes)

• 28 MPI Processes on single dual-socket Broadwell E5-2680v4, 2x14 core processor

More Details in Poster: Designing Shared Address Space MPI libraries in the Many-core Era - Jahanzeb Hashmi, The Ohio State University Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 43 XPMEM Collectives: osu_gather and osu_alltoall

osu_gather osu_alltoall 100000 Intel MPI 2018 100000 Intel MPI 2018 OpenMPI 3.0.1 OpenMPI 3.0.1 10000 10000 MV2X-2.3rc1 (CMA Coll) MV2X-2.3rc1 (CMA Coll) MV2-XPMEM 1000 MV2-XPMEM 1000 5X over Data does not fit in 100 Open MPI 100 shared L3 for 28ppn

Latency (us) Alltoall, leads to same

Latency (us) performance as 10 10 existing designs 1 Up to 6X over IMPI and OMPI 1 4K 8K 1M 2M 4M 16K 32K 64K 1K 2K 4K 8K 1M 2M 4M 128 256 512 128K 256K 512K 16K 32K 64K 128K 256K 512K Message Size (bytes) Message Size (bytes)

• 28 MPI Processes on single dual-socket Broadwell E5-2680v4, 2x14 core processor

More Details in Poster: Designing Shared Address Space MPI libraries in the Many-core Era - Jahanzeb Hashmi, The Ohio State University Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 44 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 45 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.3 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi- GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication from GPU device buffers • Unified memory

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 46 MVAPICH2-GDR 2.3a • Released on 11/09/2017 • Major Features and Enhancements – Based on MVAPICH2 2.2 – Support for CUDA 9.0 – Add support for Volta (V100) GPU – Support for OpenPOWER with NVLink – Efficient Multiple CUDA stream-based IPC communication for multi-GPU systems with and without NVLink – Enhanced performance of GPU-based point-to-point communication – Leverage Linux Cross Memory Attach (CMA) feature for enhanced host-based communication – Enhanced performance of MPI_Allreduce for GPU-resident data – InfiniBand Multicast (IB-MCAST) based designs for GPU-based broadcast and streaming applications • Basic support for IB-MCAST designs with GPUDirect RDMA • Advanced support for zero-copy IB-MCAST designs with GPUDirect RDMA • Advanced reliability support for IB-MCAST designs – Efficient broadcast designs for Deep Learning applications – Enhanced collective tuning on Xeon, OpenPOWER, and NVIDIA DGX-1 systems

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 47 Optimized MVAPICH2-GDR Design GPU-GPU Inter-node Latency GPU-GPU Inter-node Bi-Bandwidth 30 6000 25 5000 20 4000 15 10 3000 11X 1.88us 10x 2000 Latency (us) 5 0 1000 Bandwidth (MB/s) 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 0 1 2 4 8 16 32 64 1K 2K 4K

Message Size (Bytes) 128 256 512 Message Size (Bytes) MV2-(NO-GDR) MV2-GDR-2.3a MV2-(NO-GDR) MV2-GDR-2.3a GPU-GPU Inter-node Bandwidth 3500 3000 2500 2000 MVAPICH2-GDR-2.3a 1500 9x 1000 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores Bandwidth (MB/s) 500 NVIDIA Volta V100 GPU 0 Mellanox Connect-X4 EDR HCA 1 2 4 8 16 32 64 128 256 512 1K 2K 4K CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes)

MV2-(NO-GDR) MV2-GDR-2.3a

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 48 Application-Level Evaluation (HOOMD-blue) 64K Particles 256K Particles 3500 2500 3000 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X

(TPS) 1500 1000 1000 500 500

Average per second TimeSteps Average 0 0

4 8 16 32 Time per Steps second (TPS)Average 4 8 16 32 Number of Processes Number of Processes

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 49 Application-Level Evaluation (Cosmo) and Weather Forecasting in Switzerland

Wilkes GPU Cluster CSCS GPU cluster

Default Callback-based Event-based Default Callback-based Event-based

1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 Normalized Time Execution

Normalized Time Execution 0 4 8 16 32 16 32 64 96

Number of GPUs Number of GPUs

• 2X improvement on 32 GPUs nodes Cosmo model: http://www2.cosmo-model.org/content • 30% improvement on 96 GPU nodes (8 GPUs/node) /tasks/operational/meteoSwiss/ On-going collaboration with CSCS and MeteoSwiss (Switzerland) in co-designing MV2-GDR and Cosmo Application C. Chu, K. Hamidouche, A. Venkatesh, D. Banerjee , H. Subramoni, and D. K. Panda, Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, IPDPS’16

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 50 Multi-stream Communication using CUDA IPC on OpenPOWER and DGX-1

• Up to 16% higher Device to Device (D2D) bandwidth on OpenPOWER + NVLink inter-connect • Up to 30% higher D2D bandwidth on DGX-1 with NVLink • Pt-to-pt (D-D) Bandwidth: Pt-to-pt (D-D) Bandwidth: Benefits of Multi-stream CUDA IPC Design Benefits of Multi-stream CUDA IPC Design

20000 40000 18000 30% better 35000 16000 16% better 30000 14000 12000 25000 10000 20000 8000 15000 6000 10000 Million Bytes (MB)/second Bytes Million 4000 (MB)/second Bytes Million 1-stream 4-streams 2000 1-stream 4-streams 5000 0 0 16K 32K 64K 128K 256K 512K 1M 2M 4M 128K 256K 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes)

Available since MVAPICH2-GDR-2.3a

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 51 CMA-based Intra-node Communication Support

• Up to 30% lower Host-to-Host (H2H) latency and 30% higher H2H Bandwidth

INTRA-NODE Pt-to-Pt (H2H) LATENCY INTRA-NODE Pt-to-Pt (H2H) BANDWIDTH 600 16000 14000 500 30% better 12000 30% better 400 10000 300 8000 6000

Latency (us) 200 4000 100 Bandwidth (MBps) 2000 0 0 0 2 8 32 128 512 2K 8K 32K 128K 512K 2M 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Message Size (Bytes) Message Size (Bytes)

MV2-GDR (w/out CMA) MV2-GDR (w/ CMA) MV2-GDR (w/out CMA) MV2-GDR (w/ CMA)

MVAPICH2-GDR-2.3a Intel Broadwell (E5-2680 v4 @ 3240 GHz) node – 28 cores NVIDIA Tesla K-80 GPU, and Mellanox Connect-X4 EDR HCA CUDA 8.0, Mellanox OFED 4.0 with GPU-Direct-RDMA

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 52 MVAPICH2-GDR: Performance on OpenPOWER (NVLink + Pascal) INTRA-NODE LATENCY (SMALL) INTRA-NODE LATENCY (LARGE) INTRA-NODE BANDWIDTH 20 500 40

15 400 30 300 10 20 200 Latency (us) Latency (us) 5 100 10

0 0 Bandwidth (GB/sec) 0 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 1 8 64 512 4K 32K 256K 2M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET INTRA-SOCKET(NVLINK) INTER-SOCKET Intra-node Latency: 16.8 us (without GPUDirectRDMA) Intra-node Bandwidth: 32 GB/sec (NVLINK)

INTER-NODE LATENCY (SMALL) INTER-NODE LATENCY (LARGE) INTER-NODE BANDWIDTH 35 900 7 30 800 6 700 25 600 5 20 500 4 15 400 3 300 Latency (us) 10 Latency (us) 200 2

5 100 Bandwidth (GB/sec) 1 0 0 0 1 2 4 8 16 32 64 128256512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M 1 8 64 512 4K 32K 256K 2M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) Inter-node Latency: 22 us (without GPUDirectRDMA) Inter-node Bandwidth: 6 GB/sec (FDR) Platform: OpenPOWER (ppc64le) nodes equipped with a dual-socket CPU, 4 Pascal P100-SXM GPUs, and 4X-FDR InfiniBand Inter-connect

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 53 Deep Learning: New Challenges for MPI Runtimes

• Deep Learning frameworks are a different game Proposed altogether Co- NCCL NCCL2 Designs – Unusually large message sizes (order of megabytes) – Most communication based on GPU buffers cuDNN • Existing State-of-the-art cuBLAS – cuDNN, cuBLAS, NCCL --> scale-up performance MPI – NCCL2, CUDA-Aware MPI --> scale-out performance • For small and medium message sizes only! up Performance - • Proposed: Can we co-design the MPI runtime (MVAPICH2- gRPC GDR) and the DL framework (Caffe) to achieve both? Scale – Efficient Overlap of Computation and Communication Hadoop – Efficient Large-Message Communication (Reductions) – What application co-designs are needed to exploit communication-runtime co-designs? Scale-out Performance A. A. Awan, K. Hamidouche, J. M. Hashmi, and D. K. Panda, S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17)

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 54 Large Message Optimized Collectives for Deep Learning Reduce – 192 GPUs Reduce – 64 MB 250 200 200 150 150 100 100 50 50 Latency (ms) Latency (ms) • MV2-GDR provides 0 0 2 4 8 16 32 64 128 128 160 192 optimized collectives for Message Size (MB) No. of GPUs Allreduce - 128 MB large message sizes Allreduce – 64 GPUs 300 • Optimized Reduce, 300 200 200 Allreduce, and Bcast 100 100 Latency (ms) Latency (ms) • Good scaling with large 0 0 2 4 8 16 32 64 128 16 32 64 number of GPUs Message Size (MB) No. of GPUs Bcast 128 MB Bcast – 64 GPUs • Available since MVAPICH2- 80 80 GDR 2.2GA 60 60 40 40 20

20 Latency (ms) Latency (ms) 0 0 2 4 8 16 32 64 128 16 32 64 Message Size (MB) No. of GPUs Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 55 Exploiting CUDA-Aware MPI for TensorFlow (Horovod)

3500

3000 • MVAPICH2-GDR offers excellent MVAPICH2-GDR is up to 2500 22% faster than MVAPICH2 performance via advanced designs for 2000 MPI_Allreduce. 1500 • Up to 22% better performance on

Images/sec (higher (higher better) is Images/sec 1000 Wilkes2 cluster (16 GPUs) 500

0 1 2 4 8 16 No. of GPUs (4GPUs/node)

MVAPICH2 MVAPICH2-GDR

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 56 MVAPICH2-GDR on Container with Negligible Overhead GPU-GPU Inter-node Latency GPU-GPU Inter-node Bi-Bandwidth 1000 20000

100 15000

10 10000 Latency (us) 1 5000 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Bandwidth (MB/s) 0 Message Size (Bytes) 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M Docker Native Message Size (Bytes) GPU-GPU Inter-node Bandwidth Docker Native 12000 10000 8000 6000 MVAPICH2-GDR-2.3a 4000 Intel Haswell (E5-2687W @ 3.10 GHz) node - 20 cores

Bandwidth (MB/s) 2000 NVIDIA Volta V100 GPU 0 Mellanox Connect-X4 EDR HCA 1 4 16 64 256 1K 4K 16K 64K 256K 1M 4M CUDA 9.0 Mellanox OFED 4.0 with GPU-Direct-RDMA Message Size (Bytes)

Docker Native

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 57 MVAPICH2-GDR Upcoming Features

• MVAPICH2-GDR 2.3b will be released soon – Scalable Host-based Collectives – Optimized Support for Deep Learning (Caffe and CNTK) – Support for Streaming Applications • Optimized Collectives for Multi-Rails (Sierra) • Integrated Collective Support with SHArP

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 58 Scalable Host-based Collectives on OpenPOWER (Intra-node Reduce & AlltoAll) Reduce 40 2000 3.2X MVAPICH2-GDR MVAPICH2-GDR 1600 30 SpectrumMPI-10.1.0.2 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 5.2X 1200 20 1.4X 3.6X 800 Latency (us) Latency (us) 10 400 (Nodes=1, PPN=20) (Nodes=1, (Nodes=1, PPN=20) (Nodes=1, 0 0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Alltoall 3.2X 1.3X 200 3.3X 10000 MVAPICH2-GDR MVAPICH2-GDR 1.2X 150 SpectrumMPI-10.1.0.2 7500 SpectrumMPI-10.1.0.2 OpenMPI-3.0.0 OpenMPI-3.0.0 100 5000

Latency (us) 50 Latency (us) 2500 (Nodes=1, PPN=20) (Nodes=1, 0 0 PPN=20) (Nodes=1, 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M Up to 5X and 3x performance improvement by MVAPICH2 for small and large messages respectively

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 59 MVAPICH2-GDR: Allreduce Comparison with Baidu and OpenMPI

• 16 GPUs (4 nodes) MVAPICH2-GDR vs. Baidu-Allreduce and OpenMPI 3.0

100000 ~10X better 6000000 50000 OpenMPI is ~5X slower 45000 5000000 than Baidu 10000 ~30X better 40000 35000 4000000 MV2 is ~2X better 1000 30000 than Baidu 3000000 25000 100 Latency (us) Latency (us) 20000 2000000 Latency (us) 15000 10 10000 ~4X better 1000000 5000 1 0 0 4 32 256 2K 16K 128K 8M 32M 128M 512M 512K 1M 2M 4M Message Size (Bytes) Message Size (Bytes) Message Size (Bytes) MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI MVAPICH2 BAIDU OPENMPI

*Available since MVAPICH2-GDR 2.3a

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 60 MVAPICH2-GDR vs. NCCL2 – Broadcast Operation

• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Bcast (MVAPICH2-GDR) vs. ncclBcast (NCCL2) on 16 K-80 GPUs

100000 100000 1 GPU/node 2 GPUs/node 10000 10000

1000 1000 ~4X better 100 100 Latency (us) ~10X better Latency (us)

10 10

1 1 4 32 256 2K 16K 128K 1M 8M 64M 4 32 256 2K 16K 128K 1M 8M 64M Message Size (Bytes) Message Size (Bytes)

NCCL2 MVAPICH2-GDR NCCL2 MVAPICH2-GDR *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 2 K-80 GPUs, and EDR InfiniBand Inter-connect

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 61 MVAPICH2-GDR vs. NCCL2 – Reduce Operation

• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Reduce (MVAPICH2-GDR) vs. ncclReduce (NCCL2) on 16 GPUs

1000 100000 ~2.5X better 10000

100 1000

100 Latency (us) ~5X better Latency (us) 10 10

1 1 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 62 MVAPICH2-GDR vs. NCCL2 – Allreduce Operation

• Optimized designs in MVAPICH2-GDR 2.3b* offer better/comparable performance for most cases • MPI_Allreduce (MVAPICH2-GDR) vs. ncclAllreduce (NCCL2) on 16 GPUs

1000 100000

10000 ~1.2X better 100 ~3X better 1000

Latency (us) 100 10 Latency (us)

10

1 1 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 512K 2M 8M 32M 128M Message Size (Bytes) Message Size (Bytes)

MVAPICH2-GDR NCCL2 MVAPICH2-GDR NCCL2 *Will be available with upcoming MVAPICH2-GDR 2.3b Platform: Intel Xeon (Broadwell) nodes equipped with a dual-socket CPU, 1 K-80 GPUs, and EDR InfiniBand Inter-connect

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 63 Streaming Support (Combining GDR and IB-Mcast)

• Combining MCAST+GDR hardware features for heterogeneous Node 1 configurations: C CPU – Source on the Host and destination on Device IB HCA Source – SL design: Scatter at destination GPU • Source: Data and Control on Host C Data Data • Destinations: Data on Device and Control on Host CPU IB Switch – Combines IB MCAST and GDR features at receivers IB HCA Node N – CUDA IPC-based topology-aware intra-node broadcast GPU C – Minimize use of PCIe resources (Maximizing availability of PCIe Host- CPU Device Resources) IB HCA Multicast steps GPU IB SL step Data C.-H. Chu, K. Hamidouche, H. Subramoni, A. Venkatesh, B. Elton, and D. K. Panda. “Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters, “ SBAC-PAD’16, Oct 2016.

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 64 Exploiting GDR+IB-Mcast Design for Deep Learning Applications

• Optimizing MCAST+GDR Broadcast for deep learning: Node 1 – Source and destination buffers are on GPU Device Header Source CPU • Typically very large messages (>1MB) IB – Pipelining data from Device to Host Header Data HCA • Avoid GDR read limit CPU GPU Data • Leverage high-performance SL design IB IB – Combines IB MCAST and GDR features HCA Switch GPU Node N – Minimize use of PCIe resources on the receiver side Data Header • Maximizing availability of PCIe Host-Device Resources CPU IB 1. Data movement 2. IB Gather HCA Ching-Hsiang Chu, Xiaoyi Lu, Ammar A. Awan, Hari Subramoni, Jahanzeb Hashmi, Bracy Elton, and Dhabaleswar 3. IB Hardware Multicast GPU K. Panda, "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , ” 4. IB Scatter + GDR Write ICPP’17. Data Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 65 Application Evaluation: Deep Learning Frameworks

• @ RI2 cluster, 16 GPUs, 1 GPU/node – Microsoft Cognitive Toolkit (CNTK) [https://github.com/Microsoft/CNTK] Lower is better AlexNet model VGG model MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt MV2-GDR-Knomial MV2-GDR-Ring MCAST-GDR-Opt 300 2500 15% 6% 250 24% 2000 15% 200 1500 150 1000 100

Training Time (s) Training 50 Time (s) Training 500 0 0 8 16 8 16 Number of GPU nodes Number of GPU nodes • Reduces up to 24% and 15% of latency for AlexNet and VGG models • Higher improvement can be observed for larger system sizes Ching-Hsiang Chu, Xiaoyi Lu, Ammar A. Awan, Hari Subramoni, Jahanzeb Hashmi, Bracy Elton, and Dhabaleswar K. Panda, "Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , " ICPP’17. Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 66 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 67 Can HPC and Virtualization be Combined? • Virtualization has many benefits – Fault-tolerance – Job migration – Compaction • Have not been very popular in HPC due to overhead associated with Virtualization • New SR-IOV (Single Root – IO Virtualization) support available with Mellanox InfiniBand adapters changes the field • Enhanced MVAPICH2 support for SR-IOV • MVAPICH2-Virt 2.2 supports: – Efficient MPI communication over SR-IOV enabled InfiniBand networks – High-performance and locality-aware MPI communication for VMs and containers – Automatic communication channel selection for VMs (SR-IOV, IVSHMEM, and CMA/LiMIC2) and containers (IPC-SHM, CMA, and HCA) – OpenStack, Docker, and Singularity

J. Zhang, X. Lu, J. Jose, R. Shi and D. K. Panda, Can Inter-VM Shmem Benefit MPI Applications on SR-IOV based Virtualized InfiniBand Clusters? EuroPar'14

J. Zhang, X. Lu, J. Jose, M. Li, R. Shi and D.K. Panda, High Performance MPI Libray over SR-IOV enabled InfiniBand Clusters, HiPC’14 J. Zhang, X .Lu, M. Arnold and D. K. Panda, MVAPICH2 Over OpenStack with SR-IOV: an Efficient Approach to build HPC Clouds, CCGrid’15 Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 68 Application-Level Performance on Chameleon (SR-IOV Support)

6000 400 MV2-SR-IOV-Def MV2-SR-IOV-Def 350 5000 MV2-SR-IOV-Opt MV2-SR-IOV-Opt 300 4000 2% MV2-Native MV2-Native 250 3000 200 9.5% 150 2000 1%

5% Time (s) Execution

Execution Time (ms) Execution 100 1000 50 0 0 22,20 24,10 24,16 24,20 26,10 26,16 milc leslie3d pop2 GAPgeofem zeusmp2 lu Problem Size (Scale, Edgefactor) Graph500 SPEC MPI2007 • 32 VMs, 6 Core/VM • Compared to Native, 2-5% overhead for Graph500 with 128 Procs • Compared to Native, 1-9.5% overhead for SPEC MPI2007 with 128 Procs

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 69 Application-Level Performance on Chameleon (Containers Support)

100 Container-Def 4000 90 Container-Def Container-Opt 3500 80 11% Container-Opt Native 3000 70 Native 60 2500 50 2000 40 1500

Execution Time (s) Execution 30 1000 20 Time (ms) Execution 500 16% 10 0 0 22, 16 22, 20 24, 16 24, 20 26, 16 26, 20 MG.D FT.D EP.D LU.D CG.D Problem Size (Scale, Edgefactor) NAS Graph 500 • 64 Containers across 16 nodes, pining 4 Cores per Container • Compared to Container-Def, up to 11% and 16% of execution time reduction for NAS and Graph 500 • Compared to Native, less than 9 % and 4% overhead for NAS and Graph 500

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 70 Application-Level Performance on Singularity with MVAPICH2

NPB Class D Graph500 300 3000 Singularity Singularity 6% 250 2500 Native Native 200 2000

150 1500

100 BFS Execution Time (ms) Execution Execution Time (s) 7% 1000

50 500

0 0 CG EP FT IS LU MG 22,16 22,20 24,16 24,20 26,16 26,20 Problem Size (Scale, Edgefactor) • 512 Processes across 32 nodes More Details were presented • Less than 7% and 6% overhead for NPB and Graph500, respectively in yesterday’s Tutorial Session

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 71 MVAPICH2-Virt Upcoming Features

• Support for SR-IOV Enabled VM Migration • SR-IOV and IVSHMEM Support in SLURM • Support for Microsoft Azure Platform

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 72 High Performance SR-IOV enabled VM Migration Support in MVAPICH2 • Migration with SR-IOV device has to handle the challenges of detachment/re-attachment of virtualized IB device and IB connection • Consist of SR-IOV enabled IB Cluster and External Migration Controller • Multiple parallel libraries to notify MPI applications during migration (detach/reattach SR- IOV/IVShmem, migrate VMs, migration status) • Handle the IB connection suspending and reactivating • Propose Progress engine (PE) and migration based (MT) design to optimize VM migration and MPI application performance

J. Zhang, X. Lu, D. K. Panda. High-Performance Virtual Machine Migration Framework for MPI Applications on SR-IOV enabled InfiniBand Clusters. IPDPS, 2017

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 73 SR-IOV and IVSHMEM in SLURM

• Requirement of managing and isolating virtualized resources of SR-IOV and IVSHMEM • Such kind of management and isolation is hard to be achieved by MPI library alone, but much easier with SLURM • Efficient running MPI applications on HPC Clouds needs SLURM to support managing SR-IOV and IVSHMEM – Can critical HPC resources be efficiently shared among users by extending SLURM with support for SR-IOV and IVSHMEM based virtualization? – Can SR-IOV and IVSHMEM enabled SLURM and MPI library provide bare-metal performance for end applications on HPC Clouds?

J. Zhang, X. Lu, S. Chakraborty, and D. K. Panda. SLURM-V: Extending SLURM for Building Efficient HPC Cloud with SR-IOV and IVShmem. The 22nd International European Conference on Parallel Processing (Euro-Par), 2016.

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 74 Application-Level Performance on Azure

NPB Class B NPB Class C

30 45

MVAPICH2 IntelMPI 40 MVAPICH2 IntelMPI 25 35

20 30

25 15 20 Execution Time (s) Execution Time (s) 10 15

10 5 5

0 0 bt.B cg.B ep.B ft.B is.B lu.B mg.B sp.B cg.C ep.C ft.C is.C lu.C mg.C

• NPB Class B with 16 Processes on 2 Azure A8 instances Will be available publicly for • NPB Class C with 32 Processes on 4 Azure A8 instances Azure soon • Comparable performance between MVAPICH2 and Intel MPI

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 75 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 76 Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)

• MVAPICH2-EA 2.1 (Energy-Aware) • A white-box approach • New Energy-Efficient communication protocols for pt-pt and collective operations • Intelligently apply the appropriate Energy saving techniques • Application oblivious energy saving

• OEMT • A library utility to measure energy consumption for MPI applications • Works with all MPI runtimes • PRELOAD option for precompiled applications • Does not require ROOT permission: • A safe kernel module to read only a subset of MSRs

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 77 MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)

• An energy efficient runtime that provides energy savings without application knowledge • Uses automatically and transparently the best energy lever • Provides guarantees on maximum degradation with 5-

41% savings at <= 5% 1 degradation • Pessimistic MPI applies energy reduction lever to each MPI call

A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 78 MPI-3 RMA Energy Savings with Proxy-Applications

350000 Graph500 (Execution Time) Graph500 (Energy Usage) 60 300000 optimistic optimistic 50 pessimistic 250000 46% pessimistic EAM-RMA EAM-RMA 40 200000

Joules 30 150000 Seconds 100000 20

50000 10

0 0 512 256 128 512 256 128 #Processes #Processes • MPI_Win_fence dominates application execution time in graph500 • Between 128 and 512 processes, EAM-RMA yields between 31% and 46% savings with no degradation in execution time in comparison with the default optimistic MPI runtime • MPI-3 RMA Energy-efficient support will be available in upcoming MVAPICH2-EA release

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 79 MVAPICH2 Software Family

Requirements Library

MPI with IB, iWARP, Omni-Path, and RoCE MVAPICH2

Advanced MPI Features/Support, OSU INAM, PGAS and MPI+PGAS MVAPICH2-X with IB, Omni-Path, and RoCE MPI with IB, RoCE & GPU and Support for Deep Learning MVAPICH2-GDR

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

MPI Energy Monitoring Tool OEMT

InfiniBand Network Analysis and Monitoring OSU INAM

Microbenchmarks for Measuring MPI and PGAS Performance OMB

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 80 Overview of OSU INAM

• A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runtime – http://mvapich.cse.ohio-state.edu/tools/osu-inam/ • Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes • Capability to analyze and profile node-level, job-level and process-level activities for MPI communication – Point-to-Point, Collectives and RMA • Ability to filter data based on type of counters using “drop down” list • Remotely monitor various metrics of MPI processes at user specified granularity • "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X • Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes • OSU INAM v0.9.3 released on 03/16/2018 – Enhance INAMD to query end nodes based on command line option – Add a web page to display size of the database in real-time – Enhance interaction between the web application and SLURM job launcher for increased portability – Improve packaging of web application and daemon to ease installation

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 81 OSU INAM Features

Comet@SDSC --- Clustered View Finding Routes Between Nodes (1,879 nodes, 212 switches, 4,377 network links) • Show network topology of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 82 OSU INAM Features (Cont.)

Visualizing a Job (5 Nodes) Estimated Process Level Link Utilization • Job level view • Estimated Link Utilization view • Show different network metrics (load, error, etc.) for any live job • Classify data flowing over a network link at • Play back historical data for completed jobs to identify bottlenecks different granularity in conjunction with • Node level view - details per process or per node MVAPICH2-X 2.2rc1 • CPU utilization for each rank/node • Job level and • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Process level • Network metrics (e.g. XmitDiscard, RcvError) per rank/node More Details in Poster: High-Performance and Scalable Fabric Analysis, Monitoring and Introspection Infrastructure for HPC - Hari Subramoni Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 83 OSU Microbenchmarks • Available since 2004 • Suite of microbenchmarks to study communication performance of various programming models • Benchmarks available for the following programming models – Message Passing Interface (MPI) – Partitioned Global Address Space (PGAS) • (UPC) • Unified Parallel C++ (UPC++) • OpenSHMEM • Benchmarks available for multiple accelerator based architectures – Compute Unified Device Architecture (CUDA) – OpenACC Application Program Interface • Part of various national resource procurement suites like NERSC-8 / Trinity Benchmarks • Continuing to add support for newer primitives and features • Please visit the following link for more information – http://mvapich.cse.ohio-state.edu/benchmarks/

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 84 Applications-Level Tuning: Compilation of Best Practices

• MPI runtime has many parameters • Tuning a set of parameters can help you to extract higher performance • Compiled a list of such contributions through the MVAPICH Website – http://mvapich.cse.ohio-state.edu/best_practices/ • Initial list of applications – Amber – HoomDBlue – HPCG – Lulesh – MILC – Neuron – SMG2000 • Soliciting additional contributions, send your results to mvapich-help at cse.ohio-state.edu. • We will link these results with credits to you.

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 85 MVAPICH Team Part of New NSF Tier-1 System

• TACC and MVAPICH partner to win new NSF Tier-1 System – https://www.hpcwire.com/2018/07/30/tacc-wins-next-nsf-funded-major-/ • The MVAPICH team will be an integral part of the effort • An overview of the recently awarded NSF Tier 1 System at TACC will be presented by Dan Stanzione – The presentation will also include discussion on MVAPICH collaboration on past systems and this upcoming system at TACC

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 86 Commercial Support for MVAPICH2 Libraries

• Supported through X-ScaleSolutions (http://x-scalesolutions.com) • Benefits: – Help and guidance with installation of the library – Platform-specific optimizations and tuning – Timely support for operational issues encountered with the library – Web portal interface to submit issues and tracking their progress – Advanced debugging techniques – Application-specific optimizations and tuning – Obtaining guidelines on best practices – Periodic information on major fixes and updates – Information on major releases – Help with upgrading to the latest release – Flexible Service Level Agreements • Support provided to Lawrence Livermore National Laboratory (LLNL) this year

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 87 MVAPICH2 – Plans for Exascale

• Performance and Memory scalability toward 1M-10M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Optimization for GPUs and FPGAs* • Taking advantage of advanced features of Mellanox InfiniBand • Tag Matching* • Adapter Memory* • Enhanced communication schemes for upcoming architectures • NVLINK* • CAPI* • Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended FT support • Support for * features will be available in future MVAPICH2 Releases

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 88 Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/

Network Based Computing Laboratory MVAPICH User Group Meeting (MUG) 2018 89