Overview of the MVAPICH Project: Latest Status and Future Roadmap

MVAPICH2 User Group (MUG) Meeng

by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hp://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures

Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/wa Mul-core Processors <1usec latency, >100Gbps Bandwidth >1 TFlop DP on a chip • Mul-core processors are ubiquitous • InfiniBand very popular in HPC clusters • • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale compung

Tianhe – 2 (1) Titan (2) Stampede (8) Tianhe – 1A (24)

MVAPICH User Group Meeng 2015 2 Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical Shared Memory Memory Memory Memory Memory Memory Memory

Shared Memory Model Model Paroned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) , UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance

MVAPICH User Group Meeng 2015 3 Supporng Programming Models for Mul- Petaflop and Exaflop Systems: Challenges

Applicaon Kernels/Applicaons

Co-Design Middleware Opportunies and Challenges across Various Programming Models Layers MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, , Hadoop, MapReduce, etc.

Communicaon Library or Runme for Programming Models

Point-to-point Collecve Synchronizaon & I/O & File Fault Communicaon Communicaon Locks (two-sided & one-sided) Systems Tolerance

Networking Technologies Mul/Many-core Accelerators (InfiniBand, 10/40GigE, Architectures (NVIDIA and MIC) Aries, OmniPath)

MVAPICH User Group Meeng 2015 4 Designing (MPI+X) at Exascale

for million to billion processors – Support for highly-efficient inter-node and intra-node communicaon (both two-sided and one-sided) – Extremely minimum memory footprint • Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) • Balancing intra-node and inter-node communicaon for next generaon mul-core (128-1024 cores/node) – Mulple end-points per node • Support for efficient mul-threading • Scalable Collecve communicaon – Offload – Non-blocking – Topology-aware – Power-aware • Support for MPI-3 RMA Model • Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communicaon and I/O • Virtualizaon Support

MVAPICH User Group Meeng 2015 5 The MVAPICH2 Soware Family

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, RDMA over Converged Enhanced Ethernet (RoCE), and virtualized clusters. – MVAPICH (MPI-1) , Available since 2002 – MVAPICH2 (MPI-2.2, MPI-3.0 and MPI-3.1), Available since 2004 – MVAPICH2-X (Advanced MPI + PGAS), Available since 2012 – Support for GPGPUs (MVAPICH2-GDR), Available since 2014 – Support for MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualizaon (MVAPICH2-Virt), Available since 2015 • hp://.cse.ohio-state.edu

MVAPICH User Group Meeng 2015 6 MVAPICH Project Timeline

MVAPICH2-Virt

MVAPICH2-MIC

MVAPICH2-GDR

MVAPICH2-X

MVAPICH2

OMB

MVAPICH EOL Jul-15 Jan-04 Oct-02 Oct-02 Jan-10 Nov-04 Nov-12 Apr-15 Apr-15 MVAPICH User Group Meeng 2015 Timeline Aug-14 7 MVAPICH/MVAPICH2 Release Timeline and Downloads

350000

300000 MV2-X 2.2a MV2-GDR 2.1rc2 MV2 2.1 MV2-MIC 2.0

250000 MV2-GDR 2.0b MV2 1.9 200000 MV2 2.2a MV2 1.8 150000 MV2-Virt 2.1rc2 MV2 1.7 MV2 1.6

Number of Downloads 100000 MV2 1.5 MV2 1.4 MV 1.1

50000 MV2 1.0.3 MV2 1.0 MV 1.0 MV2 0.9.8 MV2 0.9.0 MV 0.9.4 0 Sep-05 Sep-06 Sep-07 Sep-09 Sep-10 Sep-11 Sep-12 Sep-13 Sep-04 Sep-08 Sep-14 Mar-05 Mar-06 Mar-07 Mar-09 Mar-10 Mar-11 Mar-12 Mar-13 Mar-15 Mar-08 Mar-14 Timeline MVAPICH User Group Meeng 2015 8 The MVAPICH2 Soware Family (Cont.)

• Empowering many TOP500 clusters (July ‘15 ranking) – 8th ranked 519,640-core cluster (Stampede) at TACC – 11th ranked 185,344-core cluster (Pleiades) at NASA – 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Instute of Technology and many others • Used by more than 2,450 organizaons in 76 countries • More than 282,000 downloads from the OSU site directly • Available with soware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (8th in Jun’15, 462,462 cores, 5.168 Plops)

MVAPICH User Group Meeng 2015 9 Usage Guidelines for the MVAPICH2 Soware Family

Requirements MVAPICH2 Library to use

MPI with IB, iWARP and RoCE MVAPICH2

Advanced MPI, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X

MPI with IB & GPU MVAPICH2-GDR

MPI with IB & MIC MVAPICH2-MIC

HPC Cloud with MPI & IB MVAPICH2-Virt

MVAPICH User Group Meeng 2015 10 Strong Procedure for Design, Development and Release

• Research is done for exploring new designs • Designs are first presented to conference/journal publicaons • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – Exhausve unit tesng – Various test procedures on diverse range of plaorms and interconnects – Performance tuning – Applicaons-based evaluaon – Evaluaon on large-scale systems • Even alpha and beta versions go through the above tesng

MVAPICH User Group Meeng 2015 11 MVAPICH2 2.2a

• Released on 08/18/2015 • Major Features and Enhancements – Based on MPICH-3.1.4 – Support for backing on-demand UD CM informaon with shared memory for minimizing memory footprint – Dynamic idenficaon of maximum read/atomic operaons supported by HCA – Enabling support for intra-node communicaons in RoCE mode without shared memory – Updated to hwloc 1.11.0 – Updated to sm_20 kernel opmizaons for MPI Datatypes – Automac detecon and tuning for 24-core Haswell architecture – Enhanced startup performance • Support for PMI-2 based startup with SLURM – Checkpoint-Restart Support with DMTCP (Distributed MulThreaded CheckPoinng) – Enhanced communicaon performance for small/medium message sizes – Support for linking Intel Trace Analyzer and Collector

MVAPICH User Group Meeng 2015 12 One-way Latency: MPI over IB with MVAPICH2

Small Message Latency Large Message Latency 2.00 120 TrueScale-QDR 1.80 100 ConnectX-3-FDR 1.60 ConnectIB-DualFDR 1.40 1.26 80 1.19 ConnectX-4-EDR 1.20 1.00 60 1.15 0.80 Latency (us) Latency (us) 1.05 40 0.60 0.40 20 0.20 0.00 0

Message Size (bytes) Message Size (bytes)

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back MVAPICH User Group Meeng 2015 13 Bandwidth: MPI over IB with MVAPICH2

Unidireconal Bandwidth Bidireconal Bandwidth 14000 30000 12465 TrueScale-QDR 24353 12000 25000 ConnectX-3-FDR 11497 10000 ConnectIB-DualFDR 22909 20000 8000 ConnectX-4-EDR 6356 15000 12161 6000

4000 3387 10000 6308 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec)

2000 5000

0 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back MVAPICH User Group Meeng 2015 14 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))

Latency 1

0.8 Intra-Socket Inter-Socket Latest MVAPICH2 2.2a

0.6 Intel Ivy-bridge 0.45 us 0.4

Latency (us) 0.18 us 0.2

0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes)

Bandwidth (Intra-socket) Bandwidth (Inter-socket) 16000 16000 14,250 MB/s 13,749 MB/s 14000 intra-Socket-CMA 14000 inter-Socket-CMA inter-Socket-Shmem 12000 intra-Socket-Shmem 12000 inter-Socket-LiMIC 10000 intra-Socket-LiMIC 10000 8000 8000 6000 6000 4000

4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 2000 0 0

Message Size (Bytes) Message Size (Bytes)

MVAPICH User Group Meeng 2015 15 MVAPICH2-X for Advanced MPI, Hybrid MPI + PGAS Applicaons MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS) Applicaons

OpenSHMEM Calls CAF Calls UPC Calls MPI Calls

Unified MVAPICH2-X Runme

InfiniBand, RoCE, iWARP

• Current Model – Separate Runmes for OpenSHMEM/UPC/CAF and MPI – Possible if both runmes are not progressed – Consumes more network resource • Unified communicaon runme for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 onwards! – hp://mvapich.cse.ohio-state.edu

MVAPICH User Group Meeng 2015 16 Hybrid (MPI+PGAS) Programming

• Applicaon sub-kernels can be re-wrien in MPI/PGAS based on communicaon characteriscs HPC Applicaon

• Benefits: Kernel 1 – Best of Distributed Compung Model MPI – Best of Shared Memory Compung Model Kernel 2 PGAS MPI • Exascale Roadmap*: Kernel 3 – “Hybrid Programming is a praccal way to MPI program exascale systems” Kernel N Kernel N PGAS MPI

* The Internaonal Exascale Soware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, Internaonal Journal of High Performance Computer Applicaons, ISSN 1094-3420

MVAPICH User Group Meeng 2015 17 MVAPICH2-X 2.2a

• Released on 08/18/2015 • MVAPICH2-X-2.2a Feature Highlights – Based on MVAPICH2 2.2a including MPI-3 features • Compliant with UPC 2.20.2, OpenSHMEM v1.0h and CAF 3.0.39 – MPI (Advanced) Features • Support for Dynamically Connected (DC) transport protocol – Available for pt-to-pt, RMA and collecves - Support for Hybrid mode with RC/DC/UD/XRC • Support for Core-Direct based Non-blocking collecves – Available for Ibcast, Ibarrier, Iscaer, Igather, Ialltoall and Iallgather – OpenSHMEM Features • Support for RoCE - Support for Dynamically Connected (DC) transport protocol – UPC Features • Based on Berkeley UPC 2.20.2 (contains changes/addions in preparaon for upcoming UPC 1.3 specificaon) • Support for RoCE - Support for Dynamically Connected (DC) transport protocol – – CAF Features • Support for RoCE • Support for Dynamically Connected (DC) transport protocol

MVAPICH User Group Meeng 2015 18 Minimizing Memory Footprint further by DC Transport

• Constant connecon cost (One QP for any peer) P0 P1

Node 0 • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Iniator) and receive (DC Node 3 Node 1 Target) IB P6 Network P2 – DC Target idenfied by “DCT Number” P7 P3 – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communicaon • Available with MVAPICH2-X 2.2a P5 P4 Node 2 Memory Footprint for Alltoall NAMD - Apoa1: Large data set RC DC-Pool RC DC-Pool UD XRC 97 1 100 UD XRC 47 22 10 10 10 10 10 10 5 0.5 3 Time 2 1 1 1 1 1 1 0 80 160 320 640 Normalized Execuon 160 320 620

Connecon Memory (KB) Number of Processes Number of Processes H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE Internaonal Supercompung Conference (ISC ’14). MVAPICH User Group Meeng 2015 19 Applicaon Level Performance with Graph500 and Sort Graph500 Execuon Time Sort Execuon Time 35 3000 MPI-Simple MPI Hybrid 30 2500 MPI-CSC 25 2000 20 MPI-CSR 51% 13X 1500 15 Hybrid (MPI+OpenSHMEM) Time (s) 1000 10 Time (seconds) 500 5 7.6X 0 0 4K 8K 16K 500GB-512 1TB-1K 2TB-2K 4TB-4K No. of Processes Input Data - No. of Processes • Performance of Hybrid (MPI+ • Performance of Hybrid (MPI OpenSHMEM) Graph500 Design +OpenSHMEM) Sort Applicaon • 8,192 processes • 4,096 processes, 4 TB Input Size - 2.4X improvement over MPI-CSR - MPI – 2408 sec; 0.16 TB/min - 7.6X improvement over MPI-Simple - Hybrid – 1172 sec; 0.36 TB/min • 16,384 processes - 51% improvement over MPI-design - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Opmizing Collecve Communicaon in OpenSHMEM, Int'l Conference on Paroned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, Internaonal Supercompung Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluaon, Int'l Conference on Parallel Processing (ICPP '12), September 2012 MVAPICH User Group Meeng 2015 20 GPU-Direct RDMA (GDR) with CUDA SNB E5-2670 / • OFED with support for GPUDirect RDMA is IVB E5-2680V2

developed by NVIDIA and Mellanox System CPU Memory • OSU has a design of MVAPICH2 using GPUDirect RDMA GPU – Hybrid design using GPU-Direct RDMA IB Adapter Chipset • GPUDirect RDMA and Host-based pipelining GPU • Alleviates P2P bandwidth bolenecks on Memory SandyBridge and IvyBridge SNB E5-2670

– Support for communicaon using mul-rail P2P write: 5.2 GB/s – Support for Mellanox Connect-IB and ConnectX P2P read: < 1.0 GB/s VPI adapters IVB E5-2680V2 – Support for RoCE with Mellanox ConnectX VPI P2P write: 6.4 GB/s P2P read: 3.5 GB/s adapters

MVAPICH User Group Meeng 2015 21 MVAPICH2-GDR 2.1rc2

• Released on 06/24/2015 • Major Features and Enhancements – Based on MVAPICH2-2.1rc2 – CUDA 7.0 compability – CUDA-Aware support for MPI_Rsend and MPI_Irsend primives – Parallel intranode communicaon channels (shared memory for H-H and GDR for D-D) – Opmized H-H, H-D and D-H communicaon – Opmized intranode D-D communicaon – Opmizaon and tuning for point-point and collecve operaons – Update to sm_20 kernel opmizaon for Datatype processing – Opmized design for GPU based small message transfers – Adding R3 support for GPU based packezed transfer – Enhanced performance for small message host-to-device transfers – Support for MPI_Scan and MPI_Exscan collecve operaons from GPU buffers – Opmizaon of collecves with new copy designs

MVAPICH User Group Meeng 2015 22 Performance of MVAPICH2-GDR with GPU-Direct-RDMA

GPU-GPU Internode Small Message Latency GPU-GPU Internode MPI Uni-Direconal Bandwidth 30 3500 MV2-GDR2.1RC2 MV2-GDR2.1RC2 25 MV2-GDR2.0b 3000 MV2-GDR2.0b MV2 w/o GDR 2500 MV2 w/o GDR 20 11x 2000

15 s) 2X 9.3X 3.5X 1500 10 Latency (us) 1000 5 2.15 usec Bandwidth (MB/ 500 0 0 0 2 8 32 128 512 2K 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) GPU-GPU Internode Bi-direconal Bandwidth 4500 4000 MV2-GDR2.1RC2 MV2-GDR2.0b 3500 MV2 w/o GDR 2x MVAPICH2-GDR-2.1RC2 3000 Intel Ivy Bridge (E5-2680 v2) node - 20 cores 2500 NVIDIA Tesla K40c GPU 2000 11x Mellanox Connect-IB Dual-FDR HCA 1500 CUDA 7 1000 Mellanox OFED 2.4 with GPU-Direct-RDMA

Bi-Bandwidth (MB/s) 500 0 1 4 16 64 256 1K 4K Message Size (bytes) MVAPICH User Group Meeng 2015 23 Applicaon-Level Evaluaon (HOOMD-blue)

• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

MVAPICH User Group Meeng 2015 24 MPI3-RMA Performance of MVAPICH2-GDR with GPU-Direct-RDMA GPU-GPU Internode MPI Put latency (RMA put operaon Device to Device )

MPI-3 RMA provides flexible synchronization and completion primitives

Small Message Latency 35 MV2-GDR2.1RC2 30 MV2-GDR2.0b 25 7X 20

15 Latency (us) 10 2.88 us 5

0 0 2 8 32 128 512 2K 8K Message Size (bytes)

MVAPICH2-GDR-2.1RC2 Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 7, Mellanox OFED 2.4 with GPU-Direct-RDMA MVAPICH User Group Meeng 2015 25 MPI Applicaons on MIC Clusters • MPI (+X) connues to be the predominant programming model in HPC • Flexibility in launching MPI jobs on clusters with Xeon Phi Xeon Xeon Phi

Mul-core Centric

MPI Host-only Program

Offload MPI Offloaded (/reverse Offload) Program Computaon

MPI MPI Program Symmetric Program

Coprocessor-only MPI Program Many-core Centric

MVAPICH User Group Meeng 2015 26 MVAPICH2-MIC Design for Clusters with IB and MIC

• Offload Mode • Intranode Communicaon • Coprocessor-only Mode • Symmetric Mode • Internode Communicaon • Coprocessors-only • Symmetric Mode • Mul-MIC Node Configuraons

MVAPICH User Group Meeng 2015 27 MVAPICH2-MIC 2.0 • Released on 12/02/2014 • Major Features and Enhancements – Based on MVAPICH2 2.0.1 – Support for nave, symmetric and offload modes of MIC usage – Opmized intra-MIC communicaon using SCIF and shared-memory channels – Opmized intra-Node Host-to-MIC communicaon using SCIF and IB channels – Enhanced mpirun_rsh to launch jobs in symmetric mode from the host – Support for proxy-based communicaon for inter-node transfers • Acve-proxy, 1-hop and 2-hop designs (acvely using host CPU) • Passive-proxy (passively using host CPU) – Support for MIC-aware MPI_Bcast() – Improved SCIF performance for pipelined communicaon – Opmized shared-memory communicaon performance for single-MIC jobs – Supports an explicit CPU-binding mechanism for MIC processes – Tuned pt-to-pt intra-MIC, intra-node, and inter-node transfers – Supports hwloc v1.9 • Running on three major systems – Stampede – Blueridge(Virginia Tech) – Beacon (UTK)

MVAPICH User Group Meeng 2015 28 MIC-Remote-MIC P2P Communicaon with Proxy-based Communicaon Intra-socket P2P Latency (Large Messages) Bandwidth 5000 6000 5236 4000 4000 3000 Beer 2000

2000 Beer 1000 Latency (usec) Latency 0

Bandwidth (MB/sec) 0 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Inter-socket P2P Latency (Large Bandwidth 6000 16000 Messages) 14000 5594 12000 Beer 4000 10000 8000 6000 2000 Beer 4000 Latency (usec) Latency 2000 0

0 Bandwidth (MB/sec) 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes)

MVAPICH User Group Meeng 2015 29 Opmized MPI Collecves for MIC Clusters (Allgather & Alltoall) 25000 1500 32-Node-Allgather (16H + 16 M) 32-Node-Allgather (8H + 8 M) 20000 Small Message Latency Large Message Latency x 1000 1000 15000 MV2-MIC MV2-MIC

10000 MV2-MIC-Opt MV2-MIC-Opt 58% 76% 500 Latency (usecs) 5000 Latency (usecs)

0 0 1 2 4 8 16 32 64 128 256 512 1K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) Message Size (Bytes) P3DFFT Performance 800 32-Node-Alltoall (8H + 8 M) 50 Communicaon Large Message Latency 40 x 1000 600 Computaon 30 55% 400 MV2-MIC 20 MV2-MIC-Opt

Latency (usecs) 200 10 Execuon Time (secs) 0 0 MV2-MIC-Opt MV2-MIC 4K 8K 16K 32K 64K 128K 256K 512K 32 Nodes (8H + 8M), Size = 2K*2K*1K Message Size (Bytes) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 MVAPICH User Group Meeng 2015 30 MVAPICH2-Virt 2.1rc2

• Enables high-performance communicaon in virtualized environments – SR-IOV based communicaon for Inter-Node MPI communicaon – Inter-VM Shared Memory (IVSHMEM) based communicaon for Intra-Node-Inter-VM MPI communicaon – hp://mvapich.cse.ohio-state.edu

MVAPICH User Group Meeng 2015 31 MVAPICH2-Virt 2.1rc2

• Released on 06/26/2015 • Major Features and Enhancements – Based on MVAPICH2 2.1rc2 – Support for efficient MPI communicaon over SR-IOV enabled InfiniBand network – High-performance and locality-aware MPI communicaon with IVSHMEM – Support for IVSHMEM device auto-detecon in virtual machine – Automac communicaon channel selecon among SR-IOV, IVSHMEM, and CMA/LiMIC2 – Support for easy configuraon through runme parameters – Tested with - Mellanox InfiniBand adapters (ConnectX-3 (56Gbps))

MVAPICH User Group Meeng 2015 32 Applicaon-Level Performance (8 VM * 8 Core/VM)

35 450 MV2-SR-IOV-Def 400 30 MV2-SR-IOV-Def 9% MV2-SR-IOV-Opt 350 9% MV2-SR-IOV-Opt

25 MV2-Nave 300 MV2-Nave 20 250 15 1% 200 Execuon Time (s)

Execuon Time (us) 150 10 100 4% 5 50

0 0 FT-64-C EP-64-C LU-64-C BT-64-C 20,10 20,16 20,20 22,10 22,16 22,20 NAS Class C Problem Size (Scale, Edgefactor) NAS Graph500

• Compared to Nave, 1-9% overhead for NAS • Compared to Nave, 4-9% overhead for Graph500

MVAPICH User Group Meeng 2015 33 OSU Micro-Benchmarks (OMB)

• Started in 2004 and connuing steadily • Allows MPI developers and users to – Test and evaluate MPI libraries • Has a wide-range of benchmarks – Two-sided (MPI-1, MPI-2 and MPI-3) – One-sided (MPI-2 and MPI-3) – RMA (MPI-3) – Collecves (MPI-1, MPI-2 and MPI-3) – Extensions for GPU-aware communicaon (CUDA and OpenACC) – UPC (Pt-to-Pt) – OpenSHMEM (Pt-to-Pt and Collecves) – Startup • Widely-used in the MPI community

MVAPICH User Group Meeng 2015 34 OSU Microbenchmarks v5.0

• OSU Micro-Benchmarks 5.0 (08/18/15) • Non-blocking collecve benchmarks – osu_iallgather, osu_ialltoall, osu_ibarrier, osu_ibcast, osu_igather, and osu_iscaer – Benchmarks can display the amount of achievable overlap • Overlap defined as the amount of computaon that can be performed while the communicaon progresses in the background • Have the addional opon: "-t" set the number of MPI_Test() calls during the dummy computaon – set CALLS to 100, 1000, or any number > 0 • Startup benchmarks – osu_init • Measures the me taken for an MPI to complete MPI_Init – osu_hello • Measures the me taken for an MPI library to complete a simple hello world MPI program MVAPICH User Group Meeng 2015 35 Overview of OSU INAM

• OSU INAM monitors IB clusters in real me by querying various subnet management enes in the network • Major features of the OSU INAM tool include: – Analyze and profile network-level acvies with many parameters (data and errors) at user specified granularity – Capability to analyze and profile node-level, job-level and -level acvies for MPI communicaon (pt-to-pt, collecves and RMA) – Remotely monitor CPU ulizaon of MPI processes at user specified granularity – Visualize the data transfer happening in a "live" fashion - Live View for • Enre Network - Live Network Level View • Parcular Job - Live Job Level View • One or mulple Nodes - Live Node Level View – Capability to visualize data transfer that happened in the network at a me duraon in the past - Historical View for • Enre Network - Historical Network Level View • Parcular Job - Historical Job Level View • One or mulple Nodes - Historical Node Level View

MUG'15 36 OSU INAM – Network Level View

Full Network (152 nodes) Zoomed-in View of the Network

• Show network topology of large clusters • Visualize traffic paern on different links • Quickly idenfy congested links/links in error state • See the history unfold – play back historical state of the network

MUG'15 37 OSU INAM – Job and Node Level Views

Visualizing a Job (5 Nodes) Finding Routes Between Nodes

• Job level view • Show different network metrics (load, error, etc.) for any live job • Play back historical data for completed jobs to idenfy bolenecks • Node level view provides details per process or per node • CPU ulizaon for each rank/node • Bytes sent/received for MPI operaons (pt-to-pt, collecve, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node

MUG'15 38 MVAPICH2-EA & OEMT

• MVAPICH2-EA (Energy-Aware) • A white-box approach • New Energy-Efficient communicaon protocols for pt-pt and collecve operaons • Intelligently apply the appropriate Energy saving techniques • Applicaon oblivious energy saving • OEMT • A library ulity to measure energy consumpon for MPI applicaons • Works with all MPI runmes • PRELOAD opon for precompiled applicaons • Does not require ROOT permission: • A safe kernel module to read only a subset of MSRs

MUG'15 39 MV2-EA : Applicaon Oblivious Energy-Aware-MPI (EAM)

• An energy efficient runme that provides energy savings without applicaon knowledge • Uses automacally and transparently the best energy lever • Provides guarantees on maximum degradaon with 5-41% savings at <= 1 5% degradaon • Pessimisc MPI applies energy reducon lever to each MPI call

A Case for Applicaon-Oblivious Energy-Efficient MPI Runme A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent , D. K. Panda , D. Kerbyson , and A. Hoise - Supercompung ‘15, Nov 2015 [Best Student Paper Finalist] MUG'15 40 MVAPICH2 – Plans for Exascale

• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Opmizaon for GPU Support and Accelerators • Taking advantage of advanced features – User Mode Memory Registraon (UMR) – On-demand Paging • Enhanced Inter-node and Intra-node communicaon schemes for upcoming OmniPath enabled Knights Landing architectures • Extended RMA support (as in MPI 3.0) • Extended topology-aware collecves • Power-aware point-to-point (one-sided and two-sided) and collecves • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migraon support with SCR

MVAPICH User Group Meeng 2015 41 Web Pointers

NOWLAB Web Page hp://nowlab.cse.ohio-state.edu

MVAPICH Web Page hp://mvapich.cse.ohio-state.edu

MVAPICH User Group Meeng 2015 42