Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach

Talk at OpenFabrics Workshop (April 2016) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] hp://www.cse.ohio-state.edu/~panda Parallel Programming Models Overview

P1 P2 P3 P1 P2 P3 P1 P2 P3

Logical Memory Memory Memory Shared Memory Memory Memory Memory

Shared Memory Model Model Paroned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 2 Paroned Global Address Space (PGAS) Models • Key features - Simple shared memory abstracons - Light weight one-sided communicaon - Easier to express irregular communicaon • Different approaches to PGAS - Languages - Libraries • Unified Parallel C (UPC) • OpenSHMEM • Co-Array Fortran (CAF) • UPC++ • X10 • Global Arrays • Chapel

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 3 Hybrid (MPI+PGAS) Programming

• Applicaon sub-kernels can be re-wrien in MPI/PGAS based on communicaon characteriscs HPC Applicaon

• Benefits: Kernel 1 – Best of Distributed Compung Model MPI Kernel 2 – Best of Shared Memory Compung Model PGAS MPI

• Exascale Roadmap*: Kernel 3 MPI – “Hybrid Programming is a praccal way to

program exascale systems” Kernel N Kernel N PGAS MPI

* The Internaonal Exascale Soware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, Internaonal Journal of High Performance Computer Applicaons, ISSN 1094-3420

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 4 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualizaon (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,550 organizaons in 79 countries – More than 360,000 (> 0.36 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘15 ranking) • 10th ranked 519,640-core cluster (Stampede) at TACC • 13th ranked 185,344-core cluster (Pleiades) at NASA • 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Instute of Technology and many others – Available with soware stacks of many vendors and Linux Distros (RedHat and SuSE) – hp://.cse.ohio-state.edu • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops) Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 5 MVAPICH2 Architecture

High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/)

High Performance and Scalable Communicaon Runme Diverse and Mechanisms

Point-to- Remote Collecves Energy- I/O and Fault Acve Introspecon point Job Startup Memory Virtualizaon Algorithms Awareness File Systems Tolerance Messages & Analysis Primives Access

Support for Modern Networking Technology Support for Modern Mul-/Many-core Architectures (InfiniBand, iWARP, RoCE, OmniPath) (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Mul Shared RC XRC UD DC UMR ODP* CMA IVSHMEM MCDRAM* NVLink* CAPI* IOV Rail Memory

* Upcoming Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 6 MVAPICH2 Soware Family

Requirements MVAPICH2 Library to use

MPI with IB, iWARP and RoCE MVAPICH2

Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X

MPI with IB & GPU MVAPICH2-GDR

MPI with IB & MIC MVAPICH2-MIC

HPC Cloud with MPI & IB MVAPICH2-Virt

Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 7 MVAPICH2 2.2rc1 • Released on 03/30/2016 • Major Features and Enhancements – Based on MPICH-3.1.4 – Support for OpenPower architecture • Opmized inter-node and intra-node communicaon – Support for Intel Omni-Path architecture • Thanks to Intel for contribung the patch • Introducon of a new PSM2 channel for Omni-Path – Support for RoCEv2 – Architecture detecon for PSC Bridges system with Omni-Path – Enhanced startup performance and reduced memory footprint for storing InfiniBand end-point informaon with SLURM • Support for shared memory based PMI operaons • Availability of an updated patch from the MVAPICH project website with this support for SLURM installaons – Opmized pt-to-pt and collecve tuning for Chameleon InfiniBand systems at TACC/UoC – Enable affinity by default for TrueScale(PSM) and Omni-Path(PSM2) channels – Enhanced tuning for shared-memory based MPI_Bcast – Enhanced debugging support and error messages – Update to hwloc version 1.11.2 Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 8 MVAPICH2-X 2.2rc1 • Released on 03/30/2016 • Introducing UPC++ Support – Based on Berkeley UPC++ v0.1 – Introduce UPC++ level support for new scaer collecve operaon (upcxx_scaer) – Opmized UPC collecves (improved performance for upcxx_reduce, upcxx_bcast, upcxx_gather, upcxx_allgather, upcxx_alltoall) • MPI Features – Based on MVAPICH2 2.2rc1 (OFA-IB-CH3 interface) – Support for OpenPower, Intel Omni-Path, and RoCE v2 • UPC Features – Based on GASNET v1.26 – Support for OpenPower, Intel Omni-Path, and RoCE v2 • OpenSHMEM Features – Support for OpenPower and RoCE v2 • CAF Features – Support for RoCE v2 • Hybrid Program Features – Introduce support for hybrid MPI+UPC++ applicaons – Support OpenPower architecture for hybrid MPI+UPC and MPI+OpenSHMEM applicaons • Unified Runme Features – Based on MVAPICH2 2.2rc1 (OFA-IB-CH3 interface). All the runme features enabled by default in OFA-IB-CH3 and OFA-IB-RoCE interface of MVAPICH2 2.2rc1 are available in MVAPICH2-X 2.2rc1 – Introduce support for UPC++ and MPI+UPC++ programming models • Support for OSU InfiniBand Network Analysis and Management (OSU INAM) Tool v0.9 – Capability to profile and report to node communicaon matrix for MPI processes at user specified granularity in conjuncon with OSU INAM – Capability to classify data flowing over a network link at job level and process level granularity in conjuncon with OSU INAM

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 9 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • for million to billion processors – Support for highly-efficient inter-node and intra-node communicaon (both two-sided and one-sided RMA) – Support for advanced IB mechanisms (UMR and ODP) – Extremely minimal memory footprint • Collecve communicaon • Unified Runme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 10 Latency & Bandwidth: MPI over IB with MVAPICH2

Unidireconal Bandwidth 2.00 Small Message Latency 14000 TrueScale-QDR 12465 1.80 12000 ConnectX-3-FDR 12104 1.60 1.26 1.40 10000 1.19 ConnectIB-DualFDR 1.20 8000 ConnectX-4-EDR 1.00 1.15 6356

sec) 6000 0.80 0.95

Latency (us) 0.60 4000 3387

0.40 Bandwidth (MBytes/ 2000 0.20 0.00 0

Message Size (bytes) Message Size (bytes)

TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 11 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 1 Latency Intra-Socket Inter-Socket Latest MVAPICH2 2.2rc1 0.45 us 0.5 Intel Ivy-bridge 0.18 us Latency (us) 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes)

Bandwidth (Intra-socket) Bandwidth (Inter-socket) 15000 15000 intra-Socket-CMA 14,250 MB/s inter-Socket-CMA intra-Socket-Shmem inter-Socket-Shmem 10000 13,749 MB/s 10000 intra-Socket-LiMIC inter-Socket-LiMIC

5000 5000 Bandwidth (MB/s) Bandwidth (MB/s) 0 0

Message Size (Bytes) Message Size (Bytes) Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 12 User-mode Memory Registraon (UMR) • Introduced by Mellanox to support direct local and remote nonconguous memory access – Avoid packing at sender and unpacking at receiver • Available since MVAPICH2-X 2.2b Small & Medium Message Latency Large Message Latency 350 20000 300 UMR UMR 250 Default 15000 Default 200 150 10000 100

Latency (us) 5000

50 Latency (us) 0 0 4K 16K 64K 256K 1M 2M 4M 8M 16M Message Size (Bytes) Message Size (Bytes) Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with User-mode Memory Registraon: Challenges, Designs and Benefits, CLUSTER, 2015 Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 13 On-Demand Paging (ODP) • Introduced by Mellanox to support direct remote memory access without pinning • Memory regions paged in/out dynamically by the HCA/OS • Size of registered buffers can be larger than physical memory • Will be available in future MVAPICH2 release

Graph500 BFS Kernel Graph500 Pin-down Buffer Sizes 5 1500 Pin-down ODP Pin-down ODP 4 3 1000 2 (MB) 500 1 Execuon Time (s) 0

Pin-down Buffer Size 0 16 32 64 16 32 64 Number of Processes Number of Processes Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 14 Minimizing Memory Footprint by Direct Connect (DC) Transport • Constant connecon cost (One QP for any peer) 0 P0 P1 Node • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Iniator) and receive (DC Target) Node 3 IB Node 1 – DC Target idenfied by “DCT Number” P6 Network P2 P7 P3 – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communicaon • Available since MVAPICH2-X 2.2a 2 P5 P4 Node Memory Footprint for Alltoall NAMD - Apoa1: Large data set RC DC-Pool UD XRC RC DC-Pool UD XRC 97 1 100 47 22 10 10 10 10 10 10 5 0.5 3 Time 2 1 1 1 1 1 1 0 80 160 320 640 Normalized Execuon 160 320 620

Connecon Memory (KB) Number of Processes Number of Processes H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE Internaonal Supercompung Conference (ISC ’14) Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 15 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collecve communicaon – Offload and Non-blocking • Unified Runme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 16 Co-Design with MPI-3 Non-Blocking Collecves and Collecve Offload Co-Direct Hardware (Available since MVAPICH2-X 2.2a) 5 HPL-Offload HPL-1ring HPL-Host 4 17% 1.2 1 4.5% 3 0.8 (s) 2 0.6 1 0.4 Normalized

Performance 0.2

Applicaon Run-Time 0 0 512 600 720 800 10 20 30 40 50 60 70 Data Size HPL Problem Size (N) as % of Total Memory Modified P3DFFT with Offload-Alltoall does up to 17% beer than Modified HPL with Offload-Bcast does up to 4.5% beer than default default version (128 Processes) version (512 Processes)

PCG-Default Modified-PCG-Offload K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All 15 with Collecve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, 10 21.8% ISC 2011 K. Kandalla, et. al, Designing Non-blocking Broadcast with Collecve Offload on 5 InfiniBand Clusters: A Case Study with HPL, HotI 2011 0 K. Kandalla, et. al., Designing Non-blocking Allreduce with Collecve Offload on Run-Time (s) 64 128 256 512 InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12 Number of Processes Modified Pre-Conjugate Gradient Solver with Offload-Allreduce Can Network-Offload based Non-Blocking Neighborhood MPI Collecves does up to 21.8% beer than default version Improve Communicaon Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12 Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 17 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collecve communicaon – Offload and Non-blocking • Unified Runme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 18 MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applicaons

MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications

OpenSHMEM Calls CAF Calls UPC Calls UPC++ Calls MPI Calls

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

• Unified communicaon runme for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2- X 1.9 onwards! (since 2012) • UPC++ support is available in MVAPICH2-X 2.2RC1 • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with inial support for UPC 1.3), CAF 2008 standard (OpenUH), UPC++ – Scalable Inter-node and intra-node communicaon – point-to-point and collecves Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 19 Applicaon Level Performance with Graph500 and Sort Graph500 Execuon Time Sort Execuon Time 35 3000 MPI-Simple MPI Hybrid 30 MPI-CSC 25 2000 MPI-CSR 20 13X 51% 15 Hybrid (MPI+OpenSHMEM) 1000 Time (s) 10 5 Time (seconds) 7.6X 0 0 500GB-512 1TB-1K 2TB-2K 4TB-4K 4K 8K 16K Input Data - No. of Processes No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Applicaon - 2.4X improvement over MPI-CSR • 4,096 processes, 4 TB Input Size - 7.6X improvement over MPI-Simple - MPI – 2408 sec; 0.16 TB/min • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple

J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Opmizing Collecve Communicaon in OpenSHMEM, Int'l Conference on Paroned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, Internaonal Supercompung Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluaon, Int'l Conference on Parallel Processing (ICPP '12), September 2012 Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 20 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collecve communicaon – Offload and Non-blocking • Unified Runme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 21 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU

• Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers At Sender:

MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);

High Performance and High Producvity

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 22 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases • Support for MPI communicaon from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communicaon (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communicaon for mul- GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communicaon for mulple GPU adapters/node • Opmized and tuned collecves for GPU device buffers • MPI datatype support for point-to-point and collecve communicaon from GPU device buffers

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 23 Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) GPU-GPU internode latency GPU-GPU Internode Bandwidth 30 MV2-GDR2.2b MV2-GDR2.0b 3000 MV2-GDR2.2b 25 MV2 w/o GDR 2500 MV2-GDR2.0b 11X 20

s) 2000 15 MV2 w/o GDR 10x 1500 Latency (us) 10 2X 5 1000 0 Bandwidth (MB/ 500 2.18us 0 2 8 32 128 512 2K 0 Message Size (bytes) 1 4 16 64 256 1K 4K GPU-GPU Internode Bi-Bandwidth Message Size (bytes) 4000 MV2-GDR2.2b 3000 MV2-GDR2.0b 11x 2000 MV2 w/o GDR MVAPICH2-GDR-2.2b 1000 2x Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU 0 Mellanox Connect-IB Dual-FDR HCA 1 4 16 64 256 1K 4K CUDA 7

Bi-Bandwidth (MB/s) Mellanox OFED 2.4 with GPU-Direct-RDMA Message Size (bytes) Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 24

24 Applicaon-Level Evaluaon (HOOMD-blue) 64K Parcles 256K Parcles 3500 2500 3000 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000 second (TPS) 500 second (TPS) 500

0 Average Time Steps per 0 Average Time Steps per 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 25

LENS (Oct '15) 25

Exploing GDR for OpenSHMEM

Small Message shmem_put D-D Weak Scaling 25 Host-Pipeline Enhanced-GDR 200 20 45 % 15 Host-Pipeline 7X 100 10 GDR

Evoluon Time (msec) 0 Latency (us) 5 8 16 32 64 0 Number of GPU Nodes 1 4 16 64 256 1K 4K GPULBM: 64x64x64 Message Size (bytes) • Introduced CUDA-aware OpenSHMEM Small Message shmem_get D-D • GDR for small/medium message sizes 40 • Host-staging for large message to avoid PCIe bolenecks 30 • Hybrid design brings best of both • 20 Host-Pipeline 3.13 us Put latency for 4B (7X improvement ) and 4.7 us latency for 4KB • 9X improvement for Get latency of 4B 9X GDR

Latency (us) 10 K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C. Ching and D. K. Panda, 0 Exploing GPUDirect RDMA in Designing High Performance OpenSHMEM for GPU 1 4 16 64 256 1K 4K Clusters. IEEE Cluster 2015. Also accepted for a special issue of Journal PARCO Message Size (bytes) Will be available in future MVAPICH2-X Release Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 26 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collecve communicaon – Offload and Non-blocking • Unified Runme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 27 MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC

• Offload Mode

• Intranode Communicaon

• Coprocessor-only and Symmetric Mode

• Internode Communicaon

• Coprocessors-only and Symmetric Mode

• Mul-MIC Node Configuraons

• Running on three major systems

• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 28 MIC-Remote-MIC P2P Communicaon with Proxy-based Communicaon Intra-socket P2P Latency (Large Messages) Bandwidth 5000 6000 4000 5236

3000 Beer 4000 2000 2000 Beer 1000 0

Latency (usec) Latency 0

8K 32K 128K 512K 2M Bandwidth (MB/sec) 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes)

Inter-socket P2P Bandwidth 15000 Latency (Large 6000 Messages) 5594 10000 Beer 4000 (MB/sec)

5000 2000 Beer 0

Latency (usec) Latency 0

8K 32K 128K 512K 2M Bandwidth 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 29 Opmized MPI Collecves for MIC Clusters (Allgather & Alltoall) 30000 1500 32-Node-Allgather (16H + 16 M) 32-Node-Allgather (8H + 8 M) Small Message Latency Large Message Latency 20000 MV2-MIC 1000 MV2-MIC MV2-MIC-Opt MV2-MIC-Opt 58% 10000 76% 500 Latency (usecs) Latency (usecs) 0 0 1 2 4 8 16 32 64 128 256 512 1K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) Message Size (Bytes) 1000 60 P3DFFT Performance 32-Node-Alltoall (8H + 8 M) Communicaon Large Message Latency 40 Computaon 500 MV2-MIC 55% MV2-MIC-Opt 20 Latency (usecs) 0 0

Execuon Time (secs) MV2-MIC-Opt MV2-MIC 4K 8K 16K 32K 64K 128K 256K 512K 32 Nodes (8H + 8M), Size = 2K*2K*1K Message Size (Bytes) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 30 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collecve communicaon – Offload and Non-blocking • Unified Runme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 31 Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)

• MVAPICH2-EA 2.1 (Energy-Aware) • A white-box approach • New Energy-Efficient communicaon protocols for pt-pt and collecve operaons • Intelligently apply the appropriate Energy saving techniques • Applicaon oblivious energy saving

• OEMT • A library ulity to measure energy consumpon for MPI applicaons • Works with all MPI runmes • PRELOAD opon for precompiled applicaons • Does not require ROOT permission: • A safe kernel module to read only a subset of MSRs

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 32 MVAPICH2-EA: Applicaon Oblivious Energy-Aware-MPI (EAM)

• An energy efficient runme that provides energy savings without applicaon knowledge • Uses automacally and transparently the best energy lever • Provides guarantees on maximum degradaon with 5-41% savings at <= 5% 1 degradaon • Pessimisc MPI applies energy reducon lever to each MPI call

A Case for Applicaon-Oblivious Energy-Efficient MPI Runme A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D. Kerbyson, and A. Hoise, Supercompung ‘15, Nov 2015 [Best Student Paper Finalist] Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 33 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collecve communicaon – Offload and Non-blocking • Unified Runme for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 34 Overview of OSU INAM • A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI runme – hp://mvapich.cse.ohio-state.edu/tools/osu-inam/ – hp://mvapich.cse.ohio-state.edu/userguide/osu-inam/ • Monitors IB clusters in real me by querying various subnet management enes and gathering input from the MPI runmes • Capability to analyze and profile node-level, job-level and process-level acvies for MPI communicaon (Point-to-Point, Collecves and RMA) • Ability to filter data based on type of counters using “drop down” list • Remotely monitor various metrics of MPI processes at user specified granularity • "Job Page" to display jobs in ascending/descending order of various performance metrics in conjuncon with MVAPICH2-X • Visualize the data transfer happening in a “live” or “historical” fashion for enre network, job or set of nodes Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 35 OSU INAM – Network Level View

Full Network (152 nodes) Zoomed-in View of the Network

• Show network topology of large clusters • Visualize traffic paern on different links • Quickly idenfy congested links/links in error state • See the history unfold – play back historical state of the network

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 36 OSU INAM – Job and Node Level Views

Visualizing a Job (5 Nodes) Finding Routes Between Nodes

• Job level view • Show different network metrics (load, error, etc.) for any live job • Play back historical data for completed jobs to idenfy bolenecks • Node level view provides details per process or per node • CPU ulizaon for each rank/node • Bytes sent/received for MPI operaons (pt-to-pt, collecve, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 37 MVAPICH2 – Plans for Exascale • Performance and Memory scalability toward 1M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Opmizaon for GPU Support and Accelerators • Taking advantage of advanced features of Mellanox InfiniBand • On-Demand Paging (ODP)* • Switch-IB2 SHArP* • GID-based support* • Enhanced communicaon schemes for upcoming architectures • Knights Landing with MCDRAM* • NVLINK* • CAPI* • Extended topology-aware collecves • Extended Energy-aware designs and Virtualizaon Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migraon support with SCR • Support for * features will be available in future MVAPICH2 Releases

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 38

Funding Acknowledgments

Funding Support by

Equipment Support by

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 39 Personnel Acknowledgments

Current Students Current Research Sciensts Current Senior Research Associate – A. Augusne (M.S.) – K. Kulkarni (M.S.) – H. Subramoni - K. Hamidouche – A. Awan (Ph.D.) – M. Rahman (Ph.D.) – X. Lu – S. Chakraborthy (Ph.D.) – D. Shankar (Ph.D.) Current Post-Doc Current Programmer – C.-H. Chu (Ph.D.) – A. Venkatesh (Ph.D.) – J. Lin – J. Perkins – N. Islam (Ph.D.) – J. Zhang (Ph.D.) – D. Banerjee Current Research Specialist – M. Li (Ph.D.) – M. Arnold Past Students – P. Balaji (Ph.D.) – W. Huang (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Bhagvat (M.S.) – W. Jiang (M.S.) – A. Mamidala (Ph.D.) – A. Singh (Ph.D.) – A. Bhat (M.S.) – J. Jose (Ph.D.) – G. Marsh (M.S.) – J. Sridhar (M.S.) – D. Bunnas (Ph.D.) – S. Kini (M.S.) – V. Meshram (M.S.) – S. Sur (Ph.D.) – L. Chai (Ph.D.) – M. Koop (Ph.D.) – A. Moody (M.S.) – H. Subramoni (Ph.D.) – B. Chandrasekharan (M.S.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – K. Vaidyanathan (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – J. Wu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Pai (M.S.) – W. Yu (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – S. Potluri (Ph.D.) Past Post-Docs – R. Rajachandrasekar (Ph.D.) – H. Wang – E. Mancini Past Research Scienst Past Programmers – X. Besseron – S. Marcarelli – S. Sur – D. Bureddy – H.-W. Jin – J. Vienne – M. Luo Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 40 Thank You! [email protected]

Network-Based Compung Laboratory hp://nowlab.cse.ohio-state.edu/

The MVAPICH2 Project hp://mvapich.cse.ohio-state.edu/

Network Based Compung Laboratory OpenFabrics-MVAPICH2 (April ‘16) 41