Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach
Talk at OpenFabrics Workshop (April 2016) by
Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h p://www.cse.ohio-state.edu/~panda Parallel Programming Models Overview
P1 P2 P3 P1 P2 P3 P1 P2 P3
Logical shared memory Memory Memory Memory Shared Memory Memory Memory Memory
Shared Memory Model Distributed Memory Model Par oned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, … • Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 2 Par oned Global Address Space (PGAS) Models • Key features - Simple shared memory abstrac ons - Light weight one-sided communica on - Easier to express irregular communica on • Different approaches to PGAS - Languages - Libraries • Unified Parallel C (UPC) • OpenSHMEM • Co-Array Fortran (CAF) • UPC++ • X10 • Global Arrays • Chapel
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 3 Hybrid (MPI+PGAS) Programming
• Applica on sub-kernels can be re-wri en in MPI/PGAS based on communica on characteris cs HPC Applica on
• Benefits: Kernel 1 – Best of Distributed Compu ng Model MPI Kernel 2 – Best of Shared Memory Compu ng Model PGAS MPI
• Exascale Roadmap*: Kernel 3 MPI – “Hybrid Programming is a prac cal way to
program exascale systems” Kernel N Kernel N PGAS MPI
* The Interna onal Exascale So ware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, Interna onal Journal of High Performance Computer Applica ons, ISSN 1094-3420
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 4 Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE) – MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2011 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualiza on (MVAPICH2-Virt), Available since 2015 – Support for Energy-Awareness (MVAPICH2-EA), Available since 2015 – Used by more than 2,550 organiza ons in 79 countries – More than 360,000 (> 0.36 million) downloads from the OSU site directly – Empowering many TOP500 clusters (Nov ‘15 ranking) • 10th ranked 519,640-core cluster (Stampede) at TACC • 13th ranked 185,344-core cluster (Pleiades) at NASA • 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Ins tute of Technology and many others – Available with so ware stacks of many vendors and Linux Distros (RedHat and SuSE) – h p://mvapich.cse.ohio-state.edu • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops) Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 5 MVAPICH2 Architecture
High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communica on Run me Diverse APIs and Mechanisms
Point-to- Remote Collec ves Energy- I/O and Fault Ac ve Introspec on point Job Startup Memory Virtualiza on Algorithms Awareness File Systems Tolerance Messages & Analysis Primi ves Access
Support for Modern Networking Technology Support for Modern Mul -/Many-core Architectures (InfiniBand, iWARP, RoCE, OmniPath) (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Mul Shared RC XRC UD DC UMR ODP* CMA IVSHMEM MCDRAM* NVLink* CAPI* IOV Rail Memory
* Upcoming Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 6 MVAPICH2 So ware Family
Requirements MVAPICH2 Library to use
MPI with IB, iWARP and RoCE MVAPICH2
Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X
MPI with IB & GPU MVAPICH2-GDR
MPI with IB & MIC MVAPICH2-MIC
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 7 MVAPICH2 2.2rc1 • Released on 03/30/2016 • Major Features and Enhancements – Based on MPICH-3.1.4 – Support for OpenPower architecture • Op mized inter-node and intra-node communica on – Support for Intel Omni-Path architecture • Thanks to Intel for contribu ng the patch • Introduc on of a new PSM2 channel for Omni-Path – Support for RoCEv2 – Architecture detec on for PSC Bridges system with Omni-Path – Enhanced startup performance and reduced memory footprint for storing InfiniBand end-point informa on with SLURM • Support for shared memory based PMI opera ons • Availability of an updated patch from the MVAPICH project website with this support for SLURM installa ons – Op mized pt-to-pt and collec ve tuning for Chameleon InfiniBand systems at TACC/UoC – Enable affinity by default for TrueScale(PSM) and Omni-Path(PSM2) channels – Enhanced tuning for shared-memory based MPI_Bcast – Enhanced debugging support and error messages – Update to hwloc version 1.11.2 Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 8 MVAPICH2-X 2.2rc1 • Released on 03/30/2016 • Introducing UPC++ Support – Based on Berkeley UPC++ v0.1 – Introduce UPC++ level support for new sca er collec ve opera on (upcxx_sca er) – Op mized UPC collec ves (improved performance for upcxx_reduce, upcxx_bcast, upcxx_gather, upcxx_allgather, upcxx_alltoall) • MPI Features – Based on MVAPICH2 2.2rc1 (OFA-IB-CH3 interface) – Support for OpenPower, Intel Omni-Path, and RoCE v2 • UPC Features – Based on GASNET v1.26 – Support for OpenPower, Intel Omni-Path, and RoCE v2 • OpenSHMEM Features – Support for OpenPower and RoCE v2 • CAF Features – Support for RoCE v2 • Hybrid Program Features – Introduce support for hybrid MPI+UPC++ applica ons – Support OpenPower architecture for hybrid MPI+UPC and MPI+OpenSHMEM applica ons • Unified Run me Features – Based on MVAPICH2 2.2rc1 (OFA-IB-CH3 interface). All the run me features enabled by default in OFA-IB-CH3 and OFA-IB-RoCE interface of MVAPICH2 2.2rc1 are available in MVAPICH2-X 2.2rc1 – Introduce support for UPC++ and MPI+UPC++ programming models • Support for OSU InfiniBand Network Analysis and Management (OSU INAM) Tool v0.9 – Capability to profile and report process to node communica on matrix for MPI processes at user specified granularity in conjunc on with OSU INAM – Capability to classify data flowing over a network link at job level and process level granularity in conjunc on with OSU INAM
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 9 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communica on (both two-sided and one-sided RMA) – Support for advanced IB mechanisms (UMR and ODP) – Extremely minimal memory footprint • Collec ve communica on • Unified Run me for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 10 Latency & Bandwidth: MPI over IB with MVAPICH2
Unidirec onal Bandwidth 2.00 Small Message Latency 14000 TrueScale-QDR 12465 1.80 12000 ConnectX-3-FDR 12104 1.60 1.26 1.40 10000 1.19 ConnectIB-DualFDR 1.20 8000 ConnectX-4-EDR 1.00 1.15 6356
sec) 6000 0.80 0.95
Latency (us) 0.60 4000 3387
0.40 Bandwidth (MBytes/ 2000 0.20 0.00 0
Message Size (bytes) Message Size (bytes)
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 11 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA)) 1 Latency Intra-Socket Inter-Socket Latest MVAPICH2 2.2rc1 0.45 us 0.5 Intel Ivy-bridge 0.18 us Latency (us) 0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes)
Bandwidth (Intra-socket) Bandwidth (Inter-socket) 15000 15000 intra-Socket-CMA 14,250 MB/s inter-Socket-CMA intra-Socket-Shmem inter-Socket-Shmem 10000 13,749 MB/s 10000 intra-Socket-LiMIC inter-Socket-LiMIC
5000 5000 Bandwidth (MB/s) Bandwidth (MB/s) 0 0
Message Size (Bytes) Message Size (Bytes) Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 12 User-mode Memory Registra on (UMR) • Introduced by Mellanox to support direct local and remote noncon guous memory access – Avoid packing at sender and unpacking at receiver • Available since MVAPICH2-X 2.2b Small & Medium Message Latency Large Message Latency 350 20000 300 UMR UMR 250 Default 15000 Default 200 150 10000 100
Latency (us) 5000
50 Latency (us) 0 0 4K 16K 64K 256K 1M 2M 4M 8M 16M Message Size (Bytes) Message Size (Bytes) Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with User-mode Memory Registra on: Challenges, Designs and Benefits, CLUSTER, 2015 Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 13 On-Demand Paging (ODP) • Introduced by Mellanox to support direct remote memory access without pinning • Memory regions paged in/out dynamically by the HCA/OS • Size of registered buffers can be larger than physical memory • Will be available in future MVAPICH2 release
Graph500 BFS Kernel Graph500 Pin-down Buffer Sizes 5 1500 Pin-down ODP Pin-down ODP 4 3 1000 2 (MB) 500 1 Execu on Time (s) 0
Pin-down Buffer Size 0 16 32 64 16 32 64 Number of Processes Number of Processes Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 14 Minimizing Memory Footprint by Direct Connect (DC) Transport • Constant connec on cost (One QP for any peer) 0 P0 P1 Node • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Ini ator) and receive (DC Target) Node 3 IB Node 1 – DC Target iden fied by “DCT Number” P6 Network P2 P7 P3 – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communica on • Available since MVAPICH2-X 2.2a 2 P5 P4 Node Memory Footprint for Alltoall NAMD - Apoa1: Large data set RC DC-Pool UD XRC RC DC-Pool UD XRC 97 1 100 47 22 10 10 10 10 10 10 5 0.5 3 Time 2 1 1 1 1 1 1 0 80 160 320 640 Normalized Execu on 160 320 620
Connec on Memory (KB) Number of Processes Number of Processes H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE Interna onal Supercompu ng Conference (ISC ’14) Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 15 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collec ve communica on – Offload and Non-blocking • Unified Run me for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 16 Co-Design with MPI-3 Non-Blocking Collec ves and Collec ve Offload Co-Direct Hardware (Available since MVAPICH2-X 2.2a) 5 HPL-Offload HPL-1ring HPL-Host 4 17% 1.2 1 4.5% 3 0.8 (s) 2 0.6 1 0.4 Normalized
Performance 0.2
Applica on Run-Time 0 0 512 600 720 800 10 20 30 40 50 60 70 Data Size HPL Problem Size (N) as % of Total Memory Modified P3DFFT with Offload-Alltoall does up to 17% be er than Modified HPL with Offload-Bcast does up to 4.5% be er than default default version (128 Processes) version (512 Processes)
PCG-Default Modified-PCG-Offload K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All 15 with Collec ve Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, 10 21.8% ISC 2011 K. Kandalla, et. al, Designing Non-blocking Broadcast with Collec ve Offload on 5 InfiniBand Clusters: A Case Study with HPL, HotI 2011 0 K. Kandalla, et. al., Designing Non-blocking Allreduce with Collec ve Offload on Run-Time (s) 64 128 256 512 InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12 Number of Processes Modified Pre-Conjugate Gradient Solver with Offload-Allreduce Can Network-Offload based Non-Blocking Neighborhood MPI Collec ves does up to 21.8% be er than default version Improve Communica on Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12 Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 17 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collec ve communica on – Offload and Non-blocking • Unified Run me for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 18 MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applica ons
MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications
OpenSHMEM Calls CAF Calls UPC Calls UPC++ Calls MPI Calls
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
• Unified communica on run me for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2- X 1.9 onwards! (since 2012) • UPC++ support is available in MVAPICH2-X 2.2RC1 • Feature Highlights – Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC – MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with ini al support for UPC 1.3), CAF 2008 standard (OpenUH), UPC++ – Scalable Inter-node and intra-node communica on – point-to-point and collec ves Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 19 Applica on Level Performance with Graph500 and Sort Graph500 Execu on Time Sort Execu on Time 35 3000 MPI-Simple MPI Hybrid 30 MPI-CSC 25 2000 MPI-CSR 20 13X 51% 15 Hybrid (MPI+OpenSHMEM) 1000 Time (s) 10 5 Time (seconds) 7.6X 0 0 500GB-512 1TB-1K 2TB-2K 4TB-4K 4K 8K 16K Input Data - No. of Processes No. of Processes • Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • Performance of Hybrid (MPI+OpenSHMEM) Sort • 8,192 processes Applica on - 2.4X improvement over MPI-CSR • 4,096 processes, 4 TB Input Size - 7.6X improvement over MPI-Simple - MPI – 2408 sec; 0.16 TB/min • 16,384 processes - Hybrid – 1172 sec; 0.36 TB/min - 1.5X improvement over MPI-CSR - 51% improvement over MPI-design - 13X improvement over MPI-Simple
J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Op mizing Collec ve Communica on in OpenSHMEM, Int'l Conference on Par oned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, Interna onal Supercompu ng Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Suppor ng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evalua on, Int'l Conference on Parallel Processing (ICPP '12), September 2012 Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 20 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collec ve communica on – Offload and Non-blocking • Unified Run me for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 21 GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
• Standard MPI interfaces used for unified data movement • Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) • Overlaps data movement from GPU with RDMA transfers At Sender:
MPI_Send(s_devbuf, size, …); inside MVAPICH2 At Receiver: MPI_Recv(r_devbuf, size, …);
High Performance and High Produc vity
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 22 CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases • Support for MPI communica on from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point communica on (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communica on for mul - GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node communica on for mul ple GPU adapters/node • Op mized and tuned collec ves for GPU device buffers • MPI datatype support for point-to-point and collec ve communica on from GPU device buffers
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 23 Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR) GPU-GPU internode latency GPU-GPU Internode Bandwidth 30 MV2-GDR2.2b MV2-GDR2.0b 3000 MV2-GDR2.2b 25 MV2 w/o GDR 2500 MV2-GDR2.0b 11X 20
s) 2000 15 MV2 w/o GDR 10x 1500 Latency (us) 10 2X 5 1000 0 Bandwidth (MB/ 500 2.18us 0 2 8 32 128 512 2K 0 Message Size (bytes) 1 4 16 64 256 1K 4K GPU-GPU Internode Bi-Bandwidth Message Size (bytes) 4000 MV2-GDR2.2b 3000 MV2-GDR2.0b 11x 2000 MV2 w/o GDR MVAPICH2-GDR-2.2b 1000 2x Intel Ivy Bridge (E5-2680 v2) node - 20 cores NVIDIA Tesla K40c GPU 0 Mellanox Connect-IB Dual-FDR HCA 1 4 16 64 256 1K 4K CUDA 7
Bi-Bandwidth (MB/s) Mellanox OFED 2.4 with GPU-Direct-RDMA Message Size (bytes) Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 24
24 Applica on-Level Evalua on (HOOMD-blue) 64K Par cles 256K Par cles 3500 2500 3000 MV2 MV2+GDR 2500 2000 2X 2000 1500 2X 1500 1000 1000 second (TPS) 500 second (TPS) 500
0 Average Time Steps per 0 Average Time Steps per 4 8 16 32 4 8 16 32 Number of Processes Number of Processes • Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 25
LENS (Oct '15) 25
Exploi ng GDR for OpenSHMEM
Small Message shmem_put D-D Weak Scaling 25 Host-Pipeline Enhanced-GDR 200 20 45 % 15 Host-Pipeline 7X 100 10 GDR
Evolu on Time (msec) 0 Latency (us) 5 8 16 32 64 0 Number of GPU Nodes 1 4 16 64 256 1K 4K GPULBM: 64x64x64 Message Size (bytes) • Introduced CUDA-aware OpenSHMEM Small Message shmem_get D-D • GDR for small/medium message sizes 40 • Host-staging for large message to avoid PCIe bo lenecks 30 • Hybrid design brings best of both • 20 Host-Pipeline 3.13 us Put latency for 4B (7X improvement ) and 4.7 us latency for 4KB • 9X improvement for Get latency of 4B 9X GDR
Latency (us) 10 K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C. Ching and D. K. Panda, 0 Exploi ng GPUDirect RDMA in Designing High Performance OpenSHMEM for GPU 1 4 16 64 256 1K 4K Clusters. IEEE Cluster 2015. Also accepted for a special issue of Journal PARCO Message Size (bytes) Will be available in future MVAPICH2-X Release Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 26 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collec ve communica on – Offload and Non-blocking • Unified Run me for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 27 MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC
• Offload Mode
• Intranode Communica on
• Coprocessor-only and Symmetric Mode
• Internode Communica on
• Coprocessors-only and Symmetric Mode
• Mul -MIC Node Configura ons
• Running on three major systems
• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 28 MIC-Remote-MIC P2P Communica on with Proxy-based Communica on Intra-socket P2P Latency (Large Messages) Bandwidth 5000 6000 4000 5236
3000 Be er 4000 2000 2000 Be er 1000 0
Latency (usec) Latency 0
8K 32K 128K 512K 2M Bandwidth (MB/sec) 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes)
Inter-socket P2P Bandwidth 15000 Latency (Large 6000 Messages) 5594 10000 Be er 4000 (MB/sec)
5000 2000 Be er 0
Latency (usec) Latency 0
8K 32K 128K 512K 2M Bandwidth 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 29 Op mized MPI Collec ves for MIC Clusters (Allgather & Alltoall) 30000 1500 32-Node-Allgather (16H + 16 M) 32-Node-Allgather (8H + 8 M) Small Message Latency Large Message Latency 20000 MV2-MIC 1000 MV2-MIC MV2-MIC-Opt MV2-MIC-Opt 58% 10000 76% 500 Latency (usecs) Latency (usecs) 0 0 1 2 4 8 16 32 64 128 256 512 1K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) Message Size (Bytes) 1000 60 P3DFFT Performance 32-Node-Alltoall (8H + 8 M) Communica on Large Message Latency 40 Computa on 500 MV2-MIC 55% MV2-MIC-Opt 20 Latency (usecs) 0 0
Execu on Time (secs) MV2-MIC-Opt MV2-MIC 4K 8K 16K 32K 64K 128K 256K 512K 32 Nodes (8H + 8M), Size = 2K*2K*1K Message Size (Bytes) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 30 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collec ve communica on – Offload and Non-blocking • Unified Run me for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 31 Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)
• MVAPICH2-EA 2.1 (Energy-Aware) • A white-box approach • New Energy-Efficient communica on protocols for pt-pt and collec ve opera ons • Intelligently apply the appropriate Energy saving techniques • Applica on oblivious energy saving
• OEMT • A library u lity to measure energy consump on for MPI applica ons • Works with all MPI run mes • PRELOAD op on for precompiled applica ons • Does not require ROOT permission: • A safe kernel module to read only a subset of MSRs
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 32 MVAPICH2-EA: Applica on Oblivious Energy-Aware-MPI (EAM)
• An energy efficient run me that provides energy savings without applica on knowledge • Uses automa cally and transparently the best energy lever • Provides guarantees on maximum degrada on with 5-41% savings at <= 5% 1 degrada on • Pessimis c MPI applies energy reduc on lever to each MPI call
A Case for Applica on-Oblivious Energy-Efficient MPI Run me A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D. K. Panda, D. Kerbyson, and A. Hoise, Supercompu ng ‘15, Nov 2015 [Best Student Paper Finalist] Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 33 Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale • Scalability for million to billion processors • Collec ve communica on – Offload and Non-blocking • Unified Run me for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 34 Overview of OSU INAM • A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network with inputs from the MPI run me – h p://mvapich.cse.ohio-state.edu/tools/osu-inam/ – h p://mvapich.cse.ohio-state.edu/userguide/osu-inam/ • Monitors IB clusters in real me by querying various subnet management en es and gathering input from the MPI run mes • Capability to analyze and profile node-level, job-level and process-level ac vi es for MPI communica on (Point-to-Point, Collec ves and RMA) • Ability to filter data based on type of counters using “drop down” list • Remotely monitor various metrics of MPI processes at user specified granularity • "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunc on with MVAPICH2-X • Visualize the data transfer happening in a “live” or “historical” fashion for en re network, job or set of nodes Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 35 OSU INAM – Network Level View
Full Network (152 nodes) Zoomed-in View of the Network
• Show network topology of large clusters • Visualize traffic pa ern on different links • Quickly iden fy congested links/links in error state • See the history unfold – play back historical state of the network
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 36 OSU INAM – Job and Node Level Views
Visualizing a Job (5 Nodes) Finding Routes Between Nodes
• Job level view • Show different network metrics (load, error, etc.) for any live job • Play back historical data for completed jobs to iden fy bo lenecks • Node level view provides details per process or per node • CPU u liza on for each rank/node • Bytes sent/received for MPI opera ons (pt-to-pt, collec ve, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 37 MVAPICH2 – Plans for Exascale • Performance and Memory scalability toward 1M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Op miza on for GPU Support and Accelerators • Taking advantage of advanced features of Mellanox InfiniBand • On-Demand Paging (ODP)* • Switch-IB2 SHArP* • GID-based support* • Enhanced communica on schemes for upcoming architectures • Knights Landing with MCDRAM* • NVLINK* • CAPI* • Extended topology-aware collec ves • Extended Energy-aware designs and Virtualiza on Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migra on support with SCR • Support for * features will be available in future MVAPICH2 Releases
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 38
Funding Acknowledgments
Funding Support by
Equipment Support by
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 39 Personnel Acknowledgments
Current Students Current Research Scien sts Current Senior Research Associate – A. Augus ne (M.S.) – K. Kulkarni (M.S.) – H. Subramoni - K. Hamidouche – A. Awan (Ph.D.) – M. Rahman (Ph.D.) – X. Lu – S. Chakraborthy (Ph.D.) – D. Shankar (Ph.D.) Current Post-Doc Current Programmer – C.-H. Chu (Ph.D.) – A. Venkatesh (Ph.D.) – J. Lin – J. Perkins – N. Islam (Ph.D.) – J. Zhang (Ph.D.) – D. Banerjee Current Research Specialist – M. Li (Ph.D.) – M. Arnold Past Students – P. Balaji (Ph.D.) – W. Huang (Ph.D.) – M. Luo (Ph.D.) – G. Santhanaraman (Ph.D.) – S. Bhagvat (M.S.) – W. Jiang (M.S.) – A. Mamidala (Ph.D.) – A. Singh (Ph.D.) – A. Bhat (M.S.) – J. Jose (Ph.D.) – G. Marsh (M.S.) – J. Sridhar (M.S.) – D. Bun nas (Ph.D.) – S. Kini (M.S.) – V. Meshram (M.S.) – S. Sur (Ph.D.) – L. Chai (Ph.D.) – M. Koop (Ph.D.) – A. Moody (M.S.) – H. Subramoni (Ph.D.) – B. Chandrasekharan (M.S.) – R. Kumar (M.S.) – S. Naravula (Ph.D.) – K. Vaidyanathan (Ph.D.) – N. Dandapanthula (M.S.) – S. Krishnamoorthy (M.S.) – R. Noronha (Ph.D.) – A. Vishnu (Ph.D.) – V. Dhanraj (M.S.) – K. Kandalla (Ph.D.) – X. Ouyang (Ph.D.) – J. Wu (Ph.D.) – T. Gangadharappa (M.S.) – P. Lai (M.S.) – S. Pai (M.S.) – W. Yu (Ph.D.) – K. Gopalakrishnan (M.S.) – J. Liu (Ph.D.) – S. Potluri (Ph.D.) Past Post-Docs – R. Rajachandrasekar (Ph.D.) – H. Wang – E. Mancini Past Research Scien st Past Programmers – X. Besseron – S. Marcarelli – S. Sur – D. Bureddy – H.-W. Jin – J. Vienne – M. Luo Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 40 Thank You! [email protected]
Network-Based Compu ng Laboratory h p://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project h p://mvapich.cse.ohio-state.edu/
Network Based Compu ng Laboratory OpenFabrics-MVAPICH2 (April ‘16) 41