Overview of the MVAPICH Project: Latest Status and Future Roadmap
MVAPICH2 User Group (MUG) Mee ng
by
Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] h p://www.cse.ohio-state.edu/~panda Drivers of Modern HPC Cluster Architectures
Accelerators / Coprocessors High Performance Interconnects - InfiniBand high compute density, high performance/wa Mul -core Processors <1usec latency, >100Gbps Bandwidth >1 TFlop DP on a chip • Mul -core processors are ubiquitous • InfiniBand very popular in HPC clusters • • Accelerators/Coprocessors becoming common in high-end systems • Pushing the envelope for Exascale compu ng
Tianhe – 2 (1) Titan (2) Stampede (8) Tianhe – 1A (24)
MVAPICH User Group Mee ng 2015 2 Parallel Programming Models Overview
P1 P2 P3 P1 P2 P3 P1 P2 P3
Logical shared memory Shared Memory Memory Memory Memory Memory Memory Memory
Shared Memory Model Distributed Memory Model Par oned Global Address Space (PGAS) SHMEM, DSM MPI (Message Passing Interface) Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models • Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc. • PGAS models and Hybrid MPI+PGAS models are gradually receiving importance
MVAPICH User Group Mee ng 2015 3 Suppor ng Programming Models for Mul - Petaflop and Exaflop Systems: Challenges
Applica on Kernels/Applica ons
Co-Design Middleware Opportuni es and Challenges across Various Programming Models Layers MPI, PGAS (UPC, Global Arrays, OpenSHMEM), CUDA, OpenACC, Cilk, Hadoop, MapReduce, etc.
Communica on Library or Run me for Programming Models
Point-to-point Collec ve Synchroniza on & I/O & File Fault Communica on Communica on Locks (two-sided & one-sided) Systems Tolerance
Networking Technologies Mul /Many-core Accelerators (InfiniBand, 10/40GigE, Architectures (NVIDIA and MIC) Aries, OmniPath)
MVAPICH User Group Mee ng 2015 4 Designing (MPI+X) at Exascale
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communica on (both two-sided and one-sided) – Extremely minimum memory footprint • Hybrid programming (MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, …) • Balancing intra-node and inter-node communica on for next genera on mul -core (128-1024 cores/node) – Mul ple end-points per node • Support for efficient mul -threading • Scalable Collec ve communica on – Offload – Non-blocking – Topology-aware – Power-aware • Support for MPI-3 RMA Model • Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communica on and I/O • Virtualiza on Support
MVAPICH User Group Mee ng 2015 5 The MVAPICH2 So ware Family
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, RDMA over Converged Enhanced Ethernet (RoCE), and virtualized clusters. – MVAPICH (MPI-1) , Available since 2002 – MVAPICH2 (MPI-2.2, MPI-3.0 and MPI-3.1), Available since 2004 – MVAPICH2-X (Advanced MPI + PGAS), Available since 2012 – Support for GPGPUs (MVAPICH2-GDR), Available since 2014 – Support for MIC (MVAPICH2-MIC), Available since 2014 – Support for Virtualiza on (MVAPICH2-Virt), Available since 2015 • h p://mvapich.cse.ohio-state.edu
MVAPICH User Group Mee ng 2015 6 MVAPICH Project Timeline
MVAPICH2-Virt
MVAPICH2-MIC
MVAPICH2-GDR
MVAPICH2-X
MVAPICH2
OMB
MVAPICH EOL Jul-15 Jan-04 Oct-02 Oct-02 Jan-10 Nov-04 Nov-12 Apr-15 Apr-15 MVAPICH User Group Mee ng 2015 Timeline Aug-14 7 MVAPICH/MVAPICH2 Release Timeline and Downloads
350000
300000 MV2-X 2.2a MV2-GDR 2.1rc2 MV2 2.1 MV2-MIC 2.0
250000 MV2-GDR 2.0b MV2 1.9 200000 MV2 2.2a MV2 1.8 150000 MV2-Virt 2.1rc2 MV2 1.7 MV2 1.6
Number of Downloads 100000 MV2 1.5 MV2 1.4 MV 1.1
50000 MV2 1.0.3 MV2 1.0 MV 1.0 MV2 0.9.8 MV2 0.9.0 MV 0.9.4 0 Sep-05 Sep-06 Sep-07 Sep-09 Sep-10 Sep-11 Sep-12 Sep-13 Sep-04 Sep-08 Sep-14 Mar-05 Mar-06 Mar-07 Mar-09 Mar-10 Mar-11 Mar-12 Mar-13 Mar-15 Mar-08 Mar-14 Timeline MVAPICH User Group Mee ng 2015 8 The MVAPICH2 So ware Family (Cont.)
• Empowering many TOP500 clusters (July ‘15 ranking) – 8th ranked 519,640-core cluster (Stampede) at TACC – 11th ranked 185,344-core cluster (Pleiades) at NASA – 22nd ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Ins tute of Technology and many others • Used by more than 2,450 organiza ons in 76 countries • More than 282,000 downloads from the OSU site directly • Available with so ware stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE) • Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) -> – Stampede at TACC (8th in Jun’15, 462,462 cores, 5.168 Plops)
MVAPICH User Group Mee ng 2015 9 Usage Guidelines for the MVAPICH2 So ware Family
Requirements MVAPICH2 Library to use
MPI with IB, iWARP and RoCE MVAPICH2
Advanced MPI, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X
MPI with IB & GPU MVAPICH2-GDR
MPI with IB & MIC MVAPICH2-MIC
HPC Cloud with MPI & IB MVAPICH2-Virt
MVAPICH User Group Mee ng 2015 10 Strong Procedure for Design, Development and Release
• Research is done for exploring new designs • Designs are first presented to conference/journal publica ons • Best performing designs are incorporated into the codebase • Rigorous Q&A procedure before making a release – Exhaus ve unit tes ng – Various test procedures on diverse range of pla orms and interconnects – Performance tuning – Applica ons-based evalua on – Evalua on on large-scale systems • Even alpha and beta versions go through the above tes ng
MVAPICH User Group Mee ng 2015 11 MVAPICH2 2.2a
• Released on 08/18/2015 • Major Features and Enhancements – Based on MPICH-3.1.4 – Support for backing on-demand UD CM informa on with shared memory for minimizing memory footprint – Dynamic iden fica on of maximum read/atomic opera ons supported by HCA – Enabling support for intra-node communica ons in RoCE mode without shared memory – Updated to hwloc 1.11.0 – Updated to sm_20 kernel op miza ons for MPI Datatypes – Automa c detec on and tuning for 24-core Haswell architecture – Enhanced startup performance • Support for PMI-2 based startup with SLURM – Checkpoint-Restart Support with DMTCP (Distributed Mul Threaded CheckPoin ng) – Enhanced communica on performance for small/medium message sizes – Support for linking Intel Trace Analyzer and Collector
MVAPICH User Group Mee ng 2015 12 One-way Latency: MPI over IB with MVAPICH2
Small Message Latency Large Message Latency 2.00 120 TrueScale-QDR 1.80 100 ConnectX-3-FDR 1.60 ConnectIB-DualFDR 1.40 1.26 80 1.19 ConnectX-4-EDR 1.20 1.00 60 1.15 0.80 Latency (us) Latency (us) 1.05 40 0.60 0.40 20 0.20 0.00 0
Message Size (bytes) Message Size (bytes)
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back MVAPICH User Group Mee ng 2015 13 Bandwidth: MPI over IB with MVAPICH2
Unidirec onal Bandwidth Bidirec onal Bandwidth 14000 30000 12465 TrueScale-QDR 24353 12000 25000 ConnectX-3-FDR 11497 10000 ConnectIB-DualFDR 22909 20000 8000 ConnectX-4-EDR 6356 15000 12161 6000
4000 3387 10000 6308 Bandwidth (MBytes/sec) Bandwidth (MBytes/sec)
2000 5000
0 0 4 16 64 256 1024 4K 16K 64K 256K 1M Message Size (bytes) Message Size (bytes)
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 Back-to-back MVAPICH User Group Mee ng 2015 14 MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
Latency 1
0.8 Intra-Socket Inter-Socket Latest MVAPICH2 2.2a
0.6 Intel Ivy-bridge 0.45 us 0.4
Latency (us) 0.18 us 0.2
0 0 1 2 4 8 16 32 64 128 256 512 1K Message Size (Bytes)
Bandwidth (Intra-socket) Bandwidth (Inter-socket) 16000 16000 14,250 MB/s 13,749 MB/s 14000 intra-Socket-CMA 14000 inter-Socket-CMA inter-Socket-Shmem 12000 intra-Socket-Shmem 12000 inter-Socket-LiMIC 10000 intra-Socket-LiMIC 10000 8000 8000 6000 6000 4000
4000 Bandwidth (MB/s) Bandwidth (MB/s) 2000 2000 0 0
Message Size (Bytes) Message Size (Bytes)
MVAPICH User Group Mee ng 2015 15 MVAPICH2-X for Advanced MPI, Hybrid MPI + PGAS Applica ons MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS) Applica ons
OpenSHMEM Calls CAF Calls UPC Calls MPI Calls
Unified MVAPICH2-X Run me
InfiniBand, RoCE, iWARP
• Current Model – Separate Run mes for OpenSHMEM/UPC/CAF and MPI – Possible deadlock if both run mes are not progressed – Consumes more network resource • Unified communica on run me for MPI, UPC, OpenSHMEM, CAF available with MVAPICH2-X 1.9 onwards! – h p://mvapich.cse.ohio-state.edu
MVAPICH User Group Mee ng 2015 16 Hybrid (MPI+PGAS) Programming
• Applica on sub-kernels can be re-wri en in MPI/PGAS based on communica on characteris cs HPC Applica on
• Benefits: Kernel 1 – Best of Distributed Compu ng Model MPI – Best of Shared Memory Compu ng Model Kernel 2 PGAS MPI • Exascale Roadmap*: Kernel 3 – “Hybrid Programming is a prac cal way to MPI program exascale systems” Kernel N Kernel N PGAS MPI
* The Interna onal Exascale So ware Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, Interna onal Journal of High Performance Computer Applica ons, ISSN 1094-3420
MVAPICH User Group Mee ng 2015 17 MVAPICH2-X 2.2a
• Released on 08/18/2015 • MVAPICH2-X-2.2a Feature Highlights – Based on MVAPICH2 2.2a including MPI-3 features • Compliant with UPC 2.20.2, OpenSHMEM v1.0h and CAF 3.0.39 – MPI (Advanced) Features • Support for Dynamically Connected (DC) transport protocol – Available for pt-to-pt, RMA and collec ves - Support for Hybrid mode with RC/DC/UD/XRC • Support for Core-Direct based Non-blocking collec ves – Available for Ibcast, Ibarrier, Isca er, Igather, Ialltoall and Iallgather – OpenSHMEM Features • Support for RoCE - Support for Dynamically Connected (DC) transport protocol – UPC Features • Based on Berkeley UPC 2.20.2 (contains changes/addi ons in prepara on for upcoming UPC 1.3 specifica on) • Support for RoCE - Support for Dynamically Connected (DC) transport protocol – – CAF Features • Support for RoCE • Support for Dynamically Connected (DC) transport protocol
MVAPICH User Group Mee ng 2015 18 Minimizing Memory Footprint further by DC Transport
• Constant connec on cost (One QP for any peer) P0 P1
Node 0 • Full Feature Set (RDMA, Atomics etc) • Separate objects for send (DC Ini ator) and receive (DC Node 3 Node 1 Target) IB P6 Network P2 – DC Target iden fied by “DCT Number” P7 P3 – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communica on • Available with MVAPICH2-X 2.2a P5 P4 Node 2 Memory Footprint for Alltoall NAMD - Apoa1: Large data set RC DC-Pool RC DC-Pool UD XRC 97 1 100 UD XRC 47 22 10 10 10 10 10 10 5 0.5 3 Time 2 1 1 1 1 1 1 0 80 160 320 640 Normalized Execu on 160 320 620
Connec on Memory (KB) Number of Processes Number of Processes H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE Interna onal Supercompu ng Conference (ISC ’14). MVAPICH User Group Mee ng 2015 19 Applica on Level Performance with Graph500 and Sort Graph500 Execu on Time Sort Execu on Time 35 3000 MPI-Simple MPI Hybrid 30 2500 MPI-CSC 25 2000 20 MPI-CSR 51% 13X 1500 15 Hybrid (MPI+OpenSHMEM) Time (s) 1000 10 Time (seconds) 500 5 7.6X 0 0 4K 8K 16K 500GB-512 1TB-1K 2TB-2K 4TB-4K No. of Processes Input Data - No. of Processes • Performance of Hybrid (MPI+ • Performance of Hybrid (MPI OpenSHMEM) Graph500 Design +OpenSHMEM) Sort Applica on • 8,192 processes • 4,096 processes, 4 TB Input Size - 2.4X improvement over MPI-CSR - MPI – 2408 sec; 0.16 TB/min - 7.6X improvement over MPI-Simple - Hybrid – 1172 sec; 0.36 TB/min • 16,384 processes - 51% improvement over MPI-design - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Op mizing Collec ve Communica on in OpenSHMEM, Int'l Conference on Par oned Global Address Space Programming Models (PGAS '13), October 2013. J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, Interna onal Supercompu ng Conference (ISC’13), June 2013 J. Jose, K. Kandalla, M. Luo and D. K. Panda, Suppor ng Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evalua on, Int'l Conference on Parallel Processing (ICPP '12), September 2012 MVAPICH User Group Mee ng 2015 20 GPU-Direct RDMA (GDR) with CUDA SNB E5-2670 / • OFED with support for GPUDirect RDMA is IVB E5-2680V2
developed by NVIDIA and Mellanox System CPU Memory • OSU has a design of MVAPICH2 using GPUDirect RDMA GPU – Hybrid design using GPU-Direct RDMA IB Adapter Chipset • GPUDirect RDMA and Host-based pipelining GPU • Alleviates P2P bandwidth bo lenecks on Memory SandyBridge and IvyBridge SNB E5-2670
– Support for communica on using mul -rail P2P write: 5.2 GB/s – Support for Mellanox Connect-IB and ConnectX P2P read: < 1.0 GB/s VPI adapters IVB E5-2680V2 – Support for RoCE with Mellanox ConnectX VPI P2P write: 6.4 GB/s P2P read: 3.5 GB/s adapters
MVAPICH User Group Mee ng 2015 21 MVAPICH2-GDR 2.1rc2
• Released on 06/24/2015 • Major Features and Enhancements – Based on MVAPICH2-2.1rc2 – CUDA 7.0 compa bility – CUDA-Aware support for MPI_Rsend and MPI_Irsend primi ves – Parallel intranode communica on channels (shared memory for H-H and GDR for D-D) – Op mized H-H, H-D and D-H communica on – Op mized intranode D-D communica on – Op miza on and tuning for point-point and collec ve opera ons – Update to sm_20 kernel op miza on for Datatype processing – Op mized design for GPU based small message transfers – Adding R3 support for GPU based packe zed transfer – Enhanced performance for small message host-to-device transfers – Support for MPI_Scan and MPI_Exscan collec ve opera ons from GPU buffers – Op miza on of collec ves with new copy designs
MVAPICH User Group Mee ng 2015 22 Performance of MVAPICH2-GDR with GPU-Direct-RDMA
GPU-GPU Internode Small Message Latency GPU-GPU Internode MPI Uni-Direc onal Bandwidth 30 3500 MV2-GDR2.1RC2 MV2-GDR2.1RC2 25 MV2-GDR2.0b 3000 MV2-GDR2.0b MV2 w/o GDR 2500 MV2 w/o GDR 20 11x 2000
15 s) 2X 9.3X 3.5X 1500 10 Latency (us) 1000 5 2.15 usec Bandwidth (MB/ 500 0 0 0 2 8 32 128 512 2K 1 4 16 64 256 1K 4K Message Size (bytes) Message Size (bytes) GPU-GPU Internode Bi-direc onal Bandwidth 4500 4000 MV2-GDR2.1RC2 MV2-GDR2.0b 3500 MV2 w/o GDR 2x MVAPICH2-GDR-2.1RC2 3000 Intel Ivy Bridge (E5-2680 v2) node - 20 cores 2500 NVIDIA Tesla K40c GPU 2000 11x Mellanox Connect-IB Dual-FDR HCA 1500 CUDA 7 1000 Mellanox OFED 2.4 with GPU-Direct-RDMA
Bi-Bandwidth (MB/s) 500 0 1 4 16 64 256 1K 4K Message Size (bytes) MVAPICH User Group Mee ng 2015 23 Applica on-Level Evalua on (HOOMD-blue)
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5 • GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
MVAPICH User Group Mee ng 2015 24 MPI3-RMA Performance of MVAPICH2-GDR with GPU-Direct-RDMA GPU-GPU Internode MPI Put latency (RMA put opera on Device to Device )
MPI-3 RMA provides flexible synchronization and completion primitives
Small Message Latency 35 MV2-GDR2.1RC2 30 MV2-GDR2.0b 25 7X 20
15 Latency (us) 10 2.88 us 5
0 0 2 8 32 128 512 2K 8K Message Size (bytes)
MVAPICH2-GDR-2.1RC2 Intel Ivy Bridge (E5-2680 v2) node with 20 cores NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 7, Mellanox OFED 2.4 with GPU-Direct-RDMA MVAPICH User Group Mee ng 2015 25 MPI Applica ons on MIC Clusters • MPI (+X) con nues to be the predominant programming model in HPC • Flexibility in launching MPI jobs on clusters with Xeon Phi Xeon Xeon Phi
Mul -core Centric
MPI Host-only Program
Offload MPI Offloaded (/reverse Offload) Program Computa on
MPI MPI Program Symmetric Program
Coprocessor-only MPI Program Many-core Centric
MVAPICH User Group Mee ng 2015 26 MVAPICH2-MIC Design for Clusters with IB and MIC
• Offload Mode • Intranode Communica on • Coprocessor-only Mode • Symmetric Mode • Internode Communica on • Coprocessors-only • Symmetric Mode • Mul -MIC Node Configura ons
MVAPICH User Group Mee ng 2015 27 MVAPICH2-MIC 2.0 • Released on 12/02/2014 • Major Features and Enhancements – Based on MVAPICH2 2.0.1 – Support for na ve, symmetric and offload modes of MIC usage – Op mized intra-MIC communica on using SCIF and shared-memory channels – Op mized intra-Node Host-to-MIC communica on using SCIF and IB channels – Enhanced mpirun_rsh to launch jobs in symmetric mode from the host – Support for proxy-based communica on for inter-node transfers • Ac ve-proxy, 1-hop and 2-hop designs (ac vely using host CPU) • Passive-proxy (passively using host CPU) – Support for MIC-aware MPI_Bcast() – Improved SCIF performance for pipelined communica on – Op mized shared-memory communica on performance for single-MIC jobs – Supports an explicit CPU-binding mechanism for MIC processes – Tuned pt-to-pt intra-MIC, intra-node, and inter-node transfers – Supports hwloc v1.9 • Running on three major systems – Stampede – Blueridge(Virginia Tech) – Beacon (UTK)
MVAPICH User Group Mee ng 2015 28 MIC-Remote-MIC P2P Communica on with Proxy-based Communica on Intra-socket P2P Latency (Large Messages) Bandwidth 5000 6000 5236 4000 4000 3000 Be er 2000
2000 Be er 1000 Latency (usec) Latency 0
Bandwidth (MB/sec) 0 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes) Inter-socket P2P Latency (Large Bandwidth 6000 16000 Messages) 14000 5594 12000 Be er 4000 10000 8000 6000 2000 Be er 4000 Latency (usec) Latency 2000 0
0 Bandwidth (MB/sec) 8K 32K 128K 512K 2M 1 16 256 4K 64K 1M Message Size (Bytes) Message Size (Bytes)
MVAPICH User Group Mee ng 2015 29 Op mized MPI Collec ves for MIC Clusters (Allgather & Alltoall) 25000 1500 32-Node-Allgather (16H + 16 M) 32-Node-Allgather (8H + 8 M) 20000 Small Message Latency Large Message Latency x 1000 1000 15000 MV2-MIC MV2-MIC
10000 MV2-MIC-Opt MV2-MIC-Opt 58% 76% 500 Latency (usecs) 5000 Latency (usecs)
0 0 1 2 4 8 16 32 64 128 256 512 1K 8K 16K 32K 64K 128K 256K 512K 1M Message Size (Bytes) Message Size (Bytes) P3DFFT Performance 800 32-Node-Alltoall (8H + 8 M) 50 Communica on Large Message Latency 40 x 1000 600 Computa on 30 55% 400 MV2-MIC 20 MV2-MIC-Opt
Latency (usecs) 200 10 Execu on Time (secs) 0 0 MV2-MIC-Opt MV2-MIC 4K 8K 16K 32K 64K 128K 256K 512K 32 Nodes (8H + 8M), Size = 2K*2K*1K Message Size (Bytes) A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014 MVAPICH User Group Mee ng 2015 30 MVAPICH2-Virt 2.1rc2
• Enables high-performance communica on in virtualized environments – SR-IOV based communica on for Inter-Node MPI communica on – Inter-VM Shared Memory (IVSHMEM) based communica on for Intra-Node-Inter-VM MPI communica on – h p://mvapich.cse.ohio-state.edu
MVAPICH User Group Mee ng 2015 31 MVAPICH2-Virt 2.1rc2
• Released on 06/26/2015 • Major Features and Enhancements – Based on MVAPICH2 2.1rc2 – Support for efficient MPI communica on over SR-IOV enabled InfiniBand network – High-performance and locality-aware MPI communica on with IVSHMEM – Support for IVSHMEM device auto-detec on in virtual machine – Automa c communica on channel selec on among SR-IOV, IVSHMEM, and CMA/LiMIC2 – Support for easy configura on through run me parameters – Tested with - Mellanox InfiniBand adapters (ConnectX-3 (56Gbps))
MVAPICH User Group Mee ng 2015 32 Applica on-Level Performance (8 VM * 8 Core/VM)
35 450 MV2-SR-IOV-Def 400 30 MV2-SR-IOV-Def 9% MV2-SR-IOV-Opt 350 9% MV2-SR-IOV-Opt
25 MV2-Na ve 300 MV2-Na ve 20 250 15 1% 200 Execu on Time (s)
Execu on Time (us) 150 10 100 4% 5 50
0 0 FT-64-C EP-64-C LU-64-C BT-64-C 20,10 20,16 20,20 22,10 22,16 22,20 NAS Class C Problem Size (Scale, Edgefactor) NAS Graph500
• Compared to Na ve, 1-9% overhead for NAS • Compared to Na ve, 4-9% overhead for Graph500
MVAPICH User Group Mee ng 2015 33 OSU Micro-Benchmarks (OMB)
• Started in 2004 and con nuing steadily • Allows MPI developers and users to – Test and evaluate MPI libraries • Has a wide-range of benchmarks – Two-sided (MPI-1, MPI-2 and MPI-3) – One-sided (MPI-2 and MPI-3) – RMA (MPI-3) – Collec ves (MPI-1, MPI-2 and MPI-3) – Extensions for GPU-aware communica on (CUDA and OpenACC) – UPC (Pt-to-Pt) – OpenSHMEM (Pt-to-Pt and Collec ves) – Startup • Widely-used in the MPI community
MVAPICH User Group Mee ng 2015 34 OSU Microbenchmarks v5.0
• OSU Micro-Benchmarks 5.0 (08/18/15) • Non-blocking collec ve benchmarks – osu_iallgather, osu_ialltoall, osu_ibarrier, osu_ibcast, osu_igather, and osu_isca er – Benchmarks can display the amount of achievable overlap • Overlap defined as the amount of computa on that can be performed while the communica on progresses in the background • Have the addi onal op on: "-t" set the number of MPI_Test() calls during the dummy computa on – set CALLS to 100, 1000, or any number > 0 • Startup benchmarks – osu_init • Measures the me taken for an MPI to complete MPI_Init – osu_hello • Measures the me taken for an MPI library to complete a simple hello world MPI program MVAPICH User Group Mee ng 2015 35 Overview of OSU INAM
• OSU INAM monitors IB clusters in real me by querying various subnet management en es in the network • Major features of the OSU INAM tool include: – Analyze and profile network-level ac vi es with many parameters (data and errors) at user specified granularity – Capability to analyze and profile node-level, job-level and process-level ac vi es for MPI communica on (pt-to-pt, collec ves and RMA) – Remotely monitor CPU u liza on of MPI processes at user specified granularity – Visualize the data transfer happening in a "live" fashion - Live View for • En re Network - Live Network Level View • Par cular Job - Live Job Level View • One or mul ple Nodes - Live Node Level View – Capability to visualize data transfer that happened in the network at a me dura on in the past - Historical View for • En re Network - Historical Network Level View • Par cular Job - Historical Job Level View • One or mul ple Nodes - Historical Node Level View
MUG'15 36 OSU INAM – Network Level View
Full Network (152 nodes) Zoomed-in View of the Network
• Show network topology of large clusters • Visualize traffic pa ern on different links • Quickly iden fy congested links/links in error state • See the history unfold – play back historical state of the network
MUG'15 37 OSU INAM – Job and Node Level Views
Visualizing a Job (5 Nodes) Finding Routes Between Nodes
• Job level view • Show different network metrics (load, error, etc.) for any live job • Play back historical data for completed jobs to iden fy bo lenecks • Node level view provides details per process or per node • CPU u liza on for each rank/node • Bytes sent/received for MPI opera ons (pt-to-pt, collec ve, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node
MUG'15 38 MVAPICH2-EA & OEMT
• MVAPICH2-EA (Energy-Aware) • A white-box approach • New Energy-Efficient communica on protocols for pt-pt and collec ve opera ons • Intelligently apply the appropriate Energy saving techniques • Applica on oblivious energy saving • OEMT • A library u lity to measure energy consump on for MPI applica ons • Works with all MPI run mes • PRELOAD op on for precompiled applica ons • Does not require ROOT permission: • A safe kernel module to read only a subset of MSRs
MUG'15 39 MV2-EA : Applica on Oblivious Energy-Aware-MPI (EAM)
• An energy efficient run me that provides energy savings without applica on knowledge • Uses automa cally and transparently the best energy lever • Provides guarantees on maximum degrada on with 5-41% savings at <= 1 5% degrada on • Pessimis c MPI applies energy reduc on lever to each MPI call
A Case for Applica on-Oblivious Energy-Efficient MPI Run me A. Venkatesh , A. Vishnu , K. Hamidouche , N. Tallent , D. K. Panda , D. Kerbyson , and A. Hoise - Supercompu ng ‘15, Nov 2015 [Best Student Paper Finalist] MUG'15 40 MVAPICH2 – Plans for Exascale
• Performance and Memory scalability toward 500K-1M cores – Dynamically Connected Transport (DCT) service with Connect-IB • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Op miza on for GPU Support and Accelerators • Taking advantage of advanced features – User Mode Memory Registra on (UMR) – On-demand Paging • Enhanced Inter-node and Intra-node communica on schemes for upcoming OmniPath enabled Knights Landing architectures • Extended RMA support (as in MPI 3.0) • Extended topology-aware collec ves • Power-aware point-to-point (one-sided and two-sided) and collec ves • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migra on support with SCR
MVAPICH User Group Mee ng 2015 41 Web Pointers
NOWLAB Web Page h p://nowlab.cse.ohio-state.edu
MVAPICH Web Page h p://mvapich.cse.ohio-state.edu
MVAPICH User Group Mee ng 2015 42