MVAPICH2 for Intel

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting ’18 by The MVAPICH Team The Ohio State University E-mail: [email protected] http://mvapich.cse.ohio-state.edu/ Designing (MPI+X) for Exascale • Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offloaded – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation multi-/many-core (128-1024 cores/node) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming • MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, MPI + UPC++… • Virtualization • Energy-Awareness Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 2 Architecture of MVAPICH2 Software Family High Performance Parallel Programming Models Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/Cilk) High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms Point-to- Remote Collectives Energy- I/O and Fault Active Introspectio point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages n & Analysis Primitives Access File Systems Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* NVLink* CAPI* IOV Rail Memory * Upcoming Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 3 Collective Communication in MVAPICH2 Blocking and Non-Blocking Collective Algorithms in MV2 Conventional Multi/Many-Core (Flat) Aware Designs Inter-Node Intra-Node Communication Communication Point to Hardware Point to Point Direct Kernel SHARP RDMA Point Multicast (SHMEM, Direct Shared Assisted LiMIC, CMA, Memory (CMA, XPMEM, XPMEM) LiMIC) Designed for Performance & Overlap Run-time flags: All shared-memory based collectives : MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast-based collectives : MV2_USE_MCAST (Default : OFF) CMA-based collectives : MV2_USE_CMA_COLL (Default : ON) Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 4 Advanced Allreduce Collective Designs Using SHArP osu_allreduce (OSU Micro Benchmark) using MVAPICH2 2.3b 4 PPN*, 16 Nodes 28 PPN, 16 Nodes MVAPICH2 MVAPICH2 12 16 MVAPICH2-SHArP 14 1.5x 10 MVAPICH2-SHArP 2.3x 12 8 10 6 8 6 Latency (us) Latency (us) 4 4 2 2 0 0 4 8 16 32 64 128 256 4 8 16 32 64 128 256 Message Size (Bytes) Message Size (Bytes) MV2-128B 8 MV2-SHArP-128B 12 7 MV2-32B 1.4x 2.3x 10 6 MV2-SHArP-32B 5 8 4 6 MV2-128B Latency (us) 3 MV2-SHArP-128B Latency (us) 4 MV2-32B 2 MV2-SHArP-32B 1 2 0 0 (4,4) (8,4) (16,4) *PPN: Processes Per Node (4,28) (8,28) (16,28) (Number of Nodes, PPN) (Number of Nodes, PPN) Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 5 Benefits of SHARP at Application Level Avg DDOT Allreduce time of HPCG Mesh Refinement Time of MiniAMR 0.35 12% 0.2 MVAPICH2 MVAPICH2 0.3 13% 0.15 MVAPICH2-SHArP 0.25 MVAPICH2-SHArP 0.2 0.1 0.15 0.1 0.05 Latency (seconds) Latency (seconds) 0.05 0 0 (4,28) (8,28) (16,28) (4,28) (8,28) (16,28) (Number of Nodes, PPN) (Number of Nodes, PPN) SHARP support available since MVAPICH2 2.3a Parameter Description Default MV2_ENABLE_SHARP=1 Enables SHARP-based collectives Disabled --enable-sharp Configure flag to enable SHARP Disabled • Refer to Running Collectives with Hardware based SHArP support section of MVAPICH2 user guide for more information • http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3b-userguide.html#x1-990006.26 Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 6 Problems with Blocking Collective Operations Application Application Application Application Process Process Process Process Computation Communication • Communication time cannot be used for compute – No overlap of computation and communication – Inefficient Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 7 Concept of Non-blocking Collectives Computation Application Application Application Application Communication Process Process Process Process Communication Communication Communication Communication Support Entity Support Entity Support Entity Support Entity Schedule Schedule Schedule Schedule Operation Operation Operation Operation Check if Check if Check if Check if Complete Complete Complete Complete Check if Check if Check if Check if Complete Complete Complete Complete • Application processes schedule collective operation • Check periodically if operation is complete • Overlap of computation and communication => Better Performance • Catch: Who will progress communication Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 8 Non-blocking Collective (NBC) Operations • Enables overlap of computation with communication • Non-blocking calls do not match blocking collective calls – MPI may use different algorithms for blocking and non-blocking collectives – Blocking collectives: Optimized for latency – Non-blocking collectives: Optimized for overlap • A process calling a NBC operation – Schedules collective operation and immediately returns – Executes application computation code – Waits for the end of the collective • The communication progress by – Application code through MPI_Test – Network adapter (HCA) with hardware support – Dedicated processes / thread in MPI library • There is a non-blocking equivalent for each blocking operation – Has an “I” in the name • MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 9 How do I write applications with NBC? void main() { MPI_Init() ….. MPI_Ialltoall(…) Computation that does not depend on result of Alltoall MPI_Test(for Ialltoall) /* Check if complete (non-blocking) */ Computation that does not depend on result of Alltoall MPI_Wait(for Ialltoall) /* Wait till complete (Blocking) */ … MPI_Finalize() } Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 10 P3DFFT Performance with Non-Blocking Alltoall using RDMA Primitives Small Scale Runs Large Scale Runs 16 14 Default RDMA-Aware Default-Thread Default RDMA-Aware 19% 14 12 12 10 10 8 8 6 6 4 4 2 2 0 CPUTime Loop Per (Seconds) CPUTime Loop Per (Seconds) 0 128 256 512 128 256 512 1K 2K 4K 8K Number of Processes Number of Processes • Weak scaling experiments; problem size increases with job size • RDMA-Aware delivers 19% improvement over Default @ 8,192 procs • Default-Thread exhibits worst performance – Possibly because threads steal CPU cycles from P3DFFT Will be available in future – Do not consider for large scale experiments Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 11 Offloading with Scalable Hierarchical Aggregation Protocol (SHArP) . Management and execution of MPI operations in the network by using SHArP . Manipulation of data while it is being transferred in the switch network . SHArP provides an abstraction to realize the reduction operation Physical Network Topology* . Defines Aggregation Nodes (AN), Aggregation Tree, and Aggregation Groups . AN logic is implemented as an InfiniBand Target Channel Adapter (TCA) integrated into the switch ASIC * . Uses RC for communication between ANs and between AN and hosts in the Aggregation Tree * * Bloch et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction Logical SHArP Tree* Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 12 Evaluation of SHArP based Non Blocking Allreduce MPI_Iallreduce Benchmark 1 PPN*, 8 Nodes MVAPICH2 MVAPICH2 MVAPICH2-SHArP 1 PPN, 8 Nodes 9 2.3x 50 8 MVAPICH2-SHArP 45 Lower is Better 7 40 6 35 Computation Computation Overlap (%) 5 - 30 25 4 20 3 15 2 Communication 10 Higher is Better 1 5 Pure CommunicationPure Latency(us) 0 0 4 8 16 32 64 128 4 8 16 32 64 128 Message Size (Bytes) Message Size (Bytes) • Complete offload of Allreduce collective operation to Switch helps to have much higher overlap of communication and computation Available since MVAPICH2 2.3a *PPN: Processes Per Node Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 13 MVAPICH2 – Plans for Exascale • Performance and Memory scalability toward 1-10M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Optimization for GPU Support and Accelerators • Taking advantage of advanced features of Mellanox InfiniBand • Tag Matching* • Adapter Memory* • Enhanced communication schemes for upcoming architectures • NVLINK* • CAPI* • Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended FT support • Support for * features will be available in future MVAPICH2 Releases Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 14 Thank You! Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/ Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 15.

MVAPICH2 for Intel

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

MVAPICH2 2.2 User Guide

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Scalable and High Performance MPI Design for Very Large

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

MVAPICH2 2.3 User Guide

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Exascale Computing Project -- Software

Library for Handling Asynchronous Events in C++

MVAPICH USER GROUP MEETING August 26, 2020

Intel Threading Building Blocks