How to the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

A Tutorial at the MVAPICH User Group (MUG) Meeting ’18 by

The MVAPICH Team The Ohio State University E-mail: [email protected] http://mvapich.cse.ohio-state.edu/ Designing (MPI+X) for Exascale • for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided) • Scalable Collective communication – Offloaded – Non-blocking – Topology-aware • Balancing intra-node and inter-node communication for next generation multi-/many-core (128-1024 cores/node) – Multiple end-points per node • Support for efficient multi-threading • Integrated Support for GPGPUs and Accelerators • Fault-tolerance/resiliency • QoS support for communication and I/O • Support for Hybrid MPI+PGAS programming • MPI + OpenMP, MPI + UPC, MPI + OpenSHMEM, CAF, MPI + UPC++… • Virtualization • Energy-Awareness

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 2 Architecture of MVAPICH2 Software Family

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspectio point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages n & Analysis Primitives Access File Systems

Support for Modern Networking Technology Support for Modern Multi-/Many-core Architectures (InfiniBand, iWARP, RoCE, Omni-Path) (Intel-Xeon, OpenPOWER, Xeon-Phi (MIC, KNL), NVIDIA GPGPU) Transport Protocols Modern Features Transport Mechanisms Modern Features SR- Multi Shared RC XRC UD DC UMR ODP CMA IVSHMEM XPMEM* NVLink* CAPI* IOV Rail Memory

* Upcoming

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 3 Collective Communication in MVAPICH2

Blocking and Non-Blocking Collective Algorithms in MV2

Conventional Multi/Many-Core (Flat) Aware Designs

Inter-Node Intra-Node Communication Communication

Point to Hardware Point to Point Direct Kernel SHARP RDMA Point Multicast (SHMEM, Direct Shared Assisted LiMIC, CMA, Memory (CMA, XPMEM, XPMEM) LiMIC) Designed for Performance & Overlap Run-time flags: All shared-memory based collectives : MV2_USE_SHMEM_COLL (Default: ON) Hardware Mcast-based collectives : MV2_USE_MCAST (Default : OFF) CMA-based collectives : MV2_USE_CMA_COLL (Default : ON) Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 4 Advanced Allreduce Collective Designs Using SHArP osu_allreduce (OSU Micro Benchmark) using MVAPICH2 2.3b 4 PPN*, 16 Nodes 28 PPN, 16 Nodes MVAPICH2 MVAPICH2 12 16 MVAPICH2-SHArP 14 1.5x 10 MVAPICH2-SHArP 2.3x 12 8 10 6 8 6 Latency (us) Latency (us) 4 4 2 2 0 0 4 8 16 32 64 128 256 4 8 16 32 64 128 256 Message Size (Bytes) Message Size (Bytes) MV2-128B 8 MV2-SHArP-128B 12 7 MV2-32B 1.4x 2.3x 10 6 MV2-SHArP-32B 5 8

4 6 MV2-128B

Latency (us) 3 MV2-SHArP-128B Latency (us) 4 MV2-32B 2 MV2-SHArP-32B 1 2

0 0 (4,4) (8,4) (16,4) *PPN: Processes Per Node (4,28) (8,28) (16,28) (Number of Nodes, PPN) (Number of Nodes, PPN) Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 5 Benefits of SHARP at Application Level Avg DDOT Allreduce time of HPCG Mesh Refinement Time of MiniAMR 0.35 12% 0.2 MVAPICH2 MVAPICH2 0.3 13% 0.15 MVAPICH2-SHArP 0.25 MVAPICH2-SHArP 0.2 0.1 0.15 0.1 0.05 Latency (seconds) Latency (seconds) 0.05 0 0 (4,28) (8,28) (16,28) (4,28) (8,28) (16,28) (Number of Nodes, PPN) (Number of Nodes, PPN)

SHARP support available since MVAPICH2 2.3a

Parameter Description Default MV2_ENABLE_SHARP=1 Enables SHARP-based collectives Disabled --enable-sharp Configure flag to enable SHARP Disabled

• Refer to Running Collectives with Hardware based SHArP support section of MVAPICH2 user guide for more information • http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.3b-userguide.html#x1-990006.26 Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 6 Problems with Blocking Collective Operations

Application Application Application Application Process Process Process Computation

Communication

• Communication time cannot be used for compute – No overlap of computation and communication – Inefficient

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 7 Concept of Non-blocking Collectives Computation

Application Application Application Application Communication Process Process Process Process Communication Communication Communication Communication Support Entity Support Entity Support Entity Support Entity

Schedule Schedule Schedule Schedule Operation Operation Operation Operation

Check if Check if Check if Check if Complete Complete Complete Complete

Check if Check if Check if Check if Complete Complete Complete Complete

• Application processes schedule collective operation • Check periodically if operation is complete • Overlap of computation and communication => Better Performance • Catch: Who will progress communication

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 8 Non-blocking Collective (NBC) Operations

• Enables overlap of computation with communication • Non-blocking calls do not match blocking collective calls – MPI may use different algorithms for blocking and non-blocking collectives – Blocking collectives: Optimized for latency – Non-blocking collectives: Optimized for overlap • A process calling a NBC operation – Schedules collective operation and immediately returns – Executes application computation code – Waits for the end of the collective • The communication progress by – Application code through MPI_Test – Network adapter (HCA) with hardware support – Dedicated processes / in MPI • There is a non-blocking equivalent for each blocking operation – Has an “I” in the name • MPI_Bcast -> MPI_Ibcast; MPI_Reduce -> MPI_Ireduce

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 9 How do I write applications with NBC?

void main() { MPI_Init() ….. MPI_Ialltoall(…) Computation that does not depend on result of Alltoall MPI_Test(for Ialltoall) /* Check if complete (non-blocking) */ Computation that does not depend on result of Alltoall MPI_Wait(for Ialltoall) /* Wait till complete (Blocking) */ … MPI_Finalize() }

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 10 P3DFFT Performance with Non-Blocking Alltoall using RDMA Primitives

Small Scale Runs Large Scale Runs 16 14 Default RDMA-Aware Default-Thread Default RDMA-Aware 19% 14 12 12 10 10 8 8 6 6 4 4 2 2 0 CPUTime Loop Per (Seconds) CPUTime Loop Per (Seconds) 0 128 256 512 128 256 512 1K 2K 4K 8K Number of Processes Number of Processes

• Weak scaling experiments; problem size increases with job size • RDMA-Aware delivers 19% improvement over Default @ 8,192 procs • Default-Thread exhibits worst performance – Possibly because threads steal CPU cycles from P3DFFT Will be available in future – Do not consider for large scale experiments

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 11 Offloading with Scalable Hierarchical Aggregation Protocol (SHArP)

. Management and execution of MPI operations in the network by using SHArP . Manipulation of data while it is being transferred in the switch network . SHArP provides an abstraction to realize the reduction operation Physical Network Topology* . Defines Aggregation Nodes (AN), Aggregation Tree, and Aggregation Groups . AN logic is implemented as an InfiniBand Target Channel Adapter (TCA) integrated into the switch ASIC * . Uses RC for communication between ANs and between AN and hosts in the Aggregation Tree *

* Bloch et al. Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction Logical SHArP Tree*

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 12 Evaluation of SHArP based Non Blocking Allreduce

MPI_Iallreduce Benchmark

1 PPN*, 8 Nodes MVAPICH2 MVAPICH2 MVAPICH2-SHArP 1 PPN, 8 Nodes 9 2.3x 50 8 MVAPICH2-SHArP 45

Lower is Better is Lower 7 40 6 35 Computation Computation Overlap (%) 5 - 30 25 4 20 3 15 2 Communication 10 Higher is Better 1 5

Pure CommunicationPure Latency(us) 0 0 4 8 16 32 64 128 4 8 16 32 64 128 Message Size (Bytes) Message Size (Bytes)

• Complete offload of Allreduce collective operation to Switch helps to have much higher overlap of communication and computation Available since MVAPICH2 2.3a *PPN: Processes Per Node

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 13 MVAPICH2 – Plans for Exascale

• Performance and Memory scalability toward 1-10M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • MPI + Task* • Enhanced Optimization for GPU Support and Accelerators • Taking advantage of advanced features of Mellanox InfiniBand • Tag Matching* • Adapter Memory* • Enhanced communication schemes for upcoming architectures • NVLINK* • CAPI* • Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended FT support • Support for * features will be available in future MVAPICH2 Releases

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 14 Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/

Network Based Computing Laboratory MVAPICH User Group (MUG) Meeting’18 15