Enhancing MPI Communication using Accelerated Verbs: The MVAPICH Approach

Talk at UCX BoF (SC ‘18) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Introduction, Motivation, and Challenge

• HPC applications require high-performance, low overhead data paths that provide – Low latency – High bandwidth – High message rate • Hardware Offloaded Tag Matching • Different families of accelerated verbs available – Burst family • Accumulates packets to be sent into bursts of single SGE packets – Poll family • Optimizes send completion counts • Receive completions for which only the length is of interest • Completions that contain the payload in the CQE • Can we integrate accelerated verbs into existing HPC middleware to extract peak performance and overlap? Network Based Computing Laboratory SC’18 2 Verbs-level Performance: Message Rate

12 12 12 Read 10.47 Write 10.20 Send 10 10 9.66 10 8.73 8.97 8 8 8 8.05 7.41 6 6 6

4 4 4 regular regular regular 2 2 2 Million Messages / second Million Messages acclerated / second Million Messages acclerated / second Million Messages acclerated 0 0 0 1 8 64 512 4096 1 8 64 512 4096 1 8 64 512 4096 Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

ConnectX-5 EDR (100 Gbps), Intel Broadwell E5-2680 @ 2.4 GHz MOFED 4.2-1, RHEL-7 3.10.0-693.17.1.el7.x86_64 Network Based Computing Laboratory SC’18 3 Verbs-level Performance: Bandwidth

regular regular regular acclerated acclerated acclerated 1000 1000 1000

100 100 100 Million Bytes / second Million Bytes / second Million Bytes / second Million Bytes 16.39 18.35 Read Write 16.10 Send 12.21 14.72 10 10 10 14.13 2 8 32 128 512 2 8 32 128 512 2 8 32 128 512 Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

ConnectX-5 EDR (100 Gbps), Intel Broadwell E5-2680 @ 2.4 GHz MOFED 4.2-1, RHEL-7 3.10.0-693.17.1.el7.x86_64 Network Based Computing Laboratory SC’18 4 The MVAPICH Approach

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness File Systems Tolerance Messages & Analysis Primitives Access

Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path) Transport Protocols Modern Interconnect Features Accelerated Verbs Family* Modern Switch Features SR- Multi Tag RC XRC UD DC UMR ODP Burst Poll Multicast SHARP IOV Rail Match

* Upcoming Network Based Computing Laboratory SC’18 5