Enhancing MPI Communication using Accelerated Verbs and Tag Matching: The MVAPICH Approach

Talk at UCX BoF (ISC ‘19) by

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Introduction, Motivation, and Challenge • HPC applications require high-performance, low overhead data paths that provide – Low latency – High bandwidth – High message rate • Hardware Offloaded Tag Matching • Different families of accelerated verbs available – Burst family • Accumulates packets to be sent into bursts of single SGE packets – Poll family • Optimizes send completion counts • Receive completions for which only the length is of interest • Completions that contain the payload in the CQE • Can we integrate accelerated verbs and tag matching support in UCX into existing HPC middleware to extract peak performance and overlap? Network Based Computing Laboratory ISC’19 The MVAPICH Approach

High Performance Parallel Programming Models

Message Passing Interface PGAS Hybrid --- MPI + X (MPI) (UPC, OpenSHMEM, CAF, UPC++) (MPI + PGAS + OpenMP/)

High Performance and Scalable Communication Runtime Diverse and Mechanisms

Point-to- Remote Collectives Energy- I/O and Fault Active Introspection point Job Startup Memory Virtualization Algorithms Awareness Tolerance Messages & Analysis Primitives Access File Systems

Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, Omni-Path) Transport Protocols Modern Interconnect Features Accelerated Verbs Family* Modern Switch Features SR- Multi Tag RC XRC UD DC UMR ODP Burst Poll Multicast SHARP IOV Rail Match

* Upcoming Network Based Computing Laboratory ISC’19 Verbs-level Performance: Message Rate

12 12 12 Read 10.47 Write 10.20 Send 10 10 9.66 10 8.73 8.97 8 8 8 8.05 7.41 6 6 6

4 4 4 regular regular regular Million Messages second / Million Messages second / Million Messages second / 2 2 2 acclerated acclerated acclerated 0 0 0 1 8 64 512 4096 1 8 64 512 4096 1 8 64 512 4096 Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

ConnectX-5 EDR (100 Gbps), Intel Broadwell E5-2680 @ 2.4 GHz MOFED 4.2-1, RHEL-7 3.10.0-693.17.1.el7.x86_64 Network Based Computing Laboratory ISC’19 Verbs-level Performance: Bandwidth

regular regular regular acclerated acclerated acclerated 1000 1000 1000

100 100 100 Million Bytes / secondMillion / Bytes secondMillion / Bytes Million Bytes / secondMillion / Bytes 16.39 18.35 Read Write 16.10 Send 12.21 14.72 10 10 10 14.13 2 8 32 128 512 2 8 32 128 512 2 8 32 128 512 Message Size (Bytes) Message Size (Bytes) Message Size (Bytes)

ConnectX-5 EDR (100 Gbps), Intel Broadwell E5-2680 @ 2.4 GHz MOFED 4.2-1, RHEL-7 3.10.0-693.17.1.el7.x86_64 Network Based Computing Laboratory ISC’19 Hardware Tag Matching Support

• Offloads the processing of point-to-point MPI messages from the host processor to HCA • Enables zero copy of MPI message transfers – Messages are written directly to the user's buffer without extra buffering and copies • Provides rendezvous progress offload to HCA – Increases the overlap of communication and computation

Network Based Computing Laboratory ISC’19 6 Impact of Zero Copy MPI Message Passing using HW Tag Matching osu_latency osu_latency Eager Message Range 35% Rendezvous Message Range 8 400 6 300 4 200 2 100 Latency (us) Latency Latency (us) Latency 0 0 0 2 8 32 128 512 2K 8K 32K 64K 128K 256K 512K 1M 2M 4M Message Size (byte) Message Size (byte)

MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM

Removal of intermediate buffering/copies can lead up to 35% performance improvement in latency of medium messages

Network Based Computing Laboratory ISC’19 7 Impact of Rendezvous Offload using HW Tag Matching

osu_iscatterv osu_iscatterv 640 Processes 1,280 Processes 1.7 X 1.8 X 50000 80000 40000 60000 30000 40000 20000 20000 Latency (us) Latency Latency (us) Latency 10000 0 0 16K 32K 64K 128K 256K 512K 16K 32K 64K 128K 256K 512K Message Size (byte) Message Size (byte)

MVAPICH2 MVAPICH2+HW-TM MVAPICH2 MVAPICH2+HW-TM The increased overlap can lead to 1.8X performance improvement in total latency of osu_iscatterv Network Based Computing Laboratory ISC’19 8 Future Plans

• Complete designs are being worked out • Will be available in the future MVAPICH2 releases

Network Based Computing Laboratory ISC’19 9