Optimizing Point-to-Point Communication between AMPI Endpoints in

Sam White & Laxmikant Kale UIUC

ExaMPI 2017 Motivation

• Exascale trends:

• HW: increased node parallelism, decreased memory per

• SW: applications themselves becoming more dynamic

• How should applications and runtimes respond?

• MPI+X (X=OpenMP, Kokkos, MPI, etc)?

• New language? Legion, Charm++, HPX, etc?

ExaMPI 2017 2 Motivation

• MPI+X performance: • Either serialize around communication … • Or incur synchronization costs inside MPI • Semantic restrictions can help

• What if we hoist threading into MPI?

• Performant MPI+X often requires similar hoisting

• Threaded MPIs: MPC-MPI, FG-MPI, TMPI, AMPI

• Similar to the MPI Endpoints proposal

ExaMPI 2017 3 Motivation

• Questions: • Why is AMPI’s existing implementation not as fast as we might expect (vs -based MPIs)? • What can be done to improve it? • More generally, what are the costs of process boundaries & kernel-assisted IPC? • What can MPI Endpoints offer in terms of pt2pt latency & bandwidth in shared memory?

ExaMPI 2017 4 Overview

• Introduction to AMPI • Execution model • Existing shared memory performance

• Shared memory optimizations • Locality: intra-core vs intra-process • Size: small vs large messages

• Conclusions & future work

ExaMPI 2017 5 Adaptive MPI

• AMPI is an MPI implementation on top of Charm++

• AMPI offers Charm++’s application-independent features to MPI programmers: • Overdecomposition • Communication/computation overlap • Dynamic load balancing • Online fault tolerance

ExaMPI 2017 6 Execution Model

• AMPI ranks are User-Level Threads (ULTs)

• Can have multiple per core

• Fast to context switch

• Scheduled based on message delivery

• Migratable across cores and nodes at runtime

• For load balancing & online checkpoint/restart

ExaMPI 2017 7 Execution Model

Rank 0 Rank 1 MPI_Send() Rank 3 Rank 4

Rank 2

Scheduler Scheduler

Core 0 Core 1

Node 0

ExaMPI 2017 8 Shared Memory

• AMPI can be built in two different modes

• Non-SMP: 1 process per core/hyperthread

• SMP: 1 process per node/socket/NUMA domain • N worker threads, 1 dedicated communication thread per process

ExaMPI 2017 9 AMPI Shared Memory

• Many AMPI ranks can share the same OS process

ExaMPI 2017 10 Performance Analysis

• OSU Microbenchmarks v5.3: osu_latency, osu_bibw

• Quartz (LLNL): Intel Xeon (Ivybridge)

• MVAPICH2 2.2, Intel MPI 2018, OpenMPI 2.0.0

• Cori (NERSC): Intel Xeon (Haswell)

• Cray MPI 6.7.0

• AMPI (6.8.0) vs AMPI-shm (6.9.0-beta) • P1: two ranks co-located on the same core

• P2: two ranks on different cores, in the same OS process

ExaMPI 2017 11 Existing Performance

• Small message latency on Quartz OSU MPI Latency Benchmark on Quartz (LLNL)

MVAPICH P2 AMPI P2 32 IMPI P2 AMPI P1 16

(us) OpenMPI P2 8 4 Latency 2

1-way 1 0.5

1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes)

ExaMPI 2017 12 Existing Performance

• Large message latency on Quartz

OSU MPI Latency Benchmark on Quartz (LLNL) OSU MPI Latency Benchmark on Quartz (LLNL) 2048 131072 1024 65536 512 32768 256 16384 (us) 128 (us) 8192 64 4096 Latency 32 Latency 2048 16 1024 8 512 4 256 64 128 256 512 1024 2048 4096 4 8 16 32 64 Message Size (KB) Message Size (MB)

ExaMPI 2017 13 Existing Performance

• Bidirectional Bandwidth on Quartz

ExaMPI 2017 14 Existing Performance

• Small message latency on Cori-Haswell

ExaMPI 2017 15 Existing Performance

• Bidirectional Bandwidth on Cori-Haswell

ExaMPI 2017 16 Performance Analysis

• Breakdown of P1 time (us) per message on Quartz

• Scheduling: Charm++ scheduler & ULT ctx

• Memory copy: message payload movement

• Other: AMPI message creation & matching

ExaMPI 2017 17 Scheduling Overhead

1. Even for P1, all AMPI messages traveled thru Charm++’s scheduler • Use Charm++ inline tasks

2. ULT context switching overhead

• Faster ULT ctx: or QuickThreads ULTs

3. Avoid resuming threads without real progress

• MPI_Waitall: keep track of “blocked on” reqs

P1 0-B latency: 1.27 us -> 0.66 us

ExaMPI 2017 18 Memory Copy Overhead

• Q: Even with inline tasks, AMPI P1 performs poorly for large messages. Why?

• A: Charm++ messaging semantics do not match MPI’s

• In Charm++, messages are first class objects

• Users pass ownership of messages to the runtime when sending and assume it when receiving • Only app’s that can reuse message objects in their data structures can perform “zero copy” transfers

ExaMPI 2017 19 Memory Copy Overhead

• To overcome Charm++ messaging semantics in shared memory, use a rendezvous protocol:

• Recv’er performs direct (userspace) memcpy from sendbuf to recvbuf

• Benefit: avoid intermediate copy

• Cost: sender must suspend & be resumed upon copy completion

P1 1-MB latency: 165 us -> 82 us

ExaMPI 2017 20 Other Overheads

• Sender-side: • Create a Charm++ message object & a request

• Receiver-side: • Create a request, create matching queue entry, enqueue in unexpected_msgs or posted_requests

• Solution: use memory pools for fixed-size, frequently-used objects

P1 0-B latency: 0.66 us -> 0.54 us

ExaMPI 2017 21 AMPI-shm Performance

• Small message latency on Quartz

• AMPI-shm P2 faster than other impl’s for 2+ KB

ExaMPI 2017 22 AMPI-shm Performance

• Large message latency on Quartz

• AMPI-shm P2 fastest for all large messages, up to 2.33x faster than process-based MPIs for 32+ MB

ExaMPI 2017 23 AMPI-shm Performance

• Bidirectional bandwidth on Quartz

• AMPI-shm can utilize full memory bandwidth

• 26% higher peak, 2x bandwidth for 32+ MB than others

OSU MPI Bidirectional Bandwidth Benchmark on Cori-Haswell (NERSC)

STREAM copy 40000 Cray MPI 35000 AMPI AMPI-shm

(MB/s) 30000

25000

Bandwidth 20000

15000

10000 Bidirectional 5000 ExaMPI 2017 24 4 16 64 256 1024 4096 16384 65536 Message Size (KB) AMPI-shm Performance

• Small message latency on Cori-Haswell OSU MPI Latency Benchmark on Cori-Haswell (NERSC)

8 Cray MPI P2 AMPI-shm P2 4 AMPI-shm P1

(us) 2

1 Latency

0.5

0.25 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes)

ExaMPI 2017 25 AMPI-shm Performance

• Large message latency on Cori-Haswell • AMPI-shm P2 is 47% faster than Cray MPI at 32+ MB

ExaMPI 2017 26 AMPI-shm Performance

• Bidirectional bandwidth on Cori-Haswell • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB

OSU MPI Bidirectional Bandwidth Benchmark on Cori-Haswell (NERSC)

STREAM copy 40000 Cray MPI 35000 AMPI AMPI-shm

(MB/s) 30000

25000

Bandwidth 20000

15000

10000 Bidirectional 5000

4 16 64 256 1024 4096 16384 65536 ExaMPI 2017 27 Message Size (KB) Future Work

• User-space shared memory optimizations for: Collectives, Derived Datatypes, RMA, and SHM

• Testing with applications

• Interprocess “zero copy” communication • Requires new Charm++ messaging semantics • Sender & recver both need completion callbacks • New messaging API under development: implementations for OFI, Verbs, uGNI, PAMI, MPI

ExaMPI 2017 28 Summary

• User-space communication offers portable intranode messaging performance • Lower latency: 1.5x-2.3x for large msgs • Higher bandwidth: 1.3x-2x for large msgs • Intermediate buffering unnecessary for medium/ large msgs

• Shared-memory aware endpoints can provide lower latency & higher bandwidth for messaging within node

ExaMPI 2017 29 This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE- NA0002374.

ExaMPI 2017 30 Questions? Thank you

ExaMPI 2017 31