Optimizing Point-to-Point Communication between AMPI Endpoints in Shared Memory
Sam White & Laxmikant Kale UIUC
ExaMPI 2017 Motivation
• Exascale trends:
• HW: increased node parallelism, decreased memory per thread
• SW: applications themselves becoming more dynamic
• How should applications and runtimes respond?
• MPI+X (X=OpenMP, Kokkos, MPI, etc)?
• New language? Legion, Charm++, HPX, etc?
ExaMPI 2017 2 Motivation
• MPI+X performance: • Either serialize around communication … • Or incur synchronization costs inside MPI • Semantic restrictions can help
• What if we hoist threading into MPI?
• Performant MPI+X often requires similar hoisting
• Threaded MPIs: MPC-MPI, FG-MPI, TMPI, AMPI
• Similar to the MPI Endpoints proposal
ExaMPI 2017 3 Motivation
• Questions: • Why is AMPI’s existing implementation not as fast as we might expect (vs process-based MPIs)? • What can be done to improve it? • More generally, what are the costs of process boundaries & kernel-assisted IPC? • What can MPI Endpoints offer in terms of pt2pt latency & bandwidth in shared memory?
ExaMPI 2017 4 Overview
• Introduction to AMPI • Execution model • Existing shared memory performance
• Shared memory optimizations • Locality: intra-core vs intra-process • Size: small vs large messages
• Conclusions & future work
ExaMPI 2017 5 Adaptive MPI
• AMPI is an MPI implementation on top of Charm++
• AMPI offers Charm++’s application-independent features to MPI programmers: • Overdecomposition • Communication/computation overlap • Dynamic load balancing • Online fault tolerance
ExaMPI 2017 6 Execution Model
• AMPI ranks are User-Level Threads (ULTs)
• Can have multiple per core
• Fast to context switch
• Scheduled based on message delivery
• Migratable across cores and nodes at runtime
• For load balancing & online checkpoint/restart
ExaMPI 2017 7 Execution Model
Rank 0 Rank 1 MPI_Send() Rank 3 Rank 4
Rank 2
Scheduler Scheduler
Core 0 Core 1
Node 0
ExaMPI 2017 8 Shared Memory
• AMPI can be built in two different modes
• Non-SMP: 1 process per core/hyperthread
• SMP: 1 process per node/socket/NUMA domain • N worker threads, 1 dedicated communication thread per process
ExaMPI 2017 9 AMPI Shared Memory
• Many AMPI ranks can share the same OS process
ExaMPI 2017 10 Performance Analysis
• OSU Microbenchmarks v5.3: osu_latency, osu_bibw
• Quartz (LLNL): Intel Xeon (Ivybridge)
• MVAPICH2 2.2, Intel MPI 2018, OpenMPI 2.0.0
• Cori (NERSC): Intel Xeon (Haswell)
• Cray MPI 6.7.0
• AMPI (6.8.0) vs AMPI-shm (6.9.0-beta) • P1: two ranks co-located on the same core
• P2: two ranks on different cores, in the same OS process
ExaMPI 2017 11 Existing Performance
• Small message latency on Quartz OSU MPI Latency Benchmark on Quartz (LLNL)
MVAPICH P2 AMPI P2 32 IMPI P2 AMPI P1 16
(us) OpenMPI P2 8 4 Latency 2
1-way 1 0.5
1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes)
ExaMPI 2017 12 Existing Performance
• Large message latency on Quartz
OSU MPI Latency Benchmark on Quartz (LLNL) OSU MPI Latency Benchmark on Quartz (LLNL) 2048 131072 1024 65536 512 32768 256 16384 (us) 128 (us) 8192 64 4096 Latency 32 Latency 2048 16 1024 8 512 4 256 64 128 256 512 1024 2048 4096 4 8 16 32 64 Message Size (KB) Message Size (MB)
ExaMPI 2017 13 Existing Performance
• Bidirectional Bandwidth on Quartz
ExaMPI 2017 14 Existing Performance
• Small message latency on Cori-Haswell
ExaMPI 2017 15 Existing Performance
• Bidirectional Bandwidth on Cori-Haswell
ExaMPI 2017 16 Performance Analysis
• Breakdown of P1 time (us) per message on Quartz
• Scheduling: Charm++ scheduler & ULT ctx
• Memory copy: message payload movement
• Other: AMPI message creation & matching
ExaMPI 2017 17 Scheduling Overhead
1. Even for P1, all AMPI messages traveled thru Charm++’s scheduler • Use Charm++ inline tasks
2. ULT context switching overhead
• Faster ULT ctx: Boost or QuickThreads ULTs
3. Avoid resuming threads without real progress
• MPI_Waitall: keep track of “blocked on” reqs
P1 0-B latency: 1.27 us -> 0.66 us
ExaMPI 2017 18 Memory Copy Overhead
• Q: Even with inline tasks, AMPI P1 performs poorly for large messages. Why?
• A: Charm++ messaging semantics do not match MPI’s
• In Charm++, messages are first class objects
• Users pass ownership of messages to the runtime when sending and assume it when receiving • Only app’s that can reuse message objects in their data structures can perform “zero copy” transfers
ExaMPI 2017 19 Memory Copy Overhead
• To overcome Charm++ messaging semantics in shared memory, use a rendezvous protocol:
• Recv’er performs direct (userspace) memcpy from sendbuf to recvbuf
• Benefit: avoid intermediate copy
• Cost: sender must suspend & be resumed upon copy completion
P1 1-MB latency: 165 us -> 82 us
ExaMPI 2017 20 Other Overheads
• Sender-side: • Create a Charm++ message object & a request
• Receiver-side: • Create a request, create matching queue entry, enqueue in unexpected_msgs or posted_requests
• Solution: use memory pools for fixed-size, frequently-used objects
P1 0-B latency: 0.66 us -> 0.54 us
ExaMPI 2017 21 AMPI-shm Performance
• Small message latency on Quartz
• AMPI-shm P2 faster than other impl’s for 2+ KB
ExaMPI 2017 22 AMPI-shm Performance
• Large message latency on Quartz
• AMPI-shm P2 fastest for all large messages, up to 2.33x faster than process-based MPIs for 32+ MB
ExaMPI 2017 23 AMPI-shm Performance
• Bidirectional bandwidth on Quartz
• AMPI-shm can utilize full memory bandwidth
• 26% higher peak, 2x bandwidth for 32+ MB than others
OSU MPI Bidirectional Bandwidth Benchmark on Cori-Haswell (NERSC)
STREAM copy 40000 Cray MPI 35000 AMPI AMPI-shm
(MB/s) 30000
25000
Bandwidth 20000
15000
10000 Bidirectional 5000 ExaMPI 2017 24 4 16 64 256 1024 4096 16384 65536 Message Size (KB) AMPI-shm Performance
• Small message latency on Cori-Haswell OSU MPI Latency Benchmark on Cori-Haswell (NERSC)
8 Cray MPI P2 AMPI-shm P2 4 AMPI-shm P1
(us) 2
1 Latency
0.5
0.25 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes)
ExaMPI 2017 25 AMPI-shm Performance
• Large message latency on Cori-Haswell • AMPI-shm P2 is 47% faster than Cray MPI at 32+ MB
ExaMPI 2017 26 AMPI-shm Performance
• Bidirectional bandwidth on Cori-Haswell • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB
OSU MPI Bidirectional Bandwidth Benchmark on Cori-Haswell (NERSC)
STREAM copy 40000 Cray MPI 35000 AMPI AMPI-shm
(MB/s) 30000
25000
Bandwidth 20000
15000
10000 Bidirectional 5000
4 16 64 256 1024 4096 16384 65536 ExaMPI 2017 27 Message Size (KB) Future Work
• User-space shared memory optimizations for: Collectives, Derived Datatypes, RMA, and SHM
• Testing with applications
• Interprocess “zero copy” communication • Requires new Charm++ messaging semantics • Sender & recver both need completion callbacks • New messaging API under development: implementations for OFI, Verbs, uGNI, PAMI, MPI
ExaMPI 2017 28 Summary
• User-space communication offers portable intranode messaging performance • Lower latency: 1.5x-2.3x for large msgs • Higher bandwidth: 1.3x-2x for large msgs • Intermediate buffering unnecessary for medium/ large msgs
• Shared-memory aware endpoints can provide lower latency & higher bandwidth for messaging within node
ExaMPI 2017 29 This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE- NA0002374.
ExaMPI 2017 30 Questions? Thank you
ExaMPI 2017 31