Exampi17.Pdf

Optimizing Point-to-Point Communication between AMPI Endpoints in Shared Memory Sam White & Laxmikant Kale UIUC ExaMPI 2017 Motivation • Exascale trends: • HW: increased node parallelism, decreased memory per thread • SW: applications themselves becoming more dynamic • How should applications and runtimes respond? • MPI+X (X=OpenMP, Kokkos, MPI, etc)? • New language? Legion, Charm++, HPX, etc? ExaMPI 2017 2 Motivation • MPI+X performance: • Either serialize around communication … • Or incur synchronization costs inside MPI • Semantic restrictions can help • What if we hoist threading into MPI? • Performant MPI+X often requires similar hoisting • Threaded MPIs: MPC-MPI, FG-MPI, TMPI, AMPI • Similar to the MPI Endpoints proposal ExaMPI 2017 3 Motivation • Questions: • Why is AMPI’s existing implementation not as fast as we might expect (vs process-based MPIs)? • What can be done to improve it? • More generally, what are the costs of process boundaries & kernel-assisted IPC? • What can MPI Endpoints offer in terms of pt2pt latency & bandwidth in shared memory? ExaMPI 2017 4 Overview • Introduction to AMPI • Execution model • Existing shared memory performance • Shared memory optimizations • Locality: intra-core vs intra-process • Size: small vs large messages • Conclusions & future work ExaMPI 2017 5 Adaptive MPI • AMPI is an MPI implementation on top of Charm++ • AMPI offers Charm++’s application-independent features to MPI programmers: • Overdecomposition • Communication/computation overlap • Dynamic load balancing • Online fault tolerance ExaMPI 2017 6 Execution Model • AMPI ranks are User-Level Threads (ULTs) • Can have multiple per core • Fast to context switch • Scheduled based on message delivery • Migratable across cores and nodes at runtime • For load balancing & online checkpoint/restart ExaMPI 2017 7 Execution Model Rank 0 Rank 1 MPI_Send() Rank 3 Rank 4 Rank 2 Scheduler Scheduler Core 0 Core 1 Node 0 ExaMPI 2017 8 Shared Memory • AMPI can be built in two different modes • Non-SMP: 1 process per core/hyperthread • SMP: 1 process per node/socket/NUMA domain • N worker threads, 1 dedicated communication thread per process ExaMPI 2017 9 AMPI Shared Memory • Many AMPI ranks can share the same OS process ExaMPI 2017 10 Performance Analysis • OSU Microbenchmarks v5.3: osu_latency, osu_bibw • Quartz (LLNL): Intel Xeon (Ivybridge) • MVAPICH2 2.2, Intel MPI 2018, OpenMPI 2.0.0 • Cori (NERSC): Intel Xeon (Haswell) • Cray MPI 6.7.0 • AMPI (6.8.0) vs AMPI-shm (6.9.0-beta) • P1: two ranks co-located on the same core • P2: two ranks on different cores, in the same OS process ExaMPI 2017 11 Existing Performance • Small message latency on Quartz OSU MPI Latency Benchmark on Quartz (LLNL) MVAPICH P2 AMPI P2 32 IMPI P2 AMPI P1 16 (us) OpenMPI P2 8 4 Latency 2 1-way 1 0.5 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes) ExaMPI 2017 12 Existing Performance • Large message latency on Quartz OSU MPI Latency Benchmark on Quartz (LLNL) OSU MPI Latency Benchmark on Quartz (LLNL) 2048 131072 1024 65536 512 32768 256 16384 (us) 128 (us) 8192 64 4096 Latency 32 Latency 2048 16 1024 8 512 4 256 64 128 256 512 1024 2048 4096 4 8 16 32 64 Message Size (KB) Message Size (MB) ExaMPI 2017 13 Existing Performance • Bidirectional Bandwidth on Quartz ExaMPI 2017 14 Existing Performance • Small message latency on Cori-Haswell ExaMPI 2017 15 Existing Performance • Bidirectional Bandwidth on Cori-Haswell ExaMPI 2017 16 Performance Analysis • Breakdown of P1 time (us) per message on Quartz • Scheduling: Charm++ scheduler & ULT ctx • Memory copy: message payload movement • Other: AMPI message creation & matching ExaMPI 2017 17 Scheduling Overhead 1. Even for P1, all AMPI messages traveled thru Charm++’s scheduler • Use Charm++ inline tasks 2. ULT context switching overhead • Faster ULT ctx: Boost or QuickThreads ULTs 3. Avoid resuming threads without real progress • MPI_Waitall: keep track of “blocked on” reqs P1 0-B latency: 1.27 us -> 0.66 us ExaMPI 2017 18 Memory Copy Overhead • Q: Even with inline tasks, AMPI P1 performs poorly for large messages. Why? • A: Charm++ messaging semantics do not match MPI’s • In Charm++, messages are first class objects • Users pass ownership of messages to the runtime when sending and assume it when receiving • Only app’s that can reuse message objects in their data structures can perform “zero copy” transfers ExaMPI 2017 19 Memory Copy Overhead • To overcome Charm++ messaging semantics in shared memory, use a rendezvous protocol: • Recv’er performs direct (userspace) memcpy from sendbuf to recvbuf • Benefit: avoid intermediate copy • Cost: sender must suspend & be resumed upon copy completion P1 1-MB latency: 165 us -> 82 us ExaMPI 2017 20 Other Overheads • Sender-side: • Create a Charm++ message object & a request • Receiver-side: • Create a request, create matching queue entry, enqueue in unexpected_msgs or posted_requests • Solution: use memory pools for fixed-size, frequently-used objects P1 0-B latency: 0.66 us -> 0.54 us ExaMPI 2017 21 AMPI-shm Performance • Small message latency on Quartz • AMPI-shm P2 faster than other impl’s for 2+ KB ExaMPI 2017 22 AMPI-shm Performance • Large message latency on Quartz • AMPI-shm P2 fastest for all large messages, up to 2.33x faster than process-based MPIs for 32+ MB ExaMPI 2017 23 AMPI-shm Performance • Bidirectional bandwidth on Quartz • AMPI-shm can utilize full memory bandwidth • 26% higher peak, 2x bandwidth for 32+ MB than others OSU MPI Bidirectional Bandwidth Benchmark on Cori-Haswell (NERSC) STREAM copy 40000 Cray MPI 35000 AMPI AMPI-shm (MB/s) 30000 25000 Bandwidth 20000 15000 10000 Bidirectional 5000 ExaMPI 2017 24 4 16 64 256 1024 4096 16384 65536 Message Size (KB) AMPI-shm Performance • Small message latency on Cori-Haswell OSU MPI Latency Benchmark on Cori-Haswell (NERSC) 8 Cray MPI P2 AMPI-shm P2 4 AMPI-shm P1 (us) 2 1 Latency 0.5 0.25 1 4 16 64 256 1024 4096 16384 65536 Message Size (Bytes) ExaMPI 2017 25 AMPI-shm Performance • Large message latency on Cori-Haswell • AMPI-shm P2 is 47% faster than Cray MPI at 32+ MB ExaMPI 2017 26 AMPI-shm Performance • Bidirectional bandwidth on Cori-Haswell • Cray MPI on XPMEM performs similarly to AMPI-shm up to 16 MB OSU MPI Bidirectional Bandwidth Benchmark on Cori-Haswell (NERSC) STREAM copy 40000 Cray MPI 35000 AMPI AMPI-shm (MB/s) 30000 25000 Bandwidth 20000 15000 10000 Bidirectional 5000 4 16 64 256 1024 4096 16384 65536 ExaMPI 2017 27 Message Size (KB) Future Work • User-space shared memory optimizations for: Collectives, Derived Datatypes, RMA, and SHM • Testing with applications • Interprocess “zero copy” communication • Requires new Charm++ messaging semantics • Sender & recver both need completion callbacks • New messaging API under development: implementations for OFI, Verbs, uGNI, PAMI, MPI ExaMPI 2017 28 Summary • User-space communication offers portable intranode messaging performance • Lower latency: 1.5x-2.3x for large msgs • Higher bandwidth: 1.3x-2x for large msgs • Intermediate buffering unnecessary for medium/ large msgs • Shared-memory aware endpoints can provide lower latency & higher bandwidth for messaging within node ExaMPI 2017 29 This material is based in part upon work supported by the Department of Energy, National Nuclear Security Administration, under Award Number DE- NA0002374. ExaMPI 2017 30 Questions? Thank you ExaMPI 2017 31.

Exampi17.Pdf

Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

MVAPICH2 2.2 User Guide

HPX – a Task Based Programming Model in a Global Address Space

Scalable and High Performance MPI Design for Very Large

MVAPICH2 2.3 User Guide

Automatic Handling of Global Variables for Multi-Threaded MPI Programs

Exascale Computing Project -- Software

HPXMP, an Openmp Runtime Implemented Using

MVAPICH USER GROUP MEETING August 26, 2020

Infiniband and 10-Gigabit Ethernet for Dummies

The Future of PGAS Programming from a Chapel Perspective