Implementation and Evaluation of OpenSHMEM Contexts Using OFI Libfabric
Max Grossman, Joseph Doyle, James Dinan, Howard Pritchard, Kayla Seager, Vivek Sarkar 1Rice University 2Carnegie Mellon University 3Intel Corporation 4Los Alamos National Laboratory
1 LA-UR-17-27150 Explosion in Processor Parallelism/Diversity
2 Renaissance in Parallel Programming Models
To match, also seen an explosion in programming models research/development for these new processor types: • Kokkos • OpenCL • Raja • OCR • Qthreads • UCX • Legion • HPX • Parsec • Apache Spark • Habanero • ATMI • HiHAT • ROCm • hStreams • HSA • CUDA
3 Between a Rock and a Hard Place
Communication middleware then asked to flexibly support all of these new ways of programming HPC hardware.
Perform well for a variety of (often multi-tenant) workloads.
OpenSHMEM Legion
Apache Spark Qthreads OpenMP
4 Communication Thread Safety to Date
Let the runtime do it (MPI-like thread safety).
Often the application/parallel runtime starts getting involved (e.g. AsyncSHMEM, Legion)
5 OpenSHMEM Contexts
Enables creation of separate logical endpoints into the network.
#pragma omp parallel { shmem_ctx_t my_ctx; shmem_ctx_create(SHMEM_CTX_PRIVATE, &ctx); shmem_ctx_int_put(..., my_ctx); }
Several benefits: • Independent memory fences, quieting on different contexts • Relaxation of thread safety on individual contexts reduces overheads • Offers hints to runtime as to how to partition network resources • Significant performance benefits at small to medium packet sizes
6 Contributions
Implementation of OpenSHMEM Contexts over libfabric.
Implementation of several benchmarks over this new contexts implementation (SoS micro-benchmarks, Graph500, GUPS, Mandelbrot, Pipelined Reduction).
Evaluation of new benchmarks for performance and programmability using Edison and the Cray Aries interconnect.
7 libfabric
Platform independent, low-level library for high-performance interconnects.
An fi_domain represents a hardware NIC. • Encapsulates multiple fi_endpoint objects, represent hardware resources for sending/receiving messages • fi_stx_ctx are bound to 1 or more fi_endpoint objects within an fi_domain, used to inject packets
8 Contexts on libfabric
For backwards compatibility, back the default SHMEM context Default context with a single fi_stx_ctx bound to multiple fi_endpoints.
9 Contexts on libfabric
Associate explicitly allocated contexts with a single fi_stx_ctx/fi_en dpoint pair.
Private resources reduce contention between contexts.
Single endpoint improves resource utilization. 10 Performance Evaluation
Implemented several benchmarks on top of libfabric-based contexts: • Message rate, latency micro-benchmarks • Graph500 • GUPS • Mandelbrot • Pipelined Reduction
Experiments were run on Edison: • 2 x 12-core Xeon + 64 GB DDR3 per node • Cray Aries interconnect • Flat tests run with 1 PE per core, hybrid tests run with 1 PE per socket and 12 threads per PE
11
Contexts-Based Micro-Benchmarks
Uni
-
directionalmessage rates, log not
PUT
GET
non
non
-
-
blocking PUT
blocking GET
- scale.
12 Contexts-Based Micro-Benchmarks
GET Multi-threaded contention increases network latency (less so for contexts).
PUT
13 Contexts-Based G500
Contexts enabled profitable, programmable combination of multi- threading and OpenSHMEM (not achievable in earlier efforts). 14 Contexts-Based GUPS
API Used Hybrid Contexts Used? Time (s) Speedup Parallelism? Gets/Puts No No 60.31 1.00x (Baseline) Gets/Puts Yes No 1042.66 0.06x Gets/Puts Yes Yes 330.08 0.18x Bitwise Atomics No No 21.58 2.80x Bitwise Atomics Yes No 306.23 0.20x Bitwise Atomics Yes Yes 16.97 3.55x
For small packet sizes, contexts continue to improve network utilization. Big improvements from recent bitwise atomics extensions as well.
15 Contexts-Based Mandelbrot
Blocking APIs Contexts? Time (s) Speedup Yes No 1,285.27 1.00x (Baseline) Yes For Multithreading 612.13 2.10x Yes Pipelined 592.56 2.17x No No 1,487.10 0.86x No For Multithreading 635.34 2.02x No Pipelined 608.20 2.11x
Use of one context per thread yields ~2x performance improvement, using contexts for pipelining improves performance by ~3%.
16 Contexts-Based Pipelined Reduction PE 0 PE 1
Without Contexts With Contexts Single-threaded, performance improvement purely from 7.80 s 6.94 s communication pipelining
17 Programmability Evaluation
1. Improved network utilization at smaller packet sizes enables more algorithmically faithful implementations (less boiler plate, aggregating, etc).
2. Facilitation of hybrid programming has nice side effect of increasing state per PE, can help to reduce redundant/remote communication on many irregular applications (e.g. Graph500, UTS).
3. Did not find that creation/passing of contexts added significant development burden.
18 Contributions
Implementation of OpenSHMEM Contexts over libfabric.
Implementation of several benchmarks over this new contexts implementation (SoS micro-benchmarks, Graph500, GUPS, Mandelbrot, Pipelined Reduction).
Evaluation of new benchmarks for performance and programmability using Edison and the Cray Aries interconnect.
19 Acknowledgements
20