Implementation and Evaluation of OpenSHMEM Contexts Using OFI Libfabric

Max Grossman, Joseph Doyle, James Dinan, Howard Pritchard, Kayla Seager, Vivek Sarkar 1Rice University 2Carnegie Mellon University 3Intel Corporation 4Los Alamos National Laboratory

1 LA-UR-17-27150 Explosion in Processor Parallelism/Diversity

2 Renaissance in Parallel Programming Models

To match, also seen an explosion in programming models research/development for these new processor types: • Kokkos • OpenCL • Raja • OCR • Qthreads • UCX • Legion • HPX • Parsec • • Habanero • ATMI • HiHAT • ROCm • hStreams • HSA • CUDA

3 Between a Rock and a Hard Place

Communication middleware then asked to flexibly support all of these new ways of programming HPC hardware.

Perform well for a variety of (often multi-tenant) workloads.

OpenSHMEM Legion

Apache Spark Qthreads OpenMP

4 Communication Safety to Date

Let the runtime do it (MPI-like thread safety).

Often the application/parallel runtime starts getting involved (e.g. AsyncSHMEM, Legion)

5 OpenSHMEM Contexts

Enables creation of separate logical endpoints into the network.

#pragma omp parallel { shmem_ctx_t my_ctx; shmem_ctx_create(SHMEM_CTX_PRIVATE, &ctx); shmem_ctx_int_put(..., my_ctx); }

Several benefits: • Independent memory fences, quieting on different contexts • Relaxation of thread safety on individual contexts reduces overheads • Offers hints to runtime as to how to partition network resources • Significant performance benefits at small to medium packet sizes

6 Contributions

Implementation of OpenSHMEM Contexts over libfabric.

Implementation of several benchmarks over this new contexts implementation (SoS micro-benchmarks, Graph500, GUPS, Mandelbrot, Pipelined Reduction).

Evaluation of new benchmarks for performance and programmability using Edison and the Cray Aries interconnect.

7 libfabric

Platform independent, low-level library for high-performance interconnects.

An fi_domain represents a hardware NIC. • Encapsulates multiple fi_endpoint objects, represent hardware resources for sending/receiving messages • fi_stx_ctx are bound to 1 or more fi_endpoint objects within an fi_domain, used to inject packets

8 Contexts on libfabric

For backwards compatibility, back the default SHMEM context Default context with a single fi_stx_ctx bound to multiple fi_endpoints.

9 Contexts on libfabric

Associate explicitly allocated contexts with a single fi_stx_ctx/fi_en dpoint pair.

Private resources reduce contention between contexts.

Single endpoint improves resource utilization. 10 Performance Evaluation

Implemented several benchmarks on top of libfabric-based contexts: • Message rate, latency micro-benchmarks • Graph500 • GUPS • Mandelbrot • Pipelined Reduction

Experiments were run on Edison: • 2 x 12-core Xeon + 64 GB DDR3 per node • Cray Aries interconnect • Flat tests run with 1 PE per core, hybrid tests run with 1 PE per socket and 12 threads per PE

11

Contexts-Based Micro-Benchmarks

Uni

-

directionalmessage rates, log not

PUT

GET

non

non

-

-

blocking PUT

blocking GET

- scale.

12 Contexts-Based Micro-Benchmarks

GET Multi-threaded contention increases network latency (less so for contexts).

PUT

13 Contexts-Based G500

Contexts enabled profitable, programmable combination of multi- threading and OpenSHMEM (not achievable in earlier efforts). 14 Contexts-Based GUPS

API Used Hybrid Contexts Used? Time (s) Parallelism? Gets/Puts No No 60.31 1.00x (Baseline) Gets/Puts Yes No 1042.66 0.06x Gets/Puts Yes Yes 330.08 0.18x Bitwise Atomics No No 21.58 2.80x Bitwise Atomics Yes No 306.23 0.20x Bitwise Atomics Yes Yes 16.97 3.55x

For small packet sizes, contexts continue to improve network utilization. Big improvements from recent bitwise atomics extensions as well.

15 Contexts-Based Mandelbrot

Blocking Contexts? Time (s) Speedup Yes No 1,285.27 1.00x (Baseline) Yes For Multithreading 612.13 2.10x Yes Pipelined 592.56 2.17x No No 1,487.10 0.86x No For Multithreading 635.34 2.02x No Pipelined 608.20 2.11x

Use of one context per thread yields ~2x performance improvement, using contexts for pipelining improves performance by ~3%.

16 Contexts-Based Pipelined Reduction PE 0 PE 1

Without Contexts With Contexts Single-threaded, performance improvement purely from 7.80 s 6.94 s communication pipelining

17 Programmability Evaluation

1. Improved network utilization at smaller packet sizes enables more algorithmically faithful implementations (less boiler plate, aggregating, etc).

2. Facilitation of hybrid programming has nice side effect of increasing state per PE, can help to reduce redundant/remote communication on many irregular applications (e.g. Graph500, UTS).

3. Did not find that creation/passing of contexts added significant development burden.

18 Contributions

Implementation of OpenSHMEM Contexts over libfabric.

Implementation of several benchmarks over this new contexts implementation (SoS micro-benchmarks, Graph500, GUPS, Mandelbrot, Pipelined Reduction).

Evaluation of new benchmarks for performance and programmability using Edison and the Cray Aries interconnect.

19 Acknowledgements

20