Implementation and Evaluation of Openshmem Contexts Using OFI Libfabric

Implementation and Evaluation of OpenSHMEM Contexts Using OFI Libfabric Max Grossman, Joseph Doyle, James Dinan, Howard Pritchard, Kayla Seager, Vivek Sarkar 1Rice University 2Carnegie Mellon University 3Intel Corporation 4Los Alamos National Laboratory 1 LA-UR-17-27150 Explosion in Processor Parallelism/Diversity 2 Renaissance in Parallel Programming Models To match, also seen an explosion in programming models research/development for these new processor types: • Kokkos • OpenCL • Raja • OCR • Qthreads • UCX • Legion • HPX • Parsec • Apache Spark • Habanero • ATMI • HiHAT • ROCm • hStreams • HSA • CUDA 3 Between a Rock and a Hard Place Communication middleware then asked to flexibly support all of these new ways of programming HPC hardware. Perform well for a variety of (often multi-tenant) workloads. OpenSHMEM Legion Apache Spark Qthreads OpenMP 4 Communication Thread Safety to Date Let the runtime do it (MPI-like thread safety). Often the application/parallel runtime starts getting involved (e.g. AsyncSHMEM, Legion) 5 OpenSHMEM Contexts Enables creation of separate logical endpoints into the network. #pragma omp parallel { shmem_ctx_t my_ctx; shmem_ctx_create(SHMEM_CTX_PRIVATE, &ctx); shmem_ctx_int_put(..., my_ctx); } Several benefits: • Independent memory fences, quieting on different contexts • Relaxation of thread safety on individual contexts reduces overheads • Offers hints to runtime as to how to partition network resources • Significant performance benefits at small to medium packet sizes 6 Contributions Implementation of OpenSHMEM Contexts over libfabric. Implementation of several benchmarks over this new contexts implementation (SoS micro-benchmarks, Graph500, GUPS, Mandelbrot, Pipelined Reduction). Evaluation of new benchmarks for performance and programmability using Edison and the Cray Aries interconnect. 7 libfabric Platform independent, low-level library for high-performance interconnects. An fi_domain represents a hardware NIC. • Encapsulates multiple fi_endpoint objects, represent hardware resources for sending/receiving messages • fi_stx_ctx are bound to 1 or more fi_endpoint objects within an fi_domain, used to inject packets 8 Contexts on libfabric For backwards compatibility, back the default SHMEM context Default context with a single fi_stx_ctx bound to multiple fi_endpoints. 9 Contexts on libfabric Associate explicitly allocated contexts with a single fi_stx_ctx/fi_en dpoint pair. Private resources reduce contention between contexts. Single endpoint improves resource utilization. 10 Performance Evaluation Implemented several benchmarks on top of libfabric-based contexts: • Message rate, latency micro-benchmarks • Graph500 • GUPS • Mandelbrot • Pipelined Reduction Experiments were run on Edison: • 2 x 12-core Xeon + 64 GB DDR3 per node • Cray Aries interconnect • Flat tests run with 1 PE per core, hybrid tests run with 1 PE per socket and 12 threads per PE 11 Uni-directional message rates, not log-scale. PUT non-blocking PUT 12 Benchmarks - GET non-blocking GET Based Micro - Contexts Contexts-Based Micro-Benchmarks GET Multi-threaded contention increases network latency (less so for contexts). PUT 13 Contexts-Based G500 Contexts enabled profitable, programmable combination of multithreading and OpenSHMEM (not achievable in earlier efforts). 14 Contexts-Based GUPS API Used Hybrid Contexts Used? Time (s) Speedup Parallelism? Gets/Puts No No 60.31 1.00x (Baseline) Gets/Puts Yes No 1042.66 0.06x Gets/Puts Yes Yes 330.08 0.18x Bitwise Atomics No No 21.58 2.80x Bitwise Atomics Yes No 306.23 0.20x Bitwise Atomics Yes Yes 16.97 3.55x For small packet sizes, contexts continue to improve network utilization. Big improvements from recent bitwise atomics extensions as well. 15 Contexts-Based Mandelbrot Blocking APIs Contexts? Time (s) Speedup Yes No 1,285.27 1.00x (Baseline) Yes For Multithreading 612.13 2.10x Yes Pipelined 592.56 2.17x No No 1,487.10 0.86x No For Multithreading 635.34 2.02x No Pipelined 608.20 2.11x Use of one context per thread yields ~2x performance improvement, using contexts for pipelining improves performance by ~3%. 16 Contexts-Based Pipelined Reduction PE 0 PE 1 Without Contexts With Contexts Single-threaded, performance improvement purely from 7.80 s 6.94 s communication pipelining 17 Programmability Evaluation 1. Improved network utilization at smaller packet sizes enables more algorithmically faithful implementations (less boiler plate, aggregating, etc). 2. Facilitation of hybrid programming has nice side effect of increasing state per PE, can help to reduce redundant/remote communication on many irregular applications (e.g. Graph500, UTS). 3. Did not find that creation/passing of contexts added significant development burden. 18 Contributions Implementation of OpenSHMEM Contexts over libfabric. Implementation of several benchmarks over this new contexts implementation (SoS micro-benchmarks, Graph500, GUPS, Mandelbrot, Pipelined Reduction). Evaluation of new benchmarks for performance and programmability using Edison and the Cray Aries interconnect. 19 Acknowledgements 20.

Implementation and Evaluation of Openshmem Contexts Using OFI Libfabric

Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

HPX – a Task Based Programming Model in a Global Address Space

Exascale Computing Project -- Software

HPXMP, an Openmp Runtime Implemented Using

The Future of PGAS Programming from a Chapel Perspective

Mapping Openmp to a Distributed Tasking Runtime 1 Introduction

ADMI Cloud Computing Presentation

CIF21 Dibbs: Middleware and High Performance Analytics Libraries for Scalable Data Science

The Kokkos Ecosystem

Comparing SYCL with HPX, Kokkos, Raja and C++ Executors the Future of ISO C++ Heterogeneous Computing

Bibrak Qamar Chandio