Lightweight Threaded Runtime Systems for Openmp

Lightweight Threaded Runtime Systems for OpenMP Shintaro Iwasaki Argonne National Laboratory, The University of Tokyo Email: [email protected], [email protected] SOLVED Outline of This Talk ▪ BOLT: a lightweight OpenMP library based on LLVM OpenMP. – It uses a lightweight user-level thread for OpenMP task and thread. ▪ BOLT won the Best Paper Award at PACT ’19[*] ▪ Features of BOLT: We will focus on this. 1. Extremely lightweight OpenMP threads that can efficiently handle nested parallelism. 2. Tackle an interoperability issue of MPI + OpenMP task. ▪ This presentation will cover how to handle nested parallelism of BOLT. – Please visit us! https://www.bolt-omp.org/ or google “BOLT OpenMP” [*] S. Iwasaki et al., “BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads", PACT ‘19, 2019 2 Index 1. Introduction 2. User-level threads for OpenMP threads – Nested parallel regions and issues – Efficient adoption of ULTs – Evaluation 3. User-level threads for OpenMP tasks – OpenMP task and MPI operations – Tasking over ULT-aware MPI 4. Conclusions and future work 3 OpenMP: the Most Popular Multithreading Model ▪ Multithreading is essential for exploiting modern CPUs. ▪ OpenMP is a popular parallel programming model. – In the HPC field, OpenMP is most popular for multithreading. • 57% of DOE exascale applications use OpenMP [*]. ▪ Not only user programs but also runtimes and libraries are parallelized by OpenMP. DNN library Kokkos, RAJA, OpenBLAS, Intel MKL, SLATE, Intel MKL-DNN, FFTW3, … Runtimes that have BLAS/LAPACK libraries FFTW library an OpenMP backend [*] D. E. Bernholdt et al. "A Survey of MPI Usage in the US Exascale Computing Project", Concurency Computat Pract Expr, 2018 4 Unintentional Nested OpenMP Parallel Regions User Applications #pragma omp parallel for OpenMP-parallelized code for (i = 0; i < n; i++) dgemv(matrix[n], ...); Scientific Library OpenMP-parallelized code // BLAS library Math Library A Math Library B void dgemv(...) { nested! OpenMP-parallelized code OpenMP-parallelized code #pragma omp parallel for nested! for (i = 0; i < n; i++) High-Level dgemv_seq(data[n], i); Runtime System } OpenMP Runtime System Code Example ▪ OpenMP parallelizes multiple software stacks. ▪ Nested parallel regions create OpenMP threads exponentially. #pragma omp parallel for Thread for (i = 0; i < n; i++) Parallel Region dgemm(matrix[n], ...); Thread Thread Thread Thread void dgemm(...): Parallel Region Parallel Region Parallel Region Parallel Region #pragma omp parallel for Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread for (i = 0; i < n; i++); Core Core Core Core 5 Can We Just Disable Nested Parallelism? ▪ How to utilize nested parallel regions? – Enable nested parallelism: creation of exponential the number of threads – Disable nested parallelism: adversely decrease parallelism ▪ Example: strong scaling on massively parallel machines Is the outer parallelism enough to feed work to all the cores??? Cells Cells Cells #pragma omp parallel for for (i = 0; i < n; i++) Core Core Core Core Core Core Core Core comp(cells[i], ...); Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Node Node void comp(...): Node Core Core Core Core Core Core Core Core [...]; Node Core Core Core Core Core Core Core Core #pragma omp parallel for Node Node for (i = 0; i < n; i++); Multicore Manycore Manycore + Many nodes 6 Two Directions to Address Nested Parallelism ▪ Nested parallel regions have been known as a problem since OpenMP 1.0 (1997). – By default, OpenMP disables nested parallelism[*]. ▪ Two directions to address this issue: 1. Use several work arounds implied in the OpenMP specification. => Not practical if users do not know parallelism at other software stacks. 2. Instead of OS-level threads, use lightweight threads as OpenMP threads User-level threads (ULTs, explained later) => It does not perform well if parallel regions are not nested (i.e., flat). • It does not perform well even when parallel regions are nested. => Need a solution to efficiently utilize nested parallelism. [*] Since OpenMP 5.0, the default becomes “implementation defined”, while most OpenMP systems continue to disable nested parallelism by default. 7 BOLT: Lightweight OpenMP over ULT for Both Flat & Nested Parallel Regions ▪ We proposed BOLT, a ULT-based OpenMP runtime system, which performs best for both flat and nested parallel regions. ▪ Three key contributions: 1. An in-depth performance analysis in the LLVM OpenMP runtime, finding several performance barriers. 2. An implementation of thread-to-CPU binding interface that supports user-level threads. 3. A novel thread coordination algorithm to transparently support both flat and nested parallel regions. 8 Index 1. Introduction 2. User-level threads for OpenMP threads – Nested parallel regions and issues – Efficient adoption of ULTs – Evaluation 3. User-level threads for OpenMP tasks – OpenMP task and MPI operations – Tasking over ULT-aware MPI 4. Conclusions and future work 9 Direction 1: Work around with OS-Level Threads (1/2) #pragma omp parallel for Thread for (i = 0; i < n; i++) dgemv(matrix[n], ...); Parallel Region // BLAS library Thread Thread Thread Thread void dgemv(...) { #pragma omp parallel for for (i = 0; i < n; i++) Parallel Region Parallel Region Parallel Region Parallel Region dgemv_seq(data[n], i); } Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread ▪ Several workarounds Thread 1. Disable nested parallel regions Parallel Region Thread Thread Thread Thread (OMP_NESTED=false, OMP_ACTIVE_LEVELS=...) Parallel Region Parallel Region Parallel Region Parallel Region • Parallelism is lost. Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread 1. OMP_NESTED=false 2. Finely tune numbers of threads Thread (OMP_NUM_THREADS=nth1,nth2,nth3,...) Parallel Region Thread Thread Thread • Parallelism is lost. Difficult to tune Parallel Region Parallel Region Parallel Region parameters. Thread Thread Thread Thread Thread Thread Thread Thread Thread 2. OMP_NUM_THREADS=3,3 10 Direction 1: Work around with OS-Level Threads (2/2) ▪ Workarounds (cont.) Thread 3. Limit the total number of threads Parallel Region Thread Thread Thread Thread (OMP_THREAD_LIMIT=nths) Parallel Region Parallel Region Parallel Region Parallel Region • Can adversely serialize parallel regions; Thread Thread Thread Thread Thread Thread Thread Thread 3. OMP_THREAD_LIMIT=8 doesn’t work well in practice. 8 threads. 4. Dynamically adjust # of threads Thread Parallel Region (OMP_DYNAMIC=true) Thread Thread Thread Thread • Can adversely serialize parallel regions; Parallel Region Parallel Region Parallel Region Parallel Region Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread 3, 4, 2, 1 doesn’t work well in practice. 4. OMP_DYNAMIC=true 5. Use OpenMP task Thread (#pragma omp task/taskloop) Parallel Region Thread Thread Thread Thread • Most codes use parallel regions. Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Semantically, threads != tasks. 5. task/taskloop ▪ How about using lightweight threads for OpenMP threads? 11 Direction 2: Use Lightweight Threads => User-Level Threads (ULTs) ▪ User-level threads: threads implemented > 350x in user-space. – Manages threads without heavyweight kernel operations. Fork-Join Performance on KNL Thread scheduling (= context switching) involves heavy system calls. ULT ULT ULT ULT ULT ULT ULT ULT Small overheads. Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Scheduler Scheduler Pthreads Pthreads User-level threads (ULTs) are Heavy! Kernel (OS) running on Pthreads; scheduling Kernel (OS) Core Core is done by user-level context Core Core switching in user space. Naïve Pthreads User-level threads [*] S. Seo et al. "Argobots: A Lightweight Low-Level Threading and Tasking Framework", TPDS '18, 2018 12 Using ULTs is Easy OpenMP-Parallelized Program OpenMP-Parallelized Program LLVM LLVM OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP Thread Thread Thread Thread over ULT Thread Thread Thread Thread ULT layer ULT ULT ULT ULT Pthreads Pthreads Pthreads Pthreads (Argobots) Scheduler Scheduler Pthreads Pthreads Core Core Core Core LLVM OpenMP 7.0 LLVM OpenMP 7.0 over ULT (= BOLT baseline) ▪ Replacing a Pthreads layer with a user-level threading library is a piece of cake. – Argobots[*] we used in this paper has the Pthreads-like API Note: other ULT libraries (e.g., Qthreads, Nanos++, (mutex, TLS, ...), making this process easier. MassiveThreads …) also have similar threading APIs. – The ULT-based OpenMP implementation is OpenMP 4.5-compliant (as far as we examined) ▪ Does the “baseline BOLT” perform well? 13 [*] S. Seo et al. "Argobots: A Lightweight Low-Level Threading and Tasking Framework", TPDS '18, 2018 Simple Replacement Performs Poorly 1E+0 // Run on a 56-core Skylake server Lower is better #pragma omp parallel for num_threads(N) 1E-1 for (int i = 0; i < N; i++) #pragma omp parallel for num_threads(28) for (int j = 0; j < 28; j++) 1E-2 comp_20000_cycles(i, j); Nested Parallel Region (balanced) 1E-3 Execution timetimetime[s] [s] [s] Execution Execution Execution – Faster than GNU OpenMP. 1E-4 • GCC 1E-5 – So-so among ULT-based OpenMPs 1 10 100 # of outer threads (N) • MPC, OMPi, Mercurium BOLT1E+01E-6 (baseline) GCC# of outer threads (N) MPC – Slower than Intel/LLVM OpenMPs. OMPi Mercurium

Load more