<<

Lightweight Threaded Runtime Systems for OpenMP

Shintaro Iwasaki Argonne National Laboratory, The University of Tokyo Email: [email protected], [email protected] SOLVED Outline of This Talk

▪ BOLT: a lightweight OpenMP library based on LLVM OpenMP. – It uses a lightweight user-level thread for OpenMP and thread. ▪ BOLT won the Best Paper Award at PACT ’19[*]

▪ Features of BOLT: We will focus on this. 1. Extremely lightweight OpenMP threads that can efficiently nested parallelism. 2. Tackle an interoperability issue of MPI + OpenMP task. ▪ This presentation will cover how to handle nested parallelism of BOLT. – Please us! https://www.bolt-omp.org/ or google “BOLT OpenMP” [*] S. Iwasaki et al., “BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads", PACT ‘19, 2019 2 Index

1. Introduction 2. User-level threads for OpenMP threads – Nested parallel regions and issues – Efficient adoption of ULTs – Evaluation 3. User-level threads for OpenMP tasks – OpenMP task and MPI operations – Tasking over ULT-aware MPI 4. Conclusions and future work

3 OpenMP: the Most Popular Multithreading Model

▪ Multithreading is essential for exploiting modern CPUs. ▪ OpenMP is a popular parallel programming model. – In the HPC field, OpenMP is most popular for multithreading. • 57% of DOE exascale applications use OpenMP [*].

▪ Not only user programs but also runtimes and libraries are

parallelized by OpenMP. DNN library

Kokkos, RAJA, OpenBLAS, MKL, SLATE, Intel MKL-DNN, FFTW3, …

Runtimes that have BLAS/LAPACK libraries FFTW library an OpenMP backend

[*] D. E. Bernholdt et al. "A Survey of MPI Usage in the US Exascale Computing Project", Concurency Computat Pract Expr, 2018 4 Unintentional Nested OpenMP Parallel Regions

User Applications #pragma omp parallel for OpenMP-parallelized code for (i = 0; i < n; i++) dgemv(matrix[n], ...); Scientific Library OpenMP-parallelized code

// BLAS library Math Library A Math Library B void dgemv(...) { nested! OpenMP-parallelized code OpenMP-parallelized code #pragma omp parallel for nested! for (i = 0; i < n; i++) High-Level dgemv_seq(data[n], i); } OpenMP Runtime System Code Example ▪ OpenMP parallelizes multiple software stacks. ▪ Nested parallel regions create OpenMP threads exponentially.

#pragma omp parallel for Thread for (i = 0; i < n; i++) Parallel Region dgemm(matrix[n], ...); Thread Thread Thread Thread void dgemm(...): Parallel Region Parallel Region Parallel Region Parallel Region #pragma omp parallel for Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread for (i = 0; i < n; i++); Core Core Core Core

5 Can We Just Disable Nested Parallelism?

▪ How to utilize nested parallel regions? – Enable nested parallelism: creation of exponential the number of threads – Disable nested parallelism: adversely decrease parallelism ▪ Example: strong scaling on machines Is the outer parallelism enough to feed work to all the cores???

Cells Cells Cells

#pragma omp parallel for for (i = 0; i < n; i++) Core Core Core Core Core Core Core Core comp(cells[i], ...); Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Node Node void comp(...): Node Core Core Core Core Core Core Core Core [...]; Node Core Core Core Core Core Core Core Core #pragma omp parallel for Node Node for (i = 0; i < n; i++); Multicore Manycore Manycore + Many nodes

6 Two Directions to Address Nested Parallelism

▪ Nested parallel regions have been known as a problem since OpenMP 1.0 (1997). – By default, OpenMP disables nested parallelism[*]. ▪ Two directions to address this issue: 1. Use several work arounds implied in the OpenMP specification. => Not practical if users do not know parallelism at other software stacks. 2. Instead of OS-level threads, use lightweight threads as OpenMP threads User-level threads (ULTs, explained later) => It does not perform well if parallel regions are not nested (i.e., flat). • It does not perform well even when parallel regions are nested.

=> Need a solution to efficiently utilize nested parallelism.

[*] Since OpenMP 5.0, the default becomes “implementation defined”, while most OpenMP systems continue to disable nested parallelism by default. 7 BOLT: Lightweight OpenMP over ULT for Both Flat & Nested Parallel Regions

▪ We proposed BOLT, a ULT-based OpenMP runtime system, which performs best for both flat and nested parallel regions.

▪ Three key contributions: 1. An in-depth performance analysis in the LLVM OpenMP runtime, finding several performance barriers. 2. An implementation of thread-to-CPU binding interface that supports user-level threads. 3. A novel thread coordination algorithm to transparently support both flat and nested parallel regions.

8 Index

1. Introduction 2. User-level threads for OpenMP threads – Nested parallel regions and issues – Efficient adoption of ULTs – Evaluation 3. User-level threads for OpenMP tasks – OpenMP task and MPI operations – Tasking over ULT-aware MPI 4. Conclusions and future work

9 Direction 1: Work around with OS-Level Threads (1/2)

#pragma omp parallel for Thread for (i = 0; i < n; i++) dgemv(matrix[n], ...); Parallel Region

// BLAS library Thread Thread Thread Thread void dgemv(...) { #pragma omp parallel for for (i = 0; i < n; i++) Parallel Region Parallel Region Parallel Region Parallel Region dgemv_seq(data[n], i); } Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread ▪ Several workarounds Thread 1. Disable nested parallel regions Parallel Region Thread Thread Thread Thread (OMP_NESTED=false, OMP_ACTIVE_LEVELS=...) Parallel Region Parallel Region Parallel Region Parallel Region • Parallelism is lost. Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread 1. OMP_NESTED=false 2. Finely tune numbers of threads Thread (OMP_NUM_THREADS=nth1,nth2,nth3,...) Parallel Region Thread Thread Thread

• Parallelism is lost. Difficult to tune Parallel Region Parallel Region Parallel Region parameters. Thread Thread Thread Thread Thread Thread Thread Thread Thread 2. OMP_NUM_THREADS=3,3

10 Direction 1: Work around with OS-Level Threads (2/2) ▪ Workarounds (cont.) Thread 3. Limit the total number of threads Parallel Region Thread Thread Thread Thread (OMP_THREAD_LIMIT=nths) Parallel Region Parallel Region Parallel Region Parallel Region

• Can adversely serialize parallel regions; Thread Thread Thread Thread Thread Thread Thread Thread 3. OMP_THREAD_LIMIT=8 doesn’t work well in practice. 8 threads. 4. Dynamically adjust # of threads Thread Parallel Region (OMP_DYNAMIC=true) Thread Thread Thread Thread • Can adversely serialize parallel regions; Parallel Region Parallel Region Parallel Region Parallel Region Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread 3, 4, 2, 1 doesn’t work well in practice. 4. OMP_DYNAMIC=true

5. Use OpenMP task Thread (#pragma omp task/taskloop) Parallel Region Thread Thread Thread Thread • Most codes use parallel regions. Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Semantically, threads != tasks. 5. task/taskloop ▪ How about using lightweight threads for OpenMP threads?

11 Direction 2: Use Lightweight Threads => User-Level Threads (ULTs)

▪ User-level threads: threads implemented > 350x in user-space. – Manages threads without heavyweight kernel operations. -Join Performance on KNL

Thread (= switching) involves heavy system

calls.

ULT ULT ULT ULT ULT ULT

ULT ULT

Small overheads.

Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads

Pthreads Pthreads Scheduler Scheduler Pthreads Pthreads User-level threads (ULTs) are Heavy! Kernel (OS) running on Pthreads; scheduling Kernel (OS) Core Core is done by user-level context Core Core switching in . Naïve Pthreads User-level threads

[*] S. Seo et al. "Argobots: A Lightweight Low-Level Threading and Tasking Framework", TPDS '18, 2018 12 Using ULTs is Easy

OpenMP-Parallelized Program OpenMP-Parallelized Program

LLVM LLVM OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP OpenMP Thread Thread Thread Thread over ULT Thread Thread Thread Thread

ULT layer ULT ULT ULT ULT Pthreads Pthreads Pthreads Pthreads (Argobots) Scheduler Scheduler Pthreads Pthreads Core Core Core Core

LLVM OpenMP 7.0 LLVM OpenMP 7.0 over ULT (= BOLT baseline) ▪ Replacing a Pthreads layer with a user-level threading library is a piece of cake. – Argobots[*] we used in this paper has the Pthreads-like API Note: other ULT libraries (e.g., Qthreads, Nanos++, (mutex, TLS, ...), making this easier. MassiveThreads …) also have similar threading . – The ULT-based OpenMP implementation is OpenMP 4.5-compliant (as far as we examined) ▪ Does the “baseline BOLT” perform well?

13 [*] S. Seo et al. "Argobots: A Lightweight Low-Level Threading and Tasking Framework", TPDS '18, 2018 Simple Replacement Performs Poorly 1E+0 // Run on a 56-core Skylake server Lower is better #pragma omp parallel for num_threads(N) 1E-1 for (int i = 0; i < N; i++) #pragma omp parallel for num_threads(28) for (int j = 0; j < 28; j++) 1E-2 comp_20000_cycles(i, j); Nested Parallel Region (balanced)

1E-3

Execution timetimetime[s] [s] [s] Execution Execution

– Faster than GNU OpenMP. 1E-4 • GCC 1E-5 – So-so among ULT-based OpenMPs 1 10 100 # of outer threads (N) • MPC, OMPi, Mercurium BOLT1E+01E-6 (baseline) GCC# of outer threads (N) MPC

– Slower than Intel/LLVM OpenMPs. OMPi Mercurium Intel

timetimetime[s] [s] [s]

Execution Execution Execution Execution Execution LLVM Ideal • Intel, LLVM 1 10 100

GCC: GNU OpenMP with GCC 8.1 Popular Pthreads-based OpenMP Intel: Intel OpenMP with ICC 17.2.174 LLVM: LLVM OpenMP with LLVM/Clang 7.0 MPC: MPC 3.3.0 -of-the-art ULT-based OpenMP OMPi: OMPi 1.2.3 and psthreads 1.0.4 Mercurium: OmpSs (OpenMP 3.1 compat) 2.1.0 + Nanos++ 0.14.1

14 Index

1. Introduction 2. User-level threads for OpenMP threads – Nested parallel regions and issues – Efficient adoption of ULTs – Evaluation 3. User-level threads for OpenMP tasks – OpenMP task and MPI operations – Tasking over ULT-aware MPI 4. Conclusions and future work

15 Three Optimization Directions for Further Performance 1E+0 // Run on a 56-core Skylake server #pragma omp parallel for num_threads(N) for (int i = 0; i < N; i++) 1E-1 #pragma omp parallel for num_threads(28) for (int j = 0; j < 28; j++) comp_20000_cycles(i, j); 1E-2 Nested Parallel Region (balanced) 1E-3

▪ The naïve replacement (BOLT (baseline)) Execution time[s] Execution Execution time[s] Execution 1E-4 does not perform well. 1E-5

▪ Need advanced optimizations 1E-6 1 10 100 1. Solving bottlenecks # of outer threads (N) 2. ULT-friendly affinity BOLT (baseline) GOMPBOLT (opt) IOMPGCC LOMPIntel 3. Efficient thread coordination MPCLLVM OMPiMPC MercuriumOMPi IdealMercurium BOLTIdeal (opt)

16 1. Solve Scalability Bottlenecks (1/2) Thread Parallel Region

Team cache Team cache Team cache Team cache Thread Thread Thread Thread

Parallel Region Parallel Region Parallel Region Parallel Region

Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread Thread

Thread desc. pool Thread desc. pool

Team desc. pool Team desc. pool

Thread ID counter Thread ID counter ▪ Resource management optimizations 1. Divides a large protecting all threading resources. • This cost is negligible with Pthreads. 2. Enable multi-level caching of parallel regions • Called “nested hot teams” in LLVM OpenMP.

17 1. Solve Scalability Bottlenecks (2/2)

▪ Thread creation optimizations 3. Binary creation of OpenMP threads. Master Master (Thread 0) (Thread 0)

Thread 1 Thread 2

Thread 2 Thread 1 Thread 3

Thread 3 The critical path gets shorter. Serial Thread Creation (default LLVM OpenMP) Binary Thread Creation 1E-2 BOLT (baseline) + Efficient resource management // Run on a 56-core Skylake server ++ Scalable thread startup #pragma omp parallel for num_threads(L) for (int i = 0; i < L; i++) 1E-3 #pragma omp parallel for num_threads(56) for (int j = 0; j < 56; j++)

no_comp(); 1E-4 Execution time[s] Execution Nested Parallel Regions (no computation) Lower is better No computation to measure the pure overheads. 1E-5 1 10 100 # of outer threads (L) 18 2. Affinity: How to Implement Affinity for ULTs // OMP_PLACES={0,1},{2,3},{4,5},{6,7} // OMP_PROC_BIND=spread #pragma omp parallel for num_threads(4) for (i = 0; i < 4; i++) With proc_bind, threads are bound to places. comp(i);

OpenMP Thread 0 OpenMP Thread 1 OpenMP Thread 2 OpenMP Thread 3 Place 0 Place 1 Place 2 Place 3

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 ▪ OpenMP 4.0 introduced place and prod_bind for affinity. – OS-level thread-based libraries (e.g., GNU OpenMP) use CPU masks.

▪ BOLT (baseline) ignored affinity (still standard compliant). ▪ However, affinity should be useful to 1. improve locality and 2. reduce queue contentions. – Note: ULT runtimes use shared queues + random . ▪ How to implement place over ULTs?

19 Implementation: Place Queue // OMP_PLACES={0,1},{2,3},{4,5},{6,7} // OMP_PROC_BIND=spread ▪ Place queues can implement #pragma omp parallel for num_threads(4) for (i = 0; i < 4; i++) OpenMP affinity in BOLT. comp(i);

Place 0 Place 1 Place 2 Place 3

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

OpenMP OpenMP Thread Thread ULT ULT Place queue Place queue Place queue Place queue

OpenMP Thread ULT Shared queue Shared queue Shared queue Shared queue Shared queue Shared queue Shared queue Shared queue Scheduler 0 Scheduler 1 Scheduler 2 Scheduler 3 Scheduler 4 Scheduler 5 Scheduler 6 Scheduler 7 Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

▪ Problem: OpenMP affinity setting is too deterministic.

20 OpenMP Affinity is Too Deterministic

// OMP_PLACES={0,1},{2,3},{4,5},{6,7} ▪ Affinity (or bind-var) is once set, all // OMP_PROC_BIND=spread #pragma omp parallel for num_threads(8) for (int i = 0; i < 8; i++) the OpenMP threads created #pragma omp parallel for num_threads(8) for (int j = 0; j < 8; j++) in the descendant parallel comp(i, j);

regions are bound to places. The OpenMP specification writes so.

i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7

i=0,j=1 i=0,j=2 i=0,j=3 i=0,j=4 i=0,j=5 Limited balancing. i=0,j=6 i=0,j=7 Place queue Place queue Place queue Place queue

Shared queue Shared queue Shared queue Shared queue Shared queue Shared queue Shared queue Shared queue Scheduler 0Place 0Scheduler 1 Scheduler 2Place 1Scheduler 3 Scheduler 4Place 2Scheduler 5 Scheduler 6Place 3Scheduler 7 Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 ▪ Promising direction: scheduling innermost threads with unbound random work stealing.

21 Proposed New PROC_BIND: “unset”

: reset the affinity setting of the specified parallel region. (Detailed: The thread affinity policy resets the bind-var ICV and the place-partition-var ICV to their implementation defined values and instructs the execution environment to follow these values.)

// OMP_PLACES={0,1},{2,3},{4,5},{6,7} // OMP_PLACES={0,1},{2,3},{4,5},{6,7} // OMP_PROC_BIND=spread // OMP_PROC_BIND=spread,unset #pragma omp parallel for num_threads(8) #pragma omp parallel for num_threads(8) for (int i = 0; i < 8; i++) for (int i = 0; i < 8; i++) #pragma omp parallel for num_threads(8) #pragma omp parallel for num_threads(8) for (int j = 0; j < 8; j++) for (int j = 0; j < 8; j++) comp(i, j); comp(i, j);

i=0 i=1 i=2 i=3 i=4 i=5 i=6 i=7 PlacePlace queue 0 PlacePlace queue 1 PlacePlace queue 2 PlacePlace queue 3

i=0,j=1 i=0,j=2 1E-3 BOLT (baseline) i=0,j=3 They can be scheduled on any cores. i=0,j=4 + Efficient resource management i=0,j=5 ++ Scalable thread startup i=0,j=6 Random work stealing for +++ Bind=spread i=0,j=7 innermost threads. Shared queue Shared queue Shared queue Shared queue Shared queue++++Shared Bind=spread,unset queue Shared queue Shared queue Scheduler 0 Scheduler 1 Scheduler 2 Scheduler 3 1E-4Scheduler 4 Scheduler 5 Scheduler 6 Scheduler 7 Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads Pthreads

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Execution time[s] Execution ▪ This scheduling flexibility Lower is better 1E-5 gives higher performance. 1 10 100 # of outer threads (N) 22 3. Flat Parallelism: Poor Performance

▪ BOLT should perform as good as the original LLVM OpenMP.

Nested Parallel Regions (no computation) Flat Parallel Region (no computation) #pragma omp parallel for num_threads(56) #pragma omp parallel for num_threads(56) for (int i = 0; i < 56; i++) for (int i = 0; i < 56; i++) #pragma omp parallel for num_threads(56) no_comp(i); for (int j = 0; j < 56; j++) no_comp(i, j); 1E+6 1E+2 1E+5 Lower is better 1E+4 1E+3 1E+1 1E+2

1E+1

Execution time[us] Execution Execution time [us]time Execution 1E+0 1E+0 BOLT GCC Intel LLVM BOLT GCC Intel LLVM

(PASSIVE) OMP_WAIT_POLICY=PASSIVE (PASSIVE) OMP_WAIT_POLICY=ACTIVE

▪ Optimal OMP_WAIT_POLICY for GCC/Intel/LLVM improves performance of flat parallelism.

23 Active Waiting Policy for Flat Parallelism

for (int iter = 0; iter < n; iter++) { #pragma omp parallel for num_threads(4) for (int i = 0; i < 4; i++) comp(i); } ▪ Active waiting policy improves performance of flat parallelism OMP_WAIT_POLICY by busy-wait based . = ▪ If active, Pthreads-based OpenMP ▪ BOLT on the other hand yields to a busy-waits for the next parallel region. scheduler on fork-and-join (~ passive). fork join fork join join fork join

Thread 0 Thread 0 (master) Scheduler 0 Thread 1

Thread 1 Scheduler 1 Thread 2

Thread 2 Scheduler 2 find Thread 3 next ULT Thread 3 Scheduler 3 busy busy switch switch wait wait to sched to thread * If passive, after completion of work, threads on a condition variable. Busy wait is faster than lightweight user-level ! 24 Implementation of Active Policy in BOLT

▪ If active, busy-waits for next ▪ If passive, relies on ULT parallel regions. context switching. fork join fork join fork join fork join

Thread 0 Thread 0 Scheduler 0 Scheduler 0 Thread 1 Thread 1 Scheduler 1 Scheduler 1 Thread 2 Thread 2 Scheduler 2 Scheduler 2 find Thread 3 Thread 3 next ULT Scheduler 3 Scheduler 3 busy busy switch switch wait wait to sched to thread

ULT threads are not preemptive, so BOLT periodically yields to a scheduler in order to avoid the (especially when # of OpenMP threads > # of schedulers).

25 Performance of Flat and Nested

#pragma omp parallel for num_threads(56) #pragma omp parallel for num_threads(56) for (int i = 0; i < 56; i++) for (int i = 0; i < 56; i++) #pragma omp parallel for num_threads(56) no_comp(i); for (int j = 0; j < 56; j++) no_comp(i, j);

1E+6 1E+4 MPC serializes nested parallel

1E+5 regions, so it’s fastest. 1E+3 As BOLT didn’t, MPC … OMPi do not 1E+4 implement the active policy.

1E+3 1E+2

1E+2 Execution time [us] Execution time [us] 1E+1 1E+1

1E+0 1E+0

Nested (passive) Flat (active) Lower is better

26 Penalty of the Opposite Wait Policy Lower is better 150 ~650,000 60 1E+4 1E+3 40 23x 1E+2 20

3x 1E+1 Excution time[us] Excution

0 time[us] Execution 1E+0 BOLT GCC Intel LLVM BOLT GCC Intel LLVM active passive Flat active passive Nested

#pragma omp parallel for num_threads(56) #pragma omp parallel for num_threads(56) for (int i = 0; i < 56; i++) for (int i = 0; i < 56; i++) #pragma omp parallel for num_threads(56) no_comp(i); for (int j = 0; j < 56; j++) no_comp(i, j);

▪ How to coordinate threads significantly affects the overheads. – Large performance penalty discourages users from enabling nesting.

▪ Is there a good algorithm to transparently support both flat and nested parallelism?

27 in Both Active/Passive Algorithms BOLT (active) BOLT (passive) fork join fork join fork join fork join Thread 0 Thread 0 Scheduler 0 Scheduler 0 Thread 1 Thread 1 Scheduler 1 Scheduler 1 Thread 2 Thread 2 Scheduler 2 Scheduler 2 find Thread 3 Thread 3 next ULT Scheduler 3 Scheduler 3 busy busy switch switch wait wait to sched to thread

void omp_thread() { void user_scheduler() { RESTART_THREAD: while (1) { comp(); ULT_t *ult = get_ULT_from_queue(); while (time_elapsed() < KMP_BLOCKTIME) { if (ult != NULL) if (team->next_parallel_region_flag) execute(ult); goto RESTART_THREAD; } } } } ▪ Though in both active and passive cases, they enter busy- waits after the completion of threads. – Can we merge it to perform both scheduling and flag checking? 28 Algorithm: Hybrid Wait Policy BOLT (active) BOLT (passive) fork join fork join fork join fork join

Thread 0 Thread 0 Scheduler 0 Scheduler 0 Thread 1 Thread 1 Scheduler 1 Scheduler 1 Thread 2 Thread 2 Scheduler 2 Scheduler 2 find Thread 3 Thread 3 next ULT Scheduler 3 Scheduler 3 busy busy switch switch wait wait to sched to thread

void omp_thread() { BOLT (hybrid) RESTART_THREAD: fork join fork join comp(); while (time_elapsed() < KMP_BLOCKTIME) { Thread 0 if (team->next_parallel_region_flag) Scheduler 0 goto RESTART_THREAD; Thread 1 ULT_t *ult = get_ULT_from_queue Scheduler 1 (parent_scheduler); Thread 2 if (ult != NULL) Scheduler 2 return_to_sched_and_run(ult); Thread 3 } Scheduler 3 busy busy wait } This technique is not applicable to OS-level wait + find next ULT threads since the scheduler is not revealed. ▪ Hybrid: execute flag check and queue check alternately. – [flat]: a thread does not go back to a scheduler. – [nested]: another available ULT is promptly scheduled.

29 Performance of Hybrid: Flat and Nested Lower is better 150 ~650,000 60 1E+4 1E+3 40 1E+2 20

1E+1 Excution time[us] Excution

0 time[us] Execution 1E+0 BOLT GCC Intel LLVM BOLT GCC Intel LLVM active passive hybrid Flat active passive hybrid Nested

#pragma omp parallel for num_threads(56) #pragma omp parallel for num_threads(56) for (int i = 0; i < 56; i++) for (int i = 0; i < 56; i++) #pragma omp parallel for num_threads(56) no_comp(i); for (int j = 0; j < 56; j++) no_comp(i, j); 1E-4 Nested Parallel Regions (no computation) ▪ BOLT (hybrid wait polocy) is always most efficient in both flat and nested cases. – We suggest a new keyword “auto” time[s] Execution ++++ Bind=spread,unset +++++ Hybrid policy 1E-5 so that the runtime can choose 1 10 100 the implementation. # of outer threads (N) 30 Summary of the Design

BOLT (baseline) + Efficient resource management 1. ▪ Just using ULT is insufficient. ++ Scalable thread startup 2. +++ Bind=spread => Three kinds of optimizations: ++++ Bind=spread,unset 3. +++++ Hybrid policy 1. Address scalability bottlenecks 1E-2 Nested Parallel Regions 2. ULT-friendly affinity (no computation)

3. Hybrid wait policy for 1E-3 flat and nested parallelisms

1E-4 // Run on a 56-core Skylake server time[s] Execution #pragma omp parallel for num_threads(L) for (int i = 0; i < L; i++) #pragma omp parallel for num_threads(56) for (int j = 0; j < 56; j++) no_comp(); 1E-5 1 10 100 # of outer threads (L)

31 Index

1. Introduction 2. User-level threads for OpenMP threads – Nested parallel regions and issues – Efficient adoption of ULTs – Evaluation 3. User-level threads for OpenMP tasks – OpenMP task and MPI operations – Tasking over ULT-aware MPI 4. Conclusions and future work

32 alpha makes the computation size random, Microbenchmarks while keeping the total problem size. Large alpha

// Run on a 56-core Skylake server // Run on a 56-core Skylake server #pragma omp parallel for num_threads(L) #pragma omp parallel for num_threads(56) for (int i = 0; i < L; i++) { for (int i = 0; i < 56; i++) { #pragma omp parallel for num_thLreads(28) int work_cycles = get_work(i, alpha); for (int j = 0; j < 28; j++) #pragma omp parallel for num_threads(56) comp_20000_cycles(i, j); for (int j = 0; j < 56; j++) } comp_cycles(i, j, work_cycles);}

1E+0 1E+0 1E-1 1E-1 1E-2

1E-3 1E-2

1E-4 Execution time[s] Execution

Execution time[s] Execution 1E-3 1E-5

1E-6 Lower is better 1E-4 Lower is better 1 10 100 0.1 1 10 # of outer threads (L) Alpha (A)

BOLT (baseline) BOLT (opt) GCC BOLT (baseline) BOLT (opt) GCC Intel LLVM MPC Intel LLVM MPC OMPi Mercurium Ideal OMPi Mercurium Ideal (Ideal): theoretical lower bound under perfect scalability. 33 Microbenchmarks: vs. taskloop

// Run on a 56-core Skylake server // Run on a 56-core Skylake server #pragma omp parallel for num_threads(56) #pragma omp parallel for num_threads(56) for (int i = 0; i < L; i++) { for (int i = 0; i < 56; i++) { #pragma omp taskloop grainsize(1) int work_cycles = get_work(i, alpha); for (int j = 0; j < 28; j++) #pragma omp parallel for num_threads(56) comp_20000_cycles(i, j); for (int j = 0; j < 56; j++) } comp_cycles(i, j, work_cycles);}

1E-2 1E-1

1E-3 1E-2

1E-4 1E-3

Execution time[s] Execution time[s] Execution Lower is better Lower is better 1E-5 1E-4 1 10 100 0.1 1 10 Outer loop count (L) Alpha (A)

BOLT (baseline) BOLT (opt) BOLT (baseline) BOLT (opt) GCC (taskloop) Intel (taskloop) GCC (taskloop) Intel (taskloop) LLVM (taskloop) Ideal LLVM (taskloop) Ideal

▪ Parallel regions of BOLT are as fast as taskloop! 34 Evaluation: KIFMM

▪ KIFMM[*]: highly optimized N-body solver – N-body solver is one of the heaviest kernels in astronomy simulations. ▪ Multiple layers are parallelized by OpenMP. – BLAS and FFT. KIFMM ▪ We focus on the upward phase OpenMP parallelized code in KIFMM. BLAS FFTW3 OpenMP OpenMP parallelized code parallelized code for (int i = 0; i < max_levels; i++) #pragma omp parallel for OpenMP Runtime System for (int j = 0; j < nodecounts[i]; j++) { [...]; dgemv(...); // dgemv() creates a parallel region. }

[*] A. Chandramowlishwaran et al., "Brief Announcement: Towards a Communication Optimal Fast Multipole Method and Its Implications at Exascale", SPAA '12, 2012 35 Performance: KIFMM

2.5 void kifmm_upward(): 2 for (int i = 0; i < max_levels; i++) #pragma omp parallel for num_threads(56) 1.5 for (int j = 0; j < nodecounts[i]; j++) { 1 [...];

dgemv(...); // creates a parallel region. 1) = (BOLT/1thread 0.5 } performance Relative 0 Higher is better 1 10 100 void dgemv(...): // in MKL # of inner threads (N) #pragma omp parallel for num_threads(N) NP=12, # pts = 100,000 for (int i = 0; i < [...]; i++) 05 BOLT (opt) Intel (nobind) Intel (true) dgemv_sequential(...); Intel (close) Intel (spread) Intel (dyn) Different Intel OpenMP configurations: nobind(=false),true,close,spread: proc_bind ▪ Experiments on Skylake 56 cores. dyn: MKL_DYNAMIC=true Note that other parameters are hand tuned – # of threads for the outer parallel region = 56 (see the paper). – # of threads for the inner parallel region = N (changed) ▪ Two important results: – N=1 (flat): performance is almost the same. – N>1 (nested): BOLT further boosts performance. 36 Summary: ULT-based OpenMP Threads

▪ Nested OpenMP parallel regions are commonly seen in complicated software stacks. => Demand for efficient OpenMP runtimes to exploit both flat and nested parallelism. ▪ BOLT: an lightweight OpenMP library over ULT. – Simply using ULTs is insufficient: • Solve scalability bottlenecks in the LLVM OpenMP runtime • ULT-friendly affinity implementation • Hybrid thread coordination technique to transparently support both flat and nested parallel regions. ▪ BOLT achieves unprecedented performance for nested parallel regions without hurting the performance of flat parallelism.

37 Index

1. BOLT is OpenMP over lightweight threads – What is a user-level thread (ULT)? 2. User-level threads for OpenMP threads – Nested parallel regions and issues – Efficient adoption of ULTs – Evaluation 3. User-level threads for OpenMP tasks – OpenMP task and MPI operations – Tasking over ULT-aware MPI 4. Conclusions and future work

38 Argobots Ecosystem

Applications Fused ULT N …

Charm++ Fused ULT 2 Task-Pool Fused ULT 1 Argobots runtime ULT ULT RWS ULT ES ES Communication libraries Global Thread Team Argobots ES MPI “Worker” Argobots-Aware MPICH Charm++ OmpSs CilkBots

ES ES ES 1 Sched n-1 n Origin RPC proc

U S E ... T U E RPC proc Target T S E

OpenMP (BOLT) U T E Mercury RPC

Argobots

PO

TR

TR SY TR

GE PO GE GE #pragma xmp loop SY TR TR SY Progress for (…)

SY GE SY Engine

PO ULT ULT

TR ES ES

SY

PO Open MPI

Intel DAOS XcalableMP PaRSEC Open MPI over Argobots

39 BOLT in ECP SOLLVE

ECP OpenMP Working Groups ECP Applications ECP Proposed • Application Expert (Apps, Libraries) Contributions • Compiler & RT Expert. • OpenMP Language Rep Clang /C++ FE Flang (PGI Fortran FE) Interactions • Vendor representative • Tester & Integrator LLVM Compiler

Runtime Layer •Application Feedback (OpenMP RTL, BOLT, Argobots) •Programming Model OpenMP Extensions Extensions •Compiler & Runtime ECP Operating Systems Implementation •HW/SW Co-design •Memory Hierarchies •Deep Copy •Performance Portability •Tasking NUMA Mem. Interconnect •Heterogeneous Devices Continuous Testing •MPI Interoperability CPUs N and Integration HBM e OpenMP t wo ARB Standards DRAM Accelerators r k Vendors NVM • •Validation Suite •Multi-platform Testing Exascale Nodes on Multiple Architectures ECP Tools Dev.

40 How to Build BOLT?

1. Use the latest version (https://github.com/pmodels/bolt) – Please follow the instruction. git clone https://github.com/pmodels/bolt.git $BOLT_DIR cd $BOLT_DIR ## to use the very latest version: # git checkout latest git submodule update --init mkdir build && cd build cmake ../ -DCMAKE_INSTALL_PREFIX=BOLT_INSTALL_DIR \ -DCMAKE_BUILD_TYPE=Release -DLIBOMP_USE_ARGOBOTS=on make -j install

2. Use Spack (https://github.com/spack/spack)

spack install bolt

41 BOLT: High Compatibility with Existing Libraries

▪ BOLT keeps compatibility with the LLVM OpenMP interface. – Several compilers (GCC, Intel Compilers, and Clang/LLVM) can be used as a frontend. • i.e., BOLT does not require a special compiler.

GNU Intel Clang Compiler Compiler

GCC Intel BOLT OpenMP OpenMP

42 Argobots POSIX Threads (Pthreads)

– No need to recompile existing applications and libraries for BOLT!

Note: some existing proposals require special compilers. How to Use BOLT?

▪ No recompilation needed. Just change the runtime library. ▪ [Recommended] Set LD_LIRBARY_PATH

$ LD_LIBRARY_PATH=“$BOLT/install/lib:${LD_LIBRARY_PATH}” ./prog – Please check if BOLT is loaded by ldd: $ LD_LIBRARY_PATH=“$BOLT/install/lib:${LD_LIBRARY_PATH}” ldd ./prog -vdso.so.1 => (0x00007fff3bbbe000) libm.so.6 => /lib64/libm.so.6 (0x00007f6e9fc29000) libiomp5.so => /home/user/bolt/install/lib/libiomp5.so (0x00007f6e9f994000) If you cannot find it, LD_LIBRARY_PATH ▪ [Fallback] Set LD_PRELOAD does not work. It often happens if you use GCC as a frontend. # GCC $ LD_PRELOAD=“$BOLT/install/lib/libgomp.so:${LD_PRELOAD}” ./prog # Intel C/C++ Compilers $ LD_PRELOAD=“$BOLT/install/lib/libiomp5.so:${LD_PRELOAD}” ./prog # Clang/LLVM $ LD_PRELOAD=“$BOLT/install/lib/libomp.so:${LD_PRELOAD}” ./prog 43 BOLT: Supported Features

▪ BOLT supports OpenMP 4.5 with a few exceptions. – As the original LLVM OpenMP supports. The very latest OpenMP 5.0 was released ▪ BOLT support includes … last November. Not yet fully supported. – Basic parallel regions (parallel for/section) – Tasks: task, taskloop, task depend – Target offloading: (e.g., offload computation to a GPU device) • BOLT does not affect the GPU performance – Synchronization: single/master/barrier/order, omp_lock – SIMD directives: supported by compilers ▪ BOLT currently does not support … – OMPT & OMPD (though they are OpenMP 5.0 features) – Task cancellation

44 Thank You for Listening!

▪ BOLT[*]: an lightweight OpenMP library over Argobots. – BOLT achieves unprecedented performance for nested parallel regions without hurting the performance of flat parallelism.

▪ It is available at GitHub / Spack

Acknowledgment 1. BOLT repository (https://github.com/pmodels/bolt) This research was supported by the Exascale Computing Project (17- SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the 2. Spack (https://github.com/spack/spack) nation’s exascale computing imperative.

spack install bolt

[*] S. Iwasaki et al., “BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads", PACT ‘19, 2019 45