Speed Up Your OpenMP Code Without Doing Much

Ruud van der Pas Distinguished Engineer

Oracle and Virtualization Engineering Oracle, Santa Clara, CA, USA

“Speed Up Your OpenMP Code Without Doing Much” Agenda • Motivation • Low Hanging Fruit • A Case Study

2 “Speed Up Your OpenMP Code Without Doing Much” Motivation

3 “Speed Up Your OpenMP Code Without Doing Much” OpenMP and Performance

You can get good performance with OpenMP

And your code will scale

If you do things in the right way

Easy =/= Stupid

4 “Speed Up Your OpenMP Code Without Doing Much” Ease of Use ? The ease of use of OpenMP is a mixed blessing (but I still prefer it over the alternative) Ideas are easy and quick to implement But some constructs are more expensive than others Often, a small change can have a big impact In this talk we show some of this low hanging fruit

5 “Speed Up Your OpenMP Code Without Doing Much” Low Hanging Fruit

6 “Speed Up Your OpenMP Code Without Doing Much” About Single Thread Performance and Scalability

You have to pay attention to single thread performance

Why? If your code performs badly on 1 core, what do you think will happen on 10 cores, 20 cores, … ? Scalability can mask poor performance! (a slow code tends to scale better …)

7 “Speed Up Your OpenMP Code Without Doing Much” The Basics For All Users

Do not parallelize what does not matter

Never tune your code without using a profiler

Do not share data unless you have to (in other words, use private data as much as you can)

One “parallel for” is fine. More is EVIL.

Think BIG (maximize the size of the parallel regions)

8 “Speed Up Your OpenMP Code Without Doing Much” The Wrong and Right Way Of Doing Things #pragma omp parallel #pragma omp parallel for { { } #pragma omp for { }

#pragma omp parallel for { } #pragma omp for nowait { } } // End of parallel region Parallel region overhead repeated “n” times Parallel region overhead only once No potential for the “nowait” clause Potential for the “nowait” clause

9 “Speed Up Your OpenMP Code Without Doing Much” More Basics Every barrier matters (and please use them carefully) The same is true for locks and critical regions (use atomic constructs where possible) EVERYTHING Matters (minor overheads get out of hand eventually)

10 “Speed Up Your OpenMP Code Without Doing Much” Another Example #pragma omp single #pragma omp single { { } // End of single region } // End of single region

#pragma omp barrier #pragma omp barrier

The second barrier is redundant because the single construct has an implied barrier already (this second barrier will still take time though)

11 “Speed Up Your OpenMP Code Without Doing Much” One More Example

a[npoint] = value; #pragma omp parallel … { #pragma omp for for (int64_t k=0; k

} // End of parallel region

12 “Speed Up Your OpenMP Code Without Doing Much” What Is Wrong With This?

a[npoint] = value; ✓ There are 2 barriers #pragma omp parallel … { ✓ Two times serial and parallel overhead #pragma omp for for (int64_t k=0; k

} // End of parallel region

13 “Speed Up Your OpenMP Code Without Doing Much” The Sequence Of Operations

npoint

value 0 … npoint-1

-1 -1 …. -1 -1 -1 barrier

npoint+1 … n-1

-1 …. -1 -1 barrier

14 “Speed Up Your OpenMP Code Without Doing Much” The Actual Operation Performed

0 … npoint-1 npoint npoint+1 … n-1

-1 -1 …. -1 -1 -1 value -1 … -1 -1

15 “Speed Up Your OpenMP Code Without Doing Much” The Modified Code

#pragma omp parallel … ✓ Only one barrier { #pragma omp for ✓ One time serial and parallel overhead for (int64_t k=0; k

#pragma omp single nowait value of “n” only {a[npoint] = value;}

} // End of parallel region

16 “Speed Up Your OpenMP Code Without Doing Much” A Case Study

17 “Speed Up Your OpenMP Code Without Doing Much” The Application

The OpenMP reference version of the Graph 500 benchmark

Structure of the code: • Construct an undirected graph of the specified size • Randomly select a key and conduct a BFS search Repeat • Verify the result is a tree times For the benchmark score, only the search time matters

18 “Speed Up Your OpenMP Code Without Doing Much” Testing Circumstances The code uses a parameter SCALE to set the size of the graph The value used for SCALE is 24 (~9 GB of RAM used) All experiments were conducted in the Oracle Cloud (“OCI”) Used a VM instance with 1 Skylake socket (8 cores, 16 threads) : Oracle Linux; compiler: gcc 8.3.1 The Oracle Performance Analyzer was used to make the profiles

19 “Speed Up Your OpenMP Code Without Doing Much” The Dynamic Behaviour

Graph Generation Search and verification for 64 keys

Search Verification

Note: The Oracle Performance Analyzer was used to generate this time line

20 “Speed Up Your OpenMP Code Without Doing Much” The Performance For SCALE = 24 (9 GB)

Search Time and Parallel Speed Up ✓ Search time reduces as 350 3.50 threads are added 300 3.00 ✓ Benefit from all 16 (hyper) 250 2.50 threads 200 2.00

150 1.50 ✓ But the 3.2x parallel speed up is disappointing

100 1.00 Parallel Speed Up Search Time (seconds) Search 50 0.50

0 0.00 1 2 4 6 8 10 12 14 16 Number of OpenMP Threads

21 “Speed Up Your OpenMP Code Without Doing Much” Action Plan Used a profiling tool* to identify the time consuming parts Found several opportunities to improve the OpenMP part These are actually shown earlier in this talk Although simple changes, the improvement is substantial:

*) Use your preferred tool; we used the Oracle Performance Analyzer

22 “Speed Up Your OpenMP Code Without Doing Much” Performance Of The Original and Modified Code

Search Time and Parallel Speed Up ✓ A noticeable reduction in Original Modified Original Modified the search time at 4 350 7.00 threads and beyond 300 6.00

250 5.00 ✓ The parallel speed up 200 4.00 increases to 6.5x 150 3.00 ✓ The search time is 100 2.00 2x Parallel Speed Up reduced by 2x Search Time (seconds) Search 50 1.00

0 0.00 1 2 4 6 8 10 12 14 16 Number of OpenMP Threads

23 “Speed Up Your OpenMP Code Without Doing Much” Are We Done Yet? The 2x reduction in the search time is encouraging The efforts to achieve this have been limited The question is whether there is more to be gained Let’s look at the dynamic behaviour of the threads:

24 “Speed Up Your OpenMP Code Without Doing Much” A Comparison Between 1 And 2 Threads

Both phases benefit from using 2 threads Faster Faster

Search Verification

Note: The Oracle Performance Analyzer was used to generate this time line

25 “Speed Up Your OpenMP Code Without Doing Much” Zoom In On The Second Thread

master thread is fine

second thread has gaps

Search Verification

Note: The Oracle Performance Analyzer was used to generate this time line

26 “Speed Up Your OpenMP Code Without Doing Much” How About 4 Threads?

Adding threads makes things worse

Note: The Oracle Performance Analyzer was used to generate this time line

27 “Speed Up Your OpenMP Code Without Doing Much” The Problem #pragma omp for Regular, structured for (int64_t k = k1; k < oldk2; ++k) { parallel loop for (int64_t vo = XOFF(v);Load vo < veo; ++vo) { Irregular loop Imbalance! if (bfs_tree[j] == -1) { Unbalanced flow } // End of if } // End of for-loop on vo } // End of parallel for-loop on k

“Speed Up Your OpenMP Code Without Doing Much” The Solution $ export OMP_SCHEDULE=“dynamic,25" #pragma omp for schedule(runtime) for (int64_t k = k1; k < oldk2; ++k) {\ for (int64_t vo = XOFF(v); vo < veo; ++vo) { if (bfs_tree[j] == -1) { } // End of if } // End of for-loop on vo } // End of parallel for-loop on k

29 “Speed Up Your OpenMP Code Without Doing Much” The Performance With Dynamic Scheduling Added

Search Time and Parallel Speed Up ✓ A 1% slow down on a Original Modified dynamic Original Modified dynamic Linear single thread 350 16.00

300 14.00 ✓ Near linear scaling for up 12.00 250 to 8 threads 10.00 200 8.00 ✓ The parallel speed up 150 6.00 increased to 9.5x 100 3x 4.00 Parallel Speed Up Search Time (seconds) Search 50 2.00 ✓ The search time is

0 0.00 reduced by 3x 1 2 4 6 8 10 12 14 16 Number of OpenMP Threads

30 “Speed Up Your OpenMP Code Without Doing Much” The Improvement

Original static scheduling

Dynamic scheduling

Note: The Oracle Performance Analyzer was used to generate this time line

31 “Speed Up Your OpenMP Code Without Doing Much” Conclusions

A good profiling tool is indispensable when tuning applications

With some simple improvements in the use of OpenMP, a 2x reduction in the search time is realized Compared to the original code, adding dynamic scheduling reduces the search time by 3x

32 “Speed Up Your OpenMP Code Without Doing Much” Thank You And … Stay Tuned!

Ruud van der Pas Distinguished Engineer

Oracle Linux and Virtualization Engineering Oracle, Santa Clara, CA, USA

“Speed Up Your OpenMP Code Without Doing Much”