Enhancing task-based irregular OpenMP applications with the libKOMP runtime system

François Broquedis MOAIS Team, Grenoble, France vendredi 15 mars 13 Working in the MOAIS team

• Many research directions ‣ Parallel algorithms / scheduling ‣ Interactive computing and visualization ‣ Runtime systems for HPC platforms

• I’m mainly involved in the third one ‣ Task-based parallelism (kaapi runtime system) ‣ OpenMP as a mean to express task-based parallelism (libKOMP runtime system)

A few things about me... - 2 vendredi 15 mars 13 OUTLINE

1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective

- 3 vendredi 15 mars 13 OUTLINE

1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective

- 4 vendredi 15 mars 13 OUTLINE

A longer introduction to OpenMP (in general) 1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective

- 5 vendredi 15 mars 13 !"#$%&' !(#)(*#+' 0' 1234'56'!"#$%&7'

 /&0+123*#431"51'5#!,,6(217*"#.'*8'199("8#:"3&'+12&# ;!.:<#3*#)'(3=1'&5#9&9*'>#,1'166&6#1,,6(217*"4#("# ?@#?AA@#1"5#B*'3'1"#  ?*"4(434#*+#?*9,(6&'#/('&27%&4@#CD"#79&#'*D7"&4#1"5# E"%('*"9&"3#%1'(1F6&4#  G,&2(H217*"#91("31("&5#F>#3=&#$,&"-.## !'2=(3&23D'&#C&%(&)#I*1'5#;=J,KLL)))M*,&"9,M*'8<#  N&'4(*"#OMP#=14#F&&"#'&6&14&5#QD6>#RSPP# – T=&#D,2*9("8#UMS#4,&2(H217*"4#)(66#F&#'&6&14&5#4**"#

,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0'

$,&"-.#(/#)(0&12#/3,,*'4&0# 52#46&#("03/4'27#8/#)&11#8/#46&# 8980&:(9#9*::3"(42#

,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 1/(2$324#5'!6'!"#$%&'

 /**0#,&'+*'12"3"0#432526(5(78# – 9+#8*:#0*#(7#'(;<7#====#  >&?+237*#2"0#127:'ǘ"02'0#  !"#$,&"-.#,'*;'21#(4#,*'7265&# – @:,,*'7&0#68#2#52';&#":16&'#*+#3*1,(5&'4#  A&B:('&4#5(C5&#,'*;'211(";#&D*'7#  !55*)4#7<&#,'*;'21#7*#6&#,2'255&5(E&0#("3'&1&"72558#

,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' 00' 12#'!"#$%&'34#5678$'%8/#9' Fork and Join Model Master

Parallel Worker region Threads Synchronization

Parallel Worker region Threads

Synchronization

,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 2#34#/'&5)566#6*37'

Master Thread

Outer parallel 3-way parallel region

Nested 9-way parallel parallel region

3-way parallel Outer parallel region

Note: Nesting level ,-' can be arbitrarily deep

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 12#'!"#$%&'%#34)5'%4/#6'

")*(78#' ! !//#01'&234#12%ÿ&44# 3#34)5' ")*(78#' 3#34)5' 0*#01ƪ&7#8/*92//:# 1' 1' 412'&3#6&6*':# ! ;202#("#,'(%20&6*':# (4#*"/:#255&44(9/ :#01&# 927)#/' 1' 01'&23#*)"("8#01(4# 6&6*':# %#34)5' ")*(78#' ! <*#*01&'#01'&23#4&&4# 3#34)5' 01Ȁ"8&=4># ! ;202#0'2"4+&'#(4#01'*?81# 1' 1' 412'&3#6&6*':#2"3#(4#

")*(78#' ")*(78#' @AAB#0'2"4,2'&"0#0*#01&# 3#34)5' 3#34)5' 2,,/(52C*"# ,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 12324562)*$7'89)*:;3#<'

! /"#0"#$,&"-.#,'*1'023#4050#"&&46#5*#7⋋&9&4:# ! ;66&"<099=#5>&'�')*#706(?#5=,&6@# – A>0'&4#B#C>&'&#(6#*"9=#*"&#("650"?&#*+#5>࿒# " C>'&046#?0"#'&04#0"4#)'(5>࿒#6(2D950"&*D69=# D"9&66#,'*5&?5&4#5>'*D1>#0#6,&?(E?#?*"65'D?5# " !99#?>0"1&6#204�'&#%(6(79*#099#5>'&046# – FD5#"*5#"&?&660'(9=#(22&4(05&9=3#D"9&66#&"+*'?&4#GGGGGG# – .'(%05&#B#;0?>#5>'&04#>06#0#?*,=#*+#5>࿒# " H*#*5>&'#5>'&04#?0"#0??&66#5>(6#4050# " I>0"1&6#*"9=#%(6(79*#5>>'&04#*)"("1#5>࿒#

,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23#'45)6738)*$9':5$7;)<=;7'

#pragma omp for #pragma omp sections #pragma omp single { { { ...... } } }

!$OMP DO !$OMP SECTIONS !$OMP SINGLE ...... !$OMP END DO !$OMP END SECTIONS !$OMP END SINGLE

! The work is distributed over the threads ! Must be enclosed in a parallel region ! Must be encountered by all threads in the team, or none at all ! No implied barrier on entry; implied barrier on exit (unless nowait is specified) ! A work-sharing construct does not launch any new threads ,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23#'453#/67#'8796:#'

schedule ( static | dynamic | guided | auto [, chunk] ) schedule ( runtime )

static [, chunk]

! Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion

! In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads

" Details are implementation defined

! Under certain conditions, the assignment of iterations to threads is the same across multiple loops in the same

,-' parallel region

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 2345"6#'!7'8'9:4;<'9<=#/>6#'

A loop of length 16 using 4 threads

?=)#4/' 1' -' @' A' $B'<=>$C'D' -EF' 0EG' HE-@' -AE-I' <=>$C'J'@' -E@' AEF' 0EI' KEG' HE-1' --E-@' -AE-F' -0E-I'

*) The precise distribution is implementation defined ,-'

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' 0-' 122"'32)4526/'789#/:5*$;'<92*8#='

dynamic [, chunk] ! Fixed portions of work; size is controlled by the value of chunk

! When a thread finishes, it starts on the next portion of work guided [, chunk] ! Same dynamic behavior as "dynamic", but size of the portion of work decreases exponentially auto ! The compiler (or runtime system) decides what is best to use; choice could be implementation dependent runtime ! Iteration scheduling scheme is set at runtime through ,-' environment variable OMP_SCHEDULE

.(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23"#)*4#$5'6'077'85#)9:;$<='>'?@)#9/<'

3 2 CD*/#/='0' 1 0 3 2 1 0

Thread ID 3 /B$94*A='0' 2 1 <59:A' 0

0 50 100 150 200 250 300 350 400 450 500 ,-' Iteration Number .(/&' !"#$%&'%(&)#*+#$,&"-.#

vendredi 15 mars 13 A first example

f = 1.0

for (i = 0; i < n; i++) z[i] = x[i] + y[i];

for (i = 0; i < m; i++) a[i] = b[i] + c[i];

...

scale = sum (a, 0, m) + sum (z, 0, n) + f; ...

A gentle introduction to parallel programming and HPC platforms - 18 vendredi 15 mars 13 A first example - v1

#pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0

for (i = 0; i < n; i++) z[i] = x[i] + y[i];

for (i = 0; i < m; i++) a[i] = b[i] + c[i];

... parallel region

scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */

A gentle introduction to parallel programming and HPC platforms - 19 vendredi 15 mars 13 A first example - v1

#pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0

for (i = 0; i < n; i++) z[i] = x[i] + y[i]; Statements executed by all for (i = 0; i < m; i++) the threads of a[i] = b[i] + c[i]; the parallel region ! ... parallel region

scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */

A gentle introduction to parallel programming and HPC platforms - 20 vendredi 15 mars 13 A first example - v1

#pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { Statements executed f = 1.0 by all the threads of the parallel region #pragma omp for for (i = 0; i < n; i++) parallel loop z[i] = x[i] + y[i]; (work is distributed)

#pragma omp for for (i = 0; i < m; i++) parallel loop a[i] = b[i] + c[i]; (work is distributed)

... parallel region

Statements executed scale = sum (a, 0, m) + sum (z, 0, n) + f; by all the threads of ... the parallel region } /* End of OpenMP parallel region */

A gentle introduction to parallel programming and HPC platforms - 21 vendredi 15 mars 13 A first example - v2

#pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0

#pragma omp for nowait for (i = 0; i < n; i++) z[i] = x[i] + y[i];

#pragma omp for nowait for (i = 0; i < m; i++) a[i] = b[i] + c[i];

... parallel region

#pragma omp barrier scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */

A gentle introduction to parallel programming and HPC platforms - 22 vendredi 15 mars 13 A first example - v3

#pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) if (n > some_threshold && m > some_threshold) { f = 1.0

#pragma omp for nowait for (i = 0; i < n; i++) z[i] = x[i] + y[i];

#pragma omp for nowait for (i = 0; i < m; i++) a[i] = b[i] + c[i];

...

#pragma omp barrier scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */

A gentle introduction to parallel programming and HPC platforms - 23 vendredi 15 mars 13 More processing units means massive parallelism

• Smaller chip designs mean bigger HPC platforms... ‣ More and more cores per chip ‣ More and more chips per nodes

• ... that can turn out to be embarrassing ! ‣ How can I occupy all these cores?

• The programmer has to express fine-grain massive parallelism ‣ OpenMP has evolved to fill this need with the 3.0 standard introducing OpenMP tasks (2008)

1 - Introduction to OpenMP task parallelism - 24 vendredi 15 mars 13 The tasking concept in OpenMP

1 - Introduction to OpenMP task parallelism - 25 vendredi 15 mars 13 OpenMP tasking: Who does what and when ? (Don’t ask why...)

• The developer specifies where the tasks are ‣ #pragma omp task construct

• The asumption is that all tasks can be executed independently

• When any thread encounters a task construct, an explicit task is generated ‣ Tasks can be nested • Execution of explicitely generated tasks is assigned to one of the threads of the current team ‣ This is subject to the thread’s availability and thus could be immediate or deferred until later • Completion of the task can be guaranteed using the taskwait synchronization construct

1 - Introduction to OpenMP task parallelism - 26 vendredi 15 mars 13 The Fibonacci example

long long fib (int n) int main (int argc, char **argv) { { long long x, y; long long par_res = 0; if (n < 2) return n; #pragma omp parallel { #pragma omp task shared(x) #pragma omp single firstprivate(n) par_res = fib(n); x = fib(n - 1); } ! #pragma omp task shared(y) printf ("Fibonacci result for firstprivate(n) %i is %lli\n", n, par_res); y = fib(n - 2); return EXIT_SUCCESS; #pragma omp taskwait } return x + y; }

1 - Introduction to OpenMP task parallelism - 27 vendredi 15 mars 13 OUTLINE

1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective

- 28 vendredi 15 mars 13 The work-stealing execution model

• Work-stealing = efficient way of Idle cores steal executing task-based parallelism tasks ‣ Dynamic load balancing triggered by idle processing units ๏ Per-core queues of tasks ๏ Idle cores «steal» ready tasks from T T occupied queues T T T T ‣ Scheduling decisions ๏ Next task to execute? Core to steal from? ๏ Workers generate • Programming environments based on tasks work-stealing ‣ , TBB

2 - libKOMP: an efficient runtime system for OpenMP tasks - 29 vendredi 15 mars 13 The way libKOMP executes tasks

• One «worker thread» per core Idle ‣ Able to execute libKOMP fine-grain tasks Core ‣ Holds a queue of tasks ๏ Related to sequential C stack of activation frames steal ๏ Last-in first-out data structure

• Task creation is cheap ! ‣ Reduces to pushing C function pointer + its arguments into the worker thread queue ‣ Recursive tasks are welcome !

• Work-stealing based scheduling push pop ‣ Cilks’s work first principle ‣ Work-stealing algorithm = plug-in ๏ Default: steal a task from a randomly chosen queue Core

2 - libKOMP: an efficient runtime system for OpenMP tasks - 30 vendredi 15 mars 13 libKOMP: an OpenMP runtime system based on kaapi

• OpenMP applications generate kaapi tasks ‣ OpenMP tasks == kaapi tasks ‣ Parallel region create num_threads kaapi tasks ๏ Nested parallel regions create recursive tasks OpenMP application compiled ‣ Worker threads created at library initialization with GCC

• The libKOMP library Common ABI ‣ Port of the libGOMP ABI on top of the kaapi runtime system libKOMP ‣ Binary compatible with existing gcc-compiled libGOMP OpenMP application kaapi

POSIX threads

2 - libKOMP: an efficient runtime system for OpenMP tasks - 31 vendredi 15 mars 13 Performance evaluation with BOTS

• Barcelona OpenMP Task Suite ‣ A set of representative benchmarks to evalute OpenMP tasks implementations

Name Arguments used Domain Summary

Alignment prot100.aa Dynamic programming Aligns sequences of proteins

FFT n=33,554,432 Spectral method Computes a Fast Fourier Transformation

Floorplan input.20 Optimization Computes the optimal placement of cells in a floorplan

NQueens n=14 Search Finds solutions of the N Queens problem

MultiSort n=33,554,432 Integer sorting Uses a mixture of sorting algorithms to sort a vector

SparseLU n=128 m=64 Sparse linear algebra Computes the LU factorization of a sparse matrix

Strassen n=8192 Dense linear algebra Computes a matrix multiply with Strassen's method

UTS medium.input Search Computes the number of nodes in an Unbalanced Tree • Evaluation platform • Softwares ‣ AMD48: 4x12 AMD Opteron cores ‣ gcc 4.6.2 + libGOMP ‣ gcc 4.6.2 + libKOMP ‣ icc 12.1.2 + Intel OpenMP runtime (KMP)

2 - libKOMP: an efficient runtime system for OpenMP tasks - 32 vendredi 15 mars 13 Running BOTS on the AMD48 platform

kernel libGOMP libKOMP Intel Alignment 38.8 40.0 37.0 FFT 0.5 12.2 12.0 Floorplan 27.6 32.7 29.2 NQueens 43.7 43.4 39.0 MultiSort 0.6 13.2 11.3 SparseLU 44.1 44.4 35.0 Strassen 20.8 22.4 20.5 UTS 0.9 25.3 15.0

Speed-Up of BOTS kernels on the 48 cores of the AMD48 platform

2 - libKOMP: an efficient runtime system for OpenMP tasks - 33 vendredi 15 mars 13 Running BOTS on the AMD48 platform

kernel libGOMP libKOMP Intel Alignment 38.8 40.0 37.0 FFT 0.5 12.2 12.0 Floorplan 27.6 32.7 29.2 NQueens 43.7 43.4 39.0 MultiSort 0.6 13.2 11.3 SparseLU 44.1 44.4 35.0 Strassen 20.8 22.4 20.5 UTS 0.9 25.3 15.0

Speed-Up of BOTS kernels on the 48 cores of the AMD48 platform

2 - libKOMP: an efficient runtime system for OpenMP tasks - 34 vendredi 15 mars 13 OUTLINE

1. A brief introduction on OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective

- 35 vendredi 15 mars 13 The problem of granularity

libKOMP libGOMP Intel • OpenMP tasks = a good candidate to 15,0 efficiently exploit multicore chips 11,3 ‣ More and more cores per chip 7,5 ‣ Fine-grain parallelism ๏ Tasks are lighter than threads 3,8

Problem: creating the «right» number 0 • 4 8 16 24 32 48 of tasks for a specific platform FFT ‣ Too many tasks => scheduling 15,0 overheads ‣ Not enough tasks => load imbalance 11,3

‣ Old problem for the OpenMP 7,5 community ! ๏ num_threads and nested parallel regions 3,8

0 4 8 16 24 32 48 MultiSort 3 - Adaptive granularity for task-based parallelism - 36 vendredi 15 mars 13 Adaptive OpenMP loop scheduling with libKOMP

• Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution • Exemple of use: the way libKOMP executes OpenMP parallel loops ‣ Adaptive tasks for independent loops ๏ A task == a range of iterations to compute ‣ Execution model ๏ Initially, one task in charge of the whole range

T1 : [0 - 15]

3 - Adaptive granularity for task-based parallelism - 37 vendredi 15 mars 13 Adaptive OpenMP loop scheduling with libKOMP

• Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution • Exemple of use: the way libKOMP executes OpenMP parallel loops ‣ Adaptive tasks for independent loops ๏ A task == a range of iterations to compute ‣ Execution model ๏ Initially, one task in charge of the whole range

T1 : [1 - 15]

3 - Adaptive granularity for task-based parallelism - 38 vendredi 15 mars 13 Adaptive OpenMP loop scheduling with libKOMP

• Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution • Exemple of use: the way libKOMP executes OpenMP parallel loops ‣ Adaptive tasks for independent loops ๏ A task == a range of iterations to compute ‣ Execution model ๏ Initially, one task in charge of the whole range

T1 : [2 - 15]

3 - Adaptive granularity for task-based parallelism - 39 vendredi 15 mars 13 Adaptive OpenMP loop scheduling with libKOMP

• Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution • Exemple of use: the way libKOMP executes OpenMP parallel loops ‣ Adaptive tasks for independent loops ๏ A task == a range of iterations to compute ‣ Execution model ๏ Initially, one task in charge of the whole range

T1 : [3 - 15]

3 - Adaptive granularity for task-based parallelism - 40 vendredi 15 mars 13 Adaptive OpenMP loop scheduling with libKOMP

• Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution

• Exemple of use: the way libKOMP Idle executes OpenMP parallel loops Core ‣ Adaptive tasks for independent loops ๏ A task == a range of iterations to compute ‣ Execution model steal request ๏ Initially, one task in charge of the whole

range split (T1) ๏ Idle cores post steal requests and trigger the “split” operation to generate new tasks

T1 : [4 - 15]

3 - Adaptive granularity for task-based parallelism - 41 vendredi 15 mars 13 Adaptive OpenMP loop scheduling with libKOMP

• Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution

• Exemple of use: the way libKOMP Idle executes OpenMP parallel loops Core ‣ Adaptive tasks for independent loops ๏ A task == a range of iterations to compute ‣ Execution model steal request ๏ Initially, one task in charge of the whole

range split (T1) ๏ Idle cores post steal requests and trigger the “split” operation to generate new tasks

T1 : [4 - 15]

3 - Adaptive granularity for task-based parallelism - 42 vendredi 15 mars 13 Adaptive OpenMP loop scheduling with libKOMP

• Adaptive tasks in XKaapi ‣ Adaptative tasks can be split at run time to create new tasks ‣ Provide a “splitter” function called when an idle core decides to steal some of the remaining computation to be performed by a task under execution

• Exemple of use: the way libKOMP Idle executes OpenMP parallel loops Core ‣ Adaptive tasks for independent loops ๏ A task == a range of iterations to compute ‣ Execution model steal request ๏ Initially, one task in charge of the whole

range split (T1) ๏ Idle cores post steal requests and trigger the “split” operation to generate new tasks

T1 : [4 - 9] T2 : [10 - 15]

3 - Adaptive granularity for task-based parallelism - 43 vendredi 15 mars 13 Performance evaluation: adaptive loops to enhance VTK filters [Mathias Ettinger, PhD student] • Parallel version of the VTK visualization toolkit ‣ A framework to develop parallel applications for scientific visualization ‣ A pipeline of VTK «filters» ๏ A filter == a computation performed on a 2D/3D scene

• Parallelized filters ‣ Iso-Surface ๏ for (all points p in my scene) if (p meets some requirements) compute_smthg (p); ๏ irregular workload

3 - Adaptive granularity for task-based parallelism - 44 vendredi 15 mars 13 Performance evaluation: adaptive loops to enhance VTK filters [Mathias Ettinger, PhD student] 40,0

30,0 libKOMP (adaptive) OpenMP (static) OpenMP (dynamic, 32) 20,0 OpenMP (dynamic, 512) OpenMP (dynamic, 1024) OpenMP (dynamic, 5300) 10,0

0 4 8 16 32 48 Performance of the VTKcontour filter on a 48- core AMD machine

3 - Adaptive granularity for task-based parallelism - 45 vendredi 15 mars 13 OUTLINE

1. A brief introduction on OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective

- 46 vendredi 15 mars 13 The NUMA problem

#3 #2 Non-Uniform Memory Accesses

CPU RAM CPU RAM ! The NUMA factor

! Write access = more trafic

#0 #1

CPU RAM CPU RAM • Memory affinity ! Applications run faster if accessing local data ! Need a carefull distribution of threads Access to... Local node Neighbor node Opposite node and data Read 83 ns 98 ns (x1.18) 117 ns (x1.41) ! To avoid NUMA penalties Write 142 ns 177 ns (x1.25) 208 ns (x1.46) ! To reduce memory contention

4 - Data affinity for task-based applications - vendredi 15 mars 13 Different machines, different NUMA topologies

Results of the STREAM benchmark (memory bandwidth in MB/s) for two different shared-memory machines

4 - Data affinity for task-based applications - 48 vendredi 15 mars 13 Parallel programming and memory- bound problems

• Usually a matter of... ‣ ... binding memory on specific NUMA nodes ... ‣ ... and binding threads according to the memory mapping ๏ on the same NUMA node the data they access were allocated on

• What about task-based irregular applications? ‣ Tasks mapping will change dynamically ๏ Tasks are meant to move to balance the application workload ‣ Most common solution: the interleave policy ๏ Distribute the memory pages of the data set over the whole machine to minimize the probability of memory contention

4 - Data affinity for task-based applications - 49 vendredi 15 mars 13 Controlling task execution on hierarchical architectures

• Let the programmer... ‣ Bind memory on specific nodes ‣ Settle tasks on «locality domains» ๏ Tasks settled on a locality domain can only be stolen by cores from the same locality domain If the stealing fails, the ๏ Default locality domain (the whole machine) idle core extends its research scope to a more general (upper- level) locality domain Locality domains are inherited by ๏ Core Core Core Core Core Core Core Core recursive children tasks L3 L3 L3 L3

NUMA bank NUMA bank

4 - Data affinity for task-based applications - 50 vendredi 15 mars 13 Controlling task execution on hierarchical architectures

T • Let the programmer... ‣ Bind memory on specific nodes T ... « » means «creates» or «is the father of» ‣ Settle tasks on «locality T domains» ๏ Tasks settled on a locality domain can only be stolen by cores from the same locality domain If the stealing process fails, the ๏ Default locality domain (the whole machine) idle core extends its research scope to a more general (upper- level) locality domain Locality domains are inherited by ๏ Core Core Core Core Core Core Core Core recursive children tasks L3 L3 L3 L3

NUMA bank NUMA bank

4 - Data affinity for task-based applications - 51 vendredi 15 mars 13 Controlling task execution on hierarchical architectures

T • Let the programmer... ‣ Bind memory on specific nodes T ... ‣ Settle tasks on «locality T T domains» ๏ Tasks settled on a locality domain can only be stolen by cores from kaapi_attr_set_locality_domain (&attr, NUMA_NODE_0); the same locality domain If the stealing process fails, the ๏ Default locality domain (the whole machine) idle core extends its research

scope to a more general (upper- User-specified locality level) locality domain domain (NUMA node 0) Locality domains are inherited by ๏ Core Core Core Core Core Core Core Core recursive children tasks L3 L3 L3 L3

NUMA bank NUMA bank

4 - Data affinity for task-based applications - 52 vendredi 15 mars 13 Controlling task execution on hierarchical architectures

T • Let the programmer... T ‣ Bind memory on specific nodes ... T ‣ Settle tasks on «locality T T domains» T ๏ Tasks settled on a locality domain can only be stolen by cores from kaapi_attr_set_locality_domain (&attr, NUMA_NODE_0); the same locality domain If the stealing process fails, the ๏ Default locality domain (the whole machine) idle core extends its research

scope to a more general (upper- User-specified locality level) locality domain domain (NUMA node 0) Locality domains are inherited by ๏ Core Core Core Core Core Core Core Core recursive children tasks L3 L3 L3 L3

NUMA bank NUMA bank

4 - Data affinity for task-based applications - 53 vendredi 15 mars 13 Applying this approach to PMA sorting [Marie Durand, PhD student]

• Efficient data structure for sorting particules ๏ Packed Memory Array (PMA): Add «gaps» to the particle array to minimize memory movement induced by reordering

T • On-going work: Parallel PMA [Marie Durand, PhD student] T T ‣ Partition the PMA data structure to apply reordering in parallel T T T T ๏ recursively create tasks that will scan and sort a subpart of the locality domain locality domain partition (NUMA node 0) (NUMA node 1)

4 - Data affinity for task-based applications - 54 vendredi 15 mars 13 Early results with task/data affinities

4 - Data affinity for task-based applications - 55 vendredi 15 mars 13 OUTLINE

1. A brief introduction on OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective

- 56 vendredi 15 mars 13 Summary - The libKOMP runtime system

• Efficient implementation of OpenMP independant tasks... ‣ Low runtime-related overheads ๏ Cheap task creation ‣ Provide support for recursive tasks ‣ Performance comparable to «mainstream» OpenMP runtime systems ‣ Easy to use ๏ Binary compatible with gcc-compiled OpenMP applications

• ... that comes with original features ‣ Adaptive loop scheduling ‣ Locality domains for hierarchical architectures

• We started collaborations with people from the HPC-GA project ‣ Parallelization of the Hou10ni application (Julien Diaz, Giannis Ashiotis) ‣ Early experiments with the Ondes3D application (Fabrice Dupros) ‣ No physicists were harmed during these experiments!

5 - Conclusion and future work - 57 vendredi 15 mars 13 Newcomers are welcome!

• Remember libKOMP is a (young) research project ‣ Lots of bugs fixed thanks to the Hou10ni + Ondes3D experience ๏ A new software distribution is on its way included all these fixes

‣ The «task part» of OpenMP is usually easier to handle for us than the «thread part» ๏ The underlying task-based runtime system was not designed to mimic threads

• The approach we recommend ‣ Develop and tune your application with standard OpenMP ๏ We can give you some advice if you’re not familiar with it ‣ In a second step, experiment with libKOMP’s original features ๏ If it does not behave as expected, you at least get an efficient standard OpenMP version of your app! :-)

5 - Conclusion and future work - 58 vendredi 15 mars 13 Thank you for your attention !

- vendredi 15 mars 13 Applying this approach to PMA sorting [Marie Durand, PhD student]

• Studying particle interactions for physics simulations ‣ Short-range interactions between particles ‣ Only some of them will move ‣ Many interactive-computing applications iterate over all the particles of a 3D domain

4 - Data affinity for task-based applications - 60 vendredi 15 mars 13 Applying this approach to PMA sorting [Marie Durand, PhD student]

• Studying particle interactions for physics simulations ‣ Short-range interactions between particles ‣ Only some of them will move ‣ Many interactive-computing applications iterate over all the particles of a 3D domain

Group particles by cells

4 - Data affinity for task-based applications - 61 vendredi 15 mars 13 Applying this approach to PMA sorting [Marie Durand, PhD student]

• Studying particle interactions for physics simulations ‣ Short-range interactions between particles ‣ Only some of them will move ‣ Many interactive-computing applications iterate over all the particles of a 3D domain

• Cache-oblivious data structures for moving particles ‣ Group particles by cells ‣ Keep this set of cells sorted ๏ Z-curve ordering, maximize locality

4 - Data affinity for task-based applications - 62 vendredi 15 mars 13 Applying this approach to PMA sorting [Marie Durand, PhD student]

• Efficient data structure for sorting particules ๏ Packed Memory Array (PMA): Move 5 elements Add «gaps» to the particle array to minimize memory movement induced by reordering Move 2 elements

4 - Data affinity for task-based applications - 63 vendredi 15 mars 13 Applying this approach to PMA sorting [Marie Durand, PhD student]

• Efficient data structure for sorting particules ๏ Packed Memory Array (PMA): Move 5 elements Add «gaps» to the particle array to minimize memory movement induced by reordering Move 2 elements • On-going work: Parallel PMA [Marie Durand, PhD student] ‣ Partition the PMA data structure to apply reordering in parallel ๏ recursively create tasks that will scan and sort a subpart of the partition

4 - Data affinity for task-based applications - 64 vendredi 15 mars 13 OpenMP tasks and synchronizations

• OpenMP let the application programmer in charge of tasks void data_flow_example (void) synchronization { type x, y; ‣ #pragma omp taskwait construct ‣ wait of the completion of child #pragma omp parallel #pragma omp single tasks of the current task { ๏ Here, wait for A and B before #pragma omp task compute_smthg (&x); /* Task A */ executing C and D #pragma omp task compute_smthg (&y); /* Task B */

• How to express the fact that task #pragma omp taskwait C can start executing as soon as #pragma omp task task A is complete? process_results (x); /* Task C */ #pragma omp task process_results (y); /* Task D */ } }

3 - Extending OpenMP to support data-flow parallelism - 65 vendredi 15 mars 13 Data-flow parallelism with libKOMP

• Data-flow execution model ‣ The application programmer expresses dependencies between tasks ๏ Declare variables / memory areas the task will use as input and/or produce as output - Access modes for variables (read, write)

‣ The runtime system computes a data-flow graph to decide whether a task is ready to be executed or not ๏ Synchronization is no longer the application programmer’s burden !

• libKOMP’s execution model for data-flow parallelism ‣ A task is ready for execution when all its input variables are ready ‣ A variable is ready as soon as it’s been written

3 - Extending OpenMP to support data-flow parallelism - 66 vendredi 15 mars 13 libKOMP’s API for data-flow programming

• Keywords defining variable access modes ‣ read : variable is read as input ‣ write : variable is written as output ‣ readwrite : variable is both read as input and written as output ‣ specific access modes for reductions

• Compiled thanks to KaCC ‣ Based on the ROSE compiling framework ‣ Support for C/C++ applications ‣ #pragma kaapi annotations

3 - Extending OpenMP to support data-flow parallelism - 67 vendredi 15 mars 13 OpenMP-like example of data-flow programming with libKOMP

void data_flow_example (void) Valid execution orders: { - A, B, C, D type x, y; - A, B, D, C - B, A, C, D #pragma kaapi parallel { - B, A, D, C #pragma kaapi task write(x) - B, D, A, C (*) compute_smthg (&x); /* Task A */ - A, C, B, D (*) #pragma kaapi task write(y) *invalid on the taskwait version compute_smthg (&y); /* Task B */

#pragma kaapi task read(x) Invalid execution orders: process_results (x); /* Task C */ - C before A #pragma kaapi task read(y) process_results (y); /* Task D */ - D before B } }

3 - Extending OpenMP to support data-flow parallelism - 68 vendredi 15 mars 13

iue5 iuain-Tb crash Tube - Simulation 5. Figure

ak oeohrcnb xctdi parallel. in executed be can other some tasks

rto op lhuhteeaedpnece ewe some between dependencies are there Although loop. gration

iue4 ceeo h nerto opo Europlexus. of loop integration the of Scheme 4. Figure

a opooesltost mrv htapc fEuroplexus. of aspect that improve to solutions propose to was

ahnsi nw ob o iie.Tega fti work this of goal The limited. too be to known is machines

sntue npatc sispromneo NUMA on performance its as practice in used not is

eoy tas a ata pnPspotwhich support OpenMP partial a has also It memory.

sprleie ihMIfrbt itiue n shared and distributed both for MPI with parallelized is

E rvddu w aasets: data two us provided CEA

iue4sosasmlfidvrino uolxs inte- Europlexus’ of version simplified a shows 4 Figure

h gr htteeaealto uocliin,w then we auto-collisions, of lot a are there that figure the

a enue o u efrac nlss esein see We analysis. performance our for used been has

247

lmnshrigacnrt lba oiotlsedof speed horizontal a at slab concrete a hurting elements

Meppen

eod fti aastlasts set data this of seconds

o etn upss eunilrnt simulate to run sequential A purposes. testing for

of

lmnscupn ntegon tavria speed vertical a at ground the in clumping elements

h oebs saot70 ie fFrrn7/0and 77/90 Fortran of lines 740K about is base code The uecrash Tube

15

.

7

m.s

m.s

fi.6 osssi isl opsdo 30598 of composed missile a in consists 6) (fig.

1

1

tfisbte yia ellf iuain and simulations real-life typical better fits It .

fi.5 osssi uecmoe f2080 of composed tube a in consists 5) (fig.

ti ml ehta emil used mainly we that mesh small a is It .

1

minute;

10

3

pta loop. spatial

fdt esa hnso size of chunks as sets data of

application.

iuain n ec htalcrsaeaalbefrour for available are cores all that hence and simulations

otx,w nwta h ahn sddctdt run to dedicated is machine the that know we context,

trtos.Snew r nahg efrac computing performance high a in are we Since iterations).

pwt oriiilwr od(ntrso ubrof number of terms (in load work initial poor a with up

dpe yrnie uha XKAAPI. as such runtimes by adopted

epcigterdpnece.Ti stegnrlscheme general the is This dependencies. their respecting

aalleeuined hnaltsshv enexecuted been have tasks all when ends execution parallel

yil hed,i n.N atrtenme ftras the threads, of number the matter No any. if threads, idle by

aalltsst t w eu.Teetssaete stolen then are tasks These deque. own its to tasks parallel

taig h antra xctssqeta oeadpushes and code sequential executes thread main the stealing,

.IiilDistribution Initial A.

oe nrae progressively. increases nodes

vle bet uoclieadtenihoho fsome of neighborhood the and auto-collide objects evolves

lmnsadndsb sn hi egbr.A h simulation the As neighbors. their using by nodes and elements

state.

lmnsadndso h iuae bet oudt their update to objects simulated the of nodes and elements

parallelization:

forces

hw that shows

iue6 iuain-Meppen - Simulation 6. Figure

hsi h edvlpdaspotfra nta distribution initial an for support a developed we why is This

eswta thsadabc saltrasmgtend might threads all as drawback a has it that saw We

ntepoesrolvosipeettoso work- of implementations oblivious the In

hyhv aibewr odbcuete ohupdate both they because load work variable a have They

hs w op r pta op,te trt nthe on iterate they loops, spatial are loops two These

• eteeoedcddt ou nteetssadsuytheir study and tasks these on focus to decided therefore We

Enhance simulation Meppen the on the run sequential a of profiling A EUROPLEXUS

usto nodes. of subset

eas ahnd sidpnetycmae gis a against compared independently is node each because

aast.A o nenlfre,idcsaeindependent are indices forces, internal for As sets. data

r aibebcuete r endb srprovided user by defined are they because variable are

ubro op swl stenme fndsprloop per nodes of number the as well as loops of number

usto oe e trto ftetmoa op The loop. temporal the of iteration per nodes of subset

ik dnicto osssi utpelosoe a over loops multiple in consists identification Links

ie element; given

n xct ucinta pae nenlfre fa of forces internal updates that function a execute and

trto ftetmoa op nie r independent are Indices loop. temporal the of iteration

nenlfre osssi n opoe l lmnsper elements all over loop one in consists forces Internal

simulate xetsgicn odiblne.Asqeta u to run sequential A imbalances. load significant expect

application the and with data-flow tasks

• The EUROPLEXUS application80% 10

‣ Industrial code, originally identification links developed by CEA −

‣ Analyzes the propagation the on spent is time of execution the flow variations

3 V C IV.

in presence of obstacles lasts set data this of seconds ‣ Sparse Cholesky factorization on a skyline

ONTRIBUTION matrix N

P • Two different versions of the application ttefis trto fthe of iteration first the at

‣ Standard OpenMP 3.0 versiontasks. ๏ Independent tasks with explicit synchronizations ‣ libKOMP-powered data-flow version ๏ Expressing variable access modes ๏ Let the runtime system decide whether a task

is ready or not 33

minutes. internal

3 - Extending OpenMP to support data-flow parallelism - 69

vendredi 15 mars 13 5 Evaluation of independent tasks vs Data-flow tasks on the Europlexus application

45 ideal libKOMP 40 OpenMP 35

30

25

20

Speedup (Tp/Tseq) 15

10

5

0 0 5 10 15 20 25 30 35 40 45 core count

3 - Extending OpenMP to support data-flow parallelism - 70 vendredi 15 mars 13