Enhancing Task-Based Irregular Openmp Applications with the Libkomp Runtime System

Enhancing task-based irregular OpenMP applications with the libKOMP runtime system François Broquedis MOAIS Team, Grenoble, France vendredi 15 mars 13 Working in the MOAIS team • Many research directions ‣ Parallel algorithms / scheduling ‣ Interactive computing and visualization ‣ Runtime systems for HPC platforms • I’m mainly involved in the third one ‣ Task-based parallelism (kaapi runtime system) ‣ OpenMP as a mean to express task-based parallelism (libKOMP runtime system) A few things about me... - 2 vendredi 15 mars 13 OUTLINE 1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective - 3 vendredi 15 mars 13 OUTLINE 1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective - 4 vendredi 15 mars 13 OUTLINE A longer introduction to OpenMP (in general) 1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective - 5 vendredi 15 mars 13 !"#$%&' !(#)(*#+' 0' 1234'56'!"#$%&7' /&0+123*#431"51'5#!,,6(217*"#.'*8'199("8#:"3&'+12&# ;!.:<#3*#)'(3&#4=1'&5#9&9*'>#,1'166&6#1,,6(217*"4#("# ?@#?AA@#1"5#B*'3'1"# ?*"4(434#*+#?*9,(6&'#/('&27%&4@#CD"#79&#'*D7"&4#1"5# E"%('*"9&"3#%1'(1F6&4# G,&2(H217*"#91("31("&5#F>#3=&#$,&"-.## !'2=(3&23D'&#C&%(&)#I*1'5#;=J,KLL)))M*,&"9,M*'8<# N&'4(*"#OMP#=14#F&&"#'&6&14&5#QD6>#RSPP# – T=&#D,2*9("8#UMS#4,&2(H217*"4#)(66#F&#'&6&14&5#4**"# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' $,&"-.#(/#)(0&12#/3,,*'4&0# 52#46&#("03/4'27#8/#)&11#8/#46&# 8980&:(9#9*::3"(42# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 1/(2$324#5'!6'!"#$%&' /**0#,&'+*'12"3&#2"0#432526(5(78# – 9+#8*:#0*#(7#'(;<7#====# >&?+237*#2"0#127:'&#472"02'0# !"#$,&"-.#,'*;'21#(4#,*'7265&# – @:,,*'7&0#68#2#52';&#":16&'#*+#3*1,(5&'4# A&B:('&4#5(C5&#,'*;'211(";#&D*'7# !55*)4#7<&#,'*;'21#7*#6&#,2'255&5(E&0#("3'&1&"72558# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 00' 12#'!"#$%&'34#5678$'%8/#9' Fork and Join Model Master Thread Parallel Worker region Threads Synchronization Parallel Worker region Threads Synchronization ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 2#34#/'&5)566#6*37' Master Thread Outer parallel 3-way parallel region Nested 9-way parallel parallel region 3-way parallel Outer parallel region Note: Nesting level ,-' can be arbitrarily deep .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 12#'!"#$%&'%#34)5'%4/#6' ")*(78#' ! !//#01'&234#12%&#255&44# 3#34)5' ")*(78#' 3#34)5' 0*#01&#426&7#8/*92//:# 1' 1' 412'&3#6&6*':# ! ;202#("#,'(%20&#6&6*':# (4#*"/:#255&44(9/&#9:#01&# 927)#/' 1' 01'&23#*)"("8#01(4# 6&6*':# %#34)5' ")*(78#' ! <*#*01&'#01'&23#4&&4# 3#34)5' 01&#512"8&=4># ! ;202#0'2"4+&'#(4#01'*?81# 1' 1' 412'&3#6&6*':#2"3#(4# ")*(78#' ")*(78#' @AAB#0'2"4,2'&"0#0*#01&# 3#34)5' 3#34)5' 2,,/(52C*"# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 12324562)*$7'89)*:;3#<' ! /"#0"#$,&"-.#,'*1'023#4050#"&&46#5*#7&#8907&9&4:# ! ;66&"<099=#5>&'&#0'&#5)*#706(?#5=,&6@# – A>0'&4#B#C>&'&#(6#*"9=#*"&#("650"?&#*+#5>&#4050# " C>'&046#?0"#'&04#0"4#)'(5&#5>&#4050#6(2D950"&*D69=# D"9&66#,'*5&?5&4#5>'*D1>#0#6,&?(E?#?*"65'D?5# " !99#?>0"1&6#204&#0'&#%(6(79&#5*#099#5>'&046# – FD5#"*5#"&?&660'(9=#(22&4(05&9=3#D"9&66#&"+*'?&4#GGGGGG# – .'(%05&#B#;0?>#5>'&04#>06#0#?*,=#*+#5>&#4050# " H*#*5>&'#5>'&04#?0"#0??&66#5>(6#4050# " I>0"1&6#*"9=#%(6(79&#5*#5>&#5>'&04#*)"("1#5>&#4050# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23#'45)6738)*$9':5$7;)<=;7' #pragma omp for #pragma omp sections #pragma omp single { { { .... .... .... } } } !$OMP DO !$OMP SECTIONS !$OMP SINGLE .... .... .... !$OMP END DO !$OMP END SECTIONS !$OMP END SINGLE ! The work is distributed over the threads ! Must be enclosed in a parallel region ! Must be encountered by all threads in the team, or none at all ! No implied barrier on entry; implied barrier on exit (unless nowait is specified) ! A work-sharing construct does not launch any new threads ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23#'453#/67#'8796:#' schedule ( static | dynamic | guided | auto [, chunk] ) schedule ( runtime ) static [, chunk] ! Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion ! In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads " Details are implementation defined ! Under certain conditions, the assignment of iterations to threads is the same across multiple loops in the same ,-' parallel region .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 2345"6#'!7'8'9:4;<'9<=#/>6#' A loop of length 16 using 4 threads ?=)#4/' 1' -' @' A' $B'<=>$C'D' -EF' 0EG' HE-@' -AE-I' <=>$C'J'@' -E@' AEF' 0EI' KEG' HE-1' --E-@' -AE-F' -0E-I' *) The precise distribution is implementation defined ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 0-' 122"'32)4526/'789#/:5*$;'<92*8#=' dynamic [, chunk] ! Fixed portions of work; size is controlled by the value of chunk ! When a thread finishes, it starts on the next portion of work guided [, chunk] ! Same dynamic behavior as "dynamic", but size of the portion of work decreases exponentially auto ! The compiler (or runtime system) decides what is best to use; choice could be implementation dependent runtime ! Iteration scheduling scheme is set at runtime through ,-' environment variable OMP_SCHEDULE .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23"#)*4#$5'6'077'85#)9:;$<='>'?@)#9/<' 3 2 CD*/#/='0' 1 0 3 2 1 0 Thread ID 3 /B$94*A='0' 2 1 <59:A' 0 0 50 100 150 200 250 300 350 400 450 500 ,-' Iteration Number .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 A first example f = 1.0 for (i = 0; i < n; i++) z[i] = x[i] + y[i]; for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... scale = sum (a, 0, m) + sum (z, 0, n) + f; ... A gentle introduction to parallel programming and HPC platforms - 18 vendredi 15 mars 13 A first example - v1 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0 for (i = 0; i < n; i++) z[i] = x[i] + y[i]; for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... parallel region scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 19 vendredi 15 mars 13 A first example - v1 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0 for (i = 0; i < n; i++) z[i] = x[i] + y[i]; Statements executed by all for (i = 0; i < m; i++) the threads of a[i] = b[i] + c[i]; the parallel region ! ... parallel region scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 20 vendredi 15 mars 13 A first example - v1 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { Statements executed f = 1.0 by all the threads of the parallel region #pragma omp for for (i = 0; i < n; i++) parallel loop z[i] = x[i] + y[i]; (work is distributed) #pragma omp for for (i = 0; i < m; i++) parallel loop a[i] = b[i] + c[i]; (work is distributed) ... parallel region Statements executed scale = sum (a, 0, m) + sum (z, 0, n) + f; by all the threads of ... the parallel region } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 21 vendredi 15 mars 13 A first example - v2 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0 #pragma omp for nowait for (i = 0; i < n; i++) z[i] = x[i] + y[i]; #pragma omp for nowait for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... parallel region #pragma omp barrier scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 22 vendredi 15 mars 13 A first example - v3 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) if (n > some_threshold && m > some_threshold) { f = 1.0 #pragma omp for nowait for (i = 0; i < n; i++) z[i] = x[i] + y[i]; #pragma omp for nowait for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... #pragma omp barrier scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 23 vendredi 15 mars 13 More processing units means massive parallelism • Smaller chip designs mean bigger HPC platforms... ‣ More and more cores per chip ‣ More and more chips per nodes • ... that can turn out to be embarrassing ! ‣ How can I occupy all these cores? • The programmer has to express fine-grain massive parallelism ‣ OpenMP has evolved to fill this need with the 3.0 standard introducing OpenMP tasks (2008) 1 - Introduction to OpenMP task parallelism - 24 vendredi 15 mars 13 The tasking concept in OpenMP 1 - Introduction to OpenMP task parallelism - 25 vendredi 15 mars 13 OpenMP tasking: Who does what and when ? (Don’t ask why...) • The developer specifies where the tasks are ‣ #pragma omp task construct

Load more