Enhancing Task-Based Irregular Openmp Applications with the Libkomp Runtime System

Enhancing task-based irregular OpenMP applications with the libKOMP runtime system François Broquedis MOAIS Team, Grenoble, France vendredi 15 mars 13 Working in the MOAIS team • Many research directions ‣ Parallel algorithms / scheduling ‣ Interactive computing and visualization ‣ Runtime systems for HPC platforms • I’m mainly involved in the third one ‣ Task-based parallelism (kaapi runtime system) ‣ OpenMP as a mean to express task-based parallelism (libKOMP runtime system) A few things about me... - 2 vendredi 15 mars 13 OUTLINE 1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective - 3 vendredi 15 mars 13 OUTLINE 1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective - 4 vendredi 15 mars 13 OUTLINE A longer introduction to OpenMP (in general) 1. A brief introduction to OpenMP task parallelism 2. Implementing the OpenMP tasking model 3. Adaptive loop scheduling for OpenMP 4. Data affinity and task-based applications 5. Current status from the HPC-GA project perspective - 5 vendredi 15 mars 13 !"#$%&' !(#)(*#+' 0' 1234'56'!"#$%&7' /&0+123*#431"51'5#!,,6(217*"#.'*8'199("8#:"3&'+12&# ;!.:<#3*#)'(3&#4=1'&5#9&9*'>#,1'166&6#1,,6(217*"4#("# ?@#?AA@#1"5#B*'3'1"# ?*"4(434#*+#?*9,(6&'#/('&27%&4@#CD"#79&#'*D7"&4#1"5# E"%('*"9&"3#%1'(1F6&4# G,&2(H217*"#91("31("&5#F>#3=&#$,&"-.## !'2=(3&23D'&#C&%(&)#I*1'5#;=J,KLL)))M*,&"9,M*'8<# N&'4(*"#OMP#=14#F&&"#'&6&14&5#QD6>#RSPP# – T=&#D,2*9("8#UMS#4,&2(H217*"4#)(66#F&#'&6&14&5#4**"# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' $,&"-.#(/#)(0&12#/3,,*'4&0# 52#46&#("03/4'27#8/#)&11#8/#46&# 8980&:(9#9*::3"(42# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 1/(2$324#5'!6'!"#$%&' /**0#,&'+*'12"3&#2"0#432526(5(78# – 9+#8*:#0*#(7#'(;<7#====# >&?+237*#2"0#127:'&#472"02'0# !"#$,&"-.#,'*;'21#(4#,*'7265&# – @:,,*'7&0#68#2#52';&#":16&'#*+#3*1,(5&'4# A&B:('&4#5(C5&#,'*;'211(";#&D*'7# !55*)4#7<&#,'*;'21#7*#6&#,2'255&5(E&0#("3'&1&"72558# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 00' 12#'!"#$%&'34#5678$'%8/#9' Fork and Join Model Master Thread Parallel Worker region Threads Synchronization Parallel Worker region Threads Synchronization ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 2#34#/'&5)566#6*37' Master Thread Outer parallel 3-way parallel region Nested 9-way parallel parallel region 3-way parallel Outer parallel region Note: Nesting level ,-' can be arbitrarily deep .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 12#'!"#$%&'%#34)5'%4/#6' ")*(78#' ! !//#01'&234#12%&#255&44# 3#34)5' ")*(78#' 3#34)5' 0*#01&#426&7#8/*92//:# 1' 1' 412'&3#6&6*':# ! ;202#("#,'(%20&#6&6*':# (4#*"/:#255&44(9/&#9:#01&# 927)#/' 1' 01'&23#*)"("8#01(4# 6&6*':# %#34)5' ")*(78#' ! <*#*01&'#01'&23#4&&4# 3#34)5' 01&#512"8&=4># ! ;202#0'2"4+&'#(4#01'*?81# 1' 1' 412'&3#6&6*':#2"3#(4# ")*(78#' ")*(78#' @AAB#0'2"4,2'&"0#0*#01&# 3#34)5' 3#34)5' 2,,/(52C*"# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' -0' 12324562)*$7'89)*:;3#<' ! /"#0"#$,&"-.#,'*1'023#4050#"&&46#5*#7&#8907&9&4:# ! ;66&"<099=#5>&'&#0'&#5)*#706(?#5=,&6@# – A>0'&4#B#C>&'&#(6#*"9=#*"&#("650"?&#*+#5>&#4050# " C>'&046#?0"#'&04#0"4#)'(5&#5>&#4050#6(2D950"&*D69=# D"9&66#,'*5&?5&4#5>'*D1>#0#6,&?(E?#?*"65'D?5# " !99#?>0"1&6#204&#0'&#%(6(79&#5*#099#5>'&046# – FD5#"*5#"&?&660'(9=#(22&4(05&9=3#D"9&66#&"+*'?&4#GGGGGG# – .'(%05&#B#;0?>#5>'&04#>06#0#?*,=#*+#5>&#4050# " H*#*5>&'#5>'&04#?0"#0??&66#5>(6#4050# " I>0"1&6#*"9=#%(6(79&#5*#5>&#5>'&04#*)"("1#5>&#4050# ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23#'45)6738)*$9':5$7;)<=;7' #pragma omp for #pragma omp sections #pragma omp single { { { .... .... .... } } } !$OMP DO !$OMP SECTIONS !$OMP SINGLE .... .... .... !$OMP END DO !$OMP END SECTIONS !$OMP END SINGLE ! The work is distributed over the threads ! Must be enclosed in a parallel region ! Must be encountered by all threads in the team, or none at all ! No implied barrier on entry; implied barrier on exit (unless nowait is specified) ! A work-sharing construct does not launch any new threads ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23#'453#/67#'8796:#' schedule ( static | dynamic | guided | auto [, chunk] ) schedule ( runtime ) static [, chunk] ! Distribute iterations in blocks of size "chunk" over the threads in a round-robin fashion ! In absence of "chunk", each thread executes approx. N/P chunks for a loop of length N and P threads " Details are implementation defined ! Under certain conditions, the assignment of iterations to threads is the same across multiple loops in the same ,-' parallel region .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 2345"6#'!7'8'9:4;<'9<=#/>6#' A loop of length 16 using 4 threads ?=)#4/' 1' -' @' A' $B'<=>$C'D' -EF' 0EG' HE-@' -AE-I' <=>$C'J'@' -E@' AEF' 0EI' KEG' HE-1' --E-@' -AE-F' -0E-I' *) The precise distribution is implementation defined ,-' .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 0-' 122"'32)4526/'789#/:5*$;'<92*8#=' dynamic [, chunk] ! Fixed portions of work; size is controlled by the value of chunk ! When a thread finishes, it starts on the next portion of work guided [, chunk] ! Same dynamic behavior as "dynamic", but size of the portion of work decreases exponentially auto ! The compiler (or runtime system) decides what is best to use; choice could be implementation dependent runtime ! Iteration scheduling scheme is set at runtime through ,-' environment variable OMP_SCHEDULE .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 !"#$%&' !(#)(*#+' 01' 23"#)*4#$5'6'077'85#)9:;$<='>'?@)#9/<' 3 2 CD*/#/='0' 1 0 3 2 1 0 Thread ID 3 /B$94*A='0' 2 1 <59:A' 0 0 50 100 150 200 250 300 350 400 450 500 ,-' Iteration Number .(/&' !"#$%&'%(&)#*+#$,&"-.# vendredi 15 mars 13 A first example f = 1.0 for (i = 0; i < n; i++) z[i] = x[i] + y[i]; for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... scale = sum (a, 0, m) + sum (z, 0, n) + f; ... A gentle introduction to parallel programming and HPC platforms - 18 vendredi 15 mars 13 A first example - v1 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0 for (i = 0; i < n; i++) z[i] = x[i] + y[i]; for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... parallel region scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 19 vendredi 15 mars 13 A first example - v1 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0 for (i = 0; i < n; i++) z[i] = x[i] + y[i]; Statements executed by all for (i = 0; i < m; i++) the threads of a[i] = b[i] + c[i]; the parallel region ! ... parallel region scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 20 vendredi 15 mars 13 A first example - v1 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { Statements executed f = 1.0 by all the threads of the parallel region #pragma omp for for (i = 0; i < n; i++) parallel loop z[i] = x[i] + y[i]; (work is distributed) #pragma omp for for (i = 0; i < m; i++) parallel loop a[i] = b[i] + c[i]; (work is distributed) ... parallel region Statements executed scale = sum (a, 0, m) + sum (z, 0, n) + f; by all the threads of ... the parallel region } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 21 vendredi 15 mars 13 A first example - v2 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) { f = 1.0 #pragma omp for nowait for (i = 0; i < n; i++) z[i] = x[i] + y[i]; #pragma omp for nowait for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... parallel region #pragma omp barrier scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 22 vendredi 15 mars 13 A first example - v3 #pragma omp parallel default (none) shared (z, x, y, a, b, c, n, m) private (f, i, scale) if (n > some_threshold && m > some_threshold) { f = 1.0 #pragma omp for nowait for (i = 0; i < n; i++) z[i] = x[i] + y[i]; #pragma omp for nowait for (i = 0; i < m; i++) a[i] = b[i] + c[i]; ... #pragma omp barrier scale = sum (a, 0, m) + sum (z, 0, n) + f; ... } /* End of OpenMP parallel region */ A gentle introduction to parallel programming and HPC platforms - 23 vendredi 15 mars 13 More processing units means massive parallelism • Smaller chip designs mean bigger HPC platforms... ‣ More and more cores per chip ‣ More and more chips per nodes • ... that can turn out to be embarrassing ! ‣ How can I occupy all these cores? • The programmer has to express fine-grain massive parallelism ‣ OpenMP has evolved to fill this need with the 3.0 standard introducing OpenMP tasks (2008) 1 - Introduction to OpenMP task parallelism - 24 vendredi 15 mars 13 The tasking concept in OpenMP 1 - Introduction to OpenMP task parallelism - 25 vendredi 15 mars 13 OpenMP tasking: Who does what and when ? (Don’t ask why...) • The developer specifies where the tasks are ‣ #pragma omp task construct

Enhancing Task-Based Irregular Openmp Applications with the Libkomp Runtime System

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support