A Quest for Unified, Global View Parallel Programming Models for Our Future
Kenjiro Taura
University of Tokyo
T0
T1 T161
T2 T40 T162 T184
T3 T31 T41 T77 T163 T172 T185 T187
T4 T29 T32 T38 T42 T66 T78 T102 T164 T166 T173 T175 T186 T188 T190
T5 T11 T30 T33 T37 T39 T43 T62 T67 T74 T79 T82 T103 T153 T165 T167 T171 T174 T176 T181 T189 T191
T6 T7 T12 T24 T34 T35 T44 T63 T65 T68 T72 T75 T76 T80 T81 T83 T101 T104 T122 T154 T155 T168 T169 T177 T179 T182 T192
T8 T9 T13 T14 T25 T26 T36 T45 T61 T64 T69 T71 T73 T84 T93 T105 T120 T123 T137 T156 T158 T170 T178 T180 T183 T193 T195
T10 T15 T23 T27 T46 T60 T70 T85 T94 T106 T111 T121 T124 T128 T138 T152 T157 T159 T160 T194 T196 T198
T16 T20 T28 T47 T56 T86 T87 T95 T96 T107 T110 T112 T114 T125 T129 T135 T139 T143 T197 T199
T17 T21 T48 T57 T88 T92 T97 T108 T109 T113 T115 T117 T126 T130 T136 T140 T144 T146
T18 T22 T49 T55 T58 T89 T90 T98 T100 T116 T118 T127 T131 T141 T145 T147 T150
T19 T50 T54 T59 T91 T99 T119 T132 T134 T142 T148 T149 T151
T51 T53 T133
T52
1 / 52 Acknowledgements
▶ Jun Nakashima (MassiveThreads) ▶ Shigeki Akiyama, Wataru Endo (MassiveThreads/DM) ▶ An Huynh (DAGViz) ▶ Shintaro Iwasaki (Vectorization)
2 / 52 3 / 52 What is task parallelism?
▶ like most CS terms, the definition is vague ▶ I don’t consider contraposition “data parallelism vs. task parallelism” useful ▶ imagine lots of tasks each working on a piece of data ▶ is it data parallel or task parallel?
▶ let’s instead ask: ▶ what’s useful from programmer’s view point ▶ what are useful distinctions to make from implementer’s view point
4 / 52 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched
What is task parallelism?
A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task
create task
5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched
What is task parallelism?
A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task
5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched
What is task parallelism?
A system supports task parallelism when:
create task
create task 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, 2. and cheaply;
5 / 52 4. and cheaply context-switched
What is task parallelism?
A system supports task parallelism when:
create task
create task 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . )
5 / 52 What is task parallelism?
A system supports task parallelism when:
create task
create task 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched
5 / 52 What are they good for?
▶ generality: “creating tasks at arbitrary points” unifies many superficially different patterns ▶ parallel nested loop, parallel recursions ▶ they trivially compose ▶ programmability: cheap task creation + automatic load balancing allow straightforward, processor-oblivious decomposition of the work (divide-and-conquer-until-trivial) ▶ performance: dynamic scheduling is a basis for hiding latencies and tolerating noises
6 / 52 Our goal
▶ programmers use tasks (+ higher-level syntax on top) as the unified means to express parallelism ▶ the system maps tasks to hardware parallelism ▶ cores within a node ▶ nodes ▶ SIMD lanes within a core!
7 / 52 Rest of the talk
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
8 / 52 9 / 52 Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
10 / 52 ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? ▶ tasks untied or tied: can tasks migrate after they started?
Taxonomy
▶ library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend?
11 / 52 ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? ▶ tasks untied or tied: can tasks migrate after they started?
Taxonomy
▶ library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion?
11 / 52 ▶ tasks untied or tied: can tasks migrate after they started?
Taxonomy
▶ library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)?
11 / 52 Taxonomy
▶ library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? ▶ tasks untied or tied: can tasks migrate after they started?
11 / 52 Instantiations
library suspendable untied sync /frontend task tasks topology OpenMP tasks frontend yes yes fork/join TBB library yes no fork/join Cilk frontend yes yes fork/join Quark library no no arbitrary Nanos++ library yes yes arbitrary Qthreads library yes yes arbitrary Argobots library yes yes? arbitrary MassiveThreads library yes yes arbitrary
12 / 52 MassiveThreads
▶ https://github.com/massivethreads/massivethreads ▶ design philosophy: user-level threads (ULT) in an ordinary thread API as you know it ▶ tid = myth create(f, arg) ▶ tid = myth join(arg) ▶ myth yield to switch among threads (useful for latency hiding) ▶ mutex and condition variables to build arbitrary synchronization patterns ▶ efficient work stealing scheduler (locally LIFO and child-first; steal oldest task first) ▶ an (experimental) customizable work stealing [Nakashima and Taura; ROSS 2013]
13 / 52 User-facing APIs on MassiveThreads
▶ TBB’s task group and § parallel for (but with untied quicksort(a, p, q) { if (q - p < th) { work stealing scheduler) ... ▶ } else { Chapel tasks on top of mtbb::task group tg; MassiveThreads (currently r = partition(a, p, q); tg.run([=]{ quicksort(a, p, r-1); }); broken orz) quicksort(a, r, q); ▶ tg.wait(); SML# (Ueno @ Tohoku } University) ongoing } ▶ Tapas (Fukuda @ RIKEN), a TBB interface on domain specific language for MassiveThreads particle simulation
14 / 52 Important performance metrics
▶ low local creation/sync overhead ▶ low local context switches ▶ reasonably low load balancing (migration) overhead ▶ somewhat sequential scheduling order § 1 parent() { π0 2 π0: 3 spawn { γ: ... }; 4 π1: γ π1 5 }
op measure what time (cycles) local create π0 → γ ≈ 140 work steal π0 → π1 ≈ 900 context switch myth yield ≈ 80 (Haswell i7-4500U (1.80GHz), GCC 4.9)
15 / 52 Comparison to other systems
3000 ≈ 7000 child 2500 parent § 2000 1500 1 parent() { 2 π clocks : 1000 0 3 spawn { γ: ... }; 500 167 73 72 138 4 π1: 0 Cilk CilkPlusMassiveThreadsOpenMPQthreadsTBB 5 }
Summary: ▶ Cilk(Plus), known for its superb local creation performance, sacrifices work stealing performance ▶ TBB’s local creation overhead is equally good, but it is “parent-first” and tasks are tied to a worker once started
16 / 52 ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital
17 / 52 ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important
17 / 52 ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals
17 / 52 ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?
Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads
17 / 52 Further research agenda (1)
▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?
17 / 52 Further research agenda (2)
▶ quantify the gap between hand-optimized decomposition vs. automatic decomposition (by work stealing); e.g. ▶ Space-filling decomposition vs. work stealing ▶ 2.5D matrix-multiply vs. work stealing ▶ both experimentally and theoretically
18 / 52 19 / 52 Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
20 / 52 Two facets of task parallelism in distributed memory settings
▶ a means to hide latency, for which we merely need a local user-level thread library supporting suspend/resume at arbitrary points ▶ a means to globally balance loads, for which we need a system specifically designed to migrate tasks across address spaces MassiveThreads/DM is a system supporting ▶ distributed load balancing and latency hiding ▶ + global address space supporting migration and replication
21 / 52 Tasks to hide latencies
The goal: ▶ individual tasks look like § scan(global_array
22 / 52 Preliminary results
▶ context switch: we used MassiveThreads’s myth yield function to switch context upon blocking ▶ message/RDMA: we rolled our own thread-safe comm layer (on MPI, on IB verbs, and on Fujitsu Tofu RMA), partly because Fujitsu MPI lacks multithreading support
§ 1.4 × 106 6 /* a[i] */ 1.2 × 10 T get(address
tasks
23 / 52 Taxonomy
▶ library or frontend ▶ tasks suspendable or atomic ▶ synchronization patterns arbitrary or pre-defined ▶ tasks untied or tied
▶ the main issue: implementation complexity raises on distributed memory especially for untied tasks
▶ that is, how to move tasks across address spaces?
24 / 52 Instantiations
library suspendable untied sync scale /frontend task tasks topology Distributed Cilk frontend yes yes fork/join 16 [Blumofe et al. 96] Satin frontend yes no fork/join 256 [Neuwpoort et al. 01] Tascell frontend yes yes fork/join 128 [Hiraishi et al. 09] Scioto library no no BoT 8192 [Dinan et al. 09] HotSLAW library yes no fork/join 256 [Min et al. 11] X10/GLB library no no BoT 16384 [Zhang et al. 13] Grappa library yes no fork/join 4096 [Nelson et al. 15] MassiveThreads/DM library yes yes fork/join 4096 [Akiyama et al. 15]
25 / 52 MassiveThreads/DM
▶ global (inter-node) work stealing library ▶ usable with ordinary C/C++ compilers ▶ supports fork-join with untied tasks ▶ ⇒ moves native threads across nodes
26 / 52 Migrating native threads
▶ problem: the stack of native threads has pointers pointing to the inside ▶ migrating a thread to an arbitrary address breaks these pointers ▶ ⇒ upon migration, copy the stack to the same address (iso-address [Antoniu et al. 1999])
@a
27 / 52 Migrating native threads
▶ problem: the stack of native threads has pointers pointing to the inside ▶ migrating a thread to an arbitrary address breaks these pointers ▶ ⇒ upon migration, copy the stack to the same address (iso-address [Antoniu et al. 1999])
@a' @a !
27 / 52 Migrating native threads
▶ problem: the stack of native threads has pointers pointing to the inside ▶ migrating a thread to an arbitrary address breaks these pointers ▶ ⇒ upon migration, copy the stack to the same address (iso-address [Antoniu et al. 1999])
@a' @a ! @a @a
iso address
27 / 52 Iso-address limits scalability
▶ for each thread, all nodes must reserve its address ▶ ⇒ a huge waste of virtual memory virtual address space
28 / 52 Is consuming a huge virtual memory really a problem?
▶ with high concurrency, it may indeed overflow virtual address space
stack size × tasks depth × cores/node × nodes 214 × 213 × 28 × 213 = 248
▶ more important, the luxury use of virtual memory prohibits using RDMA for work stealing (as RDMA memory must be pinned) ▶ ⇒ proposed UniAddress scheme [Akiyama et al. 2015]
29 / 52 Further research agenda
▶ demonstrate global distributed load balancing with practical workloads with lots of shared data ▶ “locality-/hierarchy-. . . ”awareness are even more important in this setting ▶ latency-hiding opportunity adds an extra dimension ▶ steal or not, switch or not
30 / 52 31 / 52 Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
32 / 52 Analyzing task parallel programs
T0
T1 T161
T2 T40 T162 T184
T3 T31 T41 T77 T163 T172 T185 T187
T4 T29 T32 T38 T42 T66 T78 T102 T164 T166 T173 T175 T186 T188 T190
T5 T11 T30 T33 T37 T39 T43 T62 T67 T74 T79 T82 T103 T153 T165 T167 T171 T174 T176 T181 T189 T191
T6 T7 T12 T24 T34 T35 T44 T63 T65 T68 T72 T75 T76 T80 T81 T83 T101 T104 T122 T154 T155 T168 T169 T177 T179 T182 T192
T8 T9 T13 T14 T25 T26 T36 T45 T61 T64 T69 T71 T73 T84 T93 T105 T120 T123 T137 T156 T158 T170 T178 T180 T183 T193 T195
T10 T15 T23 T27 T46 T60 T70 T85 T94 T106 T111 T121 T124 T128 T138 T152 T157 T159 T160 T194 T196 T198
T16 T20 T28 T47 T56 T86 T87 T95 T96 T107 T110 T112 T114 T125 T129 T135 T139 T143 T197 T199 ▶ T17 T21 T48 T57 T88 T92 T97 T108 T109 T113 T115 T117 T126 T130 T136 T140 T144 T146 task parallel systems are more T18 T22 T49 T55 T58 T89 T90 T98 T100 T116 T118 T127 T131 T141 T145 T147 T150 T19 T50 T54 T59 T91 T99 T119 T132 T134 T142 T148 T149 T151
T51 T53 T133 “opaque” from users T52 create/wait task ▶ task management, load runtime system balancing, scheduling physical resource ▶ they show performance differences and researchers want to precisely understand where they come from
33 / 52 DAG Recorder and DAGViz
▶ DAG Recorder runs a task A() { for(i=0;i<2;i++) { mk_task_group; parallel program and extracts create_task(B()); create_task(C()); its DAG, augmented with D(); B wait_tasks(); C timestamps, CPUs, etc. } } E D() { ▶ DAGViz is its visualizer mk_task_group; create_task(E()); F(); wait_tasks(); }
begin_section B C create_task
wait_tasks
endtask E
34 / 52 Why record the DAG?
▶ DAG is a logical representation of the program execution independent from the runtime system ▶ you can compare DAGs by two systems side by side ▶ DAG contains sufficient information to reconstruct many details ▶ work and critical path (excluding overhead) ▶ actual parallelism (running cores) along time ▶ available parallelism (ready tasks) along time ▶ how long each task was delayed by the scheduler
35 / 52 DAGViz Demo
Seeing is believing.
36 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is
prohibitive B C ▶ collapse “uninteresting” subgraphs E into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is
prohibitive B C ▶ collapse “uninteresting” subgraphs E into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is
prohibitive B C ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is
prohibitive B C ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, 2. its span is smaller than a begin_section create_task (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, 2. its span is smaller than a begin_section create_task (configurable) threshold wait_tasks endtask
37 / 52 Challenge : reducing space requirement
▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, 2. its span is smaller than a begin_section create_task (configurable) threshold wait_tasks endtask
37 / 52 Ongoing work
▶ hoping to use this tool to automate discovery of issues in runtime systems ▶ scheduler delays along a critical path ▶ work time inflation ▶ shed light on “steal or not” trade-offs
38 / 52 39 / 52 Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
40 / 52 Motivation
▶ task parallelism is a friend of divide-and-conquer algorithms ▶ divide-and-conquer makes coding “trivial,” by dividing until the problem becomes trivial ▶ matrix multiply, matrix factorization, triangular solve, FFT, sorting, . . . ▶ in reality, the programmer has to optimize leaves manually ▶ why? because we lack good compilers
41 / 52 The power of divide-and-conquer
§ § /∗ C += AB ∗/ /∗ quick sort ∗/ mm(A, B, C) { quicksort(a, p, q) { if (|A| = 1 && |B| = 1 § if (q − p < 2) { && |C| = 1) { /∗ Cholesky factorization ∗/ return; C00 += A00 · B00 ; chol(A) { } else { } else { { if (n = 1)√ ...... return ( a11); } } } else { } } ... } § § } /∗ FFT ∗/ /∗ triangular solve fft (n, x) { LX = B . ∗/ if (n = 1) { trsm(L, B) { They all admit return x0 ; if (M = 1) { “trivial” base case, } else { B /= l11 ; ... } else { only if performance is } ... acceptable . . . } } }
42 / 52 Static optimizations and vectorization of tasks
▶ goal: run straightforward task-based programs as fast as manually optimized programs ▶ write once, parallelize everywhere (nodes, cores, and vectors)
43 / 52 Static optimizations and vectorization of tasks
▶ goal: run straightforward task-based programs as fast as manually optimized programs ▶ write once, parallelize everywhere (nodes, cores, and vectors) serialized and vectorized
43 / 52 What does our compiler do?
1. static cut-off statically eliminates task creations 2. code-bloat-free inlining inline-expands recursions 3. loopification transforms recursions into flat loops (and then vectorizes it if possible)
44 / 52 Static cut-off § §
1 f(a, b, ··· ) { 1 fseq(a, b, ··· ) { 2 if (E){ 2 if (E){ 3 L(a, b, ··· ) 3 L(a, b, ··· ) 4 } else { 4 } else { 5 ··· 5 ··· 6 spawn f(a1, b1, ··· ); ⇒ 6 fseq(a1, b1, ··· ); 7 ··· 7 ··· 8 spawn f(a2, b2, ··· ); 8 fseq(a2, b2, ··· ); 9 ··· 9 ··· 10 } 10 } 11 } 11 }
key: determine a condition Hk, in which the height of recursion from leaves ≤ k ▶ H0 = E ▶ Hk+1 = E or ∀i(ai, bi, ··· ) satisfy Hk when succeeded, generate code that statically eliminate all task creations
45 / 52 Code-bloat-free inlining
▶ under condition Hk, inline-expanding all recursions k times would eliminate all function calls ▶ but this would result in an exponential code bloat when the function has multiple recursive calls ▶ code-bloat-free inlining fuses multiple recursive calls into a single call site § § 1 for (i = 0; i < 2; i++) { 1 ··· 2 switch (i) { 2 f(a1, b1, ··· ); 3 case 0: ··· 3 ··· ⇒ 4 case 1: ··· 4 f(a2, b2, ··· ); 5 } 5 ··· 6 f(ai, bi, ··· ); 7 }
46 / 52 Loopification §
1 fseq(a, b, ··· ) { 2 if (E){ 3 L(a, b, ··· ) 4 } else { § 5 ··· 1 for i ∈ P { 6 fseq(a1, b1, ··· ); ⇒ 2 L(xi, yi, ··· ) 7 ··· 3 } 8 fseq(a2, b2, ··· ); 9 ··· 10 } 11 }
▶ instead of code-bloat-free inlining, loopification attempts to generate a flat (or shallow) loop directly from recursive code ▶ it tries to synthesize hypotheses that the original code is an affine loop of leaf blocks ▶ the loopified code may then be vectorized
47 / 52 Results: effect of optimizations
16 27.12 17.56 17.65 220.14 109.72 14
12
10 base dynamic 8 static cef loop 6 proposed 4 relative performance 2
0 fib nqueensfft sort nbodystrassenvecaddheat2dheat3dgaussianmatmultrimultreeaddtreesumuts
48 / 52 Results: remaining gap to hand-optimized code
3 task omp 2.5 omp optimized polly 2
1.5
1
0.5 relative performance (task=1) 0 nbody vecadd heat2d heat3d gaussian matmul trimul average geomean
49 / 52 50 / 52 Agenda
Intra-node Task Parallelism
Task Parallelism in Distributed Memory
Need Good Performance Analysis Tools
Compiler Optimizations and Vectorization
Concluding Remarks
51 / 52 Future outlook of task parallelism
▶ the goal: offer both programmability and performance ▶ long way toward achieving acceptable performance on distributed memory machines. why? ▶ dynamic load balancing → random traffic ▶ global address space → fine-grain communication ▶ OK in shared memory today. why not on distributed memory (at least for now)? ▶ checking errors and completion everywhere ▶ doing mutual exclusion everywhere ▶ no hardware-prefetching analog ▶ or lack of bandwidth to tolerate random traffic and aggressive prefetching
Thank you for listening
52 / 52