<<

A Quest for Unified, Global View Parallel Programming Models for Our Future

Kenjiro Taura

University of Tokyo

T0

T1 T161

T2 T40 T162 T184

T3 T31 T41 T77 T163 T172 T185 T187

T4 T29 T32 T38 T42 T66 T78 T102 T164 T166 T173 T175 T186 T188 T190

T5 T11 T30 T33 T37 T39 T43 T62 T67 T74 T79 T82 T103 T153 T165 T167 T171 T174 T176 T181 T189 T191

T6 T7 T12 T24 T34 T35 T44 T63 T65 T68 T72 T75 T76 T80 T81 T83 T101 T104 T122 T154 T155 T168 T169 T177 T179 T182 T192

T8 T9 T13 T14 T25 T26 T36 T45 T61 T64 T69 T71 T73 T84 T93 T105 T120 T123 T137 T156 T158 T170 T178 T180 T183 T193 T195

T10 T15 T23 T27 T46 T60 T70 T85 T94 T106 T111 T121 T124 T128 T138 T152 T157 T159 T160 T194 T196 T198

T16 T20 T28 T47 T56 T86 T87 T95 T96 T107 T110 T112 T114 T125 T129 T135 T139 T143 T197 T199

T17 T21 T48 T57 T88 T92 T97 T108 T109 T113 T115 T117 T126 T130 T136 T140 T144 T146

T18 T22 T49 T55 T58 T89 T90 T98 T100 T116 T118 T127 T131 T141 T145 T147 T150

T19 T50 T54 T59 T91 T99 T119 T132 T134 T142 T148 T149 T151

T51 T53 T133

T52

1 / 52 Acknowledgements

▶ Jun Nakashima (MassiveThreads) ▶ Shigeki Akiyama, Wataru Endo (MassiveThreads/DM) ▶ An Huynh (DAGViz) ▶ Shintaro Iwasaki (Vectorization)

2 / 52 3 / 52 What is ?

▶ like most CS terms, the definition is vague ▶ I don’t consider contraposition “ vs. task parallelism” useful ▶ imagine lots of tasks each working on a piece of data ▶ is it data parallel or task parallel?

▶ let’s instead ask: ▶ what’s useful from ’s view point ▶ what are useful distinctions to make from implementer’s view point

4 / 52 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched

What is task parallelism?

A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task

create task

5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched

What is task parallelism?

A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task

5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched

What is task parallelism?

A system supports task parallelism when:

create task

create task 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, 2. and cheaply;

5 / 52 4. and cheaply context-switched

What is task parallelism?

A system supports task parallelism when:

create task

create task 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . )

5 / 52 What is task parallelism?

A system supports task parallelism when:

create task

create task 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . . . ) 4. and cheaply context-switched

5 / 52 What are they good for?

▶ generality: “creating tasks at arbitrary points” unifies many superficially different patterns ▶ parallel nested loop, parallel recursions ▶ they trivially compose ▶ programmability: cheap task creation + automatic load balancing allow straightforward, -oblivious decomposition of the work (divide-and-conquer-until-trivial) ▶ performance: dynamic is a basis for hiding latencies and tolerating noises

6 / 52 Our goal

use tasks (+ higher-level syntax on top) as the unified means to express parallelism ▶ the system maps tasks to hardware parallelism ▶ cores within a node ▶ nodes ▶ SIMD lanes within a core!

7 / 52 Rest of the talk

Intra-node Task Parallelism

Task Parallelism in

Need Good Performance Analysis Tools

Compiler Optimizations and Vectorization

Concluding Remarks

8 / 52 9 / 52 Agenda

Intra-node Task Parallelism

Task Parallelism in Distributed Memory

Need Good Performance Analysis Tools

Compiler Optimizations and Vectorization

Concluding Remarks

10 / 52 ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? ▶ tasks untied or tied: can tasks migrate after they started?

Taxonomy

▶ library or frontend: implemented with ordinary /C++ compilers or does it heavily rely on a tailored frontend?

11 / 52 ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? ▶ tasks untied or tied: can tasks migrate after they started?

Taxonomy

▶ library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion?

11 / 52 ▶ tasks untied or tied: can tasks migrate after they started?

Taxonomy

▶ library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)?

11 / 52 Taxonomy

▶ library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? ▶ tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? ▶ synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? ▶ tasks untied or tied: can tasks migrate after they started?

11 / 52 Instantiations

library suspendable untied sync /frontend task tasks topology OpenMP tasks frontend yes yes fork/join TBB library yes no fork/join frontend yes yes fork/join Quark library no no arbitrary Nanos++ library yes yes arbitrary Qthreads library yes yes arbitrary Argobots library yes yes? arbitrary MassiveThreads library yes yes arbitrary

12 / 52 MassiveThreads

▶ https://github.com/massivethreads/massivethreads ▶ design philosophy: user-level threads (ULT) in an ordinary API as you know it ▶ tid = myth create(f, arg) ▶ tid = myth join(arg) ▶ myth to switch among threads (useful for latency hiding) ▶ mutex and condition variables to build arbitrary synchronization patterns ▶ efficient scheduler (locally LIFO and child-first; steal oldest task first) ▶ an (experimental) customizable work stealing [Nakashima and Taura; ROSS 2013]

13 / 52 User-facing on MassiveThreads

▶ TBB’s task group and § parallel for (but with untied quicksort(a, p, q) { if (q - p < th) { work stealing scheduler) ... ▶ } else { Chapel tasks on top of mtbb::task group tg; MassiveThreads (currently r = partition(a, p, q); tg.run([=]{ quicksort(a, p, r-1); }); broken orz) quicksort(a, r, q); ▶ tg.wait(); SML# (Ueno @ Tohoku } University) ongoing } ▶ Tapas (Fukuda @ RIKEN), a TBB interface on domain specific language for MassiveThreads particle simulation

14 / 52 Important performance metrics

▶ low local creation/sync overhead ▶ low local context switches ▶ reasonably low load balancing (migration) overhead ▶ somewhat sequential scheduling order § 1 parent() { π0 2 π0: 3 spawn { γ: ... }; 4 π1: γ π1 5 }

op measure what time (cycles) local create π0 → γ ≈ 140 work steal π0 → π1 ≈ 900 context switch myth yield ≈ 80 (Haswell i7-4500U (1.80GHz), GCC 4.9)

15 / 52 Comparison to other systems

3000 ≈ 7000 child 2500 parent § 2000 1500 1 parent() { 2 π clocks : 1000 0 3 spawn { γ: ... }; 500 167 73 72 138 4 π1: 0 Cilk CilkPlusMassiveThreadsOpenMPQthreadsTBB 5 }

Summary: ▶ Cilk(Plus), known for its superb local creation performance, sacrifices work stealing performance ▶ TBB’s local creation overhead is equally good, but it is “parent-first” and tasks are tied to a worker once started

16 / 52 ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital

17 / 52 ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important

17 / 52 ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals

17 / 52 ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads

17 / 52 Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever- aware” schedulers obviously important ▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads ▶ the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?

17 / 52 Further research agenda (2)

▶ quantify the gap between hand-optimized decomposition vs. automatic decomposition (by work stealing); e.g. ▶ Space-filling decomposition vs. work stealing ▶ 2.5D matrix-multiply vs. work stealing ▶ both experimentally and theoretically

18 / 52 19 / 52 Agenda

Intra-node Task Parallelism

Task Parallelism in Distributed Memory

Need Good Performance Analysis Tools

Compiler Optimizations and Vectorization

Concluding Remarks

20 / 52 Two facets of task parallelism in distributed memory settings

▶ a means to hide latency, for which we merely need a local user-level thread library supporting suspend/resume at arbitrary points ▶ a means to globally balance loads, for which we need a system specifically designed to migrate tasks across address spaces MassiveThreads/DM is a system supporting ▶ distributed load balancing and latency hiding ▶ + global address space supporting migration and replication

21 / 52 Tasks to hide latencies

The goal: ▶ individual tasks look like § scan(global_array a) { ordinary blocking access for (i = 0; i < n; i++) { (programmer-friendly) .. = .. a[i] ..; } ▶ hide latencies by creating lots } of tasks § Ingredients for implementation: scan(global_array a) { pfor (i = 0; i < n; i++) { ▶ local tasking layer with good .. = .. a[i] ..; } context switch performance } ▶ message/RDMA layer with good multithreaded performance

22 / 52 Preliminary results

▶ context switch: we used MassiveThreads’s myth yield function to switch context upon blocking ▶ message/RDMA: we rolled our own thread-safe comm layer (on MPI, on IB verbs, and on Fujitsu Tofu RMA), partly because Fujitsu MPI lacks multithreading support

§ 1.4 × 106 6 /* a[i] */ 1.2 × 10 T get(address) { 1 × 106 issue non-blocking get(address); 800000 while (!the result available){ workers=1 600000 workers=2 myth_yield(); 400000 workers=3 } gets/node/sec workers=4 200000 return result; workers=5 } 0 1 2 3 4 5 6 7 8 9 10

tasks

23 / 52 Taxonomy

▶ library or frontend ▶ tasks suspendable or atomic ▶ synchronization patterns arbitrary or pre-defined ▶ tasks untied or tied

▶ the main issue: implementation complexity raises on distributed memory especially for untied tasks

▶ that is, how to move tasks across address spaces?

24 / 52 Instantiations

library suspendable untied sync scale /frontend task tasks topology Distributed Cilk frontend yes yes fork/join 16 [Blumofe et al. 96] Satin frontend yes no fork/join 256 [Neuwpoort et al. 01] Tascell frontend yes yes fork/join 128 [Hiraishi et al. 09] Scioto library no no BoT 8192 [Dinan et al. 09] HotSLAW library yes no fork/join 256 [Min et al. 11] X10/GLB library no no BoT 16384 [Zhang et al. 13] Grappa library yes no fork/join 4096 [Nelson et al. 15] MassiveThreads/DM library yes yes fork/join 4096 [Akiyama et al. 15]

25 / 52 MassiveThreads/DM

▶ global (inter-node) work stealing library ▶ usable with ordinary C/C++ compilers ▶ supports fork-join with untied tasks ▶ ⇒ moves native threads across nodes

26 / 52 Migrating native threads

▶ problem: the stack of native threads has pointers pointing to the inside ▶ migrating a thread to an arbitrary address breaks these pointers ▶ ⇒ upon migration, copy the stack to the same address (iso-address [Antoniu et al. 1999])

@a

27 / 52 Migrating native threads

▶ problem: the stack of native threads has pointers pointing to the inside ▶ migrating a thread to an arbitrary address breaks these pointers ▶ ⇒ upon migration, copy the stack to the same address (iso-address [Antoniu et al. 1999])

@a' @a !

27 / 52 Migrating native threads

▶ problem: the stack of native threads has pointers pointing to the inside ▶ migrating a thread to an arbitrary address breaks these pointers ▶ ⇒ upon migration, copy the stack to the same address (iso-address [Antoniu et al. 1999])

@a' @a ! @a @a

iso address

27 / 52 Iso-address limits

▶ for each thread, all nodes must reserve its address ▶ ⇒ a huge waste of virtual memory virtual address space

28 / 52 Is consuming a huge virtual memory really a problem?

▶ with high concurrency, it may indeed overflow virtual address space

stack size × tasks depth × cores/node × nodes 214 × 213 × 28 × 213 = 248

▶ more important, the luxury use of virtual memory prohibits using RDMA for work stealing (as RDMA memory must be pinned) ▶ ⇒ proposed UniAddress scheme [Akiyama et al. 2015]

29 / 52 Further research agenda

▶ demonstrate global distributed load balancing with practical workloads with lots of shared data ▶ “locality-/hierarchy-. . . ”awareness are even more important in this setting ▶ latency-hiding opportunity adds an extra dimension ▶ steal or not, switch or not

30 / 52 31 / 52 Agenda

Intra-node Task Parallelism

Task Parallelism in Distributed Memory

Need Good Performance Analysis Tools

Compiler Optimizations and Vectorization

Concluding Remarks

32 / 52 Analyzing task parallel programs

T0

T1 T161

T2 T40 T162 T184

T3 T31 T41 T77 T163 T172 T185 T187

T4 T29 T32 T38 T42 T66 T78 T102 T164 T166 T173 T175 T186 T188 T190

T5 T11 T30 T33 T37 T39 T43 T62 T67 T74 T79 T82 T103 T153 T165 T167 T171 T174 T176 T181 T189 T191

T6 T7 T12 T24 T34 T35 T44 T63 T65 T68 T72 T75 T76 T80 T81 T83 T101 T104 T122 T154 T155 T168 T169 T177 T179 T182 T192

T8 T9 T13 T14 T25 T26 T36 T45 T61 T64 T69 T71 T73 T84 T93 T105 T120 T123 T137 T156 T158 T170 T178 T180 T183 T193 T195

T10 T15 T23 T27 T46 T60 T70 T85 T94 T106 T111 T121 T124 T128 T138 T152 T157 T159 T160 T194 T196 T198

T16 T20 T28 T47 T56 T86 T87 T95 T96 T107 T110 T112 T114 T125 T129 T135 T139 T143 T197 T199 ▶ T17 T21 T48 T57 T88 T92 T97 T108 T109 T113 T115 T117 T126 T130 T136 T140 T144 T146 task parallel systems are more T18 T22 T49 T55 T58 T89 T90 T98 T100 T116 T118 T127 T131 T141 T145 T147 T150 T19 T50 T54 T59 T91 T99 T119 T132 T134 T142 T148 T149 T151

T51 T53 T133 “opaque” from users T52 create/wait task ▶ task management, load balancing, scheduling physical resource ▶ they show performance differences and researchers want to precisely understand where they come from

33 / 52 DAG Recorder and DAGViz

▶ DAG Recorder runs a task A() { for(i=0;i<2;i++) { mk_task_group; parallel program and extracts create_task(B()); create_task(C()); its DAG, augmented with (); B wait_tasks(); C timestamps, CPUs, etc. } } E D() { ▶ DAGViz is its visualizer mk_task_group; create_task(E()); F(); wait_tasks(); }

begin_section B C create_task

wait_tasks

endtask E

34 / 52 Why record the DAG?

▶ DAG is a logical representation of the program execution independent from the runtime system ▶ you can compare DAGs by two systems side by side ▶ DAG contains sufficient information to reconstruct many details ▶ work and critical path (excluding overhead) ▶ actual parallelism (running cores) along time ▶ available parallelism (ready tasks) along time ▶ how long each task was delayed by the scheduler

35 / 52 DAGViz Demo

Seeing is believing.

36 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive B C ▶ collapse “uninteresting” subgraphs E into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive B C ▶ collapse “uninteresting” subgraphs E into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive B C ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive B C ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, B C 2. its span is smaller than a begin_section create_task E (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, 2. its span is smaller than a begin_section create_task (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, 2. its span is smaller than a begin_section create_task (configurable) threshold wait_tasks endtask

37 / 52 Challenge : reducing space requirement

▶ literally recording all subgraphs is prohibitive ▶ collapse “uninteresting” subgraphs into single nodes ▶ current criteria: we collapse a subgraph ⇐⇒ 1. its nodes are executed by a single worker, 2. its span is smaller than a begin_section create_task (configurable) threshold wait_tasks endtask

37 / 52 Ongoing work

▶ hoping to use this tool to automate discovery of issues in runtime systems ▶ scheduler delays along a critical path ▶ work time inflation ▶ shed light on “steal or not” trade-offs

38 / 52 39 / 52 Agenda

Intra-node Task Parallelism

Task Parallelism in Distributed Memory

Need Good Performance Analysis Tools

Compiler Optimizations and Vectorization

Concluding Remarks

40 / 52 Motivation

▶ task parallelism is a friend of divide-and-conquer algorithms ▶ divide-and-conquer makes coding “trivial,” by dividing until the problem becomes trivial ▶ matrix multiply, matrix factorization, triangular solve, FFT, sorting, . . . ▶ in reality, the programmer has to optimize leaves manually ▶ why? because we lack good compilers

41 / 52 The power of divide-and-conquer

§ § /∗ C += AB ∗/ /∗ quick sort ∗/ mm(A, B, C) { quicksort(a, p, q) { if (|A| = 1 && |B| = 1 § if (q − p < 2) { && |C| = 1) { /∗ Cholesky factorization ∗/ return; C00 += A00 · B00 ; chol(A) { } else { } else { { if (n = 1)√ ...... return ( a11); } } } else { } } ... } § § } /∗ FFT ∗/ /∗ triangular solve fft (n, x) { LX = B . ∗/ if (n = 1) { trsm(L, B) { They all admit return x0 ; if (M = 1) { “trivial” base case, } else { B /= l11 ; ... } else { only if performance is } ... acceptable . . . } } }

42 / 52 Static optimizations and vectorization of tasks

▶ goal: run straightforward task-based programs as fast as manually optimized programs ▶ write once, parallelize everywhere (nodes, cores, and vectors)

43 / 52 Static optimizations and vectorization of tasks

▶ goal: run straightforward task-based programs as fast as manually optimized programs ▶ write once, parallelize everywhere (nodes, cores, and vectors) serialized and vectorized

43 / 52 What does our compiler do?

1. static cut-off statically eliminates task creations 2. code-bloat-free inlining inline-expands recursions 3. loopification transforms recursions into flat loops (and then vectorizes it if possible)

44 / 52 Static cut-off § §

1 f(a, b, ··· ) { 1 fseq(a, b, ··· ) { 2 if (E){ 2 if (E){ 3 L(a, b, ··· ) 3 L(a, b, ··· ) 4 } else { 4 } else { 5 ··· 5 ··· 6 spawn f(a1, b1, ··· ); ⇒ 6 fseq(a1, b1, ··· ); 7 ··· 7 ··· 8 spawn f(a2, b2, ··· ); 8 fseq(a2, b2, ··· ); 9 ··· 9 ··· 10 } 10 } 11 } 11 }

key: determine a condition Hk, in which the height of recursion from leaves ≤ k ▶ H0 = E ▶ Hk+1 = E or ∀i(ai, bi, ··· ) satisfy Hk when succeeded, generate code that statically eliminate all task creations

45 / 52 Code-bloat-free inlining

▶ under condition Hk, inline-expanding all recursions k times would eliminate all function calls ▶ but this would result in an exponential code bloat when the function has multiple recursive calls ▶ code-bloat-free inlining fuses multiple recursive calls into a single call site § § 1 for (i = 0; i < 2; i++) { 1 ··· 2 switch (i) { 2 f(a1, b1, ··· ); 3 case 0: ··· 3 ··· ⇒ 4 case 1: ··· 4 f(a2, b2, ··· ); 5 } 5 ··· 6 f(ai, bi, ··· ); 7 }

46 / 52 Loopification §

1 fseq(a, b, ··· ) { 2 if (E){ 3 L(a, b, ··· ) 4 } else { § 5 ··· 1 for i ∈ P { 6 fseq(a1, b1, ··· ); ⇒ 2 L(xi, yi, ··· ) 7 ··· 3 } 8 fseq(a2, b2, ··· ); 9 ··· 10 } 11 }

▶ instead of code-bloat-free inlining, loopification attempts to generate a flat (or shallow) loop directly from recursive code ▶ it tries to synthesize hypotheses that the original code is an affine loop of leaf blocks ▶ the loopified code may then be vectorized

47 / 52 Results: effect of optimizations

16 27.12 17.56 17.65 220.14 109.72 14

12

10 base dynamic 8 static cef loop 6 proposed 4 relative performance 2

0 fib nqueensfft sort nbodystrassenvecaddheat2dheat3dgaussianmatmultrimultreeaddtreesumuts

48 / 52 Results: remaining gap to hand-optimized code

3 task omp 2.5 omp optimized polly 2

1.5

1

0.5 relative performance (task=1) 0 nbody vecadd heat2d heat3d gaussian matmul trimul average geomean

49 / 52 50 / 52 Agenda

Intra-node Task Parallelism

Task Parallelism in Distributed Memory

Need Good Performance Analysis Tools

Compiler Optimizations and Vectorization

Concluding Remarks

51 / 52 Future outlook of task parallelism

▶ the goal: offer both programmability and performance ▶ long way toward achieving acceptable performance on distributed memory machines. why? ▶ dynamic load balancing → random traffic ▶ global address space → fine-grain communication ▶ OK in today. why not on distributed memory (at least for now)? ▶ checking errors and completion everywhere ▶ doing mutual exclusion everywhere ▶ no hardware-prefetching analog ▶ or lack of bandwidth to tolerate random traffic and aggressive prefetching

Thank you for listening

52 / 52