A Quest for Unified, Global View Parallel Programming Models For

A Quest for Unified, Global View Parallel Programming Models for Our Future Kenjiro Taura University of Tokyo T0 T1 T161 T2 T40 T162 T184 T3 T31 T41 T77 T163 T172 T185 T187 T4 T29 T32 T38 T42 T66 T78 T102 T164 T166 T173 T175 T186 T188 T190 T5 T11 T30 T33 T37 T39 T43 T62 T67 T74 T79 T82 T103 T153 T165 T167 T171 T174 T176 T181 T189 T191 T6 T7 T12 T24 T34 T35 T44 T63 T65 T68 T72 T75 T76 T80 T81 T83 T101 T104 T122 T154 T155 T168 T169 T177 T179 T182 T192 T8 T9 T13 T14 T25 T26 T36 T45 T61 T64 T69 T71 T73 T84 T93 T105 T120 T123 T137 T156 T158 T170 T178 T180 T183 T193 T195 T10 T15 T23 T27 T46 T60 T70 T85 T94 T106 T111 T121 T124 T128 T138 T152 T157 T159 T160 T194 T196 T198 T16 T20 T28 T47 T56 T86 T87 T95 T96 T107 T110 T112 T114 T125 T129 T135 T139 T143 T197 T199 T17 T21 T48 T57 T88 T92 T97 T108 T109 T113 T115 T117 T126 T130 T136 T140 T144 T146 T18 T22 T49 T55 T58 T89 T90 T98 T100 T116 T118 T127 T131 T141 T145 T147 T150 T19 T50 T54 T59 T91 T99 T119 T132 T134 T142 T148 T149 T151 T51 T53 T133 T52 1 / 52 Acknowledgements I Jun Nakashima (MassiveThreads) I Shigeki Akiyama, Wataru Endo (MassiveThreads/DM) I An Huynh (DAGViz) I Shintaro Iwasaki (Vectorization) 2 / 52 3 / 52 What is task parallelism? I like most CS terms, the definition is vague I I don't consider contraposition \data parallelism vs. task parallelism" useful I imagine lots of tasks each working on a piece of data I is it data parallel or task parallel? I let's instead ask: I what's useful from programmer's view point I what are useful distinctions to make from implementer's view point 4 / 52 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task create task 5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 5 / 52 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 5 / 52 What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched 5 / 52 What are they good for? I generality: \creating tasks at arbitrary points" unifies many superficially different patterns I parallel nested loop, parallel recursions I they trivially compose I programmability: cheap task creation + automatic load balancing allow straightforward, processor-oblivious decomposition of the work (divide-and-conquer-until-trivial) I performance: dynamic scheduling is a basis for hiding latencies and tolerating noises 6 / 52 Our goal I programmers use tasks (+ higher-level syntax on top) as the unified means to express parallelism I the system maps tasks to hardware parallelism I cores within a node I nodes I SIMD lanes within a core! 7 / 52 Rest of the talk Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks 8 / 52 9 / 52 Agenda Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks 10 / 52 I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? I tasks untied or tied: can tasks migrate after they started? Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? 11 / 52 I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? I tasks untied or tied: can tasks migrate after they started? Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? 11 / 52 I tasks untied or tied: can tasks migrate after they started? Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? 11 / 52 Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? I tasks untied or tied: can tasks migrate after they started? 11 / 52 Instantiations library suspendable untied sync /frontend task tasks topology OpenMP tasks frontend yes yes fork/join TBB library yes no fork/join Cilk frontend yes yes fork/join Quark library no no arbitrary Nanos++ library yes yes arbitrary Qthreads library yes yes arbitrary Argobots library yes yes? arbitrary MassiveThreads library yes yes arbitrary 12 / 52 MassiveThreads I https://github.com/massivethreads/massivethreads I design philosophy: user-level threads (ULT) in an ordinary thread API as you know it I tid = myth create(f, arg) I tid = myth join(arg) I myth yield to switch among threads (useful for latency hiding) I mutex and condition variables to build arbitrary synchronization patterns I efficient work stealing scheduler (locally LIFO and child-first; steal oldest task first) I an (experimental) customizable work stealing [Nakashima and Taura; ROSS 2013] 13 / 52 User-facing APIs on MassiveThreads I TBB's task group and § parallel for (but with untied quicksort(a, p, q) { if (q - p < th) { work stealing scheduler) ... I } else { Chapel tasks on top of mtbb::task group tg; MassiveThreads (currently r = partition(a, p, q); tg.run([=]{ quicksort(a, p, r-1); }); broken orz) quicksort(a, r, q); I tg.wait(); SML# (Ueno @ Tohoku } University) ongoing } I Tapas (Fukuda @ RIKEN), a TBB interface on domain specific language for MassiveThreads particle simulation 14 / 52 Important performance metrics I low local creation/sync overhead I low local context switches I reasonably low load balancing (migration) overhead I somewhat sequential scheduling order § 1 parent() { π0 2 π0: 3 spawn { γ: ... }; 4 π1: γ π1 5 } op measure what time (cycles) local create π0 ! γ ≈ 140 work steal π0 ! π1 ≈ 900 context switch myth yield ≈ 80 (Haswell i7-4500U (1.80GHz), GCC 4.9) 15 / 52 Comparison to other systems 3000 ≈ 7000 child 2500 parent § 2000 1500 1 parent() { 2 π clocks : 1000 0 3 spawn { γ: ... }; 500 167 73 72 138 4 π1: 0 Cilk CilkPlusMassiveThreadsOpenMPQthreadsTBB 5 } Summary: I Cilk(Plus), known for its superb local creation performance, sacrifices work stealing performance I TBB's local creation overhead is equally good, but it is \parent-first” and tasks are tied to a worker once started 16 / 52 I ) \locality-/cache-/hierarchy-/topology-/whatever- aware" schedulers obviously important I ) hierarchical/customizable schedulers proposals I ) yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads I the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)? Further research agenda (1) I task runtimes for ever larger scale systems is vital 17 / 52 I ) hierarchical/customizable schedulers proposals I ) yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads I the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)? Further research agenda (1) I task runtimes for ever larger scale systems is vital I ) \locality-/cache-/hierarchy-/topology-/whatever- aware" schedulers obviously important 17 / 52 I ) yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads I the question, it seems, ultimately

A Quest for Unified, Global View Parallel Programming Models For

Other Apis What’S Wrong with Openmp?

Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Scheduling Task Parallelism on Multi-Socket Multicore Systems

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Parallelism in Cilk Plus

Task Parallelism Bit-Level Parallelism

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

Intel Threading Building Blocks

An Overview of Parallel Ccomputing

Openclunk: a Cilk-Like Framework for the GPGPU

Pragma Omp Parallel Do Something Else();