A Quest for Unified, Global View Parallel Programming Models For

A Quest for Unified, Global View Parallel Programming Models For

A Quest for Unified, Global View Parallel Programming Models for Our Future Kenjiro Taura University of Tokyocknowledgements I Jun Nakashima (MassiveThreads) I Shigeki Akiyama, Wataru Endo (MassiveThreads/DM) I An Huynh (DAGViz) I Shintaro Iwasaki (Vectorization) 2 / 52 3 / 52 What is task parallelism? I like most CS terms, the definition is vague I I don't consider contraposition \data parallelism vs. task parallelism" useful I imagine lots of tasks each working on a piece of data I is it data parallel or task parallel? I let's instead ask: I what's useful from programmer's view point I what are useful distinctions to make from implementer's view point 4 / 52 2. and cheaply; 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task create task 5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 5 / 52 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 5 / 52 4. and cheaply context-switched What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 5 / 52 What is task parallelism? A system supports task parallelism when: 1. a logical unit of concurrency (that is, a task) can be created dynamically, at an arbitrary point of execution, create task 2. and cheaply; create task 3. and they are automatically mapped on hardware parallelism (cores, nodes, . ) 4. and cheaply context-switched 5 / 52 What are they good for? I generality: \creating tasks at arbitrary points" unifies many superficially different patterns I parallel nested loop, parallel recursions I they trivially compose I programmability: cheap task creation + automatic load balancing allow straightforward, processor-oblivious decomposition of the work (divide-and-conquer-until-trivial) I performance: dynamic scheduling is a basis for hiding latencies and tolerating noises 6 / 52 Our goal I programmers use tasks (+ higher-level syntax on top) as the unified means to express parallelism I the system maps tasks to hardware parallelism I cores within a node I nodes I SIMD lanes within a core! 7 / 52 Rest of the talk Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks 8 / 52 9 / 52 Agenda Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks 10 / 52 I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? I tasks untied or tied: can tasks migrate after they started? Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? 11 / 52 I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? I tasks untied or tied: can tasks migrate after they started? Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? 11 / 52 I tasks untied or tied: can tasks migrate after they started? Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? 11 / 52 Taxonomy I library or frontend: implemented with ordinary C/C++ compilers or does it heavily rely on a tailored frontend? I tasks suspendable or atomic: can tasks suspend/resume in the middle or do tasks always run to completion? I synchronization patterns arbitrary or pre-defined: can tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)? I tasks untied or tied: can tasks migrate after they started? 11 / 52 Instantiations library suspendable untied sync /frontend task tasks topology OpenMP tasks frontend yes yes fork/join TBB library yes no fork/join Cilk frontend yes yes fork/join Quark library no no arbitrary Nanos++ library yes yes arbitrary Qthreads library yes yes arbitrary Argobots library yes yes? arbitrary MassiveThreads library yes yes arbitrary 12 / 52 MassiveThreads I https://github.com/massivethreads/massivethreads I design philosophy: user-level threads (ULT) in an ordinary thread API as you know it I tid = myth create(f, arg) I tid = myth join(arg) I myth yield to switch among threads (useful for latency hiding) I mutex and condition variables to build arbitrary synchronization patterns I efficient work stealing scheduler (locally LIFO and child-first; steal oldest task first) I an (experimental) customizable work stealing [Nakashima and Taura; ROSS 2013] 13 / 52 User-facing APIs on MassiveThreads I TBB's task group and § parallel for (but with untied quicksort(a, p, q) { if (q - p < th) { work stealing scheduler) ... I } else { Chapel tasks on top of mtbb::task group tg; MassiveThreads (currently r = partition(a, p, q); tg.run([=]{ quicksort(a, p, r-1); }); broken orz) quicksort(a, r, q); I tg.wait(); SML# (Ueno @ Tohoku } University) ongoing } I Tapas (Fukuda @ RIKEN), a TBB interface on domain specific language for MassiveThreads particle simulation 14 / 52 Important performance metrics I low local creation/sync overhead I low local context switches I reasonably low load balancing (migration) overhead I somewhat sequential scheduling order § 1 parent() { π0 2 π0: 3 spawn { γ: ... }; 4 π1: γ π1 5 } op measure what time (cycles) local create π0 ! γ ≈ 140 work steal π0 ! π1 ≈ 900 context switch myth yield ≈ 80 (Haswell i7-4500U (1.80GHz), GCC 4.9) 15 / 52 Comparison to other systems 3000 ≈ 7000 child 2500 parent § 2000 1500 1 parent() { 2 π clocks : 1000 0 3 spawn { γ: ... }; 500 167 73 72 138 4 π1: 0 Cilk CilkPlusMassiveThreadsOpenMPQthreadsTBB 5 } Summary: I Cilk(Plus), known for its superb local creation performance, sacrifices work stealing performance I TBB's local creation overhead is equally good, but it is \parent-first” and tasks are tied to a worker once started 16 / 52 I ) \locality-/cache-/hierarchy-/topology-/whatever- aware" schedulers obviously important I ) hierarchical/customizable schedulers proposals I ) yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads I the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)? Further research agenda (1) I task runtimes for ever larger scale systems is vital 17 / 52 I ) hierarchical/customizable schedulers proposals I ) yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads I the question, it seems, ultimately comes to this: when no tasks exist near you but some may exist far from you, steal it or not (stay idle)? Further research agenda (1) I task runtimes for ever larger scale systems is vital I ) \locality-/cache-/hierarchy-/topology-/whatever- aware" schedulers obviously important 17 / 52 I ) yet, IMO, there are no clear demonstrations that clearly outperform simple greedy work stealing over many workloads I the question, it seems, ultimately

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    74 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us