Scheduling Multithreaded Computations by Work Stealing

Total Page:16

File Type:pdf, Size:1020Kb

Scheduling Multithreaded Computations by Work Stealing Scheduling Multithreaded Computations by Work Stealing Rob ert D Blumofe The University of Texas at Austin Charles E Leiserson MIT Laboratory for Computer Science Abstract This pap er studies the problem of eciently scheduling fully strict ie well structured multithreaded computations on parallel computers A p opular and prac tical metho d of scheduling this kind of dynamic MIMDstyle computation is work stealing in which pro cessors needing work steal computational threads from other pro cessors In this pap er we give the rst provably go o d workstealing scheduler for multithreaded computations with dep endencies Sp ecically our analysis shows that the exp ected time to execute a fully strict computation on P pro cessors using our workstealing scheduler is T P O T where T is the minimum serial execution time of the multithreaded computation and T is the minimum execution time with an innite numb er of pro cessors Moreover the space required by the execution is at most S P where S is the minimum serial space requirement We also show that the exp ected total communication of the algorithm is at most O PT n S where S is the size of the largest activation record of max max d any thread and n is the maximum numb er of times that any thread synchronizes with d its parent This communication b ound justies the folk wisdom that workstealing schedulers are more communication ecient than their worksharing counterparts All three of these b ounds are existentially optimal to within a constant factor Intro duction For ecient execution of a dynamically growing multithreaded computation on a MIMD style parallel computer a scheduling algorithm must ensure that enough threads are active concurrently to keep the pro cessors busy Simultaneously it should ensure that the numb er of concurrently active threads remains within reasonable limits so that memory requirements are not unduly large Moreover the scheduler should also try to maintain related threads This research was supp orted in part by the Advanced Research Pro jects Agency under Contract N This research was done while Rob ert D Blumofe was at the MIT Lab oratory for Computer Science and was supp orted in part by an ARPA HighPerformance Computing Graduate Fellowship on the same pro cessor if p ossible so that communication b etween them can b e minimized Needless to say achieving all these goals simultaneously can b e dicult Two scheduling paradigms have arisen to address the problem of scheduling multithreaded computations work sharing and work stealing In work sharing whenever a pro cessor generates new threads the scheduler attempts to migrate some of them to other pro cessors in hop es of distributing the work to underutilized pro cessors In work stealing however underutilized pro cessors take the initiative they attempt to steal threads from other pro cessors Intuitively the migration of threads o ccurs less frequently with work stealing than with work sharing since when all pro cessors have work to do no threads are migrated by a workstealing scheduler but threads are always migrated by a worksharing scheduler The workstealing idea dates back at least as far as Burton and Sleeps research on par allel execution of functional programs and Halsteads implementation of Multilisp These authors p oint out the heuristic b enets of work stealing with regards to space and communication Since then many researchers have implemented variants on this strategy Rudolph SlivkinAllalouf and Upfal analyzed a random ized workstealing strategy for load balancing indep endent jobs on a parallel computer and Karp and Zhang analyzed a randomized workstealing strategy for parallel backtrack search Recently Zhang and Ortynski have obtained go o d b ounds on the communication requirements of this algorithm In this pap er we present and analyze a workstealing algorithm for scheduling fully strict wellstructured multithreaded computations This class of computations encom passes b oth backtrack search computations and divideandconquer computations as well as dataow computations in which threads may stall due to a data dep endency We analyze our algorithms in a stringent atomicaccess mo del similar to the atomic message passing mo del of in which concurrent accesses to the same data structure are serially queued by an adversary Our main contribution is a randomized workstealing scheduling algorithm for fully strict multithreaded computations which is provably ecient in terms of time space and com munication We prove that the exp ected time to execute a fully strict computation on P pro cessors using our workstealing scheduler is T P O T where T is the minimum serial execution time of the multithreaded computation and T is the minimum execution time with an innite numb er of pro cessors In addition the space required by the execution is at most S P where S is the minimum serial space requirement These b ounds are b et ter than previous b ounds for worksharing schedulers and the workstealing scheduler is much simpler and eminently practical Part of this improvement is due to our fo cus ing on fully strict computations as compared to the general strict computations studied in We also prove that the exp ected total communication of the execution is at most O PT n S where S is the size of the largest activation record of any thread d max max and n is the maximum numb er of times that any thread synchronizes with its parent This d b ound is existentially tight to within a constant factor meeting the lower b ound of Wu and Kung for communication in parallel divideandconquer In contrast worksharing schedulers have nearly worstcase b ehavior for communication Thus our results b olster the folk wisdom that work stealing is sup erior to work sharing Others have studied and continue to study the problem of eciently managing the space requirements of parallel computations Culler and Arvind and Ruggiero and Sargeant give heuristics for limiting the space required by dataow programs Burton shows how to limit space in certain parallel computations without causing deadlo ck More recently Burton has develop ed and analyzed a scheduling algorithm with provably go o d time and space b ounds Blello ch Gibb ons Matias and Narlikar have also recently develop ed and analyzed scheduling algorithms with provably go o d time and space b ounds It is not yet clear whether any of these algorithms are as practical as work stealing The remainder of this pap er is organized as follows In Section we review the graph theoretic mo del of multithreaded computations intro duced in which provides a theo retical basis for analyzing schedulers Section gives a simple scheduling algorithm which uses a central queue This busyleaves algorithm forms the basis for our randomized work stealing algorithm which we present in Section In Section we intro duce the atomicaccess mo del that we use to analyze execution time and communication costs for the workstealing algorithm and we present and analyze a combinatorial balls and bins game that we use to derive a b ound on the contention that arises in random work stealing We then use this b ound along with a delaysequence argument in Section to analyze the execution time and communication cost of the workstealing algorithm To conclude in Section we briey discuss how the theoretical ideas in this pap er have b een applied to the Cilk programming language and runtime system as well as make some concluding remarks A mo del of multithreaded computation This section reprises the graphtheoretic mo del of multithreaded computation intro duced in We also dene what it means for computations to b e fully strict We conclude with a statement of the greedyscheduling theorem which is an adaptation of theorems by Brent and Graham on dag scheduling A multithreaded computation is comp osed of a set of threads each of which is a se quential ordering of unittime instructions The instructions are connected by dependency edges which provide a partial ordering on which instructions must execute b efore which other instructions In Figure for example each shaded blo ck is a thread with circles representing instructions and the horizontal edges called continue edges representing the sequential ordering Thread of this example contains instructions v v and v The instructions of a thread must execute in this sequential order from the rst leftmost instruction to the last rightmost instruction In order to execute a thread we allo cate for it a chunk of memory called an activation frame that the instructions of the thread can use to store the values on which they compute A P pro cessor execution schedule for a multithreaded computation determines which pro cessors of a P pro cessor parallel computer execute which instructions at each step An execution schedule dep ends on the particular multithreaded computation and the numb er P of pro cessors In any given step of an execution schedule each pro cessor executes at most one instruction During the course of its execution a thread may create or spawn other threads Spawn ing a thread is like a subroutine call except that the spawning thread can op erate concur rently with the spawned thread We consider spawned threads to b e children of the thread that did the spawning and a thread may spawn as many children as it desires In this way Γ1 v1 v2 v16 v17 v21 v22 v23 Γ2 Γ6 v3 v6 v9 v13 v14 v15 v18 v19 v20 Γ3 Γ4 Γ5 v4 v5 v7 v8 v10 v11 v12 Figure A multithreaded computation This computation contains instructions v
Recommended publications
  • 8A. Going Parallel
    2019/2020 UniPD - T. Vardanega 02/06/2020 Terminology /1 • Program = static piece of text 8a. Going parallel – Instructions + link-time data • Processor = resource that can execute a program – In a “multi-processor,” processors may – Share uniformly one common address space for main memory – Have non-uniform access to shared memory – Have unshared parts of memory – Share no memory as in “Shared Nothing” (distributed memory) Where we take a bird’s eye look at what architectures changes (again!) when jobs become • Process = instance of program + run-time data parallel in response to the quest for best – Run-time data = registers + stack(s) + heap(s) use of (massively) parallel hardware Parallel Lang Support 505 Terminology /2 Terms in context: LWT within LWP • No uniform naming of threads of control within process –Thread, Kernel Thread, OS Thread, Task, Job, Light-Weight Process, Virtual CPU, Virtual Processor, Execution Agent, Executor, Server Thread, Worker Thread –Taskgenerally describes a logical piece of work – Element of a software architecture –Threadgenerally describes a virtual CPU, a thread of control within a process –Jobin the context of a real-time system generally describes a single execution consequent to a task release Lightweight because OS • No uniform naming of user-level lightweight threads does not schedule threads –Task, Microthread, Picothread, Strand, Tasklet, Fiber, Lightweight Thread, Work Item – Called “user-level” in that scheduling is done by code outside of Exactly the same logic applies to user-level lightweight the kernel/operating-system threads (aka tasklets), which do not need scheduling Parallel Lang Support 506 Parallel Lang Support 507 Real-Time Systems 1 2019/2020 UniPD - T.
    [Show full text]
  • Correct and Efficient Work-Stealing for Weak Memory Models
    Correct and Efficient Work-Stealing for Weak Memory Models Nhat Minh Lê Antoniu Pop Albert Cohen Francesco Zappa Nardelli INRIA and ENS Paris Abstract Lev’s deque for the ARM architectures, and prove its correctness Chase and Lev’s concurrent deque is a key data structure in shared- against the memory semantics defined in [12] and [7]. Our second memory parallel programming and plays an essential role in work- contribution is a systematic study of the performance of several stealing schedulers. We provide the first correctness proof of an implementations of Chase–Lev on relaxed hardware. In detail, we optimized implementation of Chase and Lev’s deque on top of the compare our optimized ARM implementation against a standard POWER and ARM architectures: these provide very relaxed mem- implementation for the x86 architecture and two portable variants ory models, which we exploit to improve performance but consider- expressed in C11: a reference sequentially consistent translation ably complicate the reasoning. We also study an optimized x86 and of the algorithm, and an aggressively optimized version making a portable C11 implementation, conducting systematic experiments full use of the release–acquire and relaxed semantics offered by to evaluate the impact of memory barrier optimizations. Our results C11 low-level atomics. These implementations of the Chase–Lev demonstrate the benefits of hand tuning the deque code when run- deque are evaluated in the context of a work-stealing scheduler. We ning on top of relaxed memory models. consider diverse worker/thief configurations, including a synthetic benchmark with two different workloads and standard task-parallel Categories and Subject Descriptors D.1.3 [Programming Tech- kernels.
    [Show full text]
  • Thread Management for High Performance Database Systems - Design and Implementation
    Nr.: FIN-003-2018 Thread Management for High Performance Database Systems - Design and Implementation Robert Jendersie, Johannes Wuensche, Johann Wagner, Marten Wallewein-Eising, Marcus Pinnecke, Gunter Saake Arbeitsgruppe Database and Software Engineering Fakultät für Informatik Otto-von-Guericke-Universität Magdeburg Nr.: FIN-003-2018 Thread Management for High Performance Database Systems - Design and Implementation Robert Jendersie, Johannes Wuensche, Johann Wagner, Marten Wallewein-Eising, Marcus Pinnecke, Gunter Saake Arbeitsgruppe Database and Software Engineering Technical report (Internet) Elektronische Zeitschriftenreihe der Fakultät für Informatik der Otto-von-Guericke-Universität Magdeburg ISSN 1869-5078 Fakultät für Informatik Otto-von-Guericke-Universität Magdeburg Impressum (§ 5 TMG) Herausgeber: Otto-von-Guericke-Universität Magdeburg Fakultät für Informatik Der Dekan Verantwortlich für diese Ausgabe: Otto-von-Guericke-Universität Magdeburg Fakultät für Informatik Marcus Pinnecke Postfach 4120 39016 Magdeburg E-Mail: [email protected] http://www.cs.uni-magdeburg.de/Technical_reports.html Technical report (Internet) ISSN 1869-5078 Redaktionsschluss: 21.08.2018 Bezug: Otto-von-Guericke-Universität Magdeburg Fakultät für Informatik Dekanat Thread Management for High Performance Database Systems - Design and Implementation Technical Report Robert Jendersie1, Johannes Wuensche2, Johann Wagner1, Marten Wallewein-Eising2, Marcus Pinnecke1, and Gunter Saake1 Database and Software Engineering Group, Otto-von-Guericke University
    [Show full text]
  • Concurrency & Parallel Programming Patterns
    Concurrency & Parallel programming patterns Evgeny Gavrin Outline 1. Concurrency vs Parallelism 2. Patterns by groups 3. Detailed overview of parallel patterns 4. Summary 5. Proposal for language Concurrency vs Parallelism ● Parallelism is the simultaneous execution of computations “doing lots of things at once” ● Concurrency is the composition of independently execution processes “dealing with lots of thing at once” Patterns by groups Architectural Patterns These patterns define the overall architecture for a program: ● Pipe-and-filter: view the program as filters (pipeline stages) connected by pipes (channels). Data flows through the filters to take input and transform into output. ● Agent and Repository: a collection of autonomous agents update state managed on their behalf in a central repository. ● Process control: the program is structured analogously to a process control pipeline with monitors and actuators moderating feedback loops and a pipeline of processing stages. ● Event based implicit invocation: The program is a collection of agents that post events they watch for and issue events for other agents. The architecture enforces a high level abstraction so invocation of an agent is implicit; i.e. not hardwired to a specific controlling agent. ● Model-view-controller: An architecture with a central model for the state of the program, a controller that manages the state and one or more agents that export views of the model appropriate to different uses of the model. ● Bulk Iterative (AKA bulk synchronous): A program that proceeds iteratively … update state, check against a termination condition, complete coordination, and proceed to the next iteration. ● Map reduce: the program is represented in terms of two classes of functions.
    [Show full text]
  • Model Checking Multithreaded Programs With
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository Model Checking Multithreaded Programs with Asynchronous Atomic Methods Koushik Sen and Mahesh Viswanathan Department of Computer Science, University of Illinois at Urbana-Champaign. {ksen,vmahesh}@uiuc.edu Abstract. In order to make multithreaded programming manageable, program- mers often follow a design principle where they break the problem into tasks which are then solved asynchronously and concurrently on different threads. This paper investigates the problem of model checking programs that follow this id- iom. We present a programming language SPL that encapsulates this design pat- tern. SPL extends simplified form of sequential Java to which we add the ca- pability of making asynchronous method invocations in addition to the standard synchronous method calls and the ability to execute asynchronous methods in threads atomically and concurrently. Our main result shows that the control state reachability problem for finite SPL programs is decidable. Therefore, such mul- tithreaded programs can be model checked using the counter-example guided abstraction-refinement framework. 1 Introduction Multithreaded programming is often used in software as it leads to reduced latency, improved response times of interactive applications, and more optimal use of process- ing power. Multithreaded programming also allows an application to progress even if one thread is blocked for an I/O operation. However, writing correct programs that use multiple threads is notoriously difficult, especially in the presence of a shared muta- ble memory. Since threads can interleave, there can be unintended interference through concurrent access of shared data and result in software errors due to data race and atom- icity violations.
    [Show full text]
  • Lecture 26: Creational Patterns
    Creational Patterns CSCI 4448/5448: Object-Oriented Analysis & Design Lecture 26 — 11/29/2012 © Kenneth M. Anderson, 2012 1 Goals of the Lecture • Cover material from Chapters 20-22 of the Textbook • Lessons from Design Patterns: Factories • Singleton Pattern • Object Pool Pattern • Also discuss • Builder Pattern • Lazy Instantiation © Kenneth M. Anderson, 2012 2 Pattern Classification • The Gang of Four classified patterns in three ways • The behavioral patterns are used to manage variation in behaviors (think Strategy pattern) • The structural patterns are useful to integrate existing code into new object-oriented designs (think Bridge) • The creational patterns are used to create objects • Abstract Factory, Builder, Factory Method, Prototype & Singleton © Kenneth M. Anderson, 2012 3 Factories & Their Role in OO Design • It is important to manage the creation of objects • Code that mixes object creation with the use of objects can become quickly non-cohesive • A system may have to deal with a variety of different contexts • with each context requiring a different set of objects • In design patterns, the context determines which concrete implementations need to be present © Kenneth M. Anderson, 2012 4 Factories & Their Role in OO Design • The code to determine the current context, and thus which objects to instantiate, can become complex • with many different conditional statements • If you mix this type of code with the use of the instantiated objects, your code becomes cluttered • often the use scenarios can happen in a few lines of code • if combined with creational code, the operational code gets buried behind the creational code © Kenneth M. Anderson, 2012 5 Factories provide Cohesion • The use of factories can address these issues • The conditional code can be hidden within them • pass in the parameters associated with the current context • and get back the objects you need for the situation • Then use those objects to get your work done • Factories concern themselves just with creation, letting your code focus on other things © Kenneth M.
    [Show full text]
  • An Enhanced Thread Synchronization Mechanism for Java
    SOFTWARE—PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2001; 31:667–695 (DOI: 10.1002/spe.383) An enhanced thread synchronization mechanism for Java Hsin-Ta Chiao and Shyan-Ming Yuan∗,† Department of Computer and Information Science, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu 300, Taiwan SUMMARY The thread synchronization mechanism of Java is derived from Hoare’s monitor concept. In the authors’ view, however, it is over simplified and suffers the following four drawbacks. First, it belongs to a category of no-priority monitor, the design of which, as reported in the literature on concurrent programming, is not well rated. Second, it offers only one condition queue. Where more than one long-term synchronization event is required, this restriction both degrades performance and further complicates the ordering problems that a no-priority monitor presents. Third, it lacks the support for building more elaborate scheduling programs. Fourth, during nested monitor invocations, deadlock may occur. In this paper, we first analyze these drawbacks in depth before proceeding to present our own proposal, which is a new monitor-based thread synchronization mechanism that we term EMonitor. This mechanism is implemented solely by Java, thus avoiding the need for any modification to the underlying Java Virtual Machine. A preprocessor is employed to translate the EMonitor syntax into the pure Java codes that invoke the EMonitor class libraries. We conclude with a comparison of the performance of the two monitors and allow the experimental results to demonstrate that, in most cases, replacing the Java version with the EMonitor version for developing concurrent Java objects is perfectly feasible.
    [Show full text]
  • Designing for Performance: Concurrency and Parallelism COS 518: Computer Systems Fall 2015
    Designing for Performance: Concurrency and Parallelism COS 518: Computer Systems Fall 2015 Logan Stafman Adapted from slides by Mike Freedman 2 Definitions • Concurrency: – Execution of two or more tasks overlap in time. • Parallelism: – Execution of two or more tasks occurs simultaneous. Concurrency without 3 parallelism? • Parts of tasks interact with other subsystem – Network I/O, Disk I/O, GPU, ... • Other task can be scheduled while first waits on subsystem’s response Concurrency without parrallelism? Source: bjoor.me 5 Scheduling for fairness • On time-sharing system also want to schedule between tasks, even if one not blocking – Otherwise, certain tasks can keep processing – Leads to starvation of other tasks • Preemptive scheduling – Interrupt processing of tasks to process another task (why with tasks and not network packets?) • Many scheduling disciplines – FIFO, Shortest Remaining Time, Strict Priority, Round-Robin Preemptive Scheduling Source: embeddedlinux.org.cn Concurrency with 7 parallelism • Execute code concurrently across CPUs – Clusters – Cores • CPU parallelism different from distributed systems as ready availability to shared memory – Yet to avoid difference between parallelism b/w local and remote cores, many apps just use message passing between both (like HPC’s use of MPI) Symmetric Multiprocessors 8 (SMPs) Non-Uniform Memory Architectures 9 (NUMA) 10 Pros/Cons of NUMA • Pros Applications split between different processors can share memory close to hardware Reduced bus bandwidth usage • Cons Must ensure applications sharing memory are run on processors sharing memory 11 Forms of task parallelism • Processes – Isolated process address space – Higher overhead between switching processes • Threads – Concurrency within process – Shared address space – Three forms • Kernel threads (1:1) : Kernel support, can leverage hardware parallelism • User threads (N:1): Thread library in system runtime, fastest context switching, but cannot benefit from multi- threaded/proc hardware • Hybrid (M:N): Schedule M user threads on N kernel threads.
    [Show full text]
  • An Efficient Work-Stealing Scheduler for Task Dependency Graph
    2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS) An Efficient Work-Stealing Scheduler for Task Dependency Graph Chun-Xun Lin Tsung-Wei Huang Martin D. F. Wong Department of Electrical Department of Electrical Department of Electrical and Computer Engineering, and Computer Engineering, and Computer Engineering, University of Illinois Urbana-Champaign, The University of Utah, University of Illinois Urbana-Champaign, Urbana, IL, USA Salt Lake City, Utah Urbana, IL, USA [email protected] [email protected] [email protected] Abstract—Work-stealing is a key component of many parallel stealing scheduler is not an easy job, especially when dealing task graph libraries such as Intel Threading Building Blocks with task dependency graph where the parallelism could be (TBB) FlowGraph, Microsoft Task Parallel Library (TPL) very irregular and unstructured. Due to the decentralized ar- Batch.Net, Cpp-Taskflow, and Nabbit. However, designing a correct and effective work-stealing scheduler is a notoriously chitecture, developers have to sort out many implementation difficult job, due to subtle implementation details of concur- details to efficiently manage workers such as deciding the rency controls and decentralized coordination between threads. number of steals attempted by a thief and how to mitigate This problem becomes even more challenging when striving the resource contention between workers. The problem is for optimal thread usage in handling parallel workloads even more challenging when considering throughput and with complex task graphs. As a result, we introduce in this paper an effective work-stealing scheduler for execution of energy efficiency, which have emerged as critical issues in task dependency graphs.
    [Show full text]
  • Intel Threading Building Blocks
    Praise for Intel Threading Building Blocks “The Age of Serial Computing is over. With the advent of multi-core processors, parallel- computing technology that was once relegated to universities and research labs is now emerging as mainstream. Intel Threading Building Blocks updates and greatly expands the ‘work-stealing’ technology pioneered by the MIT Cilk system of 15 years ago, providing a modern industrial-strength C++ library for concurrent programming. “Not only does this book offer an excellent introduction to the library, it furnishes novices and experts alike with a clear and accessible discussion of the complexities of concurrency.” — Charles E. Leiserson, MIT Computer Science and Artificial Intelligence Laboratory “We used to say make it right, then make it fast. We can’t do that anymore. TBB lets us design for correctness and speed up front for Maya. This book shows you how to extract the most benefit from using TBB in your code.” — Martin Watt, Senior Software Engineer, Autodesk “TBB promises to change how parallel programming is done in C++. This book will be extremely useful to any C++ programmer. With this book, James achieves two important goals: • Presents an excellent introduction to parallel programming, illustrating the most com- mon parallel programming patterns and the forces governing their use. • Documents the Threading Building Blocks C++ library—a library that provides generic algorithms for these patterns. “TBB incorporates many of the best ideas that researchers in object-oriented parallel computing developed in the last two decades.” — Marc Snir, Head of the Computer Science Department, University of Illinois at Urbana-Champaign “This book was my first introduction to Intel Threading Building Blocks.
    [Show full text]
  • Lecture 15-16: Parallel Programming with Cilk
    Lecture 15-16: Parallel Programming with Cilk Concurrent and Mul;core Programming Department of Computer Science and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan 1 Topics (Part 1) • IntroducAon • Principles of parallel algorithm design (Chapter 3) • Programming on shared memory system (Chapter 7) – OpenMP – Cilk/Cilkplus – PThread, mutual exclusion, locks, synchroniza;ons • Analysis of parallel program execuAons (Chapter 5) – Performance Metrics for Parallel Systems • Execuon Time, Overhead, Speedup, Efficiency, Cost – Scalability of Parallel Systems – Use of performance tools 2 Shared Memory Systems: Mul;core and Mul;socket Systems • a 3 Threading on Shared Memory Systems • Employ parallelism to compute on shared data – boost performance on a fixed memory footprint (strong scaling) • Useful for hiding latency – e.g. latency due to I/O, memory latency, communicaon latency • Useful for scheduling and load balancing – especially for dynamic concurrency • Relavely easy to program – easier than message-passing? you be the judge! 4 Programming Models on Shared Memory System • Library-based models – All data are shared – Pthreads, Intel Threading Building Blocks, Java Concurrency, Boost, Microso\ .Net Task Parallel Library • DirecAve-based models, e.g., OpenMP – shared and private data – pragma syntax simplifies thread creaon and synchronizaon • Programming languages – CilkPlus (Intel, GCC), and MIT Cilk – CUDA (NVIDIA) – OpenCL 5 Toward Standard Threading for C/C++ At last month's meeting of the C standard committee, WG14 decided to form a study group to produce a proposal for language extensions for C to simplify parallel programming. This proposal is expected to combine the best ideas from Cilk and OpenMP, two of the most widely-used and well-established parallel language extensions for the C language family.
    [Show full text]
  • Assessment of Barrier Implementations for Fine-Grain Parallel Regions on Current Multi-Core Architectures
    Assessment of Barrier Implementations for Fine-Grain Parallel Regions on Current Multi-core Architectures Simon A. Berger and Alexandros Stamatakis The Exelixis Lab Department of Computer Science Technische Universitat¨ Munchen¨ Boltzmannstr. 3, D-85748 Garching b. Munchen,¨ Germany Email: [email protected], [email protected] WWW: http://wwwkramer.in.tum.de/exelixis/ time Abstract—Barrier performance for synchronizing threads master thread on current multi-core systems can be critical for scientific applications that traverse a large number of relatively small parallel regions, that is, that exhibit an unfavorable com- fork putation to synchronization ratio. By means of a synthetic and a real-world benchmark we assess 4 alternative barrier worker threads parallel region implementations on 7 current multi-core systems with 2 up to join 32 cores. We find that, barrier performance is application- and data-specific with respect to cache utilization, but that a rather fork na¨ıve lock-free barrier implementation yields good results across all applications and multi-core systems tested. We also worker threads parallel region assess distinct implementations of reduction operations that join are computed in conjunction with the barriers. The synthetic and real-world benchmarks are made available as open-source code for further testing. Keywords-barriers; multi-cores; threads; RAxML Figure 1. Classic fork-join paradigm. I. INTRODUCTION The performance of barriers for synchronizing threads on modern general-purpose multi-core systems is of vital In addition, we analyze the efficient implementation of importance for the efficiency of scientific codes. Barrier reduction operations (sums over the double values produced performance can become critical, if a scientific code exhibits by each for-loop iteration), that are frequently required a high number of relatively small (with respect to the in conjunction with barriers.
    [Show full text]