A Comparison of Task Parallel Frameworks Based on Implicit Dependencies in Multi-Core Environments

Proceedings of the 50th Hawaii International Conference on System Sciences | 2017 A Comparison of Task Parallel Frameworks based on Implicit Dependencies in Multi-core Environments Basilio B. Fraguela Universidade da Coruna,˜ A Coruna,˜ Spain Email: [email protected] Abstract—The larger flexibility that task parallelism offers minimum effort from the developers, while they allow to with respect to data parallelism comes at the cost of a higher build extremely complex task graphs and provide maximum complexity due to the variety of tasks and the arbitrary pat- parallelism thanks to the graph scheduling algorithms they terns of dependences that they can exhibit. These dependencies should be expressed not only correctly, but optimally, i.e. avoid- implement, makes them particularly interesting. ing over-constraints, in order to obtain the maximum perfor- Despite the relevance of this approach, we have not found mance from the underlying hardware. There have been many comparative studies of the existing practical alternatives proposals to facilitate this non-trivial task, particularly within of this kind beyond some performance comparisons [36]. the scope of nowadays ubiquitous multi-core architectures. A very interesting family of solutions because of their large For this reason this paper tackles this issue describing scope of application, ease of use and potential performance and comparing some of the high-level approaches available are those in which the user declares the dependences of each nowadays with a particular focus on their semantics and ease task, and lets the parallel programming framework figure out of use, our main aim being to help identify the best tool which are the concrete dependences that appear at runtime for each problem at hand. Also, while there are proposals and schedule accordingly the parallel tasks. Nevertheless, as far as we know, there are no comparative studies of them that that extend these ideas to clusters and hybrid systems, help users identify their relative advantages. In this paper we this study is restricted to the widely available multi-core describe and evaluate four tools of this class discussing the systems that are ubiquitous nowadays, so that the problems strengths and weaknesses we have found in their use. related to heterogeneity and distributed-memory are not Keywords-programmability; task parallelism; dependencies; considered. This way, we set some basic criteria to choose programming models the programming environments for our comparison: I. INTRODUCTION • Allow users to implicitly declare the task dependencies, automatically taking care of the concrete implied Many applications require the exploitation of task paral- synchronizations and scheduling. lelism to benefit from all the parallelism they can expose, • Provide a high-level API (i.e. oriented to final pro- and sometimes, just to be parallelized at all. With the need grammers), and be both publicly available and well to express different parallel tasks comes the requirement documented. to properly schedule and synchronize them according to • Be usable in some programming language among the the arbitrary patterns of dependences that they can exhibit. most widely used in the literature/research on parallel There is also of course the option to resort to speculation applications, mainly in the fields of scientific and engi- under Transactional Memory [20] or Thread Level Specula- neering computing. Besides, for fairness the compari- tion [28]. However, speculation typically has non-negligible son should use the same language for all the tools. This costs and a wide range of applications can be in fact implicitly discards proposals based on new languages, successfully parallelized by properly ordering the execution which besides being more complex to compare fairly, of their tasks so that their dependencies are fulfilled. While tend to have lower rates of adoption than compiler this has been done using low level approaches since the early directives and libraries, as these latter alternatives better days of parallel computing, the growing need to parallelize facilitate the reuse of existing codes. every kind of application together with the large availability • Do not require the usage of concepts or APIs related to of parallel systems, particularly since the appearance of distributed computing, which would distort the purely multi-core processors, led to the proposal of high-level ap- shared-memory based comparison. proaches that facilitate this task. Some of the most advanced programming tools in this category are those in which the Based on these criteria, four tools were chosen. A popular users just declare, as implicitly as possible, the dependencies strategy for the declaration of task dependences is the of their tasks, but without specifying how they must be annotation of the inputs and the outputs of each task together met, letting instead the underlying compiler and/or runtime with a valid order of execution, implicitly given by the automatically manage them. The fact that these tools require sequential execution of the tasks. Since the inclusion of URI: http://hdl.handle.net/10125/41914 ISBN: 978-0-9981331-0-2 CC-BY-NC-ND 6202 dependencies for tasks in OpenMP 4.0 [26] this standard forcing to use other OpenMP functionalities such as atomic is the most widespread option that supports this paradigm, operations or critical sections, or resort to other high level and thus the first we study. A very active and related project, constructs such as section or for. The other issue which in fact pushed for the adoption of dependent tasks in is related to the specification of multi-dimensional array OpenMP [11], is OmpSs, which integrates features from the regions when instead of a static or variable length array StarSS family of programming models [27]. While OpenMP we have a pointer, although this can be solved casting the and OmpSs cover well the space of compiler directives in pointer. our scope of interest, the area of libraries is much more We find much more relevant other limitations. For ex- sparse. Because our focus here is on the semantics and the ample, the standard requires the dependent items to have programming style, we will discuss the two libraries that fit either identical or disjoint storage. This means that if array our criteria and that we find to be more original in their sections appear in the lists of dependences, they must be approach. Both of them are based on C++, which is not either identical or disjunct. Similarly, the standard explicitly surprising given the excellent properties of this language states that a variable that is part of another variable (such to enable high performance coupled with ease of use. In as a field of a structure) but is not an array element or an fact, the first library, DepSpawn [16], heavily relies on the array section cannot appear in a depend clause, which again properties of this language to minimize the effort of the points to the unavailability of tests on partial overlaps of programmer. The second library is Intel R CnC (Concurrent objects. In fact OpenMP does not allow data members in its Collections) [6], [8], whose main interest lies in its very clauses, which restricts the usefulness of this framework in original approach for the specification of the dependences. programs with composite data types other than arrays. The rest of this paper is organized as follows. Section II A problem of OpenMP that is specific to C++ is its treat- describes the main characteristics of the frameworks ana- ment of references. While references are implemented under lyzed. They are then compared in terms of performance the hood by means of pointers, this is never exposed to the and programmability in Section III. This is followed by a programmers, who just see and use them as alias to existing discussion on related work in Section IV, and the last Section data items. OpenMP however considers references as what is devoted to our conclusions. they are actually for the compiler (pointers) and in fact pro- hibits reference types for private, firstprivate and II. FRAMEWORKS threadprivate variables, because the compiler would The frameworks analyzed are now described in turn. In privatize the reference, not the object accessed through it, all the cases we only center on the creation of dependent giving place to a wrong behavior. For the same reasons, tasks, skipping other possible functionalities, even if they are references, which are heavily used in C++ applications both needed to exploit this paradigm. A clear example are explicit for performance and programmability reasons, are not suited synchronizations, which are required even if only to make to express dependences for OpenMP tasks. sure that all the parallel tasks have completed before leaving the program. Also, since the libraries tested are based on B. OmpSs C++, making it the natural language for our comparison, the Dependent tasks are the core of the OmpSs programming specific problems found in the use of the compiler directives model [27], [3]. Task creation follows the OpenMP syntax in this language will be described. with a slight difference in the specification of dependencies. Namely, they are provided using three different clauses A. OpenMP in(list), out(list) and inout(list), which ex- The well-known OpenMP standard extended in [26] the press input, output and both input and output dependences on task construct introduced in version 3.0 [2] with support the memory positions provided in the list, respectively. The for task dependences by means of the depend clause. The dependences are enforced with respect to preceding tasks clause allows to define lists of data items that are only built in the same task or outside any task code, as OmpSs inputs, only outputs, or both input and outputs using the programs assume that their execution begins with a master notation depend(dependence-type : list) where thread that can spawn tasks to other threads at any point.

A Comparison of Task Parallel Frameworks Based on Implicit Dependencies in Multi-Core Environments

Cimple: Instruction and Memory Level Parallelism a DSL for Uncovering ILP and MLP

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Scheduling Task Parallelism on Multi-Socket Multicore Systems

Task Parallelism Bit-Level Parallelism

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization

An Overview of Parallel Ccomputing

HP-CAST 20 Final Agenda V3.0 (Sessions and Chairs Are Subject to Change Without Notice)

Task Level Parallelism

Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation

Presentación De Powerpoint

Exploring the Use of Hyper-Threading Technology for Multimedia Applications with Intel£ Openmp Compiler

Prace6ip-D7.1