Technical Report Aaron Councilman

Extensible Parallel Programming in ableC Aaron Councilman Department of Computer Science and Engineering University of Minnesota, Twin Cities May 23, 2019 1 Introduction There are many different manners of parallelizing code, and many different languages that provide such features. Different types of computations are best suited by different types of parallelism. Simply whether a computation is compute bound or I/O bound determines whether the computation will benefit from being run with more threads than the machine has cores, and other properties of a computation will similarly affect how it performs when run in parallel. Thus, to provide parallel programmers the ability to deliver the best performance for their programs, the ability to choose the parallel programming abstractions they use is important. The ability to combine these abstractions however they need is also important, since different parts of a program will have different performance properties, and therefore may perform best using different abstractions. Unfortunately, parallel programming languages are often designed monolithically, built as an entire language with a specific set of features. Because of this, programmer's choice of parallel programming abstractions is generally limited to the choice of the language to use. Beyond limiting the available abstracts, this also means, that the choice of abstractions must be made ahead of time, since any attempt to change the parallel programming language at a later time is likely to be be prohibitive as it may require rewriting large portions of the codebase, if not the entire codebase. Extensible programming languages can offer a solution to these problems. With an extensible compiler, the programmer chooses a base programming language and can then select the set of \extensions" for that language that best fit their needs. With this design, it is also possible to add new extensions at a later time if needed, allowing a developer to choose features as they find them necessary. With an extensible compiler, a programmer may select to use multiple parallel programming abstractions. However, while extensible programming languages can resolve the problem of having little choice over the features of the language, it make no guarantees that the parallel extensions can actually work together. In this work, we explain the motivation behind using extensible languages for parallel programming and discuss how existing parallel programming abstractions can combined. We then move on to develop a system to allow arbitrary parallel programming abstractions to work together and explore its implementation in the ableC framework. Finally, we discuss further work to be done to the ableC framework to improve support for parallel programming, and then discuss future work that can use the tools developed herein to support new features in parallel programming. 2 Background We first provide a discussion several systems that are essential to understand this work. We first discuss the ableC framework, followed by a discussion of the Cilk and OpenMP parallelization systems. 2.1 ableC ableC is an extensible compiler framework that provides an extensible compiler for the C11 standard of the C programming language [2]. It is built using the Silver language, an attribute grammar designed specifically for the construction of programming language specifications [5]. The Silver framework includes analyses which guarantee the composability of independently developed language extensions [3, 4]. This 1 Technical Report Aaron Councilman cilk int fib(int n) { cilk int fib(int n) { if (n<2) if (n<2) return n; cilk return n; else{ else{ int x, y; int x, y; x= spawn fib(n-1); spawn x= fib(n-1); y= spawn fib(n-2); spawn y= fib(n-2); sync; sync; returnx+ y; cilk returnx+ y; } } } } (a) A Fibonacci function implemented (b) A Fibonacci function implemented in the MIT Cilk language in the ableC Cilk extension Figure 1: Examples of Cilk code means that programmers may choose any set of language extensions they want, and it is guaranteed that a valid compiler can be automatically created. These analyses do create some limitations about the exact construction of the extensions, specifically limitations to the concrete syntax that can be introduced by extensions. Because of this, some extensions based on existing programming languages must use different syntax than the languages they are based on. 2.2 Cilk Cilk is a work-stealing parallel programming language and runtime system that provides provably good parallel performance [1]. The Cilk language was designed as a monolithic language, using the cilk2c program to compile Cilk programs to standard C code. The features of Cilk that this work focuses on is the ability to spawn and sync work. The spawn construct is used to state that a certain function call can be performed asynchronously. Spawned function calls are assigned to a variable, whose value will be updated once the function call completes. The sync construct is used to guarantee that all spawned function calls have completed before continuing with the execution. Because Cilk causes substantial transformations to the code, functions must be declared as being Cilk function by adding the cilk function to the function's declaration. This declaration causes modification in the generated C code to the function's signature and body, as well as the generation of multiple copies of the function, used for different purposes within the system. A Cilk programming extension for ableC that generates code similar to the cilk2c has been previously described [2], the work herein builds further upon this extension. Figure 1a shows how a simple Fibonacci function is written in Cilk, and Figure 1b shows how it is written in ableC, show some of the differences required by the analyses mentioned above. 2.3 OpenMP OpenMP is a parallel programming specification that uses a fork-join thread model that describes parallel programming features for C, C++, and Fortran. Many compilers for these languages include support for OpenMP, including the GNU C compiler used in this work, and used by ableC. OpenMP provides a wide variety of parallelization features, but the feature this work focuses on is the OpenMP parallel for loop, which provides simple syntax for parallelizing the execution of a for loop where iterations are independent. OpenMP provides a series of #pragma directives that are used to control parallel execution. An OpenMP parallel for loop is shown in Figure 2. OpenMP's fork-join model means that when a parallel for loop construct is reached a number of threads are created to perform the work, and then when a thread finishes it exits. The threads are then joined to synchronize all iterations of the loop before continuing execution. 2 Technical Report Aaron Councilman void map(int* arr, int len, int(*f)(int)) { #pragma omp parallel for for(inti=0; i< len; i++){ arr[i]= f(arr[i]); } } Figure 2: Performing a map operation over an array using OpenMP parallelization 3 Motivation In this section we explore the benefits that can be provided by extensible programming languages as it relates to parallel programming. We then explore problems with current parallel programming extensions that prevents this vision from being realized. 3.1 Combining Parallelization Methods There are a variety of facets of a computation that affect how it performs when parallelized. As discussed above, being compute bound versus being I/O bound is an easy example of this; a computationally bound process will not see any performance benefit from utilizing more threads than the machine has cores, and doing so likely causes the process to slow down due to overhead and thread switching. On the other hand, programs that do large amounts of I/O and therefore spend a lot of timing blocking, waiting for I/O requests to be fulfilled, are I/O bound. In these situations, having more threads than cores can still lead to performance gains because thread switching can occur when a thread blocks, if when one thread blocks there is another thread ready to perform work, this will reduce the amount of time the machine is idle waiting for threads to unblock. In fact, this idea works with any computation that spends substantial amounts of time blocking, whether this is from I/O or because of the use of mutexes to ensure mutual exclusion, though in the later case the addition of extra threads can increase the contention on the locks which can adversely affect performance. Even within the class of computationally bound problems, however, there are different types of problems that may perform better with certain parallelization methods. For example, we can compare the work-stealing method in Cilk and the fork-join method in OpenMP. In Cilk's work-stealing there exists a pool of worker threads, each with its own queue of function calls to execute. When a thread runs out of work, it will steal a piece of work off of another thread's queue. This is a very useful thread model for problems that create many sub-problems and where these sub-problems may be of differing sizes. On the other hand, in OpenMP's fork-join model threads are created dynamically as they are needed and once they finish their computation, the thread exits. This model is better for problems that do not create as many pieces and where each piece is expected to take roughly the same amount of time. While this model may seem less flexible than the work-stealing model, it also has less overhead. The only real overhead in the fork-join model is to create the threads, and then perhaps some setup code in each thread. In the work-stealing model, the current state of a function must be saved frequently, in case the function is later stolen. Depending on a program's characteristics, the choice of method varies, and there are certainly other methods that could be used as well, each with its own performance characteristics.

Technical Report Aaron Councilman

Intel® Cilk™ Plus

Part V Some Broad Topic

An Introduction to Parallel Programming

Lecture Notes : Intel Xeon Phi Coprocessor - an Overview

The Relevance of Opencl to HPC

Speculative Parallelism in Cilk++

Parallel Programming with Matlabmpi ∗

(12) United States Patent (10) Patent No.: US 8.209,664 B2 Yu Et Al

Parallel Programming in Openmp About the Authors

Dryadlinq for Scientific Analyses

Automatic and Explicit Parallelization Approaches for Mathematical Simulation Models

Openmp a Parallel Programming Model for Shared Memory Architectures