Extensible Parallel Programming in ableC

Aaron Councilman Department of Computer Science and Engineering University of Minnesota, Twin Cities May 23, 2019

1 Introduction

There are many different manners of parallelizing code, and many different languages that provide such features. Different types of computations are best suited by different types of parallelism. Simply whether a computation is compute bound or I/O bound determines whether the computation will benefit from being run with more threads than the machine has cores, and other properties of a computation will similarly affect how it performs when run in parallel. Thus, to provide parallel programmers the ability to deliver the best performance for their programs, the ability to choose the parallel programming abstractions they use is important. The ability to combine these abstractions however they need is also important, since different parts of a program will have different performance properties, and therefore may perform best using different abstractions. Unfortunately, parallel programming languages are often designed monolithically, built as an entire language with a specific set of features. Because of this, programmer’s choice of parallel programming abstractions is generally limited to the choice of the language to use. Beyond limiting the available abstracts, this also means, that the choice of abstractions must be made ahead of time, since any attempt to change the parallel programming language at a later time is likely to be be prohibitive as it may require rewriting large portions of the codebase, if not the entire codebase. Extensible programming languages can offer a solution to these problems. With an extensible compiler, the programmer chooses a base programming language and can then select the set of “extensions” for that language that best fit their needs. With this design, it is also possible to add new extensions at a later time if needed, allowing a developer to choose features as they find them necessary. With an extensible compiler, a programmer may select to use multiple parallel programming abstractions. However, while extensible programming languages can resolve the problem of having little choice over the features of the language, it make no guarantees that the parallel extensions can actually work together. In this work, we explain the motivation behind using extensible languages for parallel programming and discuss how existing parallel programming abstractions can combined. We then move on to develop a system to allow arbitrary parallel programming abstractions to work together and explore its implementation in the ableC framework. Finally, we discuss further work to be done to the ableC framework to improve support for parallel programming, and then discuss future work that can use the tools developed herein to support new features in parallel programming.

2 Background

We first provide a discussion several systems that are essential to understand this work. We first discuss the ableC framework, followed by a discussion of the and OpenMP parallelization systems.

2.1 ableC ableC is an extensible compiler framework that provides an extensible compiler for the C11 standard of the C programming language [2]. It is built using the Silver language, an attribute grammar designed specifically for the construction of programming language specifications [5]. The Silver framework includes analyses which guarantee the composability of independently developed language extensions [3, 4]. This

1 Technical Report Aaron Councilman

cilk int fib(int n) { cilk int fib(int n) { if (n<2) if (n<2) return n; cilk return n; else{ else{ int x, y; int x, y; x= spawn fib(n-1); spawn x= fib(n-1); y= spawn fib(n-2); spawn y= fib(n-2); sync; sync; returnx+ y; cilk returnx+ y; } } } }

(a) A Fibonacci function implemented (b) A Fibonacci function implemented in the MIT Cilk language in the ableC Cilk extension

Figure 1: Examples of Cilk code means that programmers may choose any set of language extensions they want, and it is guaranteed that a valid compiler can be automatically created. These analyses do create some limitations about the exact construction of the extensions, specifically limitations to the concrete syntax that can be introduced by extensions. Because of this, some extensions based on existing programming languages must use different syntax than the languages they are based on.

2.2 Cilk Cilk is a work-stealing parallel programming language and runtime system that provides provably good parallel performance [1]. The Cilk language was designed as a monolithic language, using the cilk2c program to compile Cilk programs to standard C code. The features of Cilk that this work focuses on is the ability to spawn and sync work. The spawn construct is used to state that a certain function call can be performed asynchronously. Spawned function calls are assigned to a variable, whose value will be updated once the function call completes. The sync construct is used to guarantee that all spawned function calls have completed before continuing with the execution. Because Cilk causes substantial transformations to the code, functions must be declared as being Cilk function by adding the cilk function to the function’s declaration. This declaration causes modification in the generated C code to the function’s signature and body, as well as the generation of multiple copies of the function, used for different purposes within the system. A Cilk programming extension for ableC that generates code similar to the cilk2c has been previously described [2], the work herein builds further upon this extension. Figure 1a shows how a simple Fibonacci function is written in Cilk, and Figure 1b shows how it is written in ableC, show some of the differences required by the analyses mentioned above.

2.3 OpenMP OpenMP is a parallel programming specification that uses a fork-join model that describes parallel programming features for C, C++, and Fortran. Many compilers for these languages include support for OpenMP, including the GNU C compiler used in this work, and used by ableC. OpenMP provides a wide variety of parallelization features, but the feature this work focuses on is the OpenMP parallel for loop, which provides simple syntax for parallelizing the execution of a for loop where iterations are independent. OpenMP provides a series of #pragma directives that are used to control parallel execution. An OpenMP parallel for loop is shown in Figure 2. OpenMP’s fork-join model means that when a parallel for loop construct is reached a number of threads are created to perform the work, and then when a thread finishes it exits. The threads are then joined to synchronize all iterations of the loop before continuing execution.

2 Technical Report Aaron Councilman

void map(int* arr, int len, int(*f)(int)) { #pragma omp parallel for for(inti=0; i< len; i++){ arr[i]= f(arr[i]); } } Figure 2: Performing a map operation over an array using OpenMP parallelization

3 Motivation

In this section we explore the benefits that can be provided by extensible programming languages as it relates to parallel programming. We then explore problems with current parallel programming extensions that prevents this vision from being realized.

3.1 Combining Parallelization Methods There are a variety of facets of a computation that affect how it performs when parallelized. As discussed above, being compute bound versus being I/O bound is an easy example of this; a computationally bound will not see any performance benefit from utilizing more threads than the machine has cores, and doing so likely causes the process to slow down due to overhead and thread switching. On the other hand, programs that do large amounts of I/O and therefore spend a lot of timing blocking, waiting for I/O requests to be fulfilled, are I/O bound. In these situations, having more threads than cores can still lead to performance gains because thread switching can occur when a thread blocks, if when one thread blocks there is another thread ready to perform work, this will reduce the amount of time the machine is idle waiting for threads to unblock. In fact, this idea works with any computation that spends substantial amounts of time blocking, whether this is from I/O or because of the use of mutexes to ensure mutual exclusion, though in the later case the addition of extra threads can increase the contention on the locks which can adversely affect performance. Even within the class of computationally bound problems, however, there are different types of problems that may perform better with certain parallelization methods. For example, we can compare the work-stealing method in Cilk and the fork-join method in OpenMP. In Cilk’s work-stealing there exists a pool of worker threads, each with its own queue of function calls to execute. When a thread runs out of work, it will steal a piece of work off of another thread’s queue. This is a very useful thread model for problems that create many sub-problems and where these sub-problems may be of differing sizes. On the other hand, in OpenMP’s fork-join model threads are created dynamically as they are needed and once they finish their computation, the thread exits. This model is better for problems that do not create as many pieces and where each piece is expected to take roughly the same amount of time. While this model may seem less flexible than the work-stealing model, it also has less overhead. The only real overhead in the fork-join model is to create the threads, and then perhaps some setup code in each thread. In the work-stealing model, the current state of a function must be saved frequently, in case the function is later stolen. Depending on a program’s characteristics, the choice of method varies, and there are certainly other methods that could be used as well, each with its own performance characteristics. Beyond the difficulty of knowing exactly what parallelization method will result in the best performance for a certain program, there is the additional challenge that in many programs different pieces of the program have different performance characteristics. Many programs will contain pieces of computation that are compute bound and others that are not. For example, in a multi-player computer game, the render engine is likely compute bound while other parts of the game are waiting for keyboard input or network messages, and are thus I/O bound. If the render engine uses a work-stealing thread model there is added overhead for managing this that is not needed by the keyboard or network handlers, and so we would prefer that they not be part of the work-stealing. However, it is possible that a message that is received from the network requires that we perform a complicated computation that we would like to run in the work-stealing model. Unfortunately, this ability for different parallelization systems to communicate and work together is often non-existent. For example, the Cilk system does not allow us to spawn work from functions not

3 Technical Report Aaron Councilman

cilk int fib(int n); cilk int fib(int n);

void* func(void* arg) { cilk void* func(void* arg) { int*p= arg; int*p= arg; intn=*p; intn=*p; int res; int res;

res= spawn fib(n); res= spawn fib(n); sync; sync;

p= malloc(sizeof(int)); p= malloc(sizeof(int)); *p= res; *p= res;

return p; return p; } }

int main() { int main() { pthread_t thd; pthread_t thd; intn= 20; intn= 20; pthread_create(&thd, NULL, func,&n); pthread_create(&thd, NULL, func,&n); int* res; int* res; pthread_join(thread,&res); pthread_join(thread,&res); } }

(a) Example using non-Cilk function (b) Example using Cilk function as pthread function as pthread function

Figure 3: Examples of attempting to use Cilk and pthreads together

declared as being Cilk functions. However, the declaration as a Cilk function breaks the ability to use other parallelization methods. For instance, the POSIX standard defines a pthread interface for creating threads and running functions using them. For this, Figure 3 shows two alternative implementations of using Cilk and pthreads. Neither of them is valid, however. Figure 3a fails to compile using cilk2c because spawn cannot be used from non-Cilk functions. Figure 3b fails to compile because the signature of func is changed, and so no longer matches the signature required by the pthread . There are ways we could attempt to work around these problems, for example by passing the necessary pointers that are used internally by Cilk to a function we ran using pthreads, but if we have multiple threads sharing this same pointer we will encounter run-time errors, including segmentation faults. Therefore, even using an extensible compiler that allows us to choose the parallelization extensions we wish to use, we find that the extensions interact in ways, possibly at run-time, that cause programs to fail. The guarantees provided by Silver that extensions will be composable, do not extend any guarantees to whether they will work together in practice.

3.2 Current Combinations As discussed above, there are problems when combining different parallelization methods. Above we discussed how Cilk cannot be run from code that is running on a newly created pthread. While this example shows a limitation, the opposite is possible; creating pthreads from within Cilk code is possible since Cilk can run normal functions and can therefore run all functions necessary to create a pthread. An asymmetry like this is common with Cilk and other parallelization features, as we will explore further. A note of caution with using Cilk to create pthreads: this combination may be more prone to , especially if the threads are joined within the Cilk code (that is the Cilk function waits for the thread to exit) and these threads utilize synchronization methods such as barriers. Further discussion of this problem is not addressed as it is not solvable, solutions to avoid it may exist, but are beyond the scope of the current work. We encounter similar problems with attempting to combine OpenMP and Cilk. First, the cilk2c compiler

4 Technical Report Aaron Councilman

does not support the syntax of OpenMP, but it is possible to run the cilk2c compiler to generate standard C code and then add the OpenMP syntax to this code before running a standard compiler. However, there are several problems when trying to do this. If we place an OpenMP parallel loop in a function that is not a Cilk function and attempt to spawn Cilk functions from within the loop we will receive errors because Cilk does not support spawns from non-Cilk functions. We can, however, place an OpenMP loop within a Cilk function. Unfortunately, we cannot place spawns within that loop, because by the design of Cilk it uses labels and goto statements to jump to points in the code near spawns and syncs, but attempting to use a goto to a label within an OpenMP loop is prohibited for good reason. Therefore, similar to with pthreads, Cilk and OpenMP work together so long as one only attempts to place OpenMP code within Cilk code, not Cilk code within OpenMP. While we do have some ability to combine parallel programming models, it is also possible to find combinations that would not work together at all. Clearly, the construction of Cilk makes it impossible for Cilk code to be executed from non-Cilk function. It would be possible for another parallel programming system we wish to use to have a similar limitation, in which case it would be impossible for it and Cilk to truly work together. None of this discussion should be construed to state that these features cannot be used side-by-side, nothing prevents the creation of pthreads that run separately from Cilk, the problem we identify here is allowing these systems to work together. Of course, if properly constructed this could still be managed, a program with Cilk and pthreads simply must create data structures that allow the two systems to communicate with each other and monitor the communication from the other system. However, this requires that the programmer write code to handle this communication, meaning they must write such code for every parallelization system they wish to use in their program. What we propose, therefore, is a back-end that helps to handle these communications, shifting some burden for this onto the extension developer, but removing the burden from the programmer.

4 Design

As discussed above, the problems that we encounter with combining parallel systems is getting them to work together. Since the systems can be run simultaneously, as long as they operate separately, the solution to allow them to work together is simply to provide them with a manner of communicating. Since this system will operate to mediate all interactions between parallel extensions it should be as light-weight as possible, so that the improved performance resulting from the flexibility to choose the ideal set of parallel extensions is not completely offset by the extra time that this communication system requires. The bulk of this communication system is implemented with four methods: init , send message, send response, and receive response. The init thread pool method is called at the beginning of the program’s execution, it creates a pool of a specified number of threads and associates a specified id with that pool that can be used later to lookup that specific thread pool. With the thread pool is associated a message-handler function and a pointer that is used by the message-handler. The message-handler function and this pointer are used when sending a message to the the thread pool. The send message function can be called with a pointer to send a message to a specified thread pool. To do this, the message-handler function associated with the thread pool is called with three arguments: the message that was provided to the send message function, the pointer associated with the thread pool, and an object that can be used to send a response to the message. The message handler function is then expected to extract whatever data is necessary from the message and use the pointer to manipulate its internal data structures appropriately. It may then use the send response method to respond to the message. When the send message function is called, it will return an object that allows the caller to receive a response to the original message. This response can be used to send information back to the caller in a number of ways. The response itself takes the form of a pointer, which could point to a data structure with information in response to the message or a response might not be sent until the request specified in the message has been satisfied, in which case the response acts as a form of synchronization. The receive response method will take the object returned from send message and wait to return until a response has been sent, and will then return the response. Other methods in this system are used to setup and destroy internal data structures, and perform cleanup at exit to ensure that threads are not left running which could be reported as memory leaks.

5 Technical Report Aaron Councilman

Having described the system, we will now present a brief overview of how two ableC extensions make use of this system. The first enables the existing Cilk extension to handle spawns from non-Cilk functions. The second provides parallel for loops using OpenMP style syntax.

4.1 Cilk The preexisting Cilk extension operates very similarly to the cilk2c compiler designed for Cilk-5. In this work we added code to the extension specification to handle the case of spawn and sync statements that are not in Cilk functions. When a spawn statement is encountered that is not in a Cilk function it is translated into a message that is sent to the Cilk thread pool using the new communication system. This message contains several pieces of information: the address of the variable the function’s return value is to be written to (assuming the value is assigned to a variable), a function pointer to the function to be executed, the values of the arguments to that function, and a pointer to an integer (the join counter), a lock, and a condition variable. The values of the arguments are placed into a struct that was declared with the appropriate fields for the function that is being spawned and the pointer to this struct is passed in the message. The integer, lock, and condition variable are used to implement the synchronization method of this system, and are actually passed in the same struct as the arguments. Each non-Cilk function that makes use of Cilk spawn and sync statements will have these three variables lifted to its outermost scope. The join counter counts the number of spawns that have been made, and is decremented when a function finishes and has written back its value. The lock and condition variable are simply used to maintain mutual exclusion and avoid busy waiting. The remaining component of the message to discuss is the function pointer. When a Cilk function is declared the Cilk extension creates two different versions of it a fast clone and a slow clone. To allow for communication from outside of Cilk functions we also now create one additional version of the function, a wrapper, whose purpose is to extract the arguments from the struct of arguments, call the desired function, and then sync. Once the function has finished executing, this wrapper writes the value back appropriately and then updates the join counter. We use this wrapper function because it must handle the write-back and join counter update correctly, which can’t be done by the traditional fast or slow clones generated by Cilk. To then actually inject the desired function call into the Cilk worker dequeues, the message handler for the Cilk thread pool takes this message and transforms it into a Closure that is then placed onto the dequeue of a randomly selected Cilk thread. To implement the sync from non-Cilk functions, we simply wait for the join counter to equal zero, using the lock and conditional variable to ensure that the calling thread blocks while waiting, instead of busy waiting. Ideally, we would find some way so that the calling thread would help perform any of the remaining Cilk work that is being waited on. Unfortunately, doing so is very complicated, as threads cannot safely share Cilk deques. While it might be possible to implement this functionality we have not done so due to the technical challenges. Figure 4 shows how Cilk features can be used from non-Cilk functions. It also shows a few added function calls at the beginning of main. These are used to setup the parallelization system and then the Cilk thread pool specifically. The function uses Cilk’s internal methods to perform the necessary setup before using the init thread pool function describe above to create the desired threads. These threads simply run the Cilk child main function defined in the Cilk-5 runtime system, which is the function that is run when Cilk’s runtime itself is used to launch threads.

4.2 OpenMP As part of this work we have also created an OpenMP inspired extension for parallel for loops. This extension, however, does not use the fork-join model described above that OpenMP itself uses; instead, this extension creates a thread pool through the parallelization system and then when a loop is encountered messages are sent to the threads through work queues. With this method we have seen performance nearly as good as OpenMP as implemented in GCC. To implement the parallel loops, we must lift the loop body, along with some setup code, into a new function that can be called from the other threads. An added complication, though, is introduced by the fact that OpenMP is a shared-memory parallelization model. Because of this, variables that are declared outside of the function body are shared between loop iterations, and therefore all updates to them must cause their values to update in all iterations (except those variables specifically listed as private). To do this, the lifted function is provided a list of pointers, where each pointer points to the appropriate variable in the procedure that used the parallel loop. Then, within the loop body

6 Technical Report Aaron Councilman

cilk int fib(int n);

int main() { setup_thread_system(); init_cilk_ableC(2);

int* arr= malloc(sizeof(int)* 30);

for(inti=0; i< 30; i++){ spawn arr[i]= fib(i); }

sync; } Figure 4: Making use of Cilk constructs from a non-Cilk method each variable is replaced with an access to the appropriate variable out of the list of pointers, and then this is cast to the correct pointer type and dereferenced. To achieve this transformation of the code, the loop body is placed into an environment where each of the shared variables are given a shared qualifier, which causes the code placed in the function to be generated using the variable access method described above. In this function, before the injected loop body, code is added to calculate the range of index values that a certain thread should perform. Then, in the method that included the parallel loop, code is added that sends a message using the parallelization system. This message includes a function pointer to the lifted function as well as the list of pointers described previously. This list of pointers also includes pointers to variables that calculate the start and stop conditions for the loop, as well as the value that the variable is incremented by each iteration. By sending this message, work requests are placed into the work queues of each thread in the pool. Since OpenMP automatically synchronizes at the end of loops, after the message has been sent, the code immediately receives a response. This response contains a work request for one portion of the loop, and a pointer to an integer that is used to count how many threads remain to be synchronized. The work request is then processed, and after that the thread waits until the counter is decremented to zero. This implementation ensures that the thread that originally started the parallel loop will perform one part of the loop. Since each section of the loop should be roughly the same amount of computation, this is expected to take the same amount of time as each other section of the loop, and therefore the amount of time that the thread spends waiting for the rest of the loop to be finished is minimized. To ensure that nested parallel loops are allowed (though they should always be used cautiously), there is one additional complication to the synchronization process of the loop. After completing the portion of the loop assigned to that thread, it will check the counter as described. However, if the counter is not equal to zero the thread will check its work queue and if any items are in the queue, the thread will execute that work request.

4.3 Extension Combinations To demonstrate that the tools developed in this work provide meaningful new features we now explore how these two extensions can be combined in ways not possible without it. As discussed above, it is not possible for code within an OpenMP loop to use Cilk spawns. Similarly, we discussed the limitation that it is not possible to spawn Cilk work from within a pthread. Because the Cilk extension now allows for non-Cilk functions to spawn Cilk work the problem with pthreads is resolved. With problem with OpenMP is more complicated. An OpenMP loop within a non-Cilk function can now spawn Cilk work, but an OpenMP loop within a Cilk function currently cannot spawn Cilk functions in a valid way. The issue with spawning Cilk work within an OpenMP loop in a Cilk function is caused by how Cilk determines whether or not it is in a Cilk function. The Cilk extension tracks whether it is a Cilk procedure by using #define in the global scope. When the OpenMP function is injected into the global scope it keeps the same global environment, and so the loop body believes itself to be in a Cilk function. To behave properly, we need the loop body to be treated as if it were not in a Cilk function, since it will be run by multiple threads. Modification to how the

7 Technical Report Aaron Councilman

cilk void process(char* request);

void* input_handler(void* socket) { int fd=*((int*) socket); int len; char buffer[BUFF_SIZE];

while(len= read(fd, buffer, BUFF_SIZE)) { char* request= malloc(sizeof(char)* (len+1)); memcpy(request, buffer, sizeof(char)* len); request[len]=0; spawn process(request); }

sync;

return NULL; }

int main() { setup_thread_system(); init_cilk_ableC();

// ... details for setting up network elided ...

pthread_t thread; pthread_create(&thread, NULL, input_handler,&socket);

// ... further work done on the main thread ...

pthread_join(thread, NULL);

return0; } Figure 5: Making use of Cilk and pthreads together

Cilk extension tracks whether it is within a Cilk function or not could fix this situation, but would require further work. However, even without this, we have shown new abilities that were not possible previously. In addition, if the parallel loop is placed within a non-Cilk function (and this function is called from within Cilk) we can simulate this behavior with only a slight need to modify code to support this. A demonstration that outlines a possible use of pthreads and Cilk together is shown in Figure 5.

5 Future Work

Finally, we discuss future work to be done on this parallelization system, particularly attempting to integrate it better into ableC so that it automatically handles several details currently placed on the programmer; as well as work that we propose is possible using this system that would not be possible without our system.

5.1 Integration in ableC Currently there are several features that require the programmer to write code manually or require hard- coding of certain aspects within extensions that we would rather have automatically handled by the extension

8 Technical Report Aaron Councilman

and ableC. This can be seen in Figure 4 and 5, in both calls to setup thread system and init cilk ableC are made to setup Cilk’s internal data structures as well as creating the Cilk thread pool. This is code that simply needs to be injected at the beginning of main and can therefore be injected relatively easily. There are two more elements that we wish to integrate into ableC itself. The first is the assignment of thread pool numbers. Each extension that uses the parallelization system has an integer id within the system that is used to send messages to the thread pool. Currently, this number is hard-coded into the extensions, but it should be generated automatically, since the number must be different for each extension. Finally, as mentioned above, we would also like to add a method of resource allocation to the system. Currently, the number of threads each extension receives is hard-coded as an argument to that function. We would like these decisions to be made automatically by generated code. Furthermore, we would ideally like this feature to allocate threads at runtime, allowing an executable to be run on machines with varying numbers of cores and still taking full advantage of all cores available, without requiring the user to intervene. To accomplish this, we propose adding a Parallelization nonterminal to ableC. This nonterminal would have one production which a new extension would be able to aspect onto a number of attributes. These attributes would provide ableC a complete list of extensions that use the parallelization, and would therefore provide a way of generating a mapping of extensions to an extension number. Other attributes in this would provide the setup code necessary for the extension and could be used to provide information about the extension’s use of threads, such as the minimum or maximum number of threads that an extension could use. These properties can be useful and important for extensions that may use thread barriers. If, for instance, an extension generates code that requires 3 threads to all synchronize at the same time providing the extension with fewer threads can cause deadlock. Other attributes in this collection could be used by extensions that don’t use the parallelization system but do create new threads to report this, which can be useful in making decisions on how to allocate threads. Finally, we consider adding some syntax that would allow the programmer to specify how threads should be distributed. Ideally this would allow the programmer to specify the percentage of threads to be provided to each extension. At runtime this information, along with the number of cores on the machine and the other information provided by the extension, would be used to allocate resources as close to the programmer’s request as possible. This last aspect, the allocation of threads is the most complicated of these features, as we must make decisions about what information about extensions use of threads we are interested in and how to handle all the aspects in making decisions about thread allocation. For example, if we are running on a machine with 2 cores and we use three extensions that use the parallelization system, but two extensions each require at minimum 2 threads, how do we allocate threads in this case? Using these minimums, how does this impact using the programmer’s request for thread allocation. In the case described, what if the programmer states that the third extension should get 50% of the threads with the other two each receiving 25%. Should the third extension receive 4 threads? Or 1 thread since that is 50% of the cores on the machine? We might also be interested in tracking whether an extension’s threads are compute bound or block often, as it might be reasonable to allocate 8 threads on a 2 core machine if most threads will spend time blocked, but would be a less ideal decision on a machine with 2 cores if all threads are compute bound, only blocking occasionally or never.

5.2 Further Parallel Extensions In addition to the further work to be done on ableC to further this research, there is of course work that could be done to add more parallel extensions to the ableC framework that utilize the parallelization system created herein. Beyond that, though, we also believe that the demonstration of the flexibility that this parallelization system allows will allow us to create a flexible parallelization language extension to ableC. Taking inspiration from the Halide programming language, especially the Halide language extension to ableC, we propose a parallel programming extension that separates the writing of parallel programs from the manner in which the programs are parallelized. For instance, we could provide constructs to guarantee mutual exclusion to certain sections of code, and then through simple tags on those sections, or elsewhere in the code, how the mutual exclusion is enforced (whether through mutexes, spin-locks, or even low-level locks combined with low-level compare and exchange operations) can be modified. Such an extension would provide at a minimum a manner of spawning work and synchronizing it; a manner to provide that certain variables and section of code are accessed atomically; and easy ways for different threads to communicate. There are a variety of complications to this system, one of the biggest being handling synchronizations of

9 Technical Report Aaron Councilman

different types of spawns. For example, a single function could spawn threads into the Cilk environment and others using pthreads, we must then ensure that the generated code for synchronizing properly handles this situation. In addition, the extension must be built in a manner so that it itself is extensible, allowing for the new parallelization methods to be developed and easily integrated. As many features as can be made extensible should be designed to be extensible, the exact details of this depend on the final implementation of the parallel extension. Other features that such an extension might include a data type based on lattice- variables. The current lVars extension relies on standard mutexes to provide mutual exclusion, however depending on the exact properties of the least-upper bound function and the inputs to the variable different manners of guaranteeing mutual exclusion may exist. For example, if the least-upper bound function is very cheap, and contention is reasonably low, using low-level atomic compare and exchange operations may provide better performance. With more expensive least-upper bound functions different manners of maintaining atomicity may be preferred, including possibly techniques such as the use of a daemon thread that receives messages and handles all least-upper bound computations for all threads that use the variable. Allowing tuning like this to be performed by only making small code changes, instead of major rewrites, provides a great benefit to parallel programmers. Because of the flexibility provided by the parallelization system presented herein, as well as the features of extensible programming languages, we can design such a parallel programming language with a flexibility that cannot be achieved using monolithic programming languages, and with a relative ease that cannot be achieved by monolithically designing a new flexible parallel programming extension.

References

[1] Frigo, M., Leiserson, C. E., and Randall, K. H. The implementation of the Cilk-5 multithreaded language. In Proceedings of Programming Language Design and Implementation (PLDI) (New York, NY, USA, 1998), ACM, pp. 212–223.

[2] Kaminski, T., Kramer, L., Carlson, T., and Van Wyk, E. Reliable and automatic composition of language extensions to C: The ableC extensible language framework. Proceedings of the ACM on Programming Languages 1, OOPSLA (Oct. 2017), 98:1–98:29.

[3] Kaminski, T., and Van Wyk, E. Modular well-definedness analysis for attribute grammars. In Proceedings of the International Conference on Software Language Engineering (SLE) (Berlin, Germany, September 2012), vol. 7745 of LNCS, Springer, pp. 352–371.

[4] Schwerdfeger, A., and Van Wyk, E. Verifiable composition of deterministic grammars. In Pro- ceedings of the Conference on Programming Language Design and Implementation (PLDI) (New York, NY, USA, June 2009), ACM, pp. 199–210.

[5] Van Wyk, E., Bodin, D., Gao, J., and Krishnan, L. Silver: an extensible attribute grammar system. Science of 75, 1–2 (January 2010), 39–54.

10