Openclunk: a Cilk-Like Framework for the GPGPU

OpenCLunk: a Cilk-like framework for the GPGPU Lena Olson, Markus Peloquin {lena, markus}@cs December 20, 2010 Abstract models, tools, and languages that exist to exploit parallelism on CPUs. One example is the task-based The potential for GPUs to attain high speedup parallel language Cilk. It uses a simple but powerful over traditional multicore systems has made them extension of C to allow programmers to easily write a popular target for parallel programmers. However, code. they are notoriously difficult to program and debug. We decided to try to combine the elegance of Cilk Meanwhile, there are several programming models in with the acceleration of the GPU. From the start, the multicore domain which are designed to be sim- we knew that this would not be simple. Cilk is de- ple to design and program for, such as Cilk. Open- signed to be a task-parallel language, and GPUs ex- CLunk attempts to combine Cilk’s ease of program- cel at data parallelism. However, we thought that it ming with the good performance of the GPU. We might be advantageous to have a model that was sim- implemented a Cilk-like task-parallel programming ple to program, even if the speedup was more mod- framework for the GPU, and evaluated its perfor- est than the exceptional cases often cited. Therefore, mance. The results indicate that while it is possible we had several goals. The first was to simply imple- to implement such a system, it is not possible to ment a Cilk-like model, where Cilk programs could attain good performance, and therefore a Cilk-like be easily ported and run: something simple enough system is currently not feasible as a programming for a compiler to target. Our second goal was to use model for the GPU. some benchmarks to evaluate the performance of this model, and to prove that it is possible to run programs in a task-parallel manner upon a GPU. Fi- 1 Introduction nally, we wanted to characterize the performance of our benchmarks and determine the feasibility of us- Graphical Processing Units (GPUs) are becoming ing our implementation as a tool to attain speedup more and more common, and have recently started on a GPU. being used for purposes outside of graphics acceleration. They can enable some programs to attain There has been other work on adapting existing CPU large speedups; speedups of greater than 100 are parallel programming platforms onto GPGPUs. An frequently reported, and supercomputers now com- OpenMP to CUDA compiler extension called O2G monly make use of them. However, they are also was developed with generally positive results [4]. much harder to program for than traditional mul- Since both platforms are already data-parallel, the tiprocessors, and debugging is difficult. One of the authors focused mostly on modifying code to fit the main reasons for this is that they are a much more GPGPU’s memory model. Because it is already fairly recent innovation than CPUs, and as such, there has well understood how to write data-parallel code for not been time for as many helpful tools to be devel- the GPU, and because there are already tools to aid oped. Another reason is that they are more limited in development and debugging, we wanted to focus than CPUs in what they support, so programmers on another model, and a task-parallel model has not commonly have to put work into changing their al- been widely studied for the GPU. Hence, we chose gorithms to work on them (for example, rewriting to implement a Cilk-like model. recursion-based code to use stacks instead). Section 2 gives a background of the technologies in- In contrast, there are a number of programming volved. Section 3 describes the details of our imple- 1 mentation’s design and interface. Evaluation is cov- ATI’s and NVIDIA’s devices. That was a main mo- ered in Section 4. We discuss the OpenCLunk model tivation in choosing OpenCL over CUDA. OpenCL in 5 and conclude in Section 6. claims to support task-level parallelism in the form of command queues and the option CL QUEUE OUT OF- ORDER EXEC MODE ENABLE [3]. However, this is not supported on most architectures. Unfortunately, this 2 Background is not so much an instruction to OpenCL as it is a suggestion. No errors or warnings are given when this 2.1 GPGPUs option is used for hardware that does not support it. Since the hardware we had access to does not support As the difficulty of improving uniprocessor perfor- out-of-order execution, we implemented a framework mance has increased, the landscape has changed using other features, which we will describe later. to include other architectures besides CPUs. GPUs have been used to accelerate graphics calculations for some time, but recently they have been repur- 2.3 Cilk posed for general-purpose computation, under the name GPGPUs. Calculations are offloaded from the Cilk is an extension of the C programming language CPU onto the GPU and executed there. to enable easy parallelization of programs [1]. It uses GPGPUs differ from CPUs in a number of ways. a task-based parallelism model. If the Cilk keywords They are intended for data-level parallelism: many are removed from a Cilk program (the C elision is threads all executing the same instructions at the taken), the program executes correctly in serial, sim- same time. They have far more cores than typical plifying program design and debugging. The main spawn sync spawn multicore systems; they also can run hundreds of Cilk keywords are and . The key- threads simultaneously. Transcendental functions are word is used to indicate that a function call can run implemented in hardware, allowing them to be very in parallel with the main method and other spawned fast, especially for single-precision floating point op- functions. The sync keyword pauses the execution erations. of a thread until all of previous spawns in the same function call have completed. It is common for Cilk However, there are some disadvantages of GPUs as programs to be recursive, with each function having opposed to CPUs. Threads are executed in thread multiple recursive calls to itself. groups, typically of 32 threads. Within these groups, all threads execute the same instructions. If there is The way that these keywords are implemented in an if condition that only evaluates true for some Cilk is to use work-stealing queues. When a thread threads, a form of predicated execution will disable spawns a new function, it adds it to the head of the threads for which the if condition was false, then its work queue. Other threads can steal from the perform the instructions in the if block. Afterwards, tail of other queues when they exhaust the work in those threads may be disabled and the others enabled their own queue. When a thread steals, it converts while the instructions in an else block are executed. the function to a slower version that executes syn- This is called divergence, and limits performance. chronization statements; in the common case, syn- Another limitation is in what types of code can be chronization is unnecessary because by the time the run on the GPU. One limitation that is particularly thread reaches a sync statement, it would have ex- salient to our task was that recursive calls are not ecuted all previous instructions anyway. This allows possible since their use of memory is unbounded. for very cheap spawns, which is part of why Cilk has very good performance. We chose to use Cilk as the model for our system 2.2 OpenCL because it is an easy to use programming model, yet powerful enough to implement a wide variety of pro- OpenCL (Open Computing Language) is a C-based grams. Because our platform was different, we did language for writing kernels, which can then be exe- not however use the same design for our implemen- cuted on GPUs and CPUs [2]. It is portable between tation; in particular, we made no use of work-stealing different manufacturers: it can be used for both queues. 2 3 OpenCLunk the subsequent task group. One benefit of this approach would have been less memory bandwidth: once a task group completes, they need only transfer OpenCLunk is our implementation of a task-parallel model for GPGPUs. Its name is a combination of back some relatively small status information. This OpenCL, Cilk, and ‘clunk’, and serves to describe has the problem of leaving holes in the arguments, since not all tasks will spawn. As subsequent levels of both its roots and its performance. spawns execute, those holes take up more and more memory. Our approach was to copy the spawn argu- 3.1 Execution model ments back to the CPU. Task groups would then be rearranged so that all spawning tasks would be at the front, moving the remainder at the back. The way tasks execute in OpenCLunk is similar to what happens in Cilk, but necessarily differs in some One of our previous approaches was to have each key ways. The first and most important problem to group consisting solely of the spawned tasks of the solve was how to spawn tasks. As the GPGPU en- previous group. This offered a more fluid execu- vironment is quite restrictive, our solution was for a tion, but it had higher memory requirements for the task to return whatever arguments would have been task groups and additional communication with the used in the spawn invocations back to the CPU. GPGPU. It also depended on the OpenCL device These spawn arguments then serve as input argu- supporting out-of-order execution, which our hard- ments for subsequent spawned tasks.

Openclunk: a Cilk-Like Framework for the GPGPU

Other Apis What’S Wrong with Openmp?

Opencl on Shared Memory Multicore Cpus

Heterogeneous Task Scheduling for Accelerated Openmp

AMD Accelerated Parallel Processing Opencl Programming Guide

Implementing FPGA Design with the Opencl Standard

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Opencl SYCL 2.2 Specification

An Introduction to Gpus, CUDA and Opencl

Parallelism in Cilk Plus

Opencl in Scientific High Performance Computing

The Continuing Renaissance in Parallel Programming Languages

A CPU/GPU Task-Parallel Runtime with Explicit Epoch Synchronization