Bachelor Informatica

From CUDA to OpenACC in Graph Processing Applications

Wouter de Bruijn

June 17, 2019 Informatica — Universiteit van Amsterdam

Supervisor(s): Dr. Ana Lucia Varbanescu 2 Abstract Most GPU-accelerated programs are written using the NVIDIA proprietary API CUDA. CUDA has an extensive collection of users and libraries, but functions only on NVIDIA GPUs and is completely proprietary. This thesis proposes standard ways to convert CUDA to the higher level programming model OpenACC, examines the difficulty of this process, and analyzes the performance of the final converted program. We have applied our porting methodology to two different graph processing algorithms. We chose Breadth First Search for its relative simplicity and large amount of memory opera- tions, and PageRank for its combination of memory operations and computational sections. The results show that OpenACC was significantly faster than CUDA for PageRank, and was more or less tied with CUDA in Breadth First Search. In the end, the performance of OpenACC was close enough to CUDA for most cases, and actually faster than CUDA in one case. OpenACC did lack in performance and consistency on multi-core CPUs when compared to OpenMP. Our systematic process of porting CUDA to OpenACC was therefore successful for the two different graph processing algorithms. The OpenACC ecosystem does still suffer from a lack of user support and documentation, which makes the process of writing larger and more complicated OpenACC programs more difficult than it should be for (beginner) pro- grammers.

3 4 Contents

1 Introduction 7 1.1 Context ...... 7 1.2 Research Questions ...... 7 1.3 Thesis structure ...... 8

2 Background & Related Work 9 2.1 Background ...... 9 2.1.1 CUDA, OpenACC and OpenMP ...... 9 2.2 Definitions ...... 10 2.3 Related Work ...... 11

3 Methodology 13 3.1 Measuring performance ...... 13 3.2 Measuring ease-of-use ...... 13 3.3 Performance Analysis and Debugging ...... 14

4 Breadth First Search 15 4.1 The Algorithm ...... 15 4.2 Porting ...... 15 4.2.1 Porting memory transfers ...... 16 4.2.2 Porting compute operations ...... 18 4.3 Performance ...... 20 4.4 Optimising ...... 23 4.4.1 Diagnosing and patching ...... 23 4.4.2 Optimised results ...... 25

5 PageRank 29 5.1 The Algorithm ...... 29 5.2 Porting ...... 30 5.2.1 Porting memory transfers ...... 31 5.2.2 Porting compute operations ...... 32 5.3 Performance ...... 35 5.4 Optimising the OpenACC implementation ...... 35 5.4.1 Diagnosing and patching ...... 35 5.4.2 Optimised results ...... 39

6 Multithreading OpenACC 41 6.1 OpenMP vs. OpenACC ...... 41 6.2 Results ...... 42 6.2.1 Unoptimised OpenACC ...... 42 6.2.2 Optimised OpenACC ...... 44 6.3 Examining Results ...... 44

5 7 Conclusion and Future Work 47 7.1 Discussion ...... 47 7.2 Conclusion ...... 47 7.2.1 Systematic way of porting ...... 48 7.2.2 Difficulty of OpenACC ...... 49 7.2.3 Performance benefits or drawbacks ...... 49 7.2.4 OpenMP vs. OpenACC ...... 50 7.3 Future Work ...... 50

8 Appendices 55 8.1 Appendix A: CUDA max-reduce ...... 56 8.2 Appendix B: Graph sizes ...... 58 8.3 Appendix : Raw unoptimised GPU performance numbers ...... 59 8.3.1 graph500 ...... 59 8.3.2 KONECT ...... 61 8.3.3 SNAP ...... 62 8.4 Appendix D: Raw optimised GPU performance numbers ...... 64 8.4.1 graph500 ...... 64 8.4.2 KONECT ...... 66 8.4.3 SNAP ...... 67 8.5 Appendix E: Raw unoptimised CPU performance numbers ...... 69 8.5.1 graph500 ...... 69 8.5.2 KONECT ...... 70 8.5.3 SNAP ...... 70 8.6 Appendix F: Raw optimised CPU performance numbers ...... 71 8.6.1 graph500 ...... 71 8.6.2 KONECT ...... 72 8.6.3 SNAP ...... 72

6 CHAPTER 1 Introduction

1.1 Context

As we continue to increase the amount of data we collect and process, we need more efficient ways of storing and processing that data. When data consists of entities which can have some sort of relationship between them, a graph can be used to model the data. A graph is a mathe- matical structure consisting of vertices and edges, where each edge connects a pair of vertices. A large variety of data collected today can be represented using graphs, including data from fields like social networks, linguistics, physics, chemistry and other computational sciences [21][23]. Currently, a lot of graph processing is done using GPU acceleration, as this can bring major performance benefits. Most GPU-accelerated programs are written using the NVIDIA proprietary API CUDA. CUDA is a widely used and supported programming model used for programming NVIDIA GPUs. The downside of CUDA, however, is that it functions only on NVIDIA GPUs and is completely proprietary to NVIDIA. Moreover, it requires a different way of thinking compared to normal sequential CPU code. For this reason, it would be beneficial to have a higher level and portable API that can be used instead of CUDA without any loss of performance. In this work, we choose OpenACC as the high-level API. OpenACC is a completely portable acceleration API which aims to be able to let programmers write accelerated code once, and then have the compiler compile it for any type of accelerator. This includes GPUs from potentially any brand, multi-core CPUs and large compute clusters.

1.2 Research Questions

This paper aims to determine whether high-performance OpenACC code can be derived from CUDA code in a systematic manner. We will then investigate the portability of this new Ope- nACC code, by comparing it to CPU multithreaded OpenMP. To investigate this properly, we aim to answer the following questions:

1. Can we create a systematic way of porting CUDA to OpenACC, and how easy is this porting process? 2. What is the difficulty of implementing our algorithms in OpenACC compared to thediffi- culty of implementing our algorithms in CUDA? 3. What are the performance benefits or drawbacks to using the high-level portable program- ming model OpenACC instead of the proprietary CUDA API? 4. Is OpenACC portable and performance portable across different types of platforms (i.e., CPUs and GPUs)?

7 We answer these questions by porting two graph processing algorithms written in C and CUDA to C and OpenACC, and reviewing this process. We will identify at any quirks or difficulties encountered during this process in order to answer research question 2. In order to answer research question 1, we search for patterns of CUDA calls that are/should be always replaced by the same OpenACC codes. These become systematic one-to-one translations. We further compare the runtime of the original code and the ported code to answer question 3. Finally, in order to answer research question 4, we analyze the performance of both of our OpenACC benchmarks against similar OpenMP code.

1.3 Thesis structure

This thesis is structured in a per-algorithm way. We start with a background chapter (see chapter 2) containing information and short explanations about the CUDA, OpenACC, and OpenMP APIs, as well as relevant related work related to this thesis. Then, in chapter 3, we describe our exact testing and investigation methodologies. Further, for each of the two algorithms, we describe the porting process and any difficulties encountered during this process (see chapters 4 and 5). For each algorithm, we also include the performance comparison of the two versions. We explain any performance differences, and improve the ported code based on this investigation. After discussing the main algorithms, we select PageRank as a case-study for an in-depth comparison against the OpenMP version of the algorithm, running on a CPU (see chapter 5)”. Finally, we conclude this thesis with a summary of our findings and provide suggestions for potential future research (see chapter 7).

8 CHAPTER 2 Background & Related Work

2.1 Background

Graph processing is becoming increasingly relevant for many scientific and daily life applications. The massive size of some of the graphs around us (social networks or infrastructure networks) requires high-performance graph processing. Many frameworks have been devised to help users write algorithms from scratch [22][7][25][26], but there are also a lot of high-performance C and CUDA implementations of graph processing applications. These applications are, however, difficult to read, modify, and/or maintain by regular users who do not have experience withthose specific frameworks. Thus, obtaining higher-level versions of these codes is desirable formany domains and users. Obtaining these versions is the main goal of our work.

2.1.1 CUDA, OpenACC and OpenMP In this section we explain (in short) the differences between the three APIs we are using inthis work.

CUDA CUDA is an NVIDIA proprietary API for the C, C++ and FORTRAN programming languages which can be used to program NVIDIA GPUs. It has an extensive collection of libraries, and a large amount of support behind it. It also enjoys a sizeable community. These things make CUDA one of the most attractive choices when programmers need to use GPU acceleration in their applications. On a basic level, CUDA works by having the programmer manually program CUDA kernels, which are basically small functions that run on the GPU. These kernels will then be called from the ”normal” code running on the CPU. CUDA is a low level language in the sense that the programmer always has to manually manage memory transfers between the GPU and the rest of the machine, and continually has to think in a GPU-centric way when programming these kernels. Although the code in the CUDA kernels is just C++, the programmer has to consider the inner workings of the GPU while writing these kernels to avoid the performance penalties of, for example, warp divergence and branch divergence [6]. Thus, the low-level of abstraction and the tight coupling with the hardware make it harder for a programmer used to writing standard CPU-bound sequential (or even parallel) code to create efficient CUDA code.

OpenACC OpenACC is a accelerator API based on compiler directives, which are commonly called pragmas. It aims to be a general high-level API that can target any accelerator device. At the moment these devices are mostly GPUs and CPUs. As such, OpenACC aims to be the ”simple” way of GPU

9 accelerating a program. Being a compiler directive based API, OpenACC works by instructing the compiler to offload certain sections of code to the accelerator device. In practice this means that the programmer can take sequentially written code, and have the compiler transform that code to the proper API for the target device. This approach has the major advantage of being able to accelerate most code by a small amount, as most sequentially written code has at least a couple of sections that can be parallelised. It also means that it is much easier to add OpenACC acceleration to existing code than it is to accelerate existing code using CUDA. There are a couple of drawbacks to this compiler-centric approach, however. First, the final performance of the OpenACC accelerated code is very heavily dependent on the compiler being used [16]. This makes the choice of compiler much more important than it usually is, as a compiler switch can bring major performance gains or losses. Second, when the compiler doesn’t understand the structure of the code to be accelerated, it might not parallelise the code in the way that one might expect. This can result in entire sections not being transformed or not working anymore, and a long struggle to adapt the code to the point where the compiler understands how certain sections of code can be parallelised. This might result in much of the original code of the program having to be rewritten in order to take full advantage of OpenACC acceleration. On top of that, the compiler also has to manage memory access to and from the GPU if the programmer does not explicitly handle this. The OpenACC API does contain a number of options and directives to better control the offloading of data and code to and from the GPU in a more precise way. This makes iteasierfor the compiler to offload the code, but requires more programmer skill and code modification. To summarise, OpenACC aims to be both simpler to reason about than CUDA, and simpler to add to existing codebases than CUDA.

OpenMP Like OpenACC, OpenMP is a directive-based API. It shares most of the benefits and drawbacks that OpenACC has, but has the additional benefit of being much more mature and much more widely implemented. As such, it enjoys better compiler support and optimisations. The major difference is that instead of being made for writing code for (potentially) any acceleration device, it was mainly focused on just multithreaded CPUs, with OpenMP gaining GPU support only very recently in version 4.5 of the OpenMP standard [18]. Due to its syntactical similarity to OpenACC and its focus on multithreaded CPU accel- eration, OpenMP is a prime candidate for comparison with OpenACCs multithreaded CPU support.

2.2 Definitions

The OpenACC API contains a number of new constructs and compiler directives. In this section, we explain most of the ones relevant to this thesis. In figure 2.1 we provide a simple example application which contains most of the imporant OpenACC compiler directives that we use in this thesis. In order to explain them, we start at the main function. In this function we see three compiler directives. The directive

#pragma acc enter data copyin(vec_a[0:VSIZE], vec_b[0:VSIZE]) is called a data directive, and is an example of explicit data movement: We explicitely state that elements 0 through VSIZE of both vectors should be copied to the accelerator device. Directive

#pragma acc kernels is the simplest and broadest directive that OpenACC provides. It basically tells the compiler that all code in the following regions should be scanned for possible paralellism, and then be paralellised by the compiler. This results in the compiler doing all of the data movement, and all of the compute porting. The final directive of this function is

10 #pragma acc exit data delete(vec_a, vec_b) which is also an explitic data movement directive. In this case, it tells the compiler to free both vectors from the accelerator device. Moving to the next function, we see that it has four compiler directives. The first directive is

#pragma acc data present(vec_a[0:VSIZE], vec_b[0:VSIZE]) This directive is made up of two parts. The first part is acc data. This declares that we are now entering a data region. This data region informs the compiler that all accelerator data is shared between regions in this data region, meaning that there is no need to copy data back and forth between the accelerator and the host, and it can simply copy once. The second part of the directive (present(vec_a[0:VSIZE], vec_b[0:VSIZE]) is a declaration that the two vectors from elements 0 through VSIZE are already present on the GPU (due to the explicit data movement directives from the last function) and do not have to be copied. The next three directives all share the same core:

#pragma acc parallel loop This is the basic OpenACC directive. It declares the following loop to be a parallel one that can be converted in to a kernel for the accelerator. Each directive applies to a single loop. It is up to the programmer to ensure that the loop is fully parallel and not data dependant (meaning that the result from one loop iteration influences another loop iteration). The final directive ends with a common suffix: reduction(+:final). This informs the compiler that this loop is not fully parallel, but is in fact a reduction on variable final with operator +. This enabled the compiler to apply some specific and optimised algorithms todo this reduction as fast as possible.

2.3 Related Work

The usability of OpenACC has already been examined earlier [24]; The authors have reached the conclusion that OpenACC has a ”promising ratio of development effort to performance”. Addi- tionally, earlier performance comparison studies between CUDA and OpenACC, like Hoshino et al. [10], have shown OpenACC to be slower then CUDA. The common belief was that this behavior is due to the fact that OpenACC lacks the low-level functionality (like shared memory) that enables the programming tricks that CUDA can do [10]. Supporting this lack of performance is Christgau et al. [5], where ”the platform independent approach does not reach the speed of the native CUDA code”. In this paper, the authors state that ”a deeper analysis shows that memory access patterns have a critical impact on the compute kernels’ performance, although this seems to be caused by the compiler in use”. This supports the points we made in section 2.1.1, where we stated that the performance of OpenACC is heavily affected by the choice (and thus the optimisation level) of compiler. This performance deficit is not a commonly shared conclusion, however, as Herdman et al. [9] concludes that ”OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA”. To add to this, Ledur, Zeve, and Anjos [13] concludes that ”OpenACC presented an excellent execution time compared with the other languages” (the other languages being OpenMP and CUDA). In its conclusion, the authors also note that ”CUDA presented good execution times too, but the complexity to construct code is bigger than OpenACC and OpenMP.” Examining ease-of-use and difficulty of programming (see research question 2) we have Memeti et al. [15] which, based on the number of lines of code used for the APIs, concludes that ”on average OpenACC requires about 6.7x less programming effort compared to OpenCL”, and that ”Programming with OpenCL on average requires two times more effort than programming with CUDA for the Rodinia benchmark suite”. Using some rough maths, this would suggest that OpenACC should be around 3.5x easier to program than CUDA.

11 1 #include

2

3 #define VSIZE (100000)

4

5 int do_vec(int *vec_a, int *vec_b) { 6 int* vec_c = malloc(sizeof(int) * VSIZE); 7 int final = 0;

8

9 #pragma acc data present(vec_a[0:VSIZE], vec_b[0:VSIZE]) 10 { 11 #pragma acc parallel loop 12 for(int i = 0; i < VSIZE; i++){ 13 vec_c[i] = 0; 14 }

15

16 #pragma acc parallel loop 17 for(int i = 0; i < VSIZE; i++){ 18 vec_c[i] = vec_a[i] + vec_b[i]; 19 }

20

21 #pragma acc parallel loop reduction(+:final) 22 for(int i = 0; i < VSIZE; i++){ 23 final += vec_c[i]; 24 } 25 }

26

27 free(vec_c);

28

29 return final; 30 }

31

32 int main(int argc, char *argv[]) { 33 int* vec_a = malloc(sizeof(int) * VSIZE); 34 int* vec_b = malloc(sizeof(int) * VSIZE); 35 #pragma acc enter data copyin(vec_a[0:VSIZE], vec_b[0:VSIZE])

36

37 #pragma acc kernels 38 { 39 for(int i = 0; i < VSIZE; i++){ 40 vec_a[i] = 5+5; 41 vec_b[i] = 10+10; 42 } 43 }

44

45 #pragma acc exit data delete(vec_a, vec_b) 46 free(vec_a); 47 free(vec_b);

48

49 return do_vec(vec_a, vec_b); 50 }

Figure 2.1: Example application containing most of the basic OpenACC directives

12 CHAPTER 3 Methodology

We answer our research questions using empirical analysis of two different graph processing algorithms: Breadth First Search and PageRank. We have selected these workloads because they are both iterative, like most graph processing algorithms, but cover different types of graph processing, and could therefore require different CUDA and OpenACC constructs. Specifically, Breadth First Search is an algorithm that contains a lot of memory operations, but does not have computationally expensive steps. It also traverses different nodes in every iteration. PageRank contains a mix of both memory intensive steps and computationally expensive steps, and is therefore more balanced. Also, contrary to Breadth First Search, it traverses every node in each iteration.

3.1 Measuring performance

In order to fully explore the performance capabilities and the usability of OpenACC compared to CUDA, we will be using an iterative approach with porting the algorithms. We create a base port of the algorithms, compare it to the original non-ported version and look for any differences in performance and behaviour. Based on the results of this comparison, we adapt the ported code and compare this adapted version to the non-ported version again. This process will be repeated until no more realistic performance gains can be found. We measure the performance timings of the programs by using operating system specific tim- ing libraries. For the Linux and MacOS operating systems these will be the functions contained in the ”” C headers. For Windows we will use the ”QueryPerformanceCounter()” and ”QueryPerformanceFrequency()” functions from the ”” header. For GPU-accelerated programs, the offloading execution model (see chapter 2) requires the additional step of data movement between host and device. Thus, in our performance analysis, we time these transfers explicitly. Collecting such fine-grained performance data allows for a better understanding of the sources of performance discrepancies between CUDA and OpenACC. All performance measurements used are performed on the DAS-5 distributed supercomputer [1], the full specifications of which can be found in table 3.1. We repeat each experiment 200 times for PageRank and 6400 times for BFS before taking an average of the measured runtimes. It is well known that the performance of graph processing algorithms depends on the structure of the input graph. Therefore, we have selected a set of 36 diverse input graphs. This set includes synthetic graphs taken from the graph500 set [17], and snapshots of real-life graphs taken from the KONECT [12] and SNAP [14] repositories. The list of graphs we use for testing can be found in section 8.2, appendix B, along with their sizes.

3.2 Measuring ease-of-use

To answer research questions 1 and 2, we cannot use any exact metric, because no such metric exists. As such, we describe the porting process in detail and drawing conclusions from that. This

13 CPU Type Dual Intel Xeon E5-2630 v3 CPU Cores 8-core, 16-thread (32 total threads) CPU Speed 2.4 GHz RAM 64 GB GPU Type NVIDIA Titan X GPU Cores 3584 GPU Speed 1417 MHz GPU Memory 12 GB GDDR5X @ 480 GB/sec

Table 3.1: DAS-5 Specifications

C with OpenACC/OpenMP Compiler PGI C compiler (pgcc) @ version 19.4-0 (LLVM) C/C++ CUDA Compiler NVIDIA CUDA compiler (nvcc) @ version 10.0.130 OpenACC Profiler PGI C profiler (pgprof) @ version 19.4 CUDA Profiler NVIDIA profiler (nvprof) @ version 10.1.168

Table 3.2: Software used porting process includes things like setting up the toolchain and configuring the compiler. We will also investigate any weird errors, subtle performance killers and pitfalls that the programmer might encounter while writing OpenACC code in detail. Finally, we reflect on how complex and how systematic the process of identifying and fixing OpenACC performance bugs is. Our goal is, again, to extract potential patterns that can eventually lead to guidelines and, when possible, rules for writing well-behaved OpenACC code.

3.3 Performance Analysis and Debugging

In order to optimise and analyse our compiled binaries, we use debugging and profiling tools, in addition to the information provided by the compiler. The full list of software used along with their versions can be found in table 3.2. To explain any unexpected performance differences, we will be using a combination ofthe NVIDIA and PGI GPU profiling tool and the pgcc option ”-Minfo” [20]. ”-Minfo” provides detailed information about the compiler interpretation of our OpenACC directives, and can indicate if any implicit data regions have been placed, any implicit optimisations that have happened, and also how our OpenACC regions have been parallelised. The NVIDIA CUDA profiler, nvprof (version 10.1.168) provides detailed profiling information for our CUDA executables. It can provide, for example, detailed information about the number of times a function was called, the average runtime and total time spent on a function. It can also provide this information at the level of every CUDA API call. The PGI group OpenACC profiler, pgprof (version 19.4) does the same things for our OpenACC code, but instead of just measuring functions and API calls, it also profiles OpenACC compute regions.

14 CHAPTER 4 Breadth First Search

In this chapter we present the different OpenACC BFS implementations we have designed in the context of this thesis. We further provide a detailed analysis of their performance from the perspective of how competitive they are against their CUDA counterparts.

4.1 The Algorithm

Breadth First Search is a basic graph exploration algorithm, in which the goal is to visit all nodes in a graph starting from a specific root node. It is ”breadth first” as we explore all the nodes ”in layers”, i.e., we first visit the nodes connecting to our root node, and then we explore the connections of these nodes, etc. We will be using an expanded version of this algorithm, in which we try to find the distance (also called depth) of each node relative to our chosen rootnode. We implement this algorithm in an edge-centric manner: we loop over every edge in the graph, and check if the origin of that edge has been explored. If it has, we mark the destination node as explored and give it the depth of the origin node plus one. The exploration terminates once we go through a loop iteration in which no new nodes have been discovered. In standard sequential implementations of BFS, the algorithm works by placing newly dis- covered nodes in a queue, and then processing new nodes as they appear in the front of the queue. In our parallel CUDA implementation, we simply process each newly found node in parallel instead of placing them in a queue to be handled later. Each step of the algorithm, we travel along the edges of each node that was discovered in the previous step. This makes the algorithm inherently parallel and removes the need for a queue. It also makes keeping track of the current depth easier, because nodes with equal depths will always be discovered in equal steps. The downside of this variant is that each step, we are always checking each and every edge, and simply doing nothing for the edges that do not need to be explored this step. This results in more total work as a trade-off for the solved load-imbalance.

4.2 Porting

Our (rough) CUDA implementation can be seen in figure 4.1. The structure of the code consists of roughly the following steps:

1. Allocate the GPU memory 2. Copy the graph edges and the results data structure to the GPU. 3. Set the value of the was_updated variable on the GPU to zero. 4. Check the state of the origin of each edge. If the origin has state toprocess and the desti- nation has state unvisited, mark the destination of the edge as reachable. 5. Mark all toprocess nodes as visited and mark all reachable states as toprocess.

15 1 result_t* do_bfs_cuda(graph_edge_t* graph) { 2 result_t* results = malloc(graph->node_count * sizeof(result_t)); 3 results[0].state = node_toprocess; // Start at node 0

4

5 timer_start(timer_memtransfers); 6 copy_data_to_gpu(graph, results);

7

8 // Do BFS 9 int was_updated; 10 timer_start(timer_nomemtransfers); 11 do { 12 cudaMemset(was_updated_device, 0, sizeof(int)); // set "was_updated" on device to 0 13 bfs_search<<>>(edges, results, edge_count); 14 bfs_update_state<<>>(results, node_count, was_updated); 15 cudaMemcpy(&was_updated, was_updated_device, sizeof(int), 16 cudaMemcpyDeviceToHost); // Get the "was_updated" value back from the device 17 } while(was_updated == 1); 18 timer_stop(timer_nomemtransfers);

19

20 cudaMemcpy(results, results_device, graph->node_count * sizeof(result_t), 21 cudaMemcpyDeviceToHost); // Copy results back

22

23 delete_data_from_gpu(graph, results); // Delete data from GPU 24 timer_stop(timer_memtransfers);

25

26 return results; 27 }

Figure 4.1: The core of the Breadth First Search algorithm (psuedo-code, shortened)

6. Get the was_updated value back from the GPU. This was changed to 1 if the state of any node changed in the previous step.

7. Go back to step two if was_updated is set to 1. 8. Copy the results structure back from the GPU. 9. Deallocate the GPU memory.

One might notice that there can be race conditions in step 4: Multiple edges might mark the same destination node as reachable, and one thread might be changing the state from unvisited to reachable while another thread is still seeing it as unvisited. However, this is not a problem for this implementation, because it does not matter which thread marks the target node as reachable: the depth will always be the same for each step due to the parallel nature of this algorithm. This results in the sequential and the parallel versions of the algorithm always ending with the same result, even with the race conditions. We will discuss the porting of this code by discussing the memory operations and the compute operations separately.

4.2.1 Porting memory transfers The memory operations consist of allocating the GPU memory, copying the initial data to the GPU, the intermediate memory operations, copying the results back to the host, and finally deallocating the GPU memory. In OpenACC, the allocation and copying of the initial data

16 1 result_t* results = malloc(node_count * sizeof(result_t));

2

3 cudaMalloc(&edges_device, graph->edge_count * sizeof(edge_t)); 4 cudaMalloc(&results_device, graph->node_count * sizeof(result_t)); 5 cudaMalloc(&was_updated_device, sizeof(int));

6

7 cudaMemcpy(edges_device, graph->edges, edge_count * sizeof(edge_t), cudaMemcpyHostToDevice); 8 cudaMemcpy(results_device, results, node_count * sizeof(result_t), cudaMemcpyHostToDevice);

Figure 4.2: BFS CUDA initial data copying

1 result_t* results = malloc(node_count * sizeof(result_t)); 2 ... 3 #pragma acc data 4 { 5 ... 6 do { 7 ... 8 do_bfs_search();

9

10 update_bfs_states(); 11 ... 12 } while(was_updated == 1); 13 ... 14 copy_results_back(); 15 }

Figure 4.3: BFS OpenACC initial data copying

happens in a single step except when using specialised memory directives, which are not relevant in this thesis. The code for the initial data copy consists of five lines, and can be seen in figure 4.2. There are three memory allocations, and only two memory copies, because the was_updated_device variable is set in the BFS algorithm loop itself. Because OpenACC works with compiler directives, we can make the compiler figure out the memory transfers by itself in certain situations. We find that ”certain situations” can be quite arbitrary, as the compiler needs to understand in what way memory will be used and allocated. This process of recognition is extremely unpredictable and prone to errors. To help the compiler, we can declare a region in which GPU data will be shared with

#pragma acc data In figure 4.3 we see how this looks in our code. We have wrapped our main algorithm loop in one of these data regions, to declare that all GPU data in this region should be shared with all other code requiring this data. In practice, this means that no updating back and forth between the GPU and host is required in this region when the data is only accessed in regions that are offloaded to the GPU. This also means no extra allocations and deallocations other thanatthe beginning and end of this zone. If the compiler can now recognise the sizes and shapes of all the data that is needed, it can automatically perform the memory allocations and copies for the programmer. In our case, however, the compiler could not recognise the shape of the results array, but had no trouble with

17 1 result_t* results = malloc(node_count * sizeof(result_t)); 2 ... 3 #pragma acc data copy(results[0:node_count]) 4 { 5 ... 6 do { 7 ... 8 do_bfs_search();

9

10 update_bfs_states(); 11 ... 12 } while(was_updated == 1); 13 ... 14 copy_results_back(); 15 }

Figure 4.4: BFS OpenACC initial data copying with size hint

the edges array. As to why the compiler could recognise the shape and size of an array which was allocated in a different function while it could not for an array which was allocated several lines above is unclear. To solve this problem, we can explicitly state what we want to do with the results data. We want the data copied to the GPU from the host before the algorithm starts, and we want it copied back to the host when the algorithm has finished. OpenACC structured data pragmas (like our data region) support the copy(x) argument, which states that we want to copy the given data to the GPU at the start of our structured region, and vice-versa when the region ends. Our updated code can be seen in figure 4.4. At this point, the compiler generated functioning code which gave correct results. All other memory operations (including the updating of the was_updated variable) were implicitly gener- ated by the compiler without us giving any hints.

4.2.2 Porting compute operations As seen in our initial CUDA code in figure 4.1, there are two CUDA kernels which have to be ported to OpenACC: bfs_search and bfs_update_state.

bfs_search The original CUDA implementation of this kernel can be seen in figure 4.5. Here we can see that each instance of the bfs_search kernel handles one edge. This is supported by the call to the kernel, in which we declare that we need a number of blocks equal to the amount of edges in the graph divided by the size of a CUDA block (with one extra block to circumvent some edges not being processed when this division is not round). For our purposes this means that we can transform the code into a standard for loop iterating over all edges in the graph. We can then annotate this loop with

#pragma acc parallel loop to tell the compiler that this loop is completely parallel and should be offloaded to the GPU. Our ported code can be seen in figure 4.6. In essense, all we have done is inlined the CUDA kernel, wrapped it in a for loop, and removed any CUDA specific code. This shows us that simple algorithms and CUDA kernels can be ported to OpenACC easily, sometimes resulting in shorter code. This can be one of the main benefits of OpenACC: You do not need to know how GPUs work in order to write code that can be used on GPUs. Our

18 1 __global__ void bfs_search(edge_t* edges, result_t* results, edge_count_t edge_count) { 2 unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;

3

4 if(i < edge_count) { 5 uint32_t origin_index = edges[i].from; 6 uint32_t destination_index = edges[i].to;

7

8 if(results[origin_index].state == node_toprocess) { 9 if(results[destination_index].state == node_unvisited) { 10 results[destination_index].state = node_reachable; 11 results[destination_index].depth = results[origin_index].depth + 1; 12 } 13 } 14 } 15 }

16

17 result_t* do_bfs_cuda(graph_edge_t* graph) { 18 ... 19 do { 20 ... 21 bfs_search<<<(edge_count / CUDA_BLOCKSIZE) + 1, CUDA_BLOCKSIZE>>> 22 (edges_device, results_device, edge_count); 23 ... 24 } while(was_updated == 1); 25 ... 26 }

Figure 4.5: BFS Search CUDA function

19 1 result_t* do_bfs_cuda(graph_edge_t* graph) { 2 ... 3 do { 4 ... 5 #pragma acc parallel loop 6 for(edge_count_t i = 0; i < edge_count; i++){ 7 uint32_t origin_index = edges[i].from; 8 uint32_t destination_index = edges[i].to;

9

10 if(results[origin_index].state == node_toprocess) { 11 if(results[destination_index].state == node_unvisited) { 12 results[destination_index].state = node_reachable; 13 results[destination_index].depth = results[origin_index].depth + 1; 14 } 15 } 16 } 17 ... 18 } while(was_updated == 1); 19 ... 20 }

Figure 4.6: BFS Search OpenACC code

bfs_search code already requires some knowledge about GPUs (block sizes and block count) while the OpenACC variant is basically annotated sequential CPU code.

bfs_update_state The CUDA code to be ported can be seen in figure 4.7. This code consists of a kernel in which each instance of the kernel processes a single element in the results array. As we can see in the allocation of this results array (figures 4.1 and 4.2, it has a length equal to node_count. This is supported by the call of this kernel, in which the amount of blocks to be used is equal to the amount of nodes in the graph divided by the amount of threads per block (with one extra block to circumvent issues with node counts which are not a multiple of this threads-per-block value). This means that just like with the bfs_search kernel, we can wrap it in a for loop with node_count iterations. Now we can annotate this for loop with

#pragma acc parallel loop to tell the compiler that this is a fully parallel loop that should be offloaded. Our ported code can be found in figure 4.8. Again, all we have done is inlined the CUDA kernel, wrapped it in a for loop, and removed the CUDA specific code.

4.3 Performance

In figure 4.9 we can see the normalised performance comparison between the CUDA and Ope- nACC implementations of OpenACC. In this figure we can see that our OpenACC implementa- tion is faster for the smaller graphs (10 through 13), but loses to the original CUDA implemen- tation for the other graphs. Figure 4.10 shows the same comparison for the KONECT graphs. Again, we can see that CUDA is faster by a large margin, except for the opsahl-ucsocial graph. Looking at the graph sizes, we see that the opsahl-ucsocial graph is the smallest of all the KONECT graphs.

20 1 __global__ void bfs_update_state(result_t* results, uint32_t node_count, int* was_updated) { 2 unsigned int i = threadIdx.x + blockDim.x * blockIdx.x;

3

4 if(i < node_count) { 5 switch(results[i].state) { 6 case node_unvisited: 7 case node_visited: 8 break; 9 case node_reachable: 10 results[i].state = node_toprocess; 11 *(was_updated) = 1; 12 break; 13 case node_toprocess: 14 results[i].state = node_visited; 15 *(was_updated) = 1; 16 break; 17 } 18 } 19 }

20

21 result_t* do_bfs_cuda(graph_edge_t* graph) { 22 ... 23 do { 24 ... 25 bfs_update_state<<<(node_count / CUDA_BLOCKSIZE) + 1, CUDA_BLOCKSIZE>>> 26 (results_device, node_count, was_updated_device); 27 ... 28 } while(was_updated == 1); 29 ... 30 }

Figure 4.7: BFS Update state CUDA function

21 1 result_t* do_bfs_cuda(graph_edge_t* graph) { 2 ... 3 do { 4 ... 5 #pragma acc parallel loop 6 for(uint32_t i = 0; i < node_count; i++){ 7 switch(results[i].state) { 8 case node_unvisited: 9 case node_visited: 10 break; 11 case node_reachable: 12 results[i].state = node_toprocess; 13 was_updated = 1; 14 break; 15 case node_toprocess: 16 results[i].state = node_visited; 17 was_updated = 1; 18 break; 19 } 20 } 21 ... 22 } while(was_updated == 1); 23 ... 24 }

Figure 4.8: BFS Update state OpenACC code

Figure 4.9: Unoptimised BFS performance comparison for Graph500 graphs.

22 Figure 4.10: Unoptimised BFS performance comparison for KONECT graphs.

Finally, figure 4.11 contains the comparison for the SNAP graphs. Again, OpenACC has been beaten by CUDA in every graph. When we combine the results from these figures, we can conclude that the unoptimised version of our OpenACC BFS algorithm is faster than CUDA for very small graphs only. Once the graph size crosses the multi-megabyte line, CUDA starts being faster. In the next section, we will be taking a look at why this is the case, and how we might optimise our OpenACC code to more closely match the original CUDA implementation.

4.4 Optimising

4.4.1 Diagnosing and patching In order to optimise our BFS OpenACC implementation, we should first diagnose any potential performance problems and how we might solve them. To start diagnosing the problem, we will start with the actual raw performance numbers. These can be seen as an appendix in section 8.3. In these tables, we can immediately notice two strange results: The difference in total pro- cessing time gets larger as the graphs increase in size, and the OpenACC memory transfer time is significantly shorter than the memory transfer time for CUDA. This is especially obviousin the graph graph500-24, where the memory transfer time of CUDA is 5x longer, but the time spent in the main BFS loop is more than an order or magnitude shorter. This can mean that OpenACC is simply way faster with memory transfers, but this seems unlikely as this would mean that OpenACC is able to use GPU memory bandwidth more effi- ciently than CUDA. A significant hint can be found in the output our compiler gives uswhile compiling the OpenACC BFS code:

29, get_highrestime inlined, size=4 (inline) file bfs/bfs_acc.c (39) 35, Generating copy(results[:node_count]) 39, get_highrestime inlined, size=4 (inline) file bfs/bfs_acc.c (39)

23 Figure 4.11: Unoptimised BFS performance comparison for SNAP graphs.

40, Loop not vectorized/parallelized: contains call FMA (fused multiply-add) instruction(s) generated 44, Generating Tesla code 45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 44, Generating implicit copyin(edges[:edge_count]) 45, Loop not fused: no successor loop Loop not vectorized: data dependency 58, Generating Tesla code 59, #pragma acc loop gang /* blockIdx.x */ 59, Scalar last value needed after loop for was_updated at line 74 Loop not fused: no successor loop 75, get_highrestime inlined, size=4 (inline) file bfs/bfs_acc.c (39) 79, get_highrestime inlined, size=4 (inline) file bfs/bfs_acc.c (39)

In this snippet of compiler output, we can see that the main copying should be taking place at line 35 of our program. However, we can also spot that the compiler has implicitely placed a copyin directive at line 44, right in the middle of our main loop. This results in all the edges of the graph being copied to the GPU again at every step of our BFS loop. As this main loop is not measured as part of the memory transfer, this would result in the memory transfer time being artifically (and incorrectly) short. It will also introduce a lot of latency and overheadin our main loop, as the edges are being copied unnecessarily over and over again. For large graphs (like graph500-24) this introduces a drastic performance hit. Luckily, this problem is easily solved. Looking at our final memory transfer code in figure 4.4, we simply change the line

#pragma acc data copy(results[0:node_count]) to

#pragma acc data copy(results[0:node_count]) copyin(edges[0:edge_count])

24 This extra directive instructs the compiler to copy the edges to the GPU at the start of our memory region. This results in the compiler not needing to do the implicit transfer, and thus also not placing in incorrectly. In theory, this should reduce the processing time of the OpenACC BFS algorithm significantly, while bringing the memory transfer time up to be more in linewith the original CUDA implementation. The second major optimisation was also obtained while looking at the compiler output. Observe the following output:

44, Generating Tesla code 45, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ 58, Generating Tesla code 59, #pragma acc loop gang /* blockIdx.x */

Here we can see that the compiler decided to vectorise the first for-loop, but has not done the same for the second for-loop. As the loops are completely independant and vectorised behaviour is desired, we make this vectorisation behaviour explicit by turning

#pragma acc parallel loop into

#pragma acc parallel loop gang vector This tells our compiler that the loop can exploit both gang and vector level paralellism. This rougly maps to a CUDA block and thread respectively. In this stage some other small optimisations were made as well. For instance, all OpenACC parallel loop constructs were made async, meaning that the host does not wait for the GPU to finish its task before moving on with the other code. In our case this came downtothehost being able to queue the parallel for-loops instead of starting them sequentially.

4.4.2 Optimised results Here we can see that our optimisations have had a major effect on the runtime of the OpenACC port. In figure 4.12, we see that CUDA’s advantage has almost completely dissappeared, except for graph 23, where it is about 20% faster than OpenACC. In the other graphs in this figure, OpenACC is generally a lot faster. Figure 4.13 paints a similar picture. OpenACC is, again, generally faster than CUDA except for two outliers, being dbpedia-all and orkut-links. The actual speed increase of OpenACC differs per graph, but seems to hover around 25% faster than CUDA. Finally, in figure 4.14 we see the biggest improvement. Whereas in the unoptimised compar- ison in figure 4.11 we saw that CUDA was a lot faster (sometimes up to 95% faster), the two implementations are now very similar. Although OpenACC is not beating CUDA as confidently as it was for the other graphs, there is still a major performance improvement compared to the unoptimised version. The raw performance numbers for the optimised algorithm can be found in appendix D, section 8.4

25 Figure 4.12: Optimised BFS performance comparison for Graph500 graphs.

Figure 4.13: Optimised BFS performance comparison for KONECT graphs.

26 Figure 4.14: Optimised BFS performance comparison for SNAP graphs.

27 28 CHAPTER 5 PageRank

In this chapter we present the different OpenACC PageRank implementations we have designed in the context of this thesis. We further provide a detailed analysis of their performance from the perspective of how competitive they are against their CUDA counterparts.

5.1 The Algorithm

PageRank is an algorithm built to measure the ”importance” of a node in a graph relative to the other nodes. It is the original algorithm behind the Google search service [19][3], but is definitely not limited to use in web pages. The equation defining a node’s PageRank score is defined as follows:   ∑  PR(pj) PR(pi) = 1 − d + d L(pj) pj ∈M(pi) Where:

• d is equal to the damping factor, a way to dampen the final results; • M(x) is the set of nodes with an edge to node x;

• L(x) is the total number of edges originating from node x; • PG(x) is the PageRank score of node x.

There has been some debate about the correctness of this algorithm, however, as the paper describing the actual Google implementation [3] describes the sum of all PageRank scores in a graph being equal to 1, which would result in the following equation:   − ∑ 1 d  PR(pj) PR(pi) = + d N L(pj) pj ∈M(pi) Where N is the total amount of nodes in the graph. In our implementation, we will be using the second algorithm, as the first is prone to floating- point overflow errors. PageRank requires that all nodes that have incoming edges, also have at least one outgoing edge [4]. There are multiple ways of solving this problem, but we have solved it by adding the reverse of all incoming edges as outgoing edges for nodes without outgoing edges. For example, if we have the edge (a, b) and b has no outgoing edges, we add edge (b, a). As PageRank is an iterative algorithm, we keep repeating the equation for each node in the graph, until each node only changes by a small amount (0.1% or less).

29 1 pagerank_t* pagerank(graph_cuda_t* graph) { 2 pagerank_t* ranks = init_pagerank(graph);

3

4 timer_start(timer_memtransfers); 5 copy_graph_to_gpu(graph); 6 timer_start(timer_nomemtransfers);

7

8 do { 9 // Do PageRank 10 pagerank_do<<>>(graph, ranks, next_ranks); 11 // Find max change 12 pagerank_max_reduce<<>> 13 (ranks, ranks_next, max_array, nodecount); 14 // ranks = ranks_next 15 pagerank_shift<<>>(ranks, ranks_next, nodecount); 16 } while(max_change > PAGERANK_THRESHOLD);

17

18 timer_stop(timer_nomemtransfers);

19

20 cudaMemcpy(ranks, ranks_device, nodecount * sizeof(pagerank_t), cudaMemcpyDeviceToHost);

21

22 delete_graph_from_gpu(graph);

23

24 timer_stop(timer_memtransfers);

25

26 return ranks; 27 }

Figure 5.1: Shortened PageRank CUDA structure

We implement this algorithm by looping over all nodes in the graph, and then looping over all incoming edges of each node in a nested loop. When we have calculated the next PageRank iteration, we compute the percentage of change for each node, and then find the maximum change percentage between all nodes. If this percentage is lower than our threshold, we consider the PageRanks to be calculated. In order to verify the correctness and equivalence of both the implementations, we make sure that both the final PageRank scores and the number of iterations of both versions of the algorithm are equal.

5.2 Porting

Our original CUDA implementation is roughly described by the code in figure 5.1. Here we can see that there are roughly six parts to port:

1. Allocate the GPU memory 2. Copy the relevant graph fields to the GPU. 3. Do a PageRank iteration. 4. Find the maximum change. 5. Shift the newly computed PageRank values to our array with ”current” values. 6. Copy the final PageRank back to the host. 7. Deallocate the graph from the GPU.

30 1 static void copy_graph_to_gpu(graph_cuda_t* graph) { 2 for(uint32_t i = 0; i < graph->node_count; i++){ 3 // Copy each node's in array 4 cudaMalloc(&graph->nodes.host[i].in.device, 5 graph->nodes.host[i].in_count * sizeof(uint32_t)); 6 cudaMemcpyAsync(graph->nodes.host[i].in.device, graph->nodes.host[i].in.host, 7 graph->nodes.host[i].in_count * sizeof(uint32_t), cudaMemcpyHostToDevice); 8 // The outgoing edges are not useful in our case, so we don't copy them 9 } 10 cudaMalloc(&graph->nodes.device, graph->node_count * sizeof(node_cuda_t)); 11 cudaMemcpyAsync(graph->nodes.device, graph->nodes.host, 12 graph->node_count * sizeof(node_cuda_t), cudaMemcpyHostToDevice); 13 }

Figure 5.2: CUDA function to copy the graph to the GPU

These can be reduced to two main parts: the memory transfer operations and the computing operations. We describe these processes separately.

5.2.1 Porting memory transfers By far the easiest memory operation to port is the copying of the final PageRank back to the host. As OpenACC has a simple pragma to update a host variable by taking the matching device variable, we simply changed the line cudaMemcpy(ranks, ranks_device, graph->node_count * sizeof(pagerank_t), cudaMemcpyDeviceToHost); to

#pragma acc update host(ranks[0:graph->node_count])

The more complicated functions are the ones copying the graph to the GPU, and deallocating it after the algorithm is complete. In figure 5.2 we can see our original CUDA implementation. This consists of a series of memory allocations and asynchronous memory copies. As the CUDA malloc function returns a pointer to a device memory location, we have to keep track of two sets of pointers: one to the host memory, and one to the device memory. This has the effect of forcing us to modify our graph datastructure, as we now have to declare ”dual pointer” types containing a host and a device pointer. This makes our code more verbose and error-prone. Figure 5.3 presents our OpenACC implementation of this function. The first noticeable thing is that the function itself is much shorter and a lot less verbose. To replace our CUDA mal- loc/memcpy combination, we can use the OpenACC enter data copyin(host[start:end]) structure, which declares that we want to allocate memory space on the device which has size: sizeof(host_type) * (end - start) and then copy host[start] through host[end] from our host memory to the device. We also declare these memory operations to be async, which places them in a queue. This allows us to schedule all copy operations and have them run concurrently to the rest of the code. The first time we need the memory, we simply call #pragma acc wait and force these async operations to finish before we move on. Notice that we can use standard C syntax to declare what needs to be copied to the GPU, and OpenACC will take care of setting the pointers. Although under the hood there are still multiple copies of the variables and the same amount of memory is used, this is likely to prevent programmer errors as this is all handled at compile-time.

31 1 static void copy_graph_to_gpu(graph_acc_t* graph) { 2 #pragma acc enter data copyin(graph[0:1]) async 3 #pragma acc enter data copyin(graph->nodes[0:graph->node_count]) async

4

5 for(uint32_t i = 0; i < graph->node_count; i++){ 6 #pragma acc enter data \ 7 copyin(graph->nodes[i].in[0:graph->nodes[i].in_count]) async 8 } 9 }

Figure 5.3: OpenACC function to copy the graph to the GPU

For the deallocation of the graph from the GPU using CUDA, we kept the same structure and for-loop, but changed the cudaMalloc/cudaMempcy functions to a cudaFree function on the same variable and reversed the order. With OpenACC we also reversed the order, and replaced the #pragma acc enter data copyin(var) async pragmas with #pragma acc exit data delete(graph[0:1]) async This states that we want to deallocate the memory from the GPU without copying it back to the host, as we do the copying manually when needed. At this point, we have ported our memory operations to OpenACC, and we also encounter our first catch: our compiler cannot determine whether a variable is in the GPU memoryifwe declare the copy operations ourselves. This results in the compiler implicitly adding copyin and copyout operations around the places where we do GPU computations on the data. As we have two main OpenACC parallel compute sections (which we will explain in detail in section 5.2.2), this results in four extra (and quite large) memory operations. More importantly however, the compiler fails to correctly determine sizes of arrays to copy and also incorrectly guesses that each pointer variable holds an array. This results in invalid memory accesses, and a lot of runtime crashes. To counteract this, we have to declare which variables are present. This causes the compiler to skip the allocation and copying of these variables. In figure 5.4 we can see an example of how that would look in our ported code. In general, these present declarations are always needed in OpenACC code in which the programmer does manual memory management (as opposed to having the compiler figure it out itself), and can quickly bloat code to the point where thecode saved by the shorter copyin/copyout operations is then lost again to present pragmas.

5.2.2 Porting compute operations Assuming that all the GPU memory has now been handled, we continue with porting the actual compute kernels to OpenACC. Our CUDA code contains three kernels. The first kernel com- putes a new PageRank score for each node in our graph, based on their previous scores stored in prev_rank, and stores it in the next_rank array. The second kernel determines the change percentage for each node, and then reduces all these change percentages to their collective max- imum (a classic max reduce problem). The final kernel shifts each element in next_rank to the same index in prev_rank.

PageRank Kernel In figure 5.5 we can see our basic PageRank kernel. In this kernel, we create a number of threads equal to the amount of nodes in the graph. We then make each thread compute the PageRank score of its respective node.

32 1 copy_graph_to_gpu(graph);

2

3 #pragma acc data present(ranks[0:graph->node_count], ranks_next[0:graph->node_count]) 4 #pragma acc data present(graph->nodes->in, graph->nodes, graph) 5 { 6 do { 7 ... 8 } while(max_change > PAGERANK_THRESHOLD);

9

10 #pragma acc update host(ranks[0:graph->node_count]) 11 }

12

13 delete_graph_from_gpu(graph);

Figure 5.4: Addition of present pragmas. Notice the extra set of brackets to signal in which code block the variables are present.

1 __global__ void pagerank_kernel(graph_cuda_t* graph, 2 pagerank_t* prev_rank, pagerank_t* next_rank) { 3 int i = threadIdx.x + blockDim.x * blockIdx.x;

4

5 if(i < graph->node_count) { 6 pagerank_t rank = 0;

7

8 for(uint32_t j = 0; j < graph->nodes.device[i].in_count; j++){ 9 uint32_t in_index = graph->nodes.device[i].in.device[j]; 10 rank += (prev_rank[in_index] / graph->nodes.device[in_index].out_count); 11 }

12

13 next_rank[i] = ((1.0f - PAGERANK_D) / graph->node_count) + (PAGERANK_D * rank); 14 } 15 }

16

17 pagerank_t* pagerank(graph_cuda_t* graph) { 18 ... 19 unsigned int block_count = (graph->node_count / CUDA_BLOCKSIZE) + 1; 20 pagerank_do<<>>(graph_device, ranks_device, ranks_device_next); 21 ... 22 }

Figure 5.5: CUDA PageRank kernel

33 1 pagerank_t* pagerank(graph_cuda_t* graph) { 2 ... 3 do { 4 // Do PageRank 5 #pragma acc parallel loop 6 for(uint32_t i = 0; i < graph->node_count; i++){ 7 pagerank_t rank = 0;

8

9 #pragma acc loop reduction (+:rank) 10 for(uint32_t j = 0; j < graph->nodes[i].in_count; j++){ 11 uint32_t in_index = graph->nodes[i].in[j]; 12 rank += (ranks[in_index] / graph->nodes[in_index].out_count); 13 }

14

15 ranks_next[i] = ((1.0f - PAGERANK_D) / graph->node_count) + (PAGERANK_D * rank);

16

17 // Determine the threshold 18 thresholds[i] = (fabsf(ranks[i] - ranks_next[i]) / ranks[i]) * 100; 19 } 20 ... 21 } while(max_change > PAGERANK_THRESHOLD); 22 ... 23 }

Figure 5.6: OpenACC: PageRank kernel, ported.

The first order or business is rewriting the kernel to a sequential form and removing itfrom the function. This will allow us to annotate it with OpenACC pragmas. Luckily for us, this is fairly easy for the PageRank function. Once we have copied the body and have rewritten it to a basic for-loop, we can annotate the for loop with

#pragma acc parallel loop indicating that this for loop is completely parallel and should be offloaded to the device. As a small optimisation step, we put the computation of the change percentage in this loop. In CUDA, this change percentage is computed in the maximum change percentage kernel. We have also annotated the inner loop with

#pragma acc loop reduction (+:rank) This tells the compiler that this loop contains a reduction in variable rank (The sum part of the PageRank equation), enabling the compiler to insert specific optimisations for reduction. At this point, we have fully ported a basic version of the PageRank kernel to OpenACC. Our final result can be seen in figure 5.6.

Maximum Threshold Kernel During the porting of this kernel, the high-level nature of OpenACC really starts to make a difference. CUDA does not implement any reduction algorithms natively, which means thatwe have to implement them ourselves, or use third-party libraries like Thrust [2]. This does open up some huge optimisation opportunities for us. Using the basic reduction optimisation idea taken from NVIDIA itself [8], we can implement a reduction kernel that takes full advantage of the architecture of NVIDIA GPUs. As we can compare at most two elements at the same time, we can in theory half the amount of values to be reduced at each step of the process. We can also take advantage of latency hiding, and try to remove any warp divergence. This leaves us

34 1 pagerank_t* pagerank(graph_cuda_t* graph) { 2 ... 3 do { 4 // Do PageRank 5 ... 6 // Max change 7 max_change = 0.0f; 8 #pragma acc parallel loop reduction(max:max_change) 9 for(uint32_t i = 0; i < graph->node_count; i++){ 10 max_change = fmaxf(max_change, thresholds[i]); 11 } 12 ... 13 } while(max_change > PAGERANK_THRESHOLD); 14 ... 15 }

Figure 5.7: OpenACC maximum change reduction

with a very long and verbose function (more than 60 lines in total). The final code for this max reduce function can be found as appendix in section 8.1, but is mostly the same as the code from the NVIDIA lecture [8], with the change of the addition being changed to calls to the function fmaxf(a,b). OpenACC makes this process a lot easier and shorter. We simply implement a standard sequential max reduce for loop, and annotate it:

#pragma acc parallel loop reduction(max:max_change) . This approach results in OpenACC being able to implement the optimum reduction algo- rithm for the platform we are compiling for. As we can see in figure 5.7, this code is only five lines in total, and thus much easier to read and comprehend.

5.3 Performance

In figures 5.8, 5.9 and 5.10 you can see our benchmarking results for the PageRank algorithm. Surprisingly, our OpenACC port is faster with every graph by a significant margin. The stan- dard deviation is also generally lower, which means that the OpenACC port is generally more consistent as well. The full data can be found in appendix C, section 8.3. When investigating the specific graph results, we can see that the dbpedia-starring graph in figure 5.9 is the only graph where the CUDA performance is within 20% of OpenACC. In the next section, we investigate why our OpenACC implementation is faster, and how we can potentially increase this performance gap.

5.4 Optimising the OpenACC implementation

5.4.1 Diagnosing and patching In order to (potentially) exploit the behaviour that makes our OpenACC port faster than the original CUDA implementation, we must first figure out what this ”behaviour” specifically is. Using the NVIDIA profiler, nvprof, does not help us any further, as it only shows us that OpenACC is faster and not why. As such, we turn to the compiler information output:

66, Generating Tesla code 67, #pragma acc loop gang /* blockIdx.x */

35 Figure 5.8: Unoptimised PageRank performance comparison for Graph500 graphs.

Figure 5.9: Unoptimised PageRank performance comparison for KONECT graphs.

36 Figure 5.10: Unoptimised PageRank performance comparison for SNAP graphs.

71, #pragma acc loop vector(128) /* threadIdx.x */ Generating reduction(+:rank) 67, Loop not fused: no successor loop 71, Loop is parallelizable 84, Generating Tesla code 85, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ Generating reduction(max:max_threshold)

At this point we can start to understand why OpenACC is faster: it treats our main PageRank loops (seen in 5.11) differently than we do in CUDA. In our original CUDA implementation, each thread handles a single node. A thread calculates the new PageRank score based on all the incoming connections of that node, which can be seen in the inner loop in figure 5.11. The OpenACC compiler has taken a different approach: it treats the innermost loop asa vector calculation, and the outer loop as a gang (or block in CUDA terms). This results in a lot more threads, as each thread now increments the total new PageRank score of its gang, resulting in each gang now handling a single node, and the CUDA threads working together. Thus, in OpenACC, the work per thread is less, as we have multiple threads working on the update of a single node. This results in a much better load balance than CUDA achieves. In the end, this results in a completely different algorithmic approach than the one that was taken in the original CUDA code, and one that is clearly faster. Although it is possible to repro- duce this behaviour in CUDA by changing the algorithm, this algorithm is (significantly) more difficult to implement, because the nodes all have a different amount of incoming connections, and we would have to create a second algorithm to start enough CUDA threads, and to then map each thread to both a node and an incoming edge. Even then, we would have to write even more code to combine these threads into one to do the final new rank calculation. Our next task is exploiting this new algorithmic behaviour to further the lead that our OpenACC implementation has. We do this by making the inner loop vector behaviour explicit, and by tuning the vector length. In order to make the vectorised behaviour explicit, we change

37 1 // Do PageRank itself 2 #pragma acc parallel loop 3 for(uint32_t i = 0; i < graph->node_count; i++){ 4 pagerank_t rank = 0;

5

6 #pragma acc loop reduction(+:rank) 7 for(uint32_t j = 0; j < graph->nodes[i].in_count; j++){ 8 uint32_t in_index = graph->nodes[i].in[j] - 1; 9 rank += (ranks[in_index] / graph->nodes[in_index].out_count); 10 } 11 ranks_next[i] = ((1.0f - PAGERANK_D) / graph->node_count) + (PAGERANK_D * rank); 12 // Determine the threshold 13 thresholds[i] = (fabsf(ranks[i] - ranks_next[i]) / ranks[i]) * 100; 14 }

Figure 5.11: Main PageRank OpenACC loop.

#pragma acc parallel loop above our outer loop to

#pragma acc parallel loop gang , signalling that this outer loop exploits gang level paralellism. This means that each iteration of the loop is handled by a different gang (CUDA block). For the inner loop, we change

#pragma acc loop reduction(+:rank) to

#pragma acc loop vector reduction(+:rank) , to make our new vectorised behaviour explicit. Returning to the compiler output, we observe the following line:

71, #pragma acc loop vector(128) /* threadIdx.x */ Generating reduction(+:rank)

Our compiler is notifying us that the inner loop is vectorised with a vector length of 128. This means that the inner loop is processing 128 edges, no more and no less. When there are more than 128 elements, the loop is scheduled multiple times until all elements have been processed. When there are less than 128 elements, the other threads that have no work are still started and scheduled but simply do nothing. This introduces a tradeoff: a larger vector length can increase total paralellism as more elements are processed at a time, but smaller vector lengths result in less threads doing nothing. For our implementation, each element is one incoming node edge. As most nodes do not have more than 32 edges (and GPUs perform best when the vector lengths are a multiple of 32), we choose 32 as our vector length. In theory, this should result in less threads doing nothing (provided we do not regularly have vertices with more than 32 edges). To achieve this, we change our outer-loop directive from

#pragma acc parallel loop gang to

#pragma acc parallel loop gang vector_length(32)

38 Figure 5.12: Optimised PageRank performance comparison for Graph500 graphs.

This sets all the vectors in the gang to size 32. In addition to the previous optimisations, we have also turned all OpenACC regions to async regions. This enables us to already queue the next region before the previous region has been completely processed. When we need to use data of one of the regions on our host, we can simply call

#pragma acc wait to wait until all the regions in the queue have been processed.

5.4.2 Optimised results After running our benchmarks again, we observe not a lot has changed. In figure 5.12 we can see that, on average, the optimised version is slightly faster. This is not the case for all graphs in this figure, however, as specifically the larger graphs (22, 23 and 24) have lost performance relative to their unoptimised counterparts. We think this behaviour is due to our decision to set the vector size at 32 edges. For larger graphs, where nodes have a larger amount of edges on average, this could result in lower performance. In figure 5.13 we can observe the same kind of results as in figure 5.12. For some graphs, our new version has gained performance, while for other graphs it lost performance. Again, we hypothesize this behaviour is due to our decision to set the vector length at 32 instead of a higher number. The SNAP graphs in figure 5.14 show better results. Although no graph shows major gains over our previous, unoptimised version, none of the graphs perform worse for OpenACC than they did in the previous version. These results tell us that it might be possible to optimise OpenACC even further by choosing the vector length based on the average number of incoming edges for each node. The raw performance numbers for the optimised algorithm can be found in appendix D, section 8.4

39 Figure 5.13: Optimised PageRank performance comparison for KONECT graphs.

Figure 5.14: Optimised PageRank performance comparison for SNAP graphs.

40 CHAPTER 6 Multithreading OpenACC

One of the main advantages of OpenACC is its cross-platform portability. In fact, one of the main goals of OpenACC is that one should be able to write one single code for all platforms, and have the compiler do the porting. In our previous tests, we have compiled OpenACC for the NVIDIA Tesla architecture in order to compare GPU performance to a more low level language (CUDA in our case). In these tests, we have shown that OpenACC compilers currently have the ability to generate code that performs on-par or better than CUDA. Thus, in this chapter, we will be examining whether the same code behaves well on the CPU. Specifically, we do this by comparing against an equivalent OpenMP implementation. For this comparison, We use our PageRank implementation, as it contains a combination of both memory intensive sections and computationally intensive sections. This is contrary to BFS, where there is almost no computation, and the entire algorithm consists mainly of memory operations. We make sure the OpenMP and OpenACC implementations are equivalent by comparing their final PageRank scores and the total amount of iterations run.

6.1 OpenMP vs. OpenACC

Like OpenACC, OpenMP is a compiler directive based API. This means that one should write sequential code at first. Once this code is fully functional, the programmer adds theOpenMP compiler directives to convert it to multithreaded code. The main difference to OpenACC is that OpenMP was created for CPU multithreading, while OpenACC was designed to be more general. As OpenACC and OpenMP are so similar, porting the code is not that interesting. In order to create the OpenMP, benchmark we simply removed all OpenACC directives and replaced them with the equivalent OpenMP ones. This meant replacing

#pragma acc parallel loop with

#pragma omp parallel for to start with. We also removed all OpenACC data directives, as there is no need to move data around to any accelerator device. Finally we made some OpenMP specific optimisations, like annotating any for-loops that do reductions with their equivalent ”reduction(operator:variable)” clauses, and figuring out the optimal scheduling algorithm by benchmarking. Specifically, most nodes in our graphs have a different amount of incoming edges, which means that they have a different amount ofworkto do. Static scheduling gives each thread an equal amount of iterations (i.e., nodes in our case) to perform [11]. This means that some threads can have a lot more calculations to perform than others, increasing the total computation time. Dynamic scheduling assigns loop iterations to each thread dynamically. This has the advantage of distributing workload much more evenly

41 Figure 6.1: Unoptimised PageRank CPU performance comparison for Graph500 graphs. across threads, which should (in theory) reduce the total processing time [11]. This is not always the case, however, as algorithms with very high data-locality might have that locality broken as the memory accesses become more random due to the dynamic scheduling across threads. Additionally, the extra overhead of a dynamic scheduling algorithm might increase the runtime by a relatively large amount for small graphs. In these cases, the static scheduling algorithm might still have the upper hand. For our OpenMP implementation, we implemented a way to dynamically choose the fastest scheduling algorithm on a per-graph basis by doing a couple of test runs of both algorithms, and then picking the one which has the lowest average runtime. This should result in the fastest scheduling algorithm being chosen for each graph.

6.2 Results

In this section we present the performacne comparison of OpenACC vs. OpenMP. We include both a comparison with our initial unoptimised version of OpenACC, and then again with the optimised version, to see if our GPU targeted optimisations have any effect on the CPU perfor- mance.

6.2.1 Unoptimised OpenACC Examining the relative performance of both APIs, we see that, generally, OpenMP is faster than OpenACC. For the graphs where OpenACC is faster than OpenMP, the difference is marginal. Figure 6.1, for example, shows that up and until graph 16, OpenMP is faster by a large amount. For graphs 17 through 21, the difference between the two API’s is very small. In graphs 22 through 24 OpenMP returns to the lead by a large amount. Figure 6.2 again shows that OpenMP is usually faster than OpenACC. Although for most graphs the runtime of OpenACC is within 15% of the runtime of OpenMP, this difference is sometimes as large as 30%.

42 Figure 6.2: Unoptimised PageRank CPU performance comparison for KONECT graphs.

Figure 6.3: Unoptimised PageRank CPU performance comparison for SNAP graphs.

43 Figure 6.4: Optimised PageRank CPU performance comparison for Graph500 graphs.

Benchmarking the SNAP graphs in figure 6.3 shows a slightly different story. Here, the difference between OpenMP and OpenACC is minimal, with OpenACC actually being fasterby 15% in the email-EuAll graph. This is offset by the roadnet-CA and roadnet-PA graphs however, as OpenMP performs around 22.5% and 15% better respectively. Note that contrary to the graphs comparing OpenACC and CUDA in chapters 4 and 5, these graphs contain no error bars. This is due to the fact that for smaller graphs, the standard deviation of OpenACC was so large that the graphs became impossible to display, with the standard deviation being larger than the the average runtime in some cases. In appendix E (section 8.5), we include the raw results for this standard deviation.

6.2.2 Optimised OpenACC Examining the results for our optimised OpenACC implementation agains OpenMP, we can see that the optimisations have not influenced the CPU performance in a positve way. Only forthe SNAP graphs (figure 6.6) we can observe a positive effect, on average. For the other graphs in figures 6.4 and 6.5 the optimisations have only made the results more inconsistent. There seems to be no pattern in the performance differences between OpenMP and OpenACC, and the optimisations seem to have made some (random) graphs faster, and some slower.

6.3 Examining Results

Based on these results, we conclude that our GPU-centered optimisations can have unexpected, and even backwards effects on multithreaded OpenACC code. Furthermore, we can conclude that even though it is possible to use OpenACC to write multithreaded CPU code, it is not to be recommended as its performance relative to other multithreading APIs (OpenMP in this case) can vary wildly between inputs. Finally, OpenACC can be very inconsistent even with the same input. As can be seen in appendix E (section 8.5), the standard deviation between results can be larger than the average

44 Figure 6.5: Optimised PageRank CPU performance comparison for KONECT graphs.

Figure 6.6: Optimised PageRank CPU performance comparison for SNAP graphs.

45 runtime of a single iteration, meaning that some runs can take four times longer than other runs. Optimising OpenACC for CPUs is beyond the scope of this work. However, we believe it is important for future studies to determine how much of the performance variability is due to the compilers, the workloads, and an actual GPU-specific approach. Understanding these problems could eventually lead to a systematic process of writing high-performance code for CPUs. Ultimately, this could enable OpenACC to be cross-platform performance portable. So far, our results show this is not yet the case.

46 CHAPTER 7 Conclusion and Future Work

7.1 Discussion

In this thesis we have compared both the performance and the easy-of-use of OpenACC compared to CUDA. Although we have achieved great results for OpenACC, it is important to note the limitations and boundaries of this research. Firstly, we have only compared two algorithms. Although these algorithms form a nice general basis for other algorithms (BFS being more memory intensive than computationally expensive, and PageRank being a mix of both) they are certainly not all algorithms, and as such OpenACC might not be as fast everywhere as it is in this thesis. Furthermore, while our CUDA implementations were optimised to the best of our abilities, there is no guarantee that a way more experienced CUDA programmer might not optimise the CUDA implementations in such a way as to be faster than our current OpenACC ports. These two algorithms are also examples of two very parallel algorithms, which can be nicely represented using a series of for-loops. This sort of structured code is ideal for OpenACC and makes it very easy for the compiler to optimally parallelise the code. For algorithms that do not enjoy such a clear structure, OpenACC might not be as fast or as easy to program as it is here. Finally, our CPU OpenMP vs. OpenACC comparison was not optimised at all. Although OpenMP is not as easily optimised as OpenACC and CUDA, more attention could have been spent in optimising our implementation. We did not go any further than switching between scheduling algorithms on a per-graph basis. Although we are quite confident that the OpenMP implementation is mostly optimised, we should consider that our OpenMP implementation could potentially be slightly faster than it is now.

7.2 Conclusion

Graph processing is important, but difficult. We do not know which platforms are best for which graphs and algorithms, and thus portable programming models allow us to experiment with different platforms. If these portable programming models lose too much performance compared to their native counterparts, however, the portable models become unviable for production use. For this reason, we investigate the challenges of porting high-performance CUDA code to well- behaving and simpler OpenACC code. Our goal is to provide a systematic and constant porting process and to evaluate the performance challenges between CUDA and OpenACC. Our research has been driven by four research questions. In the following sections, we discuss our answer for each question, thus highlighting our main findings.

1. Can we create a systematic way of porting CUDA to OpenACC, and how easy is this porting process? 2. What is the difficulty of implementing our algorithms in OpenACC compared to thediffi- culty of implementing our algorithms in CUDA?

47 CUDA OpenACC cudaMalloc(a_dev, 10 * sizeof(...));

cudaMemcpy(a_dev, a_host, 10 * sizeof(...), hostToDevice); #pragma acc enter data copyin(a_host[0:10]) ... cudaFree(a_dev); cudaMalloc(a_dev, 10 * sizeof(...)); cudaMemcpy(a_dev, a_host, 10 * sizeof(...), hostToDevice); #pragma acc enter data copyin(a_host[0:10]) ...... cudaMemcpy(a_host, a_dev, 10 * sizeof(...), deviceToHost); #pragma acc update host(a_host[0:10]) cudaFree(a_dev); cudaDeviceSynchronize(); #pragma acc wait add ”async(N)” to the end of any directive, CUDA streams with N being the stream/queue number #pragma acc wait(N) cudaStreamSynchronize(stream); with N being the stream/queue number #pragma acc parallel loop \ gang vector vector_length(blk_size) cuda_kernel<<>>(...); for(unsigned int i = 0; i

Table 7.1: Standard CUDA constructs with their OpenACC counterparts

3. What are the performance benefits or drawbacks to using the high-level portable program- ming model OpenACC instead of the proprietary CUDA API? 4. Is OpenACC portable and performance portable across different types of platforms (i.e., CPUs and GPUs)?

7.2.1 Systematic way of porting Can we create a systematic way of porting CUDA to OpenACC, and how easy is this porting process? Examining our code from chapters 4 and 5 we can see that there are a number of recurring constructs that can be directly ported from CUDA to OpenACC. In table 7.2.1, we show these CUDA constructs together with their OpenACC counterparts. Note that in table 7.2.1, we have mapped every memory transfer to an OpenACC unstructured memory directive. We opted for this approach as we have experienced that the OpenACC structured data regions are very unreliable, and either simply refuse to work, or place the memory transfers in such a way that performance drastically decreases (see chapter 4). Instead, we recommend performing all array and struct memory transfers manually, and placing a block (”{” ... ”}”) around the code that requires the objects. The programmer should then annotate this block with #pragma acc data present(...) constructs to tell the compiler that the arrays and structs are already present. In our experience, it is possible to leave the memory transfers of single-size variables like a single integer or boolean to the compiler. Additionally, rewriting the CUDA kernels to for-loops allow the programmer to sometimes identify constructs that were not that visible before. For example, for-loops within CUDA kernels will result in a nested for-loop in OpenACC. The inner for-loop can then usually be annotated with #pragma acc loop vector which can open up new ways for the compiler to parallelise the construct. Finally, any custom code to do variable reductions (like adding up all elements of an array into a single variable) can be completely removed. The OpenACC built-in reduction directives perform just as well as advanced custom CUDA ones, but require a lot less code and are way easier to understand. The built-in OpenACC reduction is used as follows:

48 1 int result = 0; 2 int array[array_length];

3

4 #pragma acc parallel loop reduction(+:result) 5 for(int i = 0; i < array_length; i++){ 6 result += array[i]; 7 }

This example is for the plus operator, but the reduction directive supports all basic operators plus the max/min functions.

7.2.2 Difficulty of OpenACC What is the difficulty of implementing our algorithms in OpenACC compared to the difficulty of implementing our algorithms in CUDA? Research question 2 requires us to answer a fairly subjective question. Combining our experiences in porting Breadth First Search and PageRank, we have gained a basic understanding of the OpenACC programming model and ecosystem. In our experience, OpenACC is both faster to write and easier to write than CUDA code. While CUDA requires the programmer to think about the GPU architecture continuously, OpenACC allows the programmer to write code like they normally would and add OpenACC directives at the end. OpenACC also allows the programmer to describe code in a more abstract way. The programmer writes their code as structured as possible, and then afterwards annotates it by declaring loops to be constructs like vector operations or reduction. The compiler will then take care of converting those structures to the target architecture. Although the process of writing OpenACC is extremely easy, it is held back by the develop- ment environment around it. As OpenACC is not used as broadly as CUDA, there is a significant lack of both basic, easy to understand documentation and in-depth documentation. Multiple times we have had to solve programming problems and compiler errors by experimenting with the syntax and API calls instead of being able to find proper documentation about the given errors. Additionally, OpenACC is not yet fully supported by the Clang and GCC families of compilers, and GCC 8.1 refuses to parse our OpenACC syntax. This means we have had to use the NVIDIA PGI C compiler to compile our OpenACC code. This can be a severely limiting factor for the development of OpenACC, as installing yet another compiler can be cumbersome and a turn-off for many programmers. Finally, the bugs that can occur when using OpenACC are all fairly unclear. When the pro- grammer does not use tools like profilers and verbose compiler output, it can be very difficult to track the source of performance hindering bugs, like we observed in chapter 4 where a wrongfully placed memory transfer drastically reduced the performance of our OpenACC program.

7.2.3 Performance benefits or drawbacks What are the performance benefits or drawbacks to using the high-level portable programming model OpenACC instead of the proprietary CUDA API? To answer this research question we examine the performance results of the previous chapters. In chapter 4 we saw that the naive BFS OpenACC implementation was dramatically slower than CUDA. With some basic optimisations, this problem was completely solved and OpenACC proceeded to be faster than CUDA for the synthetic graphs and as fast as CUDA for the non-synthetic graphs. The two implementations are not equal, however, suggesting that the OpenACC method of parallellising differs from the one we used for CUDA. Chapter 5 confirms our suspicion that OpenACC enables the compiler to find different (and possibly smarter) of parallelising code. In this chapter, our OpenACC port performed better out of the box, with optimisations pulling it ahead of our CUDA implementation even further. We think that this chapter shows the potential strength of a high-level language like OpenACC, as implementing this same optimised algorithm in CUDA would have been massively more compli- cated that in is for OpenACC.

49 In short, our analysis indicates that the performance of OpenACC is good enough to serve as a replacement for CUDA for our graph processing workloads. In the worst cases we encountered, the performance was not so bad as to make the OpenACC version unusable. In the best cases, OpenACC was faster than even our optimised CUDA build.

7.2.4 OpenMP vs. OpenACC Is OpenACC portable and performance portable across different types of platforms (i.e., CPUs and GPUs)? To this end, we examined the performance difference between OpenMP and Ope- nACC when targeting CPUs. In order to be ”portable”, our well-performing OpenACC GPU code should be useable, and preferably without losing a lot of performance, on the CPU. This would enable graph processing algorithm developers to experiment before choosing the best plat- form for their current workload. This would be a massive advantage in the diverse landscape of graph processing. In our experience, the OpenACC API lacks the advanced CPU multithreading support that OpenMP offers. For example, OpenACC offers no way to determine the scheduling algorithm used for scheduling loop iterations on threads, while OpenMP offers the #pragma omp schedule(..) directive to assign scheduling algorithms to individual parallel regions to allow for better load balancing and less idle threads. This feature alone resulted in performance improvements up to 30% in our OpenMP PageRank implementation. As concluded from chapter 6, OpenACC compilers are still too inconsistent and immature when targeting multithreaded CPU configurations to be used properly, as it seems like OpenACC is slower then OpenMP for some graphs without any obvious reason. Furthermore, the tools we have used for debugging and profiling OpenACC for GPUs simply do not work here. The”- Minfo” compiler flag, for example, has considerably less output around the CPU multithreaded regions than it has for the GPU accelerated regions. While this means that, in essence, OpenACC is portable to multiple platforms, the program- mer is still better of choosing OpenMP for parallel CPU graph processing as the OpenACC compilers struggle with performance portability.

7.3 Future Work

While this thesis has examined the basics of CUDA-to-OpenACC conversion for two graph processing algorithms, there are still a lot of subjects around this thesis which can be researched. We give a list of our suggestions:

• An in-depth comparison between OpenMP and OpenACC multithreaded CPU perfor- mance. Although we have touched on the subject in this thesis, we have not gone into specific OpenACC CPU optimisation strategies. Thus, we think that research focusing on this subject might bring interesing results. • Continuing from the previous item, we would like to see standard mappings from sequential CPU code to OpenACC accelerated multithreaded CPU code similar to the ones we have produced in section 7.2.1. The lack of information during the compilation process, as well as the lack of unified analysis and debugging tools have made this task impossible forus in this short time. • As OpenMP has recently added support for targeting accelerator devices (see the OpenMP 4.5 standard [18]), we would like to see a GPU performance comparison between the two APIs. We think this could be interesting, as OpenMP is the more mature API in general but is fairly new to GPU acceleration. OpenACC is a much younger API, but was made for GPU acceleration from the beginning. • Expanding from our previous point, we suggest a more detailed look into the differences between the OpenMP and OpenACC standards, and perhaps reasons as to why a merger

50 between the two APIs is, or is not possible. If the two standards could merge, the single resulting standard could gain the support of the audiences and users of both APIs. • In this thesis we have spent most of our time comparing CUDA and OpenACC. We have not made a comparison between OpenCL and OpenACC, while this could be interesting. Of particular interest is the performance difference between OpenCL and OpenACC when targeting AMD GPUs (when OpenACC compilers gain support for AMD graphics cards).

51 52 Bibliography

[1] Vrije Universiteit Amsterdam. DAS-5 Overview. url: https://www.cs.vu.nl/das5/ home.shtml (visited on 05/01/2019). [2] Nathan Bell and Jared Hoberock. “Thrust: A productivity-oriented library for CUDA”. In: GPU computing gems Jade edition. Elsevier, 2012, pp. 359–371. [3] Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual web search engine”. In: Computer networks and ISDN systems 30.1-7 (1998), pp. 107–117. [4] Michael Brinkmeier. “PageRank revisited”. In: ACM Transactions on Internet Technology (TOIT) 6.3 (2006), pp. 282–301. [5] S. Christgau et al. “A comparison of CUDA and OpenACC: Accelerating the Tsunami Simulation EasyWave”. In: ARCS 2014; 2014 Workshop Proceedings on Architecture of Computing Systems. VDE. Feb. 2014, pp. 1–5. [6] NVIDIA Corporation. Branch Statistics. 2015. url: https://docs.nvidia.com/gameworks/ content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/ branchstatistics.htm (visited on 04/11/2019). [7] Joseph E Gonzalez et al. “Graphx: Graph processing in a distributed dataflow framework”. In: 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 2014, pp. 599–613. [8] Mark Harris et al. “Optimizing parallel reduction in CUDA”. In: (2007). url: https : //developer.download.nvidia.com/assets/cuda/files/reduction.pdf (visited on 04/24/2019). [9] J. A. Herdman et al. “Accelerating Hydrocodes with OpenACC, OpenCL and CUDA”. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. Nov. 2012, pp. 465–471. doi: 10.1109/SC.Companion.2012.66. [10] Tetsuya Hoshino et al. “CUDA vs OpenACC: Performance case studies with kernel bench- marks and a memory-bound CFD application”. In: 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. IEEE. 2013, pp. 136–143. [11] Intel. OpenMP Loop Scheduling. Aug. 2014. url: https://software.intel.com/en- us/articles/openmp-loop-scheduling (visited on 05/29/2019). [12] Jérôme Kunegis. “Konect: the koblenz network collection”. In: Proceedings of the 22nd International Conference on World Wide Web. ACM. 2013, pp. 1343–1350. [13] Cleverson Lopes Ledur, CM Zeve, and JC dos Anjos. “Comparative analysis of OpenACC, OpenMP and CUDA using sequential and parallel algorithms”. In: 11th Workshop on par- allel and distributed processing (WSPPD). 2013. [14] Jure Leskovec and Andrej Krevl. “{SNAP Datasets}:{Stanford} Large Network Dataset Collection”. In: (2015). [15] Suejb Memeti et al. “Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Pro- gramming Productivity, Performance, and Energy Consumption”. In: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud Computing. ARMS-CC ’17. Washington, DC, USA: ACM, 2017, pp. 1–6. isbn: 978-1-4503-5116-4. doi: 10.1145/3110355.3110356. url: http://doi.acm.org/10.1145/3110355.3110356.

53 [16] Matthias S Müller. “An OpenMP compiler benchmark”. In: Scientific Programming 11.2 (2003), pp. 125–131. [17] Richard C Murphy et al. “Introducing the graph 500”. In: Cray Users Group (CUG) 19 (2010), pp. 45–74. [18] OpenMP Architecture Review Board. OpenMP Application Programming Interface. Nov. 2015. url: https://www.openmp.org/wp-content/uploads/openmp-4.5.pdf. [19] Lawrence Page et al. The PageRank citation ranking: Bringing order to the web. Tech. rep. Stanford InfoLab, 1999. [20] PGI Compiler User’s Guide. url: https://www.pgroup.com/resources/docs/18.4/ x86/pgi-user-guide/index.htm (visited on 05/20/2019). [21] Xuanhua Shi et al. “Graph Processing on GPUs: A Survey”. In: ACM Comput. Surv. 50.6 (Jan. 2018), 81:1–81:35. issn: 0360-0300. doi: 10.1145/3128571. url: http://doi.acm. org/10.1145/3128571. [22] Julian Shun and Guy E Blelloch. “Ligra: a lightweight graph processing framework for shared memory”. In: ACM Sigplan Notices. Vol. 48. 8. ACM. 2013, pp. 135–146. [23] Merijn Verstraaten, Ana Lucia Varbanescu, and Cees de Laat. “Using Graph Properties to Speed-up GPU-based Graph Traversal: A Model-driven Approach”. In: CoRR abs/1708.01159 (2017). arXiv: 1708.01159. url: http://arxiv.org/abs/1708.01159. [24] Sandra Wienke et al. “OpenACC — First Experiences with Real-World Applications”. In: Euro-Par 2012 Parallel Processing. Ed. by Christos Kaklamanis, Theodore Papatheodorou, and Paul G. Spirakis. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 859–870. isbn: 978-3-642-32820-6. [25] Yanfeng Zhang et al. “Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation”. In: IEEE Transactions on Parallel and Distributed Systems 25.8 (2014), pp. 2091–2100. [26] Jianlong Zhong and Bingsheng He. “Medusa: Simplified graph processing on GPUs”. In: IEEE Transactions on Parallel and Distributed Systems 25.6 (2014), pp. 1543–1552.

54 CHAPTER 8 Appendices

55 8.1 Appendix A: CUDA max-reduce

1 __global__ void pagerank_max_reduce(pagerank_t* prev_rank, pagerank_t* next_rank, 2 float* block_max, uint32_t maxcount) { 3 int i = threadIdx.x + (blockDim.x * 2) * blockIdx.x; 4 extern __shared__ float maxdata[]; 5 maxdata[threadIdx.x] = 0.0f;

6

7 if(i < maxcount) { 8 // Latency-hiding by using a single thread to already calculate a max on init 9 float change_percent1 = (fabsf(prev_rank[i] - next_rank[i]) / prev_rank[i]) * 100.0f; 10 float change_percent2 = (fabsf(prev_rank[i + blockDim.x] - next_rank[i + blockDim.x]) 11 / prev_rank[i + blockDim.x]) * 100.0f; 12 maxdata[threadIdx.x] = fmaxf(change_percent1, change_percent2); 13 __syncthreads();

14

15 if(CUDA_BLOCKSIZE >= 512){ 16 if(threadIdx.x < 256){ 17 maxdata[threadIdx.x] = fmaxf(maxdata[threadIdx.x], maxdata[threadIdx.x + 256]); 18 } 19 __syncthreads(); 20 }

21

22 if(CUDA_BLOCKSIZE >= 256){ 23 if(threadIdx.x < 128){ 24 maxdata[threadIdx.x] = fmaxf(maxdata[threadIdx.x], maxdata[threadIdx.x + 128]); 25 } 26 __syncthreads(); 27 }

28

29 if(CUDA_BLOCKSIZE >= 128){ 30 if(threadIdx.x < 64){ 31 maxdata[threadIdx.x] = fmaxf(maxdata[threadIdx.x], maxdata[threadIdx.x + 64]); 32 } 33 __syncthreads(); 34 }

35

36 if(threadIdx.x < 32){ 37 pagerank_warp_max_reduce(maxdata, threadIdx.x); 38 }

39

40 if(threadIdx.x == 0){ 41 block_max[blockIdx.x] = maxdata[0]; 42 }

43

44 __syncthreads(); 45 } 46 }

56 1 __device__ void pagerank_warp_max_reduce(volatile float* maxdata, int threadid) { 2 if(CUDA_BLOCKSIZE >= 64) maxdata[threadid] = fmaxf(maxdata[threadid], maxdata[threadid + 32]); 3 if(CUDA_BLOCKSIZE >= 32) maxdata[threadid] = fmaxf(maxdata[threadid], maxdata[threadid + 16]); 4 if(CUDA_BLOCKSIZE >= 16) maxdata[threadid] = fmaxf(maxdata[threadid], maxdata[threadid + 8]); 5 if(CUDA_BLOCKSIZE >= 8) maxdata[threadid] = fmaxf(maxdata[threadid], maxdata[threadid + 4]); 6 if(CUDA_BLOCKSIZE >= 4) maxdata[threadid] = fmaxf(maxdata[threadid], maxdata[threadid + 2]); 7 if(CUDA_BLOCKSIZE >= 2) maxdata[threadid] = fmaxf(maxdata[threadid], maxdata[threadid + 1]); 8 }

1 pagerank_t* pagerank(graph_cuda_t* graph) { 2 ... 3 unsigned int block_count = (graph->node_count / CUDA_BLOCKSIZE) + 1; 4 pagerank_do<<>>(graph_device, ranks_device, ranks_device_next); 5 pagerank_max_reduce<<>> 6 (ranks_device, ranks_device_next, max_threshold_device, graph->node_count); 7 ... 8 }

57 8.2 Appendix B: Graph sizes

• graph500-10: 80.3 KiB • KONECT-dbpedia-starring: 3.4 MiB

• graph500-11: 198.0 KiB • KONECT-discogs_affiliation: 179.1 MiB • graph500-12: 446.5 KiB • KONECT-opsahl-ucsocial: 1.3 MiB • graph500-13: 969.5 KiB • KONECT-orkut-links: 1.7 GiB • graph500-14: 2.2 MiB • KONECT-prosper-loans: 79.0 MiB • graph500-15: 4.8 MiB • KONECT-web-NotreDame: 20.6 • graph500-16: 10.2 MiB MiB • graph500-17: 21.9 MiB • KONECT-ca-cit-HepPh: 99.5 MiB

• graph500-18: 47.9 MiB • KONECT-wiki_talk_en: 584.7 MiB • graph500-19: 100.4 MiB • KONECT-wiki_talk_fr: 102.7 MiB • graph500-20: 208.1 MiB • KONECT-zhishi-hudong-internallink: 187.5 MiB • graph500-21: 452.5 MiB • SNAP-as-skitter: 142.2 MiB • graph500-22: 945.5 MiB • SNAP-email-EuAll: 4.8 MiB • graph500-23: 1.9 GiB • SNAP-roadNet-CA: 83.8 MiB • graph500-24: 4.1 GiB • SNAP-roadNet-PA: 44.0 MiB • KONECT-actor-collaboration: 431.1 MiB • SNAP-roadNet-TX: 56.5 MiB • KONECT-ca-cit-HepPh: 99.5 MiB • SNAP-web-BerkStan: 105.1 MiB • KONECT-cfinder-google: 1.8 MiB • SNAP-web-Google: 71.9 MiB

• KONECT-dbpedia-all: 195.8 MiB • SNAP-wiki-Talk: 58.7 MiB

58 8.3 Appendix C: Raw unoptimised GPU performance numbers

All data provided is measured in seconds

8.3.1 graph500 PAGERANK

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.01975 0.00869 0.01106 0.01222 0.00322 0.00901 10  0.00156  0.00044  0.00117  0.00121  0.00019  0.00102 0.03543 0.01431 0.02112 0.0212 0.0041 0.0171 11  0.00014  1e-05  0.00014  0.00014  3e-05  0.00013 0.06548 0.021 0.04448 0.04048 0.00532 0.03516 12  0.00303  5e-05  0.00298  0.00243  0.0001  0.00233 0.11839 0.03314 0.08525 0.07669 0.00756 0.06913 13  0.00243  4e-05  0.00239  0.00248  6e-05  0.00243 0.23959 0.05603 0.18356 0.15301 0.01342 0.1396 14  0.00922  5e-05  0.00924  0.00097  2e-05  0.00097 0.46944 0.08727 0.38217 0.3093 0.02269 0.28661 15  0.00238  0.00047  0.00236  0.00172  0.00011  0.0017 0.94962 0.16116 0.78846 0.6452 0.04847 0.59673 16  0.00506  0.00013  0.00505  0.00378  0.00022  0.00372 1.87512 0.25379 1.62134 1.33589 0.092 1.24389 17  0.00922  0.00023  0.00918  0.09505  0.00015  0.09491 3.77566 0.44165 3.33401 2.61933 0.18416 2.43517 18  0.11814  0.00769  0.11748  0.09879  0.00011  0.09871 9.84588 0.95554 8.89034 6.99086 0.45318 6.53768 19  0.32699  0.00733  0.32632  0.32893  0.00307  0.32844 15.40576 1.61452 13.79124 10.72767 0.86727 9.86039 20  0.40241  0.0097  0.40158  0.36234  0.00312  0.3616 31.77228 3.32728 28.44499 23.40749 1.92724 21.48025 21  0.57696  0.00736  0.57578  0.30255  0.00448  0.30285 67.2441 6.21266 61.03145 45.04948 4.39526 40.65422 22  2.34478  0.00628  2.3417  0.81555  0.00284  0.81488 133.04553 12.25653 120.78899 96.75463 9.84969 86.90494 23  1.85127  0.01426  1.84853  2.61313  1.79735  1.88331 277.59493 30.75864 246.83629 195.95101 23.70138 172.24963 24  5.16817  0.02153  5.15784  3.18432  0.01718  3.18724

Table 8.1:

59 BFS

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.0009 0.00012 0.00078 0.00037 0.00034 3e-05 10  9e-05  1e-05  8e-05  2e-05  2e-05  0.0 0.00088 7e-05 0.00081 0.0003 0.00026 4e-05 11  9e-05  1e-05  8e-05  3e-05  2e-05  0.0 0.00101 0.00022 0.00079 0.00094 0.00091 4e-05 12  5e-05  1e-05  4e-05  4e-05  4e-05  0.0 0.00103 0.00011 0.00092 0.00074 0.0007 4e-05 13  0.00011  1e-05  0.0001  2e-05  2e-05  0.0 0.0013 0.0003 0.00101 0.004 0.00393 7e-05 14  9e-05  1e-05  8e-05  2e-05  2e-05  1e-05 0.00173 0.0004 0.00133 0.00998 0.00981 0.00017 15  0.00011  1e-05  0.0001  0.00015  0.00015  2e-05 0.0043 0.00054 0.00376 0.01185 0.01164 0.00021 16  0.00067  1e-05  0.00066  0.00019  0.00019  1e-05 0.00388 0.00058 0.0033 0.01535 0.01496 0.00039 17  0.00014  1e-05  0.00014  0.00089  0.00086  3e-05 0.01574 0.00201 0.01373 0.11672 0.11583 0.0009 18  0.00525  1e-05  0.00524  0.00175  0.00174  5e-05 0.01384 0.00068 0.01316 0.05136 0.04847 0.00289 19  0.00053  0.0  0.00053  0.00119  0.00114  9e-05 0.03488 0.00929 0.0256 0.23737 0.23451 0.00286 20  0.00087  2e-05  0.00087  0.003  0.00284  0.00017 0.06385 0.01548 0.04838 0.36419 0.3583 0.0059 21  0.00172  2e-05  0.00172  0.00506  0.0048  0.00028 0.13154 0.04495 0.08659 0.98521 0.97371 0.0115 22  0.02225  5e-05  0.02224  0.01131  0.01105  0.00063 0.23846 0.07118 0.16728 2.01916 1.96901 0.05015 23  0.00688  6e-05  0.00688  0.89407  0.45795  0.77158 0.56823 0.13929 0.42894 5.48065 5.41903 0.06162 24  0.13425  0.0001  0.13423  0.77878  0.77209  0.0085

Table 8.2:

60 8.3.2 KONECT PAGERANK

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 9.44825 1.53239 7.91585 6.86265 0.47649 6.38616 actor-collaboration  0.04706  0.00334  0.04692  0.17833  0.00325  0.17841 0.72727 0.2058 0.52147 0.40835 0.03679 0.37155 ca-cit-HepPh  0.01215  0.00014  0.01215  0.01454  0.0001  0.01449 0.70179 0.47828 0.22352 0.18821 0.01626 0.17194 cfinder-google  0.00286  7e-05  0.00286  0.01426  0.00013  0.01415 107.68727 29.36516 78.32211 67.81156 3.82389 63.98768 dbpedia-all  1.73395  0.02316  1.73171  2.32989  0.00625  2.32805 1.34662 0.03011 1.31651 1.15629 0.08073 1.07555 dbpedia-starring  0.03244  0.0001  0.03241  0.0075  3e-05  0.0075 14.20867 8.22151 5.98715 5.31233 0.6064 4.70594 discogs_affiliation  0.21254  0.01315  0.21145  0.07299  0.00021  0.07292 0.04443 0.02007 0.02436 0.02364 0.00401 0.01962 opsahl-ucsocial  0.00181  8e-05  0.00173  0.00203  0.00018  0.00186 96.80053 9.35125 87.44928 69.08414 6.24549 62.83865 orkut-links  2.83336  0.00778  2.83123  0.80682  0.00143  0.80693 2.30121 0.52201 1.7792 1.40265 0.10503 1.29762 prosper-loans  0.01748  0.00057  0.01746  0.01476  0.00037  0.01493 6.26574 0.54871 5.71703 4.72613 0.32647 4.39966 web-NotreDame  0.0486  0.00942  0.04664  0.0298  0.00016  0.0298 76.73772 9.16847 67.56924 46.43966 3.08494 43.35471 wiki_talk_en  1.48405  0.00934  1.48189  0.33946  0.00168  0.33953 39.85619 4.97707 34.87912 29.46719 1.97015 27.49704 zhishi-hudong-internallink  0.65957  0.00743  0.65977  1.81989  0.00593  1.81568

Table 8.3:

61 BFS

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.06416 0.01582 0.04834 0.36261 0.36156 0.00104 actor-collaboration  0.00191  3e-05  0.00191  0.00496  0.00491  7e-05 0.01708 0.00158 0.0155 0.04534 0.0452 0.00015 ca-cit-HepPh  0.00477  2e-05  0.00478  0.00045  0.00045  2e-05 0.00136 0.00041 0.00095 0.00655 0.00646 9e-05 cfinder-google  2e-05  1e-05  2e-05  0.00016  0.00016  1e-05 0.17153 0.1344 0.03714 4.92013 4.88015 0.03998 dbpedia-all  0.01442  0.00036  0.01465  0.90124  0.54791  0.70943 0.00216 0.00044 0.00172 0.01106 0.01072 0.00034 dbpedia-starring  0.0003  2e-05  0.00029  0.00026  0.00025  2e-05 0.04718 0.00317 0.04401 0.03951 0.03467 0.00483 discogs_affiliation  0.01701  1e-05  0.01701  0.00141  0.00133  0.00012 0.00094 0.00016 0.00078 0.00075 0.00072 3e-05 opsahl-ucsocial  4e-05  1e-05  4e-05  3e-05  3e-05  0.0 0.30547 0.05891 0.24655 1.63119 1.61466 0.01653 orkut-links  0.01729  5e-05  0.01729  0.24079  0.03706  0.23858 0.00662 0.00094 0.00568 0.01813 0.01786 0.00027 prosper-loans  0.00019  0.0  0.00019  0.00025  0.00025  1e-05 0.01135 0.00453 0.00683 0.10768 0.10672 0.00096 web-NotreDame  0.00144  4e-05  0.00147  0.00041  0.0004  4e-05 0.05193 0.01204 0.03989 0.25596 0.24771 0.00825 wiki_talk_en  0.00175  5e-05  0.00174  0.00489  0.00449  0.00045 0.04426 0.01771 0.02655 0.41963 0.41402 0.00561 zhishi-hudong-internallink  0.00099  3e-05  0.00099  0.00407  0.00369  0.0004

Table 8.4:

8.3.3 SNAP PAGERANK

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 38.10411 1.04915 37.05496 31.31826 1.77088 29.54738 as-skitter  0.68105  0.00332  0.68068  0.64634  0.00966  0.64492 2.06531 0.42065 1.64466 1.48797 0.16146 1.32651 email-EuAll  0.03429  0.00098  0.03419  0.01032  0.00019  0.01031 38.76107 0.28583 38.47524 32.62578 1.64402 30.98176 roadNet-CA  0.17714  0.00076  0.17718  0.21969  0.00074  0.21984 20.48151 0.15998 20.32152 17.00389 0.90449 16.0994 roadNet-PA  0.07809  0.00039  0.07806  0.13264  0.00261  0.13119 27.00639 0.19625 26.81013 23.37717 1.15581 22.22136 roadNet-TX  0.2293  0.00087  0.22928  0.21748  0.00775  0.21299 17.04884 5.49072 11.55812 9.47098 0.64348 8.82749 web-BerkStan  0.29605  0.01293  0.29795  0.09353  0.00025  0.09349 16.40928 0.88845 15.52083 13.13606 0.88561 12.25045 web-Google  0.19319  0.00513  0.19237  0.2466  0.00346  0.24496 51.40608 5.32436 46.08172 37.46263 2.46769 34.99494 wiki-Talk  0.82453  0.00663  0.82357  0.51365  0.00191  0.51258

Table 8.5:

62 BFS

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.03468 0.01424 0.02045 0.36508 0.36038 0.00469 as-skitter  0.00065  2e-05  0.00065  0.0021  0.0019  0.00022 0.0035 0.00048 0.00302 0.00725 0.00647 0.00077 email-EuAll  0.00053  1e-05  0.00053  9e-05  7e-05  2e-05 0.18949 0.17008 0.0194 5.15522 5.11893 0.03629 roadNet-CA  0.01084  0.001  0.0099  1.46929  1.27633  0.7582 0.1155 0.0987 0.0168 2.44789 2.44474 0.00315 roadNet-PA  0.00308  0.00017  0.00296  0.3083  0.30792  0.00052 0.17361 0.16033 0.01328 4.40895 4.40448 0.00447 roadNet-TX  0.00709  0.0007  0.00643  0.96122  0.9601  0.00124 0.20177 0.18835 0.01342 8.42668 8.40076 0.02592 web-BerkStan  0.00084  0.0001  0.00082  0.7269  0.34413  0.64353 0.02223 0.01188 0.01035 0.23274 0.23026 0.00248 web-Google  0.00066  2e-05  0.00066  0.00096  0.00087  0.00012 0.01424 0.00086 0.01338 0.02883 0.0222 0.00663 wiki-Talk  0.0004  1e-05  0.0004  0.00045  0.00033  0.00015

Table 8.6:

63 8.4 Appendix D: Raw optimised GPU performance numbers

All data provided is measured in seconds

8.4.1 graph500 PAGERANK

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.02014 0.00903 0.01111 0.01172 0.00262 0.0091 10  0.00124  0.00024  0.00119  0.0012  0.00014  0.00108 0.03546 0.01432 0.02114 0.02086 0.00362 0.01724 11  0.0001  2e-05  0.0001  0.00013  4e-05  0.00013 0.06645 0.02155 0.04489 0.0387 0.00406 0.03464 12  0.0052  0.00017  0.00506  0.00029  3e-05  0.00029 0.12197 0.03354 0.08843 0.07474 0.00568 0.06906 13  0.0067  0.00011  0.00669  0.00036  2e-05  0.00035 0.23617 0.05665 0.17953 0.15052 0.01081 0.13971 14  0.00184  2e-05  0.00183  0.00096  2e-05  0.00096 0.47199 0.08951 0.38248 0.30391 0.01626 0.28765 15  0.00223  0.0004  0.00215  0.01564  3e-05  0.01563 0.95663 0.1652 0.79143 0.63037 0.03571 0.59465 16  0.01262  0.00012  0.0126  0.00442  4e-05  0.00442 1.88536 0.26003 1.62533 1.23564 0.05723 1.17842 17  0.01458  0.00023  0.01456  0.05141  0.00011  0.0514 3.6996 0.44769 3.25191 2.43 0.11423 2.31577 18  0.02319  0.00615  0.02236  0.03157  5e-05  0.03156 9.73138 0.96354 8.76784 6.42136 0.36843 6.05293 19  0.20757  0.00685  0.20717  0.04481  0.00015  0.04481 15.50459 1.63018 13.87441 10.44802 0.75314 9.69488 20  0.42656  0.00837  0.42561  0.09739  0.0009  0.09731 31.05337 3.35556 27.69781 22.81562 1.92947 20.88615 21  0.89648  0.00704  0.895  1.15684  0.01914  1.15709 64.07219 6.243 57.82918 47.66694 4.48002 43.18692 22  1.53445  0.00461  1.53329  0.41689  0.02819  0.40601 129.10757 12.44402 116.66355 95.59972 9.82616 85.77356 23  2.55504  0.01114  2.55364  2.28015  0.04015  2.27318 272.22317 31.38951 240.83366 197.06086 22.0804 174.98045 24  3.78578  0.02143  3.77709  5.72892  0.01052  5.72974

Table 8.7:

64 BFS

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.00088 0.00012 0.00076 0.00015 0.0001 5e-05 10  4e-05  0.0  4e-05  1e-05  1e-05  1e-05 0.00084 8e-05 0.00076 0.00013 6e-05 7e-05 11  3e-05  0.0  2e-05  0.0  0.0  0.0 0.00114 0.00023 0.0009 0.0003 0.00021 9e-05 12  8e-05  1e-05  7e-05  1e-05  1e-05  0.0 0.00098 0.00011 0.00088 0.00026 9e-05 0.00017 13  3e-05  0.0  3e-05  0.0  0.0  0.0 0.00131 0.00029 0.00101 0.00061 0.0003 0.00032 14  4e-05  1e-05  4e-05  1e-05  0.0  1e-05 0.00173 0.0004 0.00132 0.00102 0.00039 0.00063 15  7e-05  1e-05  6e-05  3e-05  1e-05  3e-05 0.0025 0.00053 0.00197 0.0018 0.00054 0.00126 16  4e-05  1e-05  4e-05  3e-05  1e-05  3e-05 0.00704 0.00058 0.00645 0.0032 0.00056 0.00264 17  0.00159  1e-05  0.00159  9e-05  1e-05  9e-05 0.00908 0.002 0.00709 0.00726 0.002 0.00526 18  0.0002  1e-05  0.0002  0.0002  1e-05  0.0002 0.01415 0.00068 0.01347 0.01006 0.00067 0.00939 19  0.00039  1e-05  0.00039  0.00041  1e-05  0.00041 0.03566 0.0093 0.02635 0.0269 0.00919 0.01772 20  0.0009  2e-05  0.0009  0.00095  3e-05  0.00094 0.11702 0.01551 0.10151 0.0504 0.0146 0.0358 21  0.03035  6e-05  0.03035  0.00179  6e-05  0.00179 0.13126 0.04499 0.08627 0.1121 0.04154 0.07056 22  0.00341  6e-05  0.00341  0.00363  4e-05  0.00363 0.24341 0.07111 0.17229 0.29266 0.0616 0.23106 23  0.0069  7e-05  0.0069  0.05047  4e-05  0.05046 0.87085 0.13943 0.73141 0.47179 0.12476 0.34703 24  0.23772  9e-05  0.2377  0.10985  5e-05  0.10986

Table 8.8:

65 8.4.2 KONECT PAGERANK

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 9.49062 1.55518 7.93544 7.14373 0.46247 6.68126 actor-collaboration  0.13833  0.00274  0.13792  0.34284  0.0013  0.34266 0.72681 0.20665 0.52016 0.43399 0.03332 0.40067 ca-cit-HepPh  0.00339  0.00011  0.00339  0.01987  9e-05  0.01984 0.70462 0.48045 0.22416 0.21036 0.03044 0.17993 cfinder-google  0.00198  0.00017  0.00194  0.00122  0.00014  0.00121 108.77594 29.50627 79.26967 63.8449 2.20562 61.63928 dbpedia-all  1.20322  0.02251  1.20205  3.2546  0.00743  3.24973 1.33927 0.03013 1.30914 1.05348 0.03775 1.01573 dbpedia-starring  0.00823  9e-05  0.00822  0.01323  0.00015  0.01311 14.26455 8.28852 5.97603 4.98919 0.69503 4.29416 discogs_affiliation  0.20147  0.01307  0.20003  0.03  0.00143  0.03015 0.0448 0.02012 0.02468 0.02287 0.00369 0.01918 opsahl-ucsocial  0.00178  6e-05  0.00171  0.00019  0.00017  0.0001 102.81717 9.35446 93.46271 71.22021 6.55862 64.66159 orkut-links  4.26757  0.00965  4.26542  1.41914  0.00811  1.41939 2.31193 0.51743 1.79451 1.35885 0.07898 1.27987 prosper-loans  0.0434  0.00051  0.04324  0.01407  0.00021  0.01411 6.17866 0.53474 5.64391 4.53826 0.15984 4.37842 web-NotreDame  0.04369  0.01085  0.0422  0.02827  6e-05  0.02826 71.45119 9.01639 62.4348 47.05271 1.50657 45.54614 wiki_talk_en  0.76716  0.00858  0.7672  1.12108  0.00285  1.119 38.96486 4.89596 34.0689 28.79853 1.40616 27.39237 zhishi-hudong-internallink  1.06056  0.00848  1.05969  0.71204  0.0023  0.71166

Table 8.9:

66 BFS

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.0635 0.01533 0.04816 0.04662 0.0146 0.03202 actor-collaboration  0.00176  3e-05  0.00176  0.0017  3e-05  0.0017 0.00904 0.00158 0.00747 0.00707 0.00155 0.00552 ca-cit-HepPh  0.00022  1e-05  0.00022  0.00023  1e-05  0.00023 0.00175 0.00044 0.00131 0.00068 0.00041 0.00027 cfinder-google  9e-05  1e-05  8e-05  1e-05  0.0  1e-05 0.16283 0.13355 0.02928 0.17202 0.13474 0.03727 dbpedia-all  0.0014  7e-05  0.00138  0.00798  0.00025  0.00786 0.00169 0.00042 0.00127 0.00096 0.00041 0.00055 dbpedia-starring  8e-05  1e-05  7e-05  3e-05  1e-05  3e-05 0.04245 0.00317 0.03928 0.02157 0.00353 0.01804 discogs_affiliation  0.01541  1e-05  0.01541  0.00053  2e-05  0.00052 0.00094 0.00016 0.00078 0.00026 0.00016 0.0001 opsahl-ucsocial  3e-05  0.0  3e-05  0.0  0.0  0.0 0.36123 0.0589 0.30233 0.45817 0.06736 0.39081 orkut-links  0.09303  7e-05  0.09302  0.00903  0.00017  0.00903 0.00657 0.00095 0.00563 0.00538 0.00094 0.00444 prosper-loans  0.00016  2e-05  0.00015  0.00015  0.0  0.00015 0.01186 0.00451 0.00735 0.00697 0.00432 0.00265 web-NotreDame  0.00066  3e-05  0.00068  9e-05  2e-05  8e-05 0.05199 0.01206 0.03993 0.04323 0.01138 0.03185 wiki_talk_en  0.0017  5e-05  0.0017  0.00166  4e-05  0.00166 0.0442 0.01772 0.02648 0.03859 0.01772 0.02087 zhishi-hudong-internallink  0.00088  2e-05  0.00088  0.00104  3e-05  0.00104

Table 8.10:

8.4.3 SNAP PAGERANK

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 37.16668 1.04891 36.11777 29.55712 0.84687 28.71024 as-skitter  0.40745  0.003  0.40746  1.40332  0.00747  1.40097 2.11502 0.42139 1.69363 1.4835 0.09513 1.38838 email-EuAll  0.01753  0.00069  0.01753  0.01088  0.0001  0.01087 39.36705 0.29209 39.07496 32.1602 0.7392 31.421 roadNet-CA  0.93293  0.00135  0.93212  0.26426  0.00045  0.26452 20.2629 0.16381 20.09909 16.62677 0.40585 16.22092 roadNet-PA  0.56709  0.00099  0.5668  0.12058  0.00103  0.12027 26.67326 0.20055 26.47271 21.56628 0.51858 21.0477 roadNet-TX  0.57463  0.001  0.5742  0.20651  0.00578  0.20639 17.51286 5.51381 11.99905 9.33731 0.52192 8.81538 web-BerkStan  0.20733  0.0119  0.21003  0.05796  0.00038  0.05779 16.39577 0.88967 15.50609 12.78405 0.56605 12.218 web-Google  0.04806  0.00519  0.048  0.24579  0.00122  0.24533 53.32724 5.33885 47.98839 36.0772 1.2241 34.8531 wiki-Talk  0.88709  0.00619  0.88682  0.23879  0.00052  0.23896

Table 8.11:

67 BFS

Graph CUDA OpenACC Total Main loop Memory Total Main loop Memory 0.03455 0.01422 0.02033 0.05925 0.0147 0.04455 as-skitter  0.00068  3e-05  0.00068  0.00105  2e-05  0.00105 0.0023 0.00047 0.00183 0.00209 0.00046 0.00163 email-EuAll  5e-05  1e-05  5e-05  0.00055  1e-05  0.00055 0.18217 0.16897 0.0132 0.18809 0.17115 0.01694 roadNet-CA  0.00065  0.00017  0.0006  0.00094  0.00017  0.00091 0.10538 0.09769 0.00769 0.10775 0.09882 0.00893 roadNet-PA  0.00031  0.0001  0.00028  0.00438  0.00066  0.00381 0.18128 0.16062 0.02066 0.16908 0.1613 0.00778 roadNet-TX  0.00364  0.00015  0.00354  0.00061  8e-05  0.0006 0.20167 0.18839 0.01328 0.19455 0.18464 0.00991 web-BerkStan  0.00086  0.00015  0.00082  0.00082  8e-05  0.00082 0.0315 0.01186 0.01964 0.03325 0.0119 0.02135 web-Google  0.0056  7e-05  0.00554  0.00054  2e-05  0.00054 0.02914 0.00085 0.02828 0.01286 0.00084 0.01201 wiki-Talk  0.00796  1e-05  0.00796  0.00034  0.0  0.00034

Table 8.12:

68 8.5 Appendix E: Raw unoptimised CPU performance numbers

All data provided is measured in seconds

8.5.1 graph500

Graph OpenMP OpenACC Total Main loop Memory Total Main loop Memory 0.00299 0.00299 0.0 0.00823 0.00823 0.0 10  0.00016  0.00016  0.0  0.0144  0.0144  0.0 0.00338 0.00338 0.0 0.00437 0.00436 1e-05 11  3e-05  3e-05  0.0  5e-05  5e-05  0.0 0.00338 0.00338 0.0 0.00606 0.00605 1e-05 12  3e-05  3e-05  0.0  0.00261  0.00261  0.0 0.00272 0.00272 0.0 0.004 0.00399 1e-05 13  2e-05  2e-05  0.0  2e-05  2e-05  0.0 0.00515 0.00515 0.0 0.00946 0.00944 2e-05 14  0.00025  0.00025  0.0  0.00435  0.00435  0.0 0.01085 0.01085 0.0 0.01249 0.01245 4e-05 15  0.00322  0.00322  0.0  0.00247  0.00246  2e-05 0.02427 0.02427 0.0 0.02791 0.02784 7e-05 16  0.00178  0.00178  0.0  0.00664  0.00664  2e-05 0.0726 0.0726 0.0 0.07292 0.07278 0.00014 17  0.00665  0.00665  0.0  0.00181  0.0018  5e-05 0.23689 0.23689 0.0 0.22913 0.22741 0.00172 18  0.02129  0.02129  0.0  0.00475  0.00472  0.00012 0.68347 0.68347 0.0 0.68142 0.67797 0.00345 19  0.02484  0.02484  0.0  0.02486  0.02482  0.00011 0.99537 0.99537 0.0 1.10974 1.10492 0.00482 20  0.02506  0.02506  0.0  0.03699  0.03693  0.00035 2.63756 2.63756 0.0 2.6145 2.60734 0.00716 21  0.08084  0.08084  0.0  0.12013  0.12019  0.0005 7.77718 7.77718 0.0 9.27451 9.25289 0.02162 22  0.52394  0.52394  0.0  0.48875  0.48475  0.00434 23.61333 23.61333 0.0 26.57403 26.50304 0.07099 23  1.13877  1.13877  0.0  1.09766  1.09701  0.003 60.85996 60.85996 0.0 66.2279 66.07145 0.15645 24  2.82849  2.82849  0.0  2.76819  2.76748  0.00736

Table 8.13:

69 8.5.2 KONECT

Graph OpenMP OpenACC Total Main loop Memory Total Main loop Memory 1.86556 1.86556 0.0 1.93891 1.93649 0.00242 actor-collaboration  0.02403  0.02403  0.0  0.0451  0.04507  7e-05 0.08886 0.08886 0.0 0.12501 0.12498 3e-05 ca-cit-HepPh  0.00133  0.00133  0.0  0.00216  0.00216  0.0 0.02351 0.02351 0.0 0.02323 0.02321 2e-05 cfinder-google  0.01253  0.01253  0.0  0.00293  0.00293  0.0 11.88381 11.88381 0.0 15.33625 15.30947 0.02678 dbpedia-all  1.19304  1.19304  0.0  0.29169  0.29192  0.00202 0.03194 0.03194 0.0 0.03911 0.03902 9e-05 dbpedia-starring  0.00088  0.00088  0.0  0.00215  0.00214  4e-05 6.37403 6.37403 0.0 6.29673 6.2899 0.00683 discogs_affiliation  0.23694  0.23694  0.0  0.38917  0.38914  0.00013 0.0076 0.0076 0.0 0.00859 0.00859 1e-05 opsahl-ucsocial  0.00564  0.00564  0.0  0.01429  0.01429  0.0 14.53032 14.53032 0.0 13.7856 13.77481 0.01079 orkut-links  0.59213  0.59213  0.0  0.75864  0.75756  0.00239 0.31671 0.31671 0.0 0.3017 0.3016 0.0001 prosper-loans  0.00311  0.00311  0.0  0.005  0.00497  4e-05 0.13464 0.13464 0.0 0.14365 0.14153 0.00212 web-NotreDame  0.00723  0.00723  0.0  0.01752  0.01751  5e-05 13.00294 13.00294 0.0 14.17812 14.15757 0.02055 wiki_talk_en  0.27624  0.27624  0.0  0.11319  0.11311  0.0014 10.7045 10.7045 0.0 11.90495 11.8975 0.00746 zhishi-hudong-internallink  0.218  0.218  0.0  1.15872  1.1576  0.0012

Table 8.14:

8.5.3 SNAP

Graph OpenMP OpenACC Total Main loop Memory Total Main loop Memory 2.10977 2.10977 0.0 2.03105 2.02425 0.0068 as-skitter  0.04519  0.04519  0.0  0.09079  0.09078  7e-05 0.23063 0.23063 0.0 0.19241 0.19075 0.00166 email-EuAll  0.00338  0.00338  0.0  0.03159  0.03161  5e-05 0.6763 0.6763 0.0 0.86682 0.86065 0.00617 roadNet-CA  0.08671  0.08671  0.0  0.06719  0.06703  0.00017 0.37021 0.37021 0.0 0.43199 0.42942 0.00257 roadNet-PA  0.02637  0.02637  0.0  0.01914  0.01914  8e-05 0.52158 0.52158 0.0 0.51532 0.51072 0.0046 roadNet-TX  0.05852  0.05852  0.0  0.02268  0.02264  0.00015 0.43861 0.43861 0.0 0.43916 0.43479 0.00437 web-BerkStan  0.04803  0.04803  0.0  0.04535  0.04529  0.0001 1.26527 1.26527 0.0 1.2703 1.26535 0.00495 web-Google  0.12442  0.12442  0.0  0.12364  0.1236  7e-05 6.12999 6.12999 0.0 5.58989 5.58222 0.00767 wiki-Talk  0.88808  0.88808  0.0  0.03323  0.03319  0.00012

Table 8.15:

70 8.6 Appendix F: Raw optimised CPU performance numbers

All data provided is measured in seconds

8.6.1 graph500

Graph OpenMP OpenACC Total Main loop Memory Total Main loop Memory 0.00299 0.00299 0.0 0.00472 0.00472 0.0 10  0.00016  0.00016  0.0  0.00333  0.00332  0.0 0.00338 0.00338 0.0 0.00452 0.00451 0.0 11  3e-05  3e-05  0.0  0.00066  0.00066  0.0 0.00338 0.00338 0.0 0.00575 0.00574 1e-05 12  3e-05  3e-05  0.0  0.00204  0.00204  0.0 0.00272 0.00272 0.0 0.00605 0.00604 1e-05 13  2e-05  2e-05  0.0  0.00321  0.00321  0.0 0.00515 0.00515 0.0 0.00772 0.0077 2e-05 14  0.00025  0.00025  0.0  0.00306  0.00305  0.0 0.01085 0.01085 0.0 0.01409 0.01405 4e-05 15  0.00322  0.00322  0.0  0.0059  0.00589  3e-05 0.02427 0.02427 0.0 0.02643 0.02636 7e-05 16  0.00178  0.00178  0.0  0.00362  0.00361  3e-05 0.0726 0.0726 0.0 0.07441 0.07427 0.00014 17  0.00665  0.00665  0.0  0.00446  0.00446  5e-05 0.23689 0.23689 0.0 0.24066 0.23892 0.00174 18  0.02129  0.02129  0.0  0.02061  0.0206  5e-05 0.68347 0.68347 0.0 0.80204 0.79855 0.00349 19  0.02484  0.02484  0.0  0.02211  0.0221  8e-05 0.99537 0.99537 0.0 1.05916 1.0547 0.00446 20  0.02506  0.02506  0.0  0.08522  0.0852  0.0001 2.63756 2.63756 0.0 2.68049 2.67359 0.00691 21  0.08084  0.08084  0.0  0.07337  0.07332  0.00022 7.77718 7.77718 0.0 7.64209 7.62841 0.01368 22  0.52394  0.52394  0.0  0.24266  0.24215  0.00195 23.61333 23.61333 0.0 23.51885 23.48925 0.0296 23  1.13877  1.13877  0.0  1.22018  1.21915  0.00251 60.85996 60.85996 0.0 59.50593 59.44652 0.05942 24  2.82849  2.82849  0.0  3.83317  3.83226  0.00599

Table 8.16:

71 8.6.2 KONECT

Graph OpenMP OpenACC Total Main loop Memory Total Main loop Memory 1.86556 1.86556 0.0 2.00571 2.00327 0.00244 actor-collaboration  0.02403  0.02403  0.0  0.05005  0.05004  0.0001 0.08886 0.08886 0.0 0.12786 0.12783 3e-05 ca-cit-HepPh  0.00133  0.00133  0.0  0.00231  0.0023  0.0 0.02351 0.02351 0.0 0.02632 0.0263 2e-05 cfinder-google  0.01253  0.01253  0.0  0.01661  0.01661  0.0 11.88381 11.88381 0.0 12.63899 12.62601 0.01298 dbpedia-all  1.19304  1.19304  0.0  1.16733  1.16668  0.00169 0.03194 0.03194 0.0 0.03626 0.03617 9e-05 dbpedia-starring  0.00088  0.00088  0.0  0.01502  0.01502  4e-05 6.37403 6.37403 0.0 6.43642 6.42954 0.00687 discogs_affiliation  0.23694  0.23694  0.0  0.39587  0.39582  0.00016 0.0076 0.0076 0.0 0.00654 0.00654 1e-05 opsahl-ucsocial  0.00564  0.00564  0.0  0.00174  0.00174  0.0 14.53032 14.53032 0.0 13.85167 13.84115 0.01052 orkut-links  0.59213  0.59213  0.0  1.14481  1.14318  0.00218 0.31671 0.31671 0.0 0.32586 0.32576 0.0001 prosper-loans  0.00311  0.00311  0.0  0.01944  0.01944  3e-05 0.13464 0.13464 0.0 0.15065 0.14854 0.00211 web-NotreDame  0.00723  0.00723  0.0  0.01148  0.01147  5e-05 13.00294 13.00294 0.0 12.2825 12.27367 0.00884 wiki_talk_en  0.27624  0.27624  0.0  0.83657  0.836  0.00107 10.7045 10.7045 0.0 11.95647 11.94893 0.00753 zhishi-hudong-internallink  0.218  0.218  0.0  0.8667  0.8663  0.00126

Table 8.17:

8.6.3 SNAP

Graph OpenMP OpenACC Total Main loop Memory Total Main loop Memory 2.10977 2.10977 0.0 2.19377 2.18712 0.00664 as-skitter  0.04519  0.04519  0.0  0.0658  0.06568  0.00062 0.23063 0.23063 0.0 0.23654 0.23493 0.00161 email-EuAll  0.00338  0.00338  0.0  0.00326  0.00326  3e-05 0.6763 0.6763 0.0 0.71743 0.71119 0.00624 roadNet-CA  0.08671  0.08671  0.0  0.0758  0.07574  0.00017 0.37021 0.37021 0.0 0.36245 0.3599 0.00255 roadNet-PA  0.02637  0.02637  0.0  0.02438  0.02436  0.00015 0.52158 0.52158 0.0 0.42156 0.41702 0.00454 roadNet-TX  0.05852  0.05852  0.0  0.0261  0.02609  0.0002 0.43861 0.43861 0.0 0.43482 0.43045 0.00437 web-BerkStan  0.04803  0.04803  0.0  0.0408  0.04077  7e-05 1.26527 1.26527 0.0 1.17827 1.17333 0.00494 web-Google  0.12442  0.12442  0.0  0.08245  0.0824  6e-05 6.12999 6.12999 0.0 5.62305 5.61537 0.00768 wiki-Talk  0.88808  0.88808  0.0  0.49547  0.49538  0.00019

Table 8.18:

72