From CUDA to Openacc in Graph Processing Applications

Bachelor Informatica From CUDA to OpenACC in Graph Processing Applications Wouter de Bruijn June 17, 2019 Informatica — Universiteit van Amsterdam Supervisor(s): Dr. Ana Lucia Varbanescu 2 Abstract Most GPU-accelerated programs are written using the NVIDIA proprietary API CUDA. CUDA has an extensive collection of users and libraries, but functions only on NVIDIA GPUs and is completely proprietary. This thesis proposes standard ways to convert CUDA to the higher level programming model OpenACC, examines the difficulty of this process, and analyzes the performance of the final converted program. We have applied our porting methodology to two different graph processing algorithms. We chose Breadth First Search for its relative simplicity and large amount of memory operations, and PageRank for its combination of memory operations and computational sections. The results show that OpenACC was significantly faster than CUDA for PageRank, and was more or less tied with CUDA in Breadth First Search. In the end, the performance of OpenACC was close enough to CUDA for most cases, and actually faster than CUDA in one case. OpenACC did lack in performance and consistency on multi-core CPUs when compared to OpenMP. Our systematic process of porting CUDA to OpenACC was therefore successful for the two different graph processing algorithms. The OpenACC ecosystem does still suffer from a lack of user support and documentation, which makes the process of writing larger and more complicated OpenACC programs more difficult than it should be for (beginner) programmers. 3 4 Contents 1 Introduction 7 1.1 Context ......................................... 7 1.2 Research Questions ................................... 7 1.3 Thesis structure ..................................... 8 2 Background & Related Work 9 2.1 Background ....................................... 9 2.1.1 CUDA, OpenACC and OpenMP ....................... 9 2.2 Definitions ........................................ 10 2.3 Related Work ...................................... 11 3 Methodology 13 3.1 Measuring performance ................................ 13 3.2 Measuring ease-of-use ................................. 13 3.3 Performance Analysis and Debugging ......................... 14 4 Breadth First Search 15 4.1 The Algorithm ..................................... 15 4.2 Porting ......................................... 15 4.2.1 Porting memory transfers ........................... 16 4.2.2 Porting compute operations .......................... 18 4.3 Performance ....................................... 20 4.4 Optimising ....................................... 23 4.4.1 Diagnosing and patching ............................ 23 4.4.2 Optimised results ................................ 25 5 PageRank 29 5.1 The Algorithm ..................................... 29 5.2 Porting ......................................... 30 5.2.1 Porting memory transfers ........................... 31 5.2.2 Porting compute operations .......................... 32 5.3 Performance ....................................... 35 5.4 Optimising the OpenACC implementation ...................... 35 5.4.1 Diagnosing and patching ............................ 35 5.4.2 Optimised results ................................ 39 6 Multithreading OpenACC 41 6.1 OpenMP vs. OpenACC ................................ 41 6.2 Results .......................................... 42 6.2.1 Unoptimised OpenACC ............................ 42 6.2.2 Optimised OpenACC ............................. 44 6.3 Examining Results ................................... 44 5 7 Conclusion and Future Work 47 7.1 Discussion ........................................ 47 7.2 Conclusion ....................................... 47 7.2.1 Systematic way of porting ........................... 48 7.2.2 Difficulty of OpenACC ............................. 49 7.2.3 Performance benefits or drawbacks ...................... 49 7.2.4 OpenMP vs. OpenACC ............................ 50 7.3 Future Work ...................................... 50 8 Appendices 55 8.1 Appendix A: CUDA max-reduce ........................... 56 8.2 Appendix B: Graph sizes ................................ 58 8.3 Appendix C: Raw unoptimised GPU performance numbers ............ 59 8.3.1 graph500 .................................... 59 8.3.2 KONECT .................................... 61 8.3.3 SNAP ...................................... 62 8.4 Appendix D: Raw optimised GPU performance numbers .............. 64 8.4.1 graph500 .................................... 64 8.4.2 KONECT .................................... 66 8.4.3 SNAP ...................................... 67 8.5 Appendix E: Raw unoptimised CPU performance numbers ............. 69 8.5.1 graph500 .................................... 69 8.5.2 KONECT .................................... 70 8.5.3 SNAP ...................................... 70 8.6 Appendix F: Raw optimised CPU performance numbers .............. 71 8.6.1 graph500 .................................... 71 8.6.2 KONECT .................................... 72 8.6.3 SNAP ...................................... 72 6 CHAPTER 1 Introduction 1.1 Context As we continue to increase the amount of data we collect and process, we need more efficient ways of storing and processing that data. When data consists of entities which can have some sort of relationship between them, a graph can be used to model the data. A graph is a mathe- matical structure consisting of vertices and edges, where each edge connects a pair of vertices. A large variety of data collected today can be represented using graphs, including data from fields like social networks, linguistics, physics, chemistry and other computational sciences [21][23]. Currently, a lot of graph processing is done using GPU acceleration, as this can bring major performance benefits. Most GPU-accelerated programs are written using the NVIDIA proprietary API CUDA. CUDA is a widely used and supported programming model used for programming NVIDIA GPUs. The downside of CUDA, however, is that it functions only on NVIDIA GPUs and is completely proprietary to NVIDIA. Moreover, it requires a different way of thinking compared to normal sequential CPU code. For this reason, it would be beneficial to have a higher level and portable API that can be used instead of CUDA without any loss of performance. In this work, we choose OpenACC as the high-level API. OpenACC is a completely portable acceleration API which aims to be able to let programmers write accelerated code once, and then have the compiler compile it for any type of accelerator. This includes GPUs from potentially any brand, multi-core CPUs and large compute clusters. 1.2 Research Questions This paper aims to determine whether high-performance OpenACC code can be derived from CUDA code in a systematic manner. We will then investigate the portability of this new Ope- nACC code, by comparing it to CPU multithreaded OpenMP. To investigate this properly, we aim to answer the following questions: 1. Can we create a systematic way of porting CUDA to OpenACC, and how easy is this porting process? 2. What is the difficulty of implementing our algorithms in OpenACC compared to thediffi- culty of implementing our algorithms in CUDA? 3. What are the performance benefits or drawbacks to using the high-level portable programming model OpenACC instead of the proprietary CUDA API? 4. Is OpenACC portable and performance portable across different types of platforms (i.e., CPUs and GPUs)? 7 We answer these questions by porting two graph processing algorithms written in C and CUDA to C and OpenACC, and reviewing this process. We will identify at any quirks or difficulties encountered during this process in order to answer research question 2. In order to answer research question 1, we search for patterns of CUDA calls that are/should be always replaced by the same OpenACC codes. These become systematic one-to-one translations. We further compare the runtime of the original code and the ported code to answer question 3. Finally, in order to answer research question 4, we analyze the performance of both of our OpenACC benchmarks against similar OpenMP code. 1.3 Thesis structure This thesis is structured in a per-algorithm way. We start with a background chapter (see chapter 2) containing information and short explanations about the CUDA, OpenACC, and OpenMP APIs, as well as relevant related work related to this thesis. Then, in chapter 3, we describe our exact testing and investigation methodologies. Further, for each of the two algorithms, we describe the porting process and any difficulties encountered during this process (see chapters 4 and 5). For each algorithm, we also include the performance comparison of the two versions. We explain any performance differences, and improve the ported code based on this investigation. After discussing the main algorithms, we select PageRank as a case-study for an in-depth comparison against the OpenMP version of the algorithm, running on a CPU (see chapter 5)”. Finally, we conclude this thesis with a summary of our findings and provide suggestions for potential future research (see chapter 7). 8 CHAPTER 2 Background & Related Work 2.1 Background Graph processing is becoming increasingly relevant for many scientific and daily life applications. The massive size of some of the graphs around us (social networks or infrastructure networks) requires high-performance graph processing. Many frameworks have been devised to help users write algorithms from scratch [22][7][25][26], but there are also

From CUDA to Openacc in Graph Processing Applications

NVIDIA CUDA on IBM POWER8: Technical Overview, Software Installation, and Application Development

Hybrid Programming Using Openshmem and Openacc

Nvidia Cuda Toolkit V5.0

Applications Kernels

Improving Opencl Performance by Specializing Compiler Phase

A Compiler Optimization Framework for Directive-Based Gpu Computing

Automatically Exploiting the Memory Hierarchy of Gpus Through Just-In-Time Compilation

Effective Extensible Programming: Unleashing Julia on Gpus

GPU-Accelerated Simulation of Massive Spatial Data Based on The

California State University, Northridge

Functional GPU Language Group DPW108F18 Christian Lundtofte Sørensen Henrik Djernes Thomsen

Exploiting Remote Memory Access for Automatic Multi-GPU Parallelization Javier Cabezas ∗§ Lluís Vilanova ∗§ Isaac Gelado Φ Thomas B