Comparison of Technologies for General-Purpose Computing on Graphics Processing Units

Master of Science Thesis in Information Coding Department of Electrical Engineering, Linköping University, 2016 Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman Master of Science Thesis in Information Coding Comparison of Technologies for General-Purpose Computing on Graphics Processing Units Torbjörn Sörman LiTH-ISY-EX–16/4923–SE Supervisor: Robert Forchheimer isy, Linköpings universitet Åsa Detterfelt MindRoad AB Examiner: Ingemar Ragnemalm isy, Linköpings universitet Organisatorisk avdelning Department of Electrical Engineering Linköping University SE-581 83 Linköping, Sweden Copyright © 2016 Torbjörn Sörman Abstract The computational capacity of graphics cards for general-purpose computing have progressed fast over the last decade. A major reason is computational heavy computer games, where standard of performance and high quality graphics con- stantly rise. Another reason is better suitable technologies for programming the graphics cards. Combined, the product is high raw performance devices and means to access that performance. This thesis investigates some of the current technologies for general-purpose computing on graphics processing units. Tech- nologies are primarily compared by means of benchmarking performance and secondarily by factors concerning programming and implementation. The choice of technology can have a large impact on performance. The benchmark application found the difference in execution time of the fastest technology, CUDA, compared to the slowest, OpenCL, to be twice a factor of two. The benchmark application also found out that the older technologies, OpenGL and DirectX, are compet- itive with CUDA and OpenCL in terms of resulting raw performance. iii Acknowledgments I would like to thank Åsa Detterfelt for the opportunity to make this thesis work at MindRoad AB. I would also like to thank Ingemar Ragnemalm at ISY. Linköping, Februari 2016 Torbjörn Sörman v Contents List of Tables ix List of Figures xi Acronyms xiii 1 Introduction 1 1.1 Background . 1 1.2 Problem statement . 2 1.3 Purpose and goal of the thesis work . 2 1.4 Delimitations . 2 2 Benchmark algorithm 3 2.1 Discrete Fourier Transform . 3 2.1.1 Fast Fourier Transform . 4 2.2 Image processing . 4 2.3 Image compression . 4 2.4 Linear algebra . 5 2.5 Sorting . 5 2.5.1 Efficient sorting . 6 2.6 Criteria for Algorithm Selection . 6 3 Theory 7 3.1 Graphics Processing Unit . 7 3.1.1 GPGPU . 7 3.1.2 GPU vs CPU . 8 3.2 Fast Fourier Transform . 9 3.2.1 Cooley-Tukey . 9 3.2.2 Constant geometry . 10 3.2.3 Parallelism in FFT . 10 3.2.4 GPU algorithm . 10 3.3 Related research . 12 vii viii Contents 4 Technologies 15 4.1 CUDA . 15 4.2 OpenCL . 16 4.3 DirectCompute . 18 4.4 OpenGL . 19 4.5 OpenMP . 19 4.6 External libraries . 20 5 Implementation 21 5.1 Benchmark application GPU . 21 5.1.1 FFT . 21 5.1.2 FFT 2D . 24 5.1.3 Differences . 26 5.2 Benchmark application CPU . 29 5.2.1 FFT with OpenMP . 29 5.2.2 FFT 2D with OpenMP . 30 5.2.3 Differences with GPU . 30 5.3 Benchmark configurations . 31 5.3.1 Limitations . 31 5.3.2 Testing . 31 5.3.3 Reference libraries . 31 6 Evaluation 33 6.1 Results . 33 6.1.1 Forward FFT . 34 6.1.2 FFT 2D . 38 6.2 Discussion . 42 6.2.1 Qualitative assessment . 44 6.2.2 Method . 45 6.3 Conclusions . 46 6.3.1 Benchmark application . 46 6.3.2 Benchmark performance . 47 6.3.3 Implementation . 47 6.4 Future work . 48 6.4.1 Application . 48 6.4.2 Hardware . 48 Bibliography 51 List of Tables 3.1 Global kernel parameter list with argument depending on size of input N and stage. ........................... 11 3.2 Local kernel parameter list with argument depending on size of input N and number of stages left to complete. 12 4.1 Table of function types in CUDA. 16 5.1 Shared memory size in bytes, threads and block configuration per device. 22 5.2 Integer intrinsic bit-reverse function for different technologies. 24 5.3 Table illustrating how to set parameters and launch a kernel. 28 5.4 Synchronization in GPU technologies. 29 5.5 Twiddle factors for a 16-point sequence where α equals (2 · π)=16. Each row i corresponds to the ith butterfly operation. 30 5.6 Libraries included to compare with the implementation. 31 6.1 Technologies included in the experimental setup. 34 6.2 Table comparing CUDA to CPU implementations. 47 ix x LIST OF TABLES List of Figures 3.1 The GPU uses more transistors for data processing . 8 3.2 Radix-2 butterfly operations . 9 3.3 8-point radix-2 FFT using Cooley-Tukey algorithm . 9 3.4 Flow graph of an radix-2 FFT using the constant geometry algorithm. 10 4.1 CUDA global kernel . 16 4.2 OpenCL global kernel . 17 4.3 DirectCompute global kernel . 18 4.4 OpenGL global kernel . 19 4.5 OpenMP procedure completing one stage . 20 5.1 Overview of the events in the algorithm. 21 5.2 Flow graph of a 16-point FFT using (stage 1 and 2) Cooley-Tukey algorithm and (stage 3 and 4) constant geometry algorithm. The solid box is the bit-reverse order output. Dotted boxes are separate kernel launches, dashed boxes are data transfered to local memory before computing the remaining stages. 23 5.3 CUDA example code showing index calculation for each stage in the global kernel, N is the total number of points. io_low is the index of the first input in the butterfly operation and io_high the index of the second. 23 5.4 CUDA code for index calculation of points in shared memory. 24 5.5 Code returning a bit-reversed unsigned integer where x is the input. Only 32-bit integer input and output. 24 5.6 Original image in 5.6a transformed and represented with a quad- rant shifted magnitude visualization (scale skewed for improved illustration) in 5.6b. 25 5.7 CUDA device code for the transpose kernel. 26 5.8 Illustration of how shared memory is used in transposing an image. Input data is tiled and each tile is written to shared memory and transposed before written to the output memory. 27 5.9 An overview of the setup phase for the GPU technologies. 27 5.10 OpenMP implementation overview transforming sequence of size N...................................... 29 xi xii LIST OF FIGURES 5.11 C/C++ code performing the bit-reverse ordering of a N-point sequence. 30 6.1 Overview of the results of a single forward transform. The clFFT was timed by host synchronization resulting in an overhead in the range of 60µs. Lower is faster. 35 6.2 Performance relative CUDA implementation in 6.2a and OpenCL in 6.2b. Lower is better. 36 6.3 Performance relative CUDA implementation on GeForce GTX 670 and Intel Core i7 3770K 3.5GHz CPU. 37 6.4 Comparison between Radeon R7 R260X and GeForce GTX 670. 37 6.5 Performance relative OpenCL accumulated from both cards. 38 6.6 Overview of the results of measuring the time of a single 2D forward transform. 39 6.7 Time of 2D transform relative CUDA in 6.7a and OpenCL in 6.7b. 40 6.8 Performance relative CUDA implementation on GeForce GTX 670 and Intel Core i7 3770K 3.5GHz CPU. 41 6.9 Comparison between Radeon R7 R260X and GeForce GTX 670 run- ning 2D transform. 41 6.10 Performance relative OpenCL accumulated from both cards. 42 Acronyms 1D One-dimensional. 2D Two-dimensional. 3D Three-dimensional. ACL AMD Compute Libraries. API Application Programming Interface. BLAS Basic Linear Algebra Subprograms. CPU Central Processing Unit. CUDA Compute Unified Device Architecture. DCT Discrete Cosine Transform. DFT Discrete Fourier Transform. DIF Decimation-In-Frequency. DWT Discrete Wavelet Transform. EBCOT Embedded Block Coding with Optimized Truncation. FFT Fast Fourier Transform. FFTW Fastest Fourier Transform in the West. FLOPS Floating-point Operations Per Second. FPGA Field-Programmable Gate Array. GCN Graphics Core Next. xiii xiv Acronyms GLSL OpenGL Shading Language. GPGPU General-Purpose computing on Graphics Processing Units. GPU Graphics Processing Unit. HBM High Bandwidth Memory. HBM2 High Bandwidth Memory. HLSL High-Level Shading Language. HPC High-Performance Computing. ILP Instruction-Level Parallelism. OpenCL Open Computing Language. OpenGL Open Graphics Library. OpenMP Open Multi-Processing. OS Operating System. PTX Parallel Thread Execution. RDP Remote Desktop Protocol. SM Streaming Multiprocessor. VLIW Very Long Instruction Word-processor. 1 Introduction This chapter gives an introduction to the thesis. The background, purpose and goal of the thesis, describes a list of abbreviations and the structure of this report. 1.1 Background Technical improvements of hardware has for a long period of time been the best way to solve computationally demanding problems faster. However, during the last decades, the limit of what can be achieved by hardware improvements appear to have been reached: The operating frequency of the Central Processing Unit (CPU) does no longer significantly improve. Problems relying on single thread performance are limited by three primary technical factors: 1. The Instruction-Level Parallelism (ILP) wall 2. The memory wall 3. The power wall The first wall states that it is hard to further exploit simultaneous CPU instruc- tions: techniques such as instruction pipelining, superscalar execution and Very Long Instruction Word-processor (VLIW) exists but complexity and latency of hardware reduces the benefits. The second wall, the gap between CPU speed and memory access time, that may cost several hundreds of CPU cycles if accessing primary memory. The third wall is the power and heating problem.

Load more