Investigating the Potential of a GPU-Based Math Library

Investigating the Potential of a GPU-based Math Library by Daniel Riley Fay B.S., University of Illinois, 2004 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Department of Electrical and Computer Engineering 2007 This thesis entitled: Investigating the Potential of a GPU-based Math Library written by Daniel Riley Fay has been approved for the Department of Electrical and Computer Engineering Professor Daniel A. Connors Professor Manish Vachharajani Professor Vince Heuring Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii Riley Fay, Daniel (M.S., Computer Engineering) Investigating the Potential of a GPU-based Math Library Thesis directed by Professor Daniel A. Connors In the last few years, the Graphics Processing Unit (GPU) has evolved from being a graphics-specific integrated circuit into a high performance programmable vector/stream processor. Contemporary GPUs provide a compelling platform for running compute-intensive applications. In addition to tens of gigabytes per second of memory bandwidth, they also possess vast computation resources capable of achieving hun- dreds of giga-FLOPs of single precision floating-point computation power. Moreover, the consumer-oriented focus of contemporary GPUs means that even the highest end graphics cards cost well under a thousand dollars. Developments on the software side have also made GPU systems far more accessible for general-purpose use: new program- ming languages reduce the need for GPU programmers to understand esoteric graphics concepts, and high speed interconnect technologies improve CPU-GPU communication. Developing a high performance math library will help programmers make full use of increasingly-powerful GPUs as well as enabling the study of using GPUs for general purpose applications. Math functions are a critical part of many high performance applications, and their use consumes a large percentage of many programs’ CPU times. In order for a GPU-based math library to be useful, it must provide accurate results. Similarly, it must show a performance and/or power consumption advantage over a CPU-based math library. This thesis investigates the potential of porting Apple, Inc.’s vForce math library to four different GPUs found in current Apple computers. Using this hardware, the thesis investigates whether current GPU technology can be gainfully employed to run a high performance math library on the GPU. The thesis investigates the potential of a iv GPU-based math library using three metrics: accuracy, performance, and power/energy consumption. These three metrics are used to study the GPU-ported math library as it runs on the four GPUs. Comparisons are also made between the four different GPUs tested against the CPU version of vForce. v Contents Chapter 1 Introduction 1 2 General Purpose GPU Computing (GPGPU) 5 2.1 Characteristics of GPGPU Programs . 5 2.2 Where GPGPU Computing is Today . 8 2.3 GPGPU Numeric Considerations . 10 3 Background on GPUs 18 3.1 From Framebuffer to Fast Vector Machine: A Brief History of GPUs . 18 3.1.1 A History of the Hardware . 19 3.1.2 A History of the Software . 20 3.2 Assessment of Current GPU Technology . 21 3.3 The Future of GPUs . 25 4 Experimental Setup 27 4.1 Writing a GPGPU Math Library . 27 4.1.1 Algorithms Used . 27 4.1.2 The OpenGL Shading Language (GLSL) . 28 4.1.3 Porting the vForce Functions to Shader Programs . 31 4.2 Test Setup . 35 4.3 Testing a GPGPU Math Library . 43 vi 5 Experimental Results 49 5.1 Accuracy Results . 49 5.2 Performance Results . 54 5.3 Power/Energy Results . 66 5.4 Bandwidth Results . 70 6 Summary and conclusion 71 6.1 Summary of Results . 72 6.2 Current Suitability of a GPGPU Math Library . 74 6.3 Improving GPU Accuracy . 74 6.4 Future Suitability of a GPGPU-based Math Library . 75 Bibliography 78 Appendix A Cephes Math Library Code for sinf 83 B Detailed Accuracy Results 91 vii Tables Table 1.1 Different GPU interconnections. 3 4.1 System configurations tested. 36 4.2 Capabilities of the CPUs and GPUs. 37 B.1 Accuracy of built-in GLSL functions, NVIDIA 7300GT. 92 B.2 Accuracy of built-in GLSL functions, ATi x1600. 92 B.3 Accuracy of built-in GLSL functions, NVIDIA Quadro FX 4500. 93 B.4 Accuracy of ported vForce functions, ATi x1600. 94 B.5 Accuracy of ported vForce functions, NVIDIA Quadro FX 4500. 94 B.6 Accuracy of ported vForce functions, NVIDIA 7300GT. 95 B.7 Accuracy of built-in GLSL functions, ATi x1900XTX. 95 B.8 Accuracy of ported vForce functions, ATi x1900XTX. 96 B.9 Accuracy of basic operators, ATi x1900XTX. 96 B.10 Accuracy of basic operators, NVIDIA 7300GT. 97 B.11 Accuracy of basic operators, NVIDIA Quadro FX 4500. 97 B.12 Accuracy of basic operators, ATi x1600. 97 viii Figures Figure 1.1 Comparison of peak compute capacity and peak memory bandwidth of the NVIDIA and ATi GPUs along with Intel CPUs and the CPUs used by Apple. 2 2.1 Data-flow for a GPGPU program. 7 2.2 The IEEE 754 floating point format. 11 2.3 The floating point number line. 11 2.4 Differences relative to libm’s sinf caused by changing the rounding mode of Cephes’ argument reduction. 14 2.5 Argument reduction code of the Cephes sinf. 15 2.6 Comparison of the Control Flow Graphs (CFGs) of the Apple Libm implementation of sinf along with the Cephes Math Library implementation of sinf..................................... 16 3.1 An example of a GPU-containing computer system.. 23 3.2 The GPU pipeline. 24 4.1 The intrinsic-to-GLSL translation system. 34 4.2 Diagram of the NVIDIA G7x pixel shader core. 38 4.3 Diagram of the fragment shader of the NVIDIA G7x. 39 4.4 Diagram of dual-issue and co-issue. 40 ix 4.5 Diagram of the ATi R5xx pixel shader core. 41 4.6 Diagram of the fragment shader of the ATi R5xx. 42 4.7 The GPU test harness. 44 4.8 The Apple OpenGL software stack. 46 5.1 The accuracy differences between the tested GPUs for the basic operations, the built-in GLSL functions, and the ported vForce functions. 50 5.2 Comparison of the accuracy between the different GPUs for the basic operations, the built-in GLSL functions, and the ported vForce functions. 51 5.3 Accuracy improvement of the ported vForce functions vs. the built-in GLSL functions. 52 5.4 A performance comparison of the built-in GLSL functions versus the ported vForce functions. 53 5.5 A performance comparison of different texture sizes for the NVIDIA GPUs. 55 5.6 A performance comparison of different texture sizes for the ATi GPUs. 56 5.7 A performance comparison of the ported vForce functions. 58 5.8 A performance comparison of vForce running on different processors. 58 5.9 Speedups gained from using the vec4 data type as opposed to the float on all of the ported vForce functions. 60 5.10 Speedups gained from using the vec4 data type as opposed to the float data type on the restricted subset of the functions. 61 5.11 A comparison of the different GPUs using the float data type. 62 5.12 A comparison of the different GPUs using the vec4 data type. 64 5.13 A comparison of the performance of the ported vForce functions using an all-normals dataset versus a mixed dataset. 65 5.14 A raw bandwidth comparison. 65 x 5.15 A comparison of the system-level power consumption of different GPU- based math functions vs. vForce. 67 5.16 A comparison of the performance of different built in GLSL functions vs. vForce. 68 5.17 A comparison of the energy consumption of different GPU-based math functions vs. vForce. 69 5.18 The CPU-GPU transfer bandwidth for differing PCI-E widths. 69 Chapter 1 Introduction Over the last several years, GPUs have increased in both computation capacity and in available memory bandwidth at a rate far faster than any mainstream CPU. The two graphs shown in Figure 1.1 show the progression in peak computation capacity, in gigaflops per second (GFLOP/s) along with the progression in peak memory bandwidth in gigabytes per second (GB/s) from January, 2000 through April, 2007. For the GPUs, a zero entry for the memory bandwidth and compute capacity represents the time period when the GPU vendors made only non-programmable GPUs. For the CPUs, Figure 1.1 shows the CPUs’ peak floating point computation capacities per socket. In the last few years, GPUs have been better able to turn increased transistor budgets into higher performance than have CPUs. Such performance scaling, partic- ularly the rapid increase in computation throughput, is due to GPUs’ relative silicon efficiency: GPUs do not have the complicated branch predictors or sophisticated logic for extracting instruction level parallelism (ILP) out of programs that modern CPUs do. Similarly, GPU memory bandwidth has also greatly increased due to GPUs’ special- purpose nature. On a video board, the GPU’s memory can clock much higher because it is not part of an expandable, multi-drop stub bus that must accommodate multiple gigabytes of memory. Another factor making the GPU more useful as a high performance adjunct to the main CPU is the rapid increase in the bandwidth of the CPU-GPU interconnection.

Load more