Investigating the Potential of a GPU-based Math Library

by

Daniel Riley Fay

B.S., University of Illinois, 2004

A thesis submitted to the

Faculty of the Graduate School of the

University of Colorado in partial fulfillment

of the requirements for the degree of

Master of Science

Department of Electrical and Computer Engineering

2007 This thesis entitled: Investigating the Potential of a GPU-based Math Library written by Daniel Riley Fay has been approved for the Department of Electrical and Computer Engineering

Professor Daniel A. Connors

Professor Manish Vachharajani

Professor Vince Heuring

Date

The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii

Riley Fay, Daniel (M.S., Computer Engineering)

Investigating the Potential of a GPU-based Math Library

Thesis directed by Professor Daniel A. Connors

In the last few years, the (GPU) has evolved from being a graphics-specific integrated circuit into a high performance programmable vec- tor/stream processor. Contemporary GPUs provide a compelling platform for running compute-intensive applications. In addition to tens of gigabytes per second of memory bandwidth, they also possess vast computation resources capable of achieving hun- dreds of giga-FLOPs of single precision floating-point computation power. Moreover, the consumer-oriented focus of contemporary GPUs means that even the highest end graphics cards cost well under a thousand dollars. Developments on the software side have also made GPU systems far more accessible for general-purpose use: new program- ming languages reduce the need for GPU programmers to understand esoteric graphics concepts, and high speed interconnect technologies improve CPU-GPU communication.

Developing a high performance math library will help programmers make full use of increasingly-powerful GPUs as well as enabling the study of using GPUs for general purpose applications. Math functions are a critical part of many high performance applications, and their use consumes a large percentage of many programs’ CPU times.

In order for a GPU-based math library to be useful, it must provide accurate results.

Similarly, it must show a performance and/or power consumption advantage over a

CPU-based math library.

This thesis investigates the potential of porting Apple, Inc.’s vForce math library to four different GPUs found in current Apple computers. Using this hardware, the thesis investigates whether current GPU technology can be gainfully employed to run a high performance math library on the GPU. The thesis investigates the potential of a iv

GPU-based math library using three metrics: accuracy, performance, and power/energy consumption. These three metrics are used to study the GPU-ported math library as it runs on the four GPUs. Comparisons are also made between the four different GPUs tested against the CPU version of vForce. v

Contents

Chapter

1 Introduction 1

2 General Purpose GPU Computing (GPGPU) 5

2.1 Characteristics of GPGPU Programs ...... 5

2.2 Where GPGPU Computing is Today ...... 8

2.3 GPGPU Numeric Considerations ...... 10

3 Background on GPUs 18

3.1 From Framebuffer to Fast Vector Machine: A Brief History of GPUs . . 18

3.1.1 A History of the Hardware ...... 19

3.1.2 A History of the Software ...... 20

3.2 Assessment of Current GPU Technology ...... 21

3.3 The Future of GPUs ...... 25

4 Experimental Setup 27

4.1 Writing a GPGPU Math Library ...... 27

4.1.1 Algorithms Used ...... 27

4.1.2 The OpenGL Language (GLSL) ...... 28

4.1.3 Porting the vForce Functions to Programs ...... 31

4.2 Test Setup ...... 35

4.3 Testing a GPGPU Math Library ...... 43 vi

5 Experimental Results 49

5.1 Accuracy Results ...... 49

5.2 Performance Results ...... 54

5.3 Power/Energy Results ...... 66

5.4 Bandwidth Results ...... 70

6 Summary and conclusion 71

6.1 Summary of Results ...... 72

6.2 Current Suitability of a GPGPU Math Library ...... 74

6.3 Improving GPU Accuracy ...... 74

6.4 Future Suitability of a GPGPU-based Math Library ...... 75

Bibliography 78

Appendix

A Cephes Math Library Code for sinf 83

B Detailed Accuracy Results 91 vii

Tables

Table

1.1 Different GPU interconnections...... 3

4.1 System configurations tested...... 36

4.2 Capabilities of the CPUs and GPUs...... 37

B.1 Accuracy of built-in GLSL functions, 7300GT...... 92

B.2 Accuracy of built-in GLSL functions, ATi x1600...... 92

B.3 Accuracy of built-in GLSL functions, NVIDIA FX 4500. . . . . 93

B.4 Accuracy of ported vForce functions, ATi x1600...... 94

B.5 Accuracy of ported vForce functions, NVIDIA Quadro FX 4500. . . . . 94

B.6 Accuracy of ported vForce functions, NVIDIA 7300GT...... 95

B.7 Accuracy of built-in GLSL functions, ATi x1900XTX...... 95

B.8 Accuracy of ported vForce functions, ATi x1900XTX...... 96

B.9 Accuracy of basic operators, ATi x1900XTX...... 96

B.10 Accuracy of basic operators, NVIDIA 7300GT...... 97

B.11 Accuracy of basic operators, NVIDIA Quadro FX 4500...... 97

B.12 Accuracy of basic operators, ATi x1600...... 97 viii

Figures

Figure

1.1 Comparison of peak compute capacity and peak memory bandwidth of

the NVIDIA and ATi GPUs along with Intel CPUs and the CPUs used

by Apple...... 2

2.1 Data-flow for a GPGPU program...... 7

2.2 The IEEE 754 floating point format...... 11

2.3 The floating point number line...... 11

2.4 Differences relative to libm’s sinf caused by changing the rounding

mode of Cephes’ argument reduction...... 14

2.5 Argument reduction code of the Cephes sinf...... 15

2.6 Comparison of the Control Flow Graphs (CFGs) of the Apple Libm imple-

mentation of sinf along with the Cephes Math Library implementation

of sinf...... 16

3.1 An example of a GPU-containing computer system...... 23

3.2 The GPU pipeline...... 24

4.1 The intrinsic-to-GLSL translation system...... 34

4.2 Diagram of the NVIDIA G7x pixel shader core...... 38

4.3 Diagram of the fragment shader of the NVIDIA G7x...... 39

4.4 Diagram of dual-issue and co-issue...... 40 ix

4.5 Diagram of the ATi R5xx pixel shader core...... 41

4.6 Diagram of the fragment shader of the ATi R5xx...... 42

4.7 The GPU test harness...... 44

4.8 The Apple OpenGL software stack...... 46

5.1 The accuracy differences between the tested GPUs for the basic opera-

tions, the built-in GLSL functions, and the ported vForce functions. . . 50

5.2 Comparison of the accuracy between the different GPUs for the basic

operations, the built-in GLSL functions, and the ported vForce functions. 51

5.3 Accuracy improvement of the ported vForce functions vs. the built-in

GLSL functions...... 52

5.4 A performance comparison of the built-in GLSL functions versus the

ported vForce functions...... 53

5.5 A performance comparison of different texture sizes for the NVIDIA GPUs. 55

5.6 A performance comparison of different texture sizes for the ATi GPUs. . 56

5.7 A performance comparison of the ported vForce functions...... 58

5.8 A performance comparison of vForce running on different processors. . . 58

5.9 Speedups gained from using the vec4 data type as opposed to the float

on all of the ported vForce functions...... 60

5.10 Speedups gained from using the vec4 data type as opposed to the float

data type on the restricted subset of the functions...... 61

5.11 A comparison of the different GPUs using the float data type...... 62

5.12 A comparison of the different GPUs using the vec4 data type...... 64

5.13 A comparison of the performance of the ported vForce functions using

an all-normals dataset versus a mixed dataset...... 65

5.14 A raw bandwidth comparison...... 65 x

5.15 A comparison of the system-level power consumption of different GPU-

based math functions vs. vForce...... 67

5.16 A comparison of the performance of different built in GLSL functions vs.

vForce...... 68

5.17 A comparison of the energy consumption of different GPU-based math

functions vs. vForce...... 69

5.18 The CPU-GPU transfer bandwidth for differing PCI-E widths...... 69 Chapter 1

Introduction

Over the last several years, GPUs have increased in both computation capacity and in available memory bandwidth at a rate far faster than any mainstream CPU. The two graphs shown in Figure 1.1 show the progression in peak computation capacity, in gigaflops per second (GFLOP/s) along with the progression in peak memory bandwidth in gigabytes per second (GB/s) from January, 2000 through April, 2007. For the GPUs, a zero entry for the memory bandwidth and compute capacity represents the time period when the GPU vendors made only non-programmable GPUs. For the CPUs, Figure 1.1 shows the CPUs’ peak floating point computation capacities per socket.

In the last few years, GPUs have been better able to turn increased transistor budgets into higher performance than have CPUs. Such performance scaling, partic- ularly the rapid increase in computation throughput, is due to GPUs’ relative silicon efficiency: GPUs do not have the complicated branch predictors or sophisticated logic for extracting instruction level parallelism (ILP) out of programs that modern CPUs do. Similarly, GPU memory bandwidth has also greatly increased due to GPUs’ special- purpose nature. On a video board, the GPU’s memory can clock much higher because it is not part of an expandable, multi-drop stub bus that must accommodate multiple gigabytes of memory.

Another factor making the GPU more useful as a high performance adjunct to the main CPU is the rapid increase in the bandwidth of the CPU-GPU interconnection. 2 GPUs vs. CPUs in Compute Capacity 600 ) s / 500 P

400

e (GFLO 300 anc 200

rform 100

k Pe 0 ea 2000 2 00 1 2 00 2 2 003 20 04 2 00 5 2 006 20 07 P Year

ATi NVIDIA Apple Intel

GPUs vs. CPUs in Memory Bandwidth

120

100

h (GB/s) 80

60

Bandwidt 40

20 emory

0 ak M 2000 2001 2002 200 3 20 04 20 05 2 006 2 007 Pe Year ATi NVIDIA Apple Intel

Figure 1.1: Comparison of peak compute capacity and peak memory bandwidth of the NVIDIA and ATi GPUs along with Intel CPUs and the CPUs used by Apple.

Table 1.1 shows the maximum bandwidth available to link the CPU and GPU together.

While the GPU interconnection started out as a bus shared with other peripherals in the system, the advent of the Accelerated Graphics Port (AGP) brought increasingly high bandwidth dedicated connections to GPUs. Current GPUs enjoy up to a total of 8.0GB/s of bandwidth courtesy of a 16-lane PCI Express (PCI-E) channel running at 2.5 Giga-Transactions per second (GT/s). The upcoming PCI-E 2.0 standard will double this rate to 5.0GT/s for a total of 16.0 GB/s of bandwidth.

While GPUs show great promise as high performance, highly parallel compute engines, they suffer from two serious pitfalls. First, while recent GPUs support IEEE

754 format single precision floating point arithmetic, the accuracy of their arithmetic is not fully IEEE 754 compliant. Second, while GPUs have vast quantities of computa- 3

Bus Standard Bandwidth VLB 132MB/s PCI32 132MB/s (264MB/s†) PCI64 264MB/s (528MB/s†) AGP 1x 264MB/s AGP 2x 528MB/s AGP 4x 1056MB/s AGP 8x 2112MB/s PCI-E 4.0GB/s (2.0GB/s††) PCI-E 8.0GB/s (4.0GB/s††) PCI-E 2.0 16.0GB/s (8.0GB/s††)

† 66MHz PCI bus. †† 8-lane configuration.

Table 1.1: Different GPU interconnections.

tional capacity and memory bandwidth, their stream-based programming model makes it difficult to realize the GPUs’ full computational potential.

To study the potential of GPUs as a general purpose, high performance compute device, this thesis examines the accuracy, performance, and power/energy consumption of Apple’s high performance math library, vForce, when it is ported over to the GPU.

This thesis studies the potential of a math library on the GPU for several reasons.

First, math functions are an important part of many applications. They are also fairly complicated, so their performance often strongly influences the overall performance of an application. Math functions are also very sensitive to incorrect arithmetic results, making them an excellent vehicle for studying the effects of non-IEEE 754-compliant arithmetic on producing correct results. Porting the vForce high performance math library to the GPU allows one to compare the GPU against an already highly optimized math library that is tuned for high performance on both PowerPC’s Altivec vector extensions as well as Intel’s SSE extensions. Finally, studying a math library allows for an understanding of how one might use a faster library function to speed up an existing 4 application without having to port the entire program over to the GPU.

To study the potential of a GPU-based math library, this thesis did the following work:

(1) Developed a semi-automatic system for converting the source code of Apple’s

vForce math library to code for the GPU.

(2) Wrote a test harness and implemented a testing methodology for examining

the accuracy, performance, and power/energy consumption of the GPU math

functions.

(3) Studied the GPU-ported vForce functions on a variety of different hardware

platforms containing different GPUs from both ATi and NVIDIA.

(4) Evaluated the potential of a GPU-based math library.

The remainder of this thesis is organized as follows: Chapter 2 discusses the differ- ent aspects of General Purpose GPU (GPGPU), including the programming model and the numeric issues pertinent to GPGPU programming. Chapter 3 provides a brief his- tory of GPUs, as well as providing background information on the entire GPU system, including both the hardware and the software. Chapter 4 discusses how the GPU- based math library was developed, the test platforms, as well as the test harness used to evaluate the GPU-based math functions. Chapter 5 presents and discusses the test results. Chapter 6 summarizes the results and discusses the overall suitability of a GPU- based math library now and in the near future and discusses some possible techniques for improving GPU-based math libraries. Finally, Appendix A provides the Cephes

Math Library code for the sinf function studied in Chapter 2, and Appendix B pro- vides detailed accuracy results to supplement the average accuracy results presented in

Chapter 5. Chapter 2

General Purpose GPU Computing (GPGPU)

The work done for this thesis is part of a field of study known as General Purpose

GPU (GPGPU). The increased programmability of GPUs has opened up the potential to execute non-graphics computation on the GPU. Note that the ”General Purpose” part of GPGPU is somewhat misleading: GPGPU’s goal is not to replace conventional microprocessors with GPUs, but to supplement the CPU with another powerful compute engine.

This chapter discusses the various aspects of GPGPU: the structure of GPGPU programs, the programming of GPGPU applications, the kinds of applications most ap- propriate for running on the GPU, current GPGPU research, and the serious numerical accuracy issues endemic to GPGPU programming.

2.1 Characteristics of GPGPU Programs

Most GPGPU applications are programmed using a technique called stream pro- gramming. The stream programming paradigm’s goal is to allow the programmer to explicitly describe the data-level parallelism of a program. Stream programming consists of two core components: streams, which are large arrays of data with no dependencies between any of the elements; and kernels, which are a collection of operations specified for each data element within the stream. Stream programs consist of one or more in- put streams, and an output stream, with everything else chained together using kernels. 6

Academic examples of stream computing research include the Stanford Merrimac super- computer [21] whose work led to BrookGPU [18] (a GPGPU-oriented stream language), the Imagine stream processor [36], and StreamIT [62].

By allowing the programmer to explicitly express the data-level parallelism inher- ent in many algorithms, stream languages allow for large amounts of parallelism to be extracted at compile time. Moreover, since streams are architecture independent, they can target any kind of system, ranging from single/multi-core CPU systems to GPUs.

In stream programming, kernels perform four different classes of operations on streams:

(1) Map. Every element of the input stream(s) has the same operation performed

on it.

(2) Reduction. The output stream has only a fraction as many elements as the

input stream.

(3) Filtering. A form of reduction, filtering is where certain items are removed

from the stream according to certain criteria.

(4) Scatter. Results are placed in different spots throughout the output stream.

Figure 2.1 shows the general data-flow for a GPGPU program programmed using

OpenGL. The CPU-based Main Program coordinates all of the work. It runs the CPU- based work of the program, and manages the shader program(s). It controls the GPU through OpenGL. When the Main Program needs the GPU to perform calculations, it loads the shader program into OpenGL. It then instructs OpenGL to compile and link the GLSL program. By issuing commands to the OpenGL State Machine, the GPGPU program is able to instruct OpenGL to send commands to the GPU via the GPU’s driver. The OpenGL layer also deals with data transfer: it transfers data from the program memory to the GPU’s local memory as needed via DMA commands issued by 7

Main Program OpenGL/GPU Driver GPU

GPU Shader Shader OpenGL Interface Shader Program Instructions JIT Program Compiler

GPU Commands OpenGL Commands CPU State Work Machine

Program Memory GPU Memory DMA

Input 1 Texture 1

Input 2 Texture 2 Results

Framebuffer Other Data

Figure 2.1: Data-flow for a GPGPU program.

the GPU Driver.

Note that not all applications are suitable for running on the GPU. In general, successful GPGPU applications possess three key characteristics:

(1) Are highly data-parallel. Each data element in an input stream must not

depend on the results of computations on other elements in that input data

stream.

(2) Have simple control flow. Branches, particularly data-dependent ones,

greatly reduce the efficiency of the GPU.

(3) Possess a high arithmetic density. Each data element should have many

computations performed on it. Applications that do not enjoy a high arithmetic

density will become limited by either the GPU’s memory bandwidth or by the 8

bandwidth of the GPU-CPU interconnection.

Examples of successful GPGPU applications include ray tracing [51], fluid dynam- ics simulations [56] [67], particle physics simulations [39], databases [28], sorting [29], protein folding simulations [41], Digital Signal Processing (DSP) [47], image process- ing [25], oil exploration [32], and stock options/derivatives pricing calculations [33].

2.2 Where GPGPU Computing is Today

Currently, GPGPU is in a state of transition from using graphics programming- oriented shader languages to using general purpose streaming languages. This change to a more general purpose programming paradigm enables lower-level access to the GPU’s features.

Currently, there exist three very similar shader languages: Cg (“C for Graph- ics”) [24], GLSL (“OpenGL Shading Language”) [54], and HLSL (“High Level Shading

Language”) [5]. Cg, developed by NVIDIA, was the first of the three. It can target either

Direct3D or OpenGL. GLSL and HLSL derive heavily from Cg; however, they are both

API specific: GLSL only works with OpenGL, while HLSL only works with .

Cg, HLSL, and GLSL can be used to code either vertex shader or pixel shader programs.

Their syntax strongly resembles C, and they are designed to operate on graphics pix- els as opposed to generic stream data types. Shader programs written in these three languages represent data in three ways: as constants, as variables passed between the vertex and the fragment shaders, and as textures. These shader languages allow the shader program to read any position within the texture (in GPGPU, textures are the equivalent of arrays), which enables gather operations. Current shader languages, however, do not support scatter operations (scatter operations allow a function to write data to one or more arbitrary location).

Besides the vendor-supported shader languages, there also currently exist several 9 other third party environments for programming the GPU. A popular academic one is

BrookGPU [18], which is based off the Brook stream programming language used by the Stanford Merrimac [21] supercomputer. BrookGPU functions as a run-time on top of OpenGL and Direct3D. A C++ target also exists for debugging purposes.

Sh [45], developed at the University of Waterloo, provides a GPGPU library where the programmer can program the GPU from within C++ programs. Sh is designed to be used for both general purpose GPU programming and GPU-based graphics program- ming. It is independent of the lower-level GPU APIs, and functions as a C++ library.

Sh generates lower-level GPU code in a Just-In-Time (JIT) manner. Another aca- demic language, known as PyGPU [40], uses the Python language to provide a higher level meta-programming language abstraction on top of the basic graphics program- ming/shader framework.

Microsoft has also developed its own GPGPU library, called Accelerator [61]. Ac- celerator is an imperative language that contains only data-parallel constructs. Other commercial streaming languages include RapidMIND [52], which is based on Sh; Peak-

Stream [9] which targets GPUs, multi-core architectures, and the CELL processor; and

Reservoir Labs’ R-Stream compiler [19], which takes stream programs as input and outputs a parallel C format suitable for the aforementioned architectures’ compilers.

Recently, ATi and NVIDIA have been opening up access to their GPUs’ low-level features by providing a new programming model that more resembles a data-parallel ma- chine than a graphics processor. ATi provides this lower-level access through their Close

To the Metal (CTM) interface [50]. ATi’s CTM interface provides a virtual machine abstraction of the fragment shader hardware. CTM allows the GPGPU programmer to program the GPU in assembly language, and requires the programmer to deal with all

GPU and system memory management.

NVIDIA’s Compute Unified Device Architecture (CUDA) [12], on the other hand, operates at a higher level than does CTM. Instead of exposing the programmer to the 10 low-level assembly instructions for the shaders, CUDA operates at the C language level.

CUDA programmers must be mindful of GPU memory, texture, and constant caches, and must also make good use of a small, tightly coupled memory that is shared by groups of shader units. It is also an inherently multi-threaded programming model.

Unlike CTM, CUDA is a compiled language that promises that programs written for it will work on future NVIDIA GPUs.

GPU vendors are not the only companies developing highly parallel compute architectures. Sony, Toshiba, and IBM, for example, developed the multi-core CELL processor [58] [59], which the companies use not only in supercomputers but also in the

Sony PlayStation 3. CELL contains one conventional microprocessor along with seven stream processors. Intel is researching a GPU-like device code-named Larrabee [22].

Larrabee consists of 80 simple microprocessor cores that can achieve more than a teraflop of peak floating point performance. The ClearSpeed CSX architecture [11] is another general purpose device intended for compute-intensive applications. The Ageia PhysX processor [15], while designed primarily to accelerate physics computation in games, provides ample parallel computation resources using many simple microprocessors. Sun

Microsystem’s upcoming Rock [53] microprocessor will also provide high floating point computational power by implementing many simple microprocessor cores within a single chip in a manner similar to the company’s UltarSparc T1 microprocessor [14] .

2.3 GPGPU Numeric Considerations

The IEEE 754 floating point standard [34] is the most widely used standard for

floating point arithmetic on computers. The IEEE 754 standard enabled floating point application portability by guaranteeing consistent arithmetic results on any compliant machine. The IEEE 754 standard calls for several precision standards: single precision, single extended precision, double precision, double extended precision, and extended precision. The most widely used of these precision standards are 32-bit single precision 11

Sign Exponent Mantissa

Figure 2.2: The IEEE 754 floating point format.

−0 +0 −Inf Normal Numbers Normal Numbers +Inf

Denormal Numbers

Figure 2.3: The floating point number line.

and 64-bit double precision. Figure 2.2 shows the IEEE 754 single precision floating point format. The single precision format consists of three components:

(1) Sign bit. If the sign bit is one, the number is negative.

(2) Exponent. Eight bits in length, IEEE 754 encodes the exponent in excess-127

so that negative exponents are represented as positive integers. Keeping the

exponent positive helps to simpify the hardware used for doing comparisons.

(3) Mantissa. Sometimes called the fractional part, IEEE 754 encodes the man-

tissa as 23 bits but actually contains 24 bits of data: the most significant bit is

always 1 except for denormals.

Figure 2.3 shows the number line of the IEEE 754 floating point number system.

In the IEEE 754 number system, the “density”, or the the smallest incremental differ- ence between the closest representable numbers, is highest around zero. In addition to representing the real numbers, the IEEE 754 number system provides closure by adding several special values: 12

(1) +/-Infinity. Represents numbers whose magnitudes are too large to be repre-

sented as normals.

(2) Not a Number (NaN). There are two categories of NaNs: quiet (QNaN) and

signaling (SNaN). An SNaN, when used as an input to an operation, triggers

an exception; a quiet NaN, on the other hand, does not. Moreover, NaNs

are sticky: any arithmetic operation on a NaN input produces a NaN result.

Comparisons involving a NaN always fail: this property can be used to detect

NaNs by performing a reflexive equality x == x comparison on a variable. If

the comparison evaluates to false, then the variable holds a NaN. The lower

bits of a NaN are undefined and contain the NaN’s “payload”: a payload can

be used as a debugging tool to trace where the NaN originated.

(3) Denormals. A denormal number is a number whose exponent is lower than

the lowest negative exponent. Denormals exist to provide gradual precision loss

as the number approaches zero.

(4) +/-Zero. IEEE 754 provides two representations for zero that compare equally

to each other: a positive and a negative zero. The negative zero exists to rep-

resent negative numbers smaller than the smallest-magnitude negative number

but greater than zero.

A finite binary representation of a number can never exactly represent every real number. Inexact results require the last bits to be rounded to some value. The rounding mode used determines the actual value to which the number is rounded. The IEEE 754

floating point standard supports four different rounding modes:

(1) Round to Nearest. Round the number to the nearest value. If the number

falls midway between the two numbers, round it to the nearest even value.

(2) Round toward 0. Round the number towards zero. 13

(3) Round toward +Infinity. Round the result towards positive infinity.

(4) Round toward -Infinity. Round the result towards negative infinity.

It is also desirable to notify an application if the floating point arithmetic result may adversely affect the overall result of the program. The IEEE 754 standard provides such functionality through exceptions. Exceptions, when thrown by the hardware, cause a special bit in the hardware’s exception register to be set, and can interrupt the program and execute an exception handler. IEEE 754-compliant hardware must support five different types of exceptions:

(1) Invalid Operation. Thrown if an operand is invalid for the operation per-

formed.

(2) Division by Zero. Thrown if the divisor is zero and the dividend is a finite

nonzero number.

(3) Overflow. Thrown if the result exceeds the largest-magnitude representable

normal number.

(4) Underflow. Thrown if the result is less than the smallest-magnitude normal

number.

(5) Inexact. Thrown if the rounded result of an operation is not exact, or if there

is an overflow without an overflow trap.

Numeric error can be quantified using two methods. The first method is absolute difference, calculated by subtracting the correct floating point result from the computed

floating point result. The other way uses a measurement known as unit in last place, or ulp for short. To calculate an ulp error, one first converts the floating point numbers to their bit vector representations and then subtracts the correct and computed results as 14

Figure 2.4: Differences relative to libm’s sinf caused by changing the rounding mode of Cephes’ argument reduction.

unsigned integers. Measuring ulp errors is a good way to study small, rounding error- related problems, since a one ulp rounding error can lead to different absolute errors depending on where on the number line the two numbers lie.

Many high performance math functions compute their results using multiple poly- nomials. The math functions use different polynomial functions along different parts of the floating point number line. As a result, it is essential that the functions that decide which polynomial function should be used for a given number work correctly.

Figure 2.4 shows the absolute error between the Cephes Math Library’s [48] sinf and

Linux’s built-in libm sinf. One of the most important parts of the algorithm is the argument reduction, where the operand is mapped to different parts of the region [0, 2π]. 15 1 j = FOPI * x; /* integer part of x/(PI/4) */ 2 y = j; 3 /* map zeros to origin */ 4 if( j & 1 ) 5 { 6 j += 1; 7 y += 1.0; 8 } 9 j &= 7; /* octant modulo 360 degrees */ 10 /* reflect in x axis */ 11 if( j > 3) 12 { 13 sign = -sign; 14 j -= 4; 15 }

Figure 2.5: Argument reduction code of the Cephes sinf.

One of the most error-prone parts of the argument reduction is Line 1, where the nearest

π multiple of 4 is determined for the input operand. Each of the graphs shows the dis- crepancies that occur when the convert-to-integer rounding mode within the argument

reduction (the first line shown in Figure 2.5) is changed to:

(1) Unmodified. No changes to the rounding mode were made.

(2) Roundf. The rounding mode was changed to round-to-nearest.

(3) Truncf. The rounding mode was changed to truncate (round-to-zero).

(4) Ceilf. The rounding mode was changed to round-to-positive-infinity.

(5) Floorf. The rounding mode was changed to round-to-negative-infinity.

Note that doing nothing more than changing the rounding mode of a single op-

eration in the sinf function leads to vastly different results. The high sensitivity to

accuracy issues such as the choice of rounding modes in the Cephes sinf will have

profound implications for the accuracy of GPU-ported math library functions. 16

(a) sinf from Apple’s Libm (b) sinf from Cephes

Figure 2.6: Comparison of the Control Flow Graphs (CFGs) of the Apple Libm imple- mentation of sinf along with the Cephes Math Library implementation of sinf.

Another important aspect of algorithm design that must be taken into account when developing a high performance math library is control flow. Irregular control

flow is detrimental to performance, as valuable compute cycles are wasted incorrectly predicting branches. Moreover, irregular control flow hurts the efficiency of many SIMD architectures, as these architectures must typically execute both control paths and select the correct results using predication. Figure 2.6 shows the control flow graph (CFG) 17 of the sinf function used in the PowerPC version of OSX’s libm [16] as well as the

CFG for the Cephes Math Library version of sinf. Note that the Cephes version of sinf has much simpler control flow: the width of the CFG is lower, and there are considerably fewer back edges. Additionally, note that the CFG of the Cephes sinf function has a narrower CFG than the libm sinf. A CFG with a lower breadth is good for vectorization because it means that there are fewer control paths that need to be traversed when the algorithm is predicated. Chapter 3

Background on GPUs

Hardware support for GPGPU required GPUs to transition from being fixed- function ASICs to programmable devices that can run sophisticated programs. Not only must the GPU hardware itself become sufficiently flexible and programmable in order to support GPGPU computing, but the rest of the graphics subsystem, particularly the graphics programming APIs and the GPU programming languages, need to be general purpose enough to support GPGPU computing. While GPU systems have become vastly more flexible and programmable, extracting optimal performance out of GPGPU programs still requires a thorough knowledge of the inner workings of the GPU.

This chapter provides a brief history of the GPU hardware and its associated software stack. Additionally, this chapter discusses some of the leading GPU hardware architectures and the latest GPU programming APIs and programming languages. The chapter concludes with a discussion of where GPU technology is heading in the context of GPGPU.

3.1 From Framebuffer to Fast Vector Machine: A Brief History of GPUs

Starting out as just an expansion bus interface and a memory interface, the

Graphics Processing Units (GPUs) in personal computers have advanced over the years, slowly acquiring functionality originally done by the computer’s main processor. Over 19 time, increasing silicon budgets and demand for more advanced graphics pushed GPU designers to add a variety of new features to accelerate graphics processing.

3.1.1 A History of the Hardware

Initially providing just simple two-dimensional graphics support, designers ul- timately augmented GPUs with the ability to accelerate three dimensional graphics rendering and to accelerate the processing and display of full-motion video. The earli- est GPUs were little more than a simple memory interface that bridged the framebuffer

(a piece of memory that held the image to be displayed), and the computer’s peripheral expansion bus. These early GPUs did no graphics processing of their own: changing the displayed image required the CPU to micromanage the framebuffer by directly modify- ing every pixel. Over time, GPUs began to accelerate common two-dimensional graphics operations. Such first occurred in high-end graphics worksta- tions: these machines had GPUs that employed dedicated CPUs or DSPs as graphics co-processors. Eventually, lower-cost mainstream GPUs gained fixed-functionality ac- celeration for a few commonly used graphics operations. These operations included bitblitting (special operations for combining multiple bitmaps), color fill operations, and supporting draw operations on graphics primitives such as rectangles, circles, arcs, and lines.

Beginning in the early-to-mid 1990s, mainstream GPUs gained the ability to ac- celerate 3D graphics. The first GPUs supporting 3D acceleration supported simple and filtering. As time wore on, progressively more of the 3D render- ing process could be offloaded to the GPU for processing. Soon, GPUs could process multiple textures in a single pass, and could process geometry operations like transfor- mation, clipping, and lighting, as well as gaining more elaborate shading capabilities.

GPUs also added the ability to process motion video. The growing size of the GPUs’ on-board memory allowed programs to buffer video frames ahead of time, facilitating 20 smoother video playback. GPUs could accelerate compressed video playback by provid- ing hardware support for video compression primitives like motion compensation and the inverse Discrete Cosine Transform (iDCT).

Newer GPUs gained a critical feature needed for GPGPU: programmability. Cer- tain parts of recent GPUs can execute programs known as shader programs. The earliest programmable GPUs could only run very simple, highly limited shader programs consist- ing of only a few (less than about 30) instructions. Control flow support was limited to using the z-cull unit to conditionally nullify operations. Successive generations of GPUs allowed for more sophisticated shader programs. These more advanced GPUs supported longer shader programs, conditional branching, predication, and floating point opera- tions. The floating point support started out with 16-bit floating point values (known as “half precision”); soon, GPUs could process 24-bit floating point numbers and then

32-bit single precision IEEE 754-format operands.

3.1.2 A History of the Software

The first 3D graphics APIs were designed for professional workstation use. Orig- inally, were two major interactive graphics APIs: an open API called PHIGS (Pro- grammer’s Hierarchical Interactive Graphics System) and Silicon Graphics, Inc.’s (SGI) proprietary IRIS GL API. In addition to interactive graphics there was also Pixar’s

RenderMan [65] standard for offline rendering, which remains in use to this day.

Ultimately, PHIGS fell out of use when SGI created OpenGL [49] as an open- standard replacement for its proprietary IRIS GL API. Unlike IRIS GL, OpenGL did not require all of its features to be supported by the underlying graphics hardware, instead allowing unsupported features to be emulated in software. The OpenGL API provided the programmer with a fixed function known as the OpenGL state machine. Most of OpenGL’s commands either configure parts of the pipeline or move graphics primitives between different parts of the pipeline. OpenGL 2.0 moved away 21 from the fixed functionality pipeline somewhat by allowing the vertex and fragment parts of the pipeline to be programmed using the OpenGL Shading Language (GLSL) [54].

In the mid-1990s, low cost consumer 3D graphics cards emerged to accelerate 3D games. While at this point OpenGL had existed for several years, its feature set was too complex for the consumer-level GPUs of the time to fully support. As a result, these early GPUs used various proprietary APIs such as 3dfx’s Glide and Rendition’s

Speedy3D and RRedline. Later on, some games, such as GLQuake and Quake II, employed MiniGL drivers, which implemented a subset of OpenGL.

Around this time, Microsoft released its own 3D API, Direct3D, as part of the

DirectX collection of game programming APIs. Direct3D provided a GPU vendor- neutral (but Windows-only) API for programming 3D cards. Since Microsoft designed

Direct3D for games, they limited its feature set to the 3D functionality that then-current

GPUs supported in hardware. Over time, the standard advanced, with version 5.0 gaining wide acceptance by game developers, version 7.0 supporting hardware geometry processing, and 8.0 supporting the first programmable shaders. Currently, Direct3D is at version 9 in Windows XP and Version 10 in Windows Vista.

Today’s programmable GPUs have their vertex and fragment shader units pro- grammed using a language called a shader language. At first, shader languages were nothing more than glorified assembly languages that specified basic operations such as add, subtract, and multiply-add. As the shaders became more sophisticated, they even- tually moved to the current C-like shader languages like “C for Graphics” (Cg) [24], the OpenGL Shading Language (GLSL) [54], and the High Level Shading Language

(HLSL) [5], all of which are Just-In-Time (JIT) compiled by the GPU’s driver.

3.2 Assessment of Current GPU Technology

Current GPUs meet or exceed the specifications of Microsoft’s Shader Model 3.0.

Shader Model 3.0 has significantly increased the utility of the GPU’s programmable 22 shaders for GPGPU in several ways. First, it increases both the maximum static and dynamic instruction count of shader programs, allowing for longer and more compli- cated programs. Additionally, Shader Model 3.0 adds predication, looping, and dynamic branching. Predication and dynamic branching are essential to many programs: itera- tive looping, for example, requires dynamic branching support. Also added in Shader

Model 3.0 is limited procedure call support. Other enhancements include additional temporary registers to help support longer shader programs as well as support for arbi- trary swizzling (swizzling is a limited form of vector permute where any of the pixel’s four elements can be used as an input into an operation).

Current GPUs support a subset of the IEEE 754 [34] format single precision

floating point numbers. They do not support double precision numbers, which are necessary in many important HPC and scientific computing applications. None of the

GPUs support handling denormal numbers. Current GPUs do not fully support any of

IEEE 754’s rounding modes. The floating point specials in the IEEE 754 standard are mostly, but not completely, supported.

Current GPUs also possess a 256-bit wide connection to their local memory.

They use memory technologies such as GDDR-3 and GDDR-4, which are somewhat similar to the memory used in personal computers but with modifications to allow for extremely high speed operation. GDDR-3 and GDDR-4 technologies allow for higher speed operation by using a point-to-point connection that minimizes capacitive bus loading, providing large internal prefetch buffers, and increasing memory bandwidth at the cost of higher latency.

Figure 3.1 shows the system-level layout of a GPU-containing computer system.

GPUs usually connect to the rest of the system using a 16-lane PCI Express connection

(PCI-E x16) that provides 4GB/s in either direction. Some systems support more than one PCI-E x16 connection to allow two or more GPU boards to be used. Note that, in many cases, the added bandwidth of PCI Express goes to waste, as the CPU’s front-side 23 DRAM

64+ GB/s GPU . Bs4.0 GB/s 4.0 GB/s 4.0 GB/s

12.8 GB/s Northbridge 4.0 GB/s DRAM DRAM 64+ GB/s GPU

10.6 GB/s 10.6 GB/s

CPU CPU

Figure 3.1: An example of a GPU-containing computer system..

bus does not have enough bandwidth to keep up.

Today, virtually all GPUs support either OpenGL, Direct3D, or both. Figure 3.2 shows the GPU pipeline, and shaded in gray are the parts of the GPU relevant for

GPGPU computing: the vertex shaders, the fragment shaders, and the z-cull stage. In a conventional graphics pipeline, a list of vertexes for the 3D figures enters the graphics pipeline through the vertex shaders. There, the vertex shaders perform the transform and lighting functions on the 3D primitives. In GPGPU computing, the vertex shaders function as multiple instruction multiple data (MIMD) processors that support scatter, but frequently not gather, as many GPUs do not allow the vertex shaders to read from textures.

The next stage of the GPU is the primitive assembly. Here the GPU assembles 24

Vertex Primitive Clip/Cull/Setup Shader(s) Assembly

Hidden Surface GPU Memory Rasterization Removal (Z−Cull)

Framebuffer Fragment Optimizations Shader(s)

Figure 3.2: The GPU pipeline.

the vertexes into graphics primitives such as lines and triangles. After assembly, the primitives go through the clip/cull/assembly stage, were the GPU clips the primitives if they intersect with the viewport, culls them if they are facing the wrong way relative to the viewport, and assembles them into a 3D model. Next in the pipeline is the rasterizer.

The rasterizer converts the three-dimensional primitives into a two-dimensional image for the fragment shaders.

Most contemporary GPUs have an early z-cull unit, which removes primitives that cannot be seen (this process is sometimes known as hidden surface removal) to reduce the computational and memory bandwidth requirements of the rasterizer. This z-cull unit can be occasionally used for GPGPU as a way to efficiently squash dynamically undesired computation before it reaches the fragment shader stage. 25

At the fragment shader stage, the GPU’s programmable fragment shaders shade the rasterized image. Of the three parts of the GPU pipeline, the fragment shader stage is the most useful for GPGPU: GPUs typically have significantly more fragment shaders than vertex shaders, and the fragment shaders are usually more useful for GPGPU computing because they always support gather operations through the ability to read textures. Finally, before being outputted to the framebuffer, the shaded fragments go through pixel optimizations such as the scissor test, alpha blending, and the depth test.

3.3 The Future of GPUs

Upcoming GPUs, such as the NVIDIA G80 [7] and the ATi R600 [10], will con- tinue to facilitate GPGPU programming. These new GPUs add scatter support to the fragment shaders, opening up new possibilities such as Folding@Home [41]. The shader units will continue to become more flexible: GPUs compliant with the Shader Model 4.0 specification [17] must support a unified shader model where the programming model for vertex and fragment shaders is identical, making it easier for programmers to fully utilize the GPU’s compute resources.

Future GPUs will also enjoy better numeric accuracy. All GPUs compliant with

Shader Model 4.0 will have stricter limits on operations’ inaccuracy, with add, subtract, and multiply being limited to 1 ulp of error and divide and square root being limited to 2 ulps of error. Additionally, at least some of the future GPUs will sport hardware support for double precision arithmetic, as NVIDIA promises double precision support in its GPUs ”by the end of 2007” [13]. Shader Model 4.0-compliant GPUs will also have to support true integer and bitwise arithmetic: currently, shader languages’ integer operations compile to floating point operations, and do not support bitwise arithmetic.

In addition to the increased programmability and improved numeric accuracy, future GPU systems will better integrate the GPU into the rest of the system. Microsoft

Windows Vista along with the current versions of OSX support virtualizing the GPU’s 26 memory. Virtualized GPU memory helps make it possible for multiple applications to share the GPU, something which both OSes support.

Besides the increased compute resources provided by Moore’s Law, GPUs will also have faster connections to memory and to the rest of the system. Upcoming GPUs will enjoy faster and wider memory buses (384- and 512-bit). Moreover, PCI Express 2.0 doubles the transfer rate of PCI Express 1.1 from 2.5 GT/s to 5.0GT/s, enabling a total of 16GB/s in both directions simultaneously (8.0 GB/s in either direction). Ultimately,

PCI Express should scale to up to 10 GT/s, providing a total of 32GB/s of bandwidth.

New I/O standards will couple the GPU more tightly with the CPU. Intel/IBM’s

Geneseo [46] technology and AMD’s [63] technology both improve the latency and bandwidth of data transfers between the CPU and GPU. The future will also see

GPU-integrated CPUs: both Intel and AMD have announced future microprocessors with integrated GPUs (codenamed Nehalem [60] and Fusion [31] respectively). Chapter 4

Experimental Setup

4.1 Writing a GPGPU Math Library

To study the potential of running a math library on the GPU, Apple’s vForce math library was ported into a series of GLSL-coded fragment shader programs. Once ported, a test harness was written to compare the accuracy, performance, and power consumption of the math libraries to the original vForce functions. The test harness was also used to compare the performance and accuracy of different CPUs and GPUs.

To help pinpoint performance bottlenecks, the performance effect of changing the width of the PCI-E connection was studied as well.

4.1.1 Algorithms Used

vForce is Apple’s high performance vector math library that is a part of Apple’s

Accelerate.framework [4]. Accelerate.framework contains various high performance image, signal processing, and mathematics libraries. Accelerate.framework provides the developer with performance optimized libraries that automatically choose the correct

(and highest performance) code path. Accelerate.framework takes advantage of any vector instruction set extensions provided by the machine’s microprocessor, such as

Freescale/IBM’s Altivec [26] instructions or Intel’s SSE [42] instructions.

Many of the underlying algorithms for vForce are based on modified versions of

Steven Moshier’s Cephes Math Library code. The Cephes Math Library contains a 28 variety of algorithms optimized for high performance and are well-suited for running as

SIMD code for the reasons discussed in Section 2.3. Besides the simpler control flow, the Cephes functions are also well-suited for SIMD operations because they are not table-driven; that is, they do not need table look ups to compute their results. In SIMD machines, efficiently implementing table look ups requires gather support, which is not supported on either x86 or PowerPC. While gather support does exist on GPUs, extra memory access operations still consume valuable memory bandwidth.

4.1.2 The OpenGL Shading Language (GLSL)

The OpenGL Shading Language (GLSL) resembles a form of C modified for graph- ics programming. It can be used to write either vertex shader or fragment shader programs. It is a JIT-compiled language: when an OpenGL program is run, the pro- gram loads the GLSL source code and sends it to the GPU vendor’s driver, which then compiles, links, and loads the GLSL program onto the GPU.

The GLSL language provides several different data types useful to GPGPU: an integer data type, a Boolean data type, a floating point data type, three matrix types (a

2x2 floating point matrix, a 3x3 floating point matrix, and a 4x4 floating point matrix), and a sampler type. The GLSL language does not make any assumptions about the underlying hardware representation of any of the data types; in many cases, integers and booleans are implemented as floating point numbers, and on the older ATi R3xx [64] and R4xx [55] series, all of the floating point numbers are 24 bits long as opposed to being 32-bit IEEE-754 format single precision numbers.

GLSL shader programs support a multiple input, single output data flow. The input is one or more textures (the maximum number of input textures is limited by the number of texture units supported by the GPU) and the output is a single, fixed-position fragment. As a result of the single, fixed-position output, it is impossible to implement scatter. Addressing the texture memory involves using either a one dimensional, two 29 dimensional, or three dimensional texture address (used to access 1D, 2D, and 3D textures respectively). Since texture addresses use a floating point number, it is possible to have unaddressable elements in a texture, since 32-bit floating point numbers can only count up to 224, or roughly sixteen million values. Texture look ups use either

Texture1D, Texture2D, or Texture3D (for 1D, 2D, and 3D textures respectively).

GLSL also provides a number of different built-in math functions. Since the

GLSL functions are designed for graphics calculations, GLSL provides no guarantees about the accuracy of any of the math functions. An important way to measure the utility of porting vForce over to the GPU is to compare how well the ported functions perform (in terms of accuracy, performance, and power) against the built-in GLSL versions. This thesis examines the accuracy and performance of the following built-in

GLSL functions:

(1) acos - Inverse cosine.

(2) asin - Inverse sine.

(3) atan - Inverse tangent.

(4) atan2 - Inverse tangent, two inputs.

(5) cosine - Cosine.

(6) exp2 - Power-of-two exponent.

(7) exp - Natural exponent.

(8) logarithm - Natural logarithm.

(9) log2 - Base-2 logarithm.

(10) sine - Sine.

(11) sqrt - Square root. 30

(12) tangent - Tangent.

Note that exp2 and log2 are important not just as basic math functions, but also as the building blocks of many of the ported vForce functions, with exp2 used to adjust exponents and log2 used to extract exponents and/or extract mantissas.

In addition to the built-in math functions, this thesis investigates the performance and accuracy of the basic operators add, subtract, multiply, and divide. While GLSL treats all of these as basic operators, vForce provides divide as an operation composed of other basic operations. The accuracy tests for the basic operations used test inputs obtained from Jerome Coonen’s Ph.D. thesis [20], whose work was a major basis for the

IEEE 754 floating point standard.

The comparison operators (less than, greater than, less than or equal, greater than or equal, equal, and not equal) were also tested. An essential use of these operators is for classifying an input value, determining whether it is one of the

floating point specials, or testing whether it is within a certain real-valued range. When classifying different specials, NaNs get special treatment as one does not compare a

NaN to another NaN; rather, one compares the NaN to itself, and if the self comparison fails, the value is a NaN. A self comparison is hazardous, however: a compiler that is not aware that self comparisons are used to detect NaNs will optimize away such oper- ations. To detect such problems, the comparison operators test also has a ”self” test, which compares an input variable to itself.

Test vectors for the comparison operators were generated by taking every possible positive and negative combination of the below listed values. These values were chosen to be a representative sample of the different values on the floating point number line:

(1) 0 - Note that both positive and negative zero, which compare equal to each

other, are tested.

(2) 1 - Represents the lower-magnitude real numbers. 31

(3) 1.5 - Represents the lower-magnitude, non-integral real numbers.

(4) 0.5 - Also represents the lower-magnitude, non-integral real numbers.

(5) 3.402823e38 - Highest possible magnitude normal number.

(6) 1.175494e-38 - Largest-magnitude denormal number.

(7) 1.4012985-45 - Smallest-magnitude denormal number.

(8) Infinity

(9) Not A Number (NaN)

The math function tests employed test vectors used internally by Apple to test the correctness of Apple’s libm math library. The test vectors are specifically designed to test various vulnerable areas of the algorithms. For the trigonometric functions sinf, cosf, and tanf, the Apple test vectors were supplemented with inputs designed to test

kπ immediately around 4 , a major vulnerability in the argument reduction techniques used by these functions.

4.1.3 Porting the vForce Functions to Shader Programs

The original Apple vForce code uses compiler intrinsics [1]. Compiler intrinsics are special functions (or, in some cases, preprocessor macros) used to encode operations that are not well expressed by either C or C++. In many cases, compiler intrinsics are actually function wrappers around assembly instructions. Compiler intrinsics are popularly used for SIMD programming, as they allow the programmer to precisely specify SIMD instructions without having to program them in assembly. Being able to write SIMD code in C/C++, even if the intrinsics only specify individual instruc- tions, greatly improves productivity by allowing the compiler to deal with such tedious, cumbersome tasks as register allocation, memory addressing modes, and function pro- logues/epilogues. 32

The macro-based system used for converting the portable intrinsic code into GLSL was significantly more complex than the existing system used by Apple to generate compiler intrinsics for PowerPC and x86. The lack of integer and bitwise operations in

GLSL greatly complicated porting. Bitwise operations are essential for operations such as extracting and manipulating the exponent. Without bitwise support, the exponent extraction had to be done using the log2 function and mantissa extraction had to be done by extracting the exponent with log2, negating it, and then multiplying it by two to the negated exponent. Workarounds such as the ones previously described certainly had an impact on performance, as they required many additional operations. It is also likely that they adversely affected accuracy, as the added operations inserted additional rounding errors (along with any other accuracy errors the GPUs’ implementations of exp2 and log2 had) into the calculations.

A similar problem occurred when porting division and square root to the GPU.

Both square root and division use Peter Markstein’s algorithms [44], which start with a reciprocal estimate for the divide and square root and then use Newton-Raphson iterative refinement to converge on the correct value. The GPU ports of these two functions emulate the reciprocal estimate using the divide operator and emulate the reciprocal square root using GLSL’s built in rsqrt function.

While the translation system works for many basic operations, it is impossible to make fully orthogonal conversions of many operations, particularly operations using bitwise operators. As a result, most of the vForce functions cannot be completely auto- matically ported. Some unsupported operations have to be translated by hand. Overall, however, the translation system greatly simplifies the porting process. Moreover, using the automatic translation system reduces the risk of human error when porting.

Constants are another serious issue. The GLSL language has no way to specify as constants the floating point specials. As a result, all of the needed constants have to be generated by the test harness, put into a texture, and loaded onto the GPU. Math 33 library functions that use these constants have to load them from the constants texture.

Such a workaround will certainly adversely impact performance, as texture look ups consume valuable GPU memory bandwidth.

Finally, dealing properly with control flow is an extremely serious issue. When programming for SIMD, it is not practical to implement many forms of control flow with branches. Consider the following code fragment: if(x == 1) y = x+1; else y

= x*2;. If x and y are both four-element SIMD registers, then it is possible that some elements of x will evaluate to true and some elements will evaluate to false. To deal with this problem, SIMD programming uses a limited form of predication known as the conditional select operation. The conditional select operation takes three SIMD register inputs: two data inputs (x and y) and a predicate register. For each element in the

SIMD registers, if its corresponding Boolean element is true, the y element is selected; otherwise, the x element is selected for the output. While the GPUs internally support predication, GLSL provides no way to expose the predication to the programmer. As a result, to emulate a conditional select instruction, it is necessary to translate the select instructions into several scalar select statements. Doing so likely hurts performance, unless the shader compiler is able to properly fold the scalar instructions back into a single vector instruction.

After the compiler intrinsics get successfully translated into GLSL code, the GLSL gets assembled into a valid GLSL shader program. Figure 4.1 shows the overall trans- lation system. The translation system takes as its input the program written in the compiler intrinsic language. It then passes the intrinsic code through CPP, using a special header file to translate the intrinsics into GLSL code. Once macro expansion is complete, the translated code is inserted into a special GLSL function, and the re- sults are outputted to a GLSL code file. The compiler intrinsic translator creates three versions of the files: one that operates on scalar float types, one that operates on vec2 - based data types, and one that operates on vec4 data types. The GLSL code with 34

Program in Intrinsic Language

Translation Header Run through CPP−based Macro Expander

Post−Process into full GLSL Program

Output GLSL File

Figure 4.1: The intrinsic-to-GLSL translation system.

vec2 generated code never produced correct results, so its accuracy/performance/power

figures are not presented.

One difficulty in GPGPU is finding an efficient way to access the result data.

Traditional graphics programming did not have to deal with downloading the results from the GPU, as they were immediately drawn to the screen. In GPGPU, however, it is almost always necessary to be able to access the outputted results, either to down- load them off the GPU for further processing by the CPU or to use them as the input for another rendering pass. Moreover, in many GPGPU calculations it is desirable to have multiple outputs. Until recently, the only way it was possible to access rendered results was through pbuffers, an esoteric (pbuffers are difficult to set up) and inefficient

(they required a context switch to access). Recent versions of OpenGL, however, added 35 support for Framebuffer Objects (FBOs) [23]. FBOs are special OpenGL objects de- signed to function as render targets. An important feature of FBOs is that they make render-to-texture possible: textures can be attached to FBOs and thus be used as out- put arrays. Once these textures are rendered to, they can then be used as inputs to subsequent rendering passes. Additionally, multiple textures can be attached to FBOs, thus allowing for multiple outputs.

4.2 Test Setup

The GPU study done for this thesis tested several machines. The most thoroughly tested were two iMac machines: a shipping Core Duo 17” iMac, and a prototype Core

2 Duo 20” iMac. All of the tests, except the power consumption and PCI Express bandwidth tests, were performed on these two machines. The iMac is Apple’s mid range consumer desktop system, and both machines have mainstream GPUs in them: the 17” Core Duo iMac has an ATi x1600 GPU while the 20” Core 2 Duo iMac has an NVIDIA GeForce 7300GT for its GPU. Both iMacs have a single 64-bit memory channel running at 667MHz (a total of 5.4 GB/s of main memory bandwidth); likewise, both machines carry a total of 1GB of memory. The power studies use a 2.0GHz Core

Duo MacBook Pro (the MacBook Pro is Apple’s high-end notebook computer line), which, similar to the 17” Core Duo iMac, has an ATi Radeon x1600.

The other studies involve various Mac Pro desktop machines. The Mac Pro is

Apple’s high end workstation-class machine. While the different Mac Pro machines vary considerably in many of their specifications, all of the machines have in common two dual-core Xeon processors (these Xeon processors are all Intel’s ”Woodcrest” devices, which are microarchitecturally identical to the Core 2 Duo processor found in the 20” iMac). As seen in Table 4.1, the clock speed and memory configurations of the different

Mac Pros’ processors vary greatly; however they all have a 64-bit, 1333MHz front side bus (for a total of 10.8GB/s of memory bandwidth available to the microprocessor) 36

Machine GPU VRAM Mem. CPU FSB ATi iMac x1600 128MB 1GB† 1.83 GHz††† 667MHz NVIDIA iMac 7300GT 128MB 1GB† 2.17GHz†††† 667MHz ATi Mac Pro x1900XT 512MB 4GB†† 2x3.0GHz†††† 1333MHz NVIDIA Mac Pro Quadro FX 4500 512MB 16GB†† 2x3.0GHz†††† 1333MHz MacBook Pro x1600 128MB 1.0GB† 2.0 GHz††† 667MHz PCI-E Mac Pro 7300GT 256MB 4GB†† 2x2.66GHz†††† 1333MHz

† Single channel DDR2-667. †† DDR2-667 FB-DIMMs, quad channel. ††† Core Duo. †††† Core 2 Duo.

Table 4.1: System configurations tested. 37

Device Peak Computational Peak Memory Device Capacity (GFLOP/s) Bandwidth (GB/s) Intel Core Duo (1.83GHz)† 7.32 5.4 Intel Core Duo (2.00GHz)† 8.00 5.4 Intel Core 2 Duo (2.17GHz)† 17.36 5.4 Intel Core 2 Duo (2.66GHz)† 21.28 10.8 Intel Core 2 Duo (3.00GHz)† 24.00 10.8 ATi x1600 24.00 12.5 NVIDIA 7300GT 26.4 10.7 ATi x1900XTX 124.8 49.6 NVIDIA Quadro FX 4500 129.6 33.6

† Single core only.

Table 4.2: Capabilities of the CPUs and GPUs.

along with a quad channel, 667MHz FB-DIMM-based memory system. The GPUs vary widely as well: the PCI Express bandwidth test system contains an NVIDIA GeForce

7300GT, while the other two Mac Pros sport an ATi Radeon x1900XTX and an NVIDIA

Quadro FX 4500.

Between the test systems, four GPUs were studied. The Core Duo iMac and the

MacBook Pro both contain an ATi Radeon x1600 GPU. The Radeon x1600 GPU is one of ATi’s mid range GPUs, and both systems’ GPUs come with a complement of

128MB of video RAM. Another ATi GPU, this time in one of the Mac Pros, is the ATi

Radeon x1900XTX. The x1900XTX was, at the time of the study, ATi’s highest end

GPU. Two NVIDIA GPUs were studied as well. The first one, the NVIDIA GeForce

7300GT, is NVIDIA’s mainstream GPU. When used in the Core 2 Duo iMac, it is equipped with 128MB of memory; in the Mac Pro used to study PCI-E bandwidth, the

7300GT comes with 256MB of memory. The high end NVIDIA GPU tested was the

NVIDIA Quadro FX 4500, which is a professional, workstation-grade GPU. Table 4.2 provides more details, including raw FLOPs and memory bandwidth, about all four of the GPUs and CPUs studied. 38

From previous stages Texture Addr.

of the GPU Texture Units Units xPU4 S 4x PSU 4x PSU 4x PSU 4x PSU 4x PSU 4x PSU DRAM

Dispatch Processor Texture Cache

4x PSU4x PSU 4x PSU 4x PSU 4x PSU 4x PSU

General Purpose Register Arrays

Figure 4.2: Diagram of the NVIDIA G7x pixel shader core.

Figure 4.2 shows a high level diagram of the NVIDIA G7x shader core [66]. Both

NVIDIA GPUs studied belong to the G7x family. Vertex data enters the fragment shader core through a fragment crossbar. The fragment crossbar then distributes the shading data among the fragment shader units. The fragment shader units access tex- tures through four separate memory controllers.

Figure 4.3 shows the inner workings of the NVIDIA G7x series’ fragment shader units. The fragment shader unit can be split up into two separate areas: the texture unit and the computation unit. The texture unit is the equivalent of the memory hierarchy on a microprocessor, with a private L1 texture cache and access to a larger L2 texture cache. The compute unit contains two main ALUs, two mini-ALUs, a branch unit, and a fog ALU. For GPGPU, only the main ALUs and the branch unit are relevant, as the 39

From Other Units

Scalar FP32 Vec3 FP32 Branch Unit Shader Shader Unit Unit

Scalar FP32 Vec3 FP32 Shader Shader Unit Unit

Shaded Fragments

Figure 4.3: Diagram of the fragment shader of the NVIDIA G7x.

mini-ALUs only support 16-bit floating point calculations. Likewise, the fog ALU is too special-purpose to be useful for GPGPU. For GPGPU, the NVIDIA G7x series shader units are capable of up to three FLOPs per cycle: one multiply-add in the first ALU and a multiply in the second ALU.

The current NVIDIA GPUs also support something known as co-issue, shown in

Figure 4.4. Co-issue improves the utilization of the shader units when the data types are smaller than vec4 ’s. When a shader program uses vec2 data types, for example, co-issue will schedule two pixels for running on one shader unit. Using co-issue may reduce the penalty for using scalar float and vec2 data types in shader programs.

Figure 4.5 shows a high-level diagram of the ATi R5xx shader core. Both ATi

GPUs studied belong to the R5xx family. Vertex data enters the fragment shader core 40

Instruction Stream

Operation 5

Operation 4

Operation 3

Operation 2

Operation 1

Operation 2 Operation 1 Operation 3 Operation 4

Shader Unit 1 Shader Unit 2

Figure 4.4: Diagram of dual-issue and co-issue.

through a multi-threaded dispatch processor. The dispatch processor then distributes work among the individual fragment shaders. All of the fragment shaders have access to a common general purpose register file. Additionally, the dispatch processor is in charge of doing texture look ups. The ATi R5xx’s fragment shaders, shown in Figure 4.6, have a different architecture from the NVIDIA G7x: instead of a 4-wide ALU that can co- issue two smaller operations, the R5xx uses two separate ALUs: a vec3 ALU and a scalar ALU. Similar to the G7x, however, the ALUs combined can do one multiply-add and one multiplication operation every cycle. Additionally, each fragment shader unit contains a branch processing unit.

One thing to note about both the ATi and the NVIDIA architectures is that both designs take a significant performance hit for dynamic branches when adjacent pixels 41

From previous stages Texture Addr.

of the GPU Texture Units Units xPU4 S 4x PSU 4x PSU 4x PSU 4x PSU 4x PSU 4x PSU DRAM

Dispatch Processor Texture Cache

4x PSU4x PSU 4x PSU 4x PSU 4x PSU 4x PSU

General Purpose Register Arrays

Figure 4.5: Diagram of the ATi R5xx pixel shader core.

take different branch paths. When adjacent pixels take different control flow paths, the shader core has to execute both paths of the branch for all of the pixels. This dynamic branch-related problem is not an issue for the math libraries as all of the branches are if-converted to predication.

To provide a baseline performance measure, two families of CPUs were studied: the Intel Core Duo [27] and the Intel Core 2 Duo [6]. For the compute-bound vForce library, the most important distinguishing factors between the two architectures are raw floating point performance and memory bandwidth. Per clock, the Core Duo can provide up to four single precision FLOPs per cycle – two FADDs and two FMULs. The

Core 2 Duo, with its wider internal data-paths, can do twice as much per cycle – up to four FADDs and four FMULs per cycle. As far as memory bandwidth is concerned, all of 42

From Other Units

Scalar FP32 Vec3 FP32 Branch Unit Shader Shader Unit Unit

Scalar FP32 Vec3 FP32 Shader Shader Unit Unit

Shaded Fragments

Figure 4.6: Diagram of the fragment shader of the ATi R5xx.

the systems, except for the Mac Pro, have a 64-bit (8-byte) wide front side bus running at 667MHz, which provides them with 5.4GB/s of memory bandwidth. The Mac Pro machines, on the other hand, enjoy a 64-bit wide front side bus running at twice the speed of the other machines tested, at 1333MHz, giving them a total of 10.8GB/s of memory bandwidth.

One of the 2.66GHz Core 2 Duo machines was used to conduct the PCI-E band- width tests. The Mac Pro is useful as a PCI-E bandwidth test platform because it possesses a special utility for allocating PCI-E lanes to the various PCI-E slots in the system. As a result, it is possible to allocate only eight lanes to an x16 PCI-E slot

(in PCI Express, a slot can have fewer data lanes allocated than the maximum the slot allows), thus halving the amount of bandwidth available to an attached GPU card. 43

Doing this study helps to pinpoint bandwidth bottlenecks in the system.

The power consumption measurements were obtained using a MacBook Pro sport- ing a 2.0GHz Core Duo and an ATi x1600 GPU. System level power consumption was measured at the wall using the Kill-a-Watt device [37]. To minimize the influence of non-CPU, non-GPU components on power consumption, all running applications were closed and the screen brightness was turned all the way down. The test harness was also modified to ensure that the GPU tests ran long enough to ensure stable readings. Since the Kill-a-Watt device has no way to automatically store power consumption figures, the author manually recorded them.

The power consumption study recorded several different power consumption val- ues. First, it obtained a baseline figure by recording the machine’s idle power consump- tion while no applications were running. After obtaining the machine’s idle power, the machine’s power consumption was measured for a representative subset of the math functions in several different scenarios:

(1) Run the function without any data transfer on to or off of the GPU.

(2) Run the function and upload data to the GPU.

(3) Run the function and download off of the GPU.

(4) Run the function and upload and download data to/from the GPU.

4.3 Testing a GPGPU Math Library

Figure 4.7 shows a high level view of the test harness. The test harness first reads in the test vectors from the input file and the GLSL code from the source file. The test harness then compiles and links the GLSL code, and loads the fragment shader program onto the GPU. The test harness then sets up all of the texture data: the constant values, any look-up table data, and the test inputs. The test program then 44

Run vForce Function Load Input Data from Files Compile and Link GLSL Code on Input Data

Convert Input Data into Textures Load Shader Program onto GPU

Generate Constants Texture Load Input Textures onto GPU

Compare Results Unpack Framebuffer Download Framebuffer from GPU

Output Results to File

Figure 4.7: The GPU test harness.

feeds the test inputs to the GPU and to vForce. Once calculated, the test harness downloads the GPU results and compares them to the vForce results. Finally, the test harness writes the comparisons to a log file for later analysis.

The test harness uses Mac OS X 10.4 ”Tiger” as its development and execution environment. The test program uses the OpenGL API along with the OpenGL Utility

Toolkit (GLUT) [43] to communicate with the GPU. All of the shader programs use the

OpenGL Shading Language (GLSL), and the OpenGL Extension Wrangler (GLEW) [35] manages all of the OpenGL extensions.

The test harness measures performance in four ways:

(1) No Transfer. No transfer examines how GPGPU programs perform when

running exclusively on the GPU. In no-transfer mode, the test harness loads 45

the input data once, and repeatedly runs the shader program.

(2) Upload Only. In upload only, the test harness repeatedly uploads the input

data and runs the shader program. The purpose of the upload only transfer

mode is to measure the impact of upload transfers on a GPGPU program’s per-

formance. Uploads are performed using the OpenGL command glTexImage2D.

(3) Download Only. In download only, the test harness repeatedly runs the shader

program and downloads the results. The download only transfer test measures

the impact of download transfers on performance. Downloads are performed

using the OpenGL command glReadPixels.

(4) Upload and Download. In the upload and download mode, the test harness

repeatedly uploads input data to the GPU, runs the function, and downloads

the results from the GPU. The upload and download transfer test characterizes

the performance of the math functions as if they were part of a separate GPU

math library.

Getting optimal performance out of GPU texture transfers requires special con- sideration from the OpenGL programmer. Most graphics systems do not have a default

”direct path” between an OpenGL application’s texture memory and the GPU: the different layers of the graphics system each keep their own copy of the data. Keeping multiple copies of the same texture data makes graphics programming easier: upon creating the texture with glTexImage2D, the OpenGL application does not have to explicitly deal with allocating and deallocating texture memory, as the texture data resides within the OpenGL system. Since graphics applications typically create a tex- ture once, the extra overhead involved is acceptable. With GPGPU, on the other hand, making multiple copies of the texture greatly impacts performance as the multiple mem- ory copies between layers of the graphics stack adds significant overhead. Figure 4.8 shows Apple’s graphics system stack. Normally in OSX, transferring a texture from 46

Application

OpenGL Framework enabled

APPLE_client_storage OpenGL Driver APPLE_texture_range and

Hardware

Figure 4.8: The Apple OpenGL software stack.

an OpenGL application to the GPU’s local memory entails three memory copies: one copy between the client application and the OpenGL framework, one copy between the

OpenGL framework and the OpenGL driver, and one copy between the OpenGL driver and the GPU.

On non-OSX systems, the recommended way to eliminate the extra memory copies is to use Pixel Buffer Objects (PBOs) [3]. PBOs allow an OpenGL application to directly manage texture memory. OSX, however, does not benefit much from PBOs.

On OSX, PBOs can eliminate at most one of these copies. To eliminate two of the memory copies, Apple provides two OpenGL extensions: APPLE client storage [57] and APPLE texture range [2], which eliminate the memory copies between the client application and the OpenGL framework and the OpenGL framework and the OpenGL 47 driver respectively [8]. Another issue pertinent to obtaining optimal performance is texture size. While large blocks of memory are usually desirable for performance, each

GPU has its own optimal texture size, which will be discussed in Chapter 5.

As discussed previously, measuring the accuracy of a math function involves using either ulps or an absolute difference. When testing the floating point specials, however, quantities such as ulps and absolute differences become meaningless; for example, if one value is infinity, its difference with a real number is still infinity. Similarly, a NaN always compares false, even to another NaN. As a result, a different method for quantifying errors was used. Instead of quantifying errors by their absolute difference, each test case was binned into one of six categories:

(1) Correct. Both results are identical.

(2) Normal error. Both results, while different, are normal numbers of the same

sign.

(3) Denormal error. One of the results is a zero, while the other result is a normal

number. Such a result can occur when a denormal input or intermediate gets

squashed to zero by the hardware.

(4) Infinity error. One of the results is an infinity value.

(5) NaN error. One of the results is a NaN value.

(6) +/- 0 error. A +/-0 error is reported whenever there is a zero or negative

zero in the result.

The following single precision functions were ported from vForce over to the GPU:

(1) vvdivf - Division.

(2) vvsqrtf - Square root. 48

(3) vvsinf - Sine.

(4) vvcosf - Cosine.

(5) vvtanf - Tangent.

(6) vvasinf - Inverse sine.

(7) vvacosf - Inverse cosine.

(8) vvatanf - Inverse tangent.

(9) vvlogf - Natural logarithm.

(10) vvlog1pf - Natural logarithm. Provides extra accuracy for values immediately

around x = 1.

(11) vvexpf - Exponent function. Calculates ex.

(12) vvexpm1f - Exponent function. Calculates ex like vvexpf, but provides extra

accuracy for values immediately around x = −1.

(13) vvlogbf - Base-2 logarithm of x.

The performance tests done on the Mac Pros as well as the power tests employed a restricted subset of the vForce functions, due to limited availability of the machines.

The restricted function subset provides a representative sample of the different classes of math functions as well as functions that have certain special characteristics, such as the table look ups in log1p. The subset of functions includes sinf, logarithm, log1p, multiplication, exp, sqrt, addition, asin, and division. Chapter 5

Experimental Results

This chapter presents and discusses the results of the GPU accuracy, performance, power, and bandwidth tests. Section 5.1 presents and discusses the accuracy results,

Section 5.2 presents and discusses the performance results, Section 5.3 presents and discusses the power/energy results, and Section 5.4 presents and discusses the results of the PCI-E bandwidth study.

5.1 Accuracy Results

Figure 5.1 shows the average accuracy figures for the basic operators, the built-in

GLSL functions, and the ported vForce functions. Figure 5.2 compares the accuracy percentages between the four GPUs. With the basic operations, the NVIDIA GPUs are more accurate than the ATi GPUs. There is little accuracy difference, however, between the same vendors’ GPU models.

The built-in GLSL functions suffer serious accuracy issues with normal numbers as well as with NaNs. Both problems likely stem from the functions’ graphics roots: modest accuracy problems are not an issue for graphics, nor is properly handling NaNs. Overall, the NVIDIA GPUs are more accurate when handling the built-in GLSL functions than are the ATi GPUs. There are also significant accuracy differences within the GPU families: the mainstream GPUs (the ATi x1600 and the NVIDIA 7300GT) both enjoy better accuracy than the ATi x1900XTX and NVIDIA Quadro FX 4500 respectively. 50

Average Accuracy Percentages (Basic Operations) 90 % Pass 80 % Subnorm 70 % Inf ases

C 60 % NaN 50 % +/­0 % Norm cent of 40

Per 30 20 10 0 NV 7300GT ATi x1600 NV 4500 ATi x1900XTX

Average Accuracy Percentages (Built­In GLSL) 45 % Pass 40 % Subnorm 35 % Inf ases

C 30 % NaN 25 % +/­0 % Norm cent of 20

Per 15 10 5 0 NV 7300 GT ATi x1600 NV 4500 ATi x1900XTX

Average Accuracy Percentages (Ported vForce) Average Accuracy Differences (Basic Operations) 80 9 % Pass 8 % Pass 70 % Subnorm 7 % Subnorm % Inf % Inf 60 6 ases ases

C % NaN C % NaN 50 5 % +/­0 % +/­0 4 40 % Norm % Norm cent of cent of 3

Per 30 Per 2 20 1 0 10 ­1 0 NV 7300 GT vs. ATi NV 4500 vs. ATi NV 7300GT vs. NV ATi x1600 vs. ATi NV 7300 GT ATi x1600 NV 4500 ATi x1900XTX x1600 x1900XTX 4500 x1900XTX

Average Accuracy Differences (Ported vForce) 15 % Pass Figure12.5 5.1: The accuracy differences between the tested GPUs for the basic operations, % Subnorm the built-in10 GLSL functions, and the ported vForce functions. % Inf

ases 7.5

C % NaN 5 % +/­0 2.5 % Norm cent of 0 Per ­2.5 ­5 ­7.5 NV 7300 GT vs. ATi NV 4500 vs. ATi NV 7300GT vs. NV ATi x1600 vs. ATi x1600 x1900XTX 4500 x1900XTX 51

Average Accuracy Differences (Basic Operations) 9 8 % Pass 7 % Subnorm % Inf 6 ases

C % NaN 5 % +/­0 4 % Norm cent of 3

Per 2 1 0 ­1 NV 7300 GT vs. ATi NV 4500 vs. ATi NV 7300GT vs. NV ATi x1600 vs. ATi x1600 x1900XTX 4500 x1900XTX

Average Accuracy Differences (Built­in GLSL) 15 % Pass 12.5 % Subnorm 10 % Inf

ases 7.5

C % NaN 5 % +/­0 2.5 % Norm cent of 0 Per ­2.5 ­5 ­7.5 NV 7300 GT vs. ATi NV 4500 vs. ATi NV 7300GT vs. NV ATi x1600 vs. ATi x1600 x1900XTX 4500 x1900XTX

Average Accuracy Differences (Ported vForce) 15 % Pass 12.5 % Subnorm 10 % Inf

ases 7.5

C % NaN 5 % +/­0 2.5 % Norm cent of 0 Per ­2.5 ­5 ­7.5 NV 7300 GT vs. ATi NV 4500 vs. ATi NV 7300GT vs. NV ATi x1600 vs. ATi x1600 x1900XTX 4500 x1900XTX

Average Accuracy Differences (Built­in GLSL) 15 % Pass Figure12.5 5.2: Comparison of the accuracy between the different GPUs for the basic oper- % Subnorm ations,10 the built-in GLSL functions, and the ported vForce functions. % Inf

ases 7.5

C % NaN 5 % +/­0 2.5 % Norm cent of 0 Per ­2.5 ­5 ­7.5 NV 7300 GT vs. ATi NV 4500 vs. ATi NV 7300GT vs. NV ATi x1600 vs. ATi x1600 x1900XTX 4500 x1900XTX 52 vForce Improvement 80 ) t % Pass

rcen 60 % Subnorm e P

( % Inf 40 % NaN

rence 20 % +/­0 fe

if % Norm D 0 racy

u ­20 Acc ­40

­60 NV 7300 GT ATi x1600 NV 4500 ATi x1900XTX

Figure 5.3: Accuracy improvement of the ported vForce functions vs. the built-in GLSL functions.

Of the four GPUs, the NVIDIA 7300GT has the fewest NaN errors.

Once again, with the ported vForce functions, the NVIDIA GPUs provide a greater level of overall accuracy. In this case, however, the NVIDIA Quadro FX 4500 has a higher overall level of accuracy than does the NVIDIA 7300GT; for ATi, however, the mainstream x1600 still enjoys a higher level of accuracy over the high-end x1900XTX.

A possible explanation for the nearly-same intra-family accuracy results for the basic operations but significantly-different accuracy results for the built-in GLSL func- tions as well as the ported vForce functions are different GLSL compilers. As discussed previously, different compiler optimizations can affect the accuracy of floating-point re- sults. An example of such an optimization is the identity removal on the ATi shader compiler. Here, an equality (or inequality) comparison of a variable is made. Nor- mally, the condition always evaluates to true (or false) – except when the variable is a

NaN. It was found that the ATi compiler optimizes out these comparisons, causing NaN self compares to evaluate incorrectly. Overall, as Figure 5.3 shows, the ported vForce functions are significantly more accurate than the built-in GLSL functions. 53 NV 7300GT (256x256 textures): Built­in GLSL vs. Ported vForce ATi x1600 (128x128 textures): Built­in GLSL vs. Ported vForce 0 25 ercent) ercent) 0 p

p ­20 e ( e ( ­40 ­25

­60 ­50 ce differenc ce differenc ­80 ­75

­100 ­100 erforman erforman

P asin div exp log sine sqrt P asin div exp log sine sqrt

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

NV 4500 (512x512 textures): Built­in GLSL vs. Ported vForce ATi x1900XTX (256x256 textures): Built­in GLSL vs. Ported vForce 0 20

ercent) ercent) 0

p ­20 p

e ( e ( ­20 ­40 ­40 ­60 ­60 ce differenc ce differenc ­80 ­80

­100 ­100 erforman erforman

P asin div exp log sine sqrt P asin div exp log sine sqrt

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

NV 7300GT (256x256 textures): Built­in GLSL vs. Ported vForce ATi x1600 (128x128 textures): Built­in GLSL vs. Ported vForce 0 ATi x1600 (128x128 textures): Built­in GLSL vs. Ported vForce 25 25 ercent) ercent) 0 p

p ­20 e ( e ( ercent) 0 p ­40 ­25 e ( ­25 ­60 ­50 ce differenc ce differenc ­50 ­80 ­75 ce differenc ­75 ­100 ­100 erforman erforman

P asin div exp log sine sqrt P ­100 asin div exp log sine sqrt erforman

P asin div exp log sine sqrt Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads Vec4Float GP GPUU Vec4Float GP GPU+U+UploadsUploads Vec4Float GP GPU+U+DowDownloadnloadss Vec4Float GP GPU+U+Uploads+DowUploads+Downloadnloadss Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

ATi x1900XTX (256x256 textures): Built­in GLSL vs. Ported vForce 20

ercent) 0 p

e ( ­20

­40

­60 ce differenc ­80

­100 erforman

P asin div exp log sine sqrt

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

ATi x1600 (128x128 textures): Built­in GLSL vs. Ported vForce 25 Figure 5.4: A performance comparison of the built-in GLSL functions versus the ported

ercent) 0 vForcep functions. e ( ­25

­50

ce differenc ­75

­100 erforman

P asin div exp log sine sqrt

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads 54

5.2 Performance Results

The trade-off, however, is in performance: the ported vForce functions are signif- icantly slower than the built-in GLSL functions, being an average of 68.8% slower for the all-GPU version. It is virtually guaranteed that the ported vForce div and sqrt are going to be slower than the built-in GLSL equivalents, since both functions use the built-in GLSL functions to provide the initial estimates that are subsequently refined.

A possible future improvement worth investigating is whether it is possible to leverage the powerful memory subsystem of the GPU to implement the square root and division estimates as a large lookup table.

Other functions that show the largest performance difference between the built- in GLSL version and the ported vForce version are exp, log, and sine. exp’s large performance decrease is probably due to the lack of bitwise operators on the GPU, which exp would use to work with the exponents. log most likely suffers from the same issue. Finally, the ported vForce version of sine is significantly slower by 86.6% than the built-in GLSL version, most likely due to the complicated GPU-based argument reduction. The round-to-nearest conversion to integer requires a lot of operations on the GPU. Additionally, there is a good chance that the GPU supports the sine function in hardware.

Overall, the performance of the built-in GLSL function versus the ported vForce functions diminishes significantly when the calculations become bandwidth-limited, such as when the calculations involve off-GPU transfers. A surprising result as well is that the ATi GPUs have little performance difference between the built-in GLSL and the ported vForce functions. It is not known the reason for this difference: perhaps the division operation is highly limited by the GPU’s memory bandwidth.

Figures 5.5 and 5.6 compares the GPUs’ performance when different sized textures are used. The graphs show the percent performance change against a 128x128 texture 55

NV 7300GT Texture Size Comparison

12.5

10 (percent)

7.5

ference 5 e dif 2.5 anc 0

form Overall Average

Per Float GPU 256x256 Float GPU 512x512 Vec4 GPU 256x256 Vec4 GPU 512x512 Float GPU+Uploads 256x256 Float GPU+Uploads 512x512 Vec4 GPU+Uploads 256x256 Vec4 GPU+Uploads 512x512 Float GPU+Downloads 256x256 Float GPU+Downloads 512x512 Vec4 GPU+Downloads 256x256 Vec4 GPU+Downloads 512x512 Float GPU+Uploads+Downloads Float GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads 256x256 512x512 256x256 512x512

NV 4500 Texture Size Comparison

15

12.5

10

e (percent) 7.5

5

2.5 differenc

e 0 c ­2.5 man r ­5 fo

r Overall Average

Pe Float GPU 256x256 Float GPU 512x512 Vec4 GPU 256x256 Vec4 GPU 512x512 Float GPU+Uploads 256x256 Float GPU+Uploads 512x512 Vec4 GPU+Uploads 256x256 Vec4 GPU+Uploads 512x512 Float GPU+Downloads 256x256 Float GPU+Downloads 512x512 Vec4 GPU+Downloads 256x256 Vec4 GPU+Downloads 512x512 Float GPU+Uploads+Downloads Float GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads 256x256 512x512 256x256 512x512

Figure 5.5: A performance comparison of different texture sizes for the NVIDIA GPUs. 56

ATi x1600 Texture Size Comparison

t) 5

cen 0 er p

( ­5

­10 nce

e ­15

iffer ­20 d

­25 ce

an ­30

rm ­35

rfo Overall Average

Pe Float GPU 256x256 Float GPU 512x512 Vec4 GPU 256x256 Vec4 GPU 512x512 Float GPU+Uploads 256x256 Float GPU+Uploads 512x512 Vec4 GPU+Uploads 256x256 Vec4 GPU+Uploads 512x512 Float GPU+Downloads 256x256 Float GPU+Downloads 512x512 Vec4 GPU+Downloads 256x256 Vec4 GPU+Downloads 512x512 Float GPU+Uploads+Downloads Float GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads 256x256 512x512 256x256 512x512

ATi x1900XTX Texture Size Comparison

40

30

20 e (percent) 10

0

differenc ­10 e c ­20 man r ­30 fo r Overall Average

Pe Float GPU 256x256 Float GPU 512x512 Vec4 GPU 256x256 Vec4 GPU 512x512 Float GPU+Uploads 256x256 Float GPU+Uploads 512x512 Vec4 GPU+Uploads 256x256 Vec4 GPU+Uploads 512x512 Float GPU+Downloads 256x256 Float GPU+Downloads 512x512 Vec4 GPU+Downloads 256x256 Vec4 GPU+Downloads 512x512 Float GPU+Uploads+Downloads Float GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads 256x256 512x512 256x256 512x512

Figure 5.6: A performance comparison of different texture sizes for the ATi GPUs. 57 size baseline. Three texture sizes were compared: 128x128, 256x256, and 512x512. The results show that each GPU has its own optimal texture size, and the optimal texture size even differs within vendors’ GPU families.

Overall, the NVIDIA GPUs tend to be less particular about the texture size used, with larger textures generally providing higher performance. The NVIDIA 7300GT’s optimal texture size is 256x256, while the Quadro FX 4500 prefers 512x512 textures.

The two ATi GPUs, particularly the x1600, on the other hand, tend to have more irreg- ular texture performance results: the ATi x1600 achieves optimal overall performance with 128x128 textures, while the x1900XTX prefers 256x256 textures. Moreover, the

ATi GPUs suffer significant performance penalties from going to textures larger than their optimal size.

Figure 5.7 shows the overall performance of all of the vForce functions that were ported over to the ATi x1600 as well as to the NVIDIA 7300GT. The GPU results are plotted against the Core Duo vForce results provided by the x1600 system as well as the Core 2 Duo results provided by the 7300GT system. All of the GPU results use the vec4 data type, as the vec4 data type provides the highest performance on all of the benchmarks. Whether the GPU-ported math functions enjoy any performance benefit depends heavily on the CPU used as the performance baseline as well as whether any off-GPU transfers are involved.

Figure 5.8 shows the raw performance of vForce running on the different systems tested. Compared are a 1.83GHz Intel Core Duo on the Core Duo iMac, a 2.13GHz

Core 2 Duo on the Core 2 Duo iMac, a 2.67GHz Core 2 Duo on the NVIDIA 7300GT

Mac Pro, and a 3.0GHz Core 2 Duo on the G70 iMac. On all of the functions except for division, all of the Core 2 Duo iMach system enjoy an average of a 1.95x speedup over the Core Duo iMac.

Such a gain is entirely possible given the 2x advantage in peak single precision

floating point throughput provided by the wider SSE ALUs on the Core 2 Duo micro- 58

Performance of all Ported vForce Functions (mixed, vec4) on ATi x1600 300 ) s

/ 250 s t 200 esul 150

100 nce (MR

50 forma r 0 Pe acosh asinh atanh sinh cosh tanh acos asin atan div exp log1p expm1 log sine cosine tan­ sqrt gent

Core vForce Core 2 vForce Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Performance of all Ported vForce Functions (mixed, vec4) on NV 7300GT 300 ) 250 s/s 200 sult

150

100 nce (MRe 50 forma 0 Per acosh asinh atanh sinh cosh tanh acos asin atan div exp log1p e log sine cosine tan­ sqrt xpm gent

Core vForce Core 2 vForce Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Figure 5.7: A performance comparison of the ported vForce functions.

vForce Processor Performance Comparison 500 s / s t 400 emen 300 ce MEl 200

100 erforman P

0 asin div exp log sine sqrt log1p

Core Duo iMac Core 2 Duo iMac 7300GT Mac Pro NV 4500 Mac Pro

Figure 5.8: A performance comparison of vForce running on different processors. 59 processor. Its 1.95x gains on the benchmarks also suggest that the vForce tests are not limited by memory bandwidth: the Core 2 Duo iMac enjoys a large performance in- crease while still having the exact same 64-bit, 667MHz memory subsystem as the Core

Duo iMac. Likewise, the Mac Pros, with their high-bandwidth memory subsystem, do not gain much of an advantage above and beyond their higher clock speed.

Figures 5.9 and 5.10 show the performance improvement from going from the scalar float data type to the vector vec4 data type. While going to the 4x wider data type should in theory provide 4x the throughput, it does not, with not even the complicated log1p and expm1 able to get more than a 192% speedup.

There are several possible reasons for the relatively small gains. First, it is highly likely that all of the functions are memory bandwidth limited, even the ones that do not involve slow off-GPU transfers. Another possibility is that going to the vec4 data type incurs overhead not seen with the float data type. It is possible that some of the operations, such as the reciprocal square root estimate and the reciprocal estimate, operate internally as scalar operations. Additionally, due to limitations of the GLSL language, the conditional select has to be implemented as four scalar if-then-else oper- ations. Finally, both the ATi GPUs and the NVIDIA GPUs can issue two scalar float operations or one vector vec4 operation to each fragment shader, doubling the potential scalar throughput.

The four graphs shown in Figure 5.11 quantify the performance gains from going from the scalar float data type to the vector vec4 data type on the four GPUs. The

ATi GPUs gain the most from going to vec4, possibly because of the co-issue support provided by the NVIDIA GPUs. As expected, the largest gains from going to vec4 are on the tests that involve no off-GPU transfers. There remain, however, significant gains to be had when data is only uploaded to the GPU.

Note that the built-in GLSL functions gain little in terms of performance from going from float to vec4, except for asin. While the real reason for this behavior is 60

vec4 Performance Improvement on NV 7300GT 200

175

ercent) 150 (p

e 125

hang 100

75 ce C 50

25 erforman

P 0 acosh asinh atanh sinh cosh tanh acos asin atan div exp log1p expm1 log sine cosine tan­ sqrt gent

GPU GPU+Uploads GPU+Downloads GPU+Uploads+Downloads

Performance of all Ported vForce Functions (mixed, vec4) on ATi x1600 300 ) s

/ 250 s t 200 esul 150

100 nce (MR

50 forma r 0 Pe acosh asinh atanh sinh cosh tanh acos asin atan div exp log1p expm1 log sine cosine tan­ sqrt gent

Core vForce Core 2 vForce Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Figure 5.9: Speedups gained from using the vec4 data type as opposed to the float on all of the ported vForce functions. 61

vec4 Performance Improvement on NV 7300GT (256x256 textures) 150 125 ercent) (p 100 75 rence e 50

ce diff 25 0 ­25 Performan asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion GPU GPU+Uploads GPU+Downloads GPU+Uploads+Downloads

vec4 Performance Improvement on NV 4500 (512x512 textures) 150 cent) r 125 pe 100 75 50 25 0 ­25 asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ Performance difference ( tion GPU GPU+Uploads GPU+Downloads GPU+Uploads+Downloads

) vec4 Performance Improvement on ATi x1900XTX (256x256 textures) nt 175 ce

er 150

(p 125 e c 100 en 75 er

iff 50 d 25 ce n

a 0 m

r ­25

rfo asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­

Pe tion

GPU GPU+Uploads GPU+Downloads GPU+Uploads+Downloads

vec4 Performance Improvement on ATi x1600 (128x128 textures)

175 ercent) p

e ( 125

75

ce differenc 25

­25 erforman

P asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion

GPU GPU+Uploads GPU+Downloads GPU+Uploads+Downloads

Figure 5.10: Speedups gained from using the vec4 data type as opposed to the float data type on the restricted subset of the functions. 62

Scalar NV 4500 (512x512 textures) vs. scalar NV 7300GT (256x256 textures) 350 300 ercent) (p

250 200 rence e 150

ce diff 100 50 0

Performan ­50 asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads

Scalar ATi x1600 (128x128 textures) vs. scalar NV 7300GT (256x256 textures) Scalar ATi x1900XTX (256x256 textures) vs. scalar NV 4500 (512x512 textures) 50 ercent) 80 p e ( 70 0 ercent) p 60 e ( 50­50 40 ce differenc 30­100 ce differenc 20 ­150 erforman 10 P asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ 0 erforman tion P asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads tion

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads

Scalar ATi x1900XTX (256x256 textures) vs. scalar ATi x1600 (128x128 textures)

700 ercent) (p 500 rence e 300 ce diff 100

­100 Performan asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads

Scalar ATi x1600 (128x128 textures) vs. scalar NV 7300GT (256x256 textures)

50 ercent) p e ( 0

­50 ce differenc ­100

­150 erforman

P asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads

Figure 5.11: A comparison of the different GPUs using the float data type. 63 unknown, there are two likely possibilities. One is that the built-in GLSL functions always operate on float data types: when the functions use vec4 inputs, the data type is broken down internally into four float values. Similarly, when using float data types, the compiler might do the reverse as an optimization: combine four float operations into a single vec4 operation.

Figure 5.12 compares the performance of the four GPUs. All of the comparisons are performed using the highest-performing vec4 data type. The first graph compares the NVIDIA Quadro FX 4500 to its younger sibling, the 7300GT. While the higher end

GPU performs 239% better than the 7300GT when no off-GPU transfers are involved, the performance difference narrows significantly (to a mere 3.13%) when the slow off-

GPU data transfers begin to dominate the performance.

It is also worth investigating whether the GPU hardware’s performance is affected by the input data sets. Figure 5.13 shows the performance difference for the ported vForce functions using two different input sets: one input set using an all-normals data set, and one using a mixed data set containing all of the other types of numbers in the

IEEE 754 number system: denormals, infinities, NaNs, and negative zero. Overall, the performance figures show little difference when changing data sets.

Since bandwidth both to the GPU’s local memory as well as to and from the GPU play such an important role in the performance of GPU-run applications, it is essential to study the maximum bandwidth of the GPUs. Figure 5.14 shows the raw bandwidth of the different GPUs. The GPU bandwidth is measured by taking the performance of the addition test, in millions of results per second (MResults/s), and multiplying it by four (the number of bytes in an element), and then by three, (the number of input textures (two) plus the number of output textures (one)).

The results show that while the higher end GPUs enjoy significantly greater band- width to their local memories (221% higher for the NVIDIA GPUs, and 564% higher for the ATi GPUs), there is a much smaller difference in off-GPU bandwidths (the Quadro 64

vec4 NV 4500 (512x512 textures) vs. NV 7300GT (256x256 textures) 2000 1750 ercent) p 1500 e ( 1250 1000 750 500 ce differenc 250 0

erforman ­250 P asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion

Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

vec4 ATi x1900XTX (256x256 textures) vs. NV 4500 (512x512 textures) 80 70 ercent) p 60 e ( 50 40 30

ce differenc 20 10 0 erforman asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ P tion

Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

vec4 ATi x1900XTX (256x256 textures) vs. ATi x1600 (128x128 textures)

1750 ercent) p e ( 1250

750 ce differenc 250

­250 erforman

P asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion

Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

vec4 ATi x1600 (128x128 textures) vs. NV 7300GT (256x256 textures) 50 25 ercent) p 0 e ( ­25 ­50 ­75 ce differenc ­100 ­125

erforman ­150 P asin div exp log sine sqrt log1p asin div exp log sine sqrt addi­ tion

Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Figure 5.12: A comparison of the different GPUs using the vec4 data type. 65

ATi x1600 Normals Only vs. Mixed

20 ercent)

p 15 e ( 10

5 ce differenc 0

­5

erforman acosh asinh atanh sinh cosh tanh acos asin atan div exp log1p expm1 log sine cosine tan­ sqrt P gent

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

NV 7300GT Normals Only vs. Mixed

40

ercent) 30 p

e ( 20

10

0 ce differenc ­10

­20

erforman acosh asinh atanh sinh cosh tanh acos asin atan div exp log1p expm1 log sine cosine tan­ sqrt P gent

Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Figure 5.13: A comparison of the performance of the ported vForce functions using an all-normals dataset versus a mixed dataset.

GPU Bandwidth Comparison Float GPU 20000 Float GPU+Uploads Float GPU+Downloads 17500 ercent) Float GPU+Uploads+Downloads p Vec4 GPU e ( 15000 Vec4 GPU+Uploads 12500 Vec4 GPU+Downloads

10000 Vec4 GPU+Uploads+Downloads

ce differenc 7500

5000

erforman 2500 P 0 ATi x1600 NV 7300GT ATi x1900XTX NV 4500 GPU

Figure 5.14: A raw bandwidth comparison. 66

FX 4500 has 88.6% the bandwidth of the 7300GT, while the ATi x1900XTX has 158% the bandwidth of the x1600). The NVIDIA GPU systems actually have somewhat lower off-GPU bandwidth numbers than the mainstream GPUs do. This bandwidth discrep- ancy is likely due to the relative immaturity of the Mac Pro platform used to test the high end GPUs versus the shipping iMac platform.

5.3 Power/Energy Results

Figure 5.15 shows the overall system-level power consumption of a restricted subset of the functions when running on the MacBook Pro. Note that there is little difference in the system-level power consumption between vForce and the different GPU implementations: the system-level difference is no more than about five watts. As a result, if using the GPU is going to show any energy savings, it will have to do so by allowing a given piece of work to finish faster.

Since the overall power consumption of the MacBook Pro is mostly unchanged regardless of whether the function is running on the CPU or GPU, the GPU can only improve energy consumption if it performs faster than the CPU. The performance results are similar to the ones seen in the previous section: the GPU-based functions generally do better than the CPU versions when no off-GPU transfers are necessary; additionally, the more complicated math functions gain more from porting to the GPU.

Figure 5.17 shows which functions under which conditions actually show an energy consumption improvement from running on the GPU instead of the CPU. The numbers for Figure 5.17 are calculated by taking the power consumption, multiplying it by one second, and dividing it by the number of results generated per second. A case for saving energy by running the math functions on the GPU can be made when there are no off-GPU transfers involved and/or the math functions used are fairly complicated. 67

System Power Consumption of Ported vForce Functions 55

(W) 52.5

50 umption

ons 47.5 C

45

ower asin div exp log1p log sine sqrt P

vForce Float GPU Float GPU+Uploads Float GPU+Downloads Float Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads

System Power Consumption of Built­In GLSL Functions 55 ) (W

52.5 on ti 50 mp u s

n 47.5 Co r 45 we asin div exp log sine sqrt Po

vForce Float GPU Float GPU+Uploads Float GPU+Downloads Float Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads GPU+Uploads+Downloads Vec4 GPU+Uploads+Downloads

Figure 5.15: A comparison of the system-level power consumption of different GPU- based math functions vs. vForce. 68

Performance of Ported vForce Functions 400 /s)

s 300

Result 200 M ( e

c 100 an

0 rform

e asin div exp log1p log sine sqrt P

vForce Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Performance of Built­In GLSL Functions 500 ) 400 lts/s u 300 Res 200

100 ance (M

0 form asin div exp log sine sqrt Per

vForce Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Figure 5.16: A comparison of the performance of different built in GLSL functions vs. vForce. 69

Energy Consumption of Built­In GLSL Functions ult)

es 2.5

MR 2 n (J/ 1.5

umptio 1

ons 0.5

0 nergy C

E asin div exp log sine sqrt

vForce Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Energy Consumption of Ported vForce Functions 4 lt)

esu 3

(J/MR 2 n

mptio 1

onsu 0 asin div exp log1p log sine sqrt ergy C

En vForce Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads

Energy Consumption of Built­In GLSL Functions ult) Figurees 2.5 5.17: A comparison of the energy consumption of different GPU-based math MR 2

functionsn (J/ vs. vForce. 1.5

umptio 1

ons 0.5

0 nergy C

E asin div exp log sine sqrt

vForce Float GPU Float GPU+Uploads Float GPU+Downloads Float GPU+Uploads+Downloads Vec4 GPU Vec4 GPU+Uploads Vec4 GPU+Downloads Vec4 GPU+Uploads+Downloads Transfer Rate Comparison 250 225 200 x8 PCI­E 175 x16 PCI­E MB/s) 150 t ( u 125 hp

g 100

rou 75 Th 50 25 0 Upload Only Download Only Upload and Download Transfer Type

Figure 5.18: The CPU-GPU transfer bandwidth for differing PCI-E widths. 70

5.4 Bandwidth Results

Section 5.2, showed that the realized off-GPU transfer rates are a mere fraction of the maximum 4.0GB/s possible with x16 PCI-E. The above graph shows that the poor bandwidth is not a pure, physical-layer hardware problem, as halving the number of available PCI-E lanes has only a small influence (13MB/s) on the realized bandwidth.

The above results suggest that the low realized bandwidth is a problem that resides within the GPU hardware itself, in the drivers, or within the operating system.

Comparing the OSX results to the results obtained by NVIDIA [3], it is likely that the poor transfer rates are the fault of either the CPU chipset (the NVIDIA performance

figures were obtained using their nForce chipset on an AMD system) or the software.

Potential culprits include the operating system, the driver layer, or the OpenGL li- brary/framework. Since GPGPU has not been frequently attempted on OSX it is likely that the OS is not well tuned for GPGPU work. Chapter 6

Summary and conclusion

This thesis presented a study on the potential of creating and using a high- performance GPU-based math library. It evaluated the feasibility of such a math li- brary across several metrics: the accuracy of the math libraries, the performance of the math libraries, and the power consumption of the libraries while running on four GPUs used in current Apple machines. It compared the performance, accuracy, and power consumption of the four GPUs tested (the NVIDIA GeForce 7300GT, the ATi x1600, the ATi x1900XTX, and the NVIDIA Quadro FX 4500) against each other as well as against the CPU-based version of the math libraries.

To facilitate porting the vForce math functions over to the GPU, the author developed a semi-automatic system for converting the compiler intrinsic-based vForce code into GLSL shader code. A test harness based on OpenGL was developed to test the accuracy, performance, and power consumption of the GPU-based functions.

Testing the accuracy involved inputting special test vectors into both the origi- nal Apple vForce functions, the math functions built into GLSL, and the GPU-ported vForce functions and subsequently comparing the results. To help understand the sources of accuracy error in the ported GLSL functions, the basic arithmetic and com- parison operators were also studied. Accuracy comparisons involved categorizing each test vector by whether both functions produced the same result. If the results differed, the study categorized the error into one of several categories: normal errors, denormal 72 errors, infinity errors, Not a Number (NaN) errors, and +/-0 errors.

The performance tests thoroughly investigated all of the parameters relevant to

GPU performance: texture size, data type, and using different combinations of data transfers on to and off of the GPU. The performance tests investigated the aforemen- tioned parameters using a representative subset of the built-in GLSL math functions as well as the ported math functions. Using these functions, the performance tests com- pared the throughput (in results per second) of the four GPUs against each other as well as against the CPU-based vForce baseline. Since off-GPU transfers are an important component of GPGPU performance, the performance study also investigated the effects of halving the effective PCI-E bandwidth to the GPU.

Finally, this thesis investigated how using the GPU-based math functions affect power and energy consumption. The power consumption study involved measuring the system-level power at the wall on a MacBook Pro with an ATi x1600 GPU while running a restricted subset of both the math functions built into GLSL as well as the

GPU-ported vForce functions. The study compared the system-level power consumption of the MacBook Pro while running the GPU-based math functions against the power draw while running the vForce functions.

6.1 Summary of Results

Accuracy-wise, the GPU-implemented math library is still significantly worse than the CPU-based vForce code from which it derives. Moreover, the accuracy levels are inconsistent between different GPU vendors and even within GPU families. A promising

finding, however, is that the ported vForce functions are significantly more accurate than the built-in GLSL math functions. Of the two GPU vendors, NVIDIA’s GPUs enjoy higher overall accuracy than do ATi’s GPUs on the basic operations, the built-in

GLSL functions, and the ported vForce functions. None of the GPUs can correctly handle denormal numbers. A surprising result with the NVIDIA GPUs, however, is 73 their ability to correctly compare denormal numbers. Being able to correctly compare denormal numbers is potentially useful by allowing a math library to detect and flag denormal numbers. Once flagged, the denormal numbers can be correctly computed instead on the CPU.

The performance of the math library on the GPU is generally superior to the

CPU-based vForce so long as the computation stays on the GPU. As soon as off-GPU transfers become involved, the performance of the GPU-based math library drops sig- nificantly to the point where only the most computation-intensive math functions enjoy a performance advantage from running on the GPU. The large performance hit caused by moving data to and from the GPU is compounded by the fact that the realized off-

GPU bandwidth is a mere fraction of the maximum bandwidth possible. Uploads to the

GPU are somewhat faster than downloads from the GPU. The low realized bandwidth is due to poor utilization of the PCI Express bus by the GPU system, as halving the effective bandwidth of the PCI Express connection does not significantly affect off-GPU bandwidth.

As expected, the higher-end ATi x1900XTX and NVIDIA Quadro FX 4500 are significantly faster than the mainstream ATi x1600 and NVIDIA 7300GT when all of the computation stays on the GPU. Once slow off-GPU transfers enter the performance picture, however, the gap narrows significantly, and even disappears in some instances.

Each GPU has its own performance-optimum texture size, with the NVIDIA

GPUs generally preferring larger textures (256x256 for the 7300GT and 512x512 for the Quadro FX 4500) than the ATi GPUs (128x128 for the x1600 and 256x256 for the x1900XTX). Using the vec4 data type instead of the float data type provides a significant performance improvement of up to 200%.

Running the math libraries on the GPU slightly increases the overall system power consumption. The power consumption was not noticeably affected by uploading or downloading data to or from the GPU. As a result, the GPU-implemented math 74 libraries only showed an energy consumption advantage when the math library ran faster on the GPU than on the CPU.

6.2 Current Suitability of a GPGPU Math Library

A GPU-based math library is most suitable from a performance perspective when- ever the GPU math function can be integrated into other programs running on the GPU.

Doing so eliminates the need for slow transfers of data to and from the CPU. The math functions best suited for GPU execution are the ones that involve the most computation.

When a program that uses a math function requires extremely high levels of accuracy or double precision arithmetic, a GPU-based math library is unsuitable. Similarly, a GPU- based math library is not useful when it is unacceptable that different GPUs produce different mathematical results.

It also remains difficult to program the GPU for general purpose use, as the

OpenGL API used to program the GPU is not designed to facilitate GPGPU pro- gramming. Programmers must still deal with graphics programming constructs. The programming environment also imposes a lot of overhead on the GPGPU programmer.

First, the OpenGL Utility Toolkit (GLUT) requires that a window be opened before the

GPU can be used. Second, there is no way to pre-compile a GLSL shader program: ev- ery time the program starts, the GLSL source must be loaded in as a text file, compiled, and linked before it can run on the GPU.

6.3 Improving GPU Accuracy

There exist several possible ways to improve the accuracy of the GPU’s results.

Langou et al. [38] describe a technique to compute a double precision result using mostly single precision arithmetic augmented with a final double precision Newton-

Raphson refinement pass. While the authors used the technique to achieve accurate double precision results, the technique can also be used to improve the accuracy of 75 single precision math functions. A GPU-based implementation of this technique would execute the single precision computations on the GPU and then transfer the intermediate results to the CPU for the double precision refinement step. Da Graca and Defour [30] present another technique for improving accuracy using two single precision values to emulate a higher precision number.

Implementing interval arithmetic is another possible way to mitigate the accuracy problems. Interval arithmetic expresses the result not in terms of a single answer, but instead defines all operations on a range of values. Within that range of values lies the correct result. Providing interval arithmetic within the GPU math libraries would help users of the math functions by quantifying how inaccurate the actual result is: if the result is unacceptably inaccurate (e.g. the interval is too large), the calculation can be retried using a more accurate CPU-based math library. The extra computation involved in interval arithmetic will certainly hurt performance; however, packing the two interval values into a single pixel value may minimize the performance impact.

6.4 Future Suitability of a GPGPU-based Math Library

Newer GPUs will most likely improve the accuracy of GPU-based math libraries.

The floating-point accuracy of basic operations on the GPU hardware will improve, which will lead to a corresponding improvement in the libraries that use these op- erations. It will also be possible to employ the double precision support emerging in some future GPUs to help reduce the amount of rounding error. Another helpful accuracy-related feature in upcoming GPUs is the availability of close to the metal programming paradigms. Low level access to the GPU improves accuracy by keeping graphics-optimized compilers out of the mix: once a given math library has been writ- ten and tested, its results will not change with new GPU driver releases, nor does the programmer have to worry about the compiler making unsafe math optimizations.

Future GPUs will also improve performance in several ways. First, the continued 76 advancement of Moore’s Law will allow further performance increases through larger transistor budgets. Since GPUs are massively parallel devices, they can spend nearly all of their transistors on computation units. Upcoming GPUs with unified shaders will also improve performance by making it easier to utilize all of the computational resources on the GPU. Future faster CPU-GPU interconnects will also improve overall

GPGPU performance. Directly exposing the GPU’s memory hierarchy along with other lower-level aspects of the GPU to the GPGPU programmer will improve performance by allowing the programmer to better optimize memory accesses instead of having to second-guess a graphics-oriented API.

An important feature in upcoming Direct3D 10-compliant GPUs [17] is support for integer arithmetic and bitwise operations. These operations will improve the im- plementation of GPU-based math libraries in several ways. First, it will improve the accuracy and performance of functions that need to directly manipulate the exponent and/or mantissa to do so with fewer operations. Support for these operations will also improve the accuracy and performance of the argument reduction operations used in sinf and cosf. Support for bitwise arithmetic also opens the door for handling denor- mal numbers on the GPU. While it is unlikely that future GPUs will be able to handle denormal numbers in hardware, it will be possible to normalize denormal numbers, perform the desired operation, and subsequently re-denormalize the results. Similarly, bitwise arithmetic support allows GPU programmers to work around other incorrect and/or incomplete floating point support, such as -0 handling and NaN payload sup- port.

Systems that have the GPU tightly integrated with the CPU will make a GPGPU- based math library more useful to a variety of applications by reducing the latency of

CPU-GPU communication. It may even allow a GPU-based math library to be a drop-in replacement for a standard high performance math library, as the performance overhead of going to the GPU will likely be greatly reduced. A CPU-integrated GPU, however, 77 risks lower performance than a discrete GPU due to having to share die area and memory bandwidth with the CPU. Additionally, integrating the CPU and GPU together will likely improve the power consumption characteristics of a GPU-based math library.

Power savings will come from the shorter, simpler interconnect: instead of having to go over the CPU’s front-side bus, the chipset’s PCI-E controller, and the PCI-E links,

CPU-GPU communications will entail a single hop or less. Bibliography

[1] AltiVec Technology Programming Interface Manual, June 1999. Order number ALTIVECPIM/D 6/1999 Rev. 0.

[2] Apple texture range, February 2002. Web Site. URL: http://download.nvidia.com/developer/presentations/2006/gdc/2006-GDC- OpenGL-tutorial-GPGPU-2.pdf.

[3] Fast texture downloads and readbacks using pixel buffer objects in , August 2005. See URL: http://download.nvidia.com/developer/Papers/2005/Fast Texture Transfers/ Fast Texture Transfers.pdf.

[4] Hardware - vector libraries, 2005. Web site. URL: http://developer.apple.com/hardwaredrivers/ve/vector libraries.html.

[5] Hlsl shader reference, 2005. Web site. URL: http://developer.apple.com/hardwaredrivers/ve/vector libraries.html.

[6] Intel core 2 duo desktop processor. Web site: http://www.intel.com/products/processor/core2duo/prod brief.pdf, 2006.

[7] Nvidia 8800 gpu architecture overview, November 2006. Available at URL: http://www.nvidia.com/object/IO 37100.html.

[8] Opengl programming guide for mac os x: Using ex- tensions to optimize, December 2006. Web Site. URL: http://developer.apple.com/documentation/GraphicsImaging/Conceptual/OpenGL- MacProgGuide/opengl texturedata/chapter 10 section 2.html.

[9] The peakstream platform: High productivity software devel- opment for multi-core processors, 2006. White paper. URL: http://www.peakstreaminc.com/reference/peakstream platform technote.pdf.

[10] Ati hd 2000 graphics family, 2007. Web site. URL: http://ati.amd.com/products/hdseries.html.

[11] Clearspeed whitepaper: Csx processor architecture. Web Site. URL: http://www.clearspeed.com/docs/resources/ Clear- Speed Architecture Whitepaper Feb07v2.pdf, February 2007. 79

[12] Nvidia cuda compute unified device architecture programming guide. Web site. URL: http://developer.download.nvidia.com/compute/cuda/0 8/ NVIDIA CUDA Programming Guide 0.8.pdf, February 2007.

[13] Nvidia cuda sdk release notes, April 2007. Web site. URL: http://developer.download.nvidia.com/compute/cuda/ 0 8/NVIDIA CUDA SDK releasenotes readme win32 linux.zip.

[14] Overview. Web site: http://www.sun.com/processors/UltraSPARC- T1/features.xml, 2007.

[15] AGEIA. Web site: http://www.ageia.com/physx/.

[16] Apple. libm, 2007. Web site. URL: http://www.opensource.apple.com/projects/ darwin/6.0/source/other/Libm-40.2.tar.gz.

[17] David Blythe. The direct3d 10 system. In SIGGRAPH ’06: ACM SIGGRAPH 2006 Papers, pages 724–734, New York, NY, USA, 2006. ACM Press.

[18] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for gpus: stream computing on graphics hard- ware. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, pages 777–786, New York, NY, USA, 2004. ACM Press.

[19] R-Stream Streaming Compiler. Web site: http://www.reservoir.com/r-stream.php.

[20] Jerome Toby Coonen. Contributions to a proposed standard for binary floating-point arithmetic. Thesis (Ph.D.), University of California, Berkeley, Berke- ley, CA, USA, 1986.

[21] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn, Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, and Ujval J. Kapasi. Merrimac: Supercomputing with streams. In SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 35, Washington, DC, USA, 2003. IEEE Computer Society.

[22] Charlie Demerjian. Meet larrabee, intel’s answer to a gpu. The Inquirer, February 2007.

[23] Kurt Akeley et. al. Ext framebuffer object, April 2006. Web Site. URL: http://oss.sgi.com/projects/ogl-sample/registry/EXT/framebuffer object.txt.

[24] Randima Fernando and Mark J. Kilgard. The Cg Tutorial: The Definitive Guide to Programmable Real-Time Graphics. Addison-Wesley Professional, 75 Arlington Street, Suite 300, Boston, MA 02116, USA, February 2003.

[25] Ond!rej Fialka and Martin Cadik. Fft and convolution performance in image filter- ing on gpu. In IV ’06: Proceedings of the conference on Information Visualization, pages 609–614, Washington, DC, USA, 2006. IEEE Computer Society.

[26] S. Fuller. Motorola’s AltiVec technology. Technical Report ALTIVECWP/D, Mo- torola, http://www.mot.com/SPS/PowerPC/AltiVec/facts.html, 1998. 80

[27] Simcha Gochman, Avi Mendelson, Alon Naveh, and Efraim Rotem. Introduction to Intel Core Duo processor architecture. 10(2):89–97, May 2006.

[28] Naga Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325–336, New York, NY, USA, 2006. ACM Press.

[29] Naga K. Govindaraju, Nikunj Raghuvanshi, and Dinesh Manocha. Fast and ap- proximate stream mining of quantiles and frequencies using graphics processors. In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 611–622, New York, NY, USA, 2005. ACM Press.

[30] G. Da Graca and D. Defour. Implementation of float-float operators on graphics hardware. In RNC7, pages 23–32, July 2006.

[31] Wolfgang Gruener. Amd’s ’fusion’ to merge cpu and gpu, October 2006.

[32] Michael M. Heck. 3d visualization for oil and gas evolves. HPCWire, April 2007.

[33] Michael M. Heck. High performance modelling of derivative prices using the peakstream platform, September 2007. Web site. URL: http://www.peakstreaminc.com/reference/peakstream finance technote.pdf.

[34] IEEE Task P754. ANSI/IEEE 754-1985, Standard for Binary Floating-Point Arithmetic. IEEE, New York, August 12 1985. A preliminary draft was published in the January 1980 issue of IEEE Computer, together with several companion articles. Available from the IEEE Service Center, Piscataway, NJ, USA.

[35] M. Ikits and M. Magallon. The opengl extension wrangler library. Web site: http://glew.sourceforge.net/.

[36] Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany. The imagine stream processor. In Proceedings of the IEEE International Conference on Computer Design, pages 282–288, September 2002.

[37] Kill A Watt. Web site: http://www.p3international.com/products/special/ P4400/P4400-HG.html.

[38] Jakub Kurzak and Jack Dongarra. Implementation of the mixed-precision high performance LINPACK benchmark on the CELL Processor. LAPACK Working Note 177, September 2006. Also available as UT-CS-06-580.

[39] L. Latta. Building a million particle system, 2004. Web site. http://citeseer.ist.psu.edu/latta04building.html.

[40] Calle Lejdfors and Lennart Ohlsson. Implementing an embedded gpu language by combining translation and generation. In SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing, pages 1610–1614, New York, NY, USA, 2006. ACM Press. 81

[41] David Luebke and Greg Humphreys. How gpus work. Computer, 40(2):96–100, 2007.

[42] Software Developer’s Manual. Intel architecture. Web site. http://citeseer.ist.psu.edu/646273.html.

[43] K. Mark. The opengl utility toolkit (glut) programming interface: Api version 3, 1996.

[44] Peter Markstein. Software division and square root using goldschmidt’s algorithms. In Proceedings of the 6th Conference on Real Numbers and Computers, November 2004.

[45] M. MCCOOL, Z. QIN, and T. POPA. Shader metaprogramming, 2002.

[46] Rick Merritt. Pci express extensions take aim at amd. EETimes, September 2006.

[47] Kenneth Moreland and Edward Angel. The fft on a gpu. In HWWS ’03: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, pages 112–119, Aire-la-Ville, Switzerland, Switzerland, 2003. Eurograph- ics Association.

[48] S. L. Moshier. Cephes math library, 2000. Web site. URL: http://www.moshier.net.

[49] Jackie Neider and Tom Davis. OpenGL Programming Guide: The Official Guide to Learning OpenGL, Release 1. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1993.

[50] Mark Peercy, Mark Segal, and Derek Gerstmann. A performance-oriented data parallel virtual machine for gpus. In SIGGRAPH ’06: ACM SIGGRAPH 2006 Sketches, page 184, New York, NY, USA, 2006. ACM Press.

[51] Stefan Popov, Johannes G¨unther, Hans-Peter Seidel, and Philipp Slusallek. Stack- less kd-tree traversal for high performance gpu ray tracing. Forum, 26(3):0–0, September 2007.

[52] Rapidmind. Technology overview, 2007. Website. URL: http://rapidmind.net/technology.php.

[53] Al Riske. The multicore advantage. Web site: http://research.sun.com/minds/2005-0902/, 2005.

[54] Randi J. Rost. The OpenGL Shading Language. Addison-Wesley Professional, Boston, MA, USA, 2006.

[55] Sander Sassen. Ati radeon x800 xt, the new king of the hill? Hardware Analysis, May 2004.

[56] Carlos Eduardo Scheidegger, Joao Luiz Dihl Comba, and Rudnei Dias da Cunha. Navier-stokes on programmable graphics hardware using smac. In SIBGRAPI ’04: Proceedings of the Computer Graphics and Image Processing, XVII Brazilian Symposium on (SIBGRAPI’04), pages 300–307, Washington, DC, USA, 2004. IEEE Computer Society. 82

[57] Geoff Stahl. Apple client storage, August 2002. Web Site. URL: http://oss.sgi.com/projects/ogl-sample/registry/APPLE/client storage.txt.

[58] Jon Stokes. Introducing the ibm/sony/toshiba cell processor – part ii: The cell architecture. Ars Technica, February 2005.

[59] Jon Stokes. Introducing the ibm/sony/toshiba cell processor part i: the processing units. Ars Technica, February 2005.

[60] Jon Stokes. Intel drops a nehalem bomb on amd’s fusion: integrated graphics, on-die , smt. Ars Technica, March 2007.

[61] David Tarditi, Sidd Puri, and Jose Oglesby. Accelerator: using data parallelism to program gpus for general-purpose uses. SIGPLAN Not., 41(11):325–335, 2006.

[62] William Thies, Michal Karczmarek, and Saman Amarasinghe. Streamit: A language for streaming applications. In Proceedings of the 12th International Conference on Compiler Construction, 2002.

[63] Torrenza. Web site: http://enterprise.amd.com/us-en/AMD-Business/Technology- Home/Torrenza.aspx.

[64] Tim Tscheblockov, Alexey Stepin, and Anton Shilov. Nvidia geforce 6800 ultra and geforce 6800: Nv40 enters the scene. X-bit labs, April 2004.

[65] Steve Upstill. The RenderMan Companion: A Programmer’s Guide to Realistic Computer Graphics. Addison-Wesley, Reading, MA, 1990.

[66] Scott Wasson. Nvidia’s geforce 7800 gtx graphics processor. The Tech Report, June 2005.

[67] Ye Zhao, Zhe Fan, Wei Li, Arie Kaufman, and Suzanne Yoakum-Stover. Lattice-based flow simulation on gpu. In Proceedings of ACM Workshop on General-Purpose Computing on Graphics Processors, 2004. Appendix A

Cephes Math Library Code for sinf

Provided below is the source code for the Cephes Math Library version of sinf.

The original header files are preserved; note, however, that the original source file also contained code and header information for cosf, which was removed before putting into this appendix.

/* sinf.c

*

* Circular sine

*

*

*

* SYNOPSIS:

*

* float x, y, sinf();

*

* y = sinf( x );

*

*

*

* DESCRIPTION: 84

*

* Range reduction is into intervals of pi/4. The reduction

* error is nearly eliminated by contriving an extended precision

* modular arithmetic.

*

* Two polynomial approximating functions are employed.

* Between 0 and pi/4 the sine is approximated by

* x + x**3 P(x**2).

* Between pi/4 and pi/2 the cosine is represented as

* 1 - x**2 Q(x**2).

*

*

* ACCURACY:

*

* Relative error:

* arithmetic domain # trials peak rms

* IEEE -4096,+4096 100,000 1.2e-7 3.0e-8

* IEEE -8192,+8192 100,000 3.0e-7 3.0e-8

*

* ERROR MESSAGES:

*

* message condition value returned

* sin total loss x > 2^24 0.0

*

* Partial loss of accuracy begins to occur at x = 2^13

* = 8192. Results may be meaningless for x >= 2^24

* The routine as implemented flags a TLOSS error 85

* for x >= 2^24 and returns 0.0.

*/

/*

Cephes Math Library Release 2.2: June, 1992

Copyright 1985, 1987, 1988, 1992 by Stephen L. Moshier

Direct inquiries to 30 Frost Street, Cambridge, MA 02140

*/

/* Single precision circular sine

* test interval: [-pi/4, +pi/4]

* trials: 10000

* peak relative error: 6.8e-8

* rms relative error: 2.6e-8

*/

#include "mconf.h"

static float FOPI = 1.27323954473516;

extern float PIO4F;

/* Note, these constants are for a 32-bit significand: */

/* static float DP1 = 0.7853851318359375; static float DP2 = 1.30315311253070831298828125e-5; static float DP3 = 3.03855025325309630e-11; 86 static float lossth = 65536.;

*/

/* These are for a 24-bit significand: */ static float DP1 = 0.78515625; static float DP2 = 2.4187564849853515625e-4; static float DP3 = 3.77489497744594108e-8; static float lossth = 8192.; static float T24M1 = 16777215.;

static float sincof[] = {

-1.9515295891E-4,

8.3321608736E-3,

-1.6666654611E-1

}; static float coscof[] = {

2.443315711809948E-005,

-1.388731625493765E-003,

4.166664568298827E-002

};

#ifdef ANSIC float sinf( float xx )

#else float sinf(xx) double xx;

#endif 87

{ float *p; float x, y, z; register unsigned long j; register int sign;

sign = 1; x = xx; if( xx < 0 )

{ sign = -1; x = -xx;

} if( x > T24M1 )

{ mtherr( "sinf", TLOSS ); return(0.0);

} j = FOPI * x; /* integer part of x/(PI/4) */ y = j;

/* map zeros to origin */ if( j & 1 )

{ j += 1; y += 1.0;

} j &= 7; /* octant modulo 360 degrees */ 88

/* reflect in x axis */ if( j > 3)

{ sign = -sign; j -= 4;

}

if( x > lossth )

{ mtherr( "sinf", PLOSS ); x = x - y * PIO4F;

} else

{

/* Extended precision modular arithmetic */ x = ((x - y * DP1) - y * DP2) - y * DP3;

}

/*einits();*/ z = x * x; if( (j==1) || (j==2) )

{

/* measured relative error in +/- pi/4 is 7.8e-8 */

/* y = (( 2.443315711809948E-005 * z

- 1.388731625493765E-003) * z

+ 4.166664568298827E-002) * z * z;

*/ 89 p = coscof; y = *p++; y = y * z + *p++; y = y * z + *p++; y *= z * z; y -= 0.5 * z; y += 1.0;

} else

{

/* Theoretical relative error = 3.8e-9 in [-pi/4, +pi/4] */

/* y = ((-1.9515295891E-4 * z

+ 8.3321608736E-3) * z

- 1.6666654611E-1) * z * x; y += x;

*/ p = sincof; y = *p++; y = y * z + *p++; y = y * z + *p++; y *= z * x; y += x;

}

/*einitd();*/ if(sign < 0) y = -y; 90 return( y);

}

} Appendix B

Detailed Accuracy Results

Provided in this appendix are the accuracy results for the different accuracy tests performed. Note that the comparisons compare the results against vForce. 92

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % sqrt 63 17.5 0.00 0.00 50.8 1.59 30.2 sin 821 33.7 4.87 1.95 0.00 0.244 59.2 cos 791 35.5 0.00 0.00 0.00 0.00 64.5 tan 1152 3.47 11.3 0.00 0.00 0.694 84.5 asin 96 12.5 10.4 0.00 49.0 2.08 26.0 acos 96 25.0 0.00 0.00 49.0 0.00 26.0 atan 96 12.5 10.4 0.00 0.00 2.08 75.0 log 37 59.5 0.00 0.00 18.9 0.00 21.6 exp 28 100 0.00 0.00 0.00 0.00 0.00 log2 44 20.5 0.00 18.2 40.1 0.00 20.5 exp2 28 100 0.00 0.00 0.00 0.00 0.00

Table B.1: Accuracy of built-in GLSL functions, NVIDIA 7300GT.

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % sqrt 63 22.2 1.59 0.00 50.8 1.59 23.8 sin 821 26.9 5.60 1.95 0.00 0.244 65.3 cos 791 28.4 1.14 0.00 0.00 0.00 70.4 tan 1152 2.69 1.91 1.30 0.00 0.694 93.4 asin 96 0.00 10.4 0.00 59.4 0.00 30.2 acos 96 0.00 0.00 0.00 59.4 0.00 40.6 atan 96 2.08 10.4 0.00 10.4 2.08 75.0 log 37 13.5 0.00 5.40 56.8 0.00 24.3 exp 28 100 0.00 0.00 0.00 0.00 0.00 log2 44 18.2 0.00 18.2 40.1 0.00 22.7 exp2 28 100 0.00 0.00 0.00 0.00 0.00

Table B.2: Accuracy of built-in GLSL functions, ATi x1600. 93

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % sqrt 63 17.5 0.00 0.00 50.8 1.59 30.2 sin 821 33.7 4.87 1.95 0.00 0.243 59.2 cos 791 35.5 0.00 0.00 0.00 0.00 64.5 tan 1152 3.47 11.28 0.00 0.00 0.694 84.5 asin 96 12.5 10.4 0.00 49.0 2.08 26.0 acos 96 25.0 0.00 0.00 49.0 0.00 26.0 atan 96 12.5 10.4 0.00 0.00 2.08 75.0 log 37 59.5 0.00 0.00 18.9 0.00 21.6 exp 28 100 0.00 0.00 0.00 0.00 0.00 log2 44 20.4 0.00 18.2 40.9 0.00 20.5 exp2 28 35.7 7.14 0.00 57.1 0.00 0.00

Table B.3: Accuracy of built-in GLSL functions, NVIDIA Quadro FX 4500. 94

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % div 345 53.0 8.70 8.70 14.2 9.86 5.51 sqrt 63 76.2 0.00 0.00 22.2 1.59 0.00 sin 821 30.5 3.65 1.95 0.00 0.243 63.7 cos 791 34.9 0.00 0.00 0.00 0.00 65.1 tan 1152 2.69 1.91 1.30 0.00 0.694 93.4 asin 96 76.0 0.00 0.00 0.00 2.08 24.0 acos 96 88.5 0.00 0.00 0.00 0.00 11.5 atan 96 55.2 10.4 0.00 10.4 2.08 21.9 sinh 48 16.7 43.8 14.6 0.00 0.00 50.0 cosh 30 96.7 0.00 0.00 3.33 0.00 0.00 tanh 36 36.1 58.3 0.00 0.00 2.78 2.78 asinh 34 2.94 61.8 0.00 8.82 2.94 23.5 acosh 20 5.00 5.00 0.00 90.0 0.00 0.00 atanh 48 52.1 43.8 2.08 0.00 2.08 0.00 log 37 67.6 0.00 0.00 21.6 0.00 10.8 exp 28 92.8 0.00 0.00 7.14 0.00 0.00 log2 44 68.1 0.00 0.00 31.8 0.00 0.00 expm1 52 57.7 40.4 0.00 0.00 1.92 0.00 log1p 86 51.2 45.3 0.00 1.16 1.16 1.16

Table B.4: Accuracy of ported vForce functions, ATi x1600.

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % div 345 55.1 9.27 5.22 15.9 9.85 4.64 sqrt 63 90.5 0.00 0.00 3.17 1.59 4.76 sin 821 37.8 5.60 0.00 0.00 0.00 56.6 cos 791 40.2 0.00 0.00 0.00 0.00 59.8 tan 1152 35.2 0.868 0.00 0.00 0.00 64.0 asin 96 86.5 0.00 0.00 0.00 0.00 13.5 acos 96 88.5 0.00 0.00 0.00 0.00 11.5 atan 96 53.1 10.4 0.00 0.00 2.08 34.4 sinh 48 16.7 43.8 14.6 0.00 0.00 25.0 cosh 30 100 0.00 0.00 0.00 0.00 0.00 log 37 97.3 0.00 0.00 0.00 0.00 2.70 exp 28 100 0.00 0.00 0.00 0.00 0.00 log2 44 81.8 0.00 18.2 0.00 0.00 0.00 expm1 52 100 0.00 0.00 0.00 0.00 0.00 log1p 86 98.8 0.00 0.00 0.00 0.00 1.16

Table B.5: Accuracy of ported vForce functions, NVIDIA Quadro FX 4500. 95

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % div 345 55.1 9.27 5.22 15.9 9.86 4.64 sqrt 63 93.7 0.00 0.00 0.00 1.59 4.76 sin 821 37.8 5.60 0.00 0.00 0.00 56.6 cos 791 40.2 0.00 0.00 0.00 0.00 59.8 tan 1152 3.47 11.3 0.00 0.00 0.694 84.5 asin 96 86.5 0.00 0.00 0.00 0.00 13.5 acos 96 88.5 0.00 0.00 0.00 0.00 11.5 atan 96 53.1 10.4 0.00 0.00 2.08 34.4 sinh 48 16.7 43.8 14.6 0.00 0.00 25.0 cosh 30 100 0.00 0.00 0.00 0.00 0.00 tanh 36 36.1 58.3 0.00 0.00 2.78 2.78 asinh 34 14.7 61.8 0.00 0.00 0.00 23.5 acosh 20 5.00 5.00 0.00 90.0 0.00 0.00 atanh 48 97.9 0.00 2.08 0.00 0.00 0.00 log 37 97.3 0.00 0.00 0.00 0.00 2.70 exp 28 100 0.00 0.00 0.00 0.00 0.00 log2 44 81.8 0.00 18.2 0.00 0.00 0.00 expm1 52 100 0.00 0.00 0.00 0.00 0.00 log1p 86 98.8 0.00 0.00 0.00 0.00 1.16

Table B.6: Accuracy of ported vForce functions, NVIDIA 7300GT.

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % sqrt 63 22.2 1.59 0.00 50.79 1.59 23.8 sin 821 26.9 5.60 1.95 0.00 0.244 65.3 cos 791 28.4 1.14 0.00 0.00 0.00 70.4 tan 1152 2.69 1.91 1.30 0.00 0.69 93.4 asin 96 0.00 10.4 0.00 59.4 0.00 30.2 acos 96 0.00 0.00 0.00 59.4 0.00 40.6 atan 96 2.08 10.4 0.00 10.4 2.08 75.0 log 37 13.5 0.00 5.40 56.8 0.00 24.3 exp 28 100 0.00 0.00 0.00 0.00 0.00 log2 44 18.2 0.00 18.2 40.9 0.00 22.7 exp2 28 35.7 7.14 0.00 57.1 0.00 0.00

Table B.7: Accuracy of built-in GLSL functions, ATi x1900XTX. 96

Fn # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % div 345 53.0 8.70 8.70 14.2 9.86 5.51 sqrt 63 73.0 0.00 0.00 25.4 1.59 0.00 sin 821 30.5 3.65 1.95 0.00 0.243 63.7 cos 791 33.6 0.00 0.00 0.00 0.00 66.4 tan 1152 20.4 0.868 0.00 0.00 0.694 78.0 asin 96 74.0 10.4 0.00 0.00 2.08 13.5 acos 96 88.5 0.00 0.00 0.00 0.00 11.5 atan 96 55.2 10.4 0.00 10.4 2.08 21.9 sinh 48 16.7 43.8 14.6 0.00 0.00 25.0 cosh 30 96.7 0.00 0.00 3.33 0.00 0.00 log 37 67.6 0.00 0.00 21.6 0.00 10.8 exp 28 92.8 0.00 0.00 7.14 0.00 0.00 log2 44 68.2 0.00 0.00 31.8 0.00 0.00 expm1 52 57.69 40.38 0.00 0.00 1.92 0.00 log1p 86 51.2 45.3 0.00 1.16 1.16 1.16

Table B.8: Accuracy of ported vForce functions, ATi x1900XTX.

Op # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % + 267 79.8 16.5 0.749 0.00 0.374 2.62 − 267 79.8 16.5 0.749 0.00 0.374 2.62 ∗ 311 60.8 22.2 0.00 5.14 8.36 3.54 / 345 56.2 11.0 5.51 5.22 11.0 11.0 < 400 77.0 N/A N/A N/A N/A N/A ≤ 400 96.0 N/A N/A N/A N/A N/A > 400 77.0 N/A N/A N/A N/A N/A ≥ 400 96.0 N/A N/A N/A N/A N/A == 400 92.5 N/A N/A N/A N/A N/A ! = 400 92.5 N/A N/A N/A N/A N/A

Table B.9: Accuracy of basic operators, ATi x1900XTX. 97

Op # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % + 267 79.8 16.5 0.749 0.00 0.375 2.62 − 267 79.8 16.5 0.749 0.00 0.375 2.62 ∗ 311 66.2 19.6 0.00 3.86 8.36 1.93 / 345 62.3 7.25 5.22 6.67 9.86 8.70 < 400 100 N/A N/A N/A N/A N/A ≤ 400 100 N/A N/A N/A N/A N/A > 400 100 N/A N/A N/A N/A N/A ≥ 400 100 N/A N/A N/A N/A N/A == 400 100 N/A N/A N/A N/A N/A ! = 400 100 N/A N/A N/A N/A N/A

Table B.10: Accuracy of basic operators, NVIDIA 7300GT.

Op # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % + 267 79.8 16.5 0.749 0.00 0.374 2.62 − 267 79.8 16.5 0.749 0.00 0.375 2.62 ∗ 311 66.2 19.6 0.00 3.86 8.36 1.93 / 345 62.3 7.25 5.22 6.67 9.86 8.70 < 400 100 N/A N/A N/A N/A N/A ≤ 400 100 N/A N/A N/A N/A N/A > 400 100 N/A N/A N/A N/A N/A ≥ 400 100 N/A N/A N/A N/A N/A == 400 100 N/A N/A N/A N/A N/A ! = 400 100 N/A N/A N/A N/A N/A

Table B.11: Accuracy of basic operators, NVIDIA Quadro FX 4500.

Op # Tests Pass % Subnorm % Inf % NaN % +/-0 % Norm % + 267 79.8 16.5 0.749 0.00 0.375 2.62 − 267 79.8 16.5 0.749 0.00 0.375 2.62 ∗ 311 60.8 22.2 0.00 3.86 8.36 3.54 / 345 56.2 11.0 5.22 6.67 9.86 11.0 < 400 77.0 N/A N/A N/A N/A N/A ≤ 400 96.0 N/A N/A N/A N/A N/A > 400 77.0 N/A N/A N/A N/A N/A ≥ 400 96.0 N/A N/A N/A N/A N/A == 400 92.5 N/A N/A N/A N/A N/A ! = 400 92.5 N/A N/A N/A N/A N/A

Table B.12: Accuracy of basic operators, ATi x1600.