Extended-Precision Floating-Point Numbers for GPU Computation Andrew Thall, Alma College

FOR PUBLICATION 1 Extended-Precision Floating-Point Numbers for GPU Computation Andrew Thall, Alma College Abstract— Double-float (df64) and quad-float (qf128) numeric been only limited attempts to apply these methods for general- types can be implemented on current GPU hardware and used purpose graphics-processing-unit (GPGPU) computation, which efficiently and effectively for extended-precision computational have been hampered by having at best f32single-precision float- arithmetic. Using unevaluated sums of paired or quadrupled ing point capability. Some success has been achieved in using f32 single-precision values, these numeric types provide ap- proximately 48 and 96 bits of mantissa respectively at single- augmenting GPU based computation with double-precision CPU precision exponent ranges for computer graphics, numerical, and correction terms: Goddeke¨ et al [3] mix GPU computation with general-purpose GPU programming. This paper surveys current CPU-based double-precision defect correction in Finite-Element art, presents algorithms and Cg implementation for arithmetic, simulation, achieving a 2£ speedup over tuned CPU-only code exponential and trigonometric functions, and presents data on while maintaining the same accuracy. numerical accuracy on several different GPUs. It concludes with Myer and Sutherland [4] coined the term wheel of reincarnation an in-depth discussion of the application of extended precision to describe the evolution of display processor technology as a primitives to performing fast Fourier transforms on the GPU for real and complex data. never-ending series of hardware innovations are created as add- [Addendum (July 2009): the presence of IEEE compliant on special purpose systems. Such esoteric hardware is always in double-precision hardware in modern GPUs from NVidia and a race-against-time against generic processors; advanced capabil- other manufacturers has reduced the need for these techniques. ities developed for special-purpose systems are invariably folded The double-precision capabilities can be accessed using CUDA or back into commodity CPU technology as price-performance other GPGPU software, but are not (as of this writing) exposed breakpoints allow. We are currently in a unique time vis-a- in the graphics pipeline for use in Cg-based shader code. Shader vis the CPU/GPU dichotomy, where the stream programming writers or those still using a graphics API for their numerical computing may still find the methods described herein to be of model inherent in GPU hardware has allowed the computational interest.] power of GPUs to rocket past that of the relatively static performance of the single-processor CPU over the near past and Index Terms— floating-point computation, extended-precision, as projected for the near future. GPU-based stream processors graphics processing units, GPGPU, Fast Fourier Transforms, parallel computation avoids a major pitfall of prior parallel processing systems in being nearly ubiquitous; because of the relative cheapness of GPUs and Manuscript date of submission: March 15, 2007 their use in popular, mass-marketed games and game-platforms, inexpensive systems are present and in on most commodity I. INTRODUCTION computers currently sold. Because of this omnipresence, these processors show programmability—quality and choices of APIs, ODERN GPUs have wide data-buses allowing extremely multiple hardware platforms running the same APIs, stability M high throughput, effecting a stream-computing model and of drivers—far beyond that of the special-purpose hardware of allowing SIMD/MIMD computation at the fragment (pixel) level. previous generations. Machine precision on current hardware is limited, however, to These trends are expected to continue: when innovative break- 32-bit nearly IEEE 754 compliant floating-point. This limited throughs are made in CPU technology, these can be expected precision of fragment-program computation presents a draw- to be applied by the GPU manufacturers as well. CPU trends back for many general-purpose (GPGPU) applications. Extended- to multi-core systems will provide hardware parallel of the precision techniques, developed previously for CPU computation, classic MIMD variety; these can be expected to have the same adapt well to GPU hardware; as programmable graphics pipelines weaknesses of traditional parallel computers: synchronization, its IEEE compliance, extending precision becomes increasingly deadlock, platform-specificity and instability of applications and straightforward; as future generations of GPUs move to hardware supporting drivers under changes in hardware, firmware, and support for higher precision, these techniques will remain useful, OS capabilities. While systems such as OpenMP and OpenPVM leveraging parallelism and sophisticated instruction sets of the provide platform and OS independence, it remains the case that hardware (e.g., vector-parallelism, fused multiply-adds, etc.) to unless it is worth the developer’s time, advanced capabilities provide greater efficiencies for extended-precision than has been and algorithms will remain experimental, brittle, platform-specific seen in most CPU implementations. curiosities. The advantage of the stream-based computational Higher precision computations are increasingly necessary for model is its simplicity. numerical and scientific computing (see Bailey [1], Dinechin This paper will survey prior and current art in extended- et al [2]). Techniques for extended and mixed precision com- precision computation and in application of this to GPUs. It will putation have been available for general purpose programming then describe an implementation of a df64 and qf128 library for through myriad software packages and systems, but there have current generation GPUs, show data on numerical accuracy for Tech. Rep. CIM-007-01 °c March 2007 A. Thall FOR PUBLICATION 2 basic operations and elementary functions, and discuss limitations Algorithm 1 Fast-Two-Sum of these techniques for different hardware platforms. Require: jaj ¸ jbj, p-bit floating point numbers 1: procedure FAST-TWO-SUM(a, b) II. BACKGROUND:ARBITRARY PRECISION ARITHMETIC 2: x Ã a © b 3: b Ã x ª a Techniques for performing extended-precision arithmetic in virtual 4: . b is the actual value added to x in 2 software using pairs of machine-precision numbers have a long virtual 5: y Ã b ª b history: Dekker [5] is most often cited on the use of unevalu- virtual 6: return (x; y) ated sums in extended-precision research prior to the IEEE 754 7: . x + y = a + b, where y is roundoff error on x standard, with Linnainmaa [6] generalizing the work of Dekker 8: end procedure to computer independent implementations dependent on faithful rounding. Such methods are distinct from alternative approaches exemplified by Brent [7], Smith [8], and Bailey [9] that assign special storage to exponent and sign values and store mantissa bits is non-overlapping if all of its components are mutually non- in arbitrary-length integer arrays. A general survey of arbitrary overlapping. We will also define two numbers x and y as adjacent precision methods is Zimmermann [10] if they overlap, if x overlaps 2y or if y overlaps 2x. An expansion Priest [11] in 1992 did a full study of extended-precision is nonadjacent if no two of its components are adjacent. requirements for different radix and rounding constraints. Theorem 2 Knuth [19] Let a and b be p-bit floating-point Shewchuk [12] drew basic algorithms from Priest but restricted numbers where p ¸ 3. Then the following algorithm will produce his techniques and analyses to the IEEE 754 floating-point stan- a nonoverlapping expansion x + y such that a + b = x + y, where dard [13], radix-2, and exact rounding; these allowed relaxation of x is an approximation to a+b and y represents the roundoff error normalization requirements and led to speed improvements. These in the calculation of x. two provided the theoretical underpinnings for the single-double FORTRAN library of Bailey [14], the doubledouble C++ library Algorithm 2 Two-Sum of Briggs [15] and the C++ quad-doubles of Hida et al [16], [17]. Require: a; b, p-bit floating point numbers, where p ¸ 3 The use and hazards of double- and quad-precision numerical 1: procedure TWO-SUM(a, b) types is discussed by Li et al [18], in the context of extended- 2: x Ã a © b and mixed-precision BLAS libraries. 3: bvirtual Ã x ª a For methods based on Shewchuk’s and Priest’s techniques, 4: avirtual Ã x ª bvirtual faithful rounding is crucial for correctness. Requirements in 5: broundoff Ã b ª bvirtual particular of IEEE-754 compliance create problematic difficulties 6: aroundoff Ã a ª avirtual for older graphics display hardware; latest generation GPUs are 7: y Ã aroundoff © broundoff much more compliant with the IEEE standard. 8: return (x; y) 9: . x + y = a + b, where y is roundoff error on x A. Extended-Precision Computation 10: end procedure The presentation here will follow that of Shewchuk [12]. Given IEEE 754 single-precision values, with exact rounding and round- The following corollary, forming the heart of multiple-term ex- to-even on ties, for binary operations ¤ 2 f+; ¡; £; =g, the pansions as numerical representations: symbols ~ 2 f©; ª; ; ®g represent b-bit floating-point version with exact-rounding, i.e., Corollary 3 Let x and y be the values returned by FAST-TWO- SUM or TWO-SUM. On a machine whose arithmetic uses round- a ~ b ´ fl(a ¤ b) = a ¤ b + err(a ~ b); (1) to-even tiebreaking, x and y are nonadjacent. Given these building blocks, Shewchuk describes expansion- where err(a ~ b) is the difference between the correct arithmetic sum algorithms for adding arbitrary-length numbers each repre- and the floating-point result; exact rounding guarantees that sented by sequences of nonadjacent terms to get another such 1 jerr(a ~ b)j · ulp(a ~ b): (2) nonadjacent sequence as the exact sum. For the purpose of double- 2 and quad-precision numbvers, the ones of importance regard pairs Extended-precision arithmetic using unevaluated sums of and quads of nonadjacent terms that form the df64 and qf128 single-precision numbers rests on a number of algorithms (based representations.

Load more