A New Multiplication Algorithm for Extended Precision Using ﬂoating-Point Expansions

A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu,Jean-MichelMuller,PingTakPeterTang ARITH 23 July 2016 CudAA MultipleM PrecisionP ARARithmetic librarY Chaotic dynamical systems: bifurcation analysis; compute periodic orbits (e.g., finding sinks in the Hénon map, iterating the Lorenz attractor); celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. Target applications 1 Need massive parallel computations high performance computing using graphics processors –GPUs; ! 2 Need more precision than standard available (up to few hundred bits) extend precision using floating-point expansions. ! 1/1 Target applications 1 Need massive parallel computations high performance computing using graphics processors –GPUs; ! 2 Need more precision than standard available (up to few hundred bits) extend precision using floating-point expansions. ! Chaotic dynamical systems: 0.4 bifurcation analysis; 0.2 0 x2 compute periodic orbits (e.g., finding sinks in the -0.2 Hénon map, iterating the Lorenz attractor); -0.4 -1.5 -1 -0.5 0 0.5 1 1.5 x1 celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. 1/1 What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY –CudaAMultiplePrecisionARithmeticlibrarY– Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY –CudaAMultiplePrecisionARithmeticlibrarY– 2/1 Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sufficientlysimple and regular algorithms for addition. Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –difficultrigorouserroranalysis lack of thorough error bounds. ! CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation –floating-pointexpansions– 3/1 Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –difficultrigorouserroranalysis lack of thorough error bounds. ! CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation –floating-pointexpansions– Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sufficientlysimple and regular algorithms for addition. 3/1 CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation –floating-pointexpansions– Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sufficientlysimple and regular algorithms for addition. Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –difficultrigorouserroranalysis lack of thorough error bounds. ! 3/1 Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u0,u1,...,un 1 if for all 0 <i<n,wehave ui ulp (ui 1). − | | − Example: p =5(in radix 2) x0 =1.1010e 2; x =1.1101e − 7; 1 − 8 x2 =1.0000e 11; <> x =1.1000e − 17. 3 − :> Restriction: n 12 for single-precision and n 39 for double-precision. Non-overlapping expansions R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: − Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; − y0 =1.0000e 1; x1 =1.0010e 3; − 8 − y1 =1.0000e 2; < x2 =1.0110e 6. 8 − − > y2 =1.0000e 3; Most compact R = z0 + z1: > − : > y3 =1.0000e 5; z0 =1.1101e 1; <> − − y4 =1.0000e 8; z1 =1.1000e 8. − ⇢ − > y5 =1.0000e 9. > − > :> 4/1 Restriction: n 12 for single-precision and n 39 for double-precision. Non-overlapping expansions R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: − Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; − y0 =1.0000e 1; x1 =1.0010e 3; − 8 − y1 =1.0000e 2; < x2 =1.0110e 6. 8 − − > y2 =1.0000e 3; Most compact R = z0 + z1: > − : > y3 =1.0000e 5; z0 =1.1101e 1; <> − − y4 =1.0000e 8; z1 =1.1000e 8. − ⇢ − y5 =1.0000e 9. > − > > Solution: the FP expansions are required: to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u0,u1,...,un 1 if for all 0 <i<n,wehave ui ulp (ui 1). − | | − Example: p =5(in radix 2) x0 =1.1010e 2; x =1.1101e − 7; 1 − 8 x2 =1.0000e 11; <> x =1.1000e − 17. 3 − :> 4/1 Non-overlapping expansions R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: − Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; − y0 =1.0000e 1; x1 =1.0010e 3; − 8 − y1 =1.0000e 2; < x2 =1.0110e 6. 8 − − > y2 =1.0000e 3; Most compact R = z0 + z1: > − : > y3 =1.0000e 5; z0 =1.1101e 1; <> − − y4 =1.0000e 8; z1 =1.1000e 8. − ⇢ − y5 =1.0000e 9. > − > > Solution: the FP expansions are required: to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u0,u1,...,un 1 if for all 0 <i<n,wehave ui ulp (ui 1). − | | − Example: p =5(in radix 2) x0 =1.1010e 2; x =1.1101e − 7; 1 − 8 x2 =1.0000e 11; <> x =1.1000e − 17. 3 − :> Restriction: n 12 for single-precision and n 39 for double-precision. 4/1 Error-Free Transforms: Fast2Sum & 2MultFMA Algorithm 1 (Fast2Sum (a, b)) Requirement: s RN (a + b) ea eb; z RN (s a) ≥ − e RN (b z) Uses 3 FP operations. − return (s, e) ! Requirement: Algorithm 2 (2MultFMA (a, b)) p RN (a b) ea + eb emin + p 1; · ≥ − e fma(a, b, p) − Uses 2 FP operations. return (p, e) ! 5/1 2 quad-double multiplication in QD library [Hida et.al. 07]: – does not straightforwardly generalize; –canleadto (n3) complexity; O –worstcaseerrorboundispessimistic; –nocorrectnessproofispublished. Existing multiplication algorithms 1 Priest’s multiplication [Pri91]: –verycomplexandcostly; –basedonscalarproducts; – uses re-normalization after each step; –computestheentireresultand“truncates”a-posteriori; – comes with an error bound and correctness proof. 6/1 Existing multiplication algorithms 1 Priest’s multiplication [Pri91]: –verycomplexandcostly; –basedonscalarproducts; – uses re-normalization after each step; –computestheentireresultand“truncates”a-posteriori; – comes with an error bound and correctness proof. 2 quad-double multiplication in QD library [Hida et.al. 07]: – does not straightforwardly generalize; –canleadto (n3) complexity; O –worstcaseerrorboundispessimistic; –nocorrectnessproofispublished. 6/1 New multiplication algorithms requires: ulp -nonoverlapping FP expansion x =(x0,x1,...,xn 1) and − y =(y0,y1,...,ym 1); − ensures: ulp -nonoverlapping FP expansion ⇡ =(⇡0,⇡1,...,⇡r 1). − Let me explain it with an example ... 7/1 paper-and-pencil intuition; on-the-fly “truncation”; term-times-expansion products, xi y; · error correction term, ⇡r. Example: n =4,m=3and r =4 8/1 term-times-expansion products, xi y; · error correction term, ⇡r. Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly “truncation”; 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly “truncation”; term-times-expansion products, xi y; · error correction term, ⇡r. 8/1 the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09]. Example: n =4,m=3and r =4 r p [ b· ]+2containers of size b (s.t. 3b>2p); b + c = p 1,s.t.wecanadd2c numbers without − error (binary64 b =45, binary32 b =18); ! ! starting exponent e = ex0 + ey0 ; each bin’s LSB has a fixed weight; e (i+1)b+p 1 bins initialized with 1.5 2 − − ; · 9/1 Example: n =4,m=3and r =4 r p [ b· ]+2containers of size b (s.t. 3b>2p); b + c = p 1,s.t.wecanadd2c numbers without − error (binary64 b =45, binary32 b =18); ! ! starting exponent e = ex0 + ey0 ; each bin’s LSB has a fixed weight; e (i+1)b+p 1 bins initialized with 1.5 2 − − ; · the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09]. 9/1 subtract initial value; apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping. Example: n =4,m=3and r =4 10 / 1 apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping. Example: n =4,m=3and r =4 subtract initial value; 10 / 1 Example: n =4,m=3and r =4 subtract initial value; apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping. 10 / 1 discarded partial products: r+1 –weaddonlythefirst k; k=1 m+n 2 − P (p 1)(r+1) 2 (p 1) m+n r 2 – x y x y 2− − − − − + − − ; i j 0 0 (1 2 (p 1))2 1 2 (p 1) k=r+1 i+j=k | | − − − − − − h i discardedP roundingP errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2− − 2− ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2− − . | | Tight error bound: (p 1)r p x y 2− − [1 + (r +1)2− + | 0 0| (p 1) (p 1) 2− − m + n r 2 +2− − − + − − ]. 0 (1 2 (p 1))2 1 2 (p 1) 1 − − − − − − @ A Error bound Exact result computed as: n 1;m 1 m+n 2 − − − xiyj = xiyj . i=0;Xj=0 kX=0 i+Xj=k On-the-fly “truncation” three error sources: ! 11 / 1 discarded rounding errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2− − 2− ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2− − .

A New Multiplication Algorithm for Extended Precision Using ﬂoating-Point Expansions

Arithmetic Algorithms for Extended Precision Using Floating-Point Expansions Mioara Joldes, Olivier Marty, Jean-Michel Muller, Valentina Popescu

Floating Points

Hacking in C 2020 the C Programming Language Thom Wiggers

Floating Point Formats

Extended Precision Floating Point Arithmetic

A Variable Precision Hardware Acceleration for Scientific Computing Andrea Bocco

Effectiveness of Floating-Point Precision on the Numerical Approximation by Spectral Methods

MIPS Floating Point

FPGA Based Quadruple Precision Floating Point Arithmetic for Scientific Computations

Object Pascal Language Guide

Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates

A New Multiplication Algorithm for Extended Precision Using ﬂoating-Point Expansions