A New Multiplication Algorithm for Extended Precision Using ﬂoating-Point Expansions

A new multiplication algorithm for extended precision using floating-point expansions Valentina Popescu,Jean-MichelMuller,PingTakPeterTang ARITH 23 July 2016 CudAA MultipleM PrecisionP ARARithmetic librarY Chaotic dynamical systems: bifurcation analysis; compute periodic orbits (e.g., finding sinks in the Hénon map, iterating the Lorenz attractor); celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. Target applications 1 Need massive parallel computations high performance computing using graphics processors –GPUs; ! 2 Need more precision than standard available (up to few hundred bits) extend precision using floating-point expansions. ! 1/1 Target applications 1 Need massive parallel computations high performance computing using graphics processors –GPUs; ! 2 Need more precision than standard available (up to few hundred bits) extend precision using floating-point expansions. ! Chaotic dynamical systems: 0.4 bifurcation analysis; 0.2 0 x2 compute periodic orbits (e.g., finding sinks in the -0.2 Hénon map, iterating the Lorenz attractor); -0.4 -1.5 -1 -0.5 0 0.5 1 1.5 x1 celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc. 1/1 What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY –CudaAMultiplePrecisionARithmeticlibrarY– Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. 2/1 Extended precision Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs. What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use. CAMPARY –CudaAMultiplePrecisionARithmeticlibrarY– 2/1 Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sufficientlysimple and regular algorithms for addition. Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –difficultrigorouserroranalysis lack of thorough error bounds. ! CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation –floating-pointexpansions– 3/1 Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –difficultrigorouserroranalysis lack of thorough error bounds. ! CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation –floating-pointexpansions– Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sufficientlysimple and regular algorithms for addition. 3/1 CAMPARY (CudaA Multiple Precision ARithmetic librarY) Our approach: multiple-term representation –floating-pointexpansions– Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sufficientlysimple and regular algorithms for addition. Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –difficultrigorouserroranalysis lack of thorough error bounds. ! 3/1 Solution: the FP expansions are required to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u0,u1,...,un 1 if for all 0 <i<n,wehave ui ulp (ui 1). − | | − Example: p =5(in radix 2) x0 =1.1010e 2; x =1.1101e − 7; 1 − 8 x2 =1.0000e 11; <> x =1.1000e − 17. 3 − :> Restriction: n 12 for single-precision and n 39 for double-precision. Non-overlapping expansions R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: − Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; − y0 =1.0000e 1; x1 =1.0010e 3; − 8 − y1 =1.0000e 2; < x2 =1.0110e 6. 8 − − > y2 =1.0000e 3; Most compact R = z0 + z1: > − : > y3 =1.0000e 5; z0 =1.1101e 1; <> − − y4 =1.0000e 8; z1 =1.1000e 8. − ⇢ − > y5 =1.0000e 9. > − > :> 4/1 Restriction: n 12 for single-precision and n 39 for double-precision. Non-overlapping expansions R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: − Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; − y0 =1.0000e 1; x1 =1.0010e 3; − 8 − y1 =1.0000e 2; < x2 =1.0110e 6. 8 − − > y2 =1.0000e 3; Most compact R = z0 + z1: > − : > y3 =1.0000e 5; z0 =1.1101e 1; <> − − y4 =1.0000e 8; z1 =1.1000e 8. − ⇢ − y5 =1.0000e 9. > − > > Solution: the FP expansions are required: to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u0,u1,...,un 1 if for all 0 <i<n,wehave ui ulp (ui 1). − | | − Example: p =5(in radix 2) x0 =1.1010e 2; x =1.1101e − 7; 1 − 8 x2 =1.0000e 11; <> x =1.1000e − 17. 3 − :> 4/1 Non-overlapping expansions R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: − Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; − y0 =1.0000e 1; x1 =1.0010e 3; − 8 − y1 =1.0000e 2; < x2 =1.0110e 6. 8 − − > y2 =1.0000e 3; Most compact R = z0 + z1: > − : > y3 =1.0000e 5; z0 =1.1101e 1; <> − − y4 =1.0000e 8; z1 =1.1000e 8. − ⇢ − y5 =1.0000e 9. > − > > Solution: the FP expansions are required: to be non-overlapping. Definition: ulp -nonoverlapping. For an expansion u0,u1,...,un 1 if for all 0 <i<n,wehave ui ulp (ui 1). − | | − Example: p =5(in radix 2) x0 =1.1010e 2; x =1.1101e − 7; 1 − 8 x2 =1.0000e 11; <> x =1.1000e − 17. 3 − :> Restriction: n 12 for single-precision and n 39 for double-precision. 4/1 Error-Free Transforms: Fast2Sum & 2MultFMA Algorithm 1 (Fast2Sum (a, b)) Requirement: s RN (a + b) ea eb; z RN (s a) ≥ − e RN (b z) Uses 3 FP operations. − return (s, e) ! Requirement: Algorithm 2 (2MultFMA (a, b)) p RN (a b) ea + eb emin + p 1; · ≥ − e fma(a, b, p) − Uses 2 FP operations. return (p, e) ! 5/1 2 quad-double multiplication in QD library [Hida et.al. 07]: – does not straightforwardly generalize; –canleadto (n3) complexity; O –worstcaseerrorboundispessimistic; –nocorrectnessproofispublished. Existing multiplication algorithms 1 Priest’s multiplication [Pri91]: –verycomplexandcostly; –basedonscalarproducts; – uses re-normalization after each step; –computestheentireresultand“truncates”a-posteriori; – comes with an error bound and correctness proof. 6/1 Existing multiplication algorithms 1 Priest’s multiplication [Pri91]: –verycomplexandcostly; –basedonscalarproducts; – uses re-normalization after each step; –computestheentireresultand“truncates”a-posteriori; – comes with an error bound and correctness proof. 2 quad-double multiplication in QD library [Hida et.al. 07]: – does not straightforwardly generalize; –canleadto (n3) complexity; O –worstcaseerrorboundispessimistic; –nocorrectnessproofispublished. 6/1 New multiplication algorithms requires: ulp -nonoverlapping FP expansion x =(x0,x1,...,xn 1) and − y =(y0,y1,...,ym 1); − ensures: ulp -nonoverlapping FP expansion ⇡ =(⇡0,⇡1,...,⇡r 1). − Let me explain it with an example ... 7/1 paper-and-pencil intuition; on-the-fly “truncation”; term-times-expansion products, xi y; · error correction term, ⇡r. Example: n =4,m=3and r =4 8/1 term-times-expansion products, xi y; · error correction term, ⇡r. Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly “truncation”; 8/1 Example: n =4,m=3and r =4 paper-and-pencil intuition; on-the-fly “truncation”; term-times-expansion products, xi y; · error correction term, ⇡r. 8/1 the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09]. Example: n =4,m=3and r =4 r p [ b· ]+2containers of size b (s.t. 3b>2p); b + c = p 1,s.t.wecanadd2c numbers without − error (binary64 b =45, binary32 b =18); ! ! starting exponent e = ex0 + ey0 ; each bin’s LSB has a fixed weight; e (i+1)b+p 1 bins initialized with 1.5 2 − − ; · 9/1 Example: n =4,m=3and r =4 r p [ b· ]+2containers of size b (s.t. 3b>2p); b + c = p 1,s.t.wecanadd2c numbers without − error (binary64 b =45, binary32 b =18); ! ! starting exponent e = ex0 + ey0 ; each bin’s LSB has a fixed weight; e (i+1)b+p 1 bins initialized with 1.5 2 − − ; · the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09]. 9/1 subtract initial value; apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping. Example: n =4,m=3and r =4 10 / 1 apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping. Example: n =4,m=3and r =4 subtract initial value; 10 / 1 Example: n =4,m=3and r =4 subtract initial value; apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping. 10 / 1 discarded partial products: r+1 –weaddonlythefirst k; k=1 m+n 2 − P (p 1)(r+1) 2 (p 1) m+n r 2 – x y x y 2− − − − − + − − ; i j 0 0 (1 2 (p 1))2 1 2 (p 1) k=r+1 i+j=k | | − − − − − − h i discardedP roundingP errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2− − 2− ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2− − . | | Tight error bound: (p 1)r p x y 2− − [1 + (r +1)2− + | 0 0| (p 1) (p 1) 2− − m + n r 2 +2− − − + − − ]. 0 (1 2 (p 1))2 1 2 (p 1) 1 − − − − − − @ A Error bound Exact result computed as: n 1;m 1 m+n 2 − − − xiyj = xiyj . i=0;Xj=0 kX=0 i+Xj=k On-the-fly “truncation” three error sources: ! 11 / 1 discarded rounding errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2− − 2− ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2− − .

A New Multiplication Algorithm for Extended Precision Using ﬂoating-Point Expansions

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support