<<

A new multiplication algorithm for extended precision using floating-point expansions

Valentina Popescu,Jean-MichelMuller,PingTakPeterTang

ARITH 23 July 2016

CudAA MultipleM PrecisionP ARARithmetic Chaotic dynamical systems: bifurcation analysis; compute periodic orbits (e.g., finding sinks in the H´enon map, iterating the Lorenz attractor); celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc.

Target applications

1 Need massive parallel computations high performance computing using graphics processors –GPUs; ! 2 Need more precision than standard available (up to few hundred ) extend precision using floating-point expansions. !

1/1 Target applications

1 Need massive parallel computations high performance computing using graphics processors –GPUs; ! 2 Need more precision than standard available (up to few hundred bits) extend precision using floating-point expansions. !

Chaotic dynamical systems: 0.4 bifurcation analysis; 0.2 0 x2

compute periodic orbits (e.g., finding sinks in the -0.2 H´enon map, iterating the Lorenz attractor); -0.4 -1.5 -1 -0.5 0 0.5 1 1.5 x1 celestial mechanics (e.g., long term stability of the solar system). Experimental mathematics: ill-posed SDP problems in computational geometry (e.g., computation of kissing numbers); quantum chemistry/information; polynomial optimization etc.

1/1 What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use.

CAMPARY –CudaAMultiplePrecisionARithmeticlibrarY–

Extended precision

Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs.

2/1 Extended precision

Existing libraries: GNU MPFR - not ported on GPU; GARPREC & CUMP - tuned for big array operations: data generated on host, operations on device; QD & GQD - limited to double-double and quad-double; no correctness proofs.

What we need: support for arbitrary precision; runs both on CPU and GPU; easy to use.

CAMPARY –CudaAMultiplePrecisionARithmeticlibrarY–

2/1 Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sucientlysimple and regular algorithms for addition. : –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –dicultrigorouserroranalysis lack of thorough error bounds. !

CAMPARY (CudaA Multiple Precision ARithmetic librarY)

Our approach: multiple-term representation –floating-pointexpansions–

3/1 Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –dicultrigorouserroranalysis lack of thorough error bounds. !

CAMPARY (CudaA Multiple Precision ARithmetic librarY)

Our approach: multiple-term representation –floating-pointexpansions–

Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sucientlysimple and regular algorithms for addition.

3/1 CAMPARY (CudaA Multiple Precision ARithmetic librarY)

Our approach: multiple-term representation –floating-pointexpansions–

Pros: –usedirectlyavailableandhighlyoptimizednative FP infrastructure; – straightforwardly portable to highly parallel architectures, such as GPUs; –sucientlysimple and regular algorithms for addition. Cons: –morethanonerepresentation; –existingmultiplication algorithms do not generalize well for an arbitrary number of terms; –dicultrigorouserroranalysis lack of thorough error bounds. !

3/1 Solution: the FP expansions are required to be non-overlapping.

Definition: ulp -nonoverlapping.

For an expansion u0,u1,...,un 1 if for all 0 x =1.1000e 17. 3 :> Restriction: n 12 for single-precision and n 39 for double-precision.  

Non-overlapping expansions

R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; y0 =1.0000e 1; x1 =1.0010e 3; 8 y1 =1.0000e 2; < x2 =1.0110e 6. 8 > y2 =1.0000e 3; Most compact R = z0 + z1: > : > y3 =1.0000e 5; z0 =1.1101e 1; <> y4 =1.0000e 8; z1 =1.1000e 8. ⇢ > y5 =1.0000e 9. > > :>

4/1 Restriction: n 12 for single-precision and n 39 for double-precision.  

Non-overlapping expansions

R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; y0 =1.0000e 1; x1 =1.0010e 3; 8 y1 =1.0000e 2; < x2 =1.0110e 6. 8 > y2 =1.0000e 3; Most compact R = z0 + z1: > : > y3 =1.0000e 5; z0 =1.1101e 1; <> y4 =1.0000e 8; z1 =1.1000e 8. ⇢ y5 =1.0000e 9. > > > Solution: the FP expansions are required: to be non-overlapping.

Definition: ulp -nonoverlapping.

For an expansion u0,u1,...,un 1 if for all 0 x =1.1000e 17. 3 :>

4/1 Non-overlapping expansions

R =1.11010011e 1 can be represented, using a p =5(in radix 2)system,as: Less compact R = x0 + x1 + x2: R = y0 + y1 + y2 + y3 + y4 + y5: x0 =1.1000e 1; y0 =1.0000e 1; x1 =1.0010e 3; 8 y1 =1.0000e 2; < x2 =1.0110e 6. 8 > y2 =1.0000e 3; Most compact R = z0 + z1: > : > y3 =1.0000e 5; z0 =1.1101e 1; <> y4 =1.0000e 8; z1 =1.1000e 8. ⇢ y5 =1.0000e 9. > > > Solution: the FP expansions are required: to be non-overlapping.

Definition: ulp -nonoverlapping.

For an expansion u0,u1,...,un 1 if for all 0 x =1.1000e 17. 3 :> Restriction: n 12 for single-precision and n 39 for double-precision.  

4/1 Error-Free Transforms: Fast2Sum & 2MultFMA

Algorithm 1 (Fast2Sum (a, b)) Requirement: s RN (a + b) ea eb; z RN (s a) e RN (b z) Uses 3 FP operations. return (s, e) !

Requirement: Algorithm 2 (2MultFMA (a, b))

p RN (a b) ea + eb emin + p 1; · e fma(a, b, p) Uses 2 FP operations. return (p, e) !

5/1 2 quad-double multiplication in QD library [Hida et.al. 07]: – does not straightforwardly generalize; –canleadto (n3) complexity; O –worstcaseerrorboundispessimistic; –nocorrectnessproofispublished.

Existing multiplication algorithms

1 Priest’s multiplication [Pri91]: –verycomplexandcostly; –basedonscalarproducts; – uses re-normalization after each step; –computestheentireresultand“truncates”a-posteriori; – comes with an error bound and correctness proof.

6/1 Existing multiplication algorithms

1 Priest’s multiplication [Pri91]: –verycomplexandcostly; –basedonscalarproducts; – uses re-normalization after each step; –computestheentireresultand“truncates”a-posteriori; – comes with an error bound and correctness proof.

2 quad-double multiplication in QD library [Hida et.al. 07]: – does not straightforwardly generalize; –canleadto (n3) complexity; O –worstcaseerrorboundispessimistic; –nocorrectnessproofispublished.

6/1 New multiplication algorithms

requires: ulp -nonoverlapping FP expansion x =(x0,x1,...,xn 1) and y =(y0,y1,...,ym 1); ensures: ulp -nonoverlapping FP expansion ⇡ =(⇡0,⇡1,...,⇡r 1).

Let me explain it with an example ...

7/1 paper-and-pencil intuition; on-the-fly “truncation”;

term-times-expansion products, xi y; · error correction term, ⇡r.

Example: n =4,m=3and r =4

8/1 term-times-expansion products, xi y; · error correction term, ⇡r.

Example: n =4,m=3and r =4

paper-and-pencil intuition; on-the-fly “truncation”;

8/1 Example: n =4,m=3and r =4

paper-and-pencil intuition; on-the-fly “truncation”;

term-times-expansion products, xi y; · error correction term, ⇡r.

8/1 the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09].

Example: n =4,m=3and r =4

r p [ b· ]+2containers of size b (s.t. 3b>2p); b + = p 1,s.t.wecanadd2c numbers without error (binary64 b =45, binary32 b =18); ! ! starting exponent e = ex0 + ey0 ; each bin’s LSB has a fixed weight; e (i+1)b+p 1 bins initialized with 1.5 2 ; ·

9/1 Example: n =4,m=3and r =4

r p [ b· ]+2containers of size b (s.t. 3b>2p); b + c = p 1,s.t.wecanadd2c numbers without error (binary64 b =45, binary32 b =18); ! ! starting exponent e = ex0 + ey0 ; each bin’s LSB has a fixed weight; e (i+1)b+p 1 bins initialized with 1.5 2 ; · the number of leading bits, `; accumulation done using a Fast2Sum and addition [Rump09].

9/1 subtract initial value; apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping.

Example: n =4,m=3and r =4

10 / 1 apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping.

Example: n =4,m=3and r =4

subtract initial value;

10 / 1 Example: n =4,m=3and r =4

subtract initial value; apply renormalization step to B: – Fast2Sum and branches; –rendertheresult ulp -nonoverlapping.

10 / 1 discarded partial products: r+1 –weaddonlythefirst k; k=1 m+n 2 P (p 1)(r+1) 2 (p 1) m+n r 2 – x y x y 2 + ; i j 0 0 (1 2 (p 1))2 1 2 (p 1) k=r+1 i+j=k | | h i discardedP roundingP errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2 2 ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2 . | | Tight error bound: (p 1)r p x y 2 [1 + (r +1)2 + | 0 0| (p 1) (p 1) 2 m + n r 2 +2 + ]. 0 (1 2 (p 1))2 1 2 (p 1) 1 @ A

Error bound Exact result computed as: n 1;m 1 m+n 2 xiyj = xiyj . i=0;Xj=0 kX=0 i+Xj=k

On-the-fly “truncation” three error sources: !

11 / 1 discarded rounding errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2 2 ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2 . | | Tight error bound: (p 1)r p x y 2 [1 + (r +1)2 + | 0 0| (p 1) (p 1) 2 m + n r 2 +2 + ]. 0 (1 2 (p 1))2 1 2 (p 1) 1 @ A

Error bound Exact result computed as: n 1;m 1 m+n 2 xiyj = xiyj . i=0;Xj=0 kX=0 i+Xj=k

On-the-fly “truncation” three error sources: ! discarded partial products: r+1 –weaddonlythefirst k; k=1 m+n 2 P (p 1)(r+1) 2 (p 1) m+n r 2 – x y x y 2 + ; i j 0 0 (1 2 (p 1))2 1 2 (p 1) k=r+1 i+j=k | | P P h i

11 / 1 error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2 . | | Tight error bound: (p 1)r p x y 2 [1 + (r +1)2 + | 0 0| (p 1) (p 1) 2 m + n r 2 +2 + ]. 0 (1 2 (p 1))2 1 2 (p 1) 1 @ A

Error bound Exact result computed as: n 1;m 1 m+n 2 xiyj = xiyj . i=0;Xj=0 kX=0 i+Xj=k

On-the-fly “truncation” three error sources: ! discarded partial products: r+1 –weaddonlythefirst k; k=1 m+n 2 P (p 1)(r+1) 2 (p 1) m+n r 2 – x y x y 2 + ; i j 0 0 (1 2 (p 1))2 1 2 (p 1) k=r+1 i+j=k | | h i discardedP roundingP errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2 2 ; | | ·

11 / 1 Tight error bound: (p 1)r p x y 2 [1 + (r +1)2 + | 0 0| (p 1) (p 1) 2 m + n r 2 +2 + ]. 0 (1 2 (p 1))2 1 2 (p 1) 1 @ A

Error bound Exact result computed as: n 1;m 1 m+n 2 xiyj = xiyj . i=0;Xj=0 kX=0 i+Xj=k

On-the-fly “truncation” three error sources: ! discarded partial products: r+1 –weaddonlythefirst k; k=1 m+n 2 P (p 1)(r+1) 2 (p 1) m+n r 2 – x y x y 2 + ; i j 0 0 (1 2 (p 1))2 1 2 (p 1) k=r+1 i+j=k | | h i discardedP roundingP errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2 2 ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2 . | |

11 / 1 Error bound Exact result computed as: n 1;m 1 m+n 2 xiyj = xiyj . i=0;Xj=0 kX=0 i+Xj=k

On-the-fly “truncation” three error sources: ! discarded partial products: r+1 –weaddonlythefirst k; k=1 m+n 2 P (p 1)(r+1) 2 (p 1) m+n r 2 – x y x y 2 + ; i j 0 0 (1 2 (p 1))2 1 2 (p 1) k=r+1 i+j=k | | h i discardedP roundingP errors: –wecomputethelastr +1partial products using simple FP multiplication; (p 1)r p –weneglectr +1error terms bounded by x0y0 2 2 ; | | · error caused by the renormalization step: (p 1)r –errorlessorequalto x0y0 2 . | | Tight error bound: (p 1)r p x y 2 [1 + (r +1)2 + | 0 0| (p 1) (p 1) 2 m + n r 2 +2 + ]. 0 (1 2 (p 1))2 1 2 (p 1) 1 @ A 11 / 1 Table : Performance in Mops/s for multiplying two FP expansions on a Tesla K40c GPU, using CUDA 7.5 software architecture, running on a single thread of execution. precision not supported. ⇤

dx,dy,dr New algorithm QD 2, 2, 2 0.027 0.1043 1, 2, 2 0.365 0.1071 3, 3, 3 0.0149 ⇤ 2, 3, 3 0.0186 ⇤ 4, 4, 4 0.0103 0.0174 1, 4, 4 0.0215 0.0281 2, 4, 4 0.0142 ⇤ 8, 8, 8 0.0034 ⇤ 4, 8, 8 0.0048 ⇤ 16, 16, 16 0.001 ⇤

Comparison

Table : Worst case FP operation count when the input and output expansions are of size r. r 2 4 8 16 New algorithm 138 261 669 2103 Priest’s mul.[Pri91] 3174 16212 87432 519312

12 / 1 Comparison

Table : Worst case FP operation count when the input and output expansions are of size r. r 2 4 8 16 New algorithm 138 261 669 2103 Priest’s mul.[Pri91] 3174 16212 87432 519312

Table : Performance in Mops/s for multiplying two FP expansions on a Tesla K40c GPU, using CUDA 7.5 software architecture, running on a single thread of execution. precision not supported. ⇤

dx,dy,dr New algorithm QD 2, 2, 2 0.027 0.1043 1, 2, 2 0.365 0.1071 3, 3, 3 0.0149 ⇤ 2, 3, 3 0.0186 ⇤ 4, 4, 4 0.0103 0.0174 1, 4, 4 0.0215 0.0281 2, 4, 4 0.0142 ⇤ 8, 8, 8 0.0034 ⇤ 4, 8, 8 0.0048 ⇤ 16, 16, 16 0.001 ⇤ 12 / 1 Conclusions

CudAA MultipleM PrecisionP ARARithmetic librarY

Available online at: http://homepages.laas.fr/mmjoldes/campary/. algorithm with strong regularity; based on partial products accumulation; uses a fixed-point structure that is floating-point friendly; thorough error analysis and tight error bound; natural fit for GPUs; proved to be too complex for small precisions; performance gains with increased precision.

13 / 1 Comparison

Table : Performance in Mops/s for multiplying two FP expansions on the CPU; dx and dy represent the number of terms in the input expansions and dr is the size of the computed result. precision not supported ⇤

dx,dy,dr New algorithm QD MPFR 2, 2, 2 11.69 99.16 18.64 1, 2, 2 14.96 104.17 19.85 3, 3, 3 6.97 12.1 ⇤ 2, 3, 3 8.62 13.69 ⇤ 4, 4, 4 4.5 5.87 10.64 1, 4, 4 8.88 15.11 14.1 2, 4, 4 6.38 9.49 13.44 8, 8, 8 1.5 6.8 ⇤ 4, 8, 8 2.04 9.15 ⇤ 16, 16, 16 0.42 2.55 ⇤