Optimized Polar Decomposition for Modern Computer Architectures

Optimized Polar Decomposition for Modern Computer Architectures [email protected] Joint work with NLA Group Meeting Manchester, UK. October 2, 2018 ySchool of Mathematics, The University of Manchester Introduction Introduction 2 Context NLAFET (H2020 Project) - end April 2019 • Task-based implementation of algorithms in Plasma library • Porting from QUARK to OpenMP Runtime System • Novel SVD algorithms: • 2-stage SVD: reduction to band format then eigensolver or SVD (D&C or QR). • Polar Decomposition-based SVD: QDWH iterations. • Benefit of QDWH only at large scale on many-core architectures. 3 Polar Decomposition The Polar Decomposition For all non-singular A 2 Cm×n, there exists a unique decomposition A = UH where • U 2 Cm×n a set of n orthonormal columns • H 2 Cn×n a Hermitian pos. semi-def. matrix Reverse or Left PD ∗ m×m A = HÙ, with H` = UHU 2 C Relation with SVD • Decompositions are equivalent • Proofs of PD rely on SVD 4 Polar Decomposition Applications • Matrix nearness [Higham, 1986] • Nearest orthogonal matrix (Procrustes problem) • Factor Analysis, Multidimensional Scaling, . • Optimization (Gradient descent) • Nearby (H) or nearest (:5(H + A)) pos. def. matrix • Matrix functions: square root, p-th root, . n×n • ex: Square root of spd A 2 R ) If A = LLT (Chol) ) H = A1=2 and LT = UH (PD) 5 Polar Decomposition Application to Matrix Decompositions • SVD of A 2 Cm×n ) If A = UH (PD) A = W ΣV T (SVD) ) and H = V ΣV T (EVD) with W = UV • QDWH-Eig not detailled here • Only a portion of the spectrum • Spectral D&C algorithm (QDWH + subspace iterations) 6 • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5]. Polar Decomposition Algorithms Root finding algorithms on singular values of U. • Scaled Newton's iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley's iterations • 6 iterations (3rd order - Padéfamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops 7 Polar Decomposition Algorithms Root finding algorithms on singular values of U. • Scaled Newton's iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley's iterations • 6 iterations (3rd order - Padéfamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5]. 7 State of the art implementation H. Ltaief, D. Sukkari & D. Keyes (KAUST) • PD, PD-SVD and PD-eig (PD=QDWH or Zolo) • Distributed memory: Scalapack + Chameleon + StarPU • Massively Parallel PD [Ltaief et al., 2018] • Zolo up to 2:3× faster than QDWH • Cray XC40 system on 3200 Intel 16-core Haswell nodes Other implementations • QDWH in Elemental library • QDWH-(S,E)VD with Scalapack + ELPA [Li et al., 2018] 8 QDWH based matrix decomposition QDWH based matrix decomposition Algorithm Convergence Flops count 9 Polar Factor: U = limk!1 Xk QDWH iterations [U] = qdwh(A; α; β; ") 1 X0 = A/α, `0 = β/α 2 k = 0 3 while j1 − `k j < " 4 ak = h(`k ), bk = g(ak ), ck = ak + bk − 1 5 p 6 [Q1; Q2] R = qr( ck Xk ; In ) H 7 Xk+1 = (bk =ck )Xk + (1=ck )(ak − bk =ck )Q1Q2 8 2 2 9 `k+1 = `k (ak + bk `k )=(1 + ck `k ) 10 k = k + 1 11 end 12 U = Xk+1 10 Convergence of QDWH iterations Number of iterations depends on conditioning κ2(A) = 1=`0. • Goal: Mapping all singular values of X0 in [`0; 1] to 1 • Criterion: closeness of σ(Xk ) to 1, i.e. the distance j1 − `k j • Estimate # of iterations a priori using 2 ak + bk σi (Xk ) σi (Xk+1) = σi (Xk ) 2 1 + ck σi (Xk ) • Parameters (ak ; bk ; ck ) ! (3; 1; 3) and optimized to ensure cubical convergence • In practice, less than 6 iterations in double precision 11 Convergence of QDWH iterations −1 Estimating parameters α = kAk2 and `0 = β/α with β = 1=kA k2 • Upper bound for α = σmax (A) • Can useα ^ = kAkF or • 2-norm estimate based on power iterations (normest) • Lower bound for `0 = σmin(X0) ^ p • `0 = kX0k1=( nκ1(X0)) • Estimate κ1(X0) • using condest [Higham and Tisseur, 2000](8=3n3) −1 3 • or simply kX0 k1 using QR + triangular solve (5=3n ) • Poor (over)-estimation of ` can increase # of iterations Optimized QR it. [Nakatsukasa and Higham, 2013] • Re-use Q in first itQR 2 3 3 • Use identity structure to decrease itQR cost from (6 + 3 )n to 5n 12 Fast Cholesky iterations Optimized QDWH Iterations [U] = qdwh(A; α; β; ") 1 [:::] 2 3 2 if ck < 100 // PO-based it. - 3mn + n =3 ∗ 3 Z = In + ck Xk Xk 4 W = chol(Z) −1 −∗ 5 Xk+1 = (ak =bk )Xk + (ak − bk =ck ) Uk W W 6 else // QR-based it. - 5mn2 7 [:::] Cholesky-based iterations are not stable [Nakatsukasa and Higham, 2012] • Forward error in Xk+1 is bounded by ck " • ck ! 3 and ck ≤ 100 for k ≥ 2 for all practical `0 • Hence, switch to PO iterations when ck < 100 13 Matrix Decomposition QDWH-PD QDWH-SVD [U; H] = qdwh − pd(A) W ; Σ; V H = qdwh − svd(A) 1 U = qdwh(A) 1 [U; H] = qdwh − pd(A) 2 H = UH A +2m2n 2 [V ; Σ] = syev(H) +4n3 3 H = (H + HH )=2 3 W = UV +2mn2 Additional cost • PD needs 1 extra matrix multiplication (gemm) • SVD needs 2 extra gemm + 1 syev • Can both be implemented with similar memory footprint 14 Flops count: QDWH-PD Overall count1 of QDWH-PD with m = n 2 1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO Nature of iterations with respect to condition number 1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10 QR 1 1 2 2 3 PO 3 4 3 4 3 2 1 2 flops 23 + 3 28 32 + 3 36 + 3 41 opt 1 2 1 2 +itQR 20 + 3 24 + 3 29 33 + 3 37 + 3 Table 1: # of QR and PO iterations and flops count (=n3) for QDWH-PD. 1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-PD Overall count1 of QDWH-PD with m = n 2 1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO Nature of iterations with respect to condition number 1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10 QR 0 . 2 PO 2 . 4 2 2 flops 10 + 3 . 36 + 3 opt 1 1 +itQR 12 + 3 . 33 + 3 Table 1: # of QR and PO iterations and flops count (=n3) for QDWH-PD. 1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-SVD 2 #flops 2 • QDWH-PD: (10 + 3 ) ≤ n3 ≤ (36 + 3 ) • Symmetric Eigensolver + Multiplication • 2-stage-eig: • QDWH-eig: 4 1 7 • • (16 + 9 ) ≤ ::: ≤ (50 + 9 ) 3 4 4 • 4 with vecs • (17 + 9 ) ≤ ::: ≤ (52 + 9 ) with vecs Σ U; Σ; V 2 dges(vd/dd) 2 + 3 22 10 1 10 1 2-stage-svd 3 = 3 + 3 3 + 4 = 7 + 3 (+2-stage-eig) 12 ≤ ::: ≤ 38 14 + 2 ≤ ::: ≤ 40 + 2 QDWH-svd 3 3 5 5 8 8 (+QDWH-eig) 26 + 9 ≤ ::: ≤ 52 + 9 27 + 9 ≤ ::: ≤ 53 + 9 Table 2: Floating point operations counts (=n3) for SVD. 16 Flops Count Singular values only Singular values and vectors 2-stage-svd 2-stage-svd 400 qdwh-svd = qdwh + syev 400 qdwh-svd = qdwh + syev qdwh-svd = qdwh + qdwh-eig qdwh-svd = qdwh + qdwh-eig 300 300 200 200 100 100 Flops count (GFlops) Flops count (GFlops) 0 0 5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000) 17 Memory footprint Real double precision - QDWH-PD Stored matrices m = n m = 3n 20 m = 10n • A 2 Rm×n 15 • U 2 Rm×n and H 2 Rn×n p (m+n)×n 10 • B = ck Xk ; In 2 R (m+n)×n 5 • Q = [Q1; Q2] 2 R Memory Footprint (GiB) ) 4mn + 3n2 entries 0 5 10 15 20 Number of rows: m (/1.000) Memory available on Intel nodes or NVIDIA accelerators • Intel KNL has up to 16GiB of MCDRAM (depending on mode) • Haswell/Sandy Bridge around 64GiB • Skylake: 64/128GiB • Tesla V100 GPU: 16/32GiB 18 Experiments Experiments Runtime System QR-Optimization Architecture 19 Numerical Experiments Real square matrices in double precision • dlatms: prescribed condition number and spectrum • dlarnv: entries sampled at random in [0; 1] • m = n = 2:000;:::; 16:000 Computer architectures (Intel) Runtime systems • Haswell: 20 cores • Quark (Plasma-2.8) • Sandy Bridge: 16 cores • NEW OpenMP (Plasma-17) • KNL: 68 cores 20 QDWH on runtime systems Intel Haswell - 20 cores 150 QUARK qdwh QUARK qdwh opt. OpenMP qdwh OpenMP qdwh opt. 100 Time (s) 50 0 2 4 6 8 10 12 Matrix Size: n (/1.000) 21 QDWH-SVD: Sandy Bridge Sandy Bridge - Singular values only Sandy Bridge - Singular values and vectors 2-stage-sdd 2-stage-sdd 1,500 qdwh-svd 1,500 qdwh-svd 1,000 1,000 Time (s) Time (s) 500 500 0 0 5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000) #flops(QDWH−SVD) κ2 #flops(SDD) SandyBridge Haswell KNL 1 1 7:8 1:5 1:4 1:6 16 1 10 13 2:5 2:3 1:1 Table 3: Flop ratio vs speedup for QDWH-SVD (with vectors) and n = 14:000.

Load more