Optimized Polar Decomposition for Modern Computer Architectures

Optimized Polar Decomposition for Modern Computer Architectures [email protected] Joint work with NLA Group Meeting Manchester, UK. October 2, 2018 ySchool of Mathematics, The University of Manchester Introduction Introduction 2 Context NLAFET (H2020 Project) - end April 2019 • Task-based implementation of algorithms in Plasma library • Porting from QUARK to OpenMP Runtime System • Novel SVD algorithms: • 2-stage SVD: reduction to band format then eigensolver or SVD (D&C or QR). • Polar Decomposition-based SVD: QDWH iterations. • Benefit of QDWH only at large scale on many-core architectures. 3 Polar Decomposition The Polar Decomposition For all non-singular A 2 Cm×n, there exists a unique decomposition A = UH where • U 2 Cm×n a set of n orthonormal columns • H 2 Cn×n a Hermitian pos. semi-def. matrix Reverse or Left PD ∗ m×m A = HÙ, with H` = UHU 2 C Relation with SVD • Decompositions are equivalent • Proofs of PD rely on SVD 4 Polar Decomposition Applications • Matrix nearness [Higham, 1986] • Nearest orthogonal matrix (Procrustes problem) • Factor Analysis, Multidimensional Scaling, . • Optimization (Gradient descent) • Nearby (H) or nearest (:5(H + A)) pos. def. matrix • Matrix functions: square root, p-th root, . n×n • ex: Square root of spd A 2 R ) If A = LLT (Chol) ) H = A1=2 and LT = UH (PD) 5 Polar Decomposition Application to Matrix Decompositions • SVD of A 2 Cm×n ) If A = UH (PD) A = W ΣV T (SVD) ) and H = V ΣV T (EVD) with W = UV • QDWH-Eig not detailled here • Only a portion of the spectrum • Spectral D&C algorithm (QDWH + subspace iterations) 6 • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5]. Polar Decomposition Algorithms Root finding algorithms on singular values of U. • Scaled Newton's iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley's iterations • 6 iterations (3rd order - Padéfamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops 7 Polar Decomposition Algorithms Root finding algorithms on singular values of U. • Scaled Newton's iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley's iterations • 6 iterations (3rd order - Padéfamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5]. 7 State of the art implementation H. Ltaief, D. Sukkari & D. Keyes (KAUST) • PD, PD-SVD and PD-eig (PD=QDWH or Zolo) • Distributed memory: Scalapack + Chameleon + StarPU • Massively Parallel PD [Ltaief et al., 2018] • Zolo up to 2:3× faster than QDWH • Cray XC40 system on 3200 Intel 16-core Haswell nodes Other implementations • QDWH in Elemental library • QDWH-(S,E)VD with Scalapack + ELPA [Li et al., 2018] 8 QDWH based matrix decomposition QDWH based matrix decomposition Algorithm Convergence Flops count 9 Polar Factor: U = limk!1 Xk QDWH iterations [U] = qdwh(A; α; β; ") 1 X0 = A/α, `0 = β/α 2 k = 0 3 while j1 − `k j < " 4 ak = h(`k ), bk = g(ak ), ck = ak + bk − 1 5 p 6 [Q1; Q2] R = qr( ck Xk ; In ) H 7 Xk+1 = (bk =ck )Xk + (1=ck )(ak − bk =ck )Q1Q2 8 2 2 9 `k+1 = `k (ak + bk `k )=(1 + ck `k ) 10 k = k + 1 11 end 12 U = Xk+1 10 Convergence of QDWH iterations Number of iterations depends on conditioning κ2(A) = 1=`0. • Goal: Mapping all singular values of X0 in [`0; 1] to 1 • Criterion: closeness of σ(Xk ) to 1, i.e. the distance j1 − `k j • Estimate # of iterations a priori using 2 ak + bk σi (Xk ) σi (Xk+1) = σi (Xk ) 2 1 + ck σi (Xk ) • Parameters (ak ; bk ; ck ) ! (3; 1; 3) and optimized to ensure cubical convergence • In practice, less than 6 iterations in double precision 11 Convergence of QDWH iterations −1 Estimating parameters α = kAk2 and `0 = β/α with β = 1=kA k2 • Upper bound for α = σmax (A) • Can useα ^ = kAkF or • 2-norm estimate based on power iterations (normest) • Lower bound for `0 = σmin(X0) ^ p • `0 = kX0k1=( nκ1(X0)) • Estimate κ1(X0) • using condest [Higham and Tisseur, 2000](8=3n3) −1 3 • or simply kX0 k1 using QR + triangular solve (5=3n ) • Poor (over)-estimation of ` can increase # of iterations Optimized QR it. [Nakatsukasa and Higham, 2013] • Re-use Q in first itQR 2 3 3 • Use identity structure to decrease itQR cost from (6 + 3 )n to 5n 12 Fast Cholesky iterations Optimized QDWH Iterations [U] = qdwh(A; α; β; ") 1 [:::] 2 3 2 if ck < 100 // PO-based it. - 3mn + n =3 ∗ 3 Z = In + ck Xk Xk 4 W = chol(Z) −1 −∗ 5 Xk+1 = (ak =bk )Xk + (ak − bk =ck ) Uk W W 6 else // QR-based it. - 5mn2 7 [:::] Cholesky-based iterations are not stable [Nakatsukasa and Higham, 2012] • Forward error in Xk+1 is bounded by ck " • ck ! 3 and ck ≤ 100 for k ≥ 2 for all practical `0 • Hence, switch to PO iterations when ck < 100 13 Matrix Decomposition QDWH-PD QDWH-SVD [U; H] = qdwh − pd(A) W ; Σ; V H = qdwh − svd(A) 1 U = qdwh(A) 1 [U; H] = qdwh − pd(A) 2 H = UH A +2m2n 2 [V ; Σ] = syev(H) +4n3 3 H = (H + HH )=2 3 W = UV +2mn2 Additional cost • PD needs 1 extra matrix multiplication (gemm) • SVD needs 2 extra gemm + 1 syev • Can both be implemented with similar memory footprint 14 Flops count: QDWH-PD Overall count1 of QDWH-PD with m = n 2 1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO Nature of iterations with respect to condition number 1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10 QR 1 1 2 2 3 PO 3 4 3 4 3 2 1 2 flops 23 + 3 28 32 + 3 36 + 3 41 opt 1 2 1 2 +itQR 20 + 3 24 + 3 29 33 + 3 37 + 3 Table 1: # of QR and PO iterations and flops count (=n3) for QDWH-PD. 1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-PD Overall count1 of QDWH-PD with m = n 2 1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO Nature of iterations with respect to condition number 1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10 QR 0 . 2 PO 2 . 4 2 2 flops 10 + 3 . 36 + 3 opt 1 1 +itQR 12 + 3 . 33 + 3 Table 1: # of QR and PO iterations and flops count (=n3) for QDWH-PD. 1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-SVD 2 #flops 2 • QDWH-PD: (10 + 3 ) ≤ n3 ≤ (36 + 3 ) • Symmetric Eigensolver + Multiplication • 2-stage-eig: • QDWH-eig: 4 1 7 • • (16 + 9 ) ≤ ::: ≤ (50 + 9 ) 3 4 4 • 4 with vecs • (17 + 9 ) ≤ ::: ≤ (52 + 9 ) with vecs Σ U; Σ; V 2 dges(vd/dd) 2 + 3 22 10 1 10 1 2-stage-svd 3 = 3 + 3 3 + 4 = 7 + 3 (+2-stage-eig) 12 ≤ ::: ≤ 38 14 + 2 ≤ ::: ≤ 40 + 2 QDWH-svd 3 3 5 5 8 8 (+QDWH-eig) 26 + 9 ≤ ::: ≤ 52 + 9 27 + 9 ≤ ::: ≤ 53 + 9 Table 2: Floating point operations counts (=n3) for SVD. 16 Flops Count Singular values only Singular values and vectors 2-stage-svd 2-stage-svd 400 qdwh-svd = qdwh + syev 400 qdwh-svd = qdwh + syev qdwh-svd = qdwh + qdwh-eig qdwh-svd = qdwh + qdwh-eig 300 300 200 200 100 100 Flops count (GFlops) Flops count (GFlops) 0 0 5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000) 17 Memory footprint Real double precision - QDWH-PD Stored matrices m = n m = 3n 20 m = 10n • A 2 Rm×n 15 • U 2 Rm×n and H 2 Rn×n p (m+n)×n 10 • B = ck Xk ; In 2 R (m+n)×n 5 • Q = [Q1; Q2] 2 R Memory Footprint (GiB) ) 4mn + 3n2 entries 0 5 10 15 20 Number of rows: m (/1.000) Memory available on Intel nodes or NVIDIA accelerators • Intel KNL has up to 16GiB of MCDRAM (depending on mode) • Haswell/Sandy Bridge around 64GiB • Skylake: 64/128GiB • Tesla V100 GPU: 16/32GiB 18 Experiments Experiments Runtime System QR-Optimization Architecture 19 Numerical Experiments Real square matrices in double precision • dlatms: prescribed condition number and spectrum • dlarnv: entries sampled at random in [0; 1] • m = n = 2:000;:::; 16:000 Computer architectures (Intel) Runtime systems • Haswell: 20 cores • Quark (Plasma-2.8) • Sandy Bridge: 16 cores • NEW OpenMP (Plasma-17) • KNL: 68 cores 20 QDWH on runtime systems Intel Haswell - 20 cores 150 QUARK qdwh QUARK qdwh opt. OpenMP qdwh OpenMP qdwh opt. 100 Time (s) 50 0 2 4 6 8 10 12 Matrix Size: n (/1.000) 21 QDWH-SVD: Sandy Bridge Sandy Bridge - Singular values only Sandy Bridge - Singular values and vectors 2-stage-sdd 2-stage-sdd 1,500 qdwh-svd 1,500 qdwh-svd 1,000 1,000 Time (s) Time (s) 500 500 0 0 5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000) #flops(QDWH−SVD) κ2 #flops(SDD) SandyBridge Haswell KNL 1 1 7:8 1:5 1:4 1:6 16 1 10 13 2:5 2:3 1:1 Table 3: Flop ratio vs speedup for QDWH-SVD (with vectors) and n = 14:000.

Optimized Polar Decomposition for Modern Computer Architectures

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support