Optimized Polar Decomposition for Modern Computer Architectures
Pierre.Blanchard†@manchester.ac.uk Joint work with NLA Group Meeting Manchester, UK. October 2, 2018
†School of Mathematics, The University of Manchester Introduction
Introduction
2 Context
NLAFET (H2020 Project) - end April 2019
• Task-based implementation of algorithms in Plasma library • Porting from QUARK to OpenMP Runtime System • Novel SVD algorithms: • 2-stage SVD: reduction to band format then eigensolver or SVD (D&C or QR). • Polar Decomposition-based SVD: QDWH iterations. • Benefit of QDWH only at large scale on many-core architectures.
3 Polar Decomposition
The Polar Decomposition For all non-singular A ∈ Cm×n, there exists a unique decomposition
A = UH
where
• U ∈ Cm×n a set of n orthonormal columns • H ∈ Cn×n a Hermitian pos. semi-def. matrix Reverse or Left PD
∗ m×m A = H`U, with H` = UHU ∈ C
Relation with SVD
• Decompositions are equivalent • Proofs of PD rely on SVD
4 Polar Decomposition
Applications
• Matrix nearness [Higham, 1986] • Nearest orthogonal matrix (Procrustes problem) • Factor Analysis, Multidimensional Scaling, . . . • Optimization (Gradient descent) • Nearby (H) or nearest (.5(H + A)) pos. def. matrix
• Matrix functions: square root, p-th root, . . . n×n • ex: Square root of spd A ∈ R ) If A = LLT (Chol) ⇒ H = A1/2 and LT = UH (PD)
5 Polar Decomposition
Application to Matrix Decompositions
• SVD of A ∈ Cm×n ) If A = UH (PD) A = W ΣV T (SVD) ⇒ and H = V ΣV T (EVD) with W = UV
• QDWH-Eig not detailled here • Only a portion of the spectrum • Spectral D&C algorithm (QDWH + subspace iterations)
6 • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5].
Polar Decomposition
Algorithms Root finding algorithms on singular values of U.
• Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley’s iterations • 6 iterations (3rd order - Pad´efamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops
7 Polar Decomposition
Algorithms Root finding algorithms on singular values of U.
• Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley’s iterations • 6 iterations (3rd order - Pad´efamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5].
7 State of the art implementation
H. Ltaief, D. Sukkari & D. Keyes (KAUST)
• PD, PD-SVD and PD-eig (PD=QDWH or Zolo) • Distributed memory: Scalapack + Chameleon + StarPU • Massively Parallel PD [Ltaief et al., 2018] • Zolo up to 2.3× faster than QDWH • Cray XC40 system on 3200 Intel 16-core Haswell nodes
Other implementations
• QDWH in Elemental library • QDWH-(S,E)VD with Scalapack + ELPA [Li et al., 2018]
8 QDWH based ma- trix decomposition
QDWH based matrix decomposi- tion Algorithm Convergence Flops count
9 Polar Factor: U = limk→∞ Xk
QDWH iterations [U] = qdwh(A, α, β, ε)
1 X0 = A/α, `0 = β/α 2 k = 0
3 while |1 − `k | < ε 4 ak = h(`k ), bk = g(ak ), ck = ak + bk − 1 5 √ 6 [Q1; Q2] R = qr( ck Xk ; In ) H 7 Xk+1 = (bk /ck )Xk + (1/ck )(ak − bk /ck )Q1Q2 8 2 2 9 `k+1 = `k (ak + bk `k )/(1 + ck `k ) 10 k = k + 1 11 end
12 U = Xk+1
10 Convergence of QDWH iterations
Number of iterations depends on conditioning κ2(A) = 1/`0.
• Goal: Mapping all singular values of X0 in [`0, 1] to 1
• Criterion: closeness of σ(Xk ) to 1, i.e. the distance |1 − `k | • Estimate # of iterations a priori using
2 ak + bk σi (Xk ) σi (Xk+1) = σi (Xk ) 2 1 + ck σi (Xk )
• Parameters (ak , bk , ck ) → (3, 1, 3) and optimized to ensure cubical convergence • In practice, less than 6 iterations in double precision
11 Convergence of QDWH iterations
−1 Estimating parameters α = kAk2 and `0 = β/α with β = 1/kA k2
• Upper bound for α = σmax (A)
• Can useα ˆ = kAkF or • 2-norm estimate based on power iterations (normest)
• Lower bound for `0 = σmin(X0) ˆ √ • `0 = kX0k1/( nκ1(X0)) • Estimate κ1(X0) • using condest [Higham and Tisseur, 2000](8/3n3) −1 3 • or simply kX0 k1 using QR + triangular solve (5/3n ) • Poor (over)-estimation of ` can increase # of iterations
Optimized QR it. [Nakatsukasa and Higham, 2013]
• Re-use Q in first itQR 2 3 3 • Use identity structure to decrease itQR cost from (6 + 3 )n to 5n
12 Fast Cholesky iterations
Optimized QDWH Iterations [U] = qdwh(A, α, β, ε) 1 [...] 2 3 2 if ck < 100 // PO-based it. - 3mn + n /3 ∗ 3 Z = In + ck Xk Xk 4 W = chol(Z) −1 −∗ 5 Xk+1 = (ak /bk )Xk + (ak − bk /ck ) Uk W W 6 else // QR-based it. - 5mn2 7 [...]
Cholesky-based iterations are not stable [Nakatsukasa and Higham, 2012]
• Forward error in Xk+1 is bounded by ck ε
• ck → 3 and ck ≤ 100 for k ≥ 2 for all practical `0
• Hence, switch to PO iterations when ck < 100
QDWH-PD QDWH-SVD [U, H] = qdwh − pd(A) W , Σ, V H = qdwh − svd(A) 1 U = qdwh(A) 1 [U, H] = qdwh − pd(A) 2 H = UH A +2m2n 2 [V , Σ] = syev(H) +4n3 3 H = (H + HH )/2 3 W = UV +2mn2
Additional cost
• PD needs 1 extra matrix multiplication (gemm) • SVD needs 2 extra gemm + 1 syev • Can both be implemented with similar memory footprint
14 Flops count: QDWH-PD
Overall count1 of QDWH-PD with m = n 2 1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO
Nature of iterations with respect to condition number
1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10
QR 1 1 2 2 3 PO 3 4 3 4 3 2 1 2 flops 23 + 3 28 32 + 3 36 + 3 41 opt 1 2 1 2 +itQR 20 + 3 24 + 3 29 33 + 3 37 + 3 Table 1: # of QR and PO iterations and flops count (/n3) for QDWH-PD.
1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-PD
Overall count1 of QDWH-PD with m = n 2 1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO
Nature of iterations with respect to condition number
1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10
QR 0 . . . 2 PO 2 . . . 4 2 2 flops 10 + 3 . . . 36 + 3 opt 1 1 +itQR 12 + 3 . . . 33 + 3 Table 1: # of QR and PO iterations and flops count (/n3) for QDWH-PD.
1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-SVD
2 #flops 2 • QDWH-PD: (10 + 3 ) ≤ n3 ≤ (36 + 3 ) • Symmetric Eigensolver + Multiplication
• 2-stage-eig: • QDWH-eig: 4 1 7 • • (16 + 9 ) ≤ ... ≤ (50 + 9 ) 3 4 4 • 4 with vecs • (17 + 9 ) ≤ ... ≤ (52 + 9 ) with vecs
Σ U, Σ, V
2 dges(vd/dd) 2 + 3 22
10 1 10 1 2-stage-svd 3 = 3 + 3 3 + 4 = 7 + 3
(+2-stage-eig) 12 ≤ ... ≤ 38 14 + 2 ≤ ... ≤ 40 + 2 QDWH-svd 3 3 5 5 8 8 (+QDWH-eig) 26 + 9 ≤ ... ≤ 52 + 9 27 + 9 ≤ ... ≤ 53 + 9
Table 2: Floating point operations counts (/n3) for SVD.
16 Flops Count
Singular values only Singular values and vectors
2-stage-svd 2-stage-svd 400 qdwh-svd = qdwh + syev 400 qdwh-svd = qdwh + syev qdwh-svd = qdwh + qdwh-eig qdwh-svd = qdwh + qdwh-eig
300 300
200 200
100 100 Flops count (GFlops) Flops count (GFlops)
0 0
5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000)
17 Memory footprint
Real double precision - QDWH-PD
Stored matrices m = n m = 3n 20 m = 10n • A ∈ Rm×n 15 • U ∈ Rm×n and H ∈ Rn×n √ (m+n)×n 10 • B = ck Xk ; In ∈ R (m+n)×n 5 • Q = [Q1; Q2] ∈ R Memory Footprint (GiB) ⇒ 4mn + 3n2 entries 0 5 10 15 20 Number of rows: m (/1.000) Memory available on Intel nodes or NVIDIA accelerators
• Intel KNL has up to 16GiB of MCDRAM (depending on mode) • Haswell/Sandy Bridge around 64GiB • Skylake: 64/128GiB • Tesla V100 GPU: 16/32GiB
18 Experiments
Experiments Runtime System QR-Optimization Architecture
19 Numerical Experiments
Real square matrices in double precision
• dlatms: prescribed condition number and spectrum • dlarnv: entries sampled at random in [0, 1] • m = n = 2.000,..., 16.000
Computer architectures (Intel) Runtime systems • Haswell: 20 cores • Quark (Plasma-2.8) • Sandy Bridge: 16 cores • NEW OpenMP (Plasma-17) • KNL: 68 cores
20 QDWH on runtime systems
Intel Haswell - 20 cores
150 QUARK qdwh QUARK qdwh opt. OpenMP qdwh OpenMP qdwh opt. 100 Time (s) 50
0
2 4 6 8 10 12 Matrix Size: n (/1.000)
21 QDWH-SVD: Sandy Bridge
Sandy Bridge - Singular values only Sandy Bridge - Singular values and vectors
2-stage-sdd 2-stage-sdd 1,500 qdwh-svd 1,500 qdwh-svd
1,000 1,000 Time (s) Time (s)
500 500
0 0
5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000)
#flops(QDWH−SVD) κ2 #flops(SDD) SandyBridge Haswell KNL
1 1 7.8 1.5 1.4 1.6 16 1 10 13 2.5 2.3 1.1 Table 3: Flop ratio vs speedup for QDWH-SVD (with vectors) and n = 14.000. 22 QDWH-SVD: Haswell vs KNL
Haswell - Singular values only Haswell - Singular values and vectors
500 500 2-stage-svd 2-stage-svd 2-stage-sdd 2-stage-sdd 400 400 qdwh-svd qdwh-svd
300 300
200 200 Time (s) Time (s)
100 100
0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) KNL - Singular values only KNL - Singular values and vectors
500 500 2-stage-svd 2-stage-svd 2-stage-sdd 2-stage-sdd 400 400 qdwh-svd qdwh-svd
300 300
200 200 Time (s) Time (s)
100 100
0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) 23 QDWH-SVD: Plasma vs MKL
Haswell - Singular values only Haswell - Singular values and vectors
500 500 dgesvd (Plasma) dgesdd (Plasma) dgesvd (Lapack) dgesdd (Lapack) 400 400 dgesvd (MKL) dgesdd (MKL) qdwh-svd (Plasma) qdwh-svd (Plasma) 300 300
200 200 Time (s) Time (s)
100 100
0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) KNL - Singular values only KNL - Singular values and vectors
500 500 dgesvd (Plasma) dgesdd (Plasma) dgesvd (Lapack) dgesdd (Lapack) 400 400 dgesvd (MKL) dgesdd (MKL) qdwh-svd (Plasma) qdwh-svd (Plasma)
300 300
200 200 Time (s) Time (s)
100 100
0 0
2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) 24 Ongoing work & Perspectives
Ongoing work & Perspectives
25 Ongoing work & Perspectives
StarPU implementation Applications • Heterogenous architectures • Multidimensional Scaling • Distributed memory version • QDWH-PD and Eig • Single/Half Precision • Accelerators (NVIDIA GPUs) • Matrix p-roots
Algorithms
• Mixed-precision • Multiplications, Cholesky and QR • on high-end GPUs (with Tensor-Core features) • Zolotarev PD • 2 iterations • Extra flops embarassingly parallel • Larger memory footprint
26 Questions
Questions Thank you for your attention.
27 Referencesi
N. Higham. Computing the polar decomposition with applications. SIAM Journal on Scientific and Statistical Computing, 1986. ISSN 0196-5204. doi: 10.1137/0907079.
N. J. Higham. Functions of Matrices: Theory and Computation. 2008. ISBN 978-0-898716-46-7. doi: 10.1137/1.9780898717778.
N. J. Higham and F. Tisseur. A block algorithm for matrix 1-norm estimation, with an application to 1-norm pseudospectra. 21(4): 1185–1201, 2000. doi: 10.1137/S0895479899356080.
S. Li, J. Liu, and Y. Du. A new high performance and scalable svd algorithm on distributed memory systems. CoRR, abs/1806.06204, 2018.
H. Ltaief, D. E. Sukkari, A. Esposito, Y. Nakatsukasa, and D. E. Keyes. Massively parallel polar decomposition on distributed-memory systems. 2018.
Y. Nakatsukasa and R. W. Freund. Using zolotarevs rational approximation for computing the polar, symmetric eigenvalue, and singular value decompositions. SIAM Rev. To appear, 2014.
Y. Nakatsukasa and N. J. Higham. Backward stability of iterations for computing the polar decomposition. 33(2):460–479, 2012. doi: 10.1137/110857544.
Y. Nakatsukasa and N. J. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM Journal on Scientific Computing, 35(3):A1325–A1349, jan 2013. doi: 10.1137/120876605. URL https://doi.org/10.1137/120876605.
Y. Nakatsukasa, Z. Bai, and F. Gygi. Optimizing halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31(5):2700–2720, jan 2010. doi: 10.1137/090774999. URL https://doi.org/10.1137/090774999.
28