<<

Optimized Polar Decomposition for Modern Computer Architectures

Pierre.Blanchard†@manchester.ac.uk Joint work with NLA Group Meeting Manchester, UK. October 2, 2018

†School of , The University of Manchester Introduction

Introduction

2 Context

NLAFET (H2020 Project) - end April 2019

• Task-based implementation of algorithms in Plasma library • Porting from QUARK to OpenMP Runtime System • Novel SVD algorithms: • 2-stage SVD: reduction to band format then eigensolver or SVD (D&C or QR). • Polar Decomposition-based SVD: QDWH iterations. • Benefit of QDWH only at large scale on many-core architectures.

3 Polar Decomposition

The Polar Decomposition For all non-singular A ∈ Cm×n, there exists a unique decomposition

A = UH

where

• U ∈ Cm×n a set of n orthonormal columns • H ∈ Cn×n a Hermitian pos. semi-def. Reverse or Left PD

∗ m×m A = H`U, with H` = UHU ∈ C

Relation with SVD

• Decompositions are equivalent • Proofs of PD rely on SVD

4 Polar Decomposition

Applications

• Matrix nearness [Higham, 1986] • Nearest (Procrustes problem) • Factor Analysis, Multidimensional Scaling, . . . • Optimization (Gradient descent) • Nearby (H) or nearest (.5(H + A)) pos. def. matrix

• Matrix functions: square root, p-th root, . . . n×n • ex: Square root of spd A ∈ R ) If A = LLT (Chol) ⇒ H = A1/2 and LT = UH (PD)

5 Polar Decomposition

Application to Matrix Decompositions

• SVD of A ∈ Cm×n ) If A = UH (PD) A = W ΣV T (SVD) ⇒ and H = V ΣV T (EVD) with W = UV

• QDWH-Eig not detailled here • Only a portion of the spectrum • Spectral D&C algorithm (QDWH + subspace iterations)

6 • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5].

Polar Decomposition

Algorithms Root finding algorithms on singular values of U.

• Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley’s iterations • 6 iterations (3rd order - Pad´efamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops

7 Polar Decomposition

Algorithms Root finding algorithms on singular values of U.

• Scaled Newton’s iterations • 9 iterations (2nd order) • Backward stable • QR-based (Dynamically Weighted) Halley’s iterations • 6 iterations (3rd order - Pad´efamily) • Backward stable [Nakatsukasa et al., 2010] • Inverse-free + communications-friendly • Many cheap (Lvl3 Blas) flops • Zolotarev functions • 2 iterations (order 17!) • More flops but more paralellism • Well-suited for high granularity computing resources! • Best rational approximant of matrix sign function, see [Nakatsukasa and Freund, 2014] or [Higham, 2008, Ch.5].

7 State of the art implementation

H. Ltaief, D. Sukkari & D. Keyes (KAUST)

• PD, PD-SVD and PD-eig (PD=QDWH or Zolo) • Distributed memory: Scalapack + Chameleon + StarPU • Massively Parallel PD [Ltaief et al., 2018] • Zolo up to 2.3× faster than QDWH • Cray XC40 system on 3200 Intel 16-core Haswell nodes

Other implementations

• QDWH in Elemental library • QDWH-(S,E)VD with Scalapack + ELPA [Li et al., 2018]

8 QDWH based ma- trix decomposition

QDWH based matrix decomposi- tion Algorithm Convergence Flops count

9 Polar Factor: U = limk→∞ Xk

QDWH iterations [U] = qdwh(A, α, β, ε)

1 X0 = A/α, `0 = β/α 2 k = 0

3 while |1 − `k | < ε 4 ak = h(`k ), bk = g(ak ), ck = ak + bk − 1 5 √  6 [Q1; Q2] R = qr( ck Xk ; In ) H 7 Xk+1 = (bk /ck )Xk + (1/ck )(ak − bk /ck )Q1Q2 8 2 2 9 `k+1 = `k (ak + bk `k )/(1 + ck `k ) 10 k = k + 1 11 end

12 U = Xk+1

10 Convergence of QDWH iterations

Number of iterations depends on conditioning κ2(A) = 1/`0.

• Goal: Mapping all singular values of X0 in [`0, 1] to 1

• Criterion: closeness of σ(Xk ) to 1, i.e. the distance |1 − `k | • Estimate # of iterations a priori using

2 ak + bk σi (Xk ) σi (Xk+1) = σi (Xk ) 2 1 + ck σi (Xk )

• Parameters (ak , bk , ck ) → (3, 1, 3) and optimized to ensure cubical convergence • In practice, less than 6 iterations in double precision

11 Convergence of QDWH iterations

−1 Estimating parameters α = kAk2 and `0 = β/α with β = 1/kA k2

• Upper bound for α = σmax (A)

• Can useα ˆ = kAkF or • 2- estimate based on power iterations (normest)

• Lower bound for `0 = σmin(X0) ˆ √ • `0 = kX0k1/( nκ1(X0)) • Estimate κ1(X0) • using condest [Higham and Tisseur, 2000](8/3n3) −1 3 • or simply kX0 k1 using QR + triangular solve (5/3n ) • Poor (over)-estimation of ` can increase # of iterations

Optimized QR it. [Nakatsukasa and Higham, 2013]

• Re-use Q in first itQR 2 3 3 • Use identity structure to decrease itQR cost from (6 + 3 )n to 5n

12 Fast Cholesky iterations

Optimized QDWH Iterations [U] = qdwh(A, α, β, ε) 1 [...] 2 3 2 if ck < 100 // PO-based it. - 3mn + n /3 ∗ 3 Z = In + ck Xk Xk 4 W = chol(Z) −1 −∗ 5 Xk+1 = (ak /bk )Xk + (ak − bk /ck ) Uk W W 6 else // QR-based it. - 5mn2 7 [...]

Cholesky-based iterations are not stable [Nakatsukasa and Higham, 2012]

• Forward error in Xk+1 is bounded by ck ε

• ck → 3 and ck ≤ 100 for k ≥ 2 for all practical `0

• Hence, switch to PO iterations when ck < 100

13

QDWH-PD QDWH-SVD [U, H] = qdwh − pd(A) W , Σ, V H  = qdwh − svd(A) 1 U = qdwh(A) 1 [U, H] = qdwh − pd(A) 2 H = UH A +2m2n 2 [V , Σ] = syev(H) +4n3 3 H = (H + HH )/2 3 W = UV +2mn2

Additional cost

• PD needs 1 extra matrix multiplication (gemm) • SVD needs 2 extra gemm + 1 syev • Can both be implemented with similar memory footprint

14 Flops count: QDWH-PD

Overall count1 of QDWH-PD with m = n  2  1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO

Nature of iterations with respect to condition number

1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10

QR 1 1 2 2 3 PO 3 4 3 4 3 2 1 2 flops 23 + 3 28 32 + 3 36 + 3 41 opt 1 2 1 2 +itQR 20 + 3 24 + 3 29 33 + 3 37 + 3 Table 1: # of QR and PO iterations and flops count (/n3) for QDWH-PD.

1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-PD

Overall count1 of QDWH-PD with m = n  2  1 8 + n3#it + 4 + n3#it + 2n3 3 QR 3 PO

Nature of iterations with respect to condition number

1 2 3 5 6 13 14 16 κ2 1 10 -10 10 -10 10 -10 10 -10

QR 0 . . . 2 PO 2 . . . 4 2 2 flops 10 + 3 . . . 36 + 3 opt 1 1 +itQR 12 + 3 . . . 33 + 3 Table 1: # of QR and PO iterations and flops count (/n3) for QDWH-PD.

1 5 3 without 3 n for estimating `0 or exploiting trailing identity matrix structure. 15 Flops count: QDWH-SVD

2 #flops 2 • QDWH-PD: (10 + 3 ) ≤ n3 ≤ (36 + 3 ) • Symmetric Eigensolver + Multiplication

• 2-stage-eig: • QDWH-eig: 4 1 7 • • (16 + 9 ) ≤ ... ≤ (50 + 9 ) 3 4 4 • 4 with vecs • (17 + 9 ) ≤ ... ≤ (52 + 9 ) with vecs

Σ U, Σ, V

2 dges(vd/dd) 2 + 3 22

10 1 10 1 2-stage-svd 3 = 3 + 3 3 + 4 = 7 + 3

(+2-stage-eig) 12 ≤ ... ≤ 38 14 + 2 ≤ ... ≤ 40 + 2 QDWH-svd 3 3 5 5 8 8 (+QDWH-eig) 26 + 9 ≤ ... ≤ 52 + 9 27 + 9 ≤ ... ≤ 53 + 9

Table 2: Floating point operations counts (/n3) for SVD.

16 Flops Count

Singular values only Singular values and vectors

2-stage-svd 2-stage-svd 400 qdwh-svd = qdwh + syev 400 qdwh-svd = qdwh + syev qdwh-svd = qdwh + qdwh-eig qdwh-svd = qdwh + qdwh-eig

300 300

200 200

100 100 Flops count (GFlops) Flops count (GFlops)

0 0

5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000)

17 Memory footprint

Real double precision - QDWH-PD

Stored matrices m = n m = 3n 20 m = 10n • A ∈ Rm×n 15 • U ∈ Rm×n and H ∈ Rn×n √  (m+n)×n 10 • B = ck Xk ; In ∈ R (m+n)×n 5 • Q = [Q1; Q2] ∈ R Memory Footprint (GiB) ⇒ 4mn + 3n2 entries 0 5 10 15 20 Number of rows: m (/1.000) Memory available on Intel nodes or NVIDIA accelerators

• Intel KNL has up to 16GiB of MCDRAM (depending on mode) • Haswell/Sandy Bridge around 64GiB • Skylake: 64/128GiB • Tesla V100 GPU: 16/32GiB

18 Experiments

Experiments Runtime System QR-Optimization Architecture

19 Numerical Experiments

Real square matrices in double precision

• dlatms: prescribed condition number and spectrum • dlarnv: entries sampled at random in [0, 1] • m = n = 2.000,..., 16.000

Computer architectures (Intel) Runtime systems • Haswell: 20 cores • Quark (Plasma-2.8) • Sandy Bridge: 16 cores • NEW OpenMP (Plasma-17) • KNL: 68 cores

20 QDWH on runtime systems

Intel Haswell - 20 cores

150 QUARK qdwh QUARK qdwh opt. OpenMP qdwh OpenMP qdwh opt. 100 Time (s) 50

0

2 4 6 8 10 12 Matrix Size: n (/1.000)

21 QDWH-SVD: Sandy Bridge

Sandy Bridge - Singular values only Sandy Bridge - Singular values and vectors

2-stage-sdd 2-stage-sdd 1,500 qdwh-svd 1,500 qdwh-svd

1,000 1,000 Time (s) Time (s)

500 500

0 0

5 10 15 20 5 10 15 20 Matrix Size: n (/1.000) Matrix Size: n (/1.000)

#flops(QDWH−SVD) κ2 #flops(SDD) SandyBridge Haswell KNL

1 1 7.8 1.5 1.4 1.6 16 1 10 13 2.5 2.3 1.1 Table 3: Flop ratio vs speedup for QDWH-SVD (with vectors) and n = 14.000. 22 QDWH-SVD: Haswell vs KNL

Haswell - Singular values only Haswell - Singular values and vectors

500 500 2-stage-svd 2-stage-svd 2-stage-sdd 2-stage-sdd 400 400 qdwh-svd qdwh-svd

300 300

200 200 Time (s) Time (s)

100 100

0 0

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) KNL - Singular values only KNL - Singular values and vectors

500 500 2-stage-svd 2-stage-svd 2-stage-sdd 2-stage-sdd 400 400 qdwh-svd qdwh-svd

300 300

200 200 Time (s) Time (s)

100 100

0 0

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) 23 QDWH-SVD: Plasma vs MKL

Haswell - Singular values only Haswell - Singular values and vectors

500 500 dgesvd (Plasma) dgesdd (Plasma) dgesvd (Lapack) dgesdd (Lapack) 400 400 dgesvd (MKL) dgesdd (MKL) qdwh-svd (Plasma) qdwh-svd (Plasma) 300 300

200 200 Time (s) Time (s)

100 100

0 0

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) KNL - Singular values only KNL - Singular values and vectors

500 500 dgesvd (Plasma) dgesdd (Plasma) dgesvd (Lapack) dgesdd (Lapack) 400 400 dgesvd (MKL) dgesdd (MKL) qdwh-svd (Plasma) qdwh-svd (Plasma)

300 300

200 200 Time (s) Time (s)

100 100

0 0

2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 Matrix Size: n (/1.000) Matrix Size: n (/1.000) 24 Ongoing work & Perspectives

Ongoing work & Perspectives

25 Ongoing work & Perspectives

StarPU implementation Applications • Heterogenous architectures • Multidimensional Scaling • Distributed memory version • QDWH-PD and Eig • Single/Half Precision • Accelerators (NVIDIA GPUs) • Matrix p-roots

Algorithms

• Mixed-precision • Multiplications, Cholesky and QR • on high-end GPUs (with Tensor-Core features) • Zolotarev PD • 2 iterations • Extra flops embarassingly parallel • Larger memory footprint

26 Questions

Questions Thank you for your attention.

27 Referencesi

N. Higham. Computing the polar decomposition with applications. SIAM Journal on Scientific and Statistical Computing, 1986. ISSN 0196-5204. doi: 10.1137/0907079.

N. J. Higham. Functions of Matrices: Theory and Computation. 2008. ISBN 978-0-898716-46-7. doi: 10.1137/1.9780898717778.

N. J. Higham and F. Tisseur. A block algorithm for matrix 1-norm estimation, with an application to 1-norm pseudospectra. 21(4): 1185–1201, 2000. doi: 10.1137/S0895479899356080.

S. Li, J. Liu, and Y. Du. A new high performance and scalable svd algorithm on distributed memory systems. CoRR, abs/1806.06204, 2018.

H. Ltaief, D. E. Sukkari, A. Esposito, Y. Nakatsukasa, and D. E. Keyes. Massively parallel polar decomposition on distributed-memory systems. 2018.

Y. Nakatsukasa and R. W. Freund. Using zolotarevs rational approximation for computing the polar, symmetric eigenvalue, and singular value decompositions. SIAM Rev. To appear, 2014.

Y. Nakatsukasa and N. J. Higham. Backward stability of iterations for computing the polar decomposition. 33(2):460–479, 2012. doi: 10.1137/110857544.

Y. Nakatsukasa and N. J. Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD. SIAM Journal on Scientific Computing, 35(3):A1325–A1349, jan 2013. doi: 10.1137/120876605. URL https://doi.org/10.1137/120876605.

Y. Nakatsukasa, Z. Bai, and F. Gygi. Optimizing halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31(5):2700–2720, jan 2010. doi: 10.1137/090774999. URL https://doi.org/10.1137/090774999.

28