High-Performance SVD for Big Data

High-Performance SVD for big data Andreas Stathopoulos Eloy Romero Alcalde Computer Science Department, College of William & Mary, USA Bigdata'18 High-Performance SVD for big data Computer Science Department College of William & Mary 1/50 The SVD problem Given a matrix A with dimensions m × n find k singular triplets (σi ; ui ; vi ) such that Avi ≈ σi ui and fui g; fvi g are orthonormal bases. Applications: applied math (norm computation, reduction of dimensionality or variance) combinatorial scientific computing, social network analysis computational statistics and Machine Learning (PCA, SVM, LSI) low dimensionality embeddings (ISOMAP, MDS) SVD updating of streaming data High-Performance SVD for big data Computer Science Department College of William & Mary 2/50 Notation for SVD σi , i-th singular value, Σ = diag([σ1 : : : σk ]), σ1 ≤ · · · ≤ σk ui , i-th left singular vector, U = [u1 ::: uk ] vi , i-th right singular vector, V = [v1 ::: vk ] Full decomposition, Partial decomposition, k = minfm; ng: k < minfm; ng: k minfm;ng T X T T X T A ≈ UΣV = ui σi vi A = UΣV = ui σi vi i=1 i=1 Given vi , then σi = kAvi k2, and ui = Avi /σi . T T Given ui , then σi = kA ui k2, and vi = A ui /σi . High-Performance SVD for big data Computer Science Department College of William & Mary 3/50 Relevant Aspects The performance of various methods depends on the next aspects: Matrix dimensions, m rows and n columns Sparsity Number of singular triplets sought If few singular triplets are sought, which ones Accuracy required Spectral distribution/gaps/decay High-Performance SVD for big data Computer Science Department College of William & Mary 4/50 Accuracy of singular values Accuracy of σi : distance between the exact singular value σi and an approximationσ ~i It can be bound as q l 2 r 2 jσi − σ~i j ≤ kri k2 + kri k2 kAT Av~ − σ~ v~ k ≤ i i i σi +σ ~i kAAT u~ − σ~ u~ k ≤ i i i ; σi +σ ~i l r T where ri = Av~i − σi u~i and ri = A u~i − σi v~i We say a method obtains accuracy kAk2f () if jσi − σ~i j . kAk2f () where is the maximum of the rounding error of the arithmetic used in the method and the matrix High-Performance SVD for big data Computer Science Department College of William & Mary 5/50 Accuracy of singular vectors Individual vectors: Angle between the exact singular vector vi and an approximation v~i q l 2 r 2 kri k + kri k Bound: sin \(v~i ; vi ) . minfjσi − σi−1j; jσi − σi+1jg Not useful when singular values are close As low rank approximations, A ≈ U~U~T A: T kA − U~U~ Ak2: the largest singular value not included in U~ T kA − U~U~ AkF : the sum of all singular values not included in U~ These norms do not converge to zero unless U~ captures the entire rank of A Accuracy means closeness to kA − UUT Ak or smaller than a certain fraction of kAk. q X 2 Note that kAk2 = max σi and kAkF = σi High-Performance SVD for big data Computer Science Department College of William & Mary 6/50 Methods High-Performance SVD for big data Computer Science Department College of William & Mary 7/50 Direct Methods: Bidiagonal reduction Aka svd() in Matlab/Octave and R, numpy.linalg.svd() in NumPy Based on reducing the input matrix into a bidiagonal matrix by similarity transformations Cost: spatial O(mn), temporal O(mn2) Limitations: Challenging parallelization in shared and distributed memory Densification. The nonzero structure of the input matrix is not exploited Cost does not reduce if few singular triplets needed or if matrix is low rank Best used when: Small matrix dimensions (m = n = 4000 takes ∼ 20 sec on a 2018 MacBook Pro) More than 10% of the spectrum is needed High-Performance SVD for big data Computer Science Department College of William & Mary 8/50 Direct Methods: Squaring the matrix T 1 Form M = A A T 2 Decompose M = V~Λ~V~ 1 − 1 3 Solution: V = V~, Σ = Λ~ 2 , U = AV~Λ~ 2 Cost: spatial O(n2 + mk + nk), temporal O(mn2), for k wanted singular triplets Improvement: Easy parallelization The nonzero structure can be exploited The spatial cost reduces with k and if left and/or right singular vectors are not required Limitations: kAk2 Accuracy is kσi − σ~i k2 ≤ , where is the rounding error for A σi Best used when: One of the dimensions is small, n . 4000 Approximate solutions are enough High-Performance SVD for big data Computer Science Department College of William & Mary 9/50 Direct Methods: QR 1 Decompose A = QR T 2 Decompose R = U~Σ~V~ 3 Solution: V = V~, Σ = Σ,~ U = QU~ Improvement: Easy parallelization using Tall Skinny QR (TSQR) Better accuracy than squaring the matrix using proper variant of TSQR Limitations: More expensive than the previous method for sparse matrices Best used when: Dense input matrix One of the dimensions is small, n . 4000 High-accurate solutions are required (TSQR variant may affect accuracy) High-Performance SVD for big data Computer Science Department College of William & Mary 10/50 Iterative Methods Warning: Risk of missing triplets. See perturbation analysis In practice, this is usually not an issue, probably because of the random nature of the methods They introduce options that have impact on time and quality of the computed solution Oversampling/restarting parameters Stopping criterion Critical kernels: Matrix-multi vector product, AX and AT X Orthogonalization High-Performance SVD for big data Computer Science Department College of William & Mary 11/50 Subspace Iteration (aka Randomized SVD) Basic method: 0 0 T V0 = orth(rand(n; k )), for k ≥ k, Vi = orth(A AVi−1) T Vi converges towards the k largest singular vectors of A A Detailed method that achieve full precision: 0 1 Set i = 0 and V0 = orth(rand(n; k )) 2 Set Ui = orth(AVi ) 3 Stop if Vi and Ui are accurate enough T 4 Set Vi+1 = orth(A Ui ) 5 Increment i and go to step 2 Stopping criterion: After some number of iterations Monitor the convergence of singular values or kA − U~U~T Ak T Monitor the residual minR kA Ui − Vi Rk2, where R is a k × k matrix High-Performance SVD for big data Computer Science Department College of William & Mary 12/50 Subspace Iteration (aka Randomized SVD) 0 1 Set i = 0 and V0 = orth(rand(n; k )) 2 Set Ui = orth(AVi ) 3 Stop if Vi and Ui are accurate enough T 4 Set Vi+1 = orth(A Ui ) 5 Increment i and go to step 2 Comments: Requires a high-performance orthogonalization kernel, ie., TSQR or SVQB Can take advantage of high-performance matrix-multivector product Can be accelerated by replacing A by p(A), where the polynomial p boosts the wanted part of the spectrum, eg, p(x) = x 3 Tuning of oversampling, k0 ≥ k, may be required High-Performance SVD for big data Computer Science Department College of William & Mary 12/50 RSVD Oversampling: Example Convergence of the 10 largest singular values with RSVD Matrix's spectrum 1 F 0 k 10 A ) T ~ U ~ U 10−2 − T UU 0:5 ( k0 = 10 k 10−4 k0 = 20 0 Error, k = 30 k0 = 40 10−6 0 5 10 15 0 20 40 60 Iterations v u 10 T ~ ~T uX 2 2 Note: k(UU − UU )AkF = t σi − σ~i i=1 High-Performance SVD for big data Computer Science Department College of William & Mary 13/50 Lanczos Bidiagonalization Basic method: T T 2 T i v0 = orth(rand(n; 1)), Vi = orth([v0; A A v0; (A A) v0;:::; (A A) v0]) Vi converges towards the largest and smallest singular vectors of A Comments: Robust implementations are challenging: several choices for orthogonalization and restarting Popular variant: IRLBA Suitable for finding the largest and the smallest singular values Usually converges in less MV than RSVD, but not necessarily faster in time High-Performance SVD for big data Computer Science Department College of William & Mary 14/50 RSVD vs LBD ~ ~T T Goal: satisfy kA − UU Ak2=F = kA − UU Ak2=F (1 + ) Gap Convergence Analysis: ! n σ −1 RSVD converges in O log k − 1 iterations [Musco & Musco,'15] σk+1 ! n σ −1=2 Block Lanczos converges in O log k − 1 iterations [Saad] σk+1 Gap-Free Convergence Analysis (with high probability): log n RSVD converges in O iterations [Rokhlin et al.'09, Halko et al.'11] log n Block Lanczos in O p iterations [Musco & Musco,'15] High-Performance SVD for big data Computer Science Department College of William & Mary 15/50 The interplay between rank of the random block and gaps If p is oversampling, q the # of RSVD iterations, then [Ming Gu, '16] ~ ~T 2 T 2 2 4q kA − UU Ak2=F − kA − UU Ak2=F σk+p+1 σk+p+1 2 ≤ k C F F σk Normalization choices explain the behavior of different norms: σk+p+1 T σk+p+1 F = kAk2 gives F = kA − UU Ak2 gives σ1 σk+1 σk+p+1 T σk+p+1 F = kAkF gives P F = kA − UU AkF gives P i σi i>k σi More importantly, the σk+p+1 factor can be much smaller than σk+p+1 F σk For low accuracy, sometimes it is faster to just increase p! High-Performance SVD for big data Computer Science Department College of William & Mary 16/50 RSVD vs LBD: Example of Matrix-Vector Cost Convergence of the 10 largest singular values with RSVD 101 RSVD k0 = 10 0 F RSVD k = 20 k 0 A RSVD k = 30 ) T 0 0 ~ 10 U RSVD k = 40 ~ U Unrestarted LBD − Restarted LBD 10,20 T 10−1 UU ( k −2 Error, 10 50 100 150 200 250 300 Number of matrix-vector products High-Performance SVD for big data Computer Science Department College of William & Mary 17/50 RSVD vs LBD: Example of Orthogonalization Cost (flops) Convergence of the 10 largest singular values with RSVD F 0 k 10 A ) T ~ U ~ U 10−2 − RSVD k0 = 10 T RSVD k0 = 20 UU ( 0 k RSVD k = 30 10−4 RSVD k0 = 40 Error, Unrestarted LBD Restarted LBD 10,20 10−6 100 101 102 103 104 Number of inner products in orthogonalization (proportional to flops) High-Performance SVD for big data Computer Science Department College of William & Mary 18/50 RSVD vs LBD: Example of Orthogonalization Cost (loads)

High-Performance SVD for Big Data

Accelerating the LOBPCG Method on Gpus Using a Blocked Sparse Matrix Vector Product

ARM HPC Ecosystem

0 BLIS: a Modern Alternative to the BLAS

18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance

Solving Symmetric Semi-Definite (Ill-Conditioned)

0 BLIS: a Framework for Rapidly Instantiating BLAS Functionality

On the Performance and Energy Efficiency of Sparse Linear Algebra on Gpus

DD2358 – Introduction to HPC Linear Algebra Libraries & BLAS

The BLAS API of BLASFEO: Optimizing Performance for Small Matrices

Introduchon to Arm for Network Stack Developers

High Efficiency Spectral Analysis and BLAS-3

Algorithms and Optimization Techniques for High-Performance Matrix-Matrix Multiplications of Very Small Matrices