Low Rank Approximation Lecture 1
Total Page:16
File Type:pdf, Size:1020Kb
Low Rank Approximation Lecture 1 Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL [email protected] 1 Organizational aspects I Lectures: Tuesday 8-10, MA A110. First: September 25, Last: December 18. I Exercises: Tuesday 8-10, MA A110. First: September 25, Last: December 18. I Exam: Miniproject + oral exam. I Webpage: https://anchp.epfl.ch/lowrank. I [email protected], [email protected] 2 From http://www.niemanlab.org ... his [Aleksandr Kogan’s] message went on to confirm that his approach was indeed similar to SVD or other matrix factorization meth- ods, like in the Netflix Prize competition, and the Kosinki-Stillwell- Graepel Facebook model. Dimensionality reduction of Facebook data was the core of his model. 3 Rank and basic properties For field F, let A 2 F m×n. Then rank(A) := dim(range(A)): For simplicity, F = R throughout the lecture and often m ≥ n. Lemma Let A 2 Rm×n. Then 1. rank(AT ) = rank(A); 2. rank(PAQ) = rank(A) for invertible matrices P 2 Rm×m, Q 2 Rn×n; 3. rank(AB) ≤ minfrank(A); rank(B)g for any matrix B 2 Rn×p. A11 A12 m ×n 4. rank = rank(A11) + rank(A22) for A11 2 R 1 1 , 0 A22 m ×n m ×n A12 2 R 1 2 ,A22 2 R 2 2 . Proof: See Linear Algebra 1 / Exercises. 4 Rank and matrix factorizations m Let B = fb1;:::; br g ⊂ R with r = rank(A) be basis of range(A). Then each of the columns of A = a1; a2;:::; an can be expressed as linear combination of B: 2 3 ci1 ;:::; 6 . 7 ai = b1ci1 + b2ci2 + ··· + br cir = b1 br 4 . 5 ; cir for some coefficients cij 2 R with i = 1;:::; n, j = 1;:::; r. Stacking these relations column by column 2 3 c11 ··· cn1 ;:::; ;:::; 6 . 7 a1 an = b1 br 4 . 5 c1r ··· cnr 5 Rank and matrix factorizations Lemma. A matrix A 2 Rm×n of rank r admits a factorization of the form T m×r n×r A = BC ; B 2 R ; C 2 R : We say that A has low rank if rank(A) m; n. Illustration of low-rank factorization: A BCT #entries mn mr + nr I Generically (and in most applications), A has full rank, that is, rank(A) = minfm; ng. I Aim instead at approximating A by a low-rank matrix. 6 Questions addressed in lecture series What? Theoretical foundations of low-rank approximation. When? A priori and a posteriori estimates for low-rank approximation. Situations that allow for low-rank approximation techniques. Why? Applications in engineering, scientific computing, data analysis, ... where low-rank approximation plays a central role. How? State-of-the-art algorithms for performing and working with low-rank approximations. Will cover both, matrices and tensors. 7 Literature for Lecture 1 Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F. Matrix computations. Fourth edition. Johns Hopkins University Press, Baltimore, MD, 2013. Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. Matrix analysis. Second edition. Cambridge University Press, 2013. + References on slides. 8 1. Fundamental tools I SVD I Relation to eigenvalues I Norms I Best low-rank approximation 9 The singular value decomposition Theorem (SVD). Let A 2 Rm×n with m ≥ n. Then there are orthogonal matrices U 2 Rm×m and V 2 Rn×n such that 2 3 σ1 6 .. 7 T 6 . 7 m×n A = UΣV ; with Σ = 6 7 2 R 4 σn5 0 and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0. I σ1; : : : ; σn are called singular values I u1;:::; un are called left singular vectors I v1;:::; vn are called right singular vectors T I Avi = σi ui , A ui = σi vi for i = 1;:::; n. I Singular values are always uniquely defined by A. I Singular values are never unique. If σ1 > σ2 > ··· σn > 0 then unique up to ui ±ui , vi ±vi . 10 SVD: Sketch of proof Induction over n. n = 1 trivial. For general n, let v1 solve maxfkAvk2 : kvk2 = 1g =: kAk2: Set 1 σ1 := kAk2 and u1 := Av1/σ1. By definition, Av1 = σ1u1: m×m After completion to orthogonal matrices U1 = u1; U? 2 R and n×n V1 = v1; V? 2 R : T T T T u1 Av1 u1 AV? σ1 w U1 AV1 = T T = ; U?Av1 U?AV? 0 A1 T T T with w := V?A u1 and A1 = U?AV?. k · k2 invariant under orthogonal transformations T q T σ1 w 2 2 σ1 = kAk2 = kU1 AV1k2 = ≥ σ1 + kwk2: 0 A1 2 Hence, w = 0. Proof completed by applying induction to A1. 1 If σ1 = 0, choose arbitrary u1. 11 Very basic properties of the SVD I r = rank(A) is number of nonzero singular values of A. I kernel(A) = spanfvr+1;:::; vng I range(A) = spanfu1;:::; ur g 12 SVD: Computation (for small dense matrices) Computation of SVD proceeds in two steps: 1. Reduction to bidiagonal form: By applying n Householder reflectors from left and n − 1 Householder reflectors from right, compute orthogonal matrices U1, V1 such that 2 3 B @@ UT AV = B = 1 = @@ ; 1 1 0 4 @ 5 0 n×n that is, B1 2 R is an upper bidiagonal matrix. 2. Reduction to diagonal form: Use Divide&Conquer to compute T orthogonal matrices U2, V2 such that Σ = U2 B1V2 is diagonal. Set U = U1U2 and V = V1V2. Step 1 is usually the most expensive. Remarks on Step 1: I If m is significantly larger than n, say, m ≥ 3n=2, first computing QR decomposition of A reduces cost. I Most modern implementations reduce A successively via banded form to bidiagonal form.2 2Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACM Trans. Math. Software 26 (2000), no. 4, 581–601. 13 SVD: Computation (for small dense matrices) In most applications, vectors un+1;:::; um are not of interest. By omitting these vectors one obtains the following variant of the SVD. Theorem (Economy size SVD). Let A 2 Rm×n with m ≥ n. Then there is a matrix U 2 Rm×n with orthonormal columns and an orthonormal matrix V 2 Rn×n such that 2 3 σ1 T 6 .. 7 n×n A = UΣV ; with Σ = 4 . 5 2 R σn and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0. Computed by MATLAB’s [U,S,V] = svd(A,’econ’). Complexity: memory operations singular values only O(mn) O(mn2) economy size SVD O(mn) O(mn2) (full) SVD O(m2 + mn) O(m2n + mn2) 14 SVD: Computation (for small dense matrices) Beware of roundoff error when interpreting singular value plots. Exmaple: semilogy(svd(hilb(100))) 10 0 10 -10 10 -20 0 20 40 60 80 100 I Kink is caused by roundoff error and does not reflect true behavior of singular values. 3 I Exact singular values are known to decay exponentially. 4 I Sometimes more accuracy possible. 3Beckermann, B. The condition number of real Vandermonde, Krylov and positive definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577. 4Drmac,ˇ Z.; Veselic,´ K. New fast and accurate Jacobi SVD algorithm. I. SIAM J. Matrix Anal. Appl. 29 (2007), no. 4, 1322–1342 15 Singular/eigenvalue relations: symmetric matrices Symmetric A = AT 2 Rn×n admits spectral decomposition T A = U diag(λ1; λ2; : : : ; λn)U with orthogonal matrix U. After reordering may assume jλ1j ≥ jλ2j ≥ · · · ≥ jλnj. Spectral decomposition can be turned into SVD A = UΣV T by defining Σ = diag(jλ1j;:::; jλnj); V = U diag(sign(λ1);:::; sign(λn)): Remark: This extends to the more general case of normal matrices (e.g., orthogonal or symmetric) via complex spectral or real Schur decompositions. 16 Singular/eigenvalue relations: general matrices Consider SVD A = UΣV T of A 2 Rm×n with m ≥ n. We then have: 1. Spectral decomposition of Gramian T T T 2 2 T A A = V Σ ΣV = V diag(σ1; : : : ; σn)V T 2 2 A A has eigenvalues σ1; : : : ; σn, right singular vectors of A are eigenvectors of AT A. 2. Spectral decomposition of Gramian T T T 2 2 T AA = UΣΣ U = U diag(σ1; : : : ; σn; 0;:::; 0)U T 2 2 AA has eigenvalues σ1; : : : ; σn and, additionally, m − n zero eigenvalues, first n left singular vectors A are eigenvectors of AAT . 3. Decomposition of Golub-Kahan matrix T 0 A U 0 0 Σ U 0 A = = AT 0 0 V ΣT 0 0 V A ±σ = ;:::; − eigenvalues of are j , j 1 n, and zero (m n times). 17 Norms: Spectral and Frobenius norm Given SVD A = UΣV T , one defines: I Spectral norm: kAk2 = σ1. q 2 2 I Frobenius norm: kAkF = σ1 + ··· + σn. Basic properties: I kAk2 = maxfkAvk2 : kvk2 = 1g (see proof of SVD). I k · k2 and k · kF are both (submultiplicative) matrix norms. I k · k2 and k · kF are both unitarily invariant, that is kQAZk2 = kAk2; kQAZkF = kAkF for any orthogonal matrices Q; Z. p I kAk2 ≤ kAkF ≤ kAk2= r I kABkF ≤ minfkAk2kBkF ; kAkF kBk2g 18 Euclidean geometry on matrices n×n Let B 2 R have eigenvalues λ1; : : : ; λn 2 C. Then trace(B) := b11 + ··· + bnn = λ1 + ··· + λn: In turn, 2 T T X 2 kAkF = trace A A = trace AA = aij : i;j Two simple consequences: I k · kF is the norm induced by the matrix inner product T m×n hA; Bi := trace(AB ); A; B 2 R : I Partition A = a1; a2;:::; an and define vectorization 2 3 a1 6 . 7 mn vec(A) = 4 . 5 2 R : an Then hA; Bi = hvec(A); vec(B)i and kAkF = k vec(A)k2.