Quick viewing(Text Mode)

Low Rank Approximation Lecture 1

Low Approximation Lecture 1

Daniel Kressner Chair for Numerical Algorithms and HPC Institute of , EPFL [email protected]

1 Organizational aspects

I Lectures: Tuesday 8-10, MA A110. First: September 25, Last: December 18.

I Exercises: Tuesday 8-10, MA A110. First: September 25, Last: December 18.

I Exam: Miniproject + oral exam.

I Webpage: https://anchp.epfl.ch/lowrank.

I [email protected], [email protected]

2 From http://www.niemanlab.org

... his [Aleksandr Kogan’s] message went on to confirm that his approach was indeed similar to SVD or other factorization meth- ods, like in the Netflix Prize competition, and the Kosinki-Stillwell- Graepel Facebook model. Dimensionality reduction of Facebook data was the core of his model.

3 Rank and basic properties For field F, let A ∈ F m×n. Then

rank(A) := dim(range(A)).

For simplicity, F = R throughout the lecture and often m ≥ n. Lemma Let A ∈ Rm×n. Then 1. rank(AT ) = rank(A); 2. rank(PAQ) = rank(A) for invertible matrices P ∈ Rm×m, Q ∈ Rn×n; 3. rank(AB) ≤ min{rank(A), rank(B)} for any matrix B ∈ Rn×p.   A11 A12 m ×n 4. rank = rank(A11) + rank(A22) for A11 ∈ R 1 1 , 0 A22 m ×n m ×n A12 ∈ R 1 2 ,A22 ∈ R 2 2 . Proof: See 1 / Exercises.

4 Rank and matrix factorizations m Let B = {b1,..., br } ⊂ R with r = rank(A) be of range(A).   Then each of the columns of A = a1, a2,..., an can be expressed as of B:   ci1  ,...,   .  ai = b1ci1 + b2ci2 + ··· + br cir = b1 br  .  , cir

for some coefficients cij ∈ R with i = 1,..., n, j = 1,..., r. Stacking these relations column by column   c11 ··· cn1  ,...,   ,...,   . .  a1 an = b1 br  . .  c1r ··· cnr

5 Rank and matrix factorizations Lemma. A matrix A ∈ Rm×n of rank r admits a factorization of the form T m×r n×r A = BC , B ∈ R , C ∈ R . We say that A has low rank if rank(A)  m, n. Illustration of low-rank factorization:

A BCT #entries mn mr + nr

I Generically (and in most applications), A has full rank, that is, rank(A) = min{m, n}.

I Aim instead at approximating A by a low-rank matrix.

6 Questions addressed in lecture series

What? Theoretical foundations of low-rank approximation. When? A priori and a posteriori estimates for low-rank approximation. Situations that allow for low-rank approximation techniques. Why? Applications in engineering, scientific computing, data analysis, ... where low-rank approximation plays a central role. How? State-of-the-art algorithms for performing and working with low-rank approximations.

Will cover both, matrices and tensors.

7 Literature for Lecture 1

Golub/Van Loan’2013 Golub, Gene H.; Van Loan, Charles F. Matrix computations. Fourth edition. Johns Hopkins University Press, Baltimore, MD, 2013. Horn/Johnson’2013 Horn, Roger A.; Johnson, Charles R. . Second edition. Cambridge University Press, 2013. + References on slides.

8 1. Fundamental tools

I SVD

I Relation to eigenvalues

I Norms

I Best low-rank approximation

9 The decomposition Theorem (SVD). Let A ∈ Rm×n with m ≥ n. Then there are orthogonal matrices U ∈ Rm×m and V ∈ Rn×n such that   σ1  ..  T  .  m×n A = UΣV , with Σ =   ∈ R  σn 0

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0.

I σ1, . . . , σn are called singular values

I u1,..., un are called left singular vectors

I v1,..., vn are called right singular vectors T I Avi = σi ui , A ui = σi vi for i = 1,..., n.

I Singular values are always uniquely defined by A.

I Singular values are never unique. If σ1 > σ2 > ··· σn > 0 then unique up to ui ← ±ui , vi ← ±vi .

10 SVD: Sketch of proof Induction over n. n = 1 trivial.

For general n, let v1 solve max{kAvk2 : kvk2 = 1} =: kAk2. Set 1 σ1 := kAk2 and u1 := Av1/σ1. By definition,

Av1 = σ1u1.

 m×m After completion to orthogonal matrices U1 = u1, U⊥ ∈ R and  n×n V1 = v1, V⊥ ∈ R :

 T T   T  T u1 Av1 u1 AV⊥ σ1 w U1 AV1 = T T = , U⊥Av1 U⊥AV⊥ 0 A1

T T T with w := V⊥A u1 and A1 = U⊥AV⊥. k · k2 invariant under orthogonal transformations  T  q T σ1 w 2 2 σ1 = kAk2 = kU1 AV1k2 = ≥ σ1 + kwk2. 0 A1 2

Hence, w = 0. Proof completed by applying induction to A1. 1 If σ1 = 0, choose arbitrary u1. 11 Very basic properties of the SVD

I r = rank(A) is number of nonzero singular values of A.

I kernel(A) = span{vr+1,..., vn}

I range(A) = span{u1,..., ur }

12 SVD: Computation (for small dense matrices) Computation of SVD proceeds in two steps: 1. Reduction to bidiagonal form: By applying n Householder reflectors from left and n − 1 Householder reflectors from right, compute orthogonal matrices U1, V1 such that   B  @@ UT AV = B = 1 = @@ , 1 1 0  @  0

n×n that is, B1 ∈ R is an upper bidiagonal matrix. 2. Reduction to diagonal form: Use Divide&Conquer to compute T orthogonal matrices U2, V2 such that Σ = U2 B1V2 is diagonal. Set U = U1U2 and V = V1V2. Step 1 is usually the most expensive. Remarks on Step 1:

I If m is significantly larger than n, say, m ≥ 3n/2, first computing QR decomposition of A reduces cost. I Most modern implementations reduce A successively via banded form to bidiagonal form.2 2Bischof, C. H.; Lang, B.; Sun, X. A framework for symmetric band reduction. ACM Trans. Math. Software 26 (2000), no. 4, 581–601. 13 SVD: Computation (for small dense matrices)

In most applications, vectors un+1,..., um are not of interest. By omitting these vectors one obtains the following variant of the SVD. Theorem (Economy size SVD). Let A ∈ Rm×n with m ≥ n. Then there is a matrix U ∈ Rm×n with orthonormal columns and an orthonormal matrix V ∈ Rn×n such that   σ1 T  ..  n×n A = UΣV , with Σ =  .  ∈ R σn

and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0. Computed by MATLAB’s [U,S,V] = svd(A,’econ’). Complexity: memory operations singular values only O(mn) O(mn2) economy size SVD O(mn) O(mn2) (full) SVD O(m2 + mn) O(m2n + mn2)

14 SVD: Computation (for small dense matrices) Beware of roundoff error when interpreting singular value plots. Exmaple: semilogy(svd(hilb(100)))

10 0

10 -10

10 -20 0 20 40 60 80 100

I Kink is caused by roundoff error and does not reflect true behavior of singular values. 3 I Exact singular values are known to decay exponentially. 4 I Sometimes more accuracy possible. .

3Beckermann, B. The of real Vandermonde, Krylov and positive definite Hankel matrices. Numer. Math. 85 (2000), no. 4, 553–577. 4Drmac,ˇ Z.; Veselic,´ K. New fast and accurate Jacobi SVD algorithm. I. SIAM J. Matrix Anal. Appl. 29 (2007), no. 4, 1322–1342 15 Singular/eigenvalue relations: symmetric matrices

Symmetric A = AT ∈ Rn×n admits spectral decomposition

T A = U diag(λ1, λ2, . . . , λn)U

with U.

After reordering may assume |λ1| ≥ |λ2| ≥ · · · ≥ |λn|. Spectral decomposition can be turned into SVD A = UΣV T by defining

Σ = diag(|λ1|,..., |λn|), V = U diag(sign(λ1),..., sign(λn)).

Remark: This extends to the more general case of normal matrices (e.g., orthogonal or symmetric) via complex spectral or real Schur decompositions.

16 Singular/eigenvalue relations: general matrices

Consider SVD A = UΣV T of A ∈ Rm×n with m ≥ n. We then have: 1. Spectral decomposition of Gramian T T T 2 2 T A A = V Σ ΣV = V diag(σ1, . . . , σn)V T 2 2 A A has eigenvalues σ1, . . . , σn, right singular vectors of A are eigenvectors of AT A. 2. Spectral decomposition of Gramian T T T 2 2 T AA = UΣΣ U = U diag(σ1, . . . , σn, 0,..., 0)U T 2 2 AA has eigenvalues σ1, . . . , σn and, additionally, m − n zero eigenvalues, first n left singular vectors A are eigenvectors of AAT . 3. Decomposition of Golub-Kahan matrix

T  0 A U 0  0 Σ U 0 A = = AT 0 0 V ΣT 0 0 V

eigenvalues of A are ±σj , j = 1,..., n, and zero (m − n times).

17 Norms: Spectral and Frobenius

Given SVD A = UΣV T , one defines:

I Spectral norm: kAk2 = σ1. q 2 2 I Frobenius norm: kAkF = σ1 + ··· + σn. Basic properties:

I kAk2 = max{kAvk2 : kvk2 = 1} (see proof of SVD).

I k · k2 and k · kF are both (submultiplicative) matrix norms.

I k · k2 and k · kF are both unitarily invariant, that is

kQAZk2 = kAk2, kQAZkF = kAkF

for any orthogonal matrices Q, Z. √ I kAk2 ≤ kAkF ≤ kAk2/ r

I kABkF ≤ min{kAk2kBkF , kAkF kBk2}

18 Euclidean geometry on matrices n×n Let B ∈ R have eigenvalues λ1, . . . , λn ∈ C. Then

(B) := b11 + ··· + bnn = λ1 + ··· + λn.

In turn, 2 T T X 2 kAkF = trace A A = trace AA = aij . i,j Two simple consequences:

I k · kF is the norm induced by the matrix inner product

T m×n hA, Bi := trace(AB ), A, B ∈ R .   I Partition A = a1, a2,..., an and define vectorization   a1  .  mn vec(A) =  .  ∈ R . an

Then hA, Bi = hvec(A), vec(B)i and kAkF = k vec(A)k2.

19 Von Neumann’s trace inequality Theorem m×n For m ≥ n, let A, B ∈ R have singular values σ1(A) ≥ · · · ≥ σn(A) and σ1(B) ≥ · · · ≥ σn(B), respectively. Then

|hA, Bi| ≤ σ1(A)σ1(B) + ··· + σn(A)σn(B).

Proof of Von Neumann’s trace inequality in lecture notes.5 Consequence:

2 2 2 kA − BkF = hA − B, A − Bi = kAkF − 2hA, Bi + kBkF n 2 X 2 ≥ kAkF − 2 σi (A)σi (B) + kBkF i=1 n X 2 = (σi (A) − σi (B)) . i=1

5This proof follows [Grigorieff, R. D. Note on von Neumann’s trace inequality. Math. Nachr. 151 (1991), 327–328]. For Mirsky’s ingenious proof based on doubly stochastic matrices; see Theorem 8.7.6 in [Horn/Johnson’2013]. 20 Schatten norms

There are other unitarily invariant matrix norms.6

Let s(A) = (σ1, . . . , σn). The p- defined by

kAk(p) := ks(A)kp

is a for any 1 ≤ p ≤ ∞. p = ∞: spectral norm, p = 2: Frobenius norm, p = 1: nuclear norm. Definition The dual of a matrix norm k · k on Rm×n is defined by kAkD = max{hA, Bi : kBk = 1}. Lemma Let p, q ∈ [1, ∞] such that p−1 + q−1 = 1. Then

D kAk(p) = kAk(q).

6Complete characterization via symm gauge functions in [Horn/Johnson’2013]. 21 Best low-rank approximation

Consider k < n and let     Uk := u1 ··· uk , Σk := diag(σ1, . . . , σk ), Vk := u1 ··· uk .

Then T Tk (A) := Uk Σk Vk has rank at most k. For any unitarily invariant norm k · k:

kTk (A) − Ak = diag(0,..., 0, σk+1, . . . , σn)

In particular, for spectral norm and the Frobenius norm: q 2 2 kA − Tk (A)k2 = σk+1, kA − Tk (A)kF = σk+1 + ··· + σn.

Nearly equal if and only if singular values decay sufficiently quickly.

22 Best low-rank approximation Theorem (Schmidt-Mirsky). Let A ∈ Rm×n. Then

 m×n kA − Tk (A)k = min kA − Bk : B ∈ R has rank at most k holds for any unitarily invariant norm k · k.

7 Proof for k · kF : Follows directly from consequence of Von Neumann’s trace inequality. m×n Proof for k · k2: For any B ∈ R of rank ≤ k, kernel(B) has ≥ n − k. Hence, ∃w ∈ kernel(B) ∩ range(Vk+1) with kwk2 = 1. Then

2 2 2 T 2 kA − Bk2 ≥ k(A − B)wk2 = kAwk2 = kAVk+1Vk+1wk2 T 2 = kUk+1Σk+1Vk+1wk2 r+1 r+1 X T 2 X T 2 = σj |vj w| ≥ σk+1 |vj w| = σk+1. j=1 j=1

7See Section 7.4.9 in [Horn/Johnson’2013] for the general case. 23 Best low-rank approximation

Uniqueness:

I If σk > σk+1 best rank-k approximation with respect to Frobenius norm is unique.

I If σk = σk+1 best rank-k approximation never unique. For example I3 has several best rank-two approximations:

1 0 0 1 0 0 0 0 0 0 1 0 , 0 0 0 , 0 1 0 . 0 0 0 0 0 1 0 0 1

I With respect to spectral norm best rank-k approximation only unique if σk+1 = 0. For example, diag(2, 1, ) with 0 <  < 1 has infinitely many best rank-two approximations:

2 0 0 2 − /2 0 0 2 − /3 0 0 0 1 0 ,  0 1 − /2 0 ,  0 1 − /3 0 ,.... 0 0 0 0 0 0 0 0 1

24 Approximating the range of a matrix

Aim at finding a matrix Q ∈ Rm×k with orthonormal columns such that range(Q) ≈ range(A).

T ⊥ I − QQ is orthogonal projector onto range(Q) Aim at minimizing

k(I − QQT )Ak = kA − QQT Ak

for unitarily invariant norm k · k. Because rank(QQT A) ≤ k,

T kA − QQ Ak ≥ kA − Tk (A)k.

Setting Q = Uk one obtains

T T T T Uk Uk A = Uk Uk UΣV = Uk Σk Vk = Tk (A).

Q = Uk is optimal.

25 Approximating the range of a matrix

Variation: T T max{kQ AkF : Q Q = Ik }. Equivalent to T T T max{|hAA , QQ i| : Q Q = Ik }. By Von Neumann’s trace inequality and equivalence between eigenvectors of AAT and left singular vectors of A, optimal Q given by Uk .

26