<<

San José State University Math 253: Mathematical Methods for Data Visualization

Matrix and low- approximation

Dr. Guangliang Chen Outline

norm

• Low-rank matrix approximation

• Applications and low-rank approximation

Introduction

Recall that a vector is a collection of objects, called “vectors”, which are endowed with two kinds of operations, vector addition and , subject to requirements such as associativity, commutativity, and distributivity.

Below are some examples of vector spaces:

• Euclidean spaces (Rn)

• The collection of all matrices of a fixed size (Rm×n)

• The collection of all functions from R to R • The collection of all polynomials

• The collection of all infinite

Dr. Guangliang Chen | Mathematics & , San José State University3/49 Matrix norm and low-rank approximation

Vector norm

A norm on a V is a

k · k : V → R that satisfies the following three conditions:

• kvk ≥ 0 for all v ∈ V and kvk = 0 if and only if v = 0

• kkvk = |k|kvk for any scalar k ∈ R and vector v ∈ Rd

• kv + wk ≤ kvk + kwk for any two vectors v, w ∈ V

The norm of a vector can be thought of as the length or of the vector.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University4/49 Matrix norm and low-rank approximation

Example 0.1. Below are three different norms on the Rd: • 2-norm (or Euclidean norm): q √ X 2 T kxk2 = xi = x x

• 1-norm (Taxicab norm or Manhattan norm): X kxk1 = |xi|

• ∞-norm (maximum norm):

kxk∞ = max |xi|

When unspecified, it is understood as the Euclidean 2-norm.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University5/49 Matrix norm and low-rank approximation

d Remark. More generally, for any fixed p > 0, the `p norm on R is defined as

1/p X p d kxkp = |xi| , for all x ∈ R

Remark. Any norm on Rd can be used as a to the between two vectors:

d dist(x, y) = kx − yk, for all x, y ∈ R

For example, the between x, y ∈ Rd (corresponding to the Euclidean norm) is q X 2 distE(x, y) = kx − yk2 = (xi − yi)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University6/49 Matrix norm and low-rank approximation

Matrix norm

A matrix norm is a norm on Rm×n as a vector space (consisting of all matrices of the fixed size).

More specifically, a matrix norm is a function

m×n k · k : R → R that satisfies the following three conditions:

• kAk ≥ 0 for all A ∈ Rm×n and kAk = 0 if and only if A = O

• kkAk = |k| · kAk for any scalar k ∈ R and matrix A ∈ Rm×n

• kA + Bk ≤ kAk + kBk for any two matrices A, B ∈ Rm×n

Dr. Guangliang Chen | Mathematics & Statistics, San José State University7/49 Matrix norm and low-rank approximation

The Frobenius norm

Def 0.1. The Frobenius norm of a matrix A ∈ Rn×d is defined as s X 2 kAkF = aij i,j

Remark. The Frobenius norm on Rn×d is equivalent to the Euclidean 2-norm on the space of vectorized matrices (i.e., Rnd):

kAkF = kA(:)k2

Dr. Guangliang Chen | Mathematics & Statistics, San José State University8/49 Matrix norm and low-rank approximation

Example 0.2. Let   1 −1   X = 0 1  . 1 0

By direct calculation, kXkF = 2.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University9/49 Matrix norm and low-rank approximation

Proposition 0.1. For any matrix A ∈ Rn×d,

2 T T kAkF = (AA ) = trace(A A)

! a b Proof. We demonstrate the first identity for 2 × 2 matrices A = : c d

! ! ! a b a c a2 + b2 ∗ AAT = = c d b d ∗ c2 + d2

(The full proof is just a direct generalization of the above case)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/49 Matrix norm and low-rank approximation

n×d Theorem 0.2. For any matrix A ∈ R (with singular values {σi}), q X 2 kAkF = σi

Proof. Let the full SVD of A be A = UΣVT . Then

AAT = UΣVT · VΣT UT = U(ΣΣT )UT .

2 T Applying the formula kAkF = trace(AA ) gives that

2 T T T T T X 2 kAkF = trace(UΣΣ U ) = trace(ΣΣ U U) = trace(ΣΣ ) = σi .

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/49 Matrix norm and low-rank approximation

The norm

A second matrix norm is the , which is induced by a vector norm on Euclidean spaces. Theorem 0.3. For any norm k · k on Euclidean spaces, the following is a norm on Rm×n: kAxk kAk def= max = max kAuk n x6=0 kxk u∈R :kuk=1

Proof. We need to verify the three conditions of a norm.

First, it is obvious that kAk ≥ 0 for any A ∈ Rm×n. Suppose kAk = 0. Then for any x 6= 0, kAxk = 0, or equivalently, Ax = 0. This implies that A = O. (The other direction is trivial)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/49 Matrix norm and low-rank approximation

Second, for any k ∈ R, kkAk = max k(kA)uk = |k| · max kAuk = |k| · kAk. n n u∈R :kuk=1 u∈R :kuk=1

Lastly, for any two matrices A, B ∈ Rm×n, kA + Bk = max k(A + B)uk = max kAu + Buk n n u∈R :kuk=1 u∈R :kuk=1 ≤ max (kAuk + kBuk) n u∈R :kuk=1 ≤ max kAuk + max kBuk n n u∈R :kuk=1 u∈R :kuk=1 = kAk + kBk.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/49 Matrix norm and low-rank approximation

Theorem 0.4. For any norm on Euclidean spaces and its induced matrix operator norm, we have

m×n n kAxk ≤ kAk · kxk for all A ∈ R , x ∈ R

Proof. For all x 6= 0 ∈ Rn, by definition, kAxk ≤ kAk kxk

This implies that kAxk ≤ kAk · kxk.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/49 Matrix norm and low-rank approximation

Remark. More generally, the matrix operator norm satisfies

m×n n×p kABk ≤ kAk · kBk, for all A ∈ R , B ∈ R

Matrix norms with such a property are called sub-multiplicative matrix norms.

The proof of this result is left to you in the homework (you will need to apply the theorem on the preceding slide).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/49 Matrix norm and low-rank approximation

When the Euclidean norm (i.e., 2-norm) is used, the induced matrix operator norm is called the spectral norm.

Def 0.2. The spectral norm of a matrix A ∈ Rm×n is defined as

kAxk2 kAk2 = max = max kAuk2 n x6=0 kxk2 u∈R :kuk2=1 Theorem 0.5. The spectral norm of any matrix coincides with its largest :

kAk2 = σ1(A).

Proof. 2 T T 2 kAxk2 x A Ax T 2 kAk2 = max 2 = max T = λ1(A A) = σ1(A) . x6=0 kxk2 x6=0 x x

The maximizer is the largest right singular vector of A, i.e. v1.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/49 Matrix norm and low-rank approximation

Example 0.3. For the matrix   1 −1   X = 0 1  , 1 0 √ we have kXk2 = 3.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/49 Matrix norm and low-rank approximation

We note that the Frobenius and spectral norms of a matrix correspond to the 2- and ∞-norms of the vector of singular values (σ = (σ1, . . . , σr)):

kAkF = kσk2, kAk2 = kσk∞

The 1-norm of the singular value vector is called the nuclear norm of A, which is very useful in convex programming.

Def 0.3. The nuclear norm of a matrix A ∈ Rn×d is defined as X kAk∗ = kσk1 = σi.

√ Example 0.4. In the last example, kXk∗ = 3 + 1.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/49 Matrix norm and low-rank approximation

MATLAB function for matrix/vector norm norm – Matrix or vector norm.

norm(X,2) returns the 2-norm of X.

norm(X) is the same as norm(X,2).

norm(X,’fro’) returns the Frobenius norm of X.

In addition, for vectors...

norm(V,P) returns the p-norm of V defined as SUM(ABS(V).ˆP)ˆ(1/P).

norm(V,Inf) returns the largest element of ABS(V).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/49 Matrix norm and low-rank approximation

Condition number of a matrix

Briefly speaking, the condition number of a , is a measure of its near-singularity.

Def 0.4. Let A be any square, invertible matrix. For a given matrix operator norm k · k, the condition number of A is defined as

κ(A) = kAkkA−1k

−1 kIxk Remark. κ(A) ≥ kAA k = kIk = maxx6=0 kxk = 1. Remark. Under the matrix spectral norm,

−1 1 σmax(A) κ(A) = kAk2kA k2 = σmax(A) · = σmin(A) σmin(A)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/49 Matrix norm and low-rank approximation

1st perspective (for understanding the matrix condition number): Let

kAxk M = max = kAk x6=0 kxk kAxk kyk 1 1 m = min = min = −1 = x6=0 kxk y6=0 kA−1yk kA yk kA−1k maxy6=0 kyk which respectively represent how much the matrix A can stretch or shrink vectors. Then M κ(A) = m If A is singular, then there exists a nonzero x such that Ax = 0. Thus, m = 0 and the condition number is infinity.

In general, a finite, large condition number means that the matrix is close to being singular. In this case, we say that the matrix A is ill-conditioned (for inversion).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 21/49 Matrix norm and low-rank approximation

2nd perspective: Consider solving a system of linear equations subject to measurement error:

Ax = b + e (A is assumed to be exact) |{z} error We would like to see how the error term e (along with b) affects the solution x (via the matrix A).

Since A is invertible, the linear system has the following solution

x = A−1(b + e) = A−1b + A−1e | {z } | {z } true solution error in solution

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 22/49 Matrix norm and low-rank approximation

The ratio of the relative error in the solution to the relative error in b is kA−1ek / kA−1bk kA−1ek kbk = · kek / kbk kek kA−1bk where k · k represents a given vector norm.

The maximum possible value of the above ratio (for nonzero b, e) is then

kA−1ek kbk  kA−1ek  kbk  max · = max max b6=0,e6=0 kek kA−1bk e6=0 kek b6=0 kA−1bk kA−1ek kAyk = max max e6=0 kek y6=0 kyk = kA−1k · kAk = κ(A).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 23/49 Matrix norm and low-rank approximation

Remark.

• If the condition number of A is large (i.e., ill-conditioned), then the relative error in the solution x is much larger than the relative error in b, and thus even a small error in b may cause a large error in x.

• On the other hand, if this number is small, then the relative error in x will not be much bigger than the relative error in b.

• In the special case when the condition number is exactly one, the solution has the same relative error with the data b.

See a MATLAB demonstration by C. Moler.1

1https://blogs.mathworks.com/cleve/2017/07/17/ what-is-the-condition-number-of-a-matrix/

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 24/49 Matrix norm and low-rank approximation

Low-rank approximation of matrices

Problem. For any matrix A ∈ Rn×d and k ≥ 1, find the rank-k matrix B that is the closest to A (under a given norm such as Frobenius, or spectral):

min kA − Bk n×d B∈R : rank(B)=k

Remark. This problem arises in a number of tasks, e.g.,

• Orthogonal least fitting

• Data compression (and noise reduction)

• Recommender systems

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 25/49 Matrix norm and low-rank approximation

Theorem 0.6 (Eckart–Young–Mirsky). Given A ∈ Rn×d and 1 ≤ k ≤ rank(A), Pk T let Ak be the truncated SVD of A with the largest k terms: Ak = i=1 σiuivi . Then Ak is the best rank-k approximation to A in terms of both the Frobenius and spectral norms:2 s X 2 min kA − BkF = kA − AkkF = σi B : rank(B)=k i>k

min kA − Bk2 = kA − Akk2 = σk+1. B : rank(B)=k

Remark. The theorem still holds true if the equality constraint rank(B) = k is relaxed to rank(B) ≤ k (which will also include all the lower-rank matrices).

2Proof available at https://en.wikipedia.org/wiki/Low-rank_approximation

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 26/49 Matrix norm and low-rank approximation

Example 0.5. For the matrix   1 −1   X = 0 1  , 1 0 the best rank-1 approximation is

 2    √ 1 −1 √ 6   X = σ u vT = 3 − √1  √1 − √1 = − 1 1  . 1 1 1 1  6  2 2  2 2  √1 1 − 1 6 2 2 In this problem, the approximation error under either norm (spectral or Frobenius) is the same: kX − X1k = σ2 = 1.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 27/49 Matrix norm and low-rank approximation

Applications of low-rank approximation

• Orthogonal least-squares fitting

• Image compression

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 28/49 Matrix norm and low-rank approximation

Orthogonal Best-Fit Subspace

d b Problem: Given data x1,..., xn ∈ R and an integer 0 < k < d, find the k-D b b orthogonal “best-fit” by solving xi b

n b b X 2 b min kxi − PS(xi)k2 S b i=1 b PS(xi) Remark. This problem is different from b ordinary : b b • No predictor-response distinction S b b • Orthogonal (not vertical) fitting errors b

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 29/49 Matrix norm and low-rank approximation

b Theorem 0.7. An orthogonal best-fit k-dimensional plane to the data X = b b T n×d b [x1,..., xn] ∈ R is given by b b x = x¯ + Vk · α b where x¯ is the center of the data set x¯ b v2 b 1 X x¯ = x b n i b v1 b and Vk = [v1 ... vk] is a d × k ma- trix whose columns are the top k right S b b singular vectors of the centered data matrix b 0 T T + Xe = [x1 −x¯,..., xn −x¯] = X−1x¯ .

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 30/49 Matrix norm and low-rank approximation

b Proof. Suppose an arbitrary k- dimensional plane S is used to fit the b b d b data, with a fixed point m ∈ R , and xi an orthonormal b b PS(xi) b d×k B = [b1,..., bk] ∈ R . b b That is, b2 b m b T b B B = Ik, b1 b BBT : orthogonal projection onto S S b b The projection of each data point xi b onto the candidate plane is + 0 T PS(xi) = m + BB (xi − m).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 31/49 Matrix norm and low-rank approximation

Accordingly, we may rewrite the original problem as

n X T 2 min kxi − m − BB (xi − m)k d d×k m∈R , B∈R T i=1 B B=Ik Using multivariable calculus, we can show that for any fixed B an optimal m is

1 X def m∗ = x = x¯. n i

Plugging in x¯ for m and letting x˜i = xi − x¯ gives that

X T 2 min kx˜i − BB x˜ik . B In matrix notation, this becomes

T 2 T n×d min kXe − XBBe kF , where Xe = [x˜1,..., x˜n] ∈ R . B

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 32/49 Matrix norm and low-rank approximation

Let the full SVD of the centered data matrix Xe be

Xe = UΣVT

Denote by Xe k the best rank-k approximation of Xe :

T Xe k = UkΣkVk .

Then the minimum is attained when

T XBBe = Xe k, and a minimizer is the matrix consisting of the top k right singular vectors of Xe , i.e.,

B = Vk ≡ V(:, 1 : k).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 33/49 Matrix norm and low-rank approximation

Verify: If B = Vk, then

T T XBBe = XVe kVk T = Xe [v1,..., vk]Vk T = [σ1u1, . . . , σkuk]Vk T = [u1,..., uk] diag(σ1, . . . , σk)Vk T = UkΣkVk

= Xe k.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 34/49 Matrix norm and low-rank approximation

Proof of m∗ = x¯:

First, rewrite the above objective function as

n n X T 2 X T 2 g(m) = kxi − m − BB (xi − m)k = k(I − BB )(xi − m)k i=1 i=1 and apply the formula ∂ kAxk2 = 2AT Ax ∂x to find its gradient:

X T T T ∇g(m) = − 2(I − BB ) (I − BB )(xi − m)

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 35/49 Matrix norm and low-rank approximation

Note that I−BBT is also an orthogonal (onto the complement). Thus, (I − BBT )T (I − BBT ) = (I − BBT )2 = I − BBT . It follows that

X T T X  ∇g(m) = − 2(I − BB )(xi − m) = −2(I − BB ) xi − nm

Any minimizer m must satisfy

T X  2(I − BB ) xi − nm = 0

This equation has infinitely many solutions, but the simplest one is

X 1 X x − nm = 0 −→ m = x . i n i

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 36/49 Matrix norm and low-rank approximation

Example 0.6. Find the orthogonal best-fit line for a data set of three points (1, 1), (2, 3), (3, 2).

1 Solution. First, the center of the data is x¯ = 3 (1 + 2 + 3, 1 + 3 + 2) = (2, 2). Thus, the centered data matrix is   −1 −1 " √1 # 2 X¯ =  0 1  −→ v1 = .   √1 1 0 2

Therefore, the orthogonal best-fit line is x(t) = (2, 2) + t( √1 , √1 ), and the 2 2 projections of the original data onto the best-fit line is       2 2 −1 −1 1 1 " √1 # T T 2 h 1 1 i 5 5 1x¯ + Xve 1v = 2 2 +  0 1  √ √ =   1     √1 2 2  2 2  2 5 5 2 2 1 0 2 2

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 37/49 Matrix norm and low-rank approximation

Demonstration on another data set

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 38/49 Matrix norm and low-rank approximation

Application to image compression

Digital images are stored as matrices, so we can apply SVD to obtain their low-rank approximations (and display them as images):

k T X T Am×n ≈ UkΣkVk = σiuivi . i=1

By storing Uk, Σk, Vk instead of A, we can reduce the storage requirement from mn to mk + k + nk = k(m + n + 1). |{z} |{z} |{z} cost of Uk cost of Σk cost of Vk

This is one magnitude smaller when k  min(m, n).

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 39/49 Matrix norm and low-rank approximation

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 40/49 Matrix norm and low-rank approximation

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 41/49 Matrix norm and low-rank approximation

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 42/49 Matrix norm and low-rank approximation

Some practice problems

1. Show that for any A ∈ Rm×n, B ∈ Rn×p,

kABkF ≤ kAkF kBkF

Solution. Using the Cauchy-Schwarz inequality

2 T 2 X  X 2 X 2 2 2 (a b) = aibi ≤ ai bi = kak kbk

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 43/49 Matrix norm and low-rank approximation

we have

2 X X 2 kABkF = (A(i, :)B(:, j)) i j X X ≤ kA(i, :)k2kB(:, j)k2 i j !   X 2 X 2 = kA(i, :)k  kB(:, j)k  i j 2 2 = kAkF kBkF .

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 44/49 Matrix norm and low-rank approximation

2. Suppose A = xyT where x, y are two vectors (in column form). Show that

kAk2 = kxk2kyk2

Proof. First, the identity holds trivially true if y = 0 (in which case A = O). Suppose y 6= 0. Treating x, yT as matrices, we have

T T kAk2 = kxy k2 ≤ kxk2ky k2 = kxk2kyk2

On the other hand, for the vector v = y : kyk2

2 T y 1 T kyk2 kAk2 ≥ kAvk2 = xy = kx(y y)k2 = kxk2 = kxk2kyk2 kyk2 2 kyk2 kyk2

Consequently, we must have kAk2 = kxk2kyk2.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 45/49 Matrix norm and low-rank approximation

3. Let A be a square, , with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn. Show that

maxi |λi| kAk2 = max(|λ1|, |λn|) and Cond(A) = mini |λi|

Proof. It suffices to show that the singular values of A are given by |λ1|,..., |λn|. To see this, consider the orthogonal diagonalization of A = QΛQT . It follows that AAT = A2 = QΛ2QT Therefore, the squared singular values of A (which are eigenvalues of AAT ) 2 2 2 2 2 coincide with the eigenvalues of A , i.e., σ1 = λ1, . . . , σn = λn (not entirely sorted). This gives that σ1 = |λ1|, . . . , σn = |λn| (still not sorted, but the largest singular value must be the larger one of |λ1|, |λn|.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 46/49 Matrix norm and low-rank approximation

4. Let u, v ∈ Rn be two unit vectors with angle θ. Show that

T T 2 2 kuu − vv kF = 2 sin θ

2 T T T Solution. Using kAkF = trace(AA ) and observing that uu , vv are both projection matrices (thus symmetric), we get

T T 2 T T T T  kuu − vv kF = trace (uu − vv )(uu − vv ) = trace uuT uuT − uuT vvT − vvT uuT + vvT vvT  = trace(uuT ) − trace(uuT vvT ) − trace(vvT uuT ) + trace(vvT ) = trace(uT u) − trace(uT vvT u) − trace(vT uuT v) + trace(vT v) = 1 − (uT v)2 − (vT u)2 + 1 = 2 − 2(uT v)2 = 2 − 2 cos2 θ = 2 sin2 θ.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 47/49 Matrix norm and low-rank approximation

n×n 5. Let A ∈ R be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn (and corresponding eigenvectors v1, v2,..., vn). Solve the following problem: xT Ax max n T x∈R : x6=0 x x T v1 x=0 Solution. Write n X T A = λivivi i=1 and let n T X T A2 = A − λ1v1v1 = λivivi i=2 n T For any x 6= 0 ∈ R satisfying v1 x = 0, we have

T T T T x Ax = x (A − λ1v1v1 )x = x A2x

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 48/49 Matrix norm and low-rank approximation

Accordingly, we can rewrite the original problem as

xT A x max 2 ⊥ T x∈span(v1) : x x x6=0

This is a regular Rayleigh quotient problem with a reduced domain (the orthogonal n complement of span(v1) in R ) and the maximum is the largest eigenvalue of A2 over that domain which is λ2, achieved at x = v2.

Dr. Guangliang Chen | Mathematics & Statistics, San José State University 49/49