San José State University Math 253: Mathematical Methods for Data Visualization
Matrix norm and low-rank approximation
Dr. Guangliang Chen Outline
• Matrix norm
• Low-rank matrix approximation
• Applications Matrix norm and low-rank approximation
Introduction
Recall that a vector space is a collection of objects, called “vectors”, which are endowed with two kinds of operations, vector addition and scalar multiplication, subject to requirements such as associativity, commutativity, and distributivity.
Below are some examples of vector spaces:
• Euclidean spaces (Rn)
• The collection of all matrices of a fixed size (Rm×n)
• The collection of all functions from R to R • The collection of all polynomials
• The collection of all infinite sequences
Dr. Guangliang Chen | Mathematics & Statistics, San José State University3/49 Matrix norm and low-rank approximation
Vector norm
A norm on a vector space V is a function
k · k : V → R that satisfies the following three conditions:
• kvk ≥ 0 for all v ∈ V and kvk = 0 if and only if v = 0
• kkvk = |k|kvk for any scalar k ∈ R and vector v ∈ Rd
• kv + wk ≤ kvk + kwk for any two vectors v, w ∈ V
The norm of a vector can be thought of as the length or magnitude of the vector.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University4/49 Matrix norm and low-rank approximation
Example 0.1. Below are three different norms on the Euclidean space Rd: • 2-norm (or Euclidean norm): q √ X 2 T kxk2 = xi = x x
• 1-norm (Taxicab norm or Manhattan norm): X kxk1 = |xi|
• ∞-norm (maximum norm):
kxk∞ = max |xi|
When unspecified, it is understood as the Euclidean 2-norm.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University5/49 Matrix norm and low-rank approximation
d Remark. More generally, for any fixed p > 0, the `p norm on R is defined as
1/p X p d kxkp = |xi| , for all x ∈ R
Remark. Any norm on Rd can be used as a metric to measure the distance between two vectors:
d dist(x, y) = kx − yk, for all x, y ∈ R
For example, the Euclidean distance between x, y ∈ Rd (corresponding to the Euclidean norm) is q X 2 distE(x, y) = kx − yk2 = (xi − yi)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University6/49 Matrix norm and low-rank approximation
Matrix norm
A matrix norm is a norm on Rm×n as a vector space (consisting of all matrices of the fixed size).
More specifically, a matrix norm is a function
m×n k · k : R → R that satisfies the following three conditions:
• kAk ≥ 0 for all A ∈ Rm×n and kAk = 0 if and only if A = O
• kkAk = |k| · kAk for any scalar k ∈ R and matrix A ∈ Rm×n
• kA + Bk ≤ kAk + kBk for any two matrices A, B ∈ Rm×n
Dr. Guangliang Chen | Mathematics & Statistics, San José State University7/49 Matrix norm and low-rank approximation
The Frobenius norm
Def 0.1. The Frobenius norm of a matrix A ∈ Rn×d is defined as s X 2 kAkF = aij i,j
Remark. The Frobenius norm on Rn×d is equivalent to the Euclidean 2-norm on the space of vectorized matrices (i.e., Rnd):
kAkF = kA(:)k2
Dr. Guangliang Chen | Mathematics & Statistics, San José State University8/49 Matrix norm and low-rank approximation
Example 0.2. Let 1 −1 X = 0 1 . 1 0
By direct calculation, kXkF = 2.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University9/49 Matrix norm and low-rank approximation
Proposition 0.1. For any matrix A ∈ Rn×d,
2 T T kAkF = trace(AA ) = trace(A A)
! a b Proof. We demonstrate the first identity for 2 × 2 matrices A = : c d
! ! ! a b a c a2 + b2 ∗ AAT = = c d b d ∗ c2 + d2
(The full proof is just a direct generalization of the above case)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 10/49 Matrix norm and low-rank approximation
n×d Theorem 0.2. For any matrix A ∈ R (with singular values {σi}), q X 2 kAkF = σi
Proof. Let the full SVD of A be A = UΣVT . Then
AAT = UΣVT · VΣT UT = U(ΣΣT )UT .
2 T Applying the formula kAkF = trace(AA ) gives that
2 T T T T T X 2 kAkF = trace(UΣΣ U ) = trace(ΣΣ U U) = trace(ΣΣ ) = σi .
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 11/49 Matrix norm and low-rank approximation
The Operator norm
A second matrix norm is the operator norm, which is induced by a vector norm on Euclidean spaces. Theorem 0.3. For any norm k · k on Euclidean spaces, the following is a norm on Rm×n: kAxk kAk def= max = max kAuk n x6=0 kxk u∈R :kuk=1
Proof. We need to verify the three conditions of a norm.
First, it is obvious that kAk ≥ 0 for any A ∈ Rm×n. Suppose kAk = 0. Then for any x 6= 0, kAxk = 0, or equivalently, Ax = 0. This implies that A = O. (The other direction is trivial)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 12/49 Matrix norm and low-rank approximation
Second, for any k ∈ R, kkAk = max k(kA)uk = |k| · max kAuk = |k| · kAk. n n u∈R :kuk=1 u∈R :kuk=1
Lastly, for any two matrices A, B ∈ Rm×n, kA + Bk = max k(A + B)uk = max kAu + Buk n n u∈R :kuk=1 u∈R :kuk=1 ≤ max (kAuk + kBuk) n u∈R :kuk=1 ≤ max kAuk + max kBuk n n u∈R :kuk=1 u∈R :kuk=1 = kAk + kBk.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 13/49 Matrix norm and low-rank approximation
Theorem 0.4. For any norm on Euclidean spaces and its induced matrix operator norm, we have
m×n n kAxk ≤ kAk · kxk for all A ∈ R , x ∈ R
Proof. For all x 6= 0 ∈ Rn, by definition, kAxk ≤ kAk kxk
This implies that kAxk ≤ kAk · kxk.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 14/49 Matrix norm and low-rank approximation
Remark. More generally, the matrix operator norm satisfies
m×n n×p kABk ≤ kAk · kBk, for all A ∈ R , B ∈ R
Matrix norms with such a property are called sub-multiplicative matrix norms.
The proof of this result is left to you in the homework (you will need to apply the theorem on the preceding slide).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 15/49 Matrix norm and low-rank approximation
When the Euclidean norm (i.e., 2-norm) is used, the induced matrix operator norm is called the spectral norm.
Def 0.2. The spectral norm of a matrix A ∈ Rm×n is defined as
kAxk2 kAk2 = max = max kAuk2 n x6=0 kxk2 u∈R :kuk2=1 Theorem 0.5. The spectral norm of any matrix coincides with its largest singular value:
kAk2 = σ1(A).
Proof. 2 T T 2 kAxk2 x A Ax T 2 kAk2 = max 2 = max T = λ1(A A) = σ1(A) . x6=0 kxk2 x6=0 x x
The maximizer is the largest right singular vector of A, i.e. v1.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 16/49 Matrix norm and low-rank approximation
Example 0.3. For the matrix 1 −1 X = 0 1 , 1 0 √ we have kXk2 = 3.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 17/49 Matrix norm and low-rank approximation
We note that the Frobenius and spectral norms of a matrix correspond to the 2- and ∞-norms of the vector of singular values (σ = (σ1, . . . , σr)):
kAkF = kσk2, kAk2 = kσk∞
The 1-norm of the singular value vector is called the nuclear norm of A, which is very useful in convex programming.
Def 0.3. The nuclear norm of a matrix A ∈ Rn×d is defined as X kAk∗ = kσk1 = σi.
√ Example 0.4. In the last example, kXk∗ = 3 + 1.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 18/49 Matrix norm and low-rank approximation
MATLAB function for matrix/vector norm norm – Matrix or vector norm.
norm(X,2) returns the 2-norm of X.
norm(X) is the same as norm(X,2).
norm(X,’fro’) returns the Frobenius norm of X.
In addition, for vectors...
norm(V,P) returns the p-norm of V defined as SUM(ABS(V).ˆP)ˆ(1/P).
norm(V,Inf) returns the largest element of ABS(V).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 19/49 Matrix norm and low-rank approximation
Condition number of a matrix
Briefly speaking, the condition number of a square, invertible matrix is a measure of its near-singularity.
Def 0.4. Let A be any square, invertible matrix. For a given matrix operator norm k · k, the condition number of A is defined as
κ(A) = kAkkA−1k
−1 kIxk Remark. κ(A) ≥ kAA k = kIk = maxx6=0 kxk = 1. Remark. Under the matrix spectral norm,
−1 1 σmax(A) κ(A) = kAk2kA k2 = σmax(A) · = σmin(A) σmin(A)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 20/49 Matrix norm and low-rank approximation
1st perspective (for understanding the matrix condition number): Let
kAxk M = max = kAk x6=0 kxk kAxk kyk 1 1 m = min = min = −1 = x6=0 kxk y6=0 kA−1yk kA yk kA−1k maxy6=0 kyk which respectively represent how much the matrix A can stretch or shrink vectors. Then M κ(A) = m If A is singular, then there exists a nonzero x such that Ax = 0. Thus, m = 0 and the condition number is infinity.
In general, a finite, large condition number means that the matrix is close to being singular. In this case, we say that the matrix A is ill-conditioned (for inversion).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 21/49 Matrix norm and low-rank approximation
2nd perspective: Consider solving a system of linear equations subject to measurement error:
Ax = b + e (A is assumed to be exact) |{z} error We would like to see how the error term e (along with b) affects the solution x (via the matrix A).
Since A is invertible, the linear system has the following solution
x = A−1(b + e) = A−1b + A−1e | {z } | {z } true solution error in solution
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 22/49 Matrix norm and low-rank approximation
The ratio of the relative error in the solution to the relative error in b is kA−1ek / kA−1bk kA−1ek kbk = · kek / kbk kek kA−1bk where k · k represents a given vector norm.
The maximum possible value of the above ratio (for nonzero b, e) is then
kA−1ek kbk kA−1ek kbk max · = max max b6=0,e6=0 kek kA−1bk e6=0 kek b6=0 kA−1bk kA−1ek kAyk = max max e6=0 kek y6=0 kyk = kA−1k · kAk = κ(A).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 23/49 Matrix norm and low-rank approximation
Remark.
• If the condition number of A is large (i.e., ill-conditioned), then the relative error in the solution x is much larger than the relative error in b, and thus even a small error in b may cause a large error in x.
• On the other hand, if this number is small, then the relative error in x will not be much bigger than the relative error in b.
• In the special case when the condition number is exactly one, the solution has the same relative error with the data b.
See a MATLAB demonstration by C. Moler.1
1https://blogs.mathworks.com/cleve/2017/07/17/ what-is-the-condition-number-of-a-matrix/
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 24/49 Matrix norm and low-rank approximation
Low-rank approximation of matrices
Problem. For any matrix A ∈ Rn×d and integer k ≥ 1, find the rank-k matrix B that is the closest to A (under a given norm such as Frobenius, or spectral):
min kA − Bk n×d B∈R : rank(B)=k
Remark. This problem arises in a number of tasks, e.g.,
• Orthogonal least squares fitting
• Data compression (and noise reduction)
• Recommender systems
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 25/49 Matrix norm and low-rank approximation
Theorem 0.6 (Eckart–Young–Mirsky). Given A ∈ Rn×d and 1 ≤ k ≤ rank(A), Pk T let Ak be the truncated SVD of A with the largest k terms: Ak = i=1 σiuivi . Then Ak is the best rank-k approximation to A in terms of both the Frobenius and spectral norms:2 s X 2 min kA − BkF = kA − AkkF = σi B : rank(B)=k i>k
min kA − Bk2 = kA − Akk2 = σk+1. B : rank(B)=k
Remark. The theorem still holds true if the equality constraint rank(B) = k is relaxed to rank(B) ≤ k (which will also include all the lower-rank matrices).
2Proof available at https://en.wikipedia.org/wiki/Low-rank_approximation
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 26/49 Matrix norm and low-rank approximation
Example 0.5. For the matrix 1 −1 X = 0 1 , 1 0 the best rank-1 approximation is
2 √ 1 −1 √ 6 X = σ u vT = 3 − √1 √1 − √1 = − 1 1 . 1 1 1 1 6 2 2 2 2 √1 1 − 1 6 2 2 In this problem, the approximation error under either norm (spectral or Frobenius) is the same: kX − X1k = σ2 = 1.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 27/49 Matrix norm and low-rank approximation
Applications of low-rank approximation
• Orthogonal least-squares fitting
• Image compression
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 28/49 Matrix norm and low-rank approximation
Orthogonal Best-Fit Subspace
d b Problem: Given data x1,..., xn ∈ R and an integer 0 < k < d, find the k-D b b orthogonal “best-fit” plane by solving xi b
n b b X 2 b min kxi − PS(xi)k2 S b i=1 b PS(xi) Remark. This problem is different from b ordinary linear regression: b b • No predictor-response distinction S b b • Orthogonal (not vertical) fitting errors b
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 29/49 Matrix norm and low-rank approximation
b Theorem 0.7. An orthogonal best-fit k-dimensional plane to the data X = b b T n×d b [x1,..., xn] ∈ R is given by b b x = x¯ + Vk · α b where x¯ is the center of the data set x¯ b v2 b 1 X x¯ = x b n i b v1 b and Vk = [v1 ... vk] is a d × k ma- trix whose columns are the top k right S b b singular vectors of the centered data matrix b 0 T T + Xe = [x1 −x¯,..., xn −x¯] = X−1x¯ .
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 30/49 Matrix norm and low-rank approximation
b Proof. Suppose an arbitrary k- dimensional plane S is used to fit the b b d b data, with a fixed point m ∈ R , and xi an orthonormal basis b b PS(xi) b d×k B = [b1,..., bk] ∈ R . b b That is, b2 b m b T b B B = Ik, b1 b BBT : orthogonal projection onto S S b b The projection of each data point xi b onto the candidate plane is + 0 T PS(xi) = m + BB (xi − m).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 31/49 Matrix norm and low-rank approximation
Accordingly, we may rewrite the original problem as
n X T 2 min kxi − m − BB (xi − m)k d d×k m∈R , B∈R T i=1 B B=Ik Using multivariable calculus, we can show that for any fixed B an optimal m is
1 X def m∗ = x = x¯. n i
Plugging in x¯ for m and letting x˜i = xi − x¯ gives that
X T 2 min kx˜i − BB x˜ik . B In matrix notation, this becomes
T 2 T n×d min kXe − XBBe kF , where Xe = [x˜1,..., x˜n] ∈ R . B
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 32/49 Matrix norm and low-rank approximation
Let the full SVD of the centered data matrix Xe be
Xe = UΣVT
Denote by Xe k the best rank-k approximation of Xe :
T Xe k = UkΣkVk .
Then the minimum is attained when
T XBBe = Xe k, and a minimizer is the matrix consisting of the top k right singular vectors of Xe , i.e.,
B = Vk ≡ V(:, 1 : k).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 33/49 Matrix norm and low-rank approximation
Verify: If B = Vk, then
T T XBBe = XVe kVk T = Xe [v1,..., vk]Vk T = [σ1u1, . . . , σkuk]Vk T = [u1,..., uk] diag(σ1, . . . , σk)Vk T = UkΣkVk
= Xe k.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 34/49 Matrix norm and low-rank approximation
Proof of m∗ = x¯:
First, rewrite the above objective function as
n n X T 2 X T 2 g(m) = kxi − m − BB (xi − m)k = k(I − BB )(xi − m)k i=1 i=1 and apply the formula ∂ kAxk2 = 2AT Ax ∂x to find its gradient:
X T T T ∇g(m) = − 2(I − BB ) (I − BB )(xi − m)
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 35/49 Matrix norm and low-rank approximation
Note that I−BBT is also an orthogonal projection matrix (onto the complement). Thus, (I − BBT )T (I − BBT ) = (I − BBT )2 = I − BBT . It follows that
X T T X ∇g(m) = − 2(I − BB )(xi − m) = −2(I − BB ) xi − nm
Any minimizer m must satisfy
T X 2(I − BB ) xi − nm = 0
This equation has infinitely many solutions, but the simplest one is
X 1 X x − nm = 0 −→ m = x . i n i
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 36/49 Matrix norm and low-rank approximation
Example 0.6. Find the orthogonal best-fit line for a data set of three points (1, 1), (2, 3), (3, 2).
1 Solution. First, the center of the data is x¯ = 3 (1 + 2 + 3, 1 + 3 + 2) = (2, 2). Thus, the centered data matrix is −1 −1 " √1 # 2 X¯ = 0 1 −→ v1 = . √1 1 0 2
Therefore, the orthogonal best-fit line is x(t) = (2, 2) + t( √1 , √1 ), and the 2 2 projections of the original data onto the best-fit line is 2 2 −1 −1 1 1 " √1 # T T 2 h 1 1 i 5 5 1x¯ + Xve 1v = 2 2 + 0 1 √ √ = 1 √1 2 2 2 2 2 5 5 2 2 1 0 2 2
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 37/49 Matrix norm and low-rank approximation
Demonstration on another data set
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 38/49 Matrix norm and low-rank approximation
Application to image compression
Digital images are stored as matrices, so we can apply SVD to obtain their low-rank approximations (and display them as images):
k T X T Am×n ≈ UkΣkVk = σiuivi . i=1
By storing Uk, Σk, Vk instead of A, we can reduce the storage requirement from mn to mk + k + nk = k(m + n + 1). |{z} |{z} |{z} cost of Uk cost of Σk cost of Vk
This is one magnitude smaller when k min(m, n).
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 39/49 Matrix norm and low-rank approximation
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 40/49 Matrix norm and low-rank approximation
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 41/49 Matrix norm and low-rank approximation
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 42/49 Matrix norm and low-rank approximation
Some practice problems
1. Show that for any A ∈ Rm×n, B ∈ Rn×p,
kABkF ≤ kAkF kBkF
Solution. Using the Cauchy-Schwarz inequality
2 T 2 X X 2 X 2 2 2 (a b) = aibi ≤ ai bi = kak kbk
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 43/49 Matrix norm and low-rank approximation
we have
2 X X 2 kABkF = (A(i, :)B(:, j)) i j X X ≤ kA(i, :)k2kB(:, j)k2 i j ! X 2 X 2 = kA(i, :)k kB(:, j)k i j 2 2 = kAkF kBkF .
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 44/49 Matrix norm and low-rank approximation
2. Suppose A = xyT where x, y are two vectors (in column form). Show that
kAk2 = kxk2kyk2
Proof. First, the identity holds trivially true if y = 0 (in which case A = O). Suppose y 6= 0. Treating x, yT as matrices, we have
T T kAk2 = kxy k2 ≤ kxk2ky k2 = kxk2kyk2
On the other hand, for the unit vector v = y : kyk2
2 T y 1 T kyk2 kAk2 ≥ kAvk2 = xy = kx(y y)k2 = kxk2 = kxk2kyk2 kyk2 2 kyk2 kyk2
Consequently, we must have kAk2 = kxk2kyk2.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 45/49 Matrix norm and low-rank approximation
3. Let A be a square, symmetric matrix, with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn. Show that
maxi |λi| kAk2 = max(|λ1|, |λn|) and Cond(A) = mini |λi|
Proof. It suffices to show that the singular values of A are given by |λ1|,..., |λn|. To see this, consider the orthogonal diagonalization of A = QΛQT . It follows that AAT = A2 = QΛ2QT Therefore, the squared singular values of A (which are eigenvalues of AAT ) 2 2 2 2 2 coincide with the eigenvalues of A , i.e., σ1 = λ1, . . . , σn = λn (not entirely sorted). This gives that σ1 = |λ1|, . . . , σn = |λn| (still not sorted, but the largest singular value must be the larger one of |λ1|, |λn|.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 46/49 Matrix norm and low-rank approximation
4. Let u, v ∈ Rn be two unit vectors with angle θ. Show that
T T 2 2 kuu − vv kF = 2 sin θ
2 T T T Solution. Using kAkF = trace(AA ) and observing that uu , vv are both projection matrices (thus symmetric), we get
T T 2 T T T T kuu − vv kF = trace (uu − vv )(uu − vv ) = trace uuT uuT − uuT vvT − vvT uuT + vvT vvT = trace(uuT ) − trace(uuT vvT ) − trace(vvT uuT ) + trace(vvT ) = trace(uT u) − trace(uT vvT u) − trace(vT uuT v) + trace(vT v) = 1 − (uT v)2 − (vT u)2 + 1 = 2 − 2(uT v)2 = 2 − 2 cos2 θ = 2 sin2 θ.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 47/49 Matrix norm and low-rank approximation
n×n 5. Let A ∈ R be a symmetric matrix with eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λn (and corresponding eigenvectors v1, v2,..., vn). Solve the following problem: xT Ax max n T x∈R : x6=0 x x T v1 x=0 Solution. Write n X T A = λivivi i=1 and let n T X T A2 = A − λ1v1v1 = λivivi i=2 n T For any x 6= 0 ∈ R satisfying v1 x = 0, we have
T T T T x Ax = x (A − λ1v1v1 )x = x A2x
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 48/49 Matrix norm and low-rank approximation
Accordingly, we can rewrite the original problem as
xT A x max 2 ⊥ T x∈span(v1) : x x x6=0
This is a regular Rayleigh quotient problem with a reduced domain (the orthogonal n complement of span(v1) in R ) and the maximum is the largest eigenvalue of A2 over that domain which is λ2, achieved at x = v2.
Dr. Guangliang Chen | Mathematics & Statistics, San José State University 49/49