CSC 576: Mathematical Foundations I

Ji Liu Department of Computer Sciences, University of Rochester

September 20, 2016

1 Notations and Assumptions

In most cases (if without local definitions), we use • Greek alphabets such as α, β, and γ to denote real numbers;

• Small letters such as x, y, and z to denote vectors;

• Capital letters to denote matrices, e.g., A, B, and C. Other notations:

• R is the one dimensional Euclidean space; n • R is the n dimensional vector Euclidean space; m×n • R is the m × n dimensional matrix Euclidean space;

• R+ denotes the range [0, +∞); n • 1n ∈ R denotes a vector with 1 in all entries; n • For any vector x ∈ R , we use |x| to denote the absolute vector, that is, |x|i = |xi| ∀i = 1, ··· , n;

• denotes the component-wise product, that is, for any vectors x and y,(x y)i = xiyi. Some assumptions: • Unless explicit (local) definition, we always assume that all vectors are column vectors.

2 Vector norms, Inner product

n A function f : x ∈ R → y ∈ R+ is called a “”, if the following three conditions are satisfied • (Zero element) f(x) ≥ 0 and f(x) = 0 if and only if x = 0;

n • (Homogeneous) For any α ∈ R and x ∈ R , f(αx) = |α|f(x); n • () Any x, y ∈ R satisfy f(x) + f(y) ≥ f(x + y).

1 n The `2 norm “k · k2” (a special “f(·)”) in R is defined as

2 2 2 1 kxk2 = (|x1| + |x2| + ··· + |xn| ) 2 .

Because of `2 is the most commonly used norm (also known as Euclidean norm), we denote it as 2 2 k · k sometimes for short. (Think about it how about f([x1, x2]) = 2x1 + x2?) A general `p norm (p ≥ 1) is defined as

p p p 1 kxkp = (|x1| + |x2| + ··· + |xn| ) p .

Note that for p < 1, it is not a “norm” since the triangle inequality is violated. `∞ norm is defined as kxk∞ = max{|x1|, |x2|, ··· , |xn|}. n One may notice that the `∞ norm is the limit of the `p norm, that is, for any x ∈ R , kxk∞ = limp→+∞ kxkp. In addition, people use kxk0 to denote the `0 “norm”. n The inner product h·, ·i in R is defined as X hx, yi = xiyi. i One can show that hx, xi = kxk2. Two vectors x and y are orthogonal if hx, yi = 0. That is one reason why `2 norm is so special. n If p ≥ q, then for any x ∈ R we have kxkp ≤ kxkq. In particular, we have

kxk1 ≥ kxk2 ≥ kxk∞. To bound from the order sides, we have √ √ kxk1 ≤ nkxk2 kxk2 ≤ nkxk∞. Proof. To see the first one, we have √ kxk1 = h1n, |x|i ≤ k1nk2k|x|k2 = nkxk2 where the last inequality uses the Cauchy inequality. I leave the proof of the second inequality in your homework.

Given a norm “k · kA”, its dual norm is defined as hx, zi kxkA∗ = max hx, yi = max hx, yi = max . kykA≤1 kykA=1 z kzkA Several important properties about the dual norm are

• The dual norm’s dual norm is itself, that is, kxk(A∗)∗ = kxkA;

• The `2 norm is self-dual, that is, the dual norm of the `2 norm is still the `2 norm;

• The dual norm of the `p norm (p ≥ 1) is `q norm where p and q satisfy 1/p + 1/q = 1. Particularly, `1 norm and `∞ norm are dual to each other.

• (Holder inequality): hx, yi ≤ kxkAkykA∗

2 3 Linear space, subspace, linear transformation

A set S is a linear space if

• 0 ∈ S;

• given any two points x ∈ S, y ∈ S in S and any two scalars α ∈ R and β ∈ R, we have

αx + βy ∈ S.

n m×n Note that ∅ is not a linear space. Examples: R , matrix space R . How about the following things:

• 0; (no)

•{ 0}; (yes)

•{ x | Ax = b} where A is a matrix and b is a vector. (b = 0 yes; otherwise, no)

Let S be a linear space. A set S0 is a subspace if S0 is a linear space and also a subset of S. Actually, “subspace” is equivalent to “linear space”, because any subspace is a linear space and any linear space is a subspace. They are indeed talking about the same thing. Let S be a linear space. A function L(·) is a linear transformation if given any two points x, y ∈ S and two scalars α ∈ R and β ∈ R, one has

L(αx + βy) = αL(x) + βL(y).

For vector space, there exists a 1-1 correspondence between a linear transformation and a matrix. Therefore, we can simply say “a matrix is a linear transformation”.

• Prove that {L(x) | x ∈ S} is a linear space if S is a linear space and L is a linear transformation.

• Prove that {x | L(x) ∈ S} a linear space assuming S is a linear space, and L is a linear transformation.

How to express a subspace? The most intuitive way is to use a bunch of vectors. A subspace can be expressed by

( n ) X span{x1, x2, ··· , xn} = αixi | αi ∈ R = {Xα | α}, i=1 which is called the range space of matrix X. A subspace can be also represented by the null space of X by {α | Xα = 0}.

3 4 Eigenvalues / eigenvectors, rank, SVD, inverse

m×n T n×m The transpose of a matrix A ∈ R is defined as A ∈ R : T (A )ij = Aji.

One can verify that (AB)T = BT AT . n×n n×n A matrix B ∈ R is the inverse of an invertible matrix A ∈ R if AB = I and BA = I.

B can be denoted as A−1. A has the inverse is equivalent to that A has a full rank (the definition for “rank” will be clear very soon.) Note that the inverse of a matrix is unique. One can also verify that if both A and B are invertible, then

(AB)−1 = B−1A−1.

The “transpose” and the “inverse” are exchangeable:

(AT )−1 = (A−1)T .

When we write A−1, we have to make sure that A is invertible. n×n n n Given a square matrix A ∈ R , x ∈ R (x 6= 0) is called its eigenvector and λ ∈ R is called its eigenvalue, if the following relationship is satisfied

Ax = λx. (The effect of applying the linear transformation A on x is nothing but scaling it.)

Note that • If {λ, x} is a pair of eigenvalue-eigenvector, then so is {λ, αx} for any α 6= 0.

• One eigenvalue may correspond to multiple different eigenvectors. “Different” means eigen- vectors are different after normalization. If the matrix A is symmetric, then any two eigenvectors (corresponding to different eigenvalues) T are orthogonal, that is, if A = A, Ax1 = λ1x1, Ax2 = λ2x2, and λ1 6= λ2, then

T x1 x2 = 0.

T Proof. Consider x1 Ax2. We have T T T T T x1 Ax2 = x1 (Ax2) = x1 (Ax2) = x1 (λ2x2) = λ2x1 x2, and T T T T T T x Ax2 = (x A)x2 = (A x1) x2 = (Ax1) x2 = λ1x x2. 1 1 |{z} 1 A=AT Therefore, we have T T λ2x1 x2 = λ1x1 x2. T Since λ1 6= λ2, we obtain x1 x2 = 0.

4 m×n A matrix A ∈ R is a “rank-1” matrix, if A can be expressed as

A = xyT

m n m×n where x ∈ R and y ∈ R , and x 6= 0, y 6= 0. The rank of a matrix A ∈ R is defined as

( r ) X T m n rank(A) = min r | A = xiyi , xi ∈ R , yi ∈ R i=1 ( r ) X = min r | A = Bi,Bi is a “rank-1” matrix . i=1 Examples: [1, 1; 1, 1], [1, 1; 2, 2], and many natural images have the low rank property. “Low rank” implies that the contained information is few. m×n T We say “U ∈ R has orthogonal columns” if U U = I, that is, any two columns Ui· and Uj· of U satisfies T T Ui· Uj· = 0 if i 6= j; otherwise Ui· Uj· = 1. Swapping any two columns in U to get U 0, U 0 still satisfies U 0T U 0 = I.

•k Uxk = kxk ∀x.

•k U T yk ≤ kyk ∀y.

If U is a square matrix and has orthogonal columns, then we call it “orthogonal matrix”. It has some nice properties

• U −1 = U T (which means that UU T = U T U = I.)

• U T is also an orthogonal matrix.

• The effect of applying the transformation U on a vector x is to rotate x, that is, kUxk = kxk = kU T xk.

“SVD” is short for “singular value decomposition”, which is the most important concept in linear algebra and matrix analysis. SVD almost explores all structures of a matrix. Given any m×n matrix A ∈ R , it can be decomposed into r T X T A = UΣV = σiUi·Vi· i=1

m×r n×r where U ∈ R and V ∈ R have orthogonal columns, and Σ = diag{σ1, σ2, ··· , σr} is a diagonal matrix with positive diagonal elements. σi’s are called singular values, which are positive and are arranged in the decreasing order.

• rank(A) = r;

•k Axk ≤ σ1kxk. Why?

n×n A matrix B ∈ R is positive semi-definite (PSD), if the following things are satisfied

5 • B is symmetric;

n T •∀ x ∈ R , we have x Bx ≥ 0. The positive definite matrix is defined by adding one more condition

• xT Bx = 0 ⇔ x = 0.

n×n We can also use an equivalent definition for PSD matrices in the following: A matrix B ∈ R is positive semi-definite (PSD), if the SVD of B can be written as

B = UΣU T where U T U = I and Σ is a diagonal matrix with nonnegative diagonal elements. Examples of PSD matrices: I, AT A. Assume matrices A and B are invertible. We have the following identity:

B−1 = A−1 − B−1(B − A)A−1.

The Sherman-Morrison-Woodbury Formula is very useful to calculate the matrix inverse:

(A + UV >)−1 = A−1 − A−1U(I + V >A−1U)−1V >A−1.

This result is especially important from the perspective of computation. A special case would be that U and V are two vectors u and v. Then it is in form of

(A + uv>)−1 = A−1 − (1 + v>A−1u)−1A−1uv>A−1, which can be calculated with complexity O(n2) if A−1 is known. The Sylvester’s determinant theorem is

det(Im + AB) = det(In + BA).

5 Matrix norms (spectral norm, nuclear norm, Frobenius norm)

m×n The Frobenius norm (F-norm) of a matrix A ∈ R is defined as

1 1   2 ! 2 X 2 X 2 kAkF =  |Ai,j|  = σi 1≤i≤m,1≤j≤n i=1

If A is a vector, one can verify that kAkF = kAk2. m×n The inner product h·, ·i in R is defined as

X T T T T hX,Y i = XijYij = trace(X Y ) = trace(YX ) = trace(XY ) = trace(Y X). i,j

An important property for trace(AB):

trace(AB) = trace(BA) = trace(AT BT ) = trace(BT AT ).

6 2 One may notice that hX,Xi = kXkF . m×n The spectral (trace) norm of a matrix A ∈ R is defined as

T kAkspec = max kAxk = max y Ax = σ1(A) kxk=1 kxk=1,kyk=1

m×n The nuclear norm of a matrix A ∈ R is defined as X kAktr = σi(A) = trace(Σ) i where Σ is the diagonal matrix of SVD of A = UΣV T . An important relationship p kAkspec ≤ kAkF ≤ kAktr and rank(A)kAkspec ≥ rank(A)kAkF ≥ kAktr.

The dual norm for a k · kA is defined as hX,Y i kY kA∗ := max = max hX,Y i. (1) kXk≤1 kXkA X We have the following properties (think about why it is true):

kXkspec∗ = kXktr, kXkF ∗ = kXkF .

6 Matrix and Vector Differential

m×n m×n Let f(X): R → R be a function with respect to matrix X ∈ R . It is differential (or gradient) is defined as   ∂f(X) ··· ∂f(X) ··· ∂f(X) ∂X11 ∂X1j ∂X1n    ···············  ∂f(X)  ∂f(X) ∂f(X) ∂f(X)  =  ··· ···  . ∂X  ∂Xi1 ∂Xij ∂Xin   ···············    ∂f(X) ··· ∂f(X) ··· ∂f(X) ∂Xm1 ∂Xmj ∂Xmn We provide a few examples in the following ∂f(X) f(X) = trace(AT X) = hA, Xi = A ∂X ∂f(X) f(X) = trace(XT AX) = (A + AT )X ∂X 1 ∂f(X) f(X) = kAX − Bk2 = AT (AX − B) 2 F ∂X 1 ∂f(X) f(X) = trace(BT XT XB) = XBBT 2 ∂X 1 ∂f(X) 1 f(X) = trace(BT XT AXB) = (A + AT )XBBT 2 ∂X 2

7