<<

Review

Kaiyu Zheng

October 2017

Linear algebra is fundamental for many areas in computer science. This document aims at providing a reference (mostly for myself) when I need to remember some concepts or examples. Instead of a collection of facts as the Cookbook, this document is more gentle like a tutorial. Most of the content come from my notes while taking the undergraduate linear algebra course (Math 308) at the University of Washington. Contents on more advanced topics are collected from reading different sources on the Internet.

Contents 3.8 Exponential and 7 Special Matrices 19 Logarithm...... 11 7.1 .... 19 1 Linear System of Equa- 3.9 Conversion Be- 7.2 Orthogonal..... 20 tions2 tween Matrix Nota- 7.3 ...... 20 tion and Summation 12 7.4 Diagonalizable... 20 2 Vectors3 7.5 Symmetric...... 21 2.1 Linear independence5 4 Vector Spaces 13 7.6 Positive-Definite.. 21 2.2 Linear dependence.5 4.1 ..... 13 7.7 Singular Value De- 2.3 Linear transforma- 4.2 ...... 15 composition..... 22 tion...... 5 4.3 ...... 15 7.8 Similar...... 22 7.9 23 4.4 ... 16 3 Matrix Algebra6 7.10 Hermitian...... 23 4.5 , Row & 7.11 Discrete Fourier 3.1 Addition...... 6 Column , and Transform...... 24 3.2 Multiplication6 ...... 17 3.3 Matrix Multiplication6 8 Matrix Calculus 24 3.4 ...... 8 5 Eigen 17 8.1 Differentiation... 24 3.4.1 Conjugate 5.1 Multiplicity of 8.2 Jacobian...... 25 Transpose..8 Eigenvalues..... 18 8.3 The Chain Rule... 25 3.5 Inverse...... 9 5.2 Eigendecomposition 18 3.6 ...... 10 9 Algorithms 25 3.7 Power...... 11 6 The Big Theorem 19 9.1 Gauss-Seidel Method 25

1 Notation

We denote vectors using bold lower case letters such as x, matrices using bold upper case letters such as X, and entries of matrices using normal upper case letters such as Xij or Xi,j (The comma is used if the indices are expressed by equations). The vector ei by default means the ith column vector in an indentity matrix with dimension depending on the context.

1 Linear System of Equations

Definition 1.1 (). Each variable can be the leading variable for at most one equation.

For example, x1 + x2 + x3 − x4 = 0

−x2 + 7x4 − x5 = −1 (1)

x4 + x5 = 2 Definition 1.2. Linear systems are equivalent if they are related by a sequence of elemen- tary operations:

(1) Interchange position of rows

(2) Multiply an equal constant

(3) Add a multiple of one equation to another

Definition 1.3 (). The linear system

a11x1 + a12x2+ ··· + a1mxm = b1, . . (2)

an1x1 + an2x2+ ··· + anmxm = bn can be written as an augmented matrix as follows:   a11 . . . a1m b1  . .. . .   . . . .  (3) an1 . . . anm bn

Definition 1.4 (Row Echelon Form). A matrix is in row echelon form if

a) Every leading term is in a column to the left of the leading term of the row below it.

2 b) Any zero rows are at the bottom of the matrix

For example, the left matrix below is not an echelon form, because “0=7” has no leading variable. It is an inconsistent matrix. The right matrix is a echelon form. 1 2 3 0 0 1 −2 5 2 −1 0 0 1 2 3 0 3 4 5 6  0 0 0 0 7 0 0 22 14 4 The leading variable positions in the matrix are called pivot positions. A column in the matrix that contains a pivot position is a pivot column. The process of converting a linear system into echelon form is .

Definition 1.5 (Reduced Row Echelon Form). A matrix is said to be in reduced row echelon form if:

a) all pivot positions have 1

b) the only nonzero term in each pivot column is the pivot

c) it is in row echelon form.

Try finding the reduced row echelon form of the following matrix:

0 3 4 5 6  1 −2 5 2 −1 (4) 3 0 1 2 5

Definition 1.6 (Homogeneity). A homogeneous is

a1x1 + a2x2 + ··· + anxn = 0 (5)

The equation is said to be in homogeneous form. A linear system where all equations are in homogeneous form is a homogenous system.

Every homogenous system is consistent, i.e. solvable.

2 Vectors

n Definition 2.1 (Norm). The norm, or magnitude of a vector a ∈ R is defined as the L2-norm of the vector. v u n uX 2 |a| = t ai (6) i=1

3 n Definition 2.2 (). (Algebraic definition) Let a and b be two vectors in R . Then the dot product (or inner product) between a and b is defined as: n T X a · b = a b = aibi (7) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by

a · b = |a||b|cos(θa,b) (8) Also, The dot product w · x = b is a hyperplane, where w is normal to it. n Definition 2.3 (Projection). Let a and b be two vectors in R . The projection of b onto a is defined a · b a a · b proj b = = a (9) a |a| |a| |a|2 n Definition 2.4 (). Let a and b be two vectors in R . Then the outer T product (or product) between a and b is defined such that (ab )ij = aibj:   a1b1 a1b2 ··· a1bn a2b1 a2b2 ··· a2bn abT =   (10)  . . .. .   . . . .  anb1 anb2 ··· anbn

Definition 2.5 (). If u1, u2, ··· um are vectors and c1, c2 ··· cm are scalars, then c1u1 + c2u2 + ··· + cmum is a linear combination of the vectors. n Definition 2.6 (Span). Let {u1, ··· , um} be a set of m vectors in R . The span of the set is the set of linear combinations of u1 ··· um. 1 3 For example, suppose u1 = 2 and u2 = 2 , what is the span of {u1, u2}? A vector 3 1 a a 1 3 a v = b ∈ span{u1, u2} if and only if ∃s, t.su1 + tu2 = b. s, t exist if 2 2 b has c c 3 1 c 1 3 a  a solution. This matrix is reduced to 0 4 2a − b , therefore it has a solution when 0 0 a − 2b + c a − ab + c = 0 holds. So the span of {u1, u2} is the x − 2y + z = 0. Definition 2.7 (Relation of Span and Augmented Matrix). If a vector v is in the span of vectors {u1, ··· , um} then the matrix [u1 ··· um v] has at least 1 solution.

Theorem 2.1 (Relation of Span and Linearly Independence). If u ∈ span{u1, ··· , um} then span{u1, ··· , um} = span{u, u1, ··· , um}

4 2.1 n Definition 2.8 (Linear Independence). Let {u1, ··· , um} be a set of vectors in R . If the only solution to the equation x1u1 + ··· + xmum = 0 is the trivial solution (i.e. all zeros), then u1 ··· um are linearly independent. Fact: If any set of vector contains 0, this set of vectors are not linearly independent.

Definition 2.9 (Orthonormal Vectors). Vectors in a set U = {u1, ··· um} are orthonormal if every vector in U is a and every pair ui, uj ∈ U of vectors are orthogonal, T i.e. ui uj = 0. Theorem 2.2. Every set of orthonormal vectors is linearly independent (i.e. the vectors in the set are linearly independent).

2.2 Linear dependence n Theorem 2.3 (Linear Dependence). Let {u1, ··· , um} be a set of vectors in R . If n < m, the set is linearly dependent.

Corollary 2.3.1 (Relation of Span and Linearly Independence). If there is a set of m n n linearly independent vectors in R that spans all of R , then m = n.

Theorem 2.4 (Relation of Linear Combination and Linearly Dependence). Let {u1, ··· , um} n be a set of vectors in R . The vectors in this set are linearly dependent if one vector is a linear combination of others.

2.3 Linear transformation m n Definition 2.10 (Linear Transformation). T : R → R is a linear transforma- m m tion if for all v, u ∈ R and for all r ∈ R, T (v + u) = T v + T u and T (rv) = rT (v). R n m is the domain, and R is the co-domain. For u ∈ R , T (u) is the image of u under T . n Definition 2.11 (Subspace). A subset S of R is a subspace if S satisfies: a) S contains 0.

b) if u and v are in S then u + v is also in S.(closure under addition)

c) If r is a , and u ∈ S then, ru ∈ S.(closure under multiplication)

m n Definition 2.12 (One-to-one and On-to). Let T : R → R , T (v) = Av thus T is a linear transformation. T is one-to-one (injective) if and only if T (x) = 0 has only the trivial solution (i.e. x = 0), or equivalently, T (a) = T (b) implies a = b. This means the columns n of A are linearly independent. T is on-to (surjective) if and only if columns of A span R .

5 Note, A is a n × m matrix. If m > n, T is not one-to-one. If m < n, T is not on-to.

In more general terms, if a function is one-to-one (injective), every element of the co-domain is mapped to by at most one element of the domain. If a function is on-to (surjective) if every element of the co-domain is mapped to by at least one element of the domain. A function is one-to-one and on-to (bijective) if every element of the co-domain is mapped to by exactly one element of the domain.

3 Matrix Algebra

3.1 Addition

If A, B ∈ Mn×m(R) and r ∈ R,

(A + B)ij = (A)ij + (B)ij (11)

3.2

(rA)ij = r(A)ij (12)

3.3 m n n l If T : R → R is represented by A ∈ Mn×m(R) and W : R → R is represented by m l B ∈ Ml×m(R), then BA should be represented as W ◦ T : R → R . So BA ∈ Ml×m(R). Matrix multiplication can be thought of as applying a series of linear transformation to vectors in an initial domain. For example, BA is illustrated as

m T n W l R −→ R −→ R m l Notice that although the final transformation is R → R which reads “a transformation m l going from R (domain of T ) to R (codomain of W )”, the formal notation is “reversed”, which is W ◦ T . Alternative definition: Let A ∈ Mn×p(R) and B ∈ Mp×m(R), then AB ∈ Mn×m(R). We will look at several equivalent algebraic definitions of AB from different perspectives. But first of all, let us look at two interpretations of matrix-vector multiplication Ax where p x ∈ R . 1) We consider Ax from the perspective of considering row vectors of A, that is, we view A as   a1 a2 A =   (13)  .   .  an

6 where each component ai is a row vector. Then, Ax can be computed by performing T T dot product ai x for i ∈ {1, ··· , n}, therefore (Ax)i = ai x. Specifically,

 T  a1 x T a2 x Ax =   (14)  .   .  T an x

2) We can also compute Ax by considering column vectors of A, such that   A = a|1 a|2 ··· a|p (15)

where each component a|i is a column vector. Then, the matrix multiplication Ax can be viewed as a linear combination of columns of A with coefficients determined by entries xi for i ∈ {1, ··· , k}.

Ax = x1a|1 + x2a|2 + ··· + xna|p p X (16) = a|ixi i=1

Now, let us look at matrix-matrix multiplication also from two perspectives.

1) When we consider row vectors of A and column vectors of B, the multiplication AB can be viewed as   AB = Ab|1 Ab|2 ··· Ab|m (17)   T where B = b|1 b|2 ··· b|m . From Equation 14, we know (Ab|k)i = ai b|k. T Therefore, (AB)ij = ai b|j. 2) When we consider column vectors of A and row vectors of B, the multiplication AB can be viewed as p X T AB = a|ibi (18) i=1 T where a|ibi is the outer product with output dimension of n × m.

Properties of Matrix Multiplication:

1) A(BC) = (AB)C

2) A(B + C) = AB + AC

3)( A + B)C = AC + BC

7 4) sAB = AsB

5) IA = AI = A

Caveats:

1) AB 6= (BA) (usually)

2) AC = AB 6⇒ C = B

3.4 Transpose T If A ∈ Mn×m(R), then A ∈ Mm×n(R).

Properties of Transpose:

1)( A + B)T = AT + BT

2)( sA)T = s(AT )

3)( AC)T = CT AT

2 Theorem 3.1. A matrix A has the property that for all v, W ∈ R , v · w = Av · Aw if T T −1 and only if A is orthogonal, that is, AA = In, or equivalently, A = A .

3.4.1 Definition 3.1 (Conjugate Transpose). Given an n×n matrix A with complex entries (i.e. entries are complex numbers), the conjugate transpose (or Hermitian transpose, Hermitian conjugate) of A is given by

 T AH = A¯ (19) where A¯ has the complex conjugate entries of A.

Properties of Conjugate Transpose:

1)( A + B)H = AH + BH

2)( sA)H = s(AH )

3)( AC)H = CH AH

8 3.5 Inverse m n Definition 3.2 (Invertibility). A T : R → R is invertible if it is one-to-one m n and on-to. Two implications follows if T : R → R is invertible: 1) m = n (required)

2) T −1 is also linear.

Theorem 3.2 (Invert of Matrix). An n × n matrix A is invertible if there exists a matrix −1 B so that BA = In. If A is invertible, B is unique and define A = B.

=⇒ BA = AB = In

−1   To compute A , form an n × 2n matrix AIn . Then convert it to reduced row  −1 echelon form, which results in In A . Theorem 3.3 (Invertibility Implies Non-zero Determinant). An n × n matrix is invertible if and only if its determinant is not zero.

Theorem 3.4 (Invertibility and Positive-Definite). Any positive-definite matrix is invertible.

Properties of Matrix Inverse: If A, B are invertible n × n matrix, and C, D are n × m matrix. Then:

a) A−1 is invertible. (A−1)−1 = A

b) AA−1 = A−1A = I

c) AB is invertible. (AB)−1 = B−1A−1

d) If AC = AD, then C = D

e) If AC = 0, then C = 0

f)( AT )−1 = (A−1)T

Proof. We will prove c) and d).

−1 −1 c) Show that AB(B A ) = In:

−1 −1 −1 −1 −1 −1 AB(B A ) = In = A(BB )A = AInA = AA = In (20)

9 d) Show that AC = AD =⇒ C = D:

AC = AD (21) =⇒ A−1AC = A−1AD (22)

=⇒ InC = InD (23) =⇒ C = D (24)

From the above proof, we see that A being invertible is important, because otherwise A−1 does not exist.

3.6 Trace

Definition 3.3 (Trace). Let A ∈ Mn×n(R). The trace of A is defined as the sum of entries along the main diagonal: n X tr(A) = aii (25) i=1 Properties of Trace: a) tr(A + B) = tr(A) + tr(B) b) tr(cA) = c · tr(A) c) tr(AB) = tr(BA) d) tr(A) = tr(AT )

T T T P e) tr(X Y ) = tr(XY ) = tr(Y X) = ij XijYij f) -invariant:

tr(P −1AP ) = tr(P −1(AP )) = tr((AP )P −1) = tr(A(PP −1)) = tr(A)

g) d tr(X) = tr(dX) Trace and Eigenvalues: In Section5, we discuss eigenvectors and eigenvalues in more detail. For the sake of proximity, we describe the relation of trace and eigenvalues here.

Theorem 3.5. If A is an n × n matrix with real or complex entries and if λ1, ··· , λn are eigenvalues of A, then X tr(A) = λi (26) i k X k tr(A ) = λi (27) i

10 3.7 Power n Definition 3.4 (Integral Power). A is raising matrix A ∈ Mn×n(R) to the power of n. It is defined as the multiplication of n the same matrix A: An = AA ··· A (28) The matrix to the 0th power is defined to be the , i.e. A0 = I. The exponentiation of a non- is not well-defined; One reason is that the 0th power is undefined. Note that A−1 6= 1/A, as it is the matrix inverse. Definition 3.5 (Square Root). Matrix B = A1/2 if and only if BB = A. To compute the square root of an arbitrary square matrix, a method that involves Jordan Normal Form (Section ??) can be used. We discuss the case when the matrix A is diagonalizable (Section 7.4), meaning there exist matrix V and D such that A = VDV −1. The square root of A is R such that: R = VSV −1 (29) where S is any square root of D. To verify, RR = VS(V −1V )SV −1 = VSSV −1 = VDV −1 = A (30) The square root of D is simply obtained by taking the square root of all entries along the diagonal. To raise a matrix A to an arbitrary real value p, we can follow Ap = exp(p ln(A)) (31) where ln(A) is defined in Section 3.8 below.

3.8 Exponential and Logarithm Definition 3.6 (Exponential of Matrix). The exponential of matrix A is defined as ∞ X A eA = (32) n! n=0 This is a generalization of ordinary exponential function ex which is ∞ X xn ex = (33) n! n=0 Definition 3.7 (Logarithm of Matrix). Matrix B is the logarithm of matrix A if ln(A) = B (34) which is equivalent as eB = A. The logarithm of A does not always exist; At least, A needs to be invertible, but this is not enough. For more, please refer to Wikipedia.

11 3.9 Conversion Between Matrix Notation and Summation d T Outer products Suppose xi ∈ R , and X = [x1, x2, ··· , xn] . Then, n X T T xixi = X X (35) i=1

To understand this intuitively, note that the vertical vectors xi are rows of X. Then, recall from Equation 18, matrix multiplication AB can be viewed as the sum of outer products between column vectors of A and row vectors of B. Therefore, we need to transpose X and multiply it by itself, yielding XT X. s T Similarly, if y ∈ R , and Y = [y1, ··· , yn] , we have: n X T T xiyi = X Y (36) i=1 Examples: • Conversion from primal objective to dual objective for kernel ridge regression. In N×d N d ridge regression, with X ∈ R , y ∈ R , xi ∈ R features each we can formulate the objective as: N 2 1 X  T  T min yi − w xi + λw w (37) w N i=1 ∗ PN According to the Representer Theorem, w = i=1 αixi is the optimal weights. N Thus, with α ∈ R , the above can be transformed into the following (kernel ridge T regression objective), where k(xi, xj) = φ(xi) φ(xj) is the kernel function: N N N N 2 1 X  X T  X X T min yi − αjx xi + λ αiαjx xj (38) α N j i i=1 j=1 i=1 j=1 N N N N 1 X  X 2 X X ⇔ min yi − αjk(xj, xi) + λ αiαjk(xi, xj) (39) α N i=1 j=1 i=1 j=1 n×n To transform Equation 39 into matrix notation, first let K ∈ R be the kernel matrix where Kij = k(xi, xj). Then, we have: 1 ⇔ min (y − Kα)T (y − Kα) + λαT Kα (40) α N 1 ⇔ min (αT KT Kα − αT KT y − yT Kα + yT y) + λαT Kα (41) α N Because αT KT y and yT Kα are just scalars, we can just write: 1 ⇔ min (αT KT Kα − 2αT KT y + yT y) + λαT Kα (42) α N The matrix notation conversion of the ridge regularization term is important.

12 n×p Singular value decomposition Given matrix A ∈ R , how do you write its singular value decomposition A = USV T using summation notation?

4 Vector Spaces

Definition 4.1 (). A vector space V over a field, such as real numbers R, is a set V with two functions:

addition + : V × V → V (e.g. v + w) (43) scalar multiplication · : R × V → V (e.g. av, a ∈ R) (44) and satisfy these properties (axioms for all v, w, u ∈ V and s, t ∈ R: 1) u + (v + w) = (u + v) + w (Associativity of addition) 2) u + v = v + u (Commutativity of addition) 3) There exists an element 0 ∈ V, called the zero vector, such that v+0 = vforallv ∈ V. (Identity element of addition)

4) ··· For more, refer to the Wikipedia’s article on vector space. n Definition 4.2 (Subspace). A is a subset of R that is a vector space with n the induced multiplication and addition from R . n For example, S ∈ R is a vector subspace if for all v, w ∈ S, v + w ∈ S, and for all r ∈ R, v ∈ S, rv ∈ S. The latter implies 0 ∈ S. ( a ) 3 b ∈ R , a, b ∈ R is not a subspace. 1 Definition 4.3 (Null space). If A is an n × n matrix, the set of solutions to the system n Ax = 0 is a subspace of R , called the null space of A or null(A). n Proof. Suppose v, w are vectors in R that satisfy Av = Aw = 0. Then A(v + w) = Av +Aw = 0. And A(rv) = rAv = 0. Therefore, the set of solutions to Ax = 0 is closed both under addition and multiplication.

4.1 Determinant Before we formally define , let us use det(A) to refer to the determinant of matrix A, which is a real value.

Definition 4.4. (Determinant and ) If A ∈ Mn×n(R), define Mij as the n−1×n−1 matrix formed by deleting the i-th row and j -th column. A. det(Mij) is called the minor of entry aij in A.

13 i+j Definition 4.5 (Cofactor). If A ∈ Mn×n(R), the cofactor of aij, or Cij = (−1) det(Mij). Definition 4.6 (Singularity). A square matrix A that is invertible is called nonsingular. Otherwise, it is called singular or degenerate.

Theorem 4.1 (Singularity and Determinant). A square matrix is singular if and only if its determinant is 0.

Now, we formally introduce determinant of a matrix.

Definition 4.7 (Determinant). The determinant of A is an n × n matrix   a11 . . . a1n  . .. .   . . .  an1 . . . ann

The determinant of A is recursively defined as:

det(A) = |A| = a11C11 + a12C12 + ··· + a1nC1n (45)

And when n = 1, det(a11) = a11 (base case). The above definition is recursive because the definition of cofactor contains determinant.

Geometric Meaning of Determinants First, we focus on 2D. Suppose

a b a b A = , x = , x = c d 1 c 2 d

We have det(A) = ad − bc. This is the signed area of the parallelogram formed by vectors x1 and x2. In the 3D case, the determinant represents the signed volume of the hexahedron formed by the three column vectors in the matrix.

Theorem 4.2 (Invertibility and Determinant). For A ∈ Mn×n(R), it is invertible if and only if det(A 6= 0.

In other words, the determinant of an n by n matrix A is 0 if and only if the rows are linearly dependent (and not zero if and only if they are linearly independent).

Properties of Determinants:

a) The determinant equals to the product of eigenvalues λi: Y det(A) = λi i

14 b) det(cA) = cn · det(A)

c) det(AB) = det(A)det(B) 1 d) det(A−1) = det(A)

e) det(AT ) = det(A)

f) det(An) = det(A)n

Cool Facts about Determinants1:

1) Interchanging any two rows of an n by n matrix A reverses the sign of its determinant.

2) If two rows of a matrix are equal, its determinant is 0. (Because det(A) = −det(A) implies det(A) = 0.

3) If A is an n by n matrix, adding a multiple of one row to a different row does not affect its determinant.

4) An n by n matrix with a row of zeros has determinant zero.

4.2 Kernel m n Definition 4.8 (Kernel). Suppose T : R → R is a linear transformation. The kernel of T is the set of vectors x such that T (x) = 0, denoted by ker(T ). In other words,

m ker(T ) = {x ∈ R |T (x = 0)} (46)

m n Theorem 4.3 (Kernel and Injectivity). Suppose T : R → R is a linear transformation. Then T is one-to-one if and only if ker(T ) = {0}.

This is rather intuitive. T being one-to-one means T (x) = 0 has only the trivial solution which is x = 0. By definition of kernel, ker(T ) = {0}.

4.3 Basis

Definition 4.9 (Basis). A set B = {u1, ··· , um} is a basis for a subspace S if a) B spans S.

b) B is linearly independent.

1 Source: http://www.math.lsa.umich.edu/~hochster/419/det.html

15 n 1 0 o For example, S = , is the in the 2. 0 1 R To find basis for S = span{u1, ··· , um},

1. Use u1, ··· , um to form the rows of a matrix A. 2. Transform A into row echelon form B.

3. The nonzero rows give a basis for S.

4.4 Change of Basis n Definition 4.10 (Change of Basis). Suppose subspaces S1, S2 ⊂ R each have a basis h i B1 = {u1, ··· , um} and B2 = {v1, ··· , vm}, respectively. Let A = [a1]B1 ··· [an]B1 be 2 a matrix with column vectors relative to the basis B1 . Then, to represent column vectors in A with B2, we apply a change-of-basis matrix PB1→B2 from B1 to B2, such that

[ai]B2 = PB1→B2 [ai]B1 (47)

To find the change-of-basis matrix from B1 to B2, notice first that the definition of the (ordered) bases B1 = {u1, ··· , um} and B2 = {v1, ··· , vm} involve vectors relative to the n 3 −1 o standard basis. For example, if B = , , the coordinates of the basis vectors 1 1 2 2 are relative to the the R space, even though they are the “unit basis vectors” relative 1 to the subspace spanned by B . That is, [u ] = . Therefore, we can obtain the 1 1 S 0 B1 change-of-basis matrix from B1 to S effortlessly, given by   PB1→S = u1 ··· un (48)

Because ui = PB1→S [ei]B1 . The same goes for B2. Therefore, we can easily obtain PB1→S and PB2→S . Thus, to change the basis from B1 to B2, we can first change to the standard basis, then change to B2, summarized by:

PB1→B2 = PS→B2 PB1→S (49) = P −1 P (50) B2→S B1→S

n n For an entire matrix A representing the transformation T : R → R , we can construct n a matrix to represent the same linear transformation within a different subspace B ⊂ R , say W : B → B, by leveraging the change-of-basis matrix PB→S :

−1 [A]B = PB→S APB→S (51) 2Usually we omit the subscript when denoting vectors since by default the basis is the standard basis S.

16 4.5 Dimension, Row & Column Space, and Rank n Definition 4.11 (Dimension). Let S be a subspace of R . Then the dimension of S, denoted as dim(S), is the number of vectors in any basis of S.

Definition 4.12 (Row Space, Column Space). Suppose A ∈ Mn×m(R). Then: • row(A) = span of rows of A (row space)

• col(A) = span of columns of A (column space)

m n row(A) ⊆ R , col(A) ⊆ R . Theorem 4.4 (Basis for ). Let A be a matrix, and B be a row- echelon form of that matrix. Then a) The nonzero rows of B form a basis for row(A).

b) The columns of A corresponding to pivot columns of B form a basis for col(A). Theorem 4.5 (Dimension of Row and Column Spaces Are Equal). The following is always true for matrix A: dim(col(A)) = dim(row(A)) (52) Definition 4.13 (Rank). The rank of a matrix A is defined by:

rank(A) = dim(col(A)) = dim(row(A)) (53)

Definition 4.14 (Nullity). The nullity of A is dim(null(A)). Theorem 4.6 (Rank-Nullity Theorem). Let A be an n × m matrix. Then

rank(A) + nullity(A) = m (54)

5 Eigen

Definition 5.1 (Eigenvector and Eigenvalue). Let A ∈ Mn×n(R), then a nonzero vector u is an eigenvector of A if there exists a scalar λ such that Au = λu. The scalar λ is called the eigenvalue 0 is never an eigenvector.

Theorem 5.1 (Scaled Eigenvectors). Suppose A ∈ Mn×n(R), and u is an eigenvector with eigenvalue λ. Then for any r 6= 0, r ∈ R, ru is another eigenvector with eigenvalue λ. It is important to note that the theorem above does not imply that all eigenvectors with eigenvalue λ should be related by the scalar λ. With this in mind, it is more intuitive to accept the following theorem.

17 Theorem 5.2 (Eigenspace). If A ∈ Mn×n(R), then the set of eigenvectors with eigenvalue n λ, together with 0 is a subspace of R , called the eigenspace. Theorem 5.3 (Condition for an Eigenvalue). Let A ∈ Mn×n(R). Then λ is an eigenvalue of A if and only if det(A − λIn) = 0. (55)

We refer to det(A − λIn) = 0 as the . Definition 5.2 (Characteristic Polynomial). The characteristic polynomial of an n × n matrix A, charA(λ), is the degree n polynomial det(A − λIn) = 0. Caveat: Some linear maps do not have eigenvalues or eigenvectors, such as below: 0 −1 1 0 The intuition of eigenvectors is to think of them as the axis of the corresponding linear transformation. The eigenvalue λ helps to know if x is stretched or shrunk, when multiplied by a matrix A (i.e. Ax).

5.1 Multiplicity of Eigenvalues Definition 5.3 (Algebraic Multiplicity). The algebraic multiplicity of an eigenvalue α of k A is found by k in charA = (α − λ) Q(λ) where Q(λ) is a polynomial with Q(λ) 6= 0. 2 2 For example, for charA = −λ(λ−2) = −(λ−0)(λ−2) . Therefore, λ = 0 has algebraic multiplicity of 1, and lambda = 2 has algebraic multiplicity of 2. Definition 5.4 (Geometric Multiplicity). The geometric multiplicity of an eigenvalue λ is the dimension of the eigenspace associated with λ, i.e. number of linearly independent eigenvectors of that eigenvalue.

• 0 is eigenvalue if A ∈ Mn×n(R) is singular (See definition 4.6). • Geometric multiplicity ≤ algebraic multiplicity (of an eigenvalue).

5.2 Eigendecomposition Definition 5.5 (Eigendecomposition of a Matrix). Let A be an n × n matrix, with n linearly independent eigenvectors ui for i ∈ {1, ··· , n}. Then we can perform an eigende- composition of A as follows A = UΛU −1 (56) where U is an n × n matrix whose ith column is the eigenvector ui of A, and Λ is the diagonal matrix whose diagonal entries are the corresponding eigenvalues (i.e. Λii = λi). This definition implies that A must be diagonalizable (Section 7.4). It is usually con- venient to have U be a orthonormal matrix.

18 6 The Big Theorem

n Theorem 6.1 (The Big Theorem). Let A = {a1, ··· , an} be a set of vectors in R . Let n n A = [a1 ··· an] be an n × n matrix, and let T : R → R be given by T (X) = Ax. Then the following statements are equivalent: n a) A spans R b) A is linearly independent (i.e. Ax = 0 has only the trivial solution)

n c) A is a basis for R n d) Ax = b has a unique solution for all b ∈ R e) T is onto (surjective) f) T is one-to-one (injective) g) A is an h) ker(T ) = {0}

n i) col(A) = R n j) row(A) = R k) rank(A) = n l) det(A) 6= 0 m) λ = 0 is not an eigenvalue of A

7 Special Matrices

7.1 Block Matrix Definition 7.1 (Block Matrix). A block matrix M is defined as AB M = CD where A, B, C, D are matrices (or block matrices) themselves. Block matrices share many useful properties as normal matrices, by treating block entries as entries. For example: AB AB M 2 = (57) CD CD  A2 + BC AB + BD = (58) CA + DCCB + D2

19 7.2 Orthogonal Definition 7.2 (). An orthogonal matrix Q is a square matrix with real entries whose columns and rows are orthogonal unit vectors (i.e., orthonormal vectors), i.e.

QT Q = QQT = I (59)

Therefore, we have QT = Q−1. To fully understand why Equation 59 holds, we need T T 2 to know that for two orthogonal vectors u1 and u2, u1 u2 = 0. And u1 u1 = |u1| = 1. Therefore, in the resulting matrix, all entries are 0 except for ones along the diagonal.

7.3 Diagonal Definition 7.3 (Diagonal Matrix). A square matrix D is a diagonal matrix if all entries except for ones along the main diagonal are 0.

Simple fact: for two diagonal matrices D1 and D2, their multiplication D1D2 = D3 is 3 also a diagonal matrix with each entry D3[i] along the main diagonal equals to D1[i]D2[i]. Therefore, every diagonal matrix is invertible. The inverse D−1 of diagonal matrix D has entries D−1[i] = 1/D[i]. Another fact: The determinant of a diagonal matrix is the product of the diagonal entries. Yet another fact: The column vectors of a diagonal matrix D are the eigenvectors of D, and each diagonal entry is the eigenvalue for the eigenvector at the corresponding column, that is   λ1  λ2  D =   (60)  ..   .  λn This can be verified simply by solving the characteristic polynomial det(D − λI) = 0.

7.4 Diagonalizable Definition 7.4 (). An n × n matrix A is diagonalizable if there exists an n × n matrix P such that

D = P −1AP (61) where D is a diagonal matrix.

Note that D = P −1AP =⇒ A = PDP −1 3The [i] just means the ith entry along the main diagonal.

20 Theorem 7.1 (The Diagonalization Theorem).

a) An n × n matrix A is diagonalizable if and only if A has n linearly independent eigenvectors. b) A = PDP −1 where D is a diagonal matrix if and only if all n columns of P are linearly independent eigenvectors of A and the diagonal entries of D are their corresponding eigenvalues.

If we can find n linearly independent eigenvectors for an n × n matrix A, then we know the matrix is diagonalizable. Furthermore, we can use those eigenvectors and their corresponding eigenvalues to find the invertible matrix P and diagonal matrix D necessary to show that A is diagonalizable. Theorem 7.2 (Power of Diagonalizable Matrix). If A = PDP −1, then Ak = PDkP −1

7.5 Symmetric Definition 7.5 (). A square matrix A is symmetric if and only if A = AT (62)

T n×n For any n × m matrix B, the matrix B B ∈ R is symmetric. Also, every square diagonal matrix is symmetric.

Facts about symmetric matrix • Any symmtric matrix: – has only real eigenvalues; – is always diagonalizable; – has orthogonal eigenvectors; • The symmetric matrix A is – positive definite if all its eigenvalues are positive. – positive semidefinite if all its eigenvalues are non negative..

7.6 Positive-Definite We omit the discussion of complex matrices for now. Definition 7.6 (Positive-Definite). A symmetric n × n real matrix A is positive definite n if for all x ∈ R {0}, xT Ax > 0 (63)

21 The negative definite, positive semi-definite, and negative semi-definite matrices are defined analogously. For “∗ semi-∗”, zero is allowed (e.g. A is positive semi-definite implies xT Ax ≥ 0).

Theorem 7.3. matrix is positive semi-definite.

Given data X ∈ n × , its Σ is computed by the following:

T Σ = E[(X − E[X])(X − E[X]) ] (64)

d For nonzero y ∈ R

T T T y Σy = y E[(X − E[X])(X − E[X]) ]y (65) T T = E[y (X − E[X])(X − E[X]) y] (66) T = E[Q Q] (67)

T T For Q = (X − E[X]) y. Therefore, y Σy ≥ 0, which means Σ is positive semi-definite.

7.7 Singular Value Decomposition n×m Theorem 7.4. For any given real matrix A ∈ R , there exists a unique set of matrices U, S, V such that A = USV T (68) n×n n×p p×p T T where U ∈ R and S ∈ R and V ∈ R U U = I and V V = I. This is called the singular value decomposition of A.

U and V are orthonormal matrices. S is a diagonal matrix4. The elements in S are called singular values of A. The eigenvectors of AT A are columns of V , and the eigenvectors of AAT are columns of U. The entries in S are positive, and sorted in decreasing order (S11 ≥ S22 ≥ · · · ).

7.8 Similar Definition 7.7 (Similar). An n × n matrix A is similar to an n × n matrix B (denoted as A ∼ B) if there exists an invertible matrix P such that P −1AP = B.

The intuition behind matrix similarity is that the two matrices represent the same linear operator with respect to (possibly) different bases. The matrix P can be viewed as a change-of-basis matrix.

Theorem 7.5. All similar matrices have the same eigenvalues.

4 More precisely, it is a rectangular diagonal matrix because n may not equal to p. Still, Sij = 0 if i 6= j.

22 Proof. Suppose A ∼ B with B = P −1AP , and Bx = λx. Then P −1AP x = λx. Then AP x = λP x, which means A also has eigenvalue λ (but with eigenvector P x).

Note that some matrices have the same eigenvalues, but they are not similar.

Theorem 7.6 (Determine Matrix Similarity5). Two square matrices are similar if and only if they have the same Jordan normal form.

There are multiple corollaries from this theorem in the reference.

7.9 Jordan Normal Form Definition 7.8 (Jordan Normal Form6). The Jordan normal form of a linear transforma- tion T : V → V is a special type of block matrix in which each block consists of Jordan blocks with possibly differing constants λi. In particular, it is a block matrix of the form:   J1 0 ··· 0  0 J2 ··· 0  J =   (69)  . . .. .   . . . .  0 0 ··· Jp where every Jk is a block matrix (Jordan block), defined as:   λk 1 0 ··· 0  ..   0 λk 1 . 0    J =  ..  (70) k  0 0 λk . 0     ......   . . . . 1  0 0 0 ··· λk

7.10 Hermitian Definition 7.9. A (or self-adjoint matrix) A is a complex square matrix that is equal to its own conjugate transpose, namely,

A = AH (71)

For properties, refer to Wikipedia.

5 Reference: http://kom.aau.dk/~jakob/selPubl/papers1995/ijmest_1995.pdf 6Reference: http://mathworld.wolfram.com/JordanCanonicalForm.html. Jordan normal form is named after French mathematician Camille Jordan.

23 7.11 Discrete Fourier Transform The N × N DFT matrx is given by

FN = [enk] −inx −i2πnk/N where n = 0, 1, ··· , 5, k = 0, 1, ··· , 5, and enk = e k = e . Define (1 − i) w = e−i2π/N = e−iπ/4 = √ (72) 2

(1 − i)nk wnk = √ (73) 2 The inverse inverse N × N DFT. −1 F6 = FN (74) where 1 F = [wnk] (75) N N More details will be on my review of Fourier Series and Fourier Transform.

8 Matrix Calculus

n m This section references [link 1] and [link 2]. Suppose x ∈ R , and y ∈ R , and that y and x are related through a function ψ,   y1  y2  y = ψ(x) =   (76)  .   .  ym where each yi ∈ R, a scalar, is produced by a function of x (i.e. a function of a vector).

8.1 Differentiation Vector to Vector The derivative of vector y with respect to vector x is an n×m matrix:   ∂y1 ∂y2 ∂ym ···  ∂x ∂x ∂x   1 1 1   ∂y1 ∂y2 ∂ym ∂y  ···   ∂x ∂x ∂x  ≡  2 2 2  (77) ∂x  . . .. .   . . . .     ∂y1 ∂y2 ∂ym ··· ∂xn ∂xn ∂xn

24 Scalar to Vector The derivative of scalar y with respect to vector x is a column vector:  ∂y   ∂x  ∂y  1  .  ≡  .  (78) ∂x    ∂y  ∂xn

Vector to Scalar The derivative of vector y with respect to vector x is a row vector:   ∂y ∂y1 ∂y2 ∂ym ≡ ··· (79) ∂x ∂x ∂x ∂x

8.2 Jacobian ∂x Definition 8.1. For x, y ∈ n, the determinant of the square matrix R ∂y

∂x J = (80) ∂y is called the Jacobian of the transform determined by y = ψ(x). Example D.2 of [link 1] is a good example of how to calculate the Jacobian.

8.3 The Chain Rule Definition 8.2 (Chain Rule). Suppose we have,       x1 y1 z1 x2 y2  z2  x =   y =   z =   (81)  .   .   .   .   .   .  xn yr zm where z is a function of y, which is a function of x. Then, we have ∂z ∂y ∂z = (82) ∂x ∂x ∂x 9 Algorithms

9.1 Gauss-Seidel Method Gauss-Seidel Method Below is my Python implementation of the Gauss-Seidel method, also known as the Liebmann method or the method of successive displacement, which is an iterative method used to solve a linear system of equations Ax = b.

25 def gauss_seidel(A, b, x_0, err, N): """Approximates solution for Ax=b""" def sigma(aj, x, start, end): return sum(aj[k] * x[k] fork in range(start, end))

n = A.shape[0] x_m = x_0 form in range(N): x_mp1 = np.zeros(n) forj in range(n): x_mp1[j] = 1 / A[j,j] * (b[j] - sigma(A[j], x_mp1, 0, j) - sigma(A[j], x_m, j+1, n)) j = np.argmax(np.abs(x_mp1 - x_m)) if np.max(np.abs(x_mp1 - x_m)) < err * x_mp1[j]: return x_mp1 x_m = x_mp1 print(x_m) print("No solution satisfying tolerance condition after%d iterations."%N) return None

26