<<

Linear algebra

Kevin P. Murphy Last updated January 25, 2008

1 Introduction Linear algebra is the study of matrices and vectors. Matrices can be used to represent linear functions1 as well as systems of linear equations. For example,

4x 5x = 13 1 2 (2) 2x −+ 3x =− 9 − 1 2 can be represented more compactly by Ax = b (3) where 4 5 13 A = , b = (4) 2− 3 9  −   −  Linear algebra also forms the the basis of many machine learning methods, such as , PCA, Kalman filtering, etc. Note: Much of this chapter is based on notes written by Zico Kolter, and is used with his permission. 2 Basic notation We use the following notation:

m n By A R × we denote a with m rows and n columns, where the entries of A are real numbers. • ∈ By x Rn, we denote a vector with n entries. Usually a vector x will denote a column vector — i.e., a matrix • with n∈rows and 1 column. If we want to explicitly represent a row vector —a matrixwith 1 rowand n columns — we typically write xT (here xT denotes the transpose of x, which we will define shortly). The ith element of a vector x is denoted x : • i x1 x2 x =  .  . .    x   n    We use the notation a (or A , A , etc) to denote the entry of A in the ith row and jth column: • ij ij i,j a a a 11 12 · · · 1n a21 a22 a2n A =  . . ·. · · .  ......    a a a   m1 m2 · · · mn    1A function f : Rm Rn is called linear if → f(c1x + c2y) = c1f(x)+ c2f(y) (1) for all scalars c1,c2 and vectors x, y. Hence we can predict the output of a linear function in terms of its response to simple inputs.

1 We denote the jth column of A by a or A : • j :,j

A = a|a |a | .  1 2 · · · n  | | |   We denote the ith row of A by aT or A : • i i,: T — a1 — T — a2 — A =  .  . .    — aT —   m   

A is a matrixwhere all non-diagonalelementsare 0. Thisis typically denoted D = diag(d1, d2,...,dn), • with d1 d2 D =  .  (5) ..    d   n   n n The , denoted I R × , is a square matrix with ones on the diagonal and zeros everywhere • ∈ m n else, I = diag(1, 1,..., 1). It has the property that for all A R × , ∈ AI = A = IA (6)

where the size of I is determined by the dimensions of A so that is possible. (We define matrix multiplication below.) A block diagonal matrix is one which contains matrices on its main diagonal, and is 0 everywhere else, e.g., • A 0 (7) 0 B   The unit vector e is a vector of all 0’s, except entry i, which has value 1: • i T ei = (0,..., 0, 1, 0,..., 0) (8)

The vector of all ones is denoted 1. The vector of all zeros is denoted 0.

3 Matrix Multiplication m n n p The product of two matrices A R × and B R × is the matrix ∈ ∈ m p C = AB R × , (9) ∈ where n Cij = AikBkj . (10) Xk=1 Note that in order for the matrix product to exist, the number of columns in A must equal the number of rows in B. There are many ways of looking at matrix multiplication, and we’ll start by examining a few special cases.

2 3.1 Vector-Vector Products Given two vectors x, y Rn, the quantity xT y, called the inner product, dot product or scalar product of the vectors, is a real number∈ given by n xT y = x, y R = x y . (11) h i ∈ i i i=1 X Note that it is always the case that xT y = yT x. Given vectors x Rm, y Rn (they no longer have to be the same size), xyT is called the outer product of the ∈ ∈ T vectors. It is a matrix whose entries are given by (xy )ij = xiyj, i.e.,

x y x y x y 1 1 1 2 · · · 1 n x2y1 x2y2 x2yn T m n xy R × =  . . ·. · · .  . (12) ∈ . . .. .    x y x y x y   m 1 m 2 · · · m n    3.2 Matrix-Vector Products m n n m Given a matrix A R × and a vector x R , their product is a vector y = Ax R . There are a couple ways of looking at matrix-vector∈ multiplication, and∈ we will look at them both. ∈ If we write A by rows, then we can express Ax as,

T T — a1 — a1 x T T — a2 — a2 x y =  .  x =  .  . (13) . .      — aT —   aT x   m   m      T In other words, the ith entry of y is equal to the inner product of the ith row of A and x, yi = ai x. In Matlab notation, we have y = [A(1,:)*x1; ...; A(m,:)*xn] Alternatively, let’s write A in column form. In this case we see that,

x1 | | | x2 y = a1 a2 an  .  = a1 x1 + a2 x2 + . . . + an xn . (14)  · · ·  .         | | |  x     n          In other words, y is a of the columns of A, where the coefficients of the linear combination are given by the entries of x. In Matlab notation, we have y = A(:,1)*x1 + ...+ A(:,n)*xn So far we have been multiplying on the right by a column vector, but it is also possible to multiply on the left by a T T m n m n T row vector. This is written, y = x A for A R × , x R , and y R . As before, we can express y in two obvious ways, depending on whether we express∈ A in terms∈ on its rows or∈ columns. In the first case we express A in terms of its columns, which gives

yT = xT a|a |a | = xT a xT a xT a (15)  1 2 · · · n  1 2 · · · n | | |     which demonstrates that the ith entry of yT is equal to the inner product of x and the ith column of A.

3 Finally, expressing A in terms of rows we get the final representation of the vector-matrix product,

T — a1 — — aT — T 2 y = x1 x2 xn  .  (16) · · · .      — aT —   m    (17) T T T = x1 — a1 — + x2 — a2 — + ... + xn — an — (18) so we see that yT is a linear combination of therows of A, where the coefficients for the linear combination are given by the entries of x. 3.3 Matrix-Matrix Products Armed with this knowledge, we can now look at four different (but, of course, equivalent) ways of viewing the matrix-matrix multiplication C = AB as defined at the beginning of this section. First we can view matrix-matrix multiplication as a set of vector-vector products. The most obvious viewpoint, which follows immediately from the definition, is that the i, j entry of C is equal to the innerproductofthe ithrow of A and the jth rowof B. Symbolically, this looks like the following,

T T T T — a1 — a1 b1 a1 b2 a1 bp — aT — aT b aT b · · · aT b 2 | | | 2 1 2 2 · · · 2 p C = AB =  .  b1 b2 bp =  . . . .  . (19) .  · · ·  . . .. .      — aT —  | | |  aT b aT b aT b   m     m 1 m 2 · · · m p      m n n p n n Remember that since A R × and B R × , ai R and bj R , so these inner products all make sense. This is the most “natural” representation∈ when∈ we represent∈ A by rows and∈ B by columns. A special case of this result arises in statistical applications where A = X and B = XT , where X is the n d design matrix, whose rows are the cases. In this case, XXT is an n n matrix of inner products called× the : × xT x xT x 1 1 · · · 1 n T .. XX =  .  (20) xT x xT x  n 1 · · · n n Alternatively, we can represent A by columns, and B by rows, which leads to the interpretation of AB as a sum of outer products. Symbolically,

T — b1 — — bT — n | | |  2  T C = AB = a1 a2 an . = aibi . (21)  · · ·  .   i=1 | | |  — bT —  X    n    Put another way, AB is equal to the sum, over all i, of the outer product of the ith column of A and the ith row of B. Rm Rp T Since, in this case, ai and bi , the dimension of the outer product aibi is m p, which coincides with the dimension of C. ∈ ∈ × If A = XT and B = X, where X is the n d design matrix,wegeta d d matrix which is proportional to the empirical of the data (assuming× it has been centered): ×

x2 x x x x n i,1 i,1 i,2 · · · i,1 i,d T T .. X X = xixi =  .  (22) i=1 i x x x x x2 X X  i,d i,1 i,d i,2 · · · i,d   

4 Second, we can also view matrix-matrix multiplication as a set of matrix-vector products. Specifically, if we represent B by columns, we can view the columns of C as matrix-vector products between A and the columns of B. Symbolically, | | | | | | C = AB = A b b b = Ab Ab Ab . (23)  1 2 · · · p   1 2 · · · p  | | | | | |     Here the ith column of C is given by the matrix-vector product with the vector on the right, ci = Abi. These matrix- vector products can in turn be interpreted using both viewpoints given in the previous subsection. Finally, we have the analogous viewpoint, where we represent A by rows, and view the rows of C as the matrix-vector product between the rows of A and C. Symbolically,

T T — a1 — — a1 B — T T — a2 — — a2 B — C = AB =  .  B =  .  . (24) . .      — aT —   — aT B —   m   m      T T Here the ith row of C is given by the matrix-vector product with the vector on the left, ci = ai B. It may seem like overkill to dissect matrix multiplication to such a large degree, especially when all these view- points follow immediately from the initial definition we gave (in about a line of math) at the beginning of this section. However, virtually all of linear algebra deals with matrix multiplications of some kind, and it is worthwhile to spend some time trying to develop an intuitive understanding of the viewpoints presented here. In addition to this, it is useful to know a few basic properties of matrix multiplication at a higher level:

Matrix multiplication is associative: (AB)C = A(BC). • Matrix multiplication is distributive: A(B + C)= AB + AC. • Matrix multiplication is, in general, not commutative; that is, it can be the case that AB = BA. • 6 3.4 Multiplication by diagonal matrices Sometimes either A or B are diagonal matrices. If we post multiply A by a diagonal matrix B = diag(s), then we just scale each column by the corresponding scale factor in s. In Matlab notation

A * diag(s) = [s1*A(:,1), ..., sn*A(:,n)] A special case of this is if A has just one row, call it aT . Then we are just scaling each element of a by the corre- sponding entry in s. Hence in Matlab notation

a' * diag(s) = [s1*a1, ..., sn*an] = a .* s Similarly, if we pre multiply B by a diagonal matrix A = diag(s), then we just scale each row by the corresponding scale factor in s. In Matlab notation

diag(s) * B = [s1*B(1,:); ...; sm*B(m,:)] A special case of this is if B has just one column, call it b. Then we are just scaling each element of v by the corresponding entry in s. Hence in Matlab notation

diag(s) * v = [s1*v1; ...; sm * vm] = s .* v

5 4 Operations on matrices 4.1 The transpose m n The transpose of a matrix results from “flipping” the rows and columns. Given a matrix A R × , is transpose, written AT , is defined as ∈ T n m T A R × , (A ) = A . (25) ∈ ij ji We have in fact already been using the transpose when describing row vectors, since the transpose of a column vector is naturally a row vector. The following properties of transposes are easily verified:

(AT )T = A • (AB)T = BT AT • (A + B)T = AT + BT • 4.2 The trace of a square matrix n n The trace of a square matrix A R × , denoted tr(A) (or just trA if the parentheses are obviously implied), is the sum of diagonal elements in the∈ matrix: n trA = Aii. (26) i=1 X The trace has the following properties:

n n T For A R × , trA = trA . • ∈ n n For A, B R × , tr(A + B)=trA + trB. • ∈ n n For A R × , t R, tr(tA)= t trA. • ∈ ∈ For A, B such that AB is square, trAB = trBA. • For A, B, C such that ABC is square, • trABC = trBCA = trCAB (27)

and so on for the productof more matrices. This is called the cyclic permutation property of the trace operator.

from the cyclic permutation property, we can derive the trace trick, which rewrites the scalar inner product xT Ax as follows xT Ax = tr(xT Ax)= tr(xxT A) (28) We shall use this result later in the book. The trace of two matrices, tr(AB), is the sum of a vector which contains the inner product of the rows of A and the columns of B. In Matlab notation we have

trace(A B) = sum([a(1,:)*b(:,1), ..., a(n,:)*b(:,n)]) Thus tr(AB) is a way to define the distance between two matrices.

6 4.3 Adjugate (classical adjoint) of a square matrix Below we introduce some rather abstract material that will be necessary for defining matrix determinants and inverse. However, it can be skipped without loss of continuity. This subsection is based on the wikipedia entry.2 Let A be an n n matrix. Let M˜ be the matrix obtained by deletring row i and column j from A. Define × ij M = det M˜ to be the (i, j)’th of A. Define C = 1i+jM to be the (i, j)’th cofactor of A. Finally, ij ij ij − ij define the adjugate or classical adjoint of A to be adj(A) = CT . Thus the adjoint stores the cofactors, but in transposed order. Let us consider a 2 2 example. × a b d b C C adj = = 11 21 (29) c d c− a C C   −   12 22 Now let A be the following matrix: A11 A12 A13 A = A A A (30)  21 22 23 A31 A32 A33 Then its adjugate is   A A A A A A + 22 23 12 13 + 12 13 A32 A33 − A32 A33 A22 A23  

 A21 A23 A11 A13 A11 A13  adj(A)=  +  (31) − A31 A33 A31 A33 − A21 A23         A21 A22 A11 A12 A11 A12  + +   A31 A32 − A31 A32 A21 A22      For example, entry (2,1) is obtained by “crossing out” row 1 and column 2 from A, and computing -1 times the determinant of the resulting submatrix:

A21 A23 adj(A)2,1 = det (32) − A31 A33   4.4 Determinant of a square matrix A square n n matrix A can be viewed as a linear transformation that scales and rotates a vector. The determinant of a square× matrix det(A) = A is a measure of how much it changes a unit volume. The determinant operator | | n n satisfies these properties, where A, B R × ∈ A = AT (33) | | | | AB = A B (34) | | | || | A =0 A issingular (35) | | 1 ⇐⇒ A − = 1/ A (36) | | | | We can define the determinantin terms of an expansionof its cofactors (see Section 4.3) along any row i or column j as follows: n n det A = aij Cij = aij Cij (37) i=1 j=1 X X This is called Laplace’s expansion of the determinant. We see that the determinant is given by the inner product of any row i of A with the corresponding row i of C. Hence

A adj(A) = det(A) I (38)

2http://en.wikipedia.org/wiki/Adjugate_matrix.

7 We will return to this important formula below. We now give some examples. Consider a 2 2 matrix. From Equation 29, and summing along the rows for column 1, we have, × a b det = aC + cC = ad cb (39) c d 11 21 −   Alternatively, summing along the columns for row 2 we have

a b det = cC + dC = cb + da (40) c d 21 22 −   Now consider a 3 3 matrix. From Equation 31, and summing along the rows for column 1, we have ×

A22 A23 A21 A23 A21 A22 det(A) = a11 a12 + a13 (41) A32 A33 − A31 A33 A31 A32

a a a + a a a + a a a = 11 22 33 12 23 31 13 21 32 (42) a a a a a a a a a − 11 23 32 − 12 21 33 − 13 22 31 Note that the definition of the cofactors involves a determinant. We can make this explicit by giving the general (recursive) formula:

n A = ( 1)i+j a M˜ (for any j 1,...,n) | | − ij | i,j | ∈ i=1 Xn = ( 1)i+j a M˜ (for any i 1,...,n) − ij | i,j | ∈ j=1 X

where M˜ i,j is the minor matrix obtained by deleting row i and column j from A. The base case is that A = a11 1 1 n n | | for A R × . If we were to expand this formula completely for A R × , there would be a total of n! (n factorial) different∈ terms. For this reason, we hardly even explicitly write the complete∈ equation of the determinant for matrices bigger than 3 3. Instead, we will see below that there are various numerical methods for computing A . × | | 4.5 The inverse of a square matrix n n 1 The inverse of a square matrix A R × is denoted A− , and is the unique matrix such that ∈ 1 1 A− A = I = AA− . (43)

1 Note that A− only exists if det(A) > 0. If det(A)=0, it is called a singular matrix. n n The following are properties of the inverse; all assume that A, B R × are non-singular: ∈ 1 1 (A− )− = A • 1 1 If Ax = b, we can multiply by A− on both sides to obtain x = A− b. This demonstrates the inverse with • respect to the original system of linear equalities we began this review with.

1 1 1 (AB)− = B− A− • 1 T T 1 T (A− ) = (A )− . For this reason this matrix is often denoted A− . • From Equation 38, we have that if A is non-singular, then

1 1 A− = adj(A) (44) A | | While this is a nice “explicit” formula for the inverse of matrix, we should note that, numerically, there are in fact much more efficient ways of computing the inverse. In Matlab, we just type inv(A).

8 1 For the case of a 2 2 matrix, the expression for A− is simple enough to give explicitly. We have × a b 1 1 d b A = , A− = − (45) c d A c a   | | −  For a block diagonal matrix, the inverse is obtained by simply inverting each block separately, e.g.,

1 1 A 0 − A− 0 = 1 (46) 0 B 0 B−     4.6 Schur complements and the matrix inversion lemma (This section is based on [Jor07, ch13]. We will use these results in Section ??, when deriving equations for inferring marginals and conditionals of multivariate Gaussians. We will also use special cases of these results (such as the matrix inversion lemma) in many other places throughout the book. Consider a general partitioned matrix E F M = (47) G H   1 where we assume E and H are invertible. The goal is to derive an expression for M− . If we could block diagonalize M, it would be easier to invert. To zero out the top right block of M we can pre-multiply as follows I FH 1 E F E FH 1G 0 − = − (48) 0− I G H − G H       Similarly, to zero out the bottom right we can post-multiply as follows 1 1 E FH− G 0 I 0 E FH− G 0 − 1 = − (49) G H H− G I 0 H   −    Putting it all together we get 1 1 I FH− E F I 0 E FH− G 0 − 1 = − (50) 0 I G H H− G I 0 H     −    X M Z W Taking determinants we| get {z } | {z } | {z } | {z } 1 X M Z = M = W = E FH− G H (51) | || || | | | | | | − || | Let us define Schur complement of M wrt H as 1 M/H = E FH− G (52) − Then Equation 51 becomes M = M/H H (53) | | | || | So we can see that M/H acts somewhat like a division operator. We can derive the inverse of M as follows 1 1 1 1 Z− M− X− = W− (54) 1 1 M− = ZW− X (55) hence 1 1 1 E F − I 0 (M/H)− 0 I FH− = 1 1 − (56) G H H− G I 0 H− 0 I   −      1 1 (M/H)− 0 I FH− = 1 1 1 − (57) H− G(M/H)− H− 0 I −    1 1 1 (M/H)− (M/H)− FH− = 1 1 1 − 1 1 1 (58) H− G(M/H)− H− + H− G(M/H)− FH− − 

9 1 Alternatively, we could have decomposed the matrix M in terms of E and M/E = (H GE− F), yielding − 1 1 1 1 1 1 1 E F − E− + E− F(M/E)− GE− E− F(M/E)− = 1 1 1 (59) G H (M/E)− GE− (M/E)−    −  Equating the top left block of these two expression yields the following formula:

1 1 1 1 1 (M/H)− = E− + E− F(M/E)− GE− (60) (61)

Simplifiying this leads to the widely used matrix inversion lemma or the Sherman-Morrison-Woodbury formula:

1 1 1 1 1 1 1 (E FH− G)− = E− + E− F(H GE− F)− GE− (62) − − In the special case that H = 1 a scalar, F = u a column vector, G = vT a row vector, we get the following formula for the rank one update of an− inverse matrix

T 1 1 1 T 1 1 T 1 (E + uv )− = E− + E− u( 1 v E− u)− v E− (63) − − 1 T 1 1 E− uv E− = E− 1 (64) − 1+ vT E− u We can derive another useful formula by equating the top right blocks of Equations 58 and 59 to get

1 1 1 1 (M/H)− FH− = E− F(M/E)− (65) − 1 1 1 1 1 1 (E FH− G)− FH− = E− F(H GE− F)− (66) − − 5 Special kinds of matrices 5.1 Symmetric Matrices

A square matrix is called Hermitian if A = A∗, where A∗ is the transpose of the complex conjugate of A. A matrix is symmetric if it is real and Hermitian, and hence A = AT . In the rest of this section, we shall assume A is real. T n n T A matrix is anti-symmetric if A = A . It is easy to show that for any matrix A R × , the matrix A + A T − ∈ n n is symmetric and the matrix A A is anti-symmetric. From this it follows that any square matrix A R × can be represented as a sum of a symmetric− matrix and an anti-, since ∈ 1 1 A = (A + AT )+ (A AT ) (67) 2 2 − and the first matrix on the right is symmetric, while the second is anti-symmetric. It turns out that symmetric matrices occur a great deal in practice, and they have many nice properties which we will look at shortly. It is common to denote the set of all symmetric matrices of size n as Sn, so that A Sn means that A is a symmetric n n matrix. ∈ × 5.2 Orthogonal Matrices n T n Two vectors x, y R are orthogonal if x y = 0. A vector x R is normalized if x 2 = 1. A square matrix n n ∈ ∈ k k U R × is orthogonal (note the different meaning of the term orthogonal when talking about vectors versus matrices)∈ if all its columns are orthogonal to each other and are normalized (the columns are then referred to as being orthonormal). It follows immediately from the definition of orthogonality and normality that

UT U = I = UUT . (68)

In other words, the inverse of an is its transpose. Note that if U is not square — i.e., U m n T T ∈ R × , n

10 Another nice property of orthogonal matrices is that operating on a vector with an orthogonal matrix will not change its Euclidean norm, i.e., Ux = x (69) k k2 k k2 n n n for any x R , U R × orthogonal. Similarly, one can show that the angle between two vectors is preserved after they are transformed∈ ∈ by an orthogonalmatrix. In summary, transformations by orthogonal matrices are generalizations of rotations and reflections, since they preserve lengths and angles. Computationally, orthogonal matrices are very desirable because they do not magnify roundoff or other kinds of errors. An example of an orthogonal matrix is a . A rotation in 3d by angle α about the z axis is given by

cos(α) sin(α) 0 − R(α)= sin(α) cos(α) 0 (70)  0 0 1   If α = 45◦, this becomes 1 1 0 √2 − √2 R(45) = 1 1 0 (71)  √2 √2  0 0 1   where 1 =0.7071. We see that R( α)= RT , so this is an orthogonal matrix. √2 − 5.3 Positive definite Matrices n n T Given a matrix square A R × and a vector x R, the scalar value x Ax is called a quadratic form. Written explicitly, we see that ∈ ∈ n n T x Ax = Aij xixj . (72) i=1 j=1 X X Note that, 1 1 xT Ax = (xT Ax)T = xT AT x = xT ( A + AT )x (73) 2 2 i.e., only the symmetric part of A contributes to the quadratic form. For this reason, we often implicitly assume that the matrices appearing in a quadratic form are symmetric. We give the following definitions:

A symmetric matrix A Sn is positive definite (pd) iff for all non-zero vectors x Rn, xT Ax > 0. This is • usually denoted A 0 (or∈ just A > 0), and often times the set of all positive definite∈ matrices is denoted Sn . ++ A symmetric matrix A Sn is positive semidefinite (PSD) iff for all vectors xT Ax 0. This is written A 0 • (or just A 0), and the∈ set of all positive semidefinite matrices is often denoted Sn .≥  ≥ + Likewise, a symmetric matrix A Sn is negative definite (ND), denoted A 0 (or just A < 0) iff for all • non-zero x Rn, xT Ax < 0. ∈ ≺ ∈ Similarly, a symmetric matrix A Sn is negative semidefinite (NSD), denoted A 0 (or just A 0) iff for • all x Rn, xT Ax 0. ∈  ≤ ∈ ≤ Finally, a symmetric matrix A Sn is indefinite, if it is neither positive semidefinite nor negative semidefinite • — i.e., if there exists x , x R∈n such that xT Ax > 0 and xT Ax < 0. 1 2 ∈ 1 1 2 2 It should be obvious that if A is positive definite, then A is negative definite and vice versa. Likewise, if A is positive semidefinite then A is negative semidefinite and vice− versa. If A is indefinite, then so is A. It can also be shown that positive definite− and negative definite matrices are always invertible. − Finally, there is one type of positive definite matrix that comes up frequently,and so deserves some special mention. m n T Given any matrix A R × (not necessarily symmetric or even square), the Gram matrix G = A A is always ∈

11 positive semidefinite. Further, if m n (and we assume for convenience that A is full rank), then G = AT A is positive definite. ≥ Note that if all elements of A are positive, it does not mean A is necessarily pd. For example,

4 3 A = (74) 3 2   is not pd. Conversely, a pd matrix can have negative entries e.g.,

2 1 A = (75) 1− 2 −  A sufficient condition for a (real, symmetric) matrix to be pd is that it is diagonally dominant, i.e., if in every row of the matrix, the magnitude of the diagonal entry in that row is larger than the sum of the magnitudes of all the other (non-diagonal) entries in that row. More precisely,

a > a for all i (76) | ii| | ij | j=i X6 6 Properties of matrices 6.1 and rank

A set of vectors x1, x2,... xn is said to be (linearly) independent if no vector can be represented as a linear combination of the{ remaining vectors.} Conversely, a vector which can be represented as a linear combination of the remaining vectors is said to be (linearly) dependent. For example, if

n 1 − xn = αixi (77) i=1 X

for some α1,...,αn 1 then xn is dependent on x1,..., xn 1 ; otherwise, it is independent of x1,..., xn 1 . The column{ rank−of} a matrix A is the largest number{ of columns− } of A that constitute linearly independent{ set.− } In the same way, the row rank is the largest number of rows of A that constitute a linearly independent set. It is a basic fact of linear algebra that for any matrix A, columnrank(A) = rowrank(A), and so this quantity is simply refereed to as the rank of A, denoted as rank(A). The following are some basic properties of the rank:

m n For A R × , rank(A) min(m,n). If rank(A) = min(m,n), then A is said to be full rank, otherwise it • is called∈rank deficient. ≤

m n T For A R × , rank(A) = rank(A ). • ∈ m n n p For A R × , B R × , rank(AB) min(rank(A), rank(B)). • ∈ ∈ ≤ m n For A, B R × , rank(A + B) rank(A) + rank(B). • ∈ ≤ 6.2 Range and nullspace of a matrix The span of a set of vectors x , x ,... x is the set of all vectors that can be expressed as a linear combination of { 1 2 n} x1,..., xn . That is, { } n span( x ,... x )= v : v = α x , α R . (78) { 1 n} i i i ∈ ( i=1 ) X n Itcan beshownthatif x1,..., xn isasetof n linearly independent vectors, where each xi R , then span( x1,... xn )= Rn. In other words, any{ vector v } Rn can be written as a linear combination of x through∈x . { } ∈ 1 n The dimension of some subspace S, dim(S), is definedto be d if there exists some spanning set b1,..., bd that is linearly independent. { }

12 Figure 1: Visualization of the nullspace and range of an m × n matrix A. Here y1 = Ax1, so y1 is in the range (is reachable); similarly for y2. Also Ax3 = 0, so x3 is in the nullspace (gets mapped to 0); similarly for x4.

m n The range (sometimes also called the column space) of a matrix A R × is the the span of the columns of A. In other words, ∈ range(A)= v Rm : v = Ax, x Rn . (79) { ∈ ∈ } This can be thought of as the set of vectors that can be “reached” or “generated” by A. The nullspace of a matrix m n A R × is the set of all vectors that get mapped to the null vector when multiplied by A, i.e., ∈ nullspace(A)= x Rn : Ax = 0 . (80) { ∈ } See Figure 1 for an illustration of the range and nullspace of a matrix. We shall discuss how to compute the range and nullspace of a matrix numerically in Section 12.1 below. 6.3 Norms of vectors A norm of a vector x is informally measure of the “length” of the vector. For example, we have the commonly-used k k Euclidean or `2 norm, n x = x2. (81) k k2 v i ui=1 uX 2 T t Note that x 2 = x x. More formally,k k a norm is any function f : Rn R that satisfies 4 properties: → 1. For all x Rn, f(x) 0 (non-negativity). ∈ ≥ 2. f(x)=0 if and only if x = 0 (definiteness). 3. For all x Rn, t R, f(tx)= t f(x) (homogeneity). ∈ ∈ | | 4. For all x, y Rn, f(x + y) f(x)+ f(y) (triangle inequality). ∈ ≤ Other examples of norms are the `1 norm, n x = x (82) k k1 | i| i=1 X and the ` norm, ∞ x = maxi xi . (83) k k∞ | | In fact, all three norms presented so far are examples of the family of `p norms, which are parameterized by a real number p 1, and defined as ≥ n 1/p x = x p . (84) k kp | i| i=1 ! X

13 6.4 Matrix norms and condition numbers 1 (The following section is based on [Van06, p93].) Assume A is a non-singular matrix, and Ax = b, so x = A− b is a unique solution. If we change b to b + ∆b, then A(x + ∆x)= b + ∆b (85) so the new solution is x + ∆x where 1 ∆x = A− ∆b (86) We say that A is well-conditioned if a small ∆b results in a small ∆x; otherwise we say that A is ill-conditioned. For example, suppose 10 10 1 1 1 1 1 10 10 A = 2 10 10 , A− = − 10 10 (87) 1+10− 1 10− 1+10 10  −   −  The solution for b = (1, 1) is x = (1, 1). If we change b by ∆b, the solution changes to 10 1 ∆b1 10 (∆b1 ∆b2) ∆x = A− ∆b = (88) ∆b +− 1010(∆b − ∆b )  1 1 − 2  So a small change in b can lead to an extremely large change in x. In general, we can think of a matrix A as a defining a linear function f(x) = Ax. The expression Ax / x gives the amplication factor or gain of the system in the direction x. We define the norm of A as the maximum|| || gain|| || (over all directions): Ax A = max || || = max Ax (89) || || x=0 x x =1 || || 6 || || || || n This is sometimes called the 2-norm of a matrix. For example, if A = diag(a1,...,an), then A = maxi=1 ai . In general, we have to use numerical methods to compute A ; in Matlab, we can just type norm(A)|| || . | | Other matrix norms exist. For example, the Frobenius|| || norm of a matrix is defined as

m n 2 T A F = a = tr(A A) (90) || || v ij ui=1 j=1 uX X q t Note that this is the same as the 2-normof A(:), the vector gotten by stacking all the columns of A together. In Matlab, just type norm(A,'fro'). We now show how the norm of a matrix can be used to quantify how well-conditioned a linear system is. Since 1 ∆x = A− ∆b, one can show the following upper bound on the absolute error 1 ∆x A− ∆b (91) || || ≤ || || || || 1 So large A− means ∆x can be large even when ∆b is small. Also, since b A x , and hence b || || || || || || || || ≤ || || || || x ||A|| , we have ≥ || || 1 ∆x A− 1 ∆b || || || || ∆b A A− || || (92) x ≤ x || || ≤ || |||| || b || || || || || || thus establishing an upper bound on the relative error. The term 1 κ(A)= A A− (93) || |||| || is called the condition number of A. Large κ(A) means ∆x / x can be large even when ∆b / b is small. For the 2-norm case, the condition number is equal to the|| ratio|| of|| the|| largest to smallest singular|| values|| || ||: κ(A) = σmax/σmin. In Matlab, just type cond(A). We say A is well conditioned if κ(A) is small (close to 1), and ill-conditioned if κ(A) is large. A large condition number means A is nearly singular. This is a better measure of nearness to singularity than the size of the determinant. 100 For example, suppose A = 0.1I100 100. Then det(A) = 10− , but κ(A)=1, reflecting the fact that Ax simply scales the entries of x by 0.1, and thus× is well-behaved.

14 7 Eigenvalues and eigenvectors 7.1 Square matrices n n n Given a square matrix A R × , we say that λ C is an eigenvalue of A and x C is the corresponding eigenvector3 if ∈ ∈ ∈ Ax = λx, x =0 . (94) 6 Intuitively, this definition means that multiplying A by the vector x results in a new vector that points in the same direction as x, but scaled by a factor λ. Also note that for any eigenvector x Cn, and scalar t C, A(cx)= cAx = cλx = λ(cx), so cx is also an eigenvector. For this reason when we talk about∈ “the” eigenvector∈ associated with λ, we usually assume that the eigenvector is normalized to have length 1 (this still creates some ambiguity, since x and x will both be eigenvectors, but we will have to live with this). − We can rewrite the equation above to state that (λ, x) is an eigenvalue-eigenvector pair of A if, (λI A)x = 0, x = 0 . (95) − 6 But (λI A)x = 0 has a non-zero solution to x if and only if (λI A) has a non-empty nullspace, which is only the case if (λ−I A) is singular, i.e., − − (λI A) =0 . (96) | − | We can now use the previous definition of the determinant to expand this expression into a (very large) polynomial in λ, where λ will have maximum degree n; this is called the characteristic polynomial. We then find the n (possibly complex) roots of this polynomial to find the n eigenvalues λ1,...,λn. To find the eigenvector corresponding to the eigenvalue λi, we simply solve the linear equation (λiI A)x = 0. It should be noted that this is not the method which is actually used in practice to numerically com−pute the eigenvalues and eigenvectors (remember that the complete expansion of the determinant has n! terms); it is rather a mathematical argument. n n The following are properties of eigenvalues and eigenvectors (in all cases assume A R × has eigenvalues ∈ λi,...,λn and associated eigenvectors x1,... xn): The trace of a A is equal to the sum of its eigenvalues, • n trA = λi . (97) i=1 X The determinant of A is equal to the product of its eigenvalues, • n A = λ . (98) | | i i=1 Y The rank of A is equal to the number of non-zero eigenvalues of A. • 1 1 If A is non-singular then 1/λ is an eigenvalue of A− with associated eigenvector x , i.e., A− x = (1/λ )x . • i i i i i The eigenvalues of a diagonal matrix D = diag(d ,...d ) are just the diagonal entries d ,...d . • 1 n 1 n We can write all the eigenvector equations simultaneously as

AX = XΛ (99)

n n where the columns of X R × are the eigenvectors of A and Λ is a diagonal matrix whose entries are the eigenval- ues of A, i.e., ∈

n n | | | X R × = x1 x2 xn , Λ = diag(λ ,...,λ ) . (100) ∈  · · ·  1 n | | |   3Note that λ and the entries of x are actually in C, the set of complex numbers, not just the reals; we will see shortly why this is necessary. Don’t worry about this technicality for now, you can think of complex vectors in the same way as real vectors.

15 If the eigenvectors of A are linearly independent, then the matrix X will be invertible, so 1 A = XΛX− . (101) A matrix that can be written in this form is called diagonalizable. 7.2 Symmetric matrices Two remarkable properties come about when we look at the eigenvalues and eigenvectors of a symmetric matrix A Sn. First, it can be shown that all the eigenvalues of A are real. Secondly, the eigenvectors of A are orthonormal, i.e.,∈ T T T T ui uj =0 if i = j, and ui ui =1, where ui are the eigenvectors. In matrix form, this becomes U U = UU = I and U =1. We6 can therefore represent A as | | n T T A = UΛU = λiuiui (102) i=1 X remembering from above that the inverse of an orthogonal matrix is just its transpose. We can write this symbolically as T λ1 u1 λ − uT − | | | 2 − 2 − A = = u1 u2 un  .   .  (103)  · · ·  .. .     | | |  λ   uT     n − n −     = λ u| uT + + λ u| uT (104) 1  1 − 1 − · · · n  n − n − |  |  As an example of diagonalization,  suppose we are given the following  real, symmetric matrix 1.5 0.5 0 A = 0.5− 1.5 0 (105) −0 03   We can diagonalize this in Matlab as follows.

Listing 1: Part of rotationDemo [U,D]=eig(A) U = -0.7071 -0.7071 0 -0.7071 0.7071 0 0 0 1.0000 D = 1 0 0 0 2 0 0 0 3 We recognize from Equation 71 that U corresponds to rotation by 45 degrees. and D corresponds to scaling by diag(1, 2, 3). So the whole matrix is equivalent to a rotation by 45◦, followed by scaling, followed by rotation by 45◦. Note that there is a sign ambiguity for each column of U, but the resulting answer is always correct: − Listing 2: Output of rotationDemo >> U*D*U' ans = 1.5000 -0.5000 0 -0.5000 1.5000 0 0 0 3.0000 We can also use the diagonalization property to show that the definiteness of a matrix depends entirely on the sign of its eigenvalues. Suppose A Sn = UΛUT . Then ∈ n T T T T 2 x Ax = x UΛU x = y Λy = λiyi (106) i=1 X 16 where y = UT x (and since U is full rank, any vector y Rn can be represented in this form). Because y2 is always ∈ i positive, the sign of this expression depends entirely on the λi’s. If all λi > 0, then the matrix is positive definite; if all λi 0, it is positive semidefinite. Likewise, if all λi < 0 or λi 0, then A is negative definite or negative semidefinite≥ respectively. Finally, if A has both positive and negative eigenvalues,≤ it is indefinite. 7.3 Covariance matrices An important application of eigenvectors is to visualizing the probability density function of a multivariate Gaus- sian, also called a multivariate Normal (MVN). We study this in more detail in Section ??, but we introduce it here, since it serves as a good illustration of many ideas in linear algebra. The density is defined as follows:

def 1 1 T 1 (x µ, Σ) = exp[ (x µ) Σ− (x µ)] (107) N | (2π)d/2 Σ 1/2 − 2 − − | | where d is the dimensionality of x, µ = E[X] is the d 1 mean vector, and Σ is a d d symmetric and positive definite covariance matrix, defined by × ×

Cov(X) = Σ def= E [(X E X)(X E X)T ] (108) − − Var (X ) Cov(X ,X ) Cov(X ,X ) 1 1 2 · · · 1 d . . .. . =  . . . .  (109) Cov(X ,X ) Cov(X ,X ) Var (X )  d 1 d 2 · · · d    If d =1, this reduces to the familiar univariate Gaussian density 1 1 (x µ, σ2) def= exp[ (x µ)2] (110) N | (2π)1/2σ −2σ2 −

The expression inside the exponent of the MVN is a quadratic form called the Mahalanobis distance, and is a scalar value equal to T 1 T 1 T 1 T 1 (x µ) Σ− (x µ)= x Σ− x + µ Σ− µ 2µ Σ− x (111) − − − Intuitively, this just measures the distance of point x from µ, where each dimension gets weighted differently. In the case of a diagonal covariance, it becomes a weighted Euclidean distance:

d T 1 2 1 ∆ = (x µ) Σ− (x µ)= (x µ ) Σ− (112) − − i − i ii i=1 X

So if Σii is very large, meaning dimension i has high variance, then it will be downweighted (since we use 1/Σii) when comparing x to µ. This is a formof feature selection. Figure 2 plots some MVN densities in 2d for three differentkinds of covariance matrices.4 A full covariance matrix has d(d + 1)/2 parameters (we divide by 2 since it is symmetric). A diagonal covariance matrix has d parameters. A 2 spherical or isotropic covariance, Σ= σ Id, has one free parameter. The equation ∆= const defines an ellipsoid, which are the level sets of constant probability density. We see that a general covariance matrix has elliptical contours; for a diagonal covariance, the ellipse is axis aligned; and for a spherical covariance, the ellipse is a circle (same “spread” in all directions). We now explain why the contours have this form, using eigenanalysis. Letting Σ = UΛUT we find

p 1 T 1 1 1 T 1 T Σ− = U− Λ− U− = UΛ− U = u u (113) λ i i i=1 i X 4These plots were made by creating a dense grid of points using meshgrid, evaluating the pdf on each grid cell using mvnpdf, and then using the surfc and contour commands. See gaussPlot2dDemo for details.

17 Full, S=[2.0 1.5 ; 1.5 4.0], ρ=0.53 Diagonal, S=[1.2 −0.0 ; −0.0 4.8], ρ=−0.00 Spherical, S=[1.0 0.0 ; 0.0 1.0], ρ=0.00

0.018 0.018 0.04

0.016 0.016 0.035 0.014 0.014 0.03 0.012 0.012 0.025 0.01 0.01 0.02 0.008 0.008 p(x,y) p(x,y) p(x,y) 0.015 0.006 0.006 0.01 0.004 0.004

0.002 0.002 0.005

0 0 0 5 5 5

5 5 5

0 0 0 0 0 0

−5 −5 −5 y −5 y −5 y −5 x x x

Full, S=[2.0 1.5 ; 1.5 4.0] Diagonal, S=[1.2 −0.0 ; −0.0 4.8] Spherical, S=[1.0 0.0 ; 0.0 1.0] 5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

y 0 y 0 y 0

−1 −1 −1

−2 −2 −2

−3 −3 −3

−4 −4 −4

−5 −5 −5 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 −5 −4 −3 −2 −1 0 1 2 3 4 5 x x x

Figure 2: Visualization of 2 dimensional Gaussian densities. Left: a full covariance matrix Σ. Middle: Σ has been decorrelated (see Section ??). Right: Σ has been whitened (see Section 7.3.1). This figure was produced by gaussPlot2dDemo.

Hence p T 1 T 1 T (x µ) Σ− (x µ) = (x µ) u u (x µ) (114) − − − λ i i − i=1 i ! p X p 1 y2 = (x µ)T u uT (x µ)= i (115) λ − i i − λ i=1 i i=1 i X X def T where yi = ui (x µ). The y variables define a new coordinate system that is shifted (by µ) and rotated (by U) with respect to the original− x coordinates: y = U(x µ). Recall that the equation for an ellipse in 2D is −

y2 y2 1 + 2 =1 (116) λ1 λ2 Hence we see that the contours of equal probability density of a Gaussian lie along ellipses: see Figure 3. This gives us a fast way to plot a 2d Gaussian, without having to evaluate the density on a grid of points and use 1 the contour command. If X is a matrix of points on the circle, then Y = UΛ 2 X is a matrix of points on the ellipse represented by Σ = UΛUT . To see this, we use the result

Cov[Ax]= A Cov[x]AT (117)

to conclude 1 1 1 1 Cov[y]= UΛ 2 Cov[x]Λ 2 UT = UΛ 2 Λ 2 UT = Σ (118) 1 5 where Λ 2 = diag(√Λii). We can implement this in Matlab as follows.

5We can explain the k = √6 term as follows. We use the fact that the Mahalanobis distance is a sum of squares of d Gaussian random variables, to conclude that ∆ χ2, which is the χ-squared distribution with d degrees of freedom (see Section ??). Hence we can find the value of ∆ ∼ d that corresponds to a 95% confidence interval by using chi2inv(0.95, 2), which is approximately 6. So by setting the radius to √6, we will enclose 95% of the probability mass.

18 x2 u2

u1

y2 y1 µ

1/2 λ2 1/2 λ1

x1

Figure 3: Visualization of a 2 dimensional Gaussian density. The major and minor axes of the ellipse are defined by the first two eigenvectors of the covariance matrix, namely u1 and u2. Source: [Bis06] Figure 2.7.

Listing 3: gaussPlot2d function h=gaussPlot2d(mu, Sigma, color) [U, D] = eig(Sigma); n = 100; t = linspace(0, 2*pi, n); xy = [cos(t); sin(t)]; k = sqrt(6); %k = sqrt(chi2inv(0.95, 2)); w = (k * U * sqrt(D)) * xy; z = repmat(mu, [1 n]) + w; h = plot(z(1, :), z(2, :), color); axis('equal');

7.3.1 Whitening data The above plotting routine took a spherical covariance matrix and converted it into an ellipse. We can also go in the opposite direction. Let x (µ, Σ), where Σ = UΛUT . Decorrelation is a linear operation which removes any off-diagonal entries from Σ∼.N Define y = UT x. Then

Cov[y]= UT ΣU = UT UΛUT U = Λ (119)

which is a diagonal matrix. We can go further and convert this to a spherical matrix, by making the diagonal entries be 1 (unit variance). This is called whitening or sphereing the data. Let

1 T y = Λ− 2 U x (120)

Then

1 1 1 1 1 1 T T T Cov[y] = Λ− 2 U ΣUΛ− 2 =Λ− 2 U (UΛU U)Λ− 2 =Λ− 2 ΛΛ− 2 = I (121)

See Figure 2 for an example. In Matlab, we can implement this as follows

Listing 4: Part of gaussPlot2dDemo mu = [1 0]'; S = [2 1.5; 1.5 4]; [U,D] = eig(S); A = sqrt(inv(D))*U'; % whitening transform (without centering) mu2 = A*mu; S2 =A*S*A'; assert(approxeq(S2, eye(2)))

19 Finally, we can whiten the data and center it, by subtracting off the mean.

1 T y = Λ− 2 U (x µ) (122) 1 − T E[y] = Λ− 2 U (µ µ)= 0 (123) − 7.3.2 Standardizing data Standardizing data is a simple preprocessing step that is a weaker form of whitening, which makes each dimension be zero mean and have unit variance, but which does not remove any correlation (since it operates on each dimension separately). Standardized data is often represented by Z. Symbolically we have

xij x¯.j zij = − (124) σj We can do this in MATLAB using the MLAPA function standardize, or the function zscore.

Listing 5: standardize function [Z, mu, sigma] = standardize(X, mu, sigma) X = double(X); if nargin < 2 mu = mean(X); sigma = std(X); ndx = find(sigma < eps); sigma(ndx) = 1; end [n d] = size(X); Z = X - repmat(mu, [n 1]); Z = Z ./ repmat(sigma, [n 1]); Of course, in a supervised learning setting, we must apply the same transformation to the test data. Thus it is very common to see the following idiom (we will see examples in Chapter ??):

Listing 6: : [Xtrain, mu, sigma] = [ones(size(Xtrain,1), 1) standardize(Xtrain)]; w = fitModel(Xtrain); % plug in appropriate learning procedure Xtest = [ones(size(Xtest,1),1) standardize(Xtest, mu, sigma)]; ypred = testModel(Xtest, w); % plug in appropriate test procedure Alternatively, we can rescale the data, e.g., to 0:1 or -1:1, using the rescaleData function. We must rescale the test data in the same way, as in the example below.

Listing 7: : xTrain = -10:1:10; [zTrain, minx, rangex] = rescaleData(xTrain, -1, 1);% zTrain= -1:1 xTest = [-11, 8]; [zTest] = rescaleData(xTest, -1, 1, minx, rangex) % [-1.1, 0.8]

8 Matrix calculus While the topics in the previous sections are typically covered in a standard course on linear algebra, one topic that does not seem to be covered very often (and which we will use extensively) is the extension of calculus to the vector setting. Despite the fact that all the actual calculus we use is relatively trivial, the notation can often make things look much more difficult than they are. In this section we present some basic definitions of matrix calculus and provide a few examples. We concentrate on functions that maps vectors / matrices to scalars, i.e., their output is scalar valued. 8.1 The gradient of a function wrt a matrix m n Suppose that f : R × R is a function that takes as input a matrix A of size m n and returns a real value. Then → m n × the gradient of f (with respect to A R × ) is the matrix of partial derivatives, defined as: ∈

20 ∂f(A) ∂f(A) ∂f(A) ∂A11 ∂A12 ∂A1n ∂f(A) ∂f(A) · · · ∂f(A)   Rm n ∂A21 ∂A22 ∂A2n Af(A) × = . . ·. · · . (125) ∇ ∈  . . .. .   . . .   ∂f(A) ∂f(A) ∂f(A)     ∂Am1 ∂Am2 · · · ∂Amn  i.e., an m n matrix with   × ∂f(A) ( Af(A))ij = . (126) ∇ ∂Aij We give some examples below (based on [Jor07, ch13]). We will use these examples when computing the maximum likelihood estimate of the covariance matrix for a multivariate Gaussian in Section ??. 8.1.1 Derivative of trace and quadratic forms Recall that (AB)km = aklblm (127) X` and hence that tr(AB)= (AB)kk = aklblm (128) Xk Xk X` Hence ∂ ∂ tr(AB)= = aklblm = bji (129) ∂aij ∂aij Xk X` so tr(AB)= BT (130) ∇A We can use this, plus the trace trick, to compute the derivative of a quadratic form wrt a matrix:

xT Ax = tr(xT Ax)= tr(xxT A) = (xxT )T = xxT (131) ∇A ∇A ∇A 8.1.2 Derivative of a (log) determinant Recall from Equation 37 that A = a C for any row i, where C is the cofactor. Hence | | j ij ij ij P ∂ A = Cij (132) ∂aij | |

since Cij does not contain any terms involving aij (since we deleted row i and column j. Hence

T T A = adj(A) = A A− (133) ∇A| | | | Now we can use the chain rule to see that ∂ log A ∂ log A ∂ A 1 ∂ A | | = | | | | = | | . (134) ∂A ∂ A ∂A A ∂A ij | | ij | | ij Hence 1 T log A = A = A− (135) ∇A | | A ∇A| | | | Note the similarity to the scalar case, where ∂/(∂x) log x =1/x.

21 8.2 The gradient of a function wrt a vector Rn Note that the size of Af(A) is always the same as the size of A. So if, in particular, A is just a vector x , we have ∇ ∈ ∂f(x) ∂x1 ∂f(x)  ∂x2  xf(x)= . . (136) ∇ .  .   ∂f(x)     ∂xn    It is very important to remember that the gradient of a function is only defined if the function is real-valued, that is, if n n it returns a scalar value. We can not, for example, take the gradient of Ax, A R × with respect to x, since this quantity is vector-valued. ∈ It follows directly from the equivalent properties of partial derivatives that:

x(f(x)+ g(x)) = xf(x)+ xg(x). • ∇ ∇ ∇ For t R, x(tf(x)) = t xf(x). • ∈ ∇ ∇ We give some examples below, which we will use when computing the least squares estimate (see Section ??) as well as when computing the maximum likelihood estimate of the mean of a multivariate Gaussian in Section ??. 8.2.1 Derivative of a scalar product For x Rn, let f(x)= aT x for some known vector a Rn. Then ∈ ∈ n ∂f(x) ∂ = a x = a . (137) ∂x ∂x i i k k k i=1 X From this we can easily see that xaT x = a (138) ∇ This should be compared to the analogous situation in single variable calculus, where ∂/(∂x) ax = a. 8.2.2 Derivative of a quadratic form Now consider the quadratic form f(x)= xT Ax for A Sn. Remember that ∈ n n f(x)= Aij xixj (139) i=1 j=1 X X so n n n n ∂f(x) ∂ = A x x = A x + A x (140) ∂x ∂x ij i j ik i kj j k k i=1 j=1 i=1 j=1 X X X X Hence xxT Ax = (A + AT )x (141) ∇ If A is symmetric, this becomes xxT Ax = 2Ax. Again, this should remind you of the analogous fact in single- variable calculus, that ∂/(∂x) ax2∇=2ax. 8.3 The Jacobian Let f be a function that maps Rn to Rm, and let y = f(x). Define the Jacobian matrix as

∂y1 ∂y1 ∂x1 ∂x ∂(y ,...,y ) · · · n def 1 m def . .. . Jx y = =  . . .  (142) → ∂(x1,...,xn) ∂ym ∂ym  ∂x1 ∂xn   · · · 

22 T If we denote the i’th component of the output, then the i’th row of J is ( xfi(x)) . If f is a linear function, so y = Ax, where A is m n, we can∇ compute the gradient of each component of T × T the output separately. We have yi = Ai,:x, so from Equation ??, we have that the the i’th row of J is Ai,:, so Jacobian(Ax)= A. 8.4 The Hessian of a function wrt a vector Suppose that f : Rn R is a function that takes a vector in Rn and returns a real number. Then the → with respect to x, written x2 f(x) or simply as H is the n n matrix of partial derivatives, ∇ × 2 2 2 ∂ f(x) ∂ f(x) ∂ f(x) 2 ∂x1 ∂x1∂x2 ∂x1∂xn 2 2 · · · 2 ∂ f(x) ∂ f(x) ∂ f(x)  2  2 Rn n ∂x2∂x1 ∂x2 · · · ∂x2∂xn xf(x) × = . . . . . (143) ∇ ∈  . . .. .   2 2 2   ∂ f(x) ∂ f(x) ∂ f(x)   2   ∂xn∂x1 ∂xn∂x2 · · · ∂xn  Obviously this is a symmetric matrix, since   ∂2f(x) ∂2f(x) = . (144) ∂xi∂xj ∂xj ∂xi We see that the Hessian is just the Jacobian of the gradient. Note that the Hessian matrix is often written as T H = x( xf(x)) (145) ∇ ∇ where the term inside the bracketreturns a row vector, which we then operate on elementwise, to create an outerproduct- like matrix. We cannot write the Hessian as H = x( xf(x)), since that would require computing the gradient of a column vector. ∇ ∇ 8.4.1 Hessian of a quadratic form Consider the quadratic form f(x)= xT Ax for A Sn. From above, we have ∈ xxT Ax = (A + AT )x (146) ∇ Hence n n ∂2f(x) ∂2 = A x x = A + A =2A . (147) ∂x ∂x ∂x ∂x ij i j k` `k k` k ` k ` i=1 j=1 so X X 2 T T xx Ax = A + A (148) ∇ If A is symmetric, this is 2A, and is again analogous to the single-variable fact that ∂2/(∂x2) ax2 =2a. 8.5 Eigenvalues as Optimization Finally, we use matrix calculus to solve an optimization problem in a way that leads directly to eigenvalue/eigenvector analysis. Consider the following, equality constrained optimization problem: T 2 maxx Rn x Ax subject to x 2 =1 (149) ∈ k k for a symmetric matrix A Sn. A standard way of solving optimization problems with equality constraints is by forming the Lagrangian, an∈ objective function that includes the equality constraints. The Lagrangian in this case can be given by (x, λ)= xT Ax + λ(1 xT x) (150) L − where λ is called the Lagrange multiplier associated with the equality constraint. It can be established that for x∗ to be a optimal point to the problem, the gradient of the Lagrangian has to be zero at x∗ (this is not the only condition, but it is required). That is, x (x, λ)=2AT x 2λx = 0. (151) ∇ L − Notice that this is just the linear equation Ax = λx. This shows that the only points which can possibly maximize (or minimize) xT Ax assuming xT x =1 are the eigenvectors of A.

23 d d n-d d R d X Q 0 n = n-d

Figure 4: QR decomposition of a non-square matrix A = QR, where QT Q = I and R is upper triangular. The shaded parts of R, and all the below-diagonal terms, are zero. The shaded entries in Q are not computed in the economy-sized version, since they are not needed.

9 Matrix factorizations There are various ways to decompose a matrix into a product of other matrices. In Section 7, we discussed diago- 1 nalizing a square matrix A = XΛX− . In Section ?? below, we will discuss a way to generalize this to non-square matrices using a technique called SVD. In this section, we present some other useful matrix factorizations. 9.1 Cholesky factorization of pd matrices Any symmetric positive definite matrix can be factorized as A = RT R, where R is upper triangular with real, positive diagonal elements. This is called a Cholesky factorization: In Matlab, type R = chol(A). In fact, we can use this function to determine if a matrix is positive definite, as follows:

Listing 8: isposdef function b = isposdef(a) % Written by Tom Minka [R,p] = chol(a); b = (p == 0); (This method is faster than checking that all the eigenvalues are positive.) We can create a random pd matrix using the following function.

Listing 9: randpd function M = randpd(d) A = rand(d); M =A*A'; The Cholesky decomposition of a covariance matrix can be used to sample from a multivariate Gaussian. Let y (µ, Σ) and Σ = LLT , where L = RT is lower triangular. We first sample x (0, I), which is easy because∼ N it just requires sampling from d separate 1d Gaussians. We then set y = LT x + µ∼. This N is valid since

Cov[y]= LT Cov[x]L = LT IL = Σ (152)

In Matlab you type

Listing 10: : X = chol(Sigma)'*randn(d,n) + repmat(mu,n,1); Of course, if you have the Matlab statistics toolbox, you can just call mvnrnd, but it is useful to know this other method, too. 9.2 QR decomposition The QR decomposition of an n d matrix X is defined by × X = QR (153)

24 n n m-n n n n A U Σ VT m =

Figure 5: SVD decomposition of a non-square matrix A = UΣVT . The shaded parts of Σ, and all the off-diagonal terms, are zero. The shaded entries in U and Σ are not computed in the economy-sized version, since they are not needed.

where Q is an n n orthogonal matrix (hence QT Q = I and QQT = I), and R is an n d matrix, which is only non-zero in its upper-right× triangle (see Figure 4). In Matlab, type [Q,R] = qr(X). It is× also possible to compute an economy-sized QR decomposition, using [Q,R] = qr(X,0). In this case, Q is just n d and R is d d, where the lower triangle is zero (see Figure 4). We will use this when we study least squares in Section× ??. × 10 Singular value decomposition (SVD) Any (real) m n matrix A can be decomposed as ×

T | T | T A = UΣV = λ u1 v + + λ ur v (154) 1   − 1 − · · · r   − r − |  |      where U is an m m whose columns are orthornormal (so UT U = I), V is n n matrix whose rows and columns are orthonormal (so× VT V = VVT = I), and Σ is a m n matrix containing× the r = min(m,n) singular values × σi 0 on the main diagonal, with 0s filling the rest of the matrix. The columns of U are the left singular vectors, and the≥ columns of V are the right singular vectors. See Figure 5 for an example. In Matlab, you can compute the SVD using [U,Sigma,V] = svd(A). As is apparent from Figure 5, if mn, we take the first n columns of U×, make Σ be n n, and× leave V as is (this is also called a thin SVD). × A square diagonal matrix is nonsingular iff its diagonal elements are nonzero. The SVD implies that any square matrix is nonsingular if, and only if, its singular values are nonzero. The most numerically reliable way to determine whether matrices are singular is to test their singular values. This is far better than trying to compute determinants, which are numerically unstable. 10.1 Connection between SVD and eigenvalue decomposition If A is real, symmetric and positive definite, then the singular values are equal to the eigenvalues, and the left and right singular vectors are equal to the eigenvectors (up to a sign change). However, Matlab always returns the singular values in decreasing order, whereas the eigenvalues need not necessarily be sorted. This is illustrated below.

Listing 11: : >> A=randpd(3) A = 0.9302 0.4036 0.7065 0.4036 0.8049 0.4521 0.7065 0.4521 0.5941

>> [U,S,V]=svd(A) U = -0.6597 0.5148 -0.5476 -0.5030 -0.8437 -0.1872 -0.5584 0.1520 0.8155 S = 1.8361 0 0

25 0 0.4772 0 0 0 0.0159 V = -0.6597 0.5148 -0.5476 -0.5030 -0.8437 -0.1872 -0.5584 0.1520 0.8155

>> [U,Lam]=eig(A) U = 0.5476 0.5148 0.6597 0.1872 -0.8437 0.5030 -0.8155 0.1520 0.5584 Lam = 0.0159 0 0 0 0.4772 0 0 0 1.8361 In general, for an arbitrary real matrix A, if A = UΣVT , we have

AT A = VΣT UT UΣVT = V(ΣT Σ)VT (155) (AT A)V = V(ΣT Σ)= VD (156) so the eigenvectors of AT A are equal to V, the right singular vectors of A. Also, the eigenvalues of AT A are equal to the elements of Σ2, the squared singular values. Similarly

AAT = UΣVT VΣT UT = U(ΣΣT )UT (157) (AAT )U = U(ΣΣT )= UD (158) so the eigenvectors of AAT are equal to U, the left singular vectors of A. Also, the eigenvalues of AAT are equal to the squared singular values. We will use these results when we study PCA (see Section ??), which is very closely related to SVD. 10.2 Pseudo inverse

The pseudo-inverse of an m n matrix A, denoted A†, is defined as the unique matrix that satisfies the following 4 properties: ×

AA†A = A (159)

A†AA† = A† (160) T (AA†) = AA† (161) T (A†A) = A†A (162)

1 If A is square and non-singular, then A† = A− . If m>n and the columns of A are linearly independent, then

T 1 T A† = (A A)− A (163) which is the same expression as arises in the normal equations (see Section ??). In this case, A† is a left inverse because T 1 T A†A = (A A)− A A = I (164) but is not a right inverse because T 1 T AA† = A(A A)− A (165) only has rank n, and so cannot be the m m identity matrix. In general, we can compute the pseudo× inverse using the SVD decomposition A = UΣVT . If A is square and non-singular, we can compute A† easily:

1 T 1 T A† = A− = (UΣV )− = V[diag(1/σj )]U (166)

26 n k k n Σ VT k A U m ¼

Figure 6: Truncated SVD.

rank 1 rank 2 rank 5

rank 10 rank 20 rank 200

Figure 7: Low rank approximations to a 200 × 320 image. Ranks 1, 2, 5, 10, 20 and 200 (original is bottom right). Produced by svdImageDemo.

T T since V V = I and U U = I. When computing the pseudo inverse, if any σj =0, we replace 1/σj with 0. Suppose the first r = rank(A) singular values are nonzero, and the rest are zero. Then

1 1 Σ† = diag(σ1− ,...,σr− , 0,..., 0) (167) Hence T A† = VΣ†U (168) 1 amounts to only using the first r rows of V, U and Σ− , since the remaining elements of Σ† will be 0. In mat- lab, we can just type pinv(A). The (abbreviated) source code for this built-in function is shown below. (Use type(which('pinv')) to see the full source.)

Listing 12: pinv function B = pinv(A) [U,S,V] = svd(A,0); % if m>n, only compute first n cols of U s = diag(S); r = sum(s > tol); % rank w = diag(ones(r,1) ./ s(1:r)); B = V(:,1:r) * w * U(:,1:r)'; We will see how to use pseudo inverse for solving under-determined linear systems in Section 11.3. 10.3 Low rank approximation of matrices (The following section is based on [Mol04, sec10.11].) The economy-sized SVD

A = UΣVT (169)

27 4 9000 10

8000

7000 3 10

6000

5000 ) i i σ 2

σ 10 log( 4000

3000

1 10 2000

1000

0 0 10 0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200 i i

Figure 8: Singular values (left) and log singular values (right) for the clown image. can be rewritten as A = E + E + + E (170) 1 2 · · · r where r = min(m,n). The component matrices Ei are rank one outer products:

T Ei = σiuivi (171) The norm of each matrix is the corresponding singular value, E = σ . If we truncate this sum to the first k terms, || i|| i A = E + E = U Σ VT (172) k 1 · · · k :,1:k 1:k,1:k :,1:k we get a rank r approximation to the matrix: see Figure 6. This is called a truncated SVD. The total number of parameters needed to represent an m n matrix using a rank k approximation is × mk + k + kn = k(m + n + 1) (173)

As an example, consider the 200 320 pixel image in Figure 7(top left). This has 64,000 numbers in it. We see that a rank 20 approximation, with only×(200+320+1) 20 = 10, 420 numbers is a very good approximation. The error in this approximation is given by ×

A A = σ (174) || − k|| k+1 Since the singular values often die off very quickly (see Figure 8), a low-rank approximation is often quite accurate. In Section ??, we will see that the truncated SVD is the same thing as principal components analysis (PCA). We can construct a low rank approximation to a matrix (here, an image) as follows.

Listing 13: Part of svdImageDemo load clown; % built-in image [U,S,V] = svd(X,0); k = 20; Xhat = (U(:,1:k)*S(1:k,1:k)*V(:,1:k)'); image(Xhat);

11 Solving systems of linear equations Consider solving a generic linear system Ax = b, where A is m n. There are three cases, depending on whether A is square (m = n), tall and skinny (m>n) or short and fat (m

28 11.2 m>n: Over-determined case If m>n, the system is over-determined: there are more equations than unknowns. In general, there is no exact solution. Typically one seeks the least squares solution:

xˆ = arg min Ax b 2 (175) x || − || This solution minimizes sum of the squared residuals:

m Ax b 2 = r2 (176) || − || i i=1 X r = a x + + a x b , i = 1 : m (177) i i1 1 · · · in n − i In matlab, the backslash operator x=A\b, gives the least squares solution; internally Matlab uses QR decomposition: see Section 9.2. 11.3 m

T σ1 (v1) − T − σ2 (v2) T A = UΣV = U  .  − . − (178) .. ..    T   σn  (v )    − n −     Hence T σ1(v1) x1 σ1 < v1, x > − T − σ2(v2) x2 σ2 < v2, x > Ax = U − . −  .  = U  .  (179) .. . .        σ (v )T  x  σ < v , x > − n n −  n  n n        σ1 < v1, x > | | | σ2 < v2, x > = u1 u2 un  .  (180)  · · ·  .   | | | σ < v , x >    n n    u11σ1 < v1, x > + + u1nσ1 < vn, x > · · · n u21σ1 < v1, x > + + u2nσ1 < vn, x > =  · ·. ·  = σj < vj , x > uj (181) .   j=1 u σ < v , x > + + u σ < v , x > X  n1 1 1 · · · nn 1 n    29 Suppose that the first r singular vectors of A are nonzero and that the rest are zero.6 Hence

r Ax = σj < vj , x > uj = σj < vj , x > uj (182) j:σ >0 j=1 Xj X

Thus any Ax can be written as a linear combination of the left singular vectors u1,..., ur, so the range of A is given by range(A)= span u : σ > 0 (183) { j j } with dimension r. To find a basis for the null space, let us now define a second vector y Rn that is a linear combination solely of the right singular vectors for the zero singular values, ∈

n y = cj vj = cj vj (184) j:σ =0 j=r+1 Xj X

Since the vj ’s are orthonormal, we have

σ1 < v1, y > σ10 . .  .   .  σ < v , y > σ 0 Ay = U  r r  = U  r  = U0 = 0 (185) σr+1 < vr+1, y > 0 < vr+1, y >      .   .   .   .       σ < v , y >   0 < v , y >   n n   n      Hence the right singular vectors form an orthonormal basis for the null space:

nullspace(A)= span vj : σ =0 (186) { j } with dimension n r. We see that − dim(range(A)) + dim(nullspace(A)) = r + (n r)= n (187) − In words, this is often written as rank + nullity = n (188) This is called the rank-nullity theorem. 12.2 Existence and uniqueness of solutions We can finally answer our question about existence and uniqueness of solutions to Ax = b. If A is full rank (r = n), n 1 then Ax can generate every vector b R . Hence for any b, there is an x such that Ax = b, namely x = A− b, which can be rewritten as ∈

n 1 T 1 1 T 1 x = A− b = (UΣV )− v = VΣ− U v = < uj , b > σj− vj (189) j=1 X Furthemore, this solution must be unique. To see why, suppose y is some other solution, so Ay = b. Define v = y x. Then − Ay = A(x + v)= Ax + Av = b + Av = b (190) Hence v nullspace(A). So if nullspace(A)= 0 (since nullity=0), then we must have v = 0, so the solution x is unique. ∈ { }

6This is the opposite convention of [Bee07].

30 1 Now suppose A is not full rank (r

Hence x = z + y is also a solution for any y in the null space. To find a particular solution z, we can use the pseudo-inverse:

r T 1 z = A†b = VΣ†U v = < uj , b > σj− vj (193) j=1 X This satisfies Az = b since

σ1 1/σ1 .. ..  .   .  σr T 1/σr T Az = AA†b = U   V V   U b (194)  0   0       .   .   ..   ..       0  0      r    T = U:,1:rU:,1:rb = uj < uj , b > = b (195) j=1 X

The last step follows since b range(A) (by assumption), and hence b must lie in span uj : σj > 0 . It is natural to ask: what happens∈ if we apply the pseudo inverse to a b that is not in the{ range of A}? This happens when A is over-determined (more rows than columns in A). In this case, z = A†b is not a solution, Az = b. However, it is the “closest thing to a solution”, in this sense that it minimizes the residual norm: 6

Az b Ay b y Rn (196) || − || ≤ || − || ∀ ∈ This is called a least squares solution: see Section ??. 12.3 Worked example Consider the following example from [Van06, p53]. We have the following linear system of equations, where A is singular, lower 6 0 0 x1 b1 2 0 0 x2 = b2 (197)  4 1 2 x  b  − − 3 3       From the first equation, x1 = b1/6. From the second equation, b 2x = b 2 1 = b b =3b (198) 1 2 ⇒ 6 2 ⇒ 1 2

So the system is only solvable if b1 = 3b2, and in that case, we can choose x2 arbitrarily. Finally, the third equation can be satisfied by taking 1 1 1 x = 1 (b +4x x )= b b + x (199) 3 − 2 3 1 − 2 −3 1 − 2 3 2 2

31 So if b =3b , the equation has no solution. If b =3b , there are infinitely many solutions: x = (b /6, α, b /3 1 6 2 1 2 1 − 1 − b3/2+ α/2) is a solution for all α. Letus nowsee howwe can derivethese results using SVD. Thefollowing code fragment verifies that b = (3, 1, 10) (which satisfies b1 =3b2)isintherangeof A, by showing that there are coefficients c that can multiply the left singular vectors to generate b.

Listing 14: Part of svdSpanDemo A = [6 0 0; 2 0 0; -4 1 -2]; [U,S,V]=svd(A,'econ'); S = diag(S); ndx0 = find(S<=eps); ndxPos = find(S>eps); R = U(:,ndxPos); % range N = V(:,ndx0); % null space b = [3 1 10]'; % this satisfies b1=3b2 % verify that b is in the range c =R \ b; assert(approxeq(b, sum(R*diag(c),2))) We can also generate a particular solution z:

Listing 15: Part of svdSpanDemo z = pinv(A)*b; assert(approxeq(norm(A*z - b), 0)) We can also create other solutions by adding random multiples of vectors in the null space.

Listing 16: Part of svdSpanDemo c = rand(length(ndx0),1); x = z + sum(N*diag(c),2); assert(approxeq(norm(A*x - b), 0)) Similarly, it it easy to verify that b = (1, 1, 10) is not a solution, and is not in the range of A (cannot be perfectly generated by any Az for any z):

Listing 17: Part of svdSpanDemo b = [1 1 10]'; % this does not satisfies b1=3b2 c =R \ b; assert(not(approxeq(b, sum(R*diag(c),2)))) z = pinv(A)*b; assert(not(approxeq(norm(A*z - b), 0)))

References [Bee07] K. Beers. Numerical Methods for Chemical Engineering: Applications in MATLAB. Cambridge, 2007.

[Bis06] C. Bishop. Pattern recognition and machine learning. Springer, 2006. [Jor07] M. I. Jordan. An Introduction to Probabilistic Graphical Models. 2007. In preparation. [Mol04] Cleve Moler. Numerical Computing with MATLAB. SIAM, 2004. [Van06] L. Vandenberghe. Applied numerical computing: Lecture notes, 2006.

32