Linear Algebra Review

Machine Learning Department, Carnegie Mellon University Linear Algebra Review Jing Xiang March 18, 2014 1 Properties of Matrices Below are a few basic properties of matrices: • Matrix Multiplication is associative: (AB)C = A(BC) • Matrix Multiplication is distributive: A(B + C) = AB + AC • Matrix Multiplication is NOT commutative in general, that is AB 6= BA. For m×n n×q example, if A 2 R and B 2 R , the matrix product BA does not exist. 2 Transpose m×n > n×m The transpose of a matrix A 2 R , is written as A 2 R where the entries of the matrix are given by: > (A )ij = Aji (2.1) Properties: • Transpose of a scalar is a scalar a> = a • (A>)> = A • (AB)> = B>A> • (A + B)> = A> + B> 1 3 Trace n×n The trace of a square matrix A 2 R is written as Tr(A) and is just the sum of the diagonal elements: n X Tr(A) = Aii (3.1) i=1 The trace of a product can be written as the sum of entry-wise products of elements. Tr(A>B) = Tr(AB>) = Tr(B>A) = Tr(BA>) (3.2) n X = Ai;jBi;j (3.3) i;j (3.4) Properties: • Trace of a scalar is a scalar Tr(a) = a n×n > • A 2 R ; Tr(A) = Tr(A ) n×n • A; B 2 R ; Tr(A + B) = Tr(A) + Tr(B) n×n • A 2 R ; c 2 R; Tr(cA) = c Tr(A) • A; B such that AB is square, Tr(AB) = Tr(BA) • A; B; C such that ABC is square, Tr(ABC) = Tr(BCA) = Tr(CAB), this is called trace rotation. 4 Vector Norms A norm of a vector kxk is a measure of it’s "length" or "magnitude". The most common is the Euclidean or `2 norm. v u n uX 2 1. `2 norm : kxk2 = t xi i=1 2 2 For example, this is used in ridge regression: ky − Xβk + λkβk2 n X 2. `1 norm : kxk1 = jxij i=1 2 For example, this is used in `1 penalized regression: ky − Xβk + λkβk1 3. `1 norm : kxk1 = maxi jxij 1 n ! p X p 4. The above are all examples of the family of `p norms : kxkp = jxij i=1 2 5 Rank m A set of vectors x1; x2;::: xn ⊂ R is said to be linearly independent if no vector can be represented as a linear combination of the remaining vectors. The rank of a matrix is size of the largest subset of columns of A that constitute a linearly independent set. This is often referred to as the number of linearly independent columns of A. Note the amazing fact that rank(A) = rank(A>). This means that column rank = row rank. m×n For A 2 R rank(A) ≤ min(m; n). If rank(A) = min(m; n), then A is full rank. 6 Inverse n×n −1 The inverse of a symmetric matrix A 2 R is written as A and is defined such that: AA−1 = A−1A = I If A−1 exists, the matrix is said to be nonsingular, otherwise it is singular. For a square matrix to be invertible, it must be full rank. Non-square matrices are not invertible. Properties: • (A−1)−1 = A • (AB)−1 = B−1A−1 • (A−1)> = (A>)−1 Sherman-Morrison-Woodbury Matrix Inversion Lemma (A + XBX>)−1 = A−1 − A−1X(B−1 + X>A−1X)−1X>A−1 This comes up and can often make a hard inverse into an easy inverse. A and B are square and invertible but they don’t need to be the same dimension. 7 Orthogonal Matrices • Two vectors are orthogonal if u>v = 0. A vector is normalized if kxk = 1. • A square matrix is orthogonal if all its columns are orthogonal to each other and are normalized (columns are orthonormal). • If U is an orthogonal matrix U > = U −1, then U >U = I = UU >. • Note if U is not square, but the columns are orthonormal, then U >U = I but UU > 6= I. Orthogonal usually refers to the first case. 3 8 Matrix Calculus Gradient m×n Given f : R ! R is a function that takes as input a matrix and returns a real values. Then the gradient of f with respect to A is the matrix of partial derivatives, that is the m × n matrix defined below. @f(A) (5Af(A))ij = @Aij Note that the size of 5Af(A))ij is the same as the size of A. n The gradient of a vector x 2 R is the following: @f(x) (5xf(x))i = @xi The gradient of a function is only defined if that function is real-valued, that is it returns a real scalar value. Hessian n Given f : R ! R is a function that takes a vector and returns a real number. Then the Hessian of f with respect to x is a n × n matrix of partial derivatives as defined below. 2 2 @ f(x) (5xf(x))ij = @xi@xj Just like the gradient, the Hessian is only defined when the function is real-valued. For the purposes of this class, we will only be taking the Hessian of a vector. Common forms of Derivatives @(a>x) @(x>a) = = a @x @x @(x>Ax) = (A + A>)x @x @(a>Xb) = ab> @X @(a>X>b) = ba> @X @(a>Xa) @(a>X>a) = = aa> @X @X 4 @(x>A) = A @x @(x>) = I @x @(Ax) @x = A @z @z @(XY ) @Y @X = X + Y @z @z @z @(X−1) @X = −X−1 X−1 @z @z @ ln jXj > > = (X−1) = (X )−1 @X 9 Linear Regression To begin, the likelihood can be derived from a multivariate normal distribution. The likelihood for linear regression is given by: n 2 2 Y 2 P (Djβ; σ ) = P (yjX; β; σ ) = N (yijxi; β; σ ) i=1 2 − n 1 > = (2πσ ) 2 exp − (y − Xβ) (y − Xβ) 2σ2 By taking the log and throwing away constants, we get the negative log-likelihood below. n 1 − log P (Djβ; σ2) = log(σ2) + (y − Xβ)>(y − Xβ) 2 2σ2 We can now define the residual sum of squares or least squares. ky − Xβk = (y − Xβ)>(y − Xβ) Maximizing the likelihood is equivalent to minimizing the negative log likelihood and also equivalent to minimizing the residual sum of squares. You will also hear this being called finding the least squares solution. We can rewrite the expression as follows. ky − Xβk = (y − Xβ)>(y − Xβ) = y>y − 2(X>y)>β + β>X>Xβ 5 To find the minimum, we first have to take the derivative. Note, we need two matrix @x>Ax > @a>x > derivative identities @x = (A + A )x and @x = a. Also, note that X X is symmetric. @(y>y − 2(X>y)>β + β>X>Xβ) @β = −2(X>y) + (X>X + (X>X)>)β = −2(X>y) + 2X>Xβ After setting the derivation equal to zero and solving for β, we get the following. 0 = −2(X>y) + 2X>Xβ X>Xβ = X>y β = (X>X)−1X>y These are called the normal equations. To solve this in Octave/Matlab, you can im- plement the equations explicitly using the inverse. However, doing beta = X n y; is a more stable way of solving the normal equations. It does a QR decomposition. You can check that this solution is the global minimum and not just a stationary point. To do this, you need to evaluate the Hessian, or the second derivative. You should find that the result is a positive definite matrix. And since the Hessian is positive definite, the function is convex and thus the only stationary point is also the global minimum. 10 Ridge Regression Now, we’re going to derive ridge regression in a similar way. Recall that for linear regression, we found the MLE from forming the likelihood P (yjβ). Here, we can derive the MAP estimate from the posterior which is constructed from the likelihood and the 1 prior. Let β ∼ N (0; λ Ip) be a prior on the parameter vector β where Ip is an identity matrix of size p. The form of the posterior is given below. P (βjy) / P (yjβ)P (β) 1 /N (yjX; β; σ2)N (0; I ) λ p 6 Given that σ2 = 1, we first want to derive the posterior for β. P (βjy) / P (yjβ)P (β) 1 /N (yjX; β; σ2)N (0; I ) λ p 1 − 2 2 − n 1 > − p 1 1 > 1 −1 / (2πσ ) 2 exp − (y − Xβ) (y − Xβ) · (2π) 2 Ip exp − β ( Ip) β 2σ2 λ 2 λ 2 − n 1 > − p 1 − p λ > / (2πσ ) 2 exp − (y − Xβ) (y − Xβ) · (2π) 2 ( ) 2 exp − β I β 2σ2 λ 2 p 2 − n 1 > −1 − p λ > / (2πσ ) 2 exp − (y − Xβ) (y − Xβ) · (2πλ ) 2 exp − β β 2σ2 2 2 − n −1 − p 1 > λ > / (2πσ ) 2 (2πλ ) 2 exp − (y − Xβ) (y − Xβ) − β β 2σ2 2 Taking the negative log and dropping constants, we get: 1 λ / (y − Xβ)>(y − Xβ) + β>β 2σ2 2 / (y − Xβ)>(y − Xβ) + λβ>β Setting σ2 to 1 and dropping more constants. Now, since we wanted to maximize the posterior, we now need to minimize the negative log of the posterior. Note that minimizing the above expression is exactly the same as finding the ridge solution by minimizing the sum of squares plus the l2 penalty (Eq. 10.1).

Load more