Matrix Calculus

Matrix Calculus Sourya Dey 1 Notation • Scalars are written as lower case letters. • Vectors are written as lower case bold letters, such as x, and can be either row (dimensions 1×n) or column (dimensions n×1). Column vectors are the default choice, unless otherwise mentioned. Individual elements are indexed by subscripts, such as xi (i 2 f1; ··· ; ng). • Matrices are written as upper case bold letters, such as X, and have dimensions m × n corresponding to m rows and n columns. Individual elements are indexed by double subscripts for row and column, such as Xij (i 2 f1; ··· ; mg, j 2 f1; ··· ; ng). • Occasionally higher order tensors occur, such as 3rd order with dimensions m × n × p, etc. Note that a matrix is a 2nd order tensor. A row vector is a matrix with 1 row, and a column vector is a matrix with 1 column. A scalar is a matrix with 1 row and 1 column. Essentially, scalars and vectors are special cases of matrices. @f The derivative of f with respect to x is . Both x and f can be a scalar, vector, or matrix, @x @f T leading to 9 types of derivatives. The gradient of f w.r.t x is r f = , i.e. gradient x @x is transpose of derivative. The gradient at any point x0 in the domain has a physical interpretation, its direction is the direction of maximum increase of the function f at the point x0, and its magnitude is the rate of increase in that direction. We do not generally deal with the gradient when x is a scalar. 2 Basic Rules This document follows numerator layout convention. There is an alternative denominator layout convention, where several results are transposed. Do not mix different layout conventions. We'll first state the most general matrix-matrix derivative type. All other types are sim- plifications since scalars and vectors are special cases of matrices. Consider a function F (·) which maps m × n matrices to p × q matrices, i.e. domain ⊂ Rm×n and range ⊂ Rp×q. So, @F F (·): X ! F (X). Its derivative is a 4th order tensor of dimensions p × q × n × m. This m×n p×q @X is an outer matrix of dimensions n × m (transposed dimensions of the denominator X), with 1 each element being a p × q inner matrix (same dimensions as the numerator F ). It is given as: 2 @F @F 3 ··· 6 @X1;1 @Xm;1 7 6 7 @F 6 . 7 6 . .. 7 = 6 . 7 (1a) @X 6 7 6 @F @F 7 4 ··· 5 @X1;n @Xm;n which has n rows and m columns, and the (i; j)th element is given as: 2 @F @F 3 1;1 ··· 1;q 6@Xi;j @Xi;j 7 6 7 @F 6 . 7 6 . .. 7 = 6 . 7 (1b) @Xi;j 6 7 6 @F @F 7 4 p;1 ··· p;q 5 @Xi;j @Xi;j which has p rows and q columns. Whew! Now that that's out of the way, let's get to some general rules (for the following, x and y can represent scalar, vector or matrix): @y • The derivative always has outer matrix dimensions = transposed dimen- @x sions of denominator x, and each individual element (inner matrix) has dimensions = same dimensions of numerator y. If you do a calculation and the dimension doesn't come out right, the answer is not correct. @f (g(x)) @f (g(x)) @g(x) • Derivatives usually obey the chain rule, i.e. = . @x @g(x) @x @f(x)g(x) @g(x) @f(x) • Derivatives usually obey the product rule, i.e. = f(x) + g(x) . @x @x @x 3 Types of derivatives 3.1 Scalar by scalar Nothing special here. The derivative is a scalar, and can also be written as f 0(x). For example, if f(x) = sin x, then f 0(x) = cos x. 3.2 Scalar by vector f(·): x ! f(x). For this, the derivative is a 1 × m row vector: m×1 1×1 @f @f @f @f = ··· (2) @x @x1 @x2 @xm The gradient rxf is its transposed column vector. 2 3.3 Vector by scalar f(·): x ! f(x). For this, the derivative is a n × 1 column vector: 1×1 n×1 2 @f1 3 6 @x 7 6 7 6 @f 7 6 2 7 @f 6 @x 7 6 7 = 6 7 (3) @x 6 . 7 6 . 7 6 7 6 7 4@fn 5 @x 3.4 Vector by vector f(·): x ! f(x). Derivative, also known as the Jacobian, is a matrix of dimensions n × m. m×1 n×1 Its (i; j)th element is the scalar derivative of the ith output component w.r.t the jth input component, i.e.: 2 @f @f 3 1 ··· 1 6@x1 @xm 7 6 7 @f 6 . 7 6 . .. 7 = 6 . 7 (4) @x 6 7 6 7 4@fn @fn 5 ··· @x1 @xm 3.4.1 Special case { Vectorized scalar function This is a scalar-scalar function applied element-wise to a vector, and is denoted by f(·): x ! m×1 f(x). For example: m×1 02 31 2 3 x1 f (x1) B6 x2 7C 6 f (x2) 7 f B6 7C = 6 7 (5) B6 . 7C 6 . 7 @4 . 5A 4 . 5 xm f (xm) In this case, both the derivative and gradient are the same m × m diagonal matrix, given as: 2 3 0 f (x1) 0 6 7 6 0 7 @f 6 f (x2) 7 6 7 rxf = = 6 . 7 (6) @x 6 .. 7 6 7 4 5 0 0 f (xm) 0 @f (xi) where f (xi) = . @xi 3 Note: Some texts take the derivative of a vectorized scalar function by taking element-wise derivatives to get a m × 1 vector. To avoid confusion with (6), we will refer to this as f 0(x). 2 0 3 f (x1) 0 6 f (x2) 7 f 0(x) = 6 7 (7) 6 . 7 4 . 5 0 f (xm) To realize the effect of this, let's say we want to multiply the gradient from (6) with some m-dimensional vector a. This would result in: 2 0 3 f (x1) a1 6 0 7 6 f (x2) a2 7 6 7 rxf a = 6 . 7 (8) 6 . 7 4 5 0 f (xm) am Achieving the same result with f 0(x) from (7) would require the Hadamard product ◦, defined as element-wise multiplication of 2 vectors: 2 0 3 f (x1) a1 6 0 7 6 f (x2) a2 7 0 6 7 f (x) ◦ a = 6 . 7 (9) 6 . 7 4 5 0 f (xm) am 3.4.2 Special Case { Hessian Consider the type of function in Sec. 3.2, i.e. f(·): x ! f(x). Its gradient is a vector- m×1 1×1 to-vector function given as rxf(·): x ! rxf(x). The transpose of its derivative is the m×1 m×1 Hessian: 2 @2f @2f 3 2 ··· 6 @x @x1@xm 7 6 1 7 6 . 7 6 . .. 7 H = 6 . 7 (10) 6 7 6 7 4 @2f @2f 5 ··· 2 @xm@x1 @xm @r f T @2f @2f i.e. H = x . If derivatives are continuous, then = , so the Hessian is @x @xi@xj @xj@xi symmetric. 4 3.5 Scalar by matrix f(·): X ! f(X). In this case, the derivative is a n × m matrix: m×n 1×1 2 @f @f 3 ··· 6 @X1;1 @Xm;1 7 6 7 @f 6 . 7 6 . .. 7 = 6 . 7 (11) @X 6 7 6 @f @f 7 4 ··· 5 @X1;n @Xm;n The gradient has the same dimensions as the input matrix, i.e. m × n. 3.6 Matrix by scalar f(·): x ! F (x). In this case, the derivative is a p × q matrix: 1×1 p×q 2@F @F 3 1;1 ··· 1;q 6 @x @x 7 6 7 @F 6 . 7 = 6 . .. 7 (12) @x 6 7 6 7 4@F @F 5 p;1 ··· p;q @x @x 3.7 Vector by matrix f(·): X ! f(X). In this case, the derivative is a 3rd-order tensor with dimensions p×n×m. m×n p×1 This is the same n × m matrix in (11), but with f replaced by the p-dimensional vector f, i.e.: 2 @f @f 3 ··· 6 @X1;1 @Xm;1 7 @f 6 7 = 6 . .. 7 (13) @X 6 . 7 6 @f @f 7 4 ··· 5 @X1;n @Xm;n 3.8 Matrix by vector F (·): x ! F (x). In this case, the derivative is a 3rd-order tensor with dimensions p × q × m. m×1 p×q This is the same m × 1 row vector in (2), but with f replaced by the p × q matrix F , i.e.: @F @F @F @F = ··· (14) @x @x1 @x2 @xm 5 4 Operations and Examples 4.1 Commutation If things normally don't commute (such as for matrices, AB 6= BA), then order should be maintained when taking derivatives. If things normally commute (such as for vector inner product, a·b = b·a), their order can be switched when taking derivatives. Output dimensions must always come out right. @f For example, let f(x) = (aT x) b . The derivative should be a n×m matrix. Keeping n×1 1×m m×1 n×1 @x @f @x order fixed, we get = aT b = aT Ib = aT b.

Load more