
Matrix Calculus Sourya Dey 1 Notation • Scalars are written as lower case letters. • Vectors are written as lower case bold letters, such as x, and can be either row (dimensions 1×n) or column (dimensions n×1). Column vectors are the default choice, unless otherwise mentioned. Individual elements are indexed by subscripts, such as xi (i 2 f1; ··· ; ng). • Matrices are written as upper case bold letters, such as X, and have dimensions m × n corresponding to m rows and n columns. Individual elements are indexed by double subscripts for row and column, such as Xij (i 2 f1; ··· ; mg, j 2 f1; ··· ; ng). • Occasionally higher order tensors occur, such as 3rd order with dimensions m × n × p, etc. Note that a matrix is a 2nd order tensor. A row vector is a matrix with 1 row, and a column vector is a matrix with 1 column. A scalar is a matrix with 1 row and 1 column. Essentially, scalars and vectors are special cases of matrices. @f The derivative of f with respect to x is . Both x and f can be a scalar, vector, or matrix, @x @f T leading to 9 types of derivatives. The gradient of f w.r.t x is r f = , i.e. gradient x @x is transpose of derivative. The gradient at any point x0 in the domain has a physical interpretation, its direction is the direction of maximum increase of the function f at the point x0, and its magnitude is the rate of increase in that direction. We do not generally deal with the gradient when x is a scalar. 2 Basic Rules This document follows numerator layout convention. There is an alternative denom- inator layout convention, where several results are transposed. Do not mix different layout conventions. We'll first state the most general matrix-matrix derivative type. All other types are sim- plifications since scalars and vectors are special cases of matrices. Consider a function F (·) which maps m × n matrices to p × q matrices, i.e. domain ⊂ Rm×n and range ⊂ Rp×q. So, @F F (·): X ! F (X). Its derivative is a 4th order tensor of dimensions p × q × n × m. This m×n p×q @X is an outer matrix of dimensions n × m (transposed dimensions of the denominator X), with 1 each element being a p × q inner matrix (same dimensions as the numerator F ). It is given as: 2 @F @F 3 ··· 6 @X1;1 @Xm;1 7 6 7 @F 6 . 7 6 . .. 7 = 6 . 7 (1a) @X 6 7 6 @F @F 7 4 ··· 5 @X1;n @Xm;n which has n rows and m columns, and the (i; j)th element is given as: 2 @F @F 3 1;1 ··· 1;q 6@Xi;j @Xi;j 7 6 7 @F 6 . 7 6 . .. 7 = 6 . 7 (1b) @Xi;j 6 7 6 @F @F 7 4 p;1 ··· p;q 5 @Xi;j @Xi;j which has p rows and q columns. Whew! Now that that's out of the way, let's get to some general rules (for the following, x and y can represent scalar, vector or matrix): @y • The derivative always has outer matrix dimensions = transposed dimen- @x sions of denominator x, and each individual element (inner matrix) has di- mensions = same dimensions of numerator y. If you do a calculation and the dimension doesn't come out right, the answer is not correct. @f (g(x)) @f (g(x)) @g(x) • Derivatives usually obey the chain rule, i.e. = . @x @g(x) @x @f(x)g(x) @g(x) @f(x) • Derivatives usually obey the product rule, i.e. = f(x) + g(x) . @x @x @x 3 Types of derivatives 3.1 Scalar by scalar Nothing special here. The derivative is a scalar, and can also be written as f 0(x). For example, if f(x) = sin x, then f 0(x) = cos x. 3.2 Scalar by vector f(·): x ! f(x). For this, the derivative is a 1 × m row vector: m×1 1×1 @f @f @f @f = ··· (2) @x @x1 @x2 @xm The gradient rxf is its transposed column vector. 2 3.3 Vector by scalar f(·): x ! f(x). For this, the derivative is a n × 1 column vector: 1×1 n×1 2 @f1 3 6 @x 7 6 7 6 @f 7 6 2 7 @f 6 @x 7 6 7 = 6 7 (3) @x 6 . 7 6 . 7 6 7 6 7 4@fn 5 @x 3.4 Vector by vector f(·): x ! f(x). Derivative, also known as the Jacobian, is a matrix of dimensions n × m. m×1 n×1 Its (i; j)th element is the scalar derivative of the ith output component w.r.t the jth input component, i.e.: 2 @f @f 3 1 ··· 1 6@x1 @xm 7 6 7 @f 6 . 7 6 . .. 7 = 6 . 7 (4) @x 6 7 6 7 4@fn @fn 5 ··· @x1 @xm 3.4.1 Special case { Vectorized scalar function This is a scalar-scalar function applied element-wise to a vector, and is denoted by f(·): x ! m×1 f(x). For example: m×1 02 31 2 3 x1 f (x1) B6 x2 7C 6 f (x2) 7 f B6 7C = 6 7 (5) B6 . 7C 6 . 7 @4 . 5A 4 . 5 xm f (xm) In this case, both the derivative and gradient are the same m × m diagonal matrix, given as: 2 3 0 f (x1) 0 6 7 6 0 7 @f 6 f (x2) 7 6 7 rxf = = 6 . 7 (6) @x 6 .. 7 6 7 4 5 0 0 f (xm) 0 @f (xi) where f (xi) = . @xi 3 Note: Some texts take the derivative of a vectorized scalar function by taking element-wise derivatives to get a m × 1 vector. To avoid confusion with (6), we will refer to this as f 0(x). 2 0 3 f (x1) 0 6 f (x2) 7 f 0(x) = 6 7 (7) 6 . 7 4 . 5 0 f (xm) To realize the effect of this, let's say we want to multiply the gradient from (6) with some m-dimensional vector a. This would result in: 2 0 3 f (x1) a1 6 0 7 6 f (x2) a2 7 6 7 rxf a = 6 . 7 (8) 6 . 7 4 5 0 f (xm) am Achieving the same result with f 0(x) from (7) would require the Hadamard product ◦, defined as element-wise multiplication of 2 vectors: 2 0 3 f (x1) a1 6 0 7 6 f (x2) a2 7 0 6 7 f (x) ◦ a = 6 . 7 (9) 6 . 7 4 5 0 f (xm) am 3.4.2 Special Case { Hessian Consider the type of function in Sec. 3.2, i.e. f(·): x ! f(x). Its gradient is a vector- m×1 1×1 to-vector function given as rxf(·): x ! rxf(x). The transpose of its derivative is the m×1 m×1 Hessian: 2 @2f @2f 3 2 ··· 6 @x @x1@xm 7 6 1 7 6 . 7 6 . .. 7 H = 6 . 7 (10) 6 7 6 7 4 @2f @2f 5 ··· 2 @xm@x1 @xm @r f T @2f @2f i.e. H = x . If derivatives are continuous, then = , so the Hessian is @x @xi@xj @xj@xi symmetric. 4 3.5 Scalar by matrix f(·): X ! f(X). In this case, the derivative is a n × m matrix: m×n 1×1 2 @f @f 3 ··· 6 @X1;1 @Xm;1 7 6 7 @f 6 . 7 6 . .. 7 = 6 . 7 (11) @X 6 7 6 @f @f 7 4 ··· 5 @X1;n @Xm;n The gradient has the same dimensions as the input matrix, i.e. m × n. 3.6 Matrix by scalar f(·): x ! F (x). In this case, the derivative is a p × q matrix: 1×1 p×q 2@F @F 3 1;1 ··· 1;q 6 @x @x 7 6 7 @F 6 . 7 = 6 . .. 7 (12) @x 6 7 6 7 4@F @F 5 p;1 ··· p;q @x @x 3.7 Vector by matrix f(·): X ! f(X). In this case, the derivative is a 3rd-order tensor with dimensions p×n×m. m×n p×1 This is the same n × m matrix in (11), but with f replaced by the p-dimensional vector f, i.e.: 2 @f @f 3 ··· 6 @X1;1 @Xm;1 7 @f 6 7 = 6 . .. 7 (13) @X 6 . 7 6 @f @f 7 4 ··· 5 @X1;n @Xm;n 3.8 Matrix by vector F (·): x ! F (x). In this case, the derivative is a 3rd-order tensor with dimensions p × q × m. m×1 p×q This is the same m × 1 row vector in (2), but with f replaced by the p × q matrix F , i.e.: @F @F @F @F = ··· (14) @x @x1 @x2 @xm 5 4 Operations and Examples 4.1 Commutation If things normally don't commute (such as for matrices, AB 6= BA), then order should be maintained when taking derivatives. If things normally commute (such as for vector inner product, a·b = b·a), their order can be switched when taking derivatives. Output dimensions must always come out right. @f For example, let f(x) = (aT x) b . The derivative should be a n×m matrix. Keeping n×1 1×m m×1 n×1 @x @f @x order fixed, we get = aT b = aT Ib = aT b.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages8 Page
-
File Size-