Vector/Matrix Calculus

Vector/Matrix Calculus In neural networks, we often encounter prob- lems with analysis of several variables. Vec- tor/Matrix calculus extends calculus of one vari- able into that of a vector or a matrix of variables. Vector Gradient: Let g(w) be a differentiable scalar function of m variables, where T w = [w1,...,wm] Then the vector gradient of g(w) w.r.t. w is the m-dimensional vector of partial derivatives of g ∂g ∂w ∂g . 1 = ∇g = ∇wg = . ∂w ∂g ∂wm Similarly we can define second-order gradient or Hessian matrix. 1 Hessian matrix is defined as ∂2g ∂2g 2 ··· ∂w w ∂2g ∂w1 1 m = . 2 . ∂w 2 2 ∂ g ··· ∂ g ∂wmw1 ∂w2 m Jacobian matrix: Generalization to the vector valued functions T g(w) = [g1(w),...,gn(w)] leads to a definition of the Jacobian matrix of g w.r.t. w ∂g1 ··· ∂gn ∂g ∂w1 ∂w1 = . ∂w ∂g ∂gn 1 ··· ∂wm ∂wm In this vector convention the columns of the Jacobian matrix are gradients of the corre- sponding components functions gi(w) w.r.t. the vector w. 2 Differentiation Rules: The differentiation rules are analogous to the case of ordinary functions: • ∂f(w)g(w) ∂f(w) ∂g(w) = g(w)+ f(w) ∂w ∂w ∂w • ∂f(w) ∂g(w) ∂f(w)/g(w) g(w) − f(w) = ∂w ∂w ∂w g2(w) • ∂f(g(w)) ∂g(w) = f ′(g(w)) ∂w ∂w 3 Example: Consider m T g(w)= aiwi = a w i=1 where a is a constant vector. Thus, a ∂g 1 = . ∂w am or in the vector notation ∂aT w = a. ∂w T 3 Example: Let w = (w1,w2,w3) ∈ R . g(w) = 2w1 + 5w2 + 12w3 = 2 5 12 w. Because ∂g ∂ = (2w1 +5w2 +12w3)=2+0+0=2 ∂w1 ∂w1 hence, 2 ∂g = 5 . ∂w 12 4 Example: Consider m m T g(w)= aijwiwj = w Aw i=1 j=1 where A is a constant square matrix. Thus, m m j=1 wja1j + i=1 wiai1 ∂g . = . ∂w m m j=1 wjamj + i=1 wiaim and so in the vector notation ∂wT Aw = Aw + AT w. ∂w Hessian of g(w) is 2a ··· a + a ∂2wT Aw 11 1m m1 = . ∂w2 a 1 + a1 ··· 2amm m m which equals to A + AT . T 2 Example: Let w = (w1,w2) ∈ R . g(w) = 3w1w1 + 2w1w2 + 6w2w1 + 5w2w2 5 3 2 w1 = w1 w2 6 5 w2 ∂g ∂ = (3w1w1 + 2w1w2 + 6w2w1 + 5w2w2) ∂w1 ∂w1 = 3 ∗ 2w1 + 2w2 + 6w2 +0=6w1 + 8w2 ∂g =0+2 ∗ w1 + 6w1 + 5 ∗ 2w2 = 8w1 + 10w2 ∂w2 and so in the vector notation ∂wT Aw 3 2 w 3 6 w = 1 + 1 ∂w 6 5 w2 2 5 w2 6 8 w = 1 8 10 w2 Hessian of g(w) is ∂2wT Aw ∂ ∂wT Aw = ( ) ∂w2 ∂w ∂w 2 ∗ 3 2+6 6 8 = = . 6+2 2 ∗ 5 8 10 6 Matrix Gradient: Consider a scalar valued function g(W) of the n × m matrix W = {wij} (e.g. determinant of a matrix). The matrix gradient w.r.t W is a matrix of the same dimension as W consisting of partial derivatives of g(W) w.r.t. components of W: ∂g ∂g ∂w ··· ∂w ∂g .11 .1n = . ∂W ∂g ∂g ··· ∂wm1 ∂wmn ∂g Example: If g(W) = tr(W), then ∂W = I Example: Consider a matrix function m m T g(W)= wijaiaj = a Wa i=1 j=1 i.e. assume that a is a constant vector, whereas W is a matrix of variables. Taking the gradient T w.r.t. to W yields ∂a Wa = a a . Thus, in the ∂wij i j matrix form ∂aT Wa = aaT ∂W 7 Example: g(W) = 9w11 + 6w21 + 6w12 + 4w22 w w 3 = 3 2 11 12 w21 w22 2 Thus we have ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 9 ∂w11 ∂w11 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 6 ∂w12 ∂w12 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 6 ∂w21 ∂w21 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 4 ∂w22 ∂w22 hence ∂g 9 6 3 = = 3 2 ∂W 6 4 2 8 Example: Let W be an invertible square matrix of dimension m with determinant, det(W). Then, ∂ det W = (WT )−1 det W ∂W Recall that 1 W−1 = adj(W). det W In the above adj(W) is the adjoint of W: W11 ··· Wm1 . adj(W)= . W1 ··· Wmm m where Wij is a cofactor obtained by multiplying term (−1)i+j by the determinant of a matrix obtained from W by removing the ith row and the jth column. Recall that the determinant of W can be also obtained using cofactors: m det W = wikWik. k=1 9 In the above formula i denotes an arbitrary row. Now taking derivative of det(W) w.r.t. Wij gives ∂ det(W) = Wij ∂wij but from the definition of the matrix gradient it follows that ∂ det(W) = adj(W)T ∂W Using the formula for the inverse of W. 1 W−1 = adj(W). det W We have ∂ det(W) = (WT )−1 det W ∂W Homework: Prove that ∂ log |det(W)| 1 ∂ |det W| = = (WT )−1. ∂W |det W| ∂W 10.

Load more