In neural networks, we often encounter prob- lems with analysis of several variables. Vec- tor/Matrix calculus extends calculus of one vari- able into that of a vector or a matrix of vari- ables.
Vector Gradient: Let g(w) be a differentiable scalar function of m variables, where
T w = [w1,...,wm] Then the vector gradient of g(w) w.r.t. w is the m-dimensional vector of partial derivatives of g ∂g ∂w ∂g . 1 = ∇g = ∇wg = . ∂w ∂g ∂wm Similarly we can define second-order gradient or Hessian matrix.
1 Hessian matrix is defined as ∂2g ∂2g 2 ∂w w ∂2g ∂w1 1 m = . . 2 . . ∂w 2 2 ∂ g ∂ g ∂wmw1 ∂w2 m Jacobian matrix: Generalization to the vector valued functions
T g(w) = [g1(w),...,gn(w)] leads to a definition of the Jacobian matrix of g w.r.t. w
∂g1 ∂gn ∂g ∂w1 ∂w1 = . . ∂w ∂g ∂gn 1 ∂wm ∂wm In this vector convention the columns of the Jacobian matrix are gradients of the corre- sponding components functions gi(w) w.r.t. the vector w.
2 Differentiation Rules:
The differentiation rules are analogous to the case of ordinary functions:
• ∂f(w)g(w) ∂f(w) ∂g(w) = g(w)+ f(w) ∂w ∂w ∂w
• ∂f(w) ∂g(w) ∂f(w)/g(w) g(w) − f(w) = ∂w ∂w ∂w g2(w)
• ∂f(g(w)) ∂g(w) = f ′(g(w)) ∂w ∂w
3 Example: Consider m T g(w)= aiwi = a w i =1 where a is a constant vector. Thus, a ∂g 1 = . ∂w am or in the vector notation ∂aT w = a. ∂w
T 3 Example: Let w = (w1,w2,w3) ∈ R .
g(w) = 2w1 + 5w2 + 12w3 = 2 5 12 w. Because ∂g ∂ = (2w1 +5w2 +12w3)=2+0+0=2 ∂w1 ∂w1 hence, 2 ∂g = 5 . ∂w 12 4 Example: Consider m m T g(w)= aijwiwj = w Aw i =1 j =1 where A is a constant square matrix. Thus, m m j=1 wja1j + i=1 wiai1 ∂g . = . ∂w m m j=1 wjamj + i=1 wiaim and so in the vector notation ∂wT Aw = Aw + AT w. ∂w
Hessian of g(w) is 2a a + a ∂2wT Aw 11 1m m1 = . . ∂w2 a 1 + a1 2amm m m which equals to A + AT .
T 2 Example: Let w = (w1,w2) ∈ R .
g(w) = 3w1w1 + 2w1w2 + 6w2w1 + 5w2w2
5 3 2 w1 = w1 w2 6 5 w2 ∂g ∂ = (3w1w1 + 2w1w2 + 6w2w1 + 5w2w2) ∂w1 ∂w1
= 3 ∗ 2w1 + 2w2 + 6w2 +0=6w1 + 8w2 ∂g =0+2 ∗ w1 + 6w1 + 5 ∗ 2w2 = 8w1 + 10w2 ∂w2 and so in the vector notation ∂wT Aw 3 2 w 3 6 w = 1 + 1 ∂w 6 5 w2 2 5 w2
6 8 w = 1 8 10 w2
Hessian of g(w) is ∂2wT Aw ∂ ∂wT Aw = ( ) ∂w2 ∂w ∂w
2 ∗ 3 2+6 6 8 = = . 6+2 2 ∗ 5 8 10
6 Matrix Gradient: Consider a scalar valued func- tion g(W) of the n × m matrix W = {wij} (e.g. determinant of a matrix). The matrix gradi- ent w.r.t W is a matrix of the same dimension as W consisting of partial derivatives of g(W) w.r.t. components of W: ∂g ∂g ∂w ∂w ∂g .11 .1n = . . ∂W ∂g ∂g ∂wm1 ∂wmn ∂g Example: If g(W) = tr(W), then ∂W = I Example: Consider a matrix function m m T g(W)= wijaiaj = a Wa i =1 j =1 i.e. assume that a is a constant vector, whereas W is a matrix of variables. Taking the gradient T w.r.t. to W yields ∂a Wa = a a . Thus, in the ∂wij i j matrix form ∂aT Wa = aaT ∂W
7 Example:
g(W) = 9w11 + 6w21 + 6w12 + 4w22
w w 3 = 3 2 11 12 w21 w22 2 Thus we have ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 9 ∂w11 ∂w11 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 6 ∂w12 ∂w12 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 6 ∂w21 ∂w21 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 4 ∂w22 ∂w22 hence ∂g 9 6 3 = = 3 2 ∂W 6 4 2
8 Example: Let W be an invertible square ma- trix of dimension m with determinant, det(W). Then, ∂ det W = (WT )−1 det W ∂W Recall that 1 W−1 = adj(W). det W In the above adj(W) is the adjoint of W:
W11 Wm1 . . adj(W)= . . W1 Wmm m where Wij is a cofactor obtained by multiplying term (−1)i+j by the determinant of a matrix obtained from W by removing the ith row and the jth column. Recall that the determinant of W can be also obtained using cofactors: m det W = wikWik. k =1
9 In the above formula i denotes an arbitrary row. Now taking derivative of det(W) w.r.t. Wij gives ∂ det(W) = Wij ∂wij but from the definition of the matrix gradient it follows that ∂ det(W) = adj(W)T ∂W Using the formula for the inverse of W. 1 W−1 = adj(W). det W We have ∂ det(W) = (WT )−1 det W ∂W
Homework: Prove that ∂ log |det(W)| 1 ∂ |det W| = = (WT )−1. ∂W |det W| ∂W
10