<<

Vector/

In neural networks, we often encounter prob- lems with of several variables. Vec- tor/Matrix calculus extends calculus of one vari- able into that of a vector or a matrix of vari- ables.

Vector : Let g(w) be a differentiable of m variables, where

T w = [w1,...,wm] Then the vector gradient of g(w) w.r.t. w is the m-dimensional vector of partial of g ∂g ∂w ∂g . 1 = ∇g = ∇wg =  .  ∂w ∂g    ∂wm  Similarly we can define second-order  gradient or .

1 Hessian matrix is defined as ∂2g ∂2g 2 ∂w w ∂2g ∂w1 1 m =  . .  2 . . ∂w  2 2   ∂ g ∂ g   ∂wmw1 ∂w2   m    Jacobian matrix: Generalization to the vector valued functions

T g(w) = [g1(w),...,gn(w)] leads to a definition of the Jacobian matrix of g w.r.t. w

∂g1 ∂gn ∂g ∂w1 ∂w1 =  . .  ∂w ∂g ∂gn  1   ∂wm ∂wm    In this vector convention the columns of the Jacobian matrix are of the corre- sponding components functions gi(w) w.r.t. the vector w.

2 Differentiation Rules:

The differentiation rules are analogous to the case of ordinary functions:

• ∂f(w)g(w) ∂f(w) ∂g(w) = g(w)+ f(w) ∂w ∂w ∂w

• ∂f(w) ∂g(w) ∂f(w)/g(w) g(w) − f(w) = ∂w ∂w ∂w g2(w)

• ∂f(g(w)) ∂g(w) = f ′(g(w)) ∂w ∂w

3 Example: Consider m T g(w)= aiwi = a w i=1 where a is a constant vector. Thus, a ∂g 1 = . ∂w   am   or in the vector notation  ∂aT w = a. ∂w

T 3 Example: Let w = (w1,w2,w3) ∈ R .

g(w) = 2w1 + 5w2 + 12w3 = 2 5 12 w. Because ∂g ∂ = (2w1 +5w2 +12w3)=2+0+0=2 ∂w1 ∂w1 hence, 2 ∂g =  5  . ∂w 12     4 Example: Consider m m T g(w)= aijwiwj = w Aw i=1 j=1 where A is a constant square matrix. Thus, m m j=1 wja1j + i=1 wiai1 ∂g . =  .  ∂w m m  j=1 wjamj + i=1 wiaim    and so in the vector notation  ∂wT Aw = Aw + AT w. ∂w

Hessian of g(w) is 2a a + a ∂2wT Aw 11 1m m1 = . . ∂w2   a 1 + a1 2amm  m m    which equals to A + AT .

T 2 Example: Let w = (w1,w2) ∈ R .

g(w) = 3w1w1 + 2w1w2 + 6w2w1 + 5w2w2

5 3 2 w1 = w1 w2 6 5 w2 ∂g ∂ = (3w1w1 + 2w1w2 + 6w2w1 + 5w2w2) ∂w1 ∂w1

= 3 ∗ 2w1 + 2w2 + 6w2 +0=6w1 + 8w2 ∂g =0+2 ∗ w1 + 6w1 + 5 ∗ 2w2 = 8w1 + 10w2 ∂w2 and so in the vector notation ∂wT Aw 3 2 w 3 6 w = 1 + 1 ∂w 6 5 w2 2 5 w2

6 8 w = 1 8 10 w2

Hessian of g(w) is ∂2wT Aw ∂ ∂wT Aw = ( ) ∂w2 ∂w ∂w

2 ∗ 3 2+6 6 8 = = . 6+2 2 ∗ 5 8 10

6 Matrix Gradient: Consider a scalar valued func- tion g(W) of the n × m matrix W = {wij} (e.g. of a matrix). The matrix gradi- ent w.r.t W is a matrix of the same dimension as W consisting of partial derivatives of g(W) w.r.t. components of W: ∂g ∂g ∂w ∂w ∂g .11 .1n =  . .  ∂W ∂g ∂g    ∂wm1 ∂wmn    ∂g Example: If g(W) = tr(W), then ∂W = I Example: Consider a matrix function m m T g(W)= wijaiaj = a Wa i=1 j=1 i.e. assume that a is a constant vector, whereas W is a matrix of variables. Taking the gradient T w.r.t. to W yields ∂a Wa = a a . Thus, in the ∂wij i j matrix form ∂aT Wa = aaT ∂W

7 Example:

g(W) = 9w11 + 6w21 + 6w12 + 4w22

w w 3 = 3 2 11 12 w21 w22 2 Thus we have ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 9 ∂w11 ∂w11 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 6 ∂w12 ∂w12 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 6 ∂w21 ∂w21 ∂g ∂ = (9w11 + 6w21 + 6w12 + 4w22) = 4 ∂w22 ∂w22 hence ∂g 9 6 3 = = 3 2 ∂W 6 4 2

8 Example: Let W be an invertible square ma- trix of dimension m with determinant, det(W). Then, ∂ det W = (WT )−1 det W ∂W Recall that 1 W−1 = adj(W). det W In the above adj(W) is the adjoint of W:

W11 Wm1 . . adj(W)=  . .  W1 Wmm  m    where Wij is a cofactor obtained by multiplying term (−1)i+j by the determinant of a matrix obtained from W by removing the ith row and the jth column. Recall that the determinant of W can be also obtained using cofactors: m det W = wikWik. k=1

9 In the above formula i denotes an arbitrary row. Now taking of det(W) w.r.t. Wij gives ∂ det(W) = Wij ∂wij but from the definition of the matrix gradient it follows that ∂ det(W) = adj(W)T ∂W Using the formula for the inverse of W. 1 W−1 = adj(W). det W We have ∂ det(W) = (WT )−1 det W ∂W

Homework: Prove that ∂ log |det(W)| 1 ∂ |det W| = = (WT )−1. ∂W |det W| ∂W

10