<<

Deriving the and Hessian of Linear and Quadratic Functions in Notation

Mark Schmidt

February 6, 2019

1 Gradient of Linear

Consider a of the form f(w) = aT w, where a and w are length-d vectors. We can derive the gradeint in matrix notation as follows: 1. Convert to summation notation: d X f(w) = ajwj, j=1

where aj is element j of a and wj is element j of w. 2. Take the partial with respect to a generic element k:

 d  ∂ X a w = a . ∂w  j j k k j=1

3. Assemble the partial into a vector:  ∂    ∂w1 a1 ∂  ∂w  a2 ∇f(w) =  2  =    .   .   .   .  ∂ ad ∂wd

4. Convert to matrix notation:   a1 a2 ∇f(w) =   = a.  .   .  ad

So our final results is that ∇f(w) = a. d This generalizes the scalar case where dw [αw] = α. We can also consider general linear functions of the form f(w) = aT w + β, for a scalar β. But in this case we still have ∇f(w) = a since the y-intercept β does not depend on w.

1 2 Gradient of Quadratic Function

Consider a quadratic function of the form

f(w) = wT Aw, where w is a length-d vector and A is a d by d matrix. We can derive the gradeint in matrix notation as follows 1. Convert to summation notation: Pn  j=1 a1jwj Pn d d  j=1 a2jwj X X f(w) = wT   = w a w .  .  i ij j  .  i=1 j=1 Pn j=1 adjwj | {z } Aw

where aij is the element in row i and column j of A. To help with computing the partial derivatives, it helps to re-write it in the form

d d d X X X 2 X f(w) = wiaijwj = (aiiwi + wiaijwj). i=1 j=1 i=1 j6=i

2. Take the with respect to a generic element k:

 d  ∂ X 2 X X X  (aiiwi + wiaijwj). = 2akkwk + wjajk + akjwj. ∂wk i=1 j6=i j6=k j6=k

The first term comes from the akk term that is quadratic in wk, while the two sums come from the terms that are linear in wk. We can move one akkwk into each of the sums to simplify this to

 d  d d ∂ X 2 X X X  (aiiwi + wiaijwj). = wjajk + akjwj. ∂wk i=1 j6=i j=1 j=1

3. Assemble the partial derivatives into a vector:

∂ Pd Pd  Pd  Pd    wjaj1 + a1jwj wjaj1 a1jwj ∂w1 j=1 j=1 j=1 j=1 ∂ Pd Pd Pd Pd  wjaj2 + a2jwj  wjaj2  a2jwj  ∂w2   j=1 j=1   j=1   j=1  ∇f(w) =  .  =  .  =  .  +  .   .   .   .   .   .   .   .   .  ∂ Pd Pd Pd Pd ∂wd j=1 wjajd + j=1 adjwj j=1 wjajd j=1 adjwj

4. Convert to matrix notation: Pd  Pd  j=1 wjaj1 j=1 a1jwj Pd w a  Pd a w   j=1 j j2  j=1 2j j T T ∇f(w) =  .  +  .  = A w + Aw = (A + A)w.  .   .   .   .  Pd Pd j=1 wjajd j=1 adjwj

2 So our final result is that ∇f(w) = (AT + A)w. Note that if A is symmetric (AT = A) then we have (AT + A) = (A + A) = 2A so we have

∇f(w) = 2Aw.

d 2 This generalizes the scalar case where dw [αw ] = 2αw. We can also consider general quadratic functions of the form 1 f(w) = wT Aw + bT w + γ. 2 Using the above results we have 1 ∇f(w) = (AT + A)w + b, 2 and if A is symmetric then ∇f(w) = Aw + b.

3 Hessian of Linear Function

For a linear function of the form, f(w) = aT w, we show above the partial derivatives are given by ∂f = ak. ∂wk

Since these first partial derivatives don’t depend on any wk, the second partial derivatives are thus given by ∂2f = 0, ∂wk∂wk0 which means that the Hessian matrix is the ,

 ∂ f(w) ∂ f(w) ··· ∂ f(w)   ∂w1∂w1 ∂w1∂w2 ∂w1∂wd 0 0 ··· 0 ∂ ∂ ∂  ∂w ∂w f(w) ∂w ∂w f(w) ··· ∂w ∂w f(w) 0 0 ··· 0 ∇2f(w) =  2 1 2 2 2 d  =   ,  . . .. .  . . .. .  . . . .  . . . . ∂ f(w) ∂ f(w) ··· ∂ f(w) 0 0 ··· 0 ∂wd∂w1 ∂wd∂w2 ∂wd∂wd and using 0 to denote the zero matrix we have

∇2f(w) = 0.

4 Hessian of Quadratic Function

For a quadratic function of the form, f(w) = wT Aw, we show above the partial derivatives are given by linear functions,

d d ∂f X X = w a + a w . ∂w j jk kj j k j=1 j=1

3 The second partial derivatives are thus constant functions of the form

∂2f = ak0k + akk0 , ∂wk∂wk0 which means that the Hessian matrix has a simple form

 ∂ f(w) ∂ f(w) ··· ∂ f(w)   ∂w1∂w1 ∂w1∂w2 ∂w1∂wd a11 + a11 a21 + a12 ··· ad1 + a1d ∂ ∂ ∂  ∂w ∂w f(w) ∂w ∂w f(w) ··· ∂w ∂w f(w) a12 + a21 a22 + a22 ··· ad2 + a2d ∇2f(w) =  2 1 2 2 2 d  =   .  . . .. .   . . .. .   . . . .   . . . .  ∂ ∂ ∂ f(w) f(w) ··· f(w) a1d + ad1 a2d + ad2 ··· add + add ∂wd∂w1 ∂wd∂w2 ∂wd∂wd This gives a result of ∇2f(w) = A + AT , and if A is symmetric this simplifies to ∇2f(w) = 2A.

5 Example of Least Squares

The least squares objective function has the form 1 f(w) = kXw − yk2, 2 which can be written as 1 1 f(w) = wT XT X w − wT XT y + yT y . 2 | {z } | {z } 2 A b | {z } γ By using that XT X and symmetric and using the results above we have that

∇f(w) = XT Xw − XT y, and that ∇2f(w) = XT X.

4