Deriving the Gradient and Hessian of Linear and Quadratic Functions in Matrix Notation

Mark Schmidt

February 6, 2019

1 Gradient of Linear Function

Consider a linear function of the form f(w) = aT w, where a and w are length-d vectors. We can derive the gradeint in matrix notation as follows: 1. Convert to summation notation: d X f(w) = ajwj, j=1

where aj is element j of a and wj is element j of w. 2. Take the partial derivative with respect to a generic element k:

 d  ∂ X a w = a . ∂w  j j k k j=1

3. Assemble the partial derivatives into a vector:  ∂    ∂w1 a1 ∂  ∂w  a2 ∇f(w) =  2  =    .   .   .   .  ∂ ad ∂wd

4. Convert to matrix notation:   a1 a2 ∇f(w) =   = a.  .   .  ad

So our ﬁnal results is that ∇f(w) = a. d This generalizes the scalar case where dw [αw] = α. We can also consider general linear functions of the form f(w) = aT w + β, for a scalar β. But in this case we still have ∇f(w) = a since the y-intercept β does not depend on w.

1 2 Gradient of Quadratic Function

Consider a quadratic function of the form

f(w) = wT Aw, where w is a length-d vector and A is a d by d matrix. We can derive the gradeint in matrix notation as follows 1. Convert to summation notation: Pn  j=1 a1jwj Pn d d  j=1 a2jwj X X f(w) = wT   = w a w .  .  i ij j  .  i=1 j=1 Pn j=1 adjwj | {z } Aw

where aij is the element in row i and column j of A. To help with computing the partial derivatives, it helps to re-write it in the form

d d d X X X 2 X f(w) = wiaijwj = (aiiwi + wiaijwj). i=1 j=1 i=1 j6=i

2. Take the partial derivative with respect to a generic element k:

 d  ∂ X 2 X X X  (aiiwi + wiaijwj). = 2akkwk + wjajk + akjwj. ∂wk i=1 j6=i j6=k j6=k

The ﬁrst term comes from the akk term that is quadratic in wk, while the two sums come from the terms that are linear in wk. We can move one akkwk into each of the sums to simplify this to

 d  d d ∂ X 2 X X X  (aiiwi + wiaijwj). = wjajk + akjwj. ∂wk i=1 j6=i j=1 j=1

3. Assemble the partial derivatives into a vector:

∂ Pd Pd  Pd  Pd    wjaj1 + a1jwj wjaj1 a1jwj ∂w1 j=1 j=1 j=1 j=1 ∂ Pd Pd Pd Pd  wjaj2 + a2jwj  wjaj2  a2jwj  ∂w2   j=1 j=1   j=1   j=1  ∇f(w) =  .  =  .  =  .  +  .   .   .   .   .   .   .   .   .  ∂ Pd Pd Pd Pd ∂wd j=1 wjajd + j=1 adjwj j=1 wjajd j=1 adjwj

4. Convert to matrix notation: Pd  Pd  j=1 wjaj1 j=1 a1jwj Pd w a  Pd a w   j=1 j j2  j=1 2j j T T ∇f(w) =  .  +  .  = A w + Aw = (A + A)w.  .   .   .   .  Pd Pd j=1 wjajd j=1 adjwj

2 So our ﬁnal result is that ∇f(w) = (AT + A)w. Note that if A is symmetric (AT = A) then we have (AT + A) = (A + A) = 2A so we have

∇f(w) = 2Aw.

d 2 This generalizes the scalar case where dw [αw ] = 2αw. We can also consider general quadratic functions of the form 1 f(w) = wT Aw + bT w + γ. 2 Using the above results we have 1 ∇f(w) = (AT + A)w + b, 2 and if A is symmetric then ∇f(w) = Aw + b.

3 Hessian of Linear Function

For a linear function of the form, f(w) = aT w, we show above the partial derivatives are given by ∂f = ak. ∂wk

Since these ﬁrst partial derivatives don’t depend on any wk, the second partial derivatives are thus given by ∂2f = 0, ∂wk∂wk0 which means that the Hessian matrix is the zero matrix,

 ∂ f(w) ∂ f(w) ··· ∂ f(w)   ∂w1∂w1 ∂w1∂w2 ∂w1∂wd 0 0 ··· 0 ∂ ∂ ∂  ∂w ∂w f(w) ∂w ∂w f(w) ··· ∂w ∂w f(w) 0 0 ··· 0 ∇2f(w) =  2 1 2 2 2 d  =   ,  . . .. .  . . .. .  . . . .  . . . . ∂ f(w) ∂ f(w) ··· ∂ f(w) 0 0 ··· 0 ∂wd∂w1 ∂wd∂w2 ∂wd∂wd and using 0 to denote the zero matrix we have

∇2f(w) = 0.

4 Hessian of Quadratic Function

For a quadratic function of the form, f(w) = wT Aw, we show above the partial derivatives are given by linear functions,

d d ∂f X X = w a + a w . ∂w j jk kj j k j=1 j=1

3 The second partial derivatives are thus constant functions of the form

∂2f = ak0k + akk0 , ∂wk∂wk0 which means that the Hessian matrix has a simple form

 ∂ f(w) ∂ f(w) ··· ∂ f(w)   ∂w1∂w1 ∂w1∂w2 ∂w1∂wd a11 + a11 a21 + a12 ··· ad1 + a1d ∂ ∂ ∂  ∂w ∂w f(w) ∂w ∂w f(w) ··· ∂w ∂w f(w) a12 + a21 a22 + a22 ··· ad2 + a2d ∇2f(w) =  2 1 2 2 2 d  =   .  . . .. .   . . .. .   . . . .   . . . .  ∂ ∂ ∂ f(w) f(w) ··· f(w) a1d + ad1 a2d + ad2 ··· add + add ∂wd∂w1 ∂wd∂w2 ∂wd∂wd This gives a result of ∇2f(w) = A + AT , and if A is symmetric this simpliﬁes to ∇2f(w) = 2A.

5 Example of Least Squares

The least squares objective function has the form 1 f(w) = kXw − yk2, 2 which can be written as 1 1 f(w) = wT XT X w − wT XT y + yT y . 2 | {z } | {z } 2 A b | {z } γ By using that XT X and symmetric and using the results above we have that

∇f(w) = XT Xw − XT y, and that ∇2f(w) = XT X.