Deriving the Gradient of Linear and Quadratic Functions in Matrix Notation

Mark Schmidt

October 21, 2016

1 Gradient of Linear Function

Consider a linear function of the form f(w) = aT w, where a and w are length-d vectors. We can derive the gradeint in matrix notation as follows: 1. Convert to summation notation: d X f(w) = ajwj, j=1

where aj is element j of a and wj is element j of w. 2. Take the partial derivative with respect to a generic element k:

 d  ∂ X a w = a . ∂w  j j k k j=1

3. Assemble the partial derivatives into a vector:  ∂    ∂w1 a1 ∂  ∂w  a2 ∇f(w) =  2  =    .   .   .   .  ∂ ad ∂wd

4. Convert to matrix notation:   a1 a2 ∇f(w) =   = a.  .   .  ad

So our ﬁnal results is that ∇f(w) = a. d This generalizes the scalar case where dw [αw] = α. We can also consider general linear functions of the form f(w) = aT w + β, for a scalar β. But in this case we still have ∇f(w) = a since the y-intercept β does not depend on w.

1 2 Gradient of Quadratic Function

Consider a quadratic function of the form

f(w) = wT Aw, where w is a length-d vector and A is a d by d matrix. We can derive the gradeint in matrix notation as follows 1. Convert to summation notation: Pn  j=1 a1jwj Pn d d  j=1 a2jwj X X f(w) = wT   = w a w .  .  i ij j  .  i=1 j=1 Pn j=1 adjwj | {z } Aw

where aij is the element in row i and column j of A. To help with computing the partial derivatives, it helps to re-write it in the form

d d d X X X 2 X f(w) = wiaijwj = (aiiwi + wiaijwj). i=1 j=1 i=1 j6=i

2. Take the partial derivative with respect to a generic element k:

 d  ∂ X 2 X X X  (aiiwi + wiaijwj). = 2akkwk + wjajk + akjwj. ∂wk i=1 j6=i j6=k j6=k

The ﬁrst term comes from the akk term that is quadratic in wk, while the two sums come from the terms that are linear in wk. We can move one akkwk into each of the sums to simplify this to

 d  d d ∂ X 2 X X X  (aiiwi + wiaijwj). = wjajk + akjwj. ∂wk i=1 j6=i j=1 j=1

3. Assemble the partial derivatives into a vector:

∂ Pd Pd  Pd  Pd    wjaj1 + a1jwj wjaj1 a1jwj ∂w1 j=1 j=1 j=1 j=1 ∂ Pd Pd Pd Pd  wjaj2 + a2jwj  wjaj2  a2jwj  ∂w2   j=1 j=1   j=1   j=1  ∇f(w) =  .  =  .  =  .  +  .   .   .   .   .   .   .   .   .  ∂ Pd Pd Pd Pd ∂wd j=1 wjajd + j=1 adjwj j=1 wjajd j=1 adjwj

4. Convert to matrix notation: Pd  Pd  j=1 wjaj1 j=1 a1jwj Pd w a  Pd a w   j=1 j j2  j=1 2j j T T ∇f(w) =  .  +  .  = A w + Aw = (A + A)w.  .   .   .   .  Pd Pd j=1 wjajd j=1 adjwj

2 So our ﬁnal result is that ∇f(w) = (AT + A)w. Note that if A is symmetric (AT = A) then we have (AT + A) = (A + A) = 2A so we have

∇f(w) = 2Aw.

d 2 This generalizes the scalar case where dw [αw ] = 2αw. We can also consider general quadratic functions of the form 1 f(w) = wT Aw + bT w + γ. 2 Using the above results we have 1 ∇f(w) = (AT + A)w + b, 2 and if A is symmetric then ∇f(w) = Aw + b.