Deriving the Gradient of Linear and Quadratic Functions in Matrix Notation
Mark Schmidt
October 21, 2016
1 Gradient of Linear Function
Consider a linear function of the form f(w) = aT w, where a and w are length-d vectors. We can derive the gradeint in matrix notation as follows: 1. Convert to summation notation: d X f(w) = ajwj, j=1
where aj is element j of a and wj is element j of w. 2. Take the partial derivative with respect to a generic element k:
d ∂ X a w = a . ∂w j j k k j=1
3. Assemble the partial derivatives into a vector: ∂ ∂w1 a1 ∂ ∂w a2 ∇f(w) = 2 = . . . . ∂ ad ∂wd
4. Convert to matrix notation: a1 a2 ∇f(w) = = a. . . ad
So our final results is that ∇f(w) = a. d This generalizes the scalar case where dw [αw] = α. We can also consider general linear functions of the form f(w) = aT w + β, for a scalar β. But in this case we still have ∇f(w) = a since the y-intercept β does not depend on w.
1 2 Gradient of Quadratic Function
Consider a quadratic function of the form
f(w) = wT Aw, where w is a length-d vector and A is a d by d matrix. We can derive the gradeint in matrix notation as follows 1. Convert to summation notation: Pn j=1 a1jwj Pn d d j=1 a2jwj X X f(w) = wT = w a w . . i ij j . i=1 j=1 Pn j=1 adjwj | {z } Aw
where aij is the element in row i and column j of A. To help with computing the partial derivatives, it helps to re-write it in the form
d d d X X X 2 X f(w) = wiaijwj = (aiiwi + wiaijwj). i=1 j=1 i=1 j6=i
2. Take the partial derivative with respect to a generic element k:
d ∂ X 2 X X X (aiiwi + wiaijwj). = 2akkwk + wjajk + akjwj. ∂wk i=1 j6=i j6=k j6=k
The first term comes from the akk term that is quadratic in wk, while the two sums come from the terms that are linear in wk. We can move one akkwk into each of the sums to simplify this to
d d d ∂ X 2 X X X (aiiwi + wiaijwj). = wjajk + akjwj. ∂wk i=1 j6=i j=1 j=1
3. Assemble the partial derivatives into a vector:
∂ Pd Pd Pd Pd wjaj1 + a1jwj wjaj1 a1jwj ∂w1 j=1 j=1 j=1 j=1 ∂ Pd Pd Pd Pd wjaj2 + a2jwj wjaj2 a2jwj ∂w2 j=1 j=1 j=1 j=1 ∇f(w) = . = . = . + . . . . . . . . . ∂ Pd Pd Pd Pd ∂wd j=1 wjajd + j=1 adjwj j=1 wjajd j=1 adjwj
4. Convert to matrix notation: Pd Pd j=1 wjaj1 j=1 a1jwj Pd w a Pd a w j=1 j j2 j=1 2j j T T ∇f(w) = . + . = A w + Aw = (A + A)w. . . . . Pd Pd j=1 wjajd j=1 adjwj
2 So our final result is that ∇f(w) = (AT + A)w. Note that if A is symmetric (AT = A) then we have (AT + A) = (A + A) = 2A so we have
∇f(w) = 2Aw.
d 2 This generalizes the scalar case where dw [αw ] = 2αw. We can also consider general quadratic functions of the form 1 f(w) = wT Aw + bT w + γ. 2 Using the above results we have 1 ∇f(w) = (AT + A)w + b, 2 and if A is symmetric then ∇f(w) = Aw + b.
3