Deriving the Gradient and Hessian of Linear and Quadratic Functions in Matrix Notation
Mark Schmidt
February 6, 2019
1 Gradient of Linear Function
Consider a linear function of the form f(w) = aT w, where a and w are length-d vectors. We can derive the gradeint in matrix notation as follows: 1. Convert to summation notation: d X f(w) = ajwj, j=1
where aj is element j of a and wj is element j of w. 2. Take the partial derivative with respect to a generic element k:
d ∂ X a w = a . ∂w j j k k j=1
3. Assemble the partial derivatives into a vector: ∂ ∂w1 a1 ∂ ∂w a2 ∇f(w) = 2 = . . . . ∂ ad ∂wd
4. Convert to matrix notation: a1 a2 ∇f(w) = = a. . . ad
So our final results is that ∇f(w) = a. d This generalizes the scalar case where dw [αw] = α. We can also consider general linear functions of the form f(w) = aT w + β, for a scalar β. But in this case we still have ∇f(w) = a since the y-intercept β does not depend on w.
1 2 Gradient of Quadratic Function
Consider a quadratic function of the form
f(w) = wT Aw, where w is a length-d vector and A is a d by d matrix. We can derive the gradeint in matrix notation as follows 1. Convert to summation notation: Pn j=1 a1jwj Pn d d j=1 a2jwj X X f(w) = wT = w a w . . i ij j . i=1 j=1 Pn j=1 adjwj | {z } Aw
where aij is the element in row i and column j of A. To help with computing the partial derivatives, it helps to re-write it in the form
d d d X X X 2 X f(w) = wiaijwj = (aiiwi + wiaijwj). i=1 j=1 i=1 j6=i
2. Take the partial derivative with respect to a generic element k:
d ∂ X 2 X X X (aiiwi + wiaijwj). = 2akkwk + wjajk + akjwj. ∂wk i=1 j6=i j6=k j6=k
The first term comes from the akk term that is quadratic in wk, while the two sums come from the terms that are linear in wk. We can move one akkwk into each of the sums to simplify this to
d d d ∂ X 2 X X X (aiiwi + wiaijwj). = wjajk + akjwj. ∂wk i=1 j6=i j=1 j=1
3. Assemble the partial derivatives into a vector:
∂ Pd Pd Pd Pd wjaj1 + a1jwj wjaj1 a1jwj ∂w1 j=1 j=1 j=1 j=1 ∂ Pd Pd Pd Pd wjaj2 + a2jwj wjaj2 a2jwj ∂w2 j=1 j=1 j=1 j=1 ∇f(w) = . = . = . + . . . . . . . . . ∂ Pd Pd Pd Pd ∂wd j=1 wjajd + j=1 adjwj j=1 wjajd j=1 adjwj
4. Convert to matrix notation: Pd Pd j=1 wjaj1 j=1 a1jwj Pd w a Pd a w j=1 j j2 j=1 2j j T T ∇f(w) = . + . = A w + Aw = (A + A)w. . . . . Pd Pd j=1 wjajd j=1 adjwj
2 So our final result is that ∇f(w) = (AT + A)w. Note that if A is symmetric (AT = A) then we have (AT + A) = (A + A) = 2A so we have
∇f(w) = 2Aw.
d 2 This generalizes the scalar case where dw [αw ] = 2αw. We can also consider general quadratic functions of the form 1 f(w) = wT Aw + bT w + γ. 2 Using the above results we have 1 ∇f(w) = (AT + A)w + b, 2 and if A is symmetric then ∇f(w) = Aw + b.
3 Hessian of Linear Function
For a linear function of the form, f(w) = aT w, we show above the partial derivatives are given by ∂f = ak. ∂wk
Since these first partial derivatives don’t depend on any wk, the second partial derivatives are thus given by ∂2f = 0, ∂wk∂wk0 which means that the Hessian matrix is the zero matrix,
∂ f(w) ∂ f(w) ··· ∂ f(w) ∂w1∂w1 ∂w1∂w2 ∂w1∂wd 0 0 ··· 0 ∂ ∂ ∂ ∂w ∂w f(w) ∂w ∂w f(w) ··· ∂w ∂w f(w) 0 0 ··· 0 ∇2f(w) = 2 1 2 2 2 d = , . . .. . . . .. . . . . . . . . . ∂ f(w) ∂ f(w) ··· ∂ f(w) 0 0 ··· 0 ∂wd∂w1 ∂wd∂w2 ∂wd∂wd and using 0 to denote the zero matrix we have
∇2f(w) = 0.
4 Hessian of Quadratic Function
For a quadratic function of the form, f(w) = wT Aw, we show above the partial derivatives are given by linear functions,
d d ∂f X X = w a + a w . ∂w j jk kj j k j=1 j=1
3 The second partial derivatives are thus constant functions of the form
∂2f = ak0k + akk0 , ∂wk∂wk0 which means that the Hessian matrix has a simple form
∂ f(w) ∂ f(w) ··· ∂ f(w) ∂w1∂w1 ∂w1∂w2 ∂w1∂wd a11 + a11 a21 + a12 ··· ad1 + a1d ∂ ∂ ∂ ∂w ∂w f(w) ∂w ∂w f(w) ··· ∂w ∂w f(w) a12 + a21 a22 + a22 ··· ad2 + a2d ∇2f(w) = 2 1 2 2 2 d = . . . .. . . . .. . . . . . . . . . ∂ ∂ ∂ f(w) f(w) ··· f(w) a1d + ad1 a2d + ad2 ··· add + add ∂wd∂w1 ∂wd∂w2 ∂wd∂wd This gives a result of ∇2f(w) = A + AT , and if A is symmetric this simplifies to ∇2f(w) = 2A.
5 Example of Least Squares
The least squares objective function has the form 1 f(w) = kXw − yk2, 2 which can be written as 1 1 f(w) = wT XT X w − wT XT y + yT y . 2 | {z } | {z } 2 A b | {z } γ By using that XT X and symmetric and using the results above we have that
∇f(w) = XT Xw − XT y, and that ∇2f(w) = XT X.
4