<<

COS 302 Precept 6

Spring 2020

Princeton University

1 Outline

Overview of

Review of Differentiation Basics

Examples of Various Gradients

2 Outline

Overview of Matrix Gradients

Review of Differentiation Basics

Examples of Various Gradients

3 Different Flavors of Gradients

1 source: https://www.comp.nus.edu.sg/ cs5240/lecture/matrix-differentiation.pdf

4 Challenges of Vector/Matrix

Good news: most of the rules you know and • love from single calculus generalize well (not in all cases but in some). Bad news: confusing notation (many more • variables lead to very tedious book-keeping) and more identities to memorize.

5 Numerator vs Denominator Layout

Two main conventions used in vector/matrix • calculus: numerator and denominator layouts. Numerator layout makes the dimension of the • be the numerator dimension by denominator dimension. For example, if y is a and x RN then • ∈

∂y 1 N R × ∂x ∈

6 How to Become a Differentiation Master

1. Identify what is the flavor of your derivative. 2. What is the dimension of the derivative? 3. Identify any differentiation rules you will need for this case. 4. Can any part of the derivative be reduced to a particular identity? 5. Identify the partial .

7 Outline

Overview of Matrix Gradients

Review of Differentiation Basics

Examples of Various Gradients

8 Definitions

Names Notation & Expression

δy f (x+δx) f (x) Difference Quotient δx = δx− df f (x+h) f (x) Derivative = limh 0 − dx → h ∂f f (x1,...,xi +h,...,xn) f (x) = limh 0 − ∂dxi → h df ∂f ∂f x f = = . . . ∇ dx ∂dx1 ∂dxn df (x) � ∂f (x) ∂f (�x) Jacobian x f = = . . . ∇ dx ∂dx1 ∂dxn � � 9 Differentiation Rules (Scalar-Scalar)

Sum Rule • (f (x) + g(x))′ = f ′(x) + g ′(x) • (f (x)g(x))′ = f ′(x)g(x) + f (x)g ′(x) • f (x) ′ f ′(x)g(x) f (x)g ′(x) = − g(x) (g(x))2 � � • (g(f (x)))′ = (g f )′(x) = g ′(f (x))f ′(x) 10 ◦ Differentiation Rules (Scalar-Vector)

Sum Rule • ∂ ∂f ∂g (f (x) + g(x)) = + ∂x ∂x ∂x Product Rule • ∂ ∂f ∂g (f (x)g(x)) = g(x) + f (x) ∂x ∂x ∂x Chain Rule • ∂ ∂ ∂g ∂f (g(f (x))) = (g f )(x) = ∂x ∂x ◦ ∂f ∂x

11 Outline

Overview of Matrix Gradients

Review of Differentiation Basics

Examples of Various Gradients

12 Gradient of

M N Consider the matrix A R × and the vector ∈ x RN . Define the vector f (x) = Ax ∈ where f : RN RM , what is df ? → dx

13 Gradient of Matrix Multiplication cont.

df M N 1. Dimension of gradient: R × dx ∈ 2. One of these M N partial derivatives will × look like: N ∂fi fi (x) = Aij xj = = Aij ⇒ ∂xj �j=1 3. Collecting all of these partial derivatives: ∂f1 . . . ∂f1 A . . . A ∂x1 ∂xN 11 1N df . . . . M N = . . = . . = A R × d � � � � ∈ x ∂fM ∂fM . . . AM1 . . . AMN � ∂x1 ∂xN � � � � � � � 14 Chain Rule Example

Consider the scalar function h : R R where → h(t) = f (g(t)) with f : R2 R and g : R R2 → → such that

2 f (x) = exp(x1x2 ) ,

x1 t cos t x = = g(t) = . �x2� �t sin t �

dh What is dt ?

15 Chain Rule Example cont.

Even though the gradient is a scalar we need to compute two vector gradients (gradients) because of the chain rule: dh ∂f ∂x ∂x1 = = ∂f ∂f ∂t ∂x1 ∂x2 ∂x2 dt ∂x ∂t � ∂t � � � 2 2 2 cos t t sin t = exp(x1x2 )x2 2 exp(x1x2 )x1x2 − �sin t + t cos t� � � = exp(x x 2)(x 2(cos t x ) + 2x x (sin t + x )), 1 2 2 − 2 1 2 1 where x1 = t cos t and x2 = t sin t 16 Least Squares by Chain Rule

N D Consider the matrix X R × where N > D and ∈ the two vectors y RN and β RD. Before we ∈ ∈ saw in class that the over-determined system of linear equations: y = X β does not always have a solution. Instead of solving the problem directly we can try to find an approximate solution βˆ.

17 Least Squares by Chain Rule cont.

If we picked a random β, and multiplied it by X , we probably wouldn’t get a vector that was very close to y. Specifically, the error vector

e(β) = y X β − would probably not be close to the zero vector. A good choice of β is one that minimizes the Euclidean distance between y and X β. Specifically, one that minimizes the function L(e) = e 2. � � 18 Least Squares by Chain Rule cont.

To find the best β, let’s take the gradient of L with respect to β and set it equal to zero.

∂L 1 D 1. R × ∂β ∈ ∂L ∂L ∂e 2. ∂β = ∂e ∂β ∂L ∂ N 2 ∂L 1 N 3. = e = 2ei . Since, R × we ∂ei ∂ei i=1 i ∂e ∈ have ∂L = 2eT ∂e � ∂e N D 4. ∂β = X R × ∂L − T∈ T T T 5. ∂β = 2e X = 2(y β X )X = 0 = − T 1 T− − ⇒ β = (X X )− X y

19