<<

Lecture 4 CMSC 35246:

Shubhendu Trivedi & Risi Kondor

University of Chicago

April 5, 2017

Lecture 4 Backpropagation CMSC 35246 • Still more backpropagation • Quiz at 4:05 PM

Things we will look at today • More Backpropagation

Lecture 4 Backpropagation CMSC 35246 • Quiz at 4:05 PM

Things we will look at today • More Backpropagation • Still more backpropagation

Lecture 4 Backpropagation CMSC 35246 Things we will look at today • More Backpropagation • Still more backpropagation • Quiz at 4:05 PM

Lecture 4 Backpropagation CMSC 35246 To understand, let us just calculate!

Lecture 4 Backpropagation CMSC 35246 T yˆ = x w = x1w1 + x2w2 + . . . xdwd

One Neuron Again

wd w1 w2 x1 x2 ... xd

Consider example x; Output for x is yˆ; Correct Answer is y Loss L = (y − yˆ)2

Lecture 4 Backpropagation CMSC 35246 One Neuron Again

wd w1 w2 x1 x2 ... xd

Consider example x; Output for x is yˆ; Correct Answer is y Loss L = (y − yˆ)2 T yˆ = x w = x1w1 + x2w2 + . . . xdwd

Lecture 4 Backpropagation CMSC 35246 Update rule: w := w − η ∂L i i ∂wi Now ∂L ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi ∂wi

One Neuron Again

wd w1 w2 x1 x2 ... xd

Want to update wi (forget closed form solution for a bit!)

Lecture 4 Backpropagation CMSC 35246 Now ∂L ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi ∂wi

One Neuron Again

wd w1 w2 x1 x2 ... xd

Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi

Lecture 4 Backpropagation CMSC 35246 ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi

One Neuron Again

wd w1 w2 x1 x2 ... xd

Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi Now ∂L = ∂wi

Lecture 4 Backpropagation CMSC 35246 ∂(x w + x w + . . . x w ) 2(ˆy − y) 1 1 2 2 d d ∂wi

One Neuron Again

wd w1 w2 x1 x2 ... xd

Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi Now ∂L ∂(ˆy − y)2 = = ∂wi ∂wi

Lecture 4 Backpropagation CMSC 35246 One Neuron Again

wd w1 w2 x1 x2 ... xd

Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi Now ∂L ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi ∂wi

Lecture 4 Backpropagation CMSC 35246 2(ˆy − y)xi Update Rule:

wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)

In vector form: w := w − ηδx Simple enough! Now let’s graduate ...

One Neuron Again

wd w1 w2 x1 x2 ... xd

∂L We have: = ∂wi

Lecture 4 Backpropagation CMSC 35246 Update Rule:

wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)

In vector form: w := w − ηδx Simple enough! Now let’s graduate ...

One Neuron Again

wd w1 w2 x1 x2 ... xd

∂L We have: = 2(ˆy − y)xi ∂wi

Lecture 4 Backpropagation CMSC 35246 In vector form: w := w − ηδx Simple enough! Now let’s graduate ...

One Neuron Again

wd w1 w2 x1 x2 ... xd

∂L We have: = 2(ˆy − y)xi ∂wi Update Rule:

wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)

Lecture 4 Backpropagation CMSC 35246 Simple enough! Now let’s graduate ...

One Neuron Again

wd w1 w2 x1 x2 ... xd

∂L We have: = 2(ˆy − y)xi ∂wi Update Rule:

wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)

In vector form: w := w − ηδx

Lecture 4 Backpropagation CMSC 35246 One Neuron Again

wd w1 w2 x1 x2 ... xd

∂L We have: = 2(ˆy − y)xi ∂wi Update Rule:

wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)

In vector form: w := w − ηδx Simple enough! Now let’s graduate ...

Lecture 4 Backpropagation CMSC 35246 (1) (1) (1) z1 = tanh(a1) where a1 = w11 x1 + w21 x2 + w31 x3 likewise for z2

Simple Feedforward Network

(2) (2) w1 w2 z1 z2 w(1) (1) (1) 12 w22 w32

(1) w(1) (1) w11 21 w31 x1 x2 x3

(2) (2) yˆ = w1 z1 + w2 z2

Lecture 4 Backpropagation CMSC 35246 Simple Feedforward Network

(2) (2) w1 w2 z1 z2 w(1) (1) (1) 12 w22 w32

(1) w(1) (1) w11 21 w31 x1 x2 x3

(2) (2) yˆ = w1 z1 + w2 z2 (1) (1) (1) z1 = tanh(a1) where a1 = w11 x1 + w21 x2 + w31 x3 likewise for z2

Lecture 4 Backpropagation CMSC 35246 Want to assign credit for the loss L to each weight

Simple Feedforward Network

(2) (2) w1 w2 z1 z2 (1) w w(1) w(1) z = tanh(a ) where a = 12 22 32 1 1 1 (1) (1) w w w(1) (1) (1) (1) 11 21 31 w11 x1 + w21 x2 + w31 x3 x1 x2 x3 z2 = tanh(a2) where a2 = (1) (1) (1) w12 x1 + w22 x2 + w32 x3

(2) (2) 2 Output yˆ = w1 z1 + w2 z2; Loss L = (ˆy − y)

Lecture 4 Backpropagation CMSC 35246 Simple Feedforward Network

(2) (2) w1 w2 z1 z2 (1) w w(1) w(1) z = tanh(a ) where a = 12 22 32 1 1 1 (1) (1) w w w(1) (1) (1) (1) 11 21 31 w11 x1 + w21 x2 + w31 x3 x1 x2 x3 z2 = tanh(a2) where a2 = (1) (1) (1) w12 x1 + w22 x2 + w32 x3

(2) (2) 2 Output yˆ = w1 z1 + w2 z2; Loss L = (ˆy − y) Want to assign credit for the loss L to each weight

Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Top

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Top Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

∂L (2) = ∂w1

Lecture 4 Backpropagation CMSC 35246 (2) (2) ∂(w1 z1+w2 z2) 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Top Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

∂L ∂(ˆy−y)2 (2) = (2) = ∂w1 ∂w1

Lecture 4 Backpropagation CMSC 35246 2(ˆy − y)z1

(2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Top Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = ∂w1 ∂w1 ∂w1

Lecture 4 Backpropagation CMSC 35246 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Top Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1

Lecture 4 Backpropagation CMSC 35246 (2) w1 − ηδz1 with δ = (ˆy − y)

(2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Top Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L w1 := w1 − η (2) = ∂w1

Lecture 4 Backpropagation CMSC 35246 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Top Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1

Lecture 4 Backpropagation CMSC 35246 Top Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2

Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = 2(ˆy − y) (21) ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

∂L (1) = ∂w11

Lecture 4 Backpropagation CMSC 35246 (2) (2) ∂(w1 z1+w2 z2) 2(ˆy − y) (21) ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

∂L ∂(ˆy−y)2 (1) = (1) = ∂w11 ∂w11

Lecture 4 Backpropagation CMSC 35246 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11

Lecture 4 Backpropagation CMSC 35246 (1) (1) (1) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) w1 (1) + 0 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) ∂(w1 z1+w2 z2) Now: (1) = ∂w11

Lecture 4 Backpropagation CMSC 35246 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11

Lecture 4 Backpropagation CMSC 35246 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1

Lecture 4 Backpropagation CMSC 35246 ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =?

Lecture 4 Backpropagation CMSC 35246 Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32

There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar

2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11

Lecture 4 Backpropagation CMSC 35246 (1) Likewise, if we were considering w22 , we’d have: ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11

Lecture 4 Backpropagation CMSC 35246 ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11

(1) Likewise, if we were considering w22 , we’d have:

Lecture 4 Backpropagation CMSC 35246 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22

Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11

(1) Likewise, if we were considering w22 , we’d have: ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22

Lecture 4 Backpropagation CMSC 35246 Next Layer yˆ

(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11

(1) Likewise, if we were considering w22 , we’d have: ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22

Lecture 4 Backpropagation CMSC 35246 ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!

Let’s clean this up...

∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi

Lecture 4 Backpropagation CMSC 35246 ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!

Let’s clean this up...

∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input

Lecture 4 Backpropagation CMSC 35246 (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!

Let’s clean this up...

∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij

Lecture 4 Backpropagation CMSC 35246 (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!

Let’s clean this up...

∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj))

Lecture 4 Backpropagation CMSC 35246 ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!

Let’s clean this up...

∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!))

Lecture 4 Backpropagation CMSC 35246 Neat!

Let’s clean this up...

∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input

Lecture 4 Backpropagation CMSC 35246 Let’s clean this up...

∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!

Lecture 4 Backpropagation CMSC 35246 Let wi j be the weight for the connection FROM node i to node j Then ∂L = δjzi ∂wi j

δj is the local error (going from j backwards) and zi is the local input coming from i

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Lecture 4 Backpropagation CMSC 35246 Then ∂L = δjzi ∂wi j

δj is the local error (going from j backwards) and zi is the local input coming from i

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i to node j

Lecture 4 Backpropagation CMSC 35246 δjzi

δj is the local error (going from j backwards) and zi is the local input coming from i

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i to node j Then ∂L = ∂wi j

Lecture 4 Backpropagation CMSC 35246 δj is the local error (going from j backwards) and zi is the local input coming from i

Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i to node j Then ∂L = δjzi ∂wi j

Lecture 4 Backpropagation CMSC 35246 Let’s clean this up...

Let’s get a cleaner notation to summarize this

Let wi j be the weight for the connection FROM node i to node j Then ∂L = δjzi ∂wi j

δj is the local error (going from j backwards) and zi is the local input coming from i

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: A Graphical Revision

0

w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Let’s redraw our toy network with new notation and label nodes

Lecture 4 Backpropagation CMSC 35246 Local error from 0: δ = (ˆy − y), local input from 1: z1 ∂L ∴ = δz1; and update w1 0 := w1 0 − ηδz1 ∂w1 0

Credit Assignment: Top Layer

0 δ

w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δz1; and update w1 0 := w1 0 − ηδz1 ∂w1 0

Credit Assignment: Top Layer

0 δ

w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Local error from 0: δ = (ˆy − y), local input from 1: z1

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Top Layer

0 δ

w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Local error from 0: δ = (ˆy − y), local input from 1: z1 ∂L ∴ = δz1; and update w1 0 := w1 0 − ηδz1 ∂w1 0

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Top Layer

0

w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δz2 and update w2 0 := w2 0 − ηδz2 ∂w2 0

Credit Assignment: Top Layer

0 δ

w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Local error from 0: δ = (ˆy − y), local input from 2: z2

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Top Layer

0 δ

w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Local error from 0: δ = (ˆy − y), local input from 2: z2 ∂L ∴ = δz2 and update w2 0 := w2 0 − ηδz2 ∂w2 0

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer

0

w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer

0 δ

w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 2 Local error from 1: δ1 = (δ)(w1 0)(1 − tanh (a1)), local input from 3: x3 ∂L ∴ = δ1x3 and update w3 1 := w3 1 − ηδ1x3 ∂w3 1

Credit Assignment: Next Layer

0 δ

w2 0 w1 0 z1 1 z2 2 w δ1 5 2w4 2w3 1

w5 1 w4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δ1x3 and update w3 1 := w3 1 − ηδ1x3 ∂w3 1

Credit Assignment: Next Layer

0 δ

w2 0 w1 0 z1 1 z2 2 w δ1 5 2w4 2w3 1

w5 1 w4 1 w3 1

x1 5 x2 4 3 x3

2 Local error from 1: δ1 = (δ)(w1 0)(1 − tanh (a1)), local input from 3: x3

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer

0 δ

w2 0 w1 0 z1 1 z2 2 w δ1 5 2w4 2w3 1

w5 1 w4 1 w3 1

x1 5 x2 4 3 x3

2 Local error from 1: δ1 = (δ)(w1 0)(1 − tanh (a1)), local input from 3: x3 ∂L ∴ = δ1x3 and update w3 1 := w3 1 − ηδ1x3 ∂w3 1

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer

0

w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer

0 δ

w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 2 Local error from 2: δ2 = (δ)(w2 0)(1 − tanh (a2)), local input from 4: x2 ∂L ∴ = δ2x2 and update w4 2 := w4 2 − ηδ2x2 ∂w4 2

Credit Assignment: Next Layer

0 δ

w1 0 w2 0 z1 1 z2 2 w δ1 5 2 w3 1

w w w4 2 5 1 4 1 w3 1

x1 5 x2 4 3 x3

Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δ2x2 and update w4 2 := w4 2 − ηδ2x2 ∂w4 2

Credit Assignment: Next Layer

0 δ

w1 0 w2 0 z1 1 z2 2 w δ1 5 2 w3 1

w w w4 2 5 1 4 1 w3 1

x1 5 x2 4 3 x3

2 Local error from 2: δ2 = (δ)(w2 0)(1 − tanh (a2)), local input from 4: x2

Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer

0 δ

w1 0 w2 0 z1 1 z2 2 w δ1 5 2 w3 1

w w w4 2 5 1 4 1 w3 1

x1 5 x2 4 3 x3

2 Local error from 2: δ2 = (δ)(w2 0)(1 − tanh (a2)), local input from 4: x2 ∂L ∴ = δ2x2 and update w4 2 := w4 2 − ηδ2x2 ∂w4 2

Lecture 4 Backpropagation CMSC 35246 Let   w5 1 w5 2 (1) W = w4 1 w4 2 w3 1 w3 2

Let   x1   (1) (2) z1 Z = x2 and Z = z2 x3

Let’s Vectorize

w  Let W (2) = 1 0 (ignore that W (2) is a vector and hence w2 0 more appropriate to use w(2))

Lecture 4 Backpropagation CMSC 35246 Let   x1   (1) (2) z1 Z = x2 and Z = z2 x3

Let’s Vectorize

w  Let W (2) = 1 0 (ignore that W (2) is a vector and hence w2 0 more appropriate to use w(2)) Let   w5 1 w5 2 (1) W = w4 1 w4 2 w3 1 w3 2

Lecture 4 Backpropagation CMSC 35246 Let’s Vectorize

w  Let W (2) = 1 0 (ignore that W (2) is a vector and hence w2 0 more appropriate to use w(2)) Let   w5 1 w5 2 (1) W = w4 1 w4 2 w3 1 w3 2

Let   x1   (1) (2) z1 Z = x2 and Z = z2 x3

Lecture 4 Backpropagation CMSC 35246 2 Applying element-wise non-linearity Z(2) = tanh A(1) T 3 Compute Output yˆ = Z(2) W (2) 4 Compute Loss on example (ˆy − y)2

Feedforward Computation

T 1 Compute A(1) = Z(1) W (1)

Lecture 4 Backpropagation CMSC 35246 T 3 Compute Output yˆ = Z(2) W (2) 4 Compute Loss on example (ˆy − y)2

Feedforward Computation

T 1 Compute A(1) = Z(1) W (1) 2 Applying element-wise non-linearity Z(2) = tanh A(1)

Lecture 4 Backpropagation CMSC 35246 4 Compute Loss on example (ˆy − y)2

Feedforward Computation

T 1 Compute A(1) = Z(1) W (1) 2 Applying element-wise non-linearity Z(2) = tanh A(1) T 3 Compute Output yˆ = Z(2) W (2)

Lecture 4 Backpropagation CMSC 35246 Feedforward Computation

T 1 Compute A(1) = Z(1) W (1) 2 Applying element-wise non-linearity Z(2) = tanh A(1) T 3 Compute Output yˆ = Z(2) W (2) 4 Compute Loss on example (ˆy − y)2

Lecture 4 Backpropagation CMSC 35246 2 w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!

Flowing Backward

1 Top: Compute δ

Lecture 4 Backpropagation CMSC 35246 (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!

Flowing Backward

1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2)

Lecture 4 Backpropagation CMSC 35246 T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!

Flowing Backward

1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) )

Lecture 4 Backpropagation CMSC 35246 (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!

Flowing Backward

1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs

Lecture 4 Backpropagation CMSC 35246 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!

Flowing Backward

1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z

Lecture 4 Backpropagation CMSC 35246 (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!

Flowing Backward

1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2)

Lecture 4 Backpropagation CMSC 35246 7 All the dimensionalities nicely check out!

Flowing Backward

1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z

Lecture 4 Backpropagation CMSC 35246 Flowing Backward

1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!

Lecture 4 Backpropagation CMSC 35246 • We follow the path from the output (where we have an error signal) to the edge we want to consider • We find the δs from the top to the edge concerned by using the • Once we have the , we can write the update rule for that weight

So Far

Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights

Lecture 4 Backpropagation CMSC 35246 • We find the δs from the top to the edge concerned by using the chain rule • Once we have the partial derivative, we can write the update rule for that weight

So Far

Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights • We follow the path from the output (where we have an error signal) to the edge we want to consider

Lecture 4 Backpropagation CMSC 35246 • Once we have the partial derivative, we can write the update rule for that weight

So Far

Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights • We follow the path from the output (where we have an error signal) to the edge we want to consider • We find the δs from the top to the edge concerned by using the chain rule

Lecture 4 Backpropagation CMSC 35246 So Far

Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights • We follow the path from the output (where we have an error signal) to the edge we want to consider • We find the δs from the top to the edge concerned by using the chain rule • Once we have the partial derivative, we can write the update rule for that weight

Lecture 4 Backpropagation CMSC 35246 Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them

What did we miss?

Exercise: What if there are multiple outputs? (look at slide from last class)

Lecture 4 Backpropagation CMSC 35246 As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them

What did we miss?

Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes?

Lecture 4 Backpropagation CMSC 35246 If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them

What did we miss?

Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs

Lecture 4 Backpropagation CMSC 35246 Need to book-keep derivatives as we go down the network and reuse them

What did we miss?

Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up!

Lecture 4 Backpropagation CMSC 35246 What did we miss?

Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them

Lecture 4 Backpropagation CMSC 35246 A General View of Backpropagation Some redundancy in upcoming slides, but redundancy can be good!

Lecture 4 Backpropagation CMSC 35246 This is used with another such as SGD for learning using the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!

An Aside

Backpropagation only refers to the method for computing the gradient

Lecture 4 Backpropagation CMSC 35246 Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!

An Aside

Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient

Lecture 4 Backpropagation CMSC 35246 x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!

An Aside

Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f

Lecture 4 Backpropagation CMSC 35246 Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!

An Aside

Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired

Lecture 4 Backpropagation CMSC 35246 Note: We restrict to case where f has a single output First: Move to more precise computational graph language!

An Aside

Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ)

Lecture 4 Backpropagation CMSC 35246 First: Move to more precise computational graph language!

An Aside

Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output

Lecture 4 Backpropagation CMSC 35246 An Aside

Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient

Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!

Lecture 4 Backpropagation CMSC 35246 Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations Examples:

Computational Graphs

Formalize computation as graphs

Lecture 4 Backpropagation CMSC 35246 Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations Examples:

Computational Graphs

Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable)

Lecture 4 Backpropagation CMSC 35246 Our graph language comes with a set of allowable operations Examples:

Computational Graphs

Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables

Lecture 4 Backpropagation CMSC 35246 Examples:

Computational Graphs

Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations

Lecture 4 Backpropagation CMSC 35246 Computational Graphs

Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations Examples:

Lecture 4 Backpropagation CMSC 35246 z = xy

z

× x y

Graph uses × operation for the computation

Lecture 4 Backpropagation CMSC 35246

yˆ σ

u1 u2 + dot x w b

Computes yˆ = σ(xT w + b)

Lecture 4 Backpropagation CMSC 35246 H = max{0,XW + b}

H Rc

U1 U2 + MM X W b

MM is multiplication and Rc is ReLU activation

Lecture 4 Backpropagation CMSC 35246 Let f, g : R → R Suppose y = g(x) and z = f(y) = f(g(x)) Chain rule: dz dz dy = dx dy dx

Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that is highly efficient

Lecture 4 Backpropagation CMSC 35246 Suppose y = g(x) and z = f(y) = f(g(x)) Chain rule: dz dz dy = dx dy dx

Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that is highly efficient Let f, g : R → R

Lecture 4 Backpropagation CMSC 35246 Chain rule: dz dz dy = dx dy dx

Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that is highly efficient Let f, g : R → R Suppose y = g(x) and z = f(y) = f(g(x))

Lecture 4 Backpropagation CMSC 35246 Back to backprop: Chain Rule

Backpropagation computes the chain rule, in a manner that is highly efficient Let f, g : R → R Suppose y = g(x) and z = f(y) = f(g(x)) Chain rule: dz dz dy = dx dy dx

Lecture 4 Backpropagation CMSC 35246 z

dy dx y

dz dy x

dz dz dy Chain rule: = dx dy dx

Lecture 4 Backpropagation CMSC 35246 z

dz dz dy1 dy2 y1 y2

dy1 dy2 dx dx x

dz dz dy dz dy Multiple Paths: = 1 + 2 dx dy1 dx dy2 dx

Lecture 4 Backpropagation CMSC 35246 z

dz dz ... dz dy1 dy2 dyn y1 y2 yn

dy1 dy2 dyn dx dx dx x

dz X dz dyj Multiple Paths: = dx dy dx j j

Lecture 4 Backpropagation CMSC 35246 m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then

∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i

In vector notation:

 ∂z   P ∂z ∂yj  T ∂x1 j ∂yj ∂x1 ! .  .  ∂y  .  =  .  = ∇xz = ∇yz     ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm

Chain Rule

m n Consider x ∈ R , y ∈ R

Lecture 4 Backpropagation CMSC 35246 Suppose y = g(x) and z = f(y), then

∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i

In vector notation:

 ∂z   P ∂z ∂yj  T ∂x1 j ∂yj ∂x1 ! .  .  ∂y  .  =  .  = ∇xz = ∇yz     ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm

Chain Rule

m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R

Lecture 4 Backpropagation CMSC 35246 ∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i

In vector notation:

 ∂z   P ∂z ∂yj  T ∂x1 j ∂yj ∂x1 ! .  .  ∂y  .  =  .  = ∇xz = ∇yz     ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm

Chain Rule

m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then

Lecture 4 Backpropagation CMSC 35246 In vector notation:

 ∂z   P ∂z ∂yj  T ∂x1 j ∂yj ∂x1 ! .  .  ∂y  .  =  .  = ∇xz = ∇yz     ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm

Chain Rule

m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then

∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i

Lecture 4 Backpropagation CMSC 35246 Chain Rule

m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then

∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i

In vector notation:

 ∂z   P ∂z ∂yj  T ∂x1 j ∂yj ∂x1 ! .  .  ∂y  .  =  .  = ∇xz = ∇yz     ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm

Lecture 4 Backpropagation CMSC 35246  ∂y  Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph In general this need not only apply to vectors, but can apply to tensors w.l.o.g

Chain Rule

!T ∂y ∇ z = ∇ z x ∂x y

 ∂y  ∂x is the n × m Jacobian matrix of g

Lecture 4 Backpropagation CMSC 35246 Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph In general this need not only apply to vectors, but can apply to tensors w.l.o.g

Chain Rule

!T ∂y ∇ z = ∇ z x ∂x y

 ∂y  ∂x is the n × m Jacobian matrix of g  ∂y  Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz

Lecture 4 Backpropagation CMSC 35246 In general this need not only apply to vectors, but can apply to tensors w.l.o.g

Chain Rule

!T ∂y ∇ z = ∇ z x ∂x y

 ∂y  ∂x is the n × m Jacobian matrix of g  ∂y  Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph

Lecture 4 Backpropagation CMSC 35246 Chain Rule

!T ∂y ∇ z = ∇ z x ∂x y

 ∂y  ∂x is the n × m Jacobian matrix of g  ∂y  Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph In general this need not only apply to vectors, but can apply to tensors w.l.o.g

Lecture 4 Backpropagation CMSC 35246 Let the gradient of z with respect to a tensor X be ∇Xz If Y = g(X) and z = f(Y), then:

X ∂z ∇ z = (∇ Y ) X X j ∂Y j j

Chain Rule

We can ofcourse also write this in terms of tensors

Lecture 4 Backpropagation CMSC 35246 If Y = g(X) and z = f(Y), then:

X ∂z ∇ z = (∇ Y ) X X j ∂Y j j

Chain Rule

We can ofcourse also write this in terms of tensors

Let the gradient of z with respect to a tensor X be ∇Xz

Lecture 4 Backpropagation CMSC 35246 Chain Rule

We can ofcourse also write this in terms of tensors

Let the gradient of z with respect to a tensor X be ∇Xz If Y = g(X) and z = f(Y), then:

X ∂z ∇ z = (∇ Y ) X X j ∂Y j j

Lecture 4 Backpropagation CMSC 35246 Let for some node x the successors be: {y1, y2, . . . yn} Node: Computation result Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i

Recursive Application in a Computational Graph

Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule

Lecture 4 Backpropagation CMSC 35246 Node: Computation result Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i

Recursive Application in a Computational Graph

Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule

Let for some node x the successors be: {y1, y2, . . . yn}

Lecture 4 Backpropagation CMSC 35246 Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i

Recursive Application in a Computational Graph

Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule

Let for some node x the successors be: {y1, y2, . . . yn} Node: Computation result

Lecture 4 Backpropagation CMSC 35246 Recursive Application in a Computational Graph

Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule

Let for some node x the successors be: {y1, y2, . . . yn} Node: Computation result Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i

Lecture 4 Backpropagation CMSC 35246 Flow Graph (for previous slide)

z ...

y1 y2 ... yn

x

...

Lecture 4 Backpropagation CMSC 35246 Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors

Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i

Recursive Application in a Computational Graph

Fpropagation: Visit nodes in the order after a topological sort

Lecture 4 Backpropagation CMSC 35246 Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors

Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i

Recursive Application in a Computational Graph

Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors

Lecture 4 Backpropagation CMSC 35246 Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i

Recursive Application in a Computational Graph

Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors

Lecture 4 Backpropagation CMSC 35246 n dz X dz dyi = dx dy dx i=1 i

Recursive Application in a Computational Graph

Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors

Successors of x in previous slide {y1, y2, . . . yn}:

Lecture 4 Backpropagation CMSC 35246 Recursive Application in a Computational Graph

Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors

Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i

Lecture 4 Backpropagation CMSC 35246 Every node type needs to know: • How to compute its output • How to compute its with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping

Automatic Differentiation

Computation of the gradient can be automatically inferred from the symbolic expression of fprop

Lecture 4 Backpropagation CMSC 35246 • How to compute its output • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping

Automatic Differentiation

Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know:

Lecture 4 Backpropagation CMSC 35246 • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping

Automatic Differentiation

Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know: • How to compute its output

Lecture 4 Backpropagation CMSC 35246 Makes for rapid prototyping

Automatic Differentiation

Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know: • How to compute its output • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs

Lecture 4 Backpropagation CMSC 35246 Automatic Differentiation

Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know: • How to compute its output • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping

Lecture 4 Backpropagation CMSC 35246 Two paths lead backwards from J to weights: Through cross entropy and through regularization cost

Computational Graph for a MLP

Figure: Goodfellow et al.

To train we want to compute ∇W (1) J and ∇W (2) J

Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP

Figure: Goodfellow et al.

To train we want to compute ∇W (1) J and ∇W (2) J Two paths lead backwards from J to weights: Through cross entropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP

Figure: Goodfellow et al.

To train we want to compute ∇W (1) J and ∇W (2) J Two paths lead backwards from J to weights: Through cross entropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP

Figure: Goodfellow et al.

Weight decay cost is relatively simple: Will always contribute 2λW (i) to gradient on W (i) Two paths lead backwards from J to weights: Through cross entropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP

Figure: Goodfellow et al.

Weight decay cost is relatively simple: Will always contribute 2λW (i) to gradient on W (i) Two paths lead backwards from J to weights: Through cross entropy and through regularization cost

Lecture 4 Backpropagation CMSC 35246 Instead it just adds nodes to the graph that describe how to compute derivatives A graph evaluation engine will then do the actual computation Approach taken by and TensorFlow

Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses any numerical values

Lecture 4 Backpropagation CMSC 35246 A graph evaluation engine will then do the actual computation Approach taken by Theano and TensorFlow

Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses any numerical values Instead it just adds nodes to the graph that describe how to compute derivatives

Lecture 4 Backpropagation CMSC 35246 Approach taken by Theano and TensorFlow

Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses any numerical values Instead it just adds nodes to the graph that describe how to compute derivatives A graph evaluation engine will then do the actual computation

Lecture 4 Backpropagation CMSC 35246 Symbol to Symbol

Figure: Goodfellow et al.

In this approach backpropagation never accesses any numerical values Instead it just adds nodes to the graph that describe how to compute derivatives A graph evaluation engine will then do the actual computation Approach taken by Theano and TensorFlow

Lecture 4 Backpropagation CMSC 35246 Next time

Regularization Methods for Deep Neural Networks

Lecture 4 Backpropagation CMSC 35246