Lecture 4 Backpropagation CMSC 35246: Deep Learning
Shubhendu Trivedi & Risi Kondor
University of Chicago
April 5, 2017
Lecture 4 Backpropagation CMSC 35246 • Still more backpropagation • Quiz at 4:05 PM
Things we will look at today • More Backpropagation
Lecture 4 Backpropagation CMSC 35246 • Quiz at 4:05 PM
Things we will look at today • More Backpropagation • Still more backpropagation
Lecture 4 Backpropagation CMSC 35246 Things we will look at today • More Backpropagation • Still more backpropagation • Quiz at 4:05 PM
Lecture 4 Backpropagation CMSC 35246 To understand, let us just calculate!
Lecture 4 Backpropagation CMSC 35246 T yˆ = x w = x1w1 + x2w2 + . . . xdwd
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
Consider example x; Output for x is yˆ; Correct Answer is y Loss L = (y − yˆ)2
Lecture 4 Backpropagation CMSC 35246 One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
Consider example x; Output for x is yˆ; Correct Answer is y Loss L = (y − yˆ)2 T yˆ = x w = x1w1 + x2w2 + . . . xdwd
Lecture 4 Backpropagation CMSC 35246 Update rule: w := w − η ∂L i i ∂wi Now ∂L ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi ∂wi
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
Want to update wi (forget closed form solution for a bit!)
Lecture 4 Backpropagation CMSC 35246 Now ∂L ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi ∂wi
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi
Lecture 4 Backpropagation CMSC 35246 ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi Now ∂L = ∂wi
Lecture 4 Backpropagation CMSC 35246 ∂(x w + x w + . . . x w ) 2(ˆy − y) 1 1 2 2 d d ∂wi
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi Now ∂L ∂(ˆy − y)2 = = ∂wi ∂wi
Lecture 4 Backpropagation CMSC 35246 One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
Want to update wi (forget closed form solution for a bit!) Update rule: w := w − η ∂L i i ∂wi Now ∂L ∂(ˆy − y)2 ∂(x w + x w + . . . x w ) = = 2(ˆy − y) 1 1 2 2 d d ∂wi ∂wi ∂wi
Lecture 4 Backpropagation CMSC 35246 2(ˆy − y)xi Update Rule:
wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)
In vector form: w := w − ηδx Simple enough! Now let’s graduate ...
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
∂L We have: = ∂wi
Lecture 4 Backpropagation CMSC 35246 Update Rule:
wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)
In vector form: w := w − ηδx Simple enough! Now let’s graduate ...
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
∂L We have: = 2(ˆy − y)xi ∂wi
Lecture 4 Backpropagation CMSC 35246 In vector form: w := w − ηδx Simple enough! Now let’s graduate ...
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
∂L We have: = 2(ˆy − y)xi ∂wi Update Rule:
wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)
Lecture 4 Backpropagation CMSC 35246 Simple enough! Now let’s graduate ...
One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
∂L We have: = 2(ˆy − y)xi ∂wi Update Rule:
wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)
In vector form: w := w − ηδx
Lecture 4 Backpropagation CMSC 35246 One Neuron Again
yˆ
wd w1 w2 x1 x2 ... xd
∂L We have: = 2(ˆy − y)xi ∂wi Update Rule:
wi := wi − η(ˆy − y)xi = wi − ηδxi where δ = (ˆy − y)
In vector form: w := w − ηδx Simple enough! Now let’s graduate ...
Lecture 4 Backpropagation CMSC 35246 (1) (1) (1) z1 = tanh(a1) where a1 = w11 x1 + w21 x2 + w31 x3 likewise for z2
Simple Feedforward Network
yˆ
(2) (2) w1 w2 z1 z2 w(1) (1) (1) 12 w22 w32
(1) w(1) (1) w11 21 w31 x1 x2 x3
(2) (2) yˆ = w1 z1 + w2 z2
Lecture 4 Backpropagation CMSC 35246 Simple Feedforward Network
yˆ
(2) (2) w1 w2 z1 z2 w(1) (1) (1) 12 w22 w32
(1) w(1) (1) w11 21 w31 x1 x2 x3
(2) (2) yˆ = w1 z1 + w2 z2 (1) (1) (1) z1 = tanh(a1) where a1 = w11 x1 + w21 x2 + w31 x3 likewise for z2
Lecture 4 Backpropagation CMSC 35246 Want to assign credit for the loss L to each weight
Simple Feedforward Network
yˆ
(2) (2) w1 w2 z1 z2 (1) w w(1) w(1) z = tanh(a ) where a = 12 22 32 1 1 1 (1) (1) w w w(1) (1) (1) (1) 11 21 31 w11 x1 + w21 x2 + w31 x3 x1 x2 x3 z2 = tanh(a2) where a2 = (1) (1) (1) w12 x1 + w22 x2 + w32 x3
(2) (2) 2 Output yˆ = w1 z1 + w2 z2; Loss L = (ˆy − y)
Lecture 4 Backpropagation CMSC 35246 Simple Feedforward Network
yˆ
(2) (2) w1 w2 z1 z2 (1) w w(1) w(1) z = tanh(a ) where a = 12 22 32 1 1 1 (1) (1) w w w(1) (1) (1) (1) 11 21 31 w11 x1 + w21 x2 + w31 x3 x1 x2 x3 z2 = tanh(a2) where a2 = (1) (1) (1) w12 x1 + w22 x2 + w32 x3
(2) (2) 2 Output yˆ = w1 z1 + w2 z2; Loss L = (ˆy − y) Want to assign credit for the loss L to each weight
Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
∂L (2) = ∂w1
Lecture 4 Backpropagation CMSC 35246 (2) (2) ∂(w1 z1+w2 z2) 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
∂L ∂(ˆy−y)2 (2) = (2) = ∂w1 ∂w1
Lecture 4 Backpropagation CMSC 35246 2(ˆy − y)z1
(2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = ∂w1 ∂w1 ∂w1
Lecture 4 Backpropagation CMSC 35246 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1
Lecture 4 Backpropagation CMSC 35246 (2) w1 − ηδz1 with δ = (ˆy − y)
(2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L w1 := w1 − η (2) = ∂w1
Lecture 4 Backpropagation CMSC 35246 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1
Lecture 4 Backpropagation CMSC 35246 Top Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
(1) (1) (1) ∂L ∂L w11 w21 w31 Want to find: (2) and (2) x1 x2 x3 ∂w1 ∂w2 (2) Consider w1 first
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (2) = (2) = 2(ˆy − y) (2) = 2(ˆy − y)z1 ∂w1 ∂w1 ∂w1 (2) Familiar from earlier! Update for w1 would be (2) (2) ∂L (2) w1 := w1 − η (2) = w1 − ηδz1 with δ = (ˆy − y) ∂w1 (2) (2) (2) Likewise, for w2 update would be w2 := w2 − ηδz2
Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
Lecture 4 Backpropagation CMSC 35246 2 (2) (2) ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = 2(ˆy − y) (21) ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
∂L (1) = ∂w11
Lecture 4 Backpropagation CMSC 35246 (2) (2) ∂(w1 z1+w2 z2) 2(ˆy − y) (21) ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
∂L ∂(ˆy−y)2 (1) = (1) = ∂w11 ∂w11
Lecture 4 Backpropagation CMSC 35246 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11
Lecture 4 Backpropagation CMSC 35246 (1) (1) (1) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) w1 (1) + 0 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) ∂(w1 z1+w2 z2) Now: (1) = ∂w11
Lecture 4 Backpropagation CMSC 35246 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11
Lecture 4 Backpropagation CMSC 35246 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1
Lecture 4 Backpropagation CMSC 35246 ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =?
Lecture 4 Backpropagation CMSC 35246 Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) (1) (1) w12 w22 w32
There are six weights to assign (1) (1) (1) w11 w21 w31 credit for the loss incurred x1 x2 x3 (1) Consider w11 for an illustration Rest are similar
2 (2) (2) ∂L ∂(ˆy−y) ∂(w1 z1+w2 z2) (1) = (1) = 2(ˆy − y) (21) ∂w11 ∂w11 ∂w11 (2) (2) (1) (1) (1) ∂(w1 z1+w2 z2) (2) ∂(tanh(w11 x1+w21 x2+w31 x3)) Now: (1) = w1 (1) + 0 ∂w11 ∂w11 (2) 2 Which is: w1 (1 − tanh (a1))x1 recall a1 =? ∂L (2) 2 So we have: (1) = 2(ˆy − y)w1 (1 − tanh (a1))x1 ∂w11
Lecture 4 Backpropagation CMSC 35246 (1) Likewise, if we were considering w22 , we’d have: ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11
Lecture 4 Backpropagation CMSC 35246 ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11
(1) Likewise, if we were considering w22 , we’d have:
Lecture 4 Backpropagation CMSC 35246 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22
Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11
(1) Likewise, if we were considering w22 , we’d have: ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22
Lecture 4 Backpropagation CMSC 35246 Next Layer yˆ
(2) (2) w1 w2 z1 z2 (1) ∂L w (1) w(1) = 12 w22 32 ∂w(1) 11 (1) (1) (1) w11 w21 w31 (2) 2 x x x 2(ˆy − y)w1 (1 − tanh (a1))x1 1 2 3 Weight update: (1) (1) ∂L w11 := w11 − η (1) ∂w11
(1) Likewise, if we were considering w22 , we’d have: ∂L (2) 2 (1) = 2(ˆy − y)w2 (1 − tanh (a2))x2 ∂w22 (1) (1) ∂L Weight update: w22 := w22 − η (1) ∂w22
Lecture 4 Backpropagation CMSC 35246 ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!
Let’s clean this up...
∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi
Lecture 4 Backpropagation CMSC 35246 ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!
Let’s clean this up...
∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input
Lecture 4 Backpropagation CMSC 35246 (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!
Let’s clean this up...
∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij
Lecture 4 Backpropagation CMSC 35246 (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!
Let’s clean this up...
∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj))
Lecture 4 Backpropagation CMSC 35246 ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!
Let’s clean this up...
∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!))
Lecture 4 Backpropagation CMSC 35246 Neat!
Let’s clean this up...
∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input
Lecture 4 Backpropagation CMSC 35246 Let’s clean this up...
∂L Recall, for top layer: (2) = (ˆy − y)zi = δzi (ignoring 2) ∂wi ∂L One can think of this as: = δ zi ∂w(2) i |{z} |{z} local error local input ∂L (2) 2 For next layer we had: (1) = (ˆy − y)wj (1 − tanh (aj))xi ∂wij (2) 2 (2) 2 Let δj = (ˆy − y)wj (1 − tanh (aj)) = δwj (1 − tanh (aj)) (Notice that δj contains the δ term (which is the error!)) ∂L Then: = δj xi ∂w(1) ij |{z} |{z} local error local input Neat!
Lecture 4 Backpropagation CMSC 35246 Let wi j be the weight for the connection FROM node i to node j Then ∂L = δjzi ∂wi j
δj is the local error (going from j backwards) and zi is the local input coming from i
Let’s clean this up...
Let’s get a cleaner notation to summarize this
Lecture 4 Backpropagation CMSC 35246 Then ∂L = δjzi ∂wi j
δj is the local error (going from j backwards) and zi is the local input coming from i
Let’s clean this up...
Let’s get a cleaner notation to summarize this
Let wi j be the weight for the connection FROM node i to node j
Lecture 4 Backpropagation CMSC 35246 δjzi
δj is the local error (going from j backwards) and zi is the local input coming from i
Let’s clean this up...
Let’s get a cleaner notation to summarize this
Let wi j be the weight for the connection FROM node i to node j Then ∂L = ∂wi j
Lecture 4 Backpropagation CMSC 35246 δj is the local error (going from j backwards) and zi is the local input coming from i
Let’s clean this up...
Let’s get a cleaner notation to summarize this
Let wi j be the weight for the connection FROM node i to node j Then ∂L = δjzi ∂wi j
Lecture 4 Backpropagation CMSC 35246 Let’s clean this up...
Let’s get a cleaner notation to summarize this
Let wi j be the weight for the connection FROM node i to node j Then ∂L = δjzi ∂wi j
δj is the local error (going from j backwards) and zi is the local input coming from i
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: A Graphical Revision
yˆ
0
w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Let’s redraw our toy network with new notation and label nodes
Lecture 4 Backpropagation CMSC 35246 Local error from 0: δ = (ˆy − y), local input from 1: z1 ∂L ∴ = δz1; and update w1 0 := w1 0 − ηδz1 ∂w1 0
Credit Assignment: Top Layer
yˆ
0 δ
w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δz1; and update w1 0 := w1 0 − ηδz1 ∂w1 0
Credit Assignment: Top Layer
yˆ
0 δ
w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Local error from 0: δ = (ˆy − y), local input from 1: z1
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Top Layer
yˆ
0 δ
w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Local error from 0: δ = (ˆy − y), local input from 1: z1 ∂L ∴ = δz1; and update w1 0 := w1 0 − ηδz1 ∂w1 0
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Top Layer
yˆ
0
w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δz2 and update w2 0 := w2 0 − ηδz2 ∂w2 0
Credit Assignment: Top Layer
yˆ
0 δ
w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Local error from 0: δ = (ˆy − y), local input from 2: z2
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Top Layer
yˆ
0 δ
w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Local error from 0: δ = (ˆy − y), local input from 2: z2 ∂L ∴ = δz2 and update w2 0 := w2 0 − ηδz2 ∂w2 0
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer
yˆ
0
w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer
yˆ
0 δ
w2 0 w1 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 2 Local error from 1: δ1 = (δ)(w1 0)(1 − tanh (a1)), local input from 3: x3 ∂L ∴ = δ1x3 and update w3 1 := w3 1 − ηδ1x3 ∂w3 1
Credit Assignment: Next Layer
yˆ
0 δ
w2 0 w1 0 z1 1 z2 2 w δ1 5 2w4 2w3 1
w5 1 w4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δ1x3 and update w3 1 := w3 1 − ηδ1x3 ∂w3 1
Credit Assignment: Next Layer
yˆ
0 δ
w2 0 w1 0 z1 1 z2 2 w δ1 5 2w4 2w3 1
w5 1 w4 1 w3 1
x1 5 x2 4 3 x3
2 Local error from 1: δ1 = (δ)(w1 0)(1 − tanh (a1)), local input from 3: x3
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer
yˆ
0 δ
w2 0 w1 0 z1 1 z2 2 w δ1 5 2w4 2w3 1
w5 1 w4 1 w3 1
x1 5 x2 4 3 x3
2 Local error from 1: δ1 = (δ)(w1 0)(1 − tanh (a1)), local input from 3: x3 ∂L ∴ = δ1x3 and update w3 1 := w3 1 − ηδ1x3 ∂w3 1
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer
yˆ
0
w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer
yˆ
0 δ
w1 0 w2 0 z1 1 z2 2 w 5 2w4 2w3 1 w w 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 2 Local error from 2: δ2 = (δ)(w2 0)(1 − tanh (a2)), local input from 4: x2 ∂L ∴ = δ2x2 and update w4 2 := w4 2 − ηδ2x2 ∂w4 2
Credit Assignment: Next Layer
yˆ
0 δ
w1 0 w2 0 z1 1 z2 2 w δ1 5 2 w3 1
w w w4 2 5 1 4 1 w3 1
x1 5 x2 4 3 x3
Lecture 4 Backpropagation CMSC 35246 ∂L ∴ = δ2x2 and update w4 2 := w4 2 − ηδ2x2 ∂w4 2
Credit Assignment: Next Layer
yˆ
0 δ
w1 0 w2 0 z1 1 z2 2 w δ1 5 2 w3 1
w w w4 2 5 1 4 1 w3 1
x1 5 x2 4 3 x3
2 Local error from 2: δ2 = (δ)(w2 0)(1 − tanh (a2)), local input from 4: x2
Lecture 4 Backpropagation CMSC 35246 Credit Assignment: Next Layer
yˆ
0 δ
w1 0 w2 0 z1 1 z2 2 w δ1 5 2 w3 1
w w w4 2 5 1 4 1 w3 1
x1 5 x2 4 3 x3
2 Local error from 2: δ2 = (δ)(w2 0)(1 − tanh (a2)), local input from 4: x2 ∂L ∴ = δ2x2 and update w4 2 := w4 2 − ηδ2x2 ∂w4 2
Lecture 4 Backpropagation CMSC 35246 Let w5 1 w5 2 (1) W = w4 1 w4 2 w3 1 w3 2
Let x1 (1) (2) z1 Z = x2 and Z = z2 x3
Let’s Vectorize
w Let W (2) = 1 0 (ignore that W (2) is a vector and hence w2 0 more appropriate to use w(2))
Lecture 4 Backpropagation CMSC 35246 Let x1 (1) (2) z1 Z = x2 and Z = z2 x3
Let’s Vectorize
w Let W (2) = 1 0 (ignore that W (2) is a vector and hence w2 0 more appropriate to use w(2)) Let w5 1 w5 2 (1) W = w4 1 w4 2 w3 1 w3 2
Lecture 4 Backpropagation CMSC 35246 Let’s Vectorize
w Let W (2) = 1 0 (ignore that W (2) is a vector and hence w2 0 more appropriate to use w(2)) Let w5 1 w5 2 (1) W = w4 1 w4 2 w3 1 w3 2
Let x1 (1) (2) z1 Z = x2 and Z = z2 x3
Lecture 4 Backpropagation CMSC 35246 2 Applying element-wise non-linearity Z(2) = tanh A(1) T 3 Compute Output yˆ = Z(2) W (2) 4 Compute Loss on example (ˆy − y)2
Feedforward Computation
T 1 Compute A(1) = Z(1) W (1)
Lecture 4 Backpropagation CMSC 35246 T 3 Compute Output yˆ = Z(2) W (2) 4 Compute Loss on example (ˆy − y)2
Feedforward Computation
T 1 Compute A(1) = Z(1) W (1) 2 Applying element-wise non-linearity Z(2) = tanh A(1)
Lecture 4 Backpropagation CMSC 35246 4 Compute Loss on example (ˆy − y)2
Feedforward Computation
T 1 Compute A(1) = Z(1) W (1) 2 Applying element-wise non-linearity Z(2) = tanh A(1) T 3 Compute Output yˆ = Z(2) W (2)
Lecture 4 Backpropagation CMSC 35246 Feedforward Computation
T 1 Compute A(1) = Z(1) W (1) 2 Applying element-wise non-linearity Z(2) = tanh A(1) T 3 Compute Output yˆ = Z(2) W (2) 4 Compute Loss on example (ˆy − y)2
Lecture 4 Backpropagation CMSC 35246 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!
Flowing Backward
1 Top: Compute δ
Lecture 4 Backpropagation CMSC 35246 (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!
Flowing Backward
1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2)
Lecture 4 Backpropagation CMSC 35246 T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!
Flowing Backward
1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) )
Lecture 4 Backpropagation CMSC 35246 (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!
Flowing Backward
1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs
Lecture 4 Backpropagation CMSC 35246 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!
Flowing Backward
1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z
Lecture 4 Backpropagation CMSC 35246 (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!
Flowing Backward
1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2)
Lecture 4 Backpropagation CMSC 35246 7 All the dimensionalities nicely check out!
Flowing Backward
1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z
Lecture 4 Backpropagation CMSC 35246 Flowing Backward
1 Top: Compute δ 2 Gradient w.r.t W (2) = δZ(2) (2)T (1) 2 3 Compute δ1 = (W δ) (1 − tanh(A ) ) T Notes: (a): is Hadamard product. (b) have written W (2) δ as δ can be a vector when there are multiple outputs (1) (1) 4 Gradient w.r.t W = δ1Z 5 Update W (2) := W (2) − ηδZ(2) (1) (1) (1) 6 Update W := W − ηδ1Z 7 All the dimensionalities nicely check out!
Lecture 4 Backpropagation CMSC 35246 • We follow the path from the output (where we have an error signal) to the edge we want to consider • We find the δs from the top to the edge concerned by using the chain rule • Once we have the partial derivative, we can write the update rule for that weight
So Far
Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights
Lecture 4 Backpropagation CMSC 35246 • We find the δs from the top to the edge concerned by using the chain rule • Once we have the partial derivative, we can write the update rule for that weight
So Far
Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights • We follow the path from the output (where we have an error signal) to the edge we want to consider
Lecture 4 Backpropagation CMSC 35246 • Once we have the partial derivative, we can write the update rule for that weight
So Far
Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights • We follow the path from the output (where we have an error signal) to the edge we want to consider • We find the δs from the top to the edge concerned by using the chain rule
Lecture 4 Backpropagation CMSC 35246 So Far
Backpropagation in the context of neural networks is all about assigning credit (or blame!) for error incurred to the weights • We follow the path from the output (where we have an error signal) to the edge we want to consider • We find the δs from the top to the edge concerned by using the chain rule • Once we have the partial derivative, we can write the update rule for that weight
Lecture 4 Backpropagation CMSC 35246 Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them
What did we miss?
Exercise: What if there are multiple outputs? (look at slide from last class)
Lecture 4 Backpropagation CMSC 35246 As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them
What did we miss?
Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes?
Lecture 4 Backpropagation CMSC 35246 If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them
What did we miss?
Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs
Lecture 4 Backpropagation CMSC 35246 Need to book-keep derivatives as we go down the network and reuse them
What did we miss?
Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up!
Lecture 4 Backpropagation CMSC 35246 What did we miss?
Exercise: What if there are multiple outputs? (look at slide from last class) Another exercise: Add bias neurons. What changes? As we go down the network, notice that we need previous δs If we recompute them each time, it can blow up! Need to book-keep derivatives as we go down the network and reuse them
Lecture 4 Backpropagation CMSC 35246 A General View of Backpropagation Some redundancy in upcoming slides, but redundancy can be good!
Lecture 4 Backpropagation CMSC 35246 This is used with another algorithm such as SGD for learning using the gradient
Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!
An Aside
Backpropagation only refers to the method for computing the gradient
Lecture 4 Backpropagation CMSC 35246 Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!
An Aside
Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient
Lecture 4 Backpropagation CMSC 35246 x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!
An Aside
Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient
Next: Computing gradient ∇xf(x, y) for arbitrary f
Lecture 4 Backpropagation CMSC 35246 Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!
An Aside
Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient
Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired
Lecture 4 Backpropagation CMSC 35246 Note: We restrict to case where f has a single output First: Move to more precise computational graph language!
An Aside
Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient
Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ)
Lecture 4 Backpropagation CMSC 35246 First: Move to more precise computational graph language!
An Aside
Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient
Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output
Lecture 4 Backpropagation CMSC 35246 An Aside
Backpropagation only refers to the method for computing the gradient This is used with another algorithm such as SGD for learning using the gradient
Next: Computing gradient ∇xf(x, y) for arbitrary f x is the set of variables whose derivatives are desired Often we require the gradient of the cost J(θ) with respect to parameters θ i.e ∇θJ(θ) Note: We restrict to case where f has a single output First: Move to more precise computational graph language!
Lecture 4 Backpropagation CMSC 35246 Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations Examples:
Computational Graphs
Formalize computation as graphs
Lecture 4 Backpropagation CMSC 35246 Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations Examples:
Computational Graphs
Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable)
Lecture 4 Backpropagation CMSC 35246 Our graph language comes with a set of allowable operations Examples:
Computational Graphs
Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables
Lecture 4 Backpropagation CMSC 35246 Examples:
Computational Graphs
Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations
Lecture 4 Backpropagation CMSC 35246 Computational Graphs
Formalize computation as graphs Nodes indicate variables (scalar, vector, tensor or another variable) Operations are simple functions of one or more variables Our graph language comes with a set of allowable operations Examples:
Lecture 4 Backpropagation CMSC 35246 z = xy
z
× x y
Graph uses × operation for the computation
Lecture 4 Backpropagation CMSC 35246 Logistic Regression
yˆ σ
u1 u2 + dot x w b
Computes yˆ = σ(xT w + b)
Lecture 4 Backpropagation CMSC 35246 H = max{0,XW + b}
H Rc
U1 U2 + MM X W b
MM is matrix multiplication and Rc is ReLU activation
Lecture 4 Backpropagation CMSC 35246 Let f, g : R → R Suppose y = g(x) and z = f(y) = f(g(x)) Chain rule: dz dz dy = dx dy dx
Back to backprop: Chain Rule
Backpropagation computes the chain rule, in a manner that is highly efficient
Lecture 4 Backpropagation CMSC 35246 Suppose y = g(x) and z = f(y) = f(g(x)) Chain rule: dz dz dy = dx dy dx
Back to backprop: Chain Rule
Backpropagation computes the chain rule, in a manner that is highly efficient Let f, g : R → R
Lecture 4 Backpropagation CMSC 35246 Chain rule: dz dz dy = dx dy dx
Back to backprop: Chain Rule
Backpropagation computes the chain rule, in a manner that is highly efficient Let f, g : R → R Suppose y = g(x) and z = f(y) = f(g(x))
Lecture 4 Backpropagation CMSC 35246 Back to backprop: Chain Rule
Backpropagation computes the chain rule, in a manner that is highly efficient Let f, g : R → R Suppose y = g(x) and z = f(y) = f(g(x)) Chain rule: dz dz dy = dx dy dx
Lecture 4 Backpropagation CMSC 35246 z
dy dx y
dz dy x
dz dz dy Chain rule: = dx dy dx
Lecture 4 Backpropagation CMSC 35246 z
dz dz dy1 dy2 y1 y2
dy1 dy2 dx dx x
dz dz dy dz dy Multiple Paths: = 1 + 2 dx dy1 dx dy2 dx
Lecture 4 Backpropagation CMSC 35246 z
dz dz ... dz dy1 dy2 dyn y1 y2 yn
dy1 dy2 dyn dx dx dx x
dz X dz dyj Multiple Paths: = dx dy dx j j
Lecture 4 Backpropagation CMSC 35246 m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then
∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i
In vector notation:
∂z P ∂z ∂yj T ∂x1 j ∂yj ∂x1 ! . . ∂y . = . = ∇xz = ∇yz ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm
Chain Rule
m n Consider x ∈ R , y ∈ R
Lecture 4 Backpropagation CMSC 35246 Suppose y = g(x) and z = f(y), then
∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i
In vector notation:
∂z P ∂z ∂yj T ∂x1 j ∂yj ∂x1 ! . . ∂y . = . = ∇xz = ∇yz ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm
Chain Rule
m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R
Lecture 4 Backpropagation CMSC 35246 ∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i
In vector notation:
∂z P ∂z ∂yj T ∂x1 j ∂yj ∂x1 ! . . ∂y . = . = ∇xz = ∇yz ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm
Chain Rule
m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then
Lecture 4 Backpropagation CMSC 35246 In vector notation:
∂z P ∂z ∂yj T ∂x1 j ∂yj ∂x1 ! . . ∂y . = . = ∇xz = ∇yz ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm
Chain Rule
m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then
∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i
Lecture 4 Backpropagation CMSC 35246 Chain Rule
m n Consider x ∈ R , y ∈ R m n n Let g : R → R and f : R → R Suppose y = g(x) and z = f(y), then
∂z X ∂z ∂yj = ∂x ∂y ∂x i j j i
In vector notation:
∂z P ∂z ∂yj T ∂x1 j ∂yj ∂x1 ! . . ∂y . = . = ∇xz = ∇yz ∂x ∂z P ∂z ∂yj ∂xm j ∂yj ∂xm
Lecture 4 Backpropagation CMSC 35246 ∂y Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph In general this need not only apply to vectors, but can apply to tensors w.l.o.g
Chain Rule
!T ∂y ∇ z = ∇ z x ∂x y
∂y ∂x is the n × m Jacobian matrix of g
Lecture 4 Backpropagation CMSC 35246 Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph In general this need not only apply to vectors, but can apply to tensors w.l.o.g
Chain Rule
!T ∂y ∇ z = ∇ z x ∂x y
∂y ∂x is the n × m Jacobian matrix of g ∂y Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz
Lecture 4 Backpropagation CMSC 35246 In general this need not only apply to vectors, but can apply to tensors w.l.o.g
Chain Rule
!T ∂y ∇ z = ∇ z x ∂x y
∂y ∂x is the n × m Jacobian matrix of g ∂y Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph
Lecture 4 Backpropagation CMSC 35246 Chain Rule
!T ∂y ∇ z = ∇ z x ∂x y
∂y ∂x is the n × m Jacobian matrix of g ∂y Gradient of x is a multiplication of a Jacobian matrix ∂x with a vector i.e. the gradient ∇yz Backpropagation consists of applying such Jacobian-gradient products to each operation in the computational graph In general this need not only apply to vectors, but can apply to tensors w.l.o.g
Lecture 4 Backpropagation CMSC 35246 Let the gradient of z with respect to a tensor X be ∇Xz If Y = g(X) and z = f(Y), then:
X ∂z ∇ z = (∇ Y ) X X j ∂Y j j
Chain Rule
We can ofcourse also write this in terms of tensors
Lecture 4 Backpropagation CMSC 35246 If Y = g(X) and z = f(Y), then:
X ∂z ∇ z = (∇ Y ) X X j ∂Y j j
Chain Rule
We can ofcourse also write this in terms of tensors
Let the gradient of z with respect to a tensor X be ∇Xz
Lecture 4 Backpropagation CMSC 35246 Chain Rule
We can ofcourse also write this in terms of tensors
Let the gradient of z with respect to a tensor X be ∇Xz If Y = g(X) and z = f(Y), then:
X ∂z ∇ z = (∇ Y ) X X j ∂Y j j
Lecture 4 Backpropagation CMSC 35246 Let for some node x the successors be: {y1, y2, . . . yn} Node: Computation result Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i
Recursive Application in a Computational Graph
Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule
Lecture 4 Backpropagation CMSC 35246 Node: Computation result Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i
Recursive Application in a Computational Graph
Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule
Let for some node x the successors be: {y1, y2, . . . yn}
Lecture 4 Backpropagation CMSC 35246 Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i
Recursive Application in a Computational Graph
Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule
Let for some node x the successors be: {y1, y2, . . . yn} Node: Computation result
Lecture 4 Backpropagation CMSC 35246 Recursive Application in a Computational Graph
Writing an algebraic expression for the gradient of a scalar with respect to any node in the computational graph that produced that scalar is straightforward using the chain-rule
Let for some node x the successors be: {y1, y2, . . . yn} Node: Computation result Edge: Computation dependency n dz X dz dyi = dx dy dx i=1 i
Lecture 4 Backpropagation CMSC 35246 Flow Graph (for previous slide)
z ...
y1 y2 ... yn
x
...
Lecture 4 Backpropagation CMSC 35246 Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors
Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i
Recursive Application in a Computational Graph
Fpropagation: Visit nodes in the order after a topological sort
Lecture 4 Backpropagation CMSC 35246 Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors
Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i
Recursive Application in a Computational Graph
Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors
Lecture 4 Backpropagation CMSC 35246 Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i
Recursive Application in a Computational Graph
Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors
Lecture 4 Backpropagation CMSC 35246 n dz X dz dyi = dx dy dx i=1 i
Recursive Application in a Computational Graph
Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors
Successors of x in previous slide {y1, y2, . . . yn}:
Lecture 4 Backpropagation CMSC 35246 Recursive Application in a Computational Graph
Fpropagation: Visit nodes in the order after a topological sort Compute the value of each node given its ancestors Bpropagation: Output gradient = 1 Now visit nods in reverse order Compute gradient with respect to each node using gradient with respect to successors
Successors of x in previous slide {y1, y2, . . . yn}: n dz X dz dyi = dx dy dx i=1 i
Lecture 4 Backpropagation CMSC 35246 Every node type needs to know: • How to compute its output • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping
Automatic Differentiation
Computation of the gradient can be automatically inferred from the symbolic expression of fprop
Lecture 4 Backpropagation CMSC 35246 • How to compute its output • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping
Automatic Differentiation
Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know:
Lecture 4 Backpropagation CMSC 35246 • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping
Automatic Differentiation
Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know: • How to compute its output
Lecture 4 Backpropagation CMSC 35246 Makes for rapid prototyping
Automatic Differentiation
Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know: • How to compute its output • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs
Lecture 4 Backpropagation CMSC 35246 Automatic Differentiation
Computation of the gradient can be automatically inferred from the symbolic expression of fprop Every node type needs to know: • How to compute its output • How to compute its gradients with respect to its inputs given the gradient w.r.t its outputs Makes for rapid prototyping
Lecture 4 Backpropagation CMSC 35246 Two paths lead backwards from J to weights: Through cross entropy and through regularization cost
Computational Graph for a MLP
Figure: Goodfellow et al.
To train we want to compute ∇W (1) J and ∇W (2) J
Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP
Figure: Goodfellow et al.
To train we want to compute ∇W (1) J and ∇W (2) J Two paths lead backwards from J to weights: Through cross entropy and through regularization cost
Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP
Figure: Goodfellow et al.
To train we want to compute ∇W (1) J and ∇W (2) J Two paths lead backwards from J to weights: Through cross entropy and through regularization cost
Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP
Figure: Goodfellow et al.
Weight decay cost is relatively simple: Will always contribute 2λW (i) to gradient on W (i) Two paths lead backwards from J to weights: Through cross entropy and through regularization cost
Lecture 4 Backpropagation CMSC 35246 Computational Graph for a MLP
Figure: Goodfellow et al.
Weight decay cost is relatively simple: Will always contribute 2λW (i) to gradient on W (i) Two paths lead backwards from J to weights: Through cross entropy and through regularization cost
Lecture 4 Backpropagation CMSC 35246 Instead it just adds nodes to the graph that describe how to compute derivatives A graph evaluation engine will then do the actual computation Approach taken by Theano and TensorFlow
Symbol to Symbol
Figure: Goodfellow et al.
In this approach backpropagation never accesses any numerical values
Lecture 4 Backpropagation CMSC 35246 A graph evaluation engine will then do the actual computation Approach taken by Theano and TensorFlow
Symbol to Symbol
Figure: Goodfellow et al.
In this approach backpropagation never accesses any numerical values Instead it just adds nodes to the graph that describe how to compute derivatives
Lecture 4 Backpropagation CMSC 35246 Approach taken by Theano and TensorFlow
Symbol to Symbol
Figure: Goodfellow et al.
In this approach backpropagation never accesses any numerical values Instead it just adds nodes to the graph that describe how to compute derivatives A graph evaluation engine will then do the actual computation
Lecture 4 Backpropagation CMSC 35246 Symbol to Symbol
Figure: Goodfellow et al.
In this approach backpropagation never accesses any numerical values Instead it just adds nodes to the graph that describe how to compute derivatives A graph evaluation engine will then do the actual computation Approach taken by Theano and TensorFlow
Lecture 4 Backpropagation CMSC 35246 Next time
Regularization Methods for Deep Neural Networks
Lecture 4 Backpropagation CMSC 35246