CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation
Mitesh M. Khapra
Department of Computer Science and Engineering Indian Institute of Technology Madras
1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation
2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons)
3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ˆ = f(x) vector The network contains L − 1 hidden layers (2, in
a3 this case) having n neurons each W3 b Finally, there is one output layer containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W ∈ n×n and b ∈ n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W ∈ n×k and b ∈ k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer
(L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) The pre-activation at layer i is given by
ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2
hi(x) = g(ai(x)) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.)
To simplify notation we will refer to ai(x) as ai
and hi(x) as hi 5/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) The pre-activation at layer i is given by
ai = bi + Wihi−1 a3 W3 b3 The activation at layer i is given by h2
hi = g(ai) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h = O(a ) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.)
6/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) N Data: {xi, yi}i=1 Model:
a3 yˆi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3) W3 b3 h2 Parameters: a2 θ = W1, .., WL, b1, b2, ..., bL(L = 3) W2 b Algorithm: Gradient Descent with Back- h 2 1 propagation (we will see soon) Objective/Loss/Error function: Say, a1 N k 1 X X W1 b min (ˆy − y )2 1 N ij ij i=1 j=1 x x x 1 2 n In general, min L (θ)
where (θ) is some function of the parameters L 7/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition)
8/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The story so far... We have introduced feedforward neural networks We are now interested in finding an algorithm for learning the parameters of this model
9/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) Recall our gradient descent algorithm Algorithm: gradient descent() a3 t ← 0; W3 b3 max iterations ← 1000; h2 Initialize w0, b0; while t++ < max iterations do a2 wt+1 ← wt − η∇wt; W2 b bt+1 ← bt − η∇bt; h 2 1 end a1 W 1 b1
x1 x2 xn
10/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) Recall our gradient descent algorithm We can write it more concisely as
a3 Algorithm: gradient descent() W3 b t ← 0; h 3 2 max iterations ← 1000; Initialize θ0 = [w0, b0]; a2 while t++ < max iterations do W 2 b2 θt+1 ← θt − η∇θt; h1 end
∂L (θ) ∂L (θ) T where ∇θt = , a1 ∂wt ∂bt W Now, in this feedforward neural network, 1 b1 instead of θ = [w, b] we have θ = x1 x2 xn [W1,W2, .., WL, b1, b2, .., bL] We can still use the same algorithm for learning the parameters of our model 11/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) Recall our gradient descent algorithm We can write it more concisely as
a3 Algorithm: gradient descent() W3 b t ← 0; h 3 2 max iterations ← 1000; 0 0 0 0 Initializeθ 0 = [W1 , ..., WL, b1, ..., bL]; a2 while t++ < max iterations do W 2 b2 θt+1 ← θt − η∇θt; h1 end
∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) T where ∇θt = , ., , , ., a1 ∂W1,t ∂WL,t ∂b1,t ∂bL,t W Now, in this feedforward neural network, 1 b1 instead of θ = [w, b] we have θ = x1 x2 xn [W1,W2, .., WL, b1, b2, .., bL] We can still use the same algorithm for learning the parameters of our model 12/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Except that now our ∇θ looks much more nasty
∂L (θ) ... ∂L (θ) ∂L (θ) ... ∂L (θ) ... ∂L (θ) ... ∂L (θ) ∂L (θ) ∂L (θ) ... ∂L (θ) ∂W111 ∂W11n ∂W211 ∂W21n ∂WL,11 ∂WL,1k ∂WL,1k ∂b11 ∂bL1 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ...... ∂W121 ∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12 ∂bL2 ...... ...... ∂L (θ) ... ∂L (θ) ∂L (θ) ... ∂L (θ) ... ∂L (θ) ... ∂L (θ) ∂L (θ) ∂L (θ) ... ∂L (θ) ∂W1n1 ∂W1nn ∂W2n1 ∂W2nn ∂WL,n1 ∂WL,nk ∂WL,nk ∂b1n ∂bLk
∇θ is thus composed of n×n n×k ∇W1, ∇W2, ...∇WL−1 ∈ R , ∇WL ∈ R , n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R
13/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the loss function L (θ)? How to compute ∇θ which is composed of n×n n×k ∇W1, ∇W2, ..., ∇WL−1 ∈ R , ∇WL ∈ R n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R ?
14/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.3: Output Functions and Loss Functions
15/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the loss function L (θ)? How to compute ∇θ which is composed of: n×n n×k ∇W1, ∇W2, ..., ∇WL−1 ∈ R , ∇WL ∈ R n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R ?
16/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The choice of loss function depends yi = {7.5 8.2 7.7} on the problem at hand imdb Critics RT We will illustrate this with the help Rating Rating Rating of two examples Consider our movie example again but this time we are interested in predicting ratings Neural network with 3 Here yi ∈ R L − 1 hidden layers The loss function should capture how muchy ˆi deviates from yi n If yi ∈ R then the squared error loss can capture this deviation isActor isDirector N 3 1 X X 2 ...... L (θ) = (ˆyij − yij) Damon Nolan N i=1 j=1 xi
17/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) A related question: What should the output function ‘O’ be if yi ∈ R? More specifically, can it be the logistic a3 function? W3 b3 h2 No, because it restrictsy ˆi to a value between 0 & 1 but we wanty ˆi ∈ R a2 So, in such cases it makes sense to W have ‘O’ as linear function 2 b2 h1 f(x) = hL = O(aL) = W a + b a1 O L O W 1 b1 yˆi = f(xi) is no longer bounded between 0 and 1 x1 x2 xn
18/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Intentionally left blank 19/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Now let us consider another problem y = [1 0 0 0] for which a different loss function Apple Mango Orange Banana would be appropriate Suppose we want to classify an image into 1 of k classes Here again we could use the squared Neural network with error loss to capture the deviation L − 1 hidden layers But can you think of a better function?
20/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Notice that y is a probability y = [1 0 0 0] distribution Apple Mango Orange Banana Therefore we should also ensure that yˆ is a probability distribution What choice of the output activation ‘O’ will ensure this ? Neural network with aL = WLhL−1 + bL L − 1 hidden layers eaL,j yˆj = O(aL)j = Pk aL,i i=1 e
th O(aL)j is the j element ofy ˆ and aL,j th is the j element of the vector aL. This function is called the softmax function
21/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x)
a3 W3 b3 h2
a2 W 2 b2 h1
a1 W 1 b1
x1 x2 xn Now that we have ensured that both y = [1 0 0 0] y &y ˆ are probability distributions Apple Mango Orange Banana can you think of a function which captures the difference between them? Cross-entropy
Neural network with k L − 1 hidden layers X L (θ) = − yc logy ˆc c=1 Notice that
yc = 1 if c = ` (the true class label) = 0 otherwise
∴ L (θ) = − logy ˆ`
22/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 So, for classification problem (where you have hL =y ˆ = f(x) to choose 1 of K classes), we use the following objective function
a3 minimize L (θ) = − logy ˆ` θ W3 b3 h2 or maximize − L (θ) = logy ˆ` θ But wait! a 2 Isy ˆ a function of θ = [W ,W , ., W , b , b , ., b ]? W ` 1 2 L 1 2 L 2 b2 h1 Yes, it is indeed a function of θ
yˆ` = [O(W3g(W2g(W1x + b1) + b2) + b3)]` a1 What doesy ˆ` encode? W th 1 b1 It is the probability that x belongs to the ` class (bring it as close to 1). x1 x2 xn logy ˆ` is called the log-likelihood of the data. 23/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Outputs
Real Values Probabilities
Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy
Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often For the rest of this lecture we will focus on the case where the output activation is a softmax function and the loss function is cross entropy
24/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.4: Backpropagation (Intuition)
25/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the loss function L (θ)? How to compute ∇θ which is composed of: n×n n×k ∇W1, ∇W2, ..., ∇WL−1 ∈ R , ∇WL ∈ R n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R ?
26/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 yˆ = f(x) Let us focus on this one Algorithm: gradient descent() weight (W112). a 31 t ← 0; To learn this weight W311 b h 3 using SGD we need a 21 max iterations ← 1000; formula for ∂L (θ) . Initialize θ0; ∂W112 a21 W211 b while We will see how to h 2 11 t++ < max iterations do calculate this. θt+1 ← θt − η∇θt; a11 W W 111 112 b1 end
x1 x2 xd
27/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 (θ) First let us take the simple case when L yˆ = f(x) we have a deep but thin network. In this case it is easy to find the derivative by chain rule. aL1 WL11 h21 ∂L (θ) ∂L (θ) ∂yˆ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 = ∂W111 ∂yˆ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21 ∂L (θ) ∂L (θ) ∂h11 W211 = (just compressing the chain rule) h ∂W111 ∂h11 ∂W111 11 ∂L (θ) ∂L (θ) ∂h21 = ∂W211 ∂h21 ∂W211 a11 W111 ∂L (θ) ∂L (θ) ∂aL1 = ∂WL11 ∂aL1 ∂WL11 x1
28/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Let us see an intuitive explanation of backpropagation before we get into the mathematical details
29/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We get a certain loss at the output and we try to − logy ˆ` figure out who is responsible for this loss So, we talk to the output layer and say “Hey! You are not producing the desired output, better take responsibility”. a3 The output layer says “Well, I take responsibility W3 b3 for my part but please understand that I am only h2 as the good as the hidden layer and weights below me”. After all ... a2 W2 b2 f(x) =y ˆ = O(WLhL−1 + bL) h1
a1 W1 b1
x1 x2 xn
30/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 So, we talk to WL, bL and hL and ask them “What is − logy ˆ` wrong with you?”
WL and bL take full responsibility but hL says “Well, please understand that I am only as good as the pre- activation layer”
The pre-activation layer in turn says that I am only as a3 good as the hidden layer and weights below me. W3 b3 We continue in this manner and realize that the h2 responsibility lies with all the weights and biases (i.e. all the parameters of the model) a2 But instead of talking to them directly, it is easier to W2 b2 talk to them through the hidden layers and output h1 layers (and this is exactly what the chain rule allows us to do) a1 W1 b1 ∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } x1 x2 xn Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights 31/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases
∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights
Our focus is on Cross entropy loss and Softmax output.
32/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.5: Backpropagation: Computing Gradients w.r.t. the Output Units
33/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights
∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights
Our focus is on Cross entropy loss and Softmax output.
34/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Let us first consider the partial derivative − logy ˆ` w.r.t. i-th output
L (θ) = − logy ˆ` (` = true class label) ∂ ∂ (L (θ)) = (− logy ˆ`) a3 ∂yˆi ∂yˆi W3 b 1 h 3 = − if i = ` 2 yˆ` = 0 otherwise a2 W2 b2 More compactly, h1 ∂ 1(i=`) (L (θ)) = − ∂yˆi yˆ` a1 W1 b1
x1 x2 xn
35/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ`
∂ 1(`=i) (L (θ)) = − ∂yˆi yˆ` We can now talk about the gradient a3 w.r.t. the vectory ˆ W3 b3 h2 ∂L (θ) 1`=1 a2 ∂yˆ1 1 1 `=2 W b . 2 2 ∇ˆyL (θ) = . = − . h1 yˆ` . ∂L (θ) . ∂yˆk 1`=k a1 1 W1 b1 = − e` yˆ` x1 x2 xn where e(`) is a k-dimensional vector whose `-th element is 1 and all other elements are 0. 36/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 What we are actually interested in is − logy ˆ` ∂L (θ) ∂(− logy ˆ ) = ` ∂aLi ∂aLi ∂(− logy ˆ ) ∂yˆ = ` ` a3 ∂yˆ` ∂aLi W3 b3 h2 Doesy ˆ` depend on aLi ? Indeed, it does.
exp(aL`) a2 yˆ` = P W b exp(aLi) 2 2 i h1 Having established this, we will now derive the full expression on the next a1 W slide 1 b1
x1 x2 xn
37/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 ∂ −1 ∂ − logy ˆ` = yˆ` ∂aLi yˆ` ∂aLi g(x) ∂ ∂g(x) 1 g(x) ∂h(x) −1 ∂ h(x) = − = softmax(aL)` 2 yˆ` ∂aLi ∂x ∂x h(x) h(x) ∂x −1 ∂ exp(aL)` = P yˆ` ∂aLi i0 exp(aL)` ∂ ∂ P ! exp(aL)` exp(aL)` ∂a i0 exp(aL)i0 −1 ∂aLi Li = P − P 2 yˆ` i0 exp(aL)i0 ( i0 (exp(aL)i0 ) 1 ! −1 (`=i) exp(aL)` exp(aL)` exp(aL)i = P − P P yˆ` i0 exp(aL)i0 i0 exp(aL)i0 i0 exp(aL)i0 −1 = 1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i yˆ` −1 = 1(`=i)yˆ` − yˆ`yˆi yˆ` = − 1(`=i) − yˆi
38/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 So far we have derived the partial derivative w.r.t. − logy ˆ` the i-th element of aL
∂L (θ) = −(1`=i − yˆi) ∂aL,i a3 We can now write the gradient w.r.t. the vector a W3 L b3 h2 ∂L (θ) − (1`=1 − yˆ1) ∂a a2 L1 − (1`=2 − yˆ2) . W2 b2 ∇aL L (θ) = . = . . h1 ∂L (θ) . ∂aLk − (1`=k − yˆk) a = −(e(`) − yˆ) 1 W1 b1
x1 x2 xn
39/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden Units
40/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases
∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights
Our focus is on Cross entropy loss and Softmax output.
41/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Chain rule along multiple paths: If a − logy ˆ` function p(z) can be written as a function of intermediate results qi(z) then we have :
∂p(z) X ∂p(z) ∂qm(z) a = 3 W3 ∂z ∂qm(z) ∂z b3 m h2
In our case: a2 p(z) is the loss function L (θ) W2 b2 h1 z = hij
qm(z) = aLm a1 W1 b1
x1 x2 xn
42/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Intentionally left blank 43/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ k ` ∂ (θ) X ∂ (θ) ∂a L = L i+1,m ∂hij ∂ai+1,m ∂hij m=1 k X ∂ (θ) = L W ∂ai+1,m i+1,m,j a3 m=1 W3 b3 Now consider these two vectors, h2
∂L (θ) a2 ∂ai+1,1 Wi+1,1,j W2 b2 . . h ∇ai+1 L (θ) = . ; Wi+1, · ,j = . 1 ∂L (θ) Wi+1,k,j ∂ai+1,k a1 W1 b1 Wi+1, · ,j is the j-th column of Wi+1; see that, x x x k 1 2 n T X ∂L (θ) (Wi+1, · ,j) ∇ai+1 L (θ) = Wi+1,m,j ∂ai+1,m m=1 ai+1 = Wi+1hij + bi+1 44/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ`
∂L (θ) T We have, = (Wi+1,.,j) ∇ai+1 L (θ) ∂hij We can now write the gradient w.r.t. h i a3 W3 ∂L (θ) b3 T h2 ∂hi1 (Wi+1, · ,1) ∇ai+1 L (θ) ∂L (θ) T (Wi+1, · ,2) ∇a L (θ) ∂hi2 i+1 ∇hi L (θ) = . = . a2 . . . W2 b2 ∂L (θ) T h (Wi+1, · ,n) ∇ai+1 L (θ) 1 ∂hin T = (Wi+1) (∇a L (θ)) i+1 a1 W1 b1 We are almost done except that we do not x1 x2 xn know how to calculate ∇ai+1 L (θ) for i < L−1 We will see how to compute that
45/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ` ∂L (θ) ∂ai1 . ∇ai L (θ) = . ∂L (θ) ∂ain a3 W3 ∂L (θ) ∂L (θ) ∂hij b3 = h2 ∂aij ∂hij ∂aij ∂L (θ) 0 a = g (aij)[∵ hij = g(aij)] 2 ∂hij W2 b2 0 h1 ∂L (θ) g (a ) ∂hi1 i1 . ∇ai L (θ) = . a1 0 W1 b ∂L (θ) g (a ) 1 ∂hin in 0 x1 x2 xn = ∇hi L (θ) [. . . , g (aik),... ]
46/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.7: Backpropagation: Computing Gradients w.r.t. Parameters
47/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases
∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights
Our focus is on Cross entropy loss and Softmax output.
48/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Recall that, − logy ˆ`
ak = bk + Wkhk−1 ∂aki = hk−1,j ∂Wkij a3 ∂L (θ) ∂L (θ) ∂aki W3 = b3 ∂Wkij ∂aki ∂Wkij h2 ∂L (θ) = hk−1,j a2 ∂aki W2 b2 ∂ (θ) ∂ (θ) ∂ (θ) L L ...... L h1 ∂Wk11 ∂Wk12 ∂Wk1n ...... ∇W L (θ) = . . . . . a1 k . . . . . W1 b1 ...... ∂L (θ) ∂Wknn x1 x2 xn
49/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Intentionally left blank 50/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 3×3 Lets take a simple example of a Wk ∈ R and see what each entry looks like
∂L (θ) ∂L (θ) ∂L (θ) ∂Wk11 ∂Wk12 ∂Wk13 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂aki ∇W L (θ) = = k ∂Wk21 ∂Wk22 ∂Wk23 ∂Wkij ∂aki ∂Wkij ∂L (θ) ∂L (θ) ∂L (θ) ∂Wk31 ∂Wk32 ∂Wk33
∂L (θ) ∂L (θ) ∂L (θ) hk−1,1 hk−1,2 hk−1,3 ∂ak1 ∂ak1 ∂ak1 ∂L (θ) ∂L (θ) ∂L (θ) T ∇W L (θ) = hk−1,1 hk−1,2 hk−1,3 = ∇a L (θ) · hk−1 k ∂ak2 ∂ak2 ∂ak2 k ∂L (θ) ∂L (θ) ∂L (θ) hk−1,1 hk−1,2 hk−1,3 ∂ak3 ∂ak3 ∂ak3
51/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Finally, coming to the biases − logy ˆ` X aki = bki + Wkijhk−1,j j
∂L (θ) ∂L (θ) ∂aki = a3 ∂bki ∂aki ∂bki W3 b ∂L (θ) h 3 = 2 ∂aki a2 We can now write the gradient w.r.t. the vector W2 b2 bk h1
∂L (θ) a ak1 1 ∂L (θ) W1 b1 ak2 ∇bk L (θ) = . = ∇ak L (θ) . x x x . 1 2 n ∂L (θ) akn
52/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.8: Backpropagation: Pseudo code
53/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Finally, we have all the pieces of the puzzle
∇aL L (θ) (gradient w.r.t. output layer)
∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)
∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)
We can now write the full learning algorithm
54/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Algorithm: gradient descent() t ← 0; max iterations ← 1000; 0 0 0 0 Initializeθ 0 = [W1 , ..., WL, b1, ..., bL]; while t++ < max iterations do h1, h2, ..., hL−1, a1, a2, ..., aL, yˆ = forward propagation(θt); ∇θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, yˆ); θt+1 ← θt − η∇θt; end
55/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Algorithm: forward propagation(θ) for k = 1 to L − 1 do ak = bk + Wkhk−1; hk = g(ak); end aL = bL + WLhL−1; yˆ = O(aL);
56/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Just do a forward propagation and compute all hi’s, ai’s, andy ˆ
Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, yˆ) //Compute output gradient ;
∇aL L (θ) = −(e(y) − yˆ); for k = L to 1 do // Compute gradients w.r.t. parameters ; T ∇Wk L (θ) = ∇ak L (θ)hk−1 ;
∇bk L (θ) = ∇ak L (θ); // Compute gradients w.r.t. layer below ; T ∇hk−1 L (θ) = Wk (∇ak L (θ)) ; // Compute gradients w.r.t. layer below (pre-activation); 0 ∇ak−1 L (θ) = ∇hk−1 L (θ) [. . . , g (ak−1,j),... ]; end
57/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.9: Derivative of the activation function
58/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Now, the only thing we need to figure out is how to compute g0 Logistic function tanh
g(z) = σ(z) g(z) = tanh (z) 1 ez − e−z = = 1 + e−z ez + e−z ! 0 1 d −z (ez + e−z) d (ez − e−z) g (z) = (−1) −z 2 (1 + e ) dz (1 + e ) dz − (ez − e−z) d (ez + e−z) 0 dz 1 −z g (z) = = (−1) (−e ) z −z 2 (1 + e−z)2 (e + e ) z −z 2 z −z 2 −z (e + e ) − (e − e ) 1 1 + e − 1 = = z −z 2 1 + e−z 1 + e−z (e + e ) (ez − e−z)2 = g(z)(1 − g(z)) =1 − (ez + e−z)2 =1 − (g(z))2
59/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4