<<

CS7015 (): Lecture 4 Feedforward Neural Networks,

Mitesh M. Khapra

Department of Computer Science and Engineering Indian Institute of Technology Madras

1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation

2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons)

3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ˆ = f(x) vector The network contains L − 1 hidden layers (2, in

a3 this case) having n neurons each W3 b Finally, there is one output containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W ∈ n×n and b ∈ n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W ∈ n×k and b ∈ k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer

(L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) The pre-activation at layer i is given by

ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2

hi(x) = g(ai(x)) a2 W where g is called the (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.)

To simplify notation we will refer to ai(x) as ai

and hi(x) as hi 5/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) The pre-activation at layer i is given by

ai = bi + Wihi−1 a3 W3 b3 The activation at layer i is given by h2

hi = g(ai) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h = O(a ) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.)

6/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) N Data: {xi, yi}i=1 Model:

a3 yˆi = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3) W3 b3 h2 Parameters: a2 θ = W1, .., WL, b1, b2, ..., bL(L = 3) W2 b Algorithm: with Back- h 2 1 propagation (we will see soon) Objective/Loss/Error function: Say, a1 N k 1 X X W1 b min (ˆy − y )2 1 N ij ij i=1 j=1 x x x 1 2 n In general, min L (θ)

where (θ) is some function of the parameters L 7/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition)

8/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The story so far... We have introduced feedforward neural networks We are now interested in finding an algorithm for learning the parameters of this model

9/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) Recall our gradient descent algorithm Algorithm: gradient descent() a3 t ← 0; W3 b3 max iterations ← 1000; h2 Initialize w0, b0; while t++ < max iterations do a2 wt+1 ← wt − η∇wt; W2 b bt+1 ← bt − η∇bt; h 2 1 end a1 W 1 b1

x1 x2 xn

10/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) Recall our gradient descent algorithm We can write it more concisely as

a3 Algorithm: gradient descent() W3 b t ← 0; h 3 2 max iterations ← 1000; Initialize θ0 = [w0, b0]; a2 while t++ < max iterations do W 2 b2 θt+1 ← θt − η∇θt; h1 end

 ∂L (θ) ∂L (θ) T where ∇θt = , a1 ∂wt ∂bt W Now, in this feedforward neural network, 1 b1 instead of θ = [w, b] we have θ = x1 x2 xn [W1,W2, .., WL, b1, b2, .., bL] We can still use the same algorithm for learning the parameters of our model 11/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) Recall our gradient descent algorithm We can write it more concisely as

a3 Algorithm: gradient descent() W3 b t ← 0; h 3 2 max iterations ← 1000; 0 0 0 0 Initializeθ 0 = [W1 , ..., WL, b1, ..., bL]; a2 while t++ < max iterations do W 2 b2 θt+1 ← θt − η∇θt; h1 end

 ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) T where ∇θt = , ., , , ., a1 ∂W1,t ∂WL,t ∂b1,t ∂bL,t W Now, in this feedforward neural network, 1 b1 instead of θ = [w, b] we have θ = x1 x2 xn [W1,W2, .., WL, b1, b2, .., bL] We can still use the same algorithm for learning the parameters of our model 12/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Except that now our ∇θ looks much more nasty

  ∂L (θ) ... ∂L (θ) ∂L (θ) ... ∂L (θ) ... ∂L (θ) ... ∂L (θ) ∂L (θ) ∂L (θ) ... ∂L (θ) ∂W111 ∂W11n ∂W211 ∂W21n ∂WL,11 ∂WL,1k ∂WL,1k ∂b11 ∂bL1      ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ)   ......   ∂W121 ∂W12n ∂W221 ∂W22n ∂WL,21 ∂WL,2k ∂WL,2k ∂b12 ∂bL2   ......   ......    ∂L (θ) ... ∂L (θ) ∂L (θ) ... ∂L (θ) ... ∂L (θ) ... ∂L (θ) ∂L (θ) ∂L (θ) ... ∂L (θ) ∂W1n1 ∂W1nn ∂W2n1 ∂W2nn ∂WL,n1 ∂WL,nk ∂WL,nk ∂b1n ∂bLk

∇θ is thus composed of n×n n×k ∇W1, ∇W2, ...∇WL−1 ∈ R , ∇WL ∈ R , n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R

13/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the L (θ)? How to compute ∇θ which is composed of n×n n×k ∇W1, ∇W2, ..., ∇WL−1 ∈ R , ∇WL ∈ R n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R ?

14/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.3: Output Functions and Loss Functions

15/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the loss function L (θ)? How to compute ∇θ which is composed of: n×n n×k ∇W1, ∇W2, ..., ∇WL−1 ∈ R , ∇WL ∈ R n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R ?

16/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The choice of loss function depends yi = {7.5 8.2 7.7} on the problem at hand imdb Critics RT We will illustrate this with the help Rating Rating Rating of two examples Consider our movie example again but this time we are interested in predicting ratings Neural network with 3 Here yi ∈ R L − 1 hidden layers The loss function should capture how muchy ˆi deviates from yi n If yi ∈ R then the squared error loss can capture this deviation isActor isDirector N 3 1 X X 2 ...... L (θ) = (ˆyij − yij) Damon Nolan N i=1 j=1 xi

17/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x) A related question: What should the output function ‘O’ be if yi ∈ R? More specifically, can it be the logistic a3 function? W3 b3 h2 No, because it restrictsy ˆi to a value between 0 & 1 but we wanty ˆi ∈ R a2 So, in such cases it makes sense to W have ‘O’ as linear function 2 b2 h1 f(x) = hL = O(aL) = W a + b a1 O L O W 1 b1 yˆi = f(xi) is no longer bounded between 0 and 1 x1 x2 xn

18/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Intentionally left blank 19/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Now let us consider another problem y = [1 0 0 0] for which a different loss function Apple Mango Orange Banana would be appropriate Suppose we want to classify an image into 1 of k classes Here again we could use the squared Neural network with error loss to capture the deviation L − 1 hidden layers But can you think of a better function?

20/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Notice that y is a probability y = [1 0 0 0] distribution Apple Mango Orange Banana Therefore we should also ensure that yˆ is a What choice of the output activation ‘O’ will ensure this ? Neural network with aL = WLhL−1 + bL L − 1 hidden layers eaL,j yˆj = O(aL)j = Pk aL,i i=1 e

th O(aL)j is the j element ofy ˆ and aL,j th is the j element of the vector aL. This function is called the softmax function

21/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ˆ = f(x)

a3 W3 b3 h2

a2 W 2 b2 h1

a1 W 1 b1

x1 x2 xn Now that we have ensured that both y = [1 0 0 0] y &y ˆ are probability distributions Apple Mango Orange Banana can you think of a function which captures the difference between them? Cross-entropy

Neural network with k L − 1 hidden layers X L (θ) = − yc logy ˆc c=1 Notice that

yc = 1 if c = ` (the true class label) = 0 otherwise

∴ L (θ) = − logy ˆ`

22/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 So, for classification problem (where you have hL =y ˆ = f(x) to choose 1 of K classes), we use the following objective function

a3 minimize L (θ) = − logy ˆ` θ W3 b3 h2 or maximize − L (θ) = logy ˆ` θ But wait! a 2 Isy ˆ a function of θ = [W ,W , ., W , b , b , ., b ]? W ` 1 2 L 1 2 L 2 b2 h1 Yes, it is indeed a function of θ

yˆ` = [O(W3g(W2g(W1x + b1) + b2) + b3)]` a1 What doesy ˆ` encode? W th 1 b1 It is the probability that x belongs to the ` class (bring it as close to 1). x1 x2 xn logy ˆ` is called the log-likelihood of the data. 23/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Outputs

Real Values Probabilities

Output Activation Linear Softmax

Loss Function Squared Error Cross Entropy

Of course, there could be other loss functions depending on the problem at hand but the two loss functions that we just saw are encountered very often For the rest of this lecture we will focus on the case where the output activation is a softmax function and the loss function is cross entropy

24/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.4: Backpropagation (Intuition)

25/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the loss function L (θ)? How to compute ∇θ which is composed of: n×n n×k ∇W1, ∇W2, ..., ∇WL−1 ∈ R , ∇WL ∈ R n k ∇b1, ∇b2, ..., ∇bL−1 ∈ R and ∇bL ∈ R ?

26/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 yˆ = f(x) Let us focus on this one Algorithm: gradient descent() weight (W112). a 31 t ← 0; To learn this weight W311 b h 3 using SGD we need a 21 max iterations ← 1000; formula for ∂L (θ) . Initialize θ0; ∂W112 a21 W211 b while We will see how to h 2 11 t++ < max iterations do calculate this. θt+1 ← θt − η∇θt; a11 W W 111 112 b1 end

x1 x2 xd

27/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 (θ) First let us take the simple case when L yˆ = f(x) we have a deep but thin network. In this case it is easy to find the derivative by chain rule. aL1 WL11 h21 ∂L (θ) ∂L (θ) ∂yˆ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 = ∂W111 ∂yˆ ∂aL11 ∂h21 ∂a21 ∂h11 ∂a11 ∂W111 a21 ∂L (θ) ∂L (θ) ∂h11 W211 = (just compressing the chain rule) h ∂W111 ∂h11 ∂W111 11 ∂L (θ) ∂L (θ) ∂h21 = ∂W211 ∂h21 ∂W211 a11 W111 ∂L (θ) ∂L (θ) ∂aL1 = ∂WL11 ∂aL1 ∂WL11 x1

28/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Let us see an intuitive explanation of backpropagation before we get into the mathematical details

29/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We get a certain loss at the output and we try to − logy ˆ` figure out who is responsible for this loss So, we talk to the output layer and say “Hey! You are not producing the desired output, better take responsibility”. a3 The output layer says “Well, I take responsibility W3 b3 for my part but please understand that I am only h2 as the good as the hidden layer and weights below me”. After all ... a2 W2 b2 f(x) =y ˆ = O(WLhL−1 + bL) h1

a1 W1 b1

x1 x2 xn

30/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 So, we talk to WL, bL and hL and ask them “What is − logy ˆ` wrong with you?”

WL and bL take full responsibility but hL says “Well, please understand that I am only as good as the pre- activation layer”

The pre-activation layer in turn says that I am only as a3 good as the hidden layer and weights below me. W3 b3 We continue in this manner and realize that the h2 responsibility lies with all the weights and biases (i.e. all the parameters of the model) a2 But instead of talking to them directly, it is easier to W2 b2 talk to them through the hidden layers and output h1 layers (and this is exactly what the chain rule allows us to do) a1 W1 b1 ∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } x1 x2 xn Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights 31/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights

Our focus is on Cross entropy loss and Softmax output.

32/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.5: Backpropagation: Computing Gradients w.r.t. the Output Units

33/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights

∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights

Our focus is on Cross entropy loss and Softmax output.

34/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Let us first consider the partial derivative − logy ˆ` w.r.t. i-th output

L (θ) = − logy ˆ` (` = true class label) ∂ ∂ (L (θ)) = (− logy ˆ`) a3 ∂yˆi ∂yˆi W3 b 1 h 3 = − if i = ` 2 yˆ` = 0 otherwise a2 W2 b2 More compactly, h1 ∂ 1(i=`) (L (θ)) = − ∂yˆi yˆ` a1 W1 b1

x1 x2 xn

35/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ`

∂ 1(`=i) (L (θ)) = − ∂yˆi yˆ` We can now talk about the gradient a3 w.r.t. the vectory ˆ W3 b3 h2    ∂L (θ)  1`=1 a2 ∂yˆ1 1 1  `=2 W b  .    2 2 ∇ˆyL (θ) =  .  = − . h1   yˆ`  .  ∂L (θ)  .  ∂yˆk 1`=k a1 1 W1 b1 = − e` yˆ` x1 x2 xn where e(`) is a k-dimensional vector whose `-th element is 1 and all other elements are 0. 36/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 What we are actually interested in is − logy ˆ` ∂L (θ) ∂(− logy ˆ ) = ` ∂aLi ∂aLi ∂(− logy ˆ ) ∂yˆ = ` ` a3 ∂yˆ` ∂aLi W3 b3 h2 Doesy ˆ` depend on aLi ? Indeed, it does.

exp(aL`) a2 yˆ` = P W b exp(aLi) 2 2 i h1 Having established this, we will now derive the full expression on the next a1 W slide 1 b1

x1 x2 xn

37/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 ∂ −1 ∂ − logy ˆ` = yˆ` ∂aLi yˆ` ∂aLi g(x) ∂ ∂g(x) 1 g(x) ∂h(x) −1 ∂ h(x) = − = softmax(aL)` 2 yˆ` ∂aLi ∂x ∂x h(x) h(x) ∂x −1 ∂ exp(aL)` = P yˆ` ∂aLi i0 exp(aL)`   ∂ ∂ P ! exp(aL)` exp(aL)` ∂a i0 exp(aL)i0 −1 ∂aLi Li = P − P 2 yˆ` i0 exp(aL)i0 ( i0 (exp(aL)i0 ) 1 ! −1 (`=i) exp(aL)` exp(aL)` exp(aL)i = P − P P yˆ` i0 exp(aL)i0 i0 exp(aL)i0 i0 exp(aL)i0 −1  = 1(`=i)softmax(aL)` − softmax(aL)`softmax(aL)i yˆ` −1  = 1(`=i)yˆ` − yˆ`yˆi yˆ`  = − 1(`=i) − yˆi

38/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 So far we have derived the partial derivative w.r.t. − logy ˆ` the i-th element of aL

∂L (θ) = −(1`=i − yˆi) ∂aL,i a3 We can now write the gradient w.r.t. the vector a W3 L b3 h2    ∂L (θ)  − (1`=1 − yˆ1) ∂a a2 L1  − (1`=2 − yˆ2)  .    W2 b2 ∇aL L (θ) =  .  = .    .  h1 ∂L (θ)  .  ∂aLk − (1`=k − yˆk) a = −(e(`) − yˆ) 1 W1 b1

x1 x2 xn

39/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.6: Backpropagation: Computing Gradients w.r.t. Hidden Units

40/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights

Our focus is on Cross entropy loss and Softmax output.

41/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Chain rule along multiple paths: If a − logy ˆ` function p(z) can be written as a function of intermediate results qi(z) then we have :

∂p(z) X ∂p(z) ∂qm(z) a = 3 W3 ∂z ∂qm(z) ∂z b3 m h2

In our case: a2 p(z) is the loss function L (θ) W2 b2 h1 z = hij

qm(z) = aLm a1 W1 b1

x1 x2 xn

42/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Intentionally left blank 43/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ k ` ∂ (θ) X ∂ (θ) ∂a L = L i+1,m ∂hij ∂ai+1,m ∂hij m=1 k X ∂ (θ) = L W ∂ai+1,m i+1,m,j a3 m=1 W3 b3 Now consider these two vectors, h2

 ∂L (θ)    a2 ∂ai+1,1 Wi+1,1,j W2 b2  .   .  h ∇ai+1 L (θ) =  .  ; Wi+1, · ,j = . 1     ∂L (θ) Wi+1,k,j ∂ai+1,k a1 W1 b1 Wi+1, · ,j is the j-th column of Wi+1; see that, x x x k 1 2 n T X ∂L (θ) (Wi+1, · ,j) ∇ai+1 L (θ) = Wi+1,m,j ∂ai+1,m m=1 ai+1 = Wi+1hij + bi+1 44/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ`

∂L (θ) T We have, = (Wi+1,.,j) ∇ai+1 L (θ) ∂hij We can now write the gradient w.r.t. h i a3 W3  ∂L (θ)  b3  T  h2 ∂hi1 (Wi+1, · ,1) ∇ai+1 L (θ) ∂L (θ) T   (Wi+1, · ,2) ∇a L (θ)  ∂hi2   i+1  ∇hi L (θ) =  .  = . a2  .   .   .    W2 b2 ∂L (θ) T h (Wi+1, · ,n) ∇ai+1 L (θ) 1 ∂hin T = (Wi+1) (∇a L (θ)) i+1 a1 W1 b1 We are almost done except that we do not x1 x2 xn know how to calculate ∇ai+1 L (θ) for i < L−1 We will see how to compute that

45/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 − logy ˆ`  ∂L (θ)  ∂ai1  .  ∇ai L (θ) =  .    ∂L (θ) ∂ain a3 W3 ∂L (θ) ∂L (θ) ∂hij b3 = h2 ∂aij ∂hij ∂aij ∂L (θ) 0 a = g (aij)[∵ hij = g(aij)] 2 ∂hij W2 b2  0  h1 ∂L (θ) g (a ) ∂hi1 i1  .  ∇ai L (θ) =  .  a1  0  W1 b ∂L (θ) g (a ) 1 ∂hin in 0 x1 x2 xn = ∇hi L (θ) [. . . , g (aik),... ]

46/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.7: Backpropagation: Computing Gradients w.r.t. Parameters

47/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Quantities of interest (roadmap for the remaining part): Gradient w.r.t. output units Gradient w.r.t. hidden units Gradient w.r.t. weights and biases

∂L (θ) ∂L (θ) ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 = ∂W111 ∂yˆ ∂a3 ∂h2 ∂a2 ∂h1 ∂a1 ∂W111 | {z } | {z } | {z } | {z } | {z } Talk to the Talk to the Talk to the Talk to the and now weight directly output layer previous hidden previous talk to layer hidden layer the weights

Our focus is on Cross entropy loss and Softmax output.

48/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Recall that, − logy ˆ`

ak = bk + Wkhk−1 ∂aki = hk−1,j ∂Wkij a3 ∂L (θ) ∂L (θ) ∂aki W3 = b3 ∂Wkij ∂aki ∂Wkij h2 ∂L (θ) = hk−1,j a2 ∂aki W2 b2  ∂ (θ) ∂ (θ) ∂ (θ)  L L ...... L h1 ∂Wk11 ∂Wk12 ∂Wk1n    ......  ∇W L (θ) =  . . . . .  a1 k  . . . . .    W1 b1 ...... ∂L (θ) ∂Wknn x1 x2 xn

49/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Intentionally left blank 50/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 3×3 Lets take a simple example of a Wk ∈ R and see what each entry looks like

 ∂L (θ) ∂L (θ) ∂L (θ)  ∂Wk11 ∂Wk12 ∂Wk13     ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂L (θ) ∂aki ∇W L (θ) =   = k  ∂Wk21 ∂Wk22 ∂Wk23  ∂Wkij ∂aki ∂Wkij     ∂L (θ) ∂L (θ) ∂L (θ) ∂Wk31 ∂Wk32 ∂Wk33

 ∂L (θ) ∂L (θ) ∂L (θ)  hk−1,1 hk−1,2 hk−1,3 ∂ak1 ∂ak1 ∂ak1    ∂L (θ) ∂L (θ) ∂L (θ)  T ∇W L (θ) =  hk−1,1 hk−1,2 hk−1,3 = ∇a L (θ) · hk−1 k  ∂ak2 ∂ak2 ∂ak2  k     ∂L (θ) ∂L (θ) ∂L (θ) hk−1,1 hk−1,2 hk−1,3 ∂ak3 ∂ak3 ∂ak3

51/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Finally, coming to the biases − logy ˆ` X aki = bki + Wkijhk−1,j j

∂L (θ) ∂L (θ) ∂aki = a3 ∂bki ∂aki ∂bki W3 b ∂L (θ) h 3 = 2 ∂aki a2 We can now write the gradient w.r.t. the vector W2 b2 bk h1

 ∂L (θ)  a ak1 1  ∂L (θ)  W1 b1  ak2  ∇bk L (θ) =  .  = ∇ak L (θ)  .  x x x  .  1 2 n ∂L (θ) akn

52/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.8: Backpropagation: Pseudo code

53/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Finally, we have all the pieces of the puzzle

∇aL L (θ) (gradient w.r.t. output layer)

∇hk L (θ), ∇ak L (θ) (gradient w.r.t. hidden layers, 1 ≤ k < L)

∇Wk L (θ), ∇bk L (θ) (gradient w.r.t. weights and biases, 1 ≤ k ≤ L)

We can now write the full learning algorithm

54/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Algorithm: gradient descent() t ← 0; max iterations ← 1000; 0 0 0 0 Initializeθ 0 = [W1 , ..., WL, b1, ..., bL]; while t++ < max iterations do h1, h2, ..., hL−1, a1, a2, ..., aL, yˆ = forward propagation(θt); ∇θt = backward propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, yˆ); θt+1 ← θt − η∇θt; end

55/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Algorithm: forward propagation(θ) for k = 1 to L − 1 do ak = bk + Wkhk−1; hk = g(ak); end aL = bL + WLhL−1; yˆ = O(aL);

56/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Just do a forward propagation and compute all hi’s, ai’s, andy ˆ

Algorithm: back propagation(h1, h2, ..., hL−1, a1, a2, ..., aL, y, yˆ) //Compute output gradient ;

∇aL L (θ) = −(e(y) − yˆ); for k = L to 1 do // Compute gradients w.r.t. parameters ; T ∇Wk L (θ) = ∇ak L (θ)hk−1 ;

∇bk L (θ) = ∇ak L (θ); // Compute gradients w.r.t. layer below ; T ∇hk−1 L (θ) = Wk (∇ak L (θ)) ; // Compute gradients w.r.t. layer below (pre-activation); 0 ∇ak−1 L (θ) = ∇hk−1 L (θ) [. . . , g (ak−1,j),... ]; end

57/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.9: Derivative of the activation function

58/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Now, the only thing we need to figure out is how to compute g0 tanh

g(z) = σ(z) g(z) = tanh (z) 1 ez − e−z = = 1 + e−z ez + e−z ! 0 1 d −z (ez + e−z) d (ez − e−z) g (z) = (−1) −z 2 (1 + e ) dz (1 + e ) dz − (ez − e−z) d (ez + e−z) 0 dz 1 −z g (z) = = (−1) (−e ) z −z 2 (1 + e−z)2 (e + e ) z −z 2 z −z 2  −z  (e + e ) − (e − e ) 1 1 + e − 1 = = z −z 2 1 + e−z 1 + e−z (e + e ) (ez − e−z)2 = g(z)(1 − g(z)) =1 − (ez + e−z)2 =1 − (g(z))2

59/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4