Lecture 4 Feedforward Neural Networks, Backpropagation
Total Page:16
File Type:pdf, Size:1020Kb
CS7015 (Deep Learning): Lecture 4 Feedforward Neural Networks, Backpropagation Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras 1/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 References/Acknowledgments See the excellent videos by Hugo Larochelle on Backpropagation 2/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.1: Feedforward Neural Networks (a.k.a. multilayered network of neurons) 3/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The input to the network is an n-dimensional hL =y ^ = f(x) vector The network contains L − 1 hidden layers (2, in a3 this case) having n neurons each W3 b Finally, there is one output layer containing k h 3 2 neurons (say, corresponding to k classes) Each neuron in the hidden layer and output layer a2 can be split into two parts : pre-activation and W 2 b2 activation (ai and hi are vectors) h1 The input layer can be called the 0-th layer and the output layer can be called the (L)-th layer a1 W 2 n×n and b 2 n are the weight and bias W i R i R 1 b1 between layers i − 1 and i (0 < i < L) W 2 n×k and b 2 k are the weight and bias x1 x2 xn L R L R between the last hidden layer and the output layer (L = 3 in this case) 4/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) The pre-activation at layer i is given by ai(x) = bi + Wihi−1(x) a3 W3 b3 The activation at layer i is given by h2 hi(x) = g(ai(x)) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h (x) = O(a (x)) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.) To simplify notation we will refer to ai(x) as ai and hi(x) as hi 5/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) The pre-activation at layer i is given by ai = bi + Wihi−1 a3 W3 b3 The activation at layer i is given by h2 hi = g(ai) a2 W where g is called the activation function (for 2 b2 h1 example, logistic, tanh, linear, etc.) The activation at the output layer is given by a1 f(x) = h = O(a ) W L L 1 b1 where O is the output activation function (for x1 x2 xn example, softmax, linear, etc.) 6/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) N Data: fxi; yigi=1 Model: a3 y^i = f(xi) = O(W3g(W2g(W1x + b1) + b2) + b3) W3 b3 h2 Parameters: a2 θ = W1; ::; WL; b1; b2; :::; bL(L = 3) W2 b Algorithm: Gradient Descent with Back- h 2 1 propagation (we will see soon) Objective/Loss/Error function: Say, a1 N k 1 X X W1 b min (^y − y )2 1 N ij ij i=1 j=1 x x x 1 2 n In general, min L (θ) where (θ) is some function of the parameters L 7/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.2: Learning Parameters of Feedforward Neural Networks (Intuition) 8/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The story so far... We have introduced feedforward neural networks We are now interested in finding an algorithm for learning the parameters of this model 9/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) Recall our gradient descent algorithm Algorithm: gradient descent() a3 t 0; W3 b3 max iterations 1000; h2 Initialize w0; b0; while t++ < max iterations do a2 wt+1 wt − ηrwt; W2 b bt+1 bt − ηrbt; h 2 1 end a1 W 1 b1 x1 x2 xn 10/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) Recall our gradient descent algorithm We can write it more concisely as a3 Algorithm: gradient descent() W3 b t 0; h 3 2 max iterations 1000; Initialize θ0 = [w0; b0]; a2 while t++ < max iterations do W 2 b2 θt+1 θt − ηrθt; h1 end @L (θ) @L (θ) T where rθt = ; a1 @wt @bt W Now, in this feedforward neural network, 1 b1 instead of θ = [w; b] we have θ = x1 x2 xn [W1;W2; ::; WL; b1; b2; ::; bL] We can still use the same algorithm for learning the parameters of our model 11/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) Recall our gradient descent algorithm We can write it more concisely as a3 Algorithm: gradient descent() W3 b t 0; h 3 2 max iterations 1000; 0 0 0 0 Initializeθ 0 = [W1 ; :::; WL; b1; :::; bL]; a2 while t++ < max iterations do W 2 b2 θt+1 θt − ηrθt; h1 end @L (θ) @L (θ) @L (θ) @L (θ) T where rθt = ; :; ; ; :; a1 @W1;t @WL;t @b1;t @bL;t W Now, in this feedforward neural network, 1 b1 instead of θ = [w; b] we have θ = x1 x2 xn [W1;W2; ::; WL; b1; b2; ::; bL] We can still use the same algorithm for learning the parameters of our model 12/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Except that now our rθ looks much more nasty 2 3 @L (θ) ::: @L (θ) @L (θ) ::: @L (θ) ::: @L (θ) ::: @L (θ) @L (θ) @L (θ) ::: @L (θ) @W111 @W11n @W211 @W21n @WL;11 @WL;1k @WL;1k @b11 @bL1 6 7 6 7 6 @L (θ) @L (θ) @L (θ) @L (θ) @L (θ) @L (θ) @L (θ) @L (θ) @L (θ) 7 6 ::: ::: ::: ::: ::: 7 6 @W121 @W12n @W221 @W22n @WL;21 @WL;2k @WL;2k @b12 @bL2 7 6 . 7 6 . 7 4 5 @L (θ) ::: @L (θ) @L (θ) ::: @L (θ) ::: @L (θ) ::: @L (θ) @L (θ) @L (θ) ::: @L (θ) @W1n1 @W1nn @W2n1 @W2nn @WL;n1 @WL;nk @WL;nk @b1n @bLk rθ is thus composed of n×n n×k rW1; rW2; :::rWL−1 2 R ; rWL 2 R ; n k rb1; rb2; :::; rbL−1 2 R and rbL 2 R 13/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the loss function L (θ)? How to compute rθ which is composed of n×n n×k rW1; rW2; :::; rWL−1 2 R ; rWL 2 R n k rb1; rb2; :::; rbL−1 2 R and rbL 2 R ? 14/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Module 4.3: Output Functions and Loss Functions 15/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 We need to answer two questions How to choose the loss function L (θ)? How to compute rθ which is composed of: n×n n×k rW1; rW2; :::; rWL−1 2 R ; rWL 2 R n k rb1; rb2; :::; rbL−1 2 R and rbL 2 R ? 16/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 The choice of loss function depends yi = f7:5 8:2 7:7g on the problem at hand imdb Critics RT We will illustrate this with the help Rating Rating Rating of two examples Consider our movie example again but this time we are interested in predicting ratings Neural network with 3 Here yi 2 R L − 1 hidden layers The loss function should capture how muchy ^i deviates from yi n If yi 2 R then the squared error loss can capture this deviation isActor isDirector N 3 1 X X 2 .. ........ L (θ) = (^yij − yij) Damon Nolan N i=1 j=1 xi 17/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 hL =y ^ = f(x) A related question: What should the output function `O' be if yi 2 R? More specifically, can it be the logistic a3 function? W3 b3 h2 No, because it restrictsy ^i to a value between 0 & 1 but we wanty ^i 2 R a2 So, in such cases it makes sense to W have `O' as linear function 2 b2 h1 f(x) = hL = O(aL) = W a + b a1 O L O W 1 b1 y^i = f(xi) is no longer bounded between 0 and 1 x1 x2 xn 18/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Intentionally left blank 19/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Now let us consider another problem y = [1 0 0 0] for which a different loss function Apple Mango Orange Banana would be appropriate Suppose we want to classify an image into 1 of k classes Here again we could use the squared Neural network with error loss to capture the deviation L − 1 hidden layers But can you think of a better function? 20/9 Mitesh M. Khapra CS7015 (Deep Learning): Lecture 4 Notice that y is a probability y = [1 0 0 0] distribution Apple Mango Orange Banana Therefore we should also ensure that y^ is a probability distribution What choice of the output activation `O' will ensure this ? Neural network with aL = WLhL−1 + bL L − 1 hidden layers eaL;j y^j = O(aL)j = Pk aL;i i=1 e th O(aL)j is the j element ofy ^ and aL;j th is the j element of the vector aL. This function is called the softmax function 21/9 Mitesh M.