Lecture 7 – Deep Learning and Convolutional Networks ESS2222

ESS2222 Lecture 7 – Deep learning and Convolutional Networks Hosein Shahnas University of Toronto, Department of Earth Sciences, 1 Outline Deep Learning Convolutional Neural Networks Recurrent Neural Networks 2 Review of Lecture 6 Dropout Standard neural network Neural network with dropout h1 h2 y Or Nand And Neural Networks Back Propogation i ʚ2ʛ ʚ'ͯͥʛ (l) ʚÂʛ = ͬ$ îÀ iÍ¿À (l−1) ʚ'ͯͥʛ ͦ ʚ'ʛ (l) ʚ'ʛ % = (1 - ʚͬ ʛ ) ∑ $ ͫ $ % $% 3 Deep Learning Single Neuron Shallow Learning Binary Class Binary Class Shallow Learning Multi layer Multi Class Multi Class Deep neural Network 4 Deep Learning Why more layers? 5 Deep Learning Large Number of parameters 6 Convolutional Networks (CNN) W ³ F ¹ I: Image K: ‘Filter‘ or ‘Kernel’ or ‘Feature Detector’ I*K: Convolved Feature (activation map) ³ͯ¢³ ³¹ Ɣ + 1 The size of the convolved field (I*K) ¯³ S: stride Ex.: ³¹ = (7-3)/1+1 = 5 Volume image 7 Convolutional Networks (CNN) Filters https://medium.com/@ageitgey/machine-learning-is-fun-part-3-deep-learning-and-convolutional-neural-networks-f40359318721 8 Convolutional Networks (CNN) Filters ̊ ̊ ̊ ̉ ̉ ̉ ̊ Box blur ̊ ̊ ̊ I̴̴̤̥̮̩̹ ̉ ̊ ̉ ̒ ̉ ̉ ̉ ̊ ̊ ̊ Ǝ̊ Ǝ ̊ Ǝ ̊ Edge detection Ǝ̊ ̑ Ǝ ̊ Ǝ̊ Ǝ ̊ Ǝ ̊ ̉ Ǝ ̊ ̉ ̊ ̋ ̊ ̊ S̨̡̲̰̥̮ Ǝ̊ ̎ Ǝ ̊ Gaussian blur ̋ ̍ ̋ ̊̏ ̉ Ǝ ̊ ̉ ̊ ̋ ̊ 9 Stride and Padding In convolving an image: 1) The outputs shrink 2) The information on corners of the image is lost This can be prevented by padding. Conv. 36 32 Ɛ 32 Ɛ3 32 32 36 10 Stride and Padding Color Image ¤ͯ¢¤ͮ¬ ¤¹ Ɣ + 1 ¯¤ ³ͯ¢³ͮ¬ ³¹ Ɣ + 1 ¯³ http://machinelearninguru.com/computer_vision/basics/convolution/convolution_layer.html 11 Stride and Padding S W F P Stride = 1 H Stride = 2 12 Convolutional Networks Max Pooling/Downsampling with CNNs ) Z = X.W = ∑ͥ wi ͬ$ = w 1 x1 + w 2 x2 + … + wn xn ReLU ͮ Ɣ ͕ͬ͡ 0, ͮ ̶̯̮̉ ̖̯̯̬ ̶̯̮̉ ̖̯̯̬ ……….. ̛̘̥̒ ̛̘̥̒ ̶̯̮̉ ̖̯̯̬ ̴̡̡̬̳̳̦̩̣̩̯̮̉ ̛̘̥̒ ReLU ͮ Ɣ ͕ͬ͡ 0, ͮ Rectified Linear Unit (ReLU) SReLU ͮ = log(1+exp(z)) Analytic approx. 13 Convolutional Networks (CNN) ³ ³¹ RGB image Convolving ³ = 32 ³¹ = 28 Six independent filters https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8 14 Convolutional Networks (CNN) Sequence of CNN layers Feature visualization CNN Architecture 15 Convolutional Networks ̴̸̡̙̯̦̭ ̛̘̥̒ ̛̘̥̒ ̴̸̡̙̯̦̭ ̵̴̦̮̣̩̯̮ ͙ͯ5% ReLU ͮ Ɣ ͕ͬ͡ 0,ͮ (z ) Ɣ ̫: Number of classes Rectified Linear Unit (ReLU) Softmax j & ͯ5 ∑ ͙ ( ( SReLU ͮ = log(1+exp(z)) https://medium.com/data-science-group-iitr/building-a-convolutional-neural-network-in-python-with-tensorflow-d251c3ca8117 ` Analytic approx. 16 A Simple CNN Model Example 1: MNIST Database - Handwritten digits A simple CNN deep learning model for handwritten digits recognition using Keras, Output flatten 10 3Ɛ ̌ ÅÄÌ. 2Ɛ ̋ ¬ÅÅÂ. 128 Ɛ0.2 ̋̑ 128 Fully connected layers 17 A Simple CNN Model 18 A Simple CNN Model 19 A Simple CNN Model 20 A Simple CNN Model Example 2: Fashion-MNIST Database CNN deep learning model using Keras, Input 28 Ɛ28 Ɛ1 Conv. 32 - 3Ɛ ̌ Max Pool 2 Ɛ ̋ Conv. 64 - 3Ɛ ̌ Max Pool 2 Ɛ ̋ Output layer 10 Dense layer 128 Flatten Max Pool 2 Ɛ ̋ Conv. 128 - 3Ɛ ̌ 21 A Simple CNN Model 22 A Simple CNN Model 23 A Simple CNN Model 24 A Simple CNN Model 25 A Simple CNN Model Example 3: Fashion-MNIST Database CNN deep learning model with dropout using Keras, Input 28 Ɛ28 Ɛ1 Conv. 32 - 3Ɛ ̌ Max Pool 2 Ɛ ̋ Dropout 0.25 Conv. 64 - 3Ɛ ̌ Dropout 0.4 Max Pool 2 Ɛ ̋ Conv. 128 - 3Ɛ ̌ Dropout 0.25 Max Pool 2 Ɛ ̋ Flatten Dense layer 128 Dropout 0.3 Output layer 10 26 A Simple CNN Model 27 Overcome Overfitting with Dropout Fashion MNIST Data Without dropout With dropout https://www.datacamp.com/community/tutorials/convolutional-neural-networks-python 28 Recurrent Neural Networks (RNN) RNN: Unlike the regular neural networks in which the samples are assumed to be time independent, being inputted to the network as a whole, the recurrent neural networks , take their inputs from temporally distributed samples. They can be thought as multiple copies of the same network, each passing a message to a successor. These networks have loops in them, allowing time- dependent information to persist. Cell ht hto ht h 1 t2 htn N N N N N = x x to xt1 xt x t 2 tn RNN time 29 Recurrent Neural Networks (RNN) Example 1: Classification of events happening at every frame in a movie. RNN can use its reasoning about the previous events in the film to inform later ones. Example 2: Earthquakes and their relevance to the previous earthquakes. Extension: The application of recurrent neural networks is not limited the time dependent samples. The application can be extended to the other spaces (other than time) as can be seen from the following example. Example 3: MNIST- digit recognition using RNN In this example each 28 Ɛ28-pixel image of handwritten digits, instead of being flattened to a 784-dimensional array as input for a regular neural network, is assumed as 28 one-dimensional images. Each row of pixels of the image is assumed to be a single 1D-image , and the next row another image, following the previous 1D-image. Each row of the image is then sent to one cell of the recurrent neural network in sequence. The following code trains a model using RNN algorithm. Note that each row has some information to the next row which is not preserved in the regular neural networks. 30 MNIST- Digit Recognition - RNN 31 MNIST- Digit Recognition - RNN 32 MNIST- Digit Recognition - RNN 33 MNIST- Digit Recognition - RNN 34 MNIST- Digit Recognition - RNN 35 Optimization Optimizers are a crucial part of the neural networks Batch Gradient Descent (BGD) Ą¦ ¿ ¿ ¿ ˒ÍÀ Ɣ Ǝ ñ = ñ ∑¿ Ï Ǝ ĈʚÐ ʛ Î j ĄÍÀ Mini-Batch Gradient Descent (BGD) Ą¦ Á≪Ä ¿ ¿ ¿ ˒ÍÀ Ɣ Ǝ ñ = ñ ∑¿Ͱ̊ Ï Ǝ ĈʚÐ ʛ Î j ĄÍÀ Stochastic Gradient Descent (SGD) Ą¦ ¿ ¿ ¿ ˒ÍÀ Ɣ Ǝ ñ = ñ Ï Ǝ ĈʚÐ ʛ Î j ĄÍÀ i: sample, j: feature 36 Optimization AdaGrad AdaGrad (adaptive gradient algorithm) is a modified stochastic gradient descent with per-parameter learning rate (adaptively tuned per parameter). ñ Ê ̋ ÍÀ Ɣ ÍÀ Ǝ ½À, £ÀÀ Ɣ ∑þͰ̊ ½ þÀ , ½þ Ɣ ˨À¿ ʚÍʛ £ÀÀ RMSProp optimization algorithm RMSProp (Root Mean Square Propagation) optimization algorithm (Kingma & Ba, 2015) is an update to the RMSProp optimizer. ̊ ̋ ¦ Í Ɣ ∑ Ï¿ Ǝ ĈʚÐ¿ʛ sample objective function ̋ ¿ wƔ ̷ ƍ ˒Í , ̋ ĄÀÊ ̴̳~̊ ̴̳ͮ̊ ƔƔƔ˪Ɣ ̴̳ ƍ ʚ1ʚ1----˪ʛʛʛʛ ÉȦ = dwdwdw dwdwdw dwdwdw ĄÍ , ºÍ ̊ͯìÊ~̊ ĄÀÊ Êͮ̊ Ê ĄÍ Í = Í Ǝ ñ ÉȦºÍ ͮą t: time step, i: sample, j: feature 37 Optimization Adam optimization algorithm Adam (Adaptive moment estimation) optimization algorithm (Kingma & Ba, 2015) is an update to the RMSProp optimizer. 1) Computationally efficient 2) Little memory requirements 3) Well suited for large data Ê ̴~̊ ̴ͮ̊ ̴ ĄÀ ̶ dwdwdw ̶ dwdwdw Ɣ ˪̊ ̶ dwdwdw ƍ ʚ1ʚ1---- ì̊ʛ , ÌȦºÍ = Ê~̊ ĄÍ ̊ͯì ̊ Ê ̋ ̴~̊ ̴ͮ̊ ̴ ĄÀ ̳ dwdwdw ̳ dwdwdw Ɣ ˪̋ ̳ dwdwdw ƍ ʚ1ʚ1---- ì̋ʛʛʛ , ÉȦºÍ = Ê~̊ ĄÍ ̊ͯì ̋ Êͮ̊ Ê ÌȧºÍ Í = Í Ǝ ñ ÉȦºÍ ͮą 38 Optimization Momentum (Modified SGD) Stochastic gradient descent with momentum remembers the update Δw at each iteration, and determines the next update as a linear combination of the gradient and the previous update. ˒ÍÊ Ɣ ˩˒ÍÊͯ̊ Ǝ ñ ˨¦ʚÍʛi ÍÊ ƔÍÊͯ̊ Ǝ ʚ˩˒Í Ǝ ñ ˨¦ʚÍʛiʛ SGD Momentum Implicit Stochastic Gradient Descent (ISGD) SGD is generally sensitive to learning rate η. Fast convergence requires large learning rates but this may induce numerical instability. The problem can be largely solved by considering implicit updates whereby the stochastic gradient is evaluated at the next iterate rather than the current one. ÍÄ»Í ƔÍÅÂº ƍ ˒ÍÄ»Í 39 Optimization https://stackoverflow.com/questions/36162180/gradient-descent-vs-adagrad-vs-momentum-in-tensorflow 40 Saddle Point Problem ĄÀ = 0 ĄÍ https://stackoverflow.com/questions/36162180/gradient-descent-vs-adagrad-vs-momentum-in-tensorflow 41 .

Load more