Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
Deep Learning Kairit Sirts Lecture in TUT 19.12.2016 Outline • What can be done with deep learning? • Deep learning demystified • How can you get started with deep learning? 2 Why deep learning? Deep learning Gradient boosting Random Forest Linear model 3 http://www.infoworld.com/article/3003315/big-data/deep-learning-a-brief-guide-for-practical-problem-solvers.html What can be done with deep learning? Handwritten digit recognition MNIST benchmark dataset The best reported error rate is 0.21% 5 Street view number recognition • Obtained from house numbers in Google Street View images • Best error rate is 1.69% 6 Image classification 7 Image classification 10 objects 6000 labeled instances for each object Best accuracy so far 96.53% 8 Image classification 9 Image classification 20 superclasses 100 finegrained classes 600 labeled images per class Best classification accuracy 75.72% 10 Detecting doodles https://quickdraw.withgoogle.com There are other simple and fun AI experiments launched by Google https://aiexperiments.withgoogle.com 11 Image captioning 12 Image captioning – not so great results 13 Automatic colorization of images 14 http://richzhang.github.io/colorization/resources/images/teaser3.jpg Automatic colorization of images - failed 15 DeepDream https://deepdreamgenerator.com 16 DeepDream 17 DeepDream 18 DeepDream 19 Word embeddings 20 http://metaoptimize.s3.amazonaws.com/cw-embeddings-ACL2010/embeddings-mostcommon.EMBEDDING_SIZE=50.png Word embeddings months weekdays numbers 21 Word embeddings • � man − � woman ≈ � king − �(queen) • � walking − � walked ≈ � swimming − �(swam) 22 Automatic text generation – pseudo Shakespeare 23 http://karpathy.github.io/2015/05/21/rnn-effectiveness Machine translation • Google Translate app 24 Learning to play Atari Arcade games 25 https://www.youtube.com/watch?v=cjpEIotvwFY AlphaGo 26 https://www.youtube.com/watch?v=PQCrX1sQSzY Other tasks tackled with deep neural networks • Speech recognition • Various tasks in robotics • Log analysis/risk detection • Recommendation systems • Motion detection from videos • Business and Economics analytics • Etc … 27 Deep learning demystified How does deep learning work? • Biological neuron • Artificial neuron 29 http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7 • Biological neural network • Artificial neural network 30 https://www.eeweb.com/blog/rob_riemen/deep-machine-learning-and-the-google-brain http://www.theprojectspot.com/tutorial-post/introduction-to-artificial-neural-networks-part-1/7 What happens inside a neuron? < ℎ = �7�7 + �:�: + ⋯ + �<�< = = �>�> >?7 Output: ℎ = �(�) 31 Activation function E DE 1 if � ≥ th 1 � − � � � = J � � = � � = � � = max (0, �) 0 if � < th 1 + �DE �E + �DE 32 https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/neural_networks.html Single neuron logic gates • Threshold activation function 33 https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/ XOR gate • Cannot be done with a single neuron • A hidden layer is necessary �� �� OR NOT AND AND 0 0 � 0 ∙ 1 + 0 ∙ 1 > 0.5 = 0 � 0 ∙ −1 + 0 ∙ −1 > −1.5 = 1 � 0 ∙ 1 + 1 ∙ 1 > 1.5 = 0 0 1 � 0 ∙ 1 + 1 ∙ 1 > 0.5 = 1 � 0 ∙ −1 + 1 ∙ −1 > −1.5 = 1 � 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1 1 0 � 1 ∙ 1 + 0 ∙ 1 > 0.5 = 1 � 1 ∙ −1 + 0 ∙ −1 > −1.5 = 1 � 1 ∙ 1 + 1 ∙ 1 > 1.5 = 1 1 1 � 1 ∙ 1 + 1 ∙ 1 > 0.5 = 1 � 1 ∙ −1 + 1 ∙ −1 > −1.5 = 0 � 1 ∙ 1 + 0 ∙ 1 > 1.5 = 0 34 https://blog.abhranil.net/2015/03/03/training-neural-networks-with-genetic-algorithms/ How to assign weights? 8 Y 9 + 9 Y 9 + 9 Y 9 + 9 Y 4 = = 270 weights 35 http://neuralnetworksanddeeplearning.com/ Backpropagation • Standard and efficient method for training neural networks • The general idea: • Compute the error with a forward pass • Propagate the error back to change the weights such that the error would become smaller ERROR à ERROR’ ERROR’ < ERROR 36 Diversion to calculus - derivative • �_ = �_ � • Derivative is the slope of the tangent line • It is the rate of change when going in the direction of steepest ascent 37 Derivatives • When �_ � = 0 then it is the local or global maximum or minimum or a saddle point • When �_ � > 0 then the function is increasing • When �_ � < 0 then the function is decreasing 38 Gradients • Generalization of derivatives to multivariate functions • Derivative is a vector pointing to the direction of steepest ascent ab ab • ∇�(�, �) = , ac ad ab ab • , - partial derivatives – take ac ad derivative wrt one variable while treating all others as constant 39 Gradients and backpropagation • Backpropagation is used to compute the gradients with respect to all parameters in a neural network. • The gradients are then used in a general method of gradient descent for minimizing functions. • We want to minimize the cost function that measures the error made by the neural network. • In order to do that we need to move to the direction of deepest descent given by the gradients. 40 Gradient descent • An iterative algorithm • Start with initial parameter values �f • Update parameters iteratively until convergence: �gh7 =: �g − �∇� � • � - learning rate, controls the step size 41 Deep learning demystified How does backpropagation work? Backpropagation explained • Example from: https://mattmazur.com/2015/03/17/ • 2 inputs • 1 hidden layer with 2 neurons • Bias terms in both the hidden and output layer • 2 outputs 43 Initial configuration • Training values • Initial weights: �7, … , �l • Initial biases: �7, �: 44 Forward pass – first hidden unit 45 Forward pass – first hidden unit 46 Forward pass – second hidden unit 47 Forward pass – first output unit 48 Forward pass – second output unit 49 Forward pass – error of the first output 50 Forward pass – output error 51 Forward pass – output error 52 Backwards pass • Consider �n • How much a change in �n affects the total error? • Apply the chain rule: 53 Chain rule • Formula for computing derivative of the composition of two or more functions • � � ≡ �(� � ) ≡ (� ∘ �)(�) – composition of functions � and � • �_ � = �_ � � �_ � • � � = �sc � � = 3� � � � = �u(c) = �sc • �_ � = �_ � � �_ � = (�u(c))′�′(�) = �u c (3�)′ = �sc Y 3 = 3�sc 54 Backwards pass • Consider �n • How much a change in �n affects the total error? • Apply the chain rule: 55 How much does error change wrt the output? 56 How much does output change wrt its net input? 57 Derivative of the sigmoid function 1 � � = 1 + �DE �_ � = �(�)(1 − � � ) 58 How much does output change wrt its net input? 59 How much does net input change wrt �n? 60 Putting it all together 61 This is known as the delta rule • Delta rule is the gradient descent rule for updating the weights of the inputs to neurons in a single-layer neural network 62 Apply delta rule to outer layer weights 63 Update the weights with gradient descent • set learning rate � = 0.5 ��h� =: �� − ��� � 64 Backpropagation to hidden layer • Continue backwards pass to calculate new values for �7, �:, �s and �| 65 BP through hidden layer • ���€7 affects both �7 and �: and thus needs to take into account both: 66 BP through hidden layer • Consider one of those: • First term can be calculated using values computed before: • Second term is just �n 67 BP through hidden layer • Plug the values in: • Compute the same value for �:: • Compute the total: 68 BP through hidden layer a•‚g a<…g • Next we need ƒ„ and ƒ„ for each a<…gƒ„ a† weight � • Compute the partial derivative wrt a weight 69 BP through hidden layer • Putting it together • We can now update �7 70 BP through hidden layer • Compute the partial derivatives in the same way for �:, �s and �| • Update �:, �s and �| 71 After first update with backpropagation 72 Did the error decrease? • Old error was: 0.298371109 • Improvement: 0.007343335 • After 10000 updates the error will be ca 0.000035085 • The generated outputs will be 0.015912196 for 0.01 target and 0.984065734 for 0.99 target 73 In conclusion • Neural networks consist of artificial neurons organized into layers and connected to each other with learnable weights. • Backpropagation with gradient descent is the standard method for training neural networks. • Backpropagation can be used to compute the gradients of a neural network, regardless of the depth of the network. • Of course, there are other important tricks and tips but this is the basis of understanding neural networks and deep learning. 74 Common neural network architectures Feed-forward network • Simplest type of neural network • Connections between units do not form cycles • Information always moves in one direction • It never goes backwards 76 https://upload.wikimedia.org/wikipedia/en/5/54/Feed_forward_neural_net.gif Recurrent neural network • Connections between units form cycles • They possess internal memory – they “remember” the past inputs • Suitable for modeling sequential/temporal data, such as for instance text and language data 77 Convolutional neural networks • Convolutional layers have neurons arranged in 3 dimensions • Especially suitable for processing image data 78 http://parse.ele.tue.nl/education/cluster2 Autoencoders • Output layer attempts to reconstruct the input • Used for unsupervised feature learning • The hidden layer has typically less neurons, thus performing data compression 79 Getting started with neural networks Courses and tutorials • https://www.coursera.org/learn/machine-learning - • Introductory course on machine learning, provides necessary background • https://www.coursera.org/learn/neural-networks • Course on neural networks – assumes knowledge about machine learning •