Deep Learning I

Slides adapted from Vlad Morariu CMSC 422 Introduction to Machine Learning Lecture 20 Deep Learning I Furong Huang / [email protected] History of Neural Networks 2 Standard computer vision pipeline “cat” or “background” classificatio n feature extraction features predicted labels “cat” Features: HOG, SIFT, LBP, … supervision training Classifiers: SVM, RF, KNN, … Features are hand-crafted, not trained eventually limited by feature quality Cat image credit: https://raw.githubusercontent.com/BVLC/caffe/master/examples/images/cat.jpg 3 Deep learning training supervision features classifier Image credit: LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 1998. Deep learning multiple layer neural networks learn features and classifiers directly (“end-to-end” training) 4 SWITCHBOARD: telephone speech corpus for research and development Speech Recognition Slide credit: Bohyung Han 5 Image Classification Performance Image Classification Top-5 Errors (%) Figure from: K. He, X. Zhang, S. Ren, J. Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. (slides) 6 Slide credit: Bohyung Han Biological inspiration Image source: http://cs231n.github.io/neural-networks-1/ 7 Artificial neuron Image source: http://cs231n.github.io/neural-networks-1/ Activation function is usually non-linear step, tanh, sigmoid The actual biological system is much more complicated 8 McCulloch-Pitts Model Read: https://en.wikipedia.org/wiki/Artificial_neuron • Threshold Logic Unit (TLU) ➢ Warren McCulloch and Walter Pitts, 1943 ➢ Binary inputs/outputs and Threshold activation function ➢ Can represent AND/OR/NOT functions which can be composed for complex functions 9 Rosenblatt’s Perceptron Read: https://en.wikipedia.org/wiki/Perceptron • Perceptron ➢ Proposed by Frank Rosenblatt in 1957 ➢ Based on McCulloch-Pitts model ➢ Real inputs/outputs, threshold activation function 10 Perceptron Learning • Given a training dataset of input features and labels • Initialize weights randomly • For each example in training set ➢ Classify example using current weights ➢ Update weights • If data is linearly separable, convergence is guaranteed in a bounded number of iterations https://en.wikipedia.org/wiki/Perceptron 11 Multiple output variables Weights to predict multiple outputs can be learned independently b1 x1 w11 w12 y1 w13 w14 x 2 b2 w21 w22 w 23 y2 w24 x3 b w31 3 w32 w33 y3 x4 w34 inputs outputs 12 Perceptron success • Implemented as custom-built hardware, the “Mark I Perceptron” • Input: photocells • Weights: potentiometers • Weight updates: electric motors • Demonstrated ability to classify 20x20 images • Generated lots of AI excitement • In 1958, the New York Times reported the perceptron to be • "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." https://en.wikipedia.org/wiki/Perceptron 13 Linear Classifier Slide credit: Bohyung Han 14 Nonlinear Classifier Slide credit: Bohyung Han 15 Nonlinear Classifier – XOR problem 16 Perceptron limitations • If there are multiple separating hyperplanes, learning will converge to one of them (not the optimal one) • If training set is not linearly separable, training will fail completely • Marvin Minsky and Seymour Papert, “Perceptrons”, 1969 • Proved that it was impossible to learn an XOR function with a single layer perceptron network • Led to the “AI Winter” of the 1970’s https://en.wikipedia.org/wiki/Perceptron 17 Multi-Layer Perceptron (MLP) Image source: http://cs231n.github.io/neural-networks-1/ • Activation function need not be a threshold • Multiple layers can represent XOR function • But perceptron algorithm cannot be used to update weights • Why? Hidden layers are not observed! https://en.wikipedia.org/wiki/Multilayer_perceptron 18 Multi-Layer: Backpropagation 푦 푤푘푖 푧푖 푤푖푗 푧푗 푥푘 횺 푦ො푖 횺 푦ො푗 퓛 퐸 Sigmoid Sigmoid Neuron 푖 Neuron 푗 휕퐸 휕퐸 푑푦ො푗 = 휕푧푗 휕푦ො푗 푑푧푗 휕퐸 휕퐸 푑푧푗 휕퐸 휕퐸 푑푦ො푗 = ෍ = ෍ 푤푖푗 = ෍ 푤푖푗 휕푦ො푖 휕푧푗 푑푦ො푖 휕푧푗 휕푦ො푗 푑푧푗 푗 푗 푗 푛 푛 푛 푛 푛 휕퐸 휕퐸 푑푦ො푖 휕푧푖 푑푦ො푖 휕푧푖 휕퐸 푑푦ො푗 = ෍ 푛 푛 = ෍ 푛 ෍ 푤푖푗 푛 푛 휕푤 휕푦ො푖 푑푧푖 휕푤푘푖 푑푧 휕푤푘푖 휕푦ො 푑푧 푘푖 푛 푛 푖 푗 푗 푗 Slide credit: Bohyung Han 19 Stochastic Gradient Descent (SGD) Update weights for each sample 푛 1 푛 푛 2 휕퐸 퐸 = 푦 − 푦ො 풘푖 푡 + 1 = 풘푖 푡 − 휖 2 휕풘푖 + Fast, online − Sensitive to noise Minibatch SGD: Update weights for a small set of samples 퐵 1 푛 푛 2 휕퐸 퐸 = ෍ 푦 − 푦ො 풘푖 푡 + 1 = 풘푖 푡 − 휖 2 휕풘푖 푛∈퐵 + Fast, online + Robust to noise Slide credit: Bohyung Han 20 Momentum Remember the previous direction + Converge faster + Avoid oscillation 휕퐸 푣푖 푡 = 훼푣푖 푡 − 1 − 휖 (푡) 휕푤푖 풘 푡 + 1 = 풘 푡 + 풗(푡) Slide credit: Bohyung Han 21 Weight Decay Penalize the size of the weights 1 퐶 = 퐸 + ෍ 푤2 2 푖 푖 휕퐶 휕퐸 푤푖 푡 + 1 = 푤푖 푡 − 휖 = 푤푖 푡 − 휖 − 휆푤푖 휕푤푖 휕푤푖 + Improve generalization a lot! Slide credit: Bohyung Han22 Furong Huang 3251 A.V. Williams, College Park, MD 20740 301.405.8010 / [email protected].

Deep Learning I

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support