Deep Learning I
Total Page:16
File Type:pdf, Size:1020Kb
Slides adapted from Vlad Morariu CMSC 422 Introduction to Machine Learning Lecture 20 Deep Learning I Furong Huang / [email protected] History of Neural Networks 2 Standard computer vision pipeline “cat” or “background” classificatio n feature extraction features predicted labels “cat” Features: HOG, SIFT, LBP, … supervision training Classifiers: SVM, RF, KNN, … Features are hand-crafted, not trained eventually limited by feature quality Cat image credit: https://raw.githubusercontent.com/BVLC/caffe/master/examples/images/cat.jpg 3 Deep learning training supervision features classifier Image credit: LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 1998. Deep learning multiple layer neural networks learn features and classifiers directly (“end-to-end” training) 4 SWITCHBOARD: telephone speech corpus for research and development Speech Recognition Slide credit: Bohyung Han 5 Image Classification Performance Image Classification Top-5 Errors (%) Figure from: K. He, X. Zhang, S. Ren, J. Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. (slides) 6 Slide credit: Bohyung Han Biological inspiration Image source: http://cs231n.github.io/neural-networks-1/ 7 Artificial neuron Image source: http://cs231n.github.io/neural-networks-1/ Activation function is usually non-linear step, tanh, sigmoid The actual biological system is much more complicated 8 McCulloch-Pitts Model Read: https://en.wikipedia.org/wiki/Artificial_neuron • Threshold Logic Unit (TLU) ➢ Warren McCulloch and Walter Pitts, 1943 ➢ Binary inputs/outputs and Threshold activation function ➢ Can represent AND/OR/NOT functions which can be composed for complex functions 9 Rosenblatt’s Perceptron Read: https://en.wikipedia.org/wiki/Perceptron • Perceptron ➢ Proposed by Frank Rosenblatt in 1957 ➢ Based on McCulloch-Pitts model ➢ Real inputs/outputs, threshold activation function 10 Perceptron Learning • Given a training dataset of input features and labels • Initialize weights randomly • For each example in training set ➢ Classify example using current weights ➢ Update weights • If data is linearly separable, convergence is guaranteed in a bounded number of iterations https://en.wikipedia.org/wiki/Perceptron 11 Multiple output variables Weights to predict multiple outputs can be learned independently b1 x1 w11 w12 y1 w13 w14 x 2 b2 w21 w22 w 23 y2 w24 x3 b w31 3 w32 w33 y3 x4 w34 inputs outputs 12 Perceptron success • Implemented as custom-built hardware, the “Mark I Perceptron” • Input: photocells • Weights: potentiometers • Weight updates: electric motors • Demonstrated ability to classify 20x20 images • Generated lots of AI excitement • In 1958, the New York Times reported the perceptron to be • "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence." https://en.wikipedia.org/wiki/Perceptron 13 Linear Classifier Slide credit: Bohyung Han 14 Nonlinear Classifier Slide credit: Bohyung Han 15 Nonlinear Classifier – XOR problem 16 Perceptron limitations • If there are multiple separating hyperplanes, learning will converge to one of them (not the optimal one) • If training set is not linearly separable, training will fail completely • Marvin Minsky and Seymour Papert, “Perceptrons”, 1969 • Proved that it was impossible to learn an XOR function with a single layer perceptron network • Led to the “AI Winter” of the 1970’s https://en.wikipedia.org/wiki/Perceptron 17 Multi-Layer Perceptron (MLP) Image source: http://cs231n.github.io/neural-networks-1/ • Activation function need not be a threshold • Multiple layers can represent XOR function • But perceptron algorithm cannot be used to update weights • Why? Hidden layers are not observed! https://en.wikipedia.org/wiki/Multilayer_perceptron 18 Multi-Layer: Backpropagation 푦 푤푘푖 푧푖 푤푖푗 푧푗 푥푘 횺 푦ො푖 횺 푦ො푗 퓛 퐸 Sigmoid Sigmoid Neuron 푖 Neuron 푗 휕퐸 휕퐸 푑푦ො푗 = 휕푧푗 휕푦ො푗 푑푧푗 휕퐸 휕퐸 푑푧푗 휕퐸 휕퐸 푑푦ො푗 = = 푤푖푗 = 푤푖푗 휕푦ො푖 휕푧푗 푑푦ො푖 휕푧푗 휕푦ො푗 푑푧푗 푗 푗 푗 푛 푛 푛 푛 푛 휕퐸 휕퐸 푑푦ො푖 휕푧푖 푑푦ො푖 휕푧푖 휕퐸 푑푦ො푗 = 푛 푛 = 푛 푤푖푗 푛 푛 휕푤 휕푦ො푖 푑푧푖 휕푤푘푖 푑푧 휕푤푘푖 휕푦ො 푑푧 푘푖 푛 푛 푖 푗 푗 푗 Slide credit: Bohyung Han 19 Stochastic Gradient Descent (SGD) Update weights for each sample 푛 1 푛 푛 2 휕퐸 퐸 = 푦 − 푦ො 풘푖 푡 + 1 = 풘푖 푡 − 휖 2 휕풘푖 + Fast, online − Sensitive to noise Minibatch SGD: Update weights for a small set of samples 퐵 1 푛 푛 2 휕퐸 퐸 = 푦 − 푦ො 풘푖 푡 + 1 = 풘푖 푡 − 휖 2 휕풘푖 푛∈퐵 + Fast, online + Robust to noise Slide credit: Bohyung Han 20 Momentum Remember the previous direction + Converge faster + Avoid oscillation 휕퐸 푣푖 푡 = 훼푣푖 푡 − 1 − 휖 (푡) 휕푤푖 풘 푡 + 1 = 풘 푡 + 풗(푡) Slide credit: Bohyung Han 21 Weight Decay Penalize the size of the weights 1 퐶 = 퐸 + 푤2 2 푖 푖 휕퐶 휕퐸 푤푖 푡 + 1 = 푤푖 푡 − 휖 = 푤푖 푡 − 휖 − 휆푤푖 휕푤푖 휕푤푖 + Improve generalization a lot! Slide credit: Bohyung Han22 Furong Huang 3251 A.V. Williams, College Park, MD 20740 301.405.8010 / [email protected].