Slides adapted from Vlad Morariu

CMSC 422 Introduction to Lecture 20 I

Furong Huang / [email protected] History of Neural Networks

2 Standard pipeline

“cat” or “background”

classificatio n

feature feature extraction features predicted labels “cat” Features: HOG, SIFT, LBP, … supervision training Classifiers: SVM, RF, KNN, … Features are hand-crafted, not trained eventually limited by feature quality Cat image credit: https://raw.githubusercontent.com/BVLC/caffe/master/examples/images/cat.jpg

3 Deep learning training

supervision

features classifier Image credit: LeCun, Y., Bottou, L., Bengio, Y., Haffner, P. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 1998. Deep learning multiple layer neural networks learn features and classifiers directly (“end-to-end” training)

4 SWITCHBOARD: telephone speech corpus for research and development Speech Recognition

Slide credit: Bohyung Han 5 Image Classification Performance

Image Classification Top-5 Errors (%) Figure from: K. He, X. Zhang, S. Ren, J. Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. (slides) 6 Slide credit: Bohyung Han Biological inspiration

Image source: http://cs231n.github.io/neural-networks-1/

7 Artificial neuron

Image source: http://cs231n.github.io/neural-networks-1/ Activation function is usually non-linear step, tanh, sigmoid The actual biological system is much more complicated

8 McCulloch-Pitts Model

Read: https://en.wikipedia.org/wiki/Artificial_neuron • Threshold Logic Unit (TLU) ➢ Warren McCulloch and Walter Pitts, 1943 ➢ Binary inputs/outputs and Threshold activation function ➢ Can represent AND/OR/NOT functions which can be composed for complex functions

9 Rosenblatt’s

Read: https://en.wikipedia.org/wiki/Perceptron • Perceptron ➢ Proposed by Frank Rosenblatt in 1957 ➢ Based on McCulloch-Pitts model ➢ Real inputs/outputs, threshold activation function

10 Perceptron Learning

• Given a training dataset of input features and labels

• Initialize weights randomly • For each example in training set ➢ Classify example using current weights ➢ Update weights • If data is linearly separable, convergence is guaranteed in a bounded number of iterations https://en.wikipedia.org/wiki/Perceptron

11 Multiple output variables

Weights to predict multiple outputs can be learned independently b1 x1 w11

w12 y1 w13 w14 x 2 b2 w21 w22 w 23 y2 w24 x3 b w31 3 w32 w33 y3 x4 w34

inputs outputs 12 Perceptron success

• Implemented as custom-built hardware, the “Mark I Perceptron” • Input: photocells • Weights: potentiometers • Weight updates: electric motors • Demonstrated ability to classify 20x20 images • Generated lots of AI excitement

• In 1958, the New York Times reported the perceptron to be • "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."

https://en.wikipedia.org/wiki/Perceptron

13 Linear Classifier

Slide credit: Bohyung Han

14 Nonlinear Classifier

Slide credit: Bohyung Han 15 Nonlinear Classifier – XOR problem

16 Perceptron limitations

• If there are multiple separating hyperplanes, learning will converge to one of them (not the optimal one) • If training set is not linearly separable, training will fail completely

• Marvin Minsky and Seymour Papert, “”, 1969 • Proved that it was impossible to learn an XOR function with a single layer perceptron network

• Led to the “AI Winter” of the 1970’s https://en.wikipedia.org/wiki/Perceptron 17 Multi-Layer Perceptron (MLP)

Image source: http://cs231n.github.io/neural-networks-1/

• Activation function need not be a threshold • Multiple layers can represent XOR function • But perceptron algorithm cannot be used to update weights • Why? Hidden layers are not observed! https://en.wikipedia.org/wiki/Multilayer_perceptron 18 Multi-Layer: Backpropagation 푦

푤푘푖 푧푖 푤푖푗 푧푗 푥푘 횺 푦ො푖 횺 푦ො푗 퓛 퐸 Sigmoid Sigmoid Neuron 푖 Neuron 푗

휕퐸 휕퐸 푑푦ො푗 = 휕푧푗 휕푦ො푗 푑푧푗

휕퐸 휕퐸 푑푧푗 휕퐸 휕퐸 푑푦ො푗 = ෍ = ෍ 푤푖푗 = ෍ 푤푖푗 휕푦ො푖 휕푧푗 푑푦ො푖 휕푧푗 휕푦ො푗 푑푧푗 푗 푗 푗

푛 푛 푛 푛 푛 휕퐸 휕퐸 푑푦ො푖 휕푧푖 푑푦ො푖 휕푧푖 휕퐸 푑푦ො푗 = ෍ 푛 푛 = ෍ 푛 ෍ 푤푖푗 푛 푛 휕푤 휕푦ො푖 푑푧푖 휕푤푘푖 푑푧 휕푤푘푖 휕푦ො 푑푧 푘푖 푛 푛 푖 푗 푗 푗 Slide credit: Bohyung Han

19 Stochastic Gradient Descent (SGD)

Update weights for each sample

푛 1 푛 푛 2 휕퐸 퐸 = 푦 − 푦ො 풘푖 푡 + 1 = 풘푖 푡 − 휖 2 휕풘푖 + Fast, online − Sensitive to noise Minibatch SGD: Update weights for a small set of samples

퐵 1 푛 푛 2 휕퐸 퐸 = ෍ 푦 − 푦ො 풘푖 푡 + 1 = 풘푖 푡 − 휖 2 휕풘푖 푛∈퐵 + Fast, online + Robust to noise Slide credit: Bohyung Han 20 Momentum Remember the previous direction

+ Converge faster + Avoid oscillation

휕퐸 푣푖 푡 = 훼푣푖 푡 − 1 − 휖 (푡) 휕푤푖

풘 푡 + 1 = 풘 푡 + 풗(푡)

Slide credit: Bohyung Han 21 Weight Decay

Penalize the size of the weights

1 퐶 = 퐸 + ෍ 푤2 2 푖 푖

휕퐶 휕퐸 푤푖 푡 + 1 = 푤푖 푡 − 휖 = 푤푖 푡 − 휖 − 휆푤푖 휕푤푖 휕푤푖

+ Improve generalization a lot!

Slide credit: Bohyung Han22 Furong Huang 3251 A.V. Williams, College Park, MD 20740 301.405.8010 / [email protected]