Deep Learning Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang Motivation I: representation learning Machine learning 1-2-3
• Collect data and extract features • Build model: choose hypothesis class 퓗 and loss function 푙 • Optimization: minimize the empirical loss Features
푥 Color Histogram Extract build 푇 features hypothesis 푦 = 푤 휙 푥
Red Green Blue Features: part of the model Nonlinear model
build 푇 hypothesis 푦 = 푤 휙 푥
Linear model Example: Polynomial kernel SVM
푥1 푦 = sign(푤푇휙(푥) + 푏)
푥2
Fixed 휙 푥 Motivation: representation learning
• Why don’t we also learn 휙 푥 ?
Learn 휙 푥 Learn 푤
휙 푥 푦 = 푤푇휙 푥 푥 Feedforward networks
• View each dimension of 휙 푥 as something to be learned …
푦 = 푤푇휙 푥
… 푥 휙 푥 Feedforward networks
푇
• Linear functions 휙푖 푥 = 휃푖 푥 don’t work: need some nonlinearity …
푦 = 푤푇휙 푥
… 푥 휙 푥 Feedforward networks
푇
• Typically, set 휙푖 푥 = 푟(휃푖 푥) where 푟(⋅) is some nonlinear function …
푦 = 푤푇휙 푥
… 푥 휙 푥 Feedforward deep networks
• What if we go deeper?
… …
… …
푦
…
… 푥 ℎ1 ℎ2 ℎ퐿 Figure from Deep learning, by Goodfellow, Bengio, Courville. Dark boxes are things to be learned. Motivation II: neurons Motivation: neurons
Figure from Wikipedia Motivation: abstract neuron model
• Neuron activated when the correlation between the input and a pattern 휃 exceeds some threshold 푏 푥1 • 푦 = threshold(휃푇푥 − 푏) 푥2 or 푦 = 푟(휃푇푥 − 푏) • 푟(⋅) called activation function 푦
푥푑 Motivation: artificial neural networks Motivation: artificial neural networks
• Put into layers: feedforward deep networks
… …
… …
푦
…
… 푥 ℎ1 ℎ2 ℎ퐿 Components in Feedforward networks Components
• Representations: • Input • Hidden variables • Layers/weights: • Hidden layers • Output layer Components
First layer Output layer
… …
… …
푦
… …
Input 푥 Hidden variables ℎ1 ℎ2 ℎ퐿 Input
• Represented as a vector
• Sometimes require some preprocessing, e.g., Expand • Subtract mean • Normalize to [-1,1] Output layers
Output layer • Regression: 푦 = 푤푇ℎ + 푏 • Linear units: no nonlinearity
푦
ℎ Output layers
Output layer • Multi-dimensional regression: 푦 = 푊푇ℎ + 푏 • Linear units: no nonlinearity
푦
ℎ Output layers
Output layer • Binary classification: 푦 = 휎(푤푇ℎ + 푏) • Corresponds to using logistic regression on ℎ
푦
ℎ Output layers
Output layer • Multi-class classification: • 푦 = softmax 푧 where 푧 = 푊푇ℎ + 푏 • Corresponds to using multi-class logistic regression on ℎ 푧 푦
ℎ Hidden layers
• Neuron take weighted linear
combination of the previous … layer • So can think of outputting one
value for the next layer …
ℎ푖 ℎ푖+1 Hidden layers
• 푦 = 푟(푤푇푥 + 푏)
• Typical activation function 푟 푟(⋅) • Threshold t 푧 = 핀[푧 ≥ 0] • Sigmoid 휎 푧 = 1/ 1 + exp(−푧) 푥 푦 • Tanh tanh 푧 = 2휎 2푧 − 1 Hidden layers
• Problem: saturation
푟(⋅) 푥 푦 Too small gradient
Figure borrowed from Pattern Recognition and Machine Learning, Bishop Hidden layers
• Activation function ReLU (rectified linear unit) • ReLU 푧 = max{푧, 0}
Figure from Deep learning, by Goodfellow, Bengio, Courville. Hidden layers
• Activation function ReLU (rectified linear unit) • ReLU 푧 = max{푧, 0} Gradient 1
Gradient 0 Hidden layers
• Generalizations of ReLU gReLU 푧 = max 푧, 0 + 훼 min{푧, 0} • Leaky-ReLU 푧 = max{푧, 0} + 0.01 min{푧, 0} • Parametric-ReLU 푧 : 훼 learnable
gReLU 푧
푧