Deep Learning Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang Motivation I: Representation Learning Machine Learning 1-2-3

Deep Learning Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang Motivation I: representation learning Machine learning 1-2-3 • Collect data and extract features • Build model: choose hypothesis class 퓗 and loss function 푙 • Optimization: minimize the empirical loss Features 푥 Color Histogram Extract build 푇 features hypothesis 푦 = 푤 휙 푥 Red Green Blue Features: part of the model Nonlinear model build 푇 hypothesis 푦 = 푤 휙 푥 Linear model Example: Polynomial kernel SVM 푥1 푦 = sign(푤푇휙(푥) + 푏) 푥2 Fixed 휙 푥 Motivation: representation learning • Why don’t we also learn 휙 푥 ? Learn 휙 푥 Learn 푤 휙 푥 푦 = 푤푇휙 푥 푥 Feedforward networks • View each dimension of 휙 푥 as something to be learned … 푦 = 푤푇휙 푥 … 푥 휙 푥 Feedforward networks 푇 • Linear functions 휙푖 푥 = 휃푖 푥 don’t work: need some nonlinearity … 푦 = 푤푇휙 푥 … 푥 휙 푥 Feedforward networks 푇 • Typically, set 휙푖 푥 = 푟(휃푖 푥) where 푟(⋅) is some nonlinear function … 푦 = 푤푇휙 푥 … 푥 휙 푥 Feedforward deep networks • What if we go deeper? … … … … 푦 … … 푥 ℎ1 ℎ2 ℎ퐿 Figure from Deep learning, by Goodfellow, Bengio, Courville. Dark boxes are things to be learned. Motivation II: neurons Motivation: neurons Figure from Wikipedia Motivation: abstract neuron model • Neuron activated when the correlation between the input and a pattern 휃 exceeds some threshold 푏 푥1 • 푦 = threshold(휃푇푥 − 푏) 푥2 or 푦 = 푟(휃푇푥 − 푏) • 푟(⋅) called activation function 푦 푥푑 Motivation: artificial neural networks Motivation: artificial neural networks • Put into layers: feedforward deep networks … … … … 푦 … … 푥 ℎ1 ℎ2 ℎ퐿 Components in Feedforward networks Components • Representations: • Input • Hidden variables • Layers/weights: • Hidden layers • Output layer Components First layer Output layer … … … … 푦 … … Input 푥 Hidden variables ℎ1 ℎ2 ℎ퐿 Input • Represented as a vector • Sometimes require some preprocessing, e.g., Expand • Subtract mean • Normalize to [-1,1] Output layers Output layer • Regression: 푦 = 푤푇ℎ + 푏 • Linear units: no nonlinearity 푦 ℎ Output layers Output layer • Multi-dimensional regression: 푦 = 푊푇ℎ + 푏 • Linear units: no nonlinearity 푦 ℎ Output layers Output layer • Binary classification: 푦 = 휎(푤푇ℎ + 푏) • Corresponds to using logistic regression on ℎ 푦 ℎ Output layers Output layer • Multi-class classification: • 푦 = softmax 푧 where 푧 = 푊푇ℎ + 푏 • Corresponds to using multi-class logistic regression on ℎ 푧 푦 ℎ Hidden layers • Neuron take weighted linear combination of the previous … layer • So can think of outputting one value for the next layer … ℎ푖 ℎ푖+1 Hidden layers • 푦 = 푟(푤푇푥 + 푏) • Typical activation function 푟 푟(⋅) • Threshold t 푧 = 핀[푧 ≥ 0] • Sigmoid 휎 푧 = 1/ 1 + exp(−푧) 푥 푦 • Tanh tanh 푧 = 2휎 2푧 − 1 Hidden layers • Problem: saturation 푟(⋅) 푥 푦 Too small gradient Figure borrowed from Pattern Recognition and Machine Learning, Bishop Hidden layers • Activation function ReLU (rectified linear unit) • ReLU 푧 = max{푧, 0} Figure from Deep learning, by Goodfellow, Bengio, Courville. Hidden layers • Activation function ReLU (rectified linear unit) • ReLU 푧 = max{푧, 0} Gradient 1 Gradient 0 Hidden layers • Generalizations of ReLU gReLU 푧 = max 푧, 0 + 훼 min{푧, 0} • Leaky-ReLU 푧 = max{푧, 0} + 0.01 min{푧, 0} • Parametric-ReLU 푧 : 훼 learnable gReLU 푧 푧.

Load more