<<

Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang Motivation I: representation learning 1-2-3

• Collect data and extract features • Build model: choose hypothesis class 퓗 and 푙 • Optimization: minimize the empirical loss Features

푥 Color Histogram Extract build 푇 features hypothesis 푦 = 푤 휙 푥

Red Green Blue Features: part of the model Nonlinear model

build 푇 hypothesis 푦 = 푤 휙 푥

Linear model Example: Polynomial kernel SVM

푥1 푦 = sign(푤푇휙(푥) + 푏)

푥2

Fixed 휙 푥 Motivation: representation learning

• Why don’t we also learn 휙 푥 ?

Learn 휙 푥 Learn 푤

휙 푥 푦 = 푤푇휙 푥 푥 Feedforward networks

• View each dimension of 휙 푥 as something to be learned …

푦 = 푤푇휙 푥

… 푥 휙 푥 Feedforward networks

• Linear functions 휙푖 푥 = 휃푖 푥 don’t work: need some nonlinearity …

푦 = 푤푇휙 푥

… 푥 휙 푥 Feedforward networks

• Typically, set 휙푖 푥 = 푟(휃푖 푥) where 푟(⋅) is some nonlinear function …

푦 = 푤푇휙 푥

… 푥 휙 푥 Feedforward deep networks

• What if we go deeper?

… …

… …

… 푥 ℎ1 ℎ2 ℎ퐿 Figure from Deep learning, by Goodfellow, Bengio, Courville. Dark boxes are things to be learned. Motivation II: neurons Motivation: neurons

Figure from Wikipedia Motivation: abstract neuron model

• Neuron activated when the correlation between the input and a pattern 휃 exceeds some threshold 푏 푥1 • 푦 = threshold(휃푇푥 − 푏) 푥2 or 푦 = 푟(휃푇푥 − 푏) • 푟(⋅) called activation function 푦

푥푑 Motivation: artificial neural networks Motivation: artificial neural networks

• Put into layers: feedforward deep networks

… …

… …

… 푥 ℎ1 ℎ2 ℎ퐿 Components in Feedforward networks Components

• Representations: • Input • Hidden variables • Layers/weights: • Hidden layers • Output Components

First layer Output layer

… …

… …

… …

Input 푥 Hidden variables ℎ1 ℎ2 ℎ퐿 Input

• Represented as a vector

• Sometimes require some preprocessing, e.g., Expand • Subtract mean • Normalize to [-1,1] Output layers

Output layer • Regression: 푦 = 푤푇ℎ + 푏 • Linear units: no nonlinearity

ℎ Output layers

Output layer • Multi-dimensional regression: 푦 = 푊푇ℎ + 푏 • Linear units: no nonlinearity

ℎ Output layers

Output layer • Binary classification: 푦 = 휎(푤푇ℎ + 푏) • Corresponds to using logistic regression on ℎ

ℎ Output layers

Output layer • Multi-class classification: • 푦 = softmax 푧 where 푧 = 푊푇ℎ + 푏 • Corresponds to using multi-class logistic regression on ℎ 푧 푦

ℎ Hidden layers

• Neuron take weighted linear

combination of the previous … layer • So can think of outputting one

value for the next layer …

ℎ푖 ℎ푖+1 Hidden layers

• 푦 = 푟(푤푇푥 + 푏)

• Typical activation function 푟 푟(⋅) • Threshold t 푧 = 핀[푧 ≥ 0] • Sigmoid 휎 푧 = 1/ 1 + exp(−푧) 푥 푦 • Tanh tanh 푧 = 2휎 2푧 − 1 Hidden layers

• Problem: saturation

푟(⋅) 푥 푦 Too small gradient

Figure borrowed from and Machine Learning, Bishop Hidden layers

• Activation function ReLU (rectified linear unit) • ReLU 푧 = max{푧, 0}

Figure from Deep learning, by Goodfellow, Bengio, Courville. Hidden layers

• Activation function ReLU (rectified linear unit) • ReLU 푧 = max{푧, 0} Gradient 1

Gradient 0 Hidden layers

• Generalizations of ReLU gReLU 푧 = max 푧, 0 + 훼 min{푧, 0} • Leaky-ReLU 푧 = max{푧, 0} + 0.01 min{푧, 0} • Parametric-ReLU 푧 : 훼 learnable

gReLU 푧