<<

CSE 152: Vision Manmohan Chandraker

Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense

Dense

Convolution + pool

Label + pool

Classifier Convolution + pool

Pooling Convolution + pool

Feature extraction Convolution + pool

Image Image

Jia-Bin Huang and Derek Hoiem, UIUC Two- network

Slide credit: Pieter Abeel and Dan Klein Neural networks

Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks

next layer image Convolutional layer

Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks

[Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing

Insight: Images have similar features at various spatial locations! Key operations in a CNN

Feature maps

Spatial pooling

Non-linearity

Convolution (Learned) . . .

Input Image

Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN

Feature maps

Rectified Linear Unit (ReLU) Spatial pooling

Non-linearity

Convolution (Learned)

Input Image

Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN

Feature maps

Spatial pooling Max

Non-linearity

Convolution (Learned)

Input Image

Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations

• Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger in next layer • Hierarchically extract more abstract features Key operations in a CNN

Feature maps

Spatial pooling

Non-linearity

Convolution (Learned) . . .

Input Image

Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor 1 x 1

1 x 1 convolution layers also possible, equivalent to a dot . Types of Neural Networks Convolutional Neural Networks Optimization in CNNs Learning w

§ Training examples

§ Objective: a misclassification loss

§ Procedure: § descent or

Slide credit: Pieter Abeel and Dan Klein A 3-layer network for digit recognition

MNIST dataset Cost

• The network tries to approximate the function y(x) and its output is a • We use a quadratic cost function, or MSE, or “L2-loss”. Stochastic gradient descent

Update rules for each parameter:

Cost function is a sum over all the training samples:

Gradient from entire training set: Usually, n is very large. Stochastic gradient descent

Gradient from entire training set:

• For large training data, gradient computation takes a long time • Leads to “slow learning”

• Instead, consider a mini-batch with m samples • If sample size is large enough, properties approximate the dataset Stochastic gradient descent Stochastic gradient descent Stochastic gradient descent Stochastic gradient descent

Build up velocity as a running mean of . Layer to layer relationship Cost and gradient computation Chain rule of differentiation

This is all you need to know to get the gradients in a neural network!

This is all you need to know to get the gradients in a neural network!

Backpropagation: application of chain rule in certain order, taking advantage of forward propagation to efficiently compute gradients. Backpropagation example

[Slides credit: Fei-Fei Li] Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example

Add gate: gradient distributor Mul gate: gradient switcher Patterns in backpropagation Convolutional layer is differentiable Max Pooling Average Pooling in CNNs Transfer Learning

• Improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.

• Weight initialization for CNN

Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [Oquab et al. CVPR 2014] Slide: Jiabin Huang Transfer Learning CNNs are good at transfer learning Fine-tune hT using hS as initialization Initializng hT with hS Initializng hT with hS Initializng hT with hS Initializng hT with hS Strategy for fine-tuning of data needed Amount Use hS as a feature extractor for hT Transfer learning is a common choice CNN Architectures AlexNet architecture Architectural details of AlexNet

• Similar framework to LeCun 1998 but: • Bigger model (7 hidden layers, 650k units, 60M parameters) • More data (106 images instead of 103 images) • GPU implementation (50 times speedup over CPU) Removing layer 7 Removing layers 6 and 7 Removing layers 3 and 4 Removing layers 3, 4, 6 and 7 VGGNet architecture

• Much more accurate • AlexNet : 18.2% top-5 error • VGGNet: 6.8% top-5 error

• More than twice as many layers

• Filters are much smaller

• Harder and slower to train ResNet: going real deep

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 Bigger not better: innovations typically reduce parameters, despite deeper nets Simply stacking layers Deep residual learning Deep residual learning Deep residual learning Deep residual learning

Plain Net

• Simple design • Use only 3x3 conv (like VGG) • No hidden FC Deep residual learning Key ideas for CNN architectures • Convolutional layers – Same local functions evaluated everywhere – Much fewer parameters • Pooling – Larger receptive field • ReLU – Maintain a gradient over large portion of domain • Limit parameters – Sequence of 3x3 filters instead of large filters – 1x1 convolutions to reduce feature • Skip network – Easier optimization with greater depth