CSE 152: Computer Vision Manmohan Chandraker
Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense
Dense
Convolution + pool
Label Convolution + pool
Classifier Convolution + pool
Pooling Convolution + pool
Feature extraction Convolution + pool
Image Image
Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network
Slide credit: Pieter Abeel and Dan Klein Neural networks
Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks
next layer image Convolutional layer
Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks
[Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing
Insight: Images have similar features at various spatial locations! Key operations in a CNN
Feature maps
Spatial pooling
Non-linearity
Convolution (Learned) . . .
Input Image
Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN
Feature maps
Rectified Linear Unit (ReLU) Spatial pooling
Non-linearity
Convolution (Learned)
Input Image
Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN
Feature maps
Spatial pooling Max
Non-linearity
Convolution (Learned)
Input Image
Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations
• Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN
Feature maps
Spatial pooling
Non-linearity
Convolution (Learned) . . .
Input Image
Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor 1 x 1 convolutions
1 x 1 convolution layers also possible, equivalent to a dot product. Types of Neural Networks Convolutional Neural Networks Optimization in CNNs Learning w
§ Training examples
§ Objective: a misclassification loss
§ Procedure: § Gradient descent or hill climbing
Slide credit: Pieter Abeel and Dan Klein A 3-layer network for digit recognition
MNIST dataset Cost function
• The network tries to approximate the function y(x) and its output is a • We use a quadratic cost function, or MSE, or “L2-loss”. Gradient descent Stochastic gradient descent
Update rules for each parameter:
Cost function is a sum over all the training samples:
Gradient from entire training set: Usually, n is very large. Stochastic gradient descent
Gradient from entire training set:
• For large training data, gradient computation takes a long time • Leads to “slow learning”
• Instead, consider a mini-batch with m samples • If sample size is large enough, properties approximate the dataset Stochastic gradient descent Stochastic gradient descent Stochastic gradient descent Stochastic gradient descent
Build up velocity as a running mean of gradients. Layer to layer relationship Cost and gradient computation Chain rule of differentiation
This is all you need to know to get the gradients in a neural network! Backpropagation
This is all you need to know to get the gradients in a neural network!
Backpropagation: application of chain rule in certain order, taking advantage of forward propagation to efficiently compute gradients. Backpropagation example
[Slides credit: Fei-Fei Li] Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example
Add gate: gradient distributor Mul gate: gradient switcher Patterns in backpropagation Convolutional layer is differentiable Max Pooling Average Pooling Transfer Learning in CNNs Transfer Learning
• Improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned.
• Weight initialization for CNN
Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [Oquab et al. CVPR 2014] Slide: Jiabin Huang Transfer Learning CNNs are good at transfer learning Fine-tune hT using hS as initialization Initializng hT with hS Initializng hT with hS Initializng hT with hS Initializng hT with hS Strategy for fine-tuning of data needed Amount Use hS as a feature extractor for hT Transfer learning is a common choice CNN Architectures AlexNet architecture Architectural details of AlexNet
• Similar framework to LeCun 1998 but: • Bigger model (7 hidden layers, 650k units, 60M parameters) • More data (106 images instead of 103 images) • GPU implementation (50 times speedup over CPU) Removing layer 7 Removing layers 6 and 7 Removing layers 3 and 4 Removing layers 3, 4, 6 and 7 VGGNet architecture
• Much more accurate • AlexNet : 18.2% top-5 error • VGGNet: 6.8% top-5 error
• More than twice as many layers
• Filters are much smaller
• Harder and slower to train ResNet: going real deep
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 Bigger not better: innovations typically reduce parameters, despite deeper nets Simply stacking layers Deep residual learning Deep residual learning Deep residual learning Deep residual learning
Plain Net
• Simple design • Use only 3x3 conv (like VGG) • No hidden FC Deep residual learning Key ideas for CNN architectures • Convolutional layers – Same local functions evaluated everywhere – Much fewer parameters • Pooling – Larger receptive field • ReLU – Maintain a gradient signal over large portion of domain • Limit parameters – Sequence of 3x3 filters instead of large filters – 1x1 convolutions to reduce feature dimensions • Skip network – Easier optimization with greater depth