CSE 152: Computer Vision Manmohan Chandraker

CSE 152: Computer Vision Manmohan Chandraker Lecture 15: Optimization in CNNs Recap Engineered against learned features Label Convolutional filters are trained in a Dense supervised manner by back-propagating classification error Dense Dense Convolution + pool Label Convolution + pool Classifier Convolution + pool Pooling Convolution + pool Feature extraction Convolution + pool Image Image Jia-Bin Huang and Derek Hoiem, UIUC Two-layer perceptron network Slide credit: Pieter Abeel and Dan Klein Neural networks Non-linearity Activation functions Multi-layer neural network From fully connected to convolutional networks next layer image Convolutional layer Slide: Lazebnik Spatial filtering is convolution Convolutional Neural Networks [Slides credit: Efstratios Gavves] 2D spatial filters Filters over the whole image Weight sharing Insight: Images have similar features at various spatial locations! Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor Key operations in a CNN Feature maps Rectified Linear Unit (ReLU) Spatial pooling Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Key operations in a CNN Feature maps Spatial pooling Max Non-linearity Convolution (Learned) Input Image Source: R. Fergus, Y. LeCun Slide: Lazebnik Pooling operations • Aggregate multiple values into a single value • Invariance to small transformations • Keep only most important information for next layer • Reduces the size of the next layer • Fewer parameters, faster computations • Observe larger receptive field in next layer • Hierarchically extract more abstract features Key operations in a CNN Feature maps Spatial pooling Non-linearity Convolution (Learned) . Input Image Input Feature Map Source: R. Fergus, Y. LeCun Slide: Lazebnik Convolution as a feature extractor 1 x 1 convolutions 1 x 1 convolution layers also possible, equivalent to a dot product. Types of Neural Networks Convolutional Neural Networks Optimization in CNNs Learning w § Training examples § Objective: a misclassification loss § Procedure: § Gradient descent or hill climbing Slide credit: Pieter Abeel and Dan Klein A 3-layer network for digit recognition MNIST dataset Cost function • The network tries to approximate the function y(x) and its output is a • We use a quadratic cost function, or MSE, or “L2-loss”. Gradient descent Stochastic gradient descent Update rules for each parameter: Cost function is a sum over all the training samples: Gradient from entire training set: Usually, n is very large. Stochastic gradient descent Gradient from entire training set: • For large training data, gradient computation takes a long time • Leads to “slow learning” • Instead, consider a mini-batch with m samples • If sample size is large enough, properties approximate the dataset Stochastic gradient descent Stochastic gradient descent Stochastic gradient descent Stochastic gradient descent Build up velocity as a running mean of gradients. Layer to layer relationship Cost and gradient computation Chain rule of differentiation This is all you need to know to get the gradients in a neural network! Backpropagation This is all you need to know to get the gradients in a neural network! Backpropagation: application of chain rule in certain order, taking advantage of forward propagation to efficiently compute gradients. Backpropagation example [Slides credit: Fei-Fei Li] Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Backpropagation example Add gate: gradient distributor Mul gate: gradient switcher Patterns in backpropagation Convolutional layer is differentiable Max Pooling Average Pooling Transfer Learning in CNNs Transfer Learning • Improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned. • Weight initialization for CNN Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [Oquab et al. CVPR 2014] Slide: Jiabin Huang Transfer Learning CNNs are good at transfer learning Fine-tune hT using hS as initialization Initializng hT with hS Initializng hT with hS Initializng hT with hS Initializng hT with hS Strategy for fine-tuning of data needed Amount Use hS as a feature extractor for hT Transfer learning is a common choice CNN Architectures AlexNet architecture Architectural details of AlexNet • Similar framework to LeCun 1998 but: • Bigger model (7 hidden layers, 650k units, 60M parameters) • More data (106 images instead of 103 images) • GPU implementation (50 times speedup over CPU) Removing layer 7 Removing layers 6 and 7 Removing layers 3 and 4 Removing layers 3, 4, 6 and 7 VGGNet architecture • Much more accurate • AlexNet : 18.2% top-5 error • VGGNet: 6.8% top-5 error • More than twice as many layers • Filters are much smaller • Harder and slower to train ResNet: going real deep Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep Residual Learning for Image Recognition, CVPR 2016 Bigger not better: innovations typically reduce parameters, despite deeper nets Simply stacking layers Deep residual learning Deep residual learning Deep residual learning Deep residual learning Plain Net • Simple design • Use only 3x3 conv (like VGG) • No hidden FC Deep residual learning Key ideas for CNN architectures • Convolutional layers – Same local functions evaluated everywhere – Much fewer parameters • Pooling – Larger receptive field • ReLU – Maintain a gradient signal over large portion of domain • Limit parameters – Sequence of 3x3 filters instead of large filters – 1x1 convolutions to reduce feature dimensions • Skip network – Easier optimization with greater depth.

CSE 152: Computer Vision Manmohan Chandraker

Synthesizing Images of Humans in Unseen Poses

Training Autoencoders by Alternating Minimization

Learning to Learn by Gradient Descent by Gradient Descent

Memristor-Based Approximated Computation

A Survey of Autonomous Driving: Common Practices and Emerging Technologies

Training Neural Networks Without Gradients: a Scalable ADMM Approach

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection

Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder

1 Convolution

Deep Clustering with Convolutional Autoencoders