DIY Deep Learning for Vision: a guest lecture with Caffe
caffe.berkeleyvision.org
github.com/BVLC/caffe Evan Shelhamer UC Berkeley
Based on a tutorial by E. Shelhamer, J. Donahue, Y. Jia, and R. Girshick Machine Learning in one slide
Supervised f(x) = y, p(y|x)
Unsupervised p(x), f(x) = ?
figure credit K. Cho Loss and Risk
Loss: a loss (error, or cost) function specifies the goal of learning by mapping parameter settings to a scalar value specifying the “badness” of these parameter settings
Risk Empirical Risk
Risk: risk measures loss over the input. Since we can’t measure over all inputs, we do our best and measure the empirical risk over the data we have. Optimization
How to minimize loss? Descend the gradient. Stochastic Gradient Descent (SGD)
Gradient Descent Why? - Too much data makes the sum expensive - Unreasonable Stochastic Gradient Descent effectiveness of randomization Why Deep Learning? The Unreasonable Effectiveness of Deep Features
Classes separate in the deep representations and transfer to many tasks. [DeCAF] [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features
Maximal activations of pool5 units [R-CNN]
conv5 DeConv visualization Rich visual structure of features deep in hierarchy. [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features
1st layer filters
image patches that strongly activate 1st layer filters [Zeiler-Fergus] Each 3x3 cell shows the top 9 image patches that activate a given feature in this layer.
Note the increase in visual complexity of feature activations as we go from “shallow” to “deep” layers.
[Zeiler-Fergus] [Zeiler-Fergus] [Zeiler-Fergus] [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features
Applications!
- vision - speech - text - ... Visual Recognition
AlexNet
graph credit Matt Zeiler, Clarifai Visual Recognition
ILSVRC12 Worldwide challenge won by AlexNet U. Toronto team hired by Google
[AlexNet]
AlexNet
figure credit Alex Krizshevsky NIPS ‘12 Object Detection
R-CNN: Regions with Convolutional Neural Networks
Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR14. Visual Style Recognition
Karayev et al. Recognizing Image Style. BMVC14. Caffe fine-tuning example. Demo online at http://demo.vislab.berkeleyvision.org/ (see Results Explorer).
Other Styles:
Vintage Long Exposure Noir Pastel Macro … and so on.
[ Image-Style] Biological Imaging / Connectomics
Conv. Net wins ISBI 2012 EM segmentation challenge
[Ciresan12] Speech Recognition
graph credit Matt Zeiler, Clarifai What is Deep Learning?
Compositional Models Learned End-to-End What is Deep Learning?
Compositional Models Learned End-to-End
Hierarchy of Representations - vision: pixel, motif, part, object - text: character, word, clause, sentence - speech: audio, band, phone, word
concrete abstract learning What is Deep Learning?
Compositional Models Learned End-to-End
figure credit Yann LeCun, ICML ‘13 tutorial Non-linear because a sequence of linear steps “collapses” to a single linear step.
slide credit Yann LeCun, ICML ‘13 tutorial slide credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning?
Compositional Models Learned End-to-End
Back-propagation: take the gradient of the model layer-by-layer by the chain rule to yield the gradient of all the parameters.
figure credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning?
Vast space of models!
slide credit Marc’aurelio Ranzato, CVPR ‘14 tutorial. What is not Deep Learning?
- logistic regression: linear classifier on the input + a non- linearity. It’s learned end-to-end (there’s only a single step), but there is no composition. - decision tree: a compositional model of a sequence of decisions, but all these decisions are made on the input. There is no intermediate representation or transformation. - “traditional” machine learning pipeline: take input, design features, learn clusters, then learn a classifier on top. There are stages and transformations, but each stage is learned separately and not end-to-end. History
2000s Sparse, Probabilistic, and Layer-wise models (Hinton, Bengio, Ng)
Is deep learning 2, 20, or 50 years old? What’s changed? Rosenblatt’s Perceptron “Neural” Networks
These models are not how the brain works. We don’t know how the brain works! - This isn’t a problem (except for neuroscientists). - Planes don’t flap and boats don’t swim. Be wary of “Neural Realism,” or “it just works because it’s like the brain.” - network, not neural network - unit, and not neuron Perceptron and MLP
Multi-Layer Perceptron: Just stack these.
Perceptron: sum and threshold by “step” activation function. Convolutional Nets
Model structure adapted for vision: - feature maps keep spatial structure - pooling / subsampling increases “field of view” - parameters are shared across space / translation invariance Convolutional / Filtering Non-linearity is needed to “deepen” the representation by Sigmoid, TanH: historical, chosen composition. Otherwise a sequence of linear steps “collapses” to a single linear transformation.
Convolutional Nets: 1989
LeNet: a layered model composed of convolution and pooling operations followed by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ] Convolutional Nets: 2012
AlexNet: a layered model composed of convolution, pooling, + data and further operations followed by a holistic representation + gpu and all-in-all a landmark classifier on + non-saturating nonlinearity ILSVRC12. [ AlexNet ] + regularization Convolutional Nets: 2012
AlexNet: a layered model composed of convolution, pooling, and further operations followed by a holistic representation and all-in-all a landmark classifier on ILSVRC12. [ AlexNet ]
The fully-connected “FULL” layers are linear classifiers / matrix multiplications. ReLU are rectified-linear non- linearities on the output of layers. Convolutional Nets: 2014
ILSVRC14 Winners: ~6.6% Top-5 error - GoogLeNet: composition of multi-scale dimension- + depth reduced modules + data - VGG: 16 layers of 3x3 convolution interleaved with + dimensionality reduction max pooling + 3 fully-connected layers Caffe Tutorial
Continue with the practical tutorial on Caffe, the Berkeley open- source deep learning framework. DIY Deep Learning for Vision: a Hands-On Tutorial with Caffe
caffe.berkeleyvision.org github.com/BVLC/caffe