DIY for Vision: a guest lecture with Caffe

.com/BVLC/caffe Evan Shelhamer UC Berkeley

Based on a tutorial by E. Shelhamer, J. Donahue, Y. Jia, and R. Girshick in one slide

Supervised f(x) = y, p(y|x)

Unsupervised p(x), f(x) = ?

figure credit K. Cho Loss and Risk

Loss: a loss (error, or cost) function specifies the goal of learning by mapping parameter settings to a scalar value specifying the “badness” of these parameter settings

Risk Empirical Risk

Risk: risk measures loss over the input. Since we can’t measure over all inputs, we do our best and measure the empirical risk over the data we have. Optimization

How to minimize loss? Descend the gradient. Stochastic Gradient Descent (SGD)

Gradient Descent Why? - Too much data makes the sum expensive - Unreasonable Stochastic Gradient Descent effectiveness of randomization Why Deep Learning? The Unreasonable Effectiveness of Deep Features

Classes separate in the deep representations and transfer to many tasks. [DeCAF] [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features

Maximal activations of pool5 units [R-CNN]

conv5 DeConv visualization Rich visual structure of features deep in hierarchy. [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features

1st layer filters

image patches that strongly activate 1st layer filters [Zeiler-Fergus] Each 3x3 cell shows the top 9 image patches that activate a given feature in this layer.

Note the increase in visual complexity of feature activations as we go from “shallow” to “deep” layers.

[Zeiler-Fergus] [Zeiler-Fergus] [Zeiler-Fergus] [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features


- vision - speech - text - ... Visual Recognition


graph credit Matt Zeiler, Clarifai Visual Recognition

ILSVRC12 Worldwide challenge won by AlexNet U. Toronto team hired by Google



figure credit Alex Krizshevsky NIPS ‘12 Object Detection

R-CNN: Regions with Convolutional Neural Networks

Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR14. Visual Style Recognition

Karayev et al. Recognizing Image Style. BMVC14. Caffe fine-tuning example. Demo online at (see Results Explorer).

Other Styles:

Vintage Long Exposure Noir Pastel Macro … and so on.

[ Image-Style] Biological Imaging / Connectomics

Conv. Net wins ISBI 2012 EM segmentation challenge

[Ciresan12] Speech Recognition

graph credit Matt Zeiler, Clarifai What is Deep Learning?

Compositional Models Learned End-to-End What is Deep Learning?

Compositional Models Learned End-to-End

Hierarchy of Representations - vision: pixel, motif, part, object - text: character, word, clause, sentence - speech: audio, band, phone, word

concrete abstract learning What is Deep Learning?

Compositional Models Learned End-to-End

figure credit Yann LeCun, ICML ‘13 tutorial Non-linear because a sequence of linear steps “collapses” to a single linear step.

slide credit Yann LeCun, ICML ‘13 tutorial slide credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning?

Compositional Models Learned End-to-End

Back-propagation: take the gradient of the model layer-by-layer by the chain rule to yield the gradient of all the parameters.

figure credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning?

Vast space of models!

slide credit Marc’aurelio Ranzato, CVPR ‘14 tutorial. What is not Deep Learning?

- : linear classifier on the input + a non- linearity. It’s learned end-to-end (there’s only a single step), but there is no composition. - decision tree: a compositional model of a sequence of decisions, but all these decisions are made on the input. There is no intermediate representation or transformation. - “traditional” machine learning pipeline: take input, design features, learn clusters, then learn a classifier on top. There are stages and transformations, but each stage is learned separately and not end-to-end. History

2000s Sparse, Probabilistic, and Layer-wise models (Hinton, Bengio, Ng)

Is deep learning 2, 20, or 50 years old? What’s changed? Rosenblatt’s “Neural” Networks

These models are not how the brain works. We don’t know how the brain works! - This isn’t a problem (except for neuroscientists). - Planes don’t flap and boats don’t swim. Be wary of “Neural Realism,” or “it just works because it’s like the brain.” - network, not neural network - unit, and not neuron Perceptron and MLP

Multi-Layer Perceptron: Just stack these.

Perceptron: sum and threshold by “step” activation function. Convolutional Nets

Model structure adapted for vision: - feature maps keep spatial structure - pooling / subsampling increases “field of view” - parameters are shared across space / translation invariance Convolutional / Filtering Non-linearity is needed to “deepen” the representation by Sigmoid, TanH: historical, chosen composition. Otherwise a sequence of linear steps “collapses” to a single linear transformation.

Convolutional Nets: 1989

LeNet: a layered model composed of convolution and pooling operations followed by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ] Convolutional Nets: 2012

AlexNet: a layered model composed of convolution, pooling, + data and further operations followed by a holistic representation + gpu and all-in-all a landmark classifier on + non-saturating nonlinearity ILSVRC12. [ AlexNet ] + regularization Convolutional Nets: 2012

AlexNet: a layered model composed of convolution, pooling, and further operations followed by a holistic representation and all-in-all a landmark classifier on ILSVRC12. [ AlexNet ]

The fully-connected “FULL” layers are linear classifiers / matrix multiplications. ReLU are rectified-linear non- linearities on the output of layers. Convolutional Nets: 2014

ILSVRC14 Winners: ~6.6% Top-5 error - GoogLeNet: composition of multi-scale dimension- + depth reduced modules + data - VGG: 16 layers of 3x3 convolution interleaved with + max pooling + 3 fully-connected layers Caffe Tutorial

Continue with the practical tutorial on Caffe, the Berkeley open- source deep learning framework. DIY Deep Learning for Vision: a Hands-On Tutorial with Caffe