DIY Deep Learning for Vision: a Guest Lecture with Caffe
Total Page:16
File Type:pdf, Size:1020Kb
DIY Deep Learning for Vision: a guest lecture with Caffe caffe.berkeleyvision.org github.com/BVLC/caffe Evan Shelhamer UC Berkeley Based on a tutorial by E. Shelhamer, J. Donahue, Y. Jia, and R. Girshick Machine Learning in one slide Supervised f(x) = y, p(y|x) Unsupervised p(x), f(x) = ? figure credit K. Cho Loss and Risk Loss: a loss (error, or cost) function specifies the goal of learning by mapping parameter settings to a scalar value specifying the “badness” of these parameter settings Risk Empirical Risk Risk: risk measures loss over the input. Since we can’t measure over all inputs, we do our best and measure the empirical risk over the data we have. Optimization How to minimize loss? Descend the gradient. Stochastic Gradient Descent (SGD) Gradient Descent Why? - Too much data makes the sum expensive - Unreasonable Stochastic Gradient Descent effectiveness of randomization Why Deep Learning? The Unreasonable Effectiveness of Deep Features Classes separate in the deep representations and transfer to many tasks. [DeCAF] [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features Maximal activations of pool5 units [R-CNN] conv5 DeConv visualization Rich visual structure of features deep in hierarchy. [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features 1st layer filters image patches that strongly activate 1st layer filters [Zeiler-Fergus] Each 3x3 cell shows the top 9 image patches that activate a given feature in this layer. Note the increase in visual complexity of feature activations as we go from “shallow” to “deep” layers. [Zeiler-Fergus] [Zeiler-Fergus] [Zeiler-Fergus] [Zeiler-Fergus] Why Deep Learning? The Unreasonable Effectiveness of Deep Features Applications! - vision - speech - text - ... Visual Recognition AlexNet graph credit Matt Zeiler, Clarifai Visual Recognition ILSVRC12 Worldwide challenge won by AlexNet U. Toronto team hired by Google [AlexNet] AlexNet figure credit Alex Krizshevsky NIPS ‘12 Object Detection R-CNN: Regions with Convolutional Neural Networks Ross Girshick et al. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR14. Visual Style Recognition Karayev et al. Recognizing Image Style. BMVC14. Caffe fine-tuning example. Demo online at http://demo.vislab.berkeleyvision.org/ (see Results Explorer). Other Styles: Vintage Long Exposure Noir Pastel Macro … and so on. [ Image-Style] Biological Imaging / Connectomics Conv. Net wins ISBI 2012 EM segmentation challenge [Ciresan12] Speech Recognition graph credit Matt Zeiler, Clarifai What is Deep Learning? Compositional Models Learned End-to-End What is Deep Learning? Compositional Models Learned End-to-End Hierarchy of Representations - vision: pixel, motif, part, object - text: character, word, clause, sentence - speech: audio, band, phone, word concrete abstract learning What is Deep Learning? Compositional Models Learned End-to-End figure credit Yann LeCun, ICML ‘13 tutorial Non-linear because a sequence of linear steps “collapses” to a single linear step. slide credit Yann LeCun, ICML ‘13 tutorial slide credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning? Compositional Models Learned End-to-End Back-propagation: take the gradient of the model layer-by-layer by the chain rule to yield the gradient of all the parameters. figure credit Yann LeCun, ICML ‘13 tutorial What is Deep Learning? Vast space of models! slide credit Marc’aurelio Ranzato, CVPR ‘14 tutorial. What is not Deep Learning? - logistic regression: linear classifier on the input + a non- linearity. It’s learned end-to-end (there’s only a single step), but there is no composition. - decision tree: a compositional model of a sequence of decisions, but all these decisions are made on the input. There is no intermediate representation or transformation. - “traditional” machine learning pipeline: take input, design features, learn clusters, then learn a classifier on top. There are stages and transformations, but each stage is learned separately and not end-to-end. History 2000s Sparse, Probabilistic, and Layer-wise models (Hinton, Bengio, Ng) Is deep learning 2, 20, or 50 years old? What’s changed? Rosenblatt’s Perceptron “Neural” Networks These models are not how the brain works. We don’t know how the brain works! - This isn’t a problem (except for neuroscientists). - Planes don’t flap and boats don’t swim. Be wary of “Neural Realism,” or “it just works because it’s like the brain.” - network, not neural network - unit, and not neuron Perceptron and MLP Multi-Layer Perceptron: Just stack these. Perceptron: sum and threshold by “step” activation function. Convolutional Nets Model structure adapted for vision: - feature maps keep spatial structure - pooling / subsampling increases “field of view” - parameters are shared across space / translation invariance Convolutional / Filtering Non-linearity is needed to “deepen” the representation by Sigmoid, TanH: historical, chosen composition. Otherwise a sequence of linear steps “collapses” to a single linear transformation. Convolutional Nets: 1989 LeNet: a layered model composed of convolution and pooling operations followed by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ] Convolutional Nets: 2012 AlexNet: a layered model composed of convolution, pooling, + data and further operations followed by a holistic representation + gpu and all-in-all a landmark classifier on + non-saturating nonlinearity ILSVRC12. [ AlexNet ] + regularization Convolutional Nets: 2012 AlexNet: a layered model composed of convolution, pooling, and further operations followed by a holistic representation and all-in-all a landmark classifier on ILSVRC12. [ AlexNet ] The fully-connected “FULL” layers are linear classifiers / matrix multiplications. ReLU are rectified-linear non- linearities on the output of layers. Convolutional Nets: 2014 ILSVRC14 Winners: ~6.6% Top-5 error - GoogLeNet: composition of multi-scale dimension- + depth reduced modules + data - VGG: 16 layers of 3x3 convolution interleaved with + dimensionality reduction max pooling + 3 fully-connected layers Caffe Tutorial Continue with the practical tutorial on Caffe, the Berkeley open- source deep learning framework. DIY Deep Learning for Vision: a Hands-On Tutorial with Caffe caffe.berkeleyvision.org github.com/BVLC/caffe.