Deep Learning and Weak Supervision for Image Classification
Total Page:16
File Type:pdf, Size:1020Kb
Deep learning and weak supervision for image classification Matthieu Cord Joint work with Thibaut Durand, Nicolas Thome Sorbonne Universités - Université Pierre et Marie Curie (UPMC) Laboratoire d’Informatique de Paris 6 (LIP6) - MLIA Team UMR CNRS June 09, 2016 1/35 Outline Context: Visual classification 1. MANTRA: Latent variable model to boost classification performances 2. WELDON: extension to Deep CNN 2/35 Motivations • Working on datasets with complex scenes (large and cluttered background), not centered objects, variable size, ... VOC07/12 MIT67 15 Scene COCO VOC12 Action • Select relevant regions ! better prediction • ImageNet: centered objects I Efficient transfert: needs bounding boxes [Oquab, CVPR14] • Full annotations expensive ) training with weak supervision 3/35 Motivations How to learn without bounding boxes? • Multiple-Instance Learning/Latent variables for missing information [Felzenszwalb, PAMI10] • Latent SVM and extensions => MANTRA How to learn deep without bounding boxes? • Learning invariance with input image transformations I Spatial Transformer Networks [Jaderberg, NIPS15] • Attention models: to select relevant regions I Stacked Attention Networks for Image Question Answering [Yang, CVPR16] • Parts model I Automatic discovery and optimization of parts for image classification [Parizi, ICLR15] • Deep MIL I Is object localization for free? [Oquab, CVPR15] I Deep extension of MANTRA: WELDON 4/35 Notations Variable Notation Space Train Test Example Input x X observed observed image Output y Y observed unobserved label Latent h H unobserved unobserved region • Model missing information with latent variables h • Most popular approach in Computer Vision: Latent SVM [Felzenszwalb, PAMI10] [Yu, ICML09] 5/35 Latent Structural SVM [Yu, ICML09] • Prediction function: (y^; h^) = arg max hw; Ψ(x; y; h)i (1) (y;h)2Y×H I Ψ(x; y; h): joint feature map I Joint inference in the (Y × H) space • Training: a set of N labeled trained pairs (xi ; yi ) I Objective function: upper bound of ∆(yi ; y^i ) N 1 2 C X kwk + max [∆(yi ; y) + hw; Ψ(xi ; y; h)i] − max hw; Ψ(xi ; yi ; h)i 2 N (y;h)2Y×H h2H i=1 | {z } ≥∆(yi ;^yi ) I Difference of Convex Functions, solved with CCCP I LAI: max [∆(yi ; y) + hw; Ψ(xi ; y; h)i] (y;h)2Y×H I Challenge exacerbated in the latent case, (Y × H) space 6/35 MANTRA: Minimum Maximum Latent Structural SVM Classifying only with the max scoring latent value not always relevant MANTRA model: + − • Pair of latent variables (hi;y; hi;y) + I max scoring latent value: hi;y = arg max hw; Ψ(xi ; y; h)i h2H − I min scoring latent value: hi;y = arg min hw; Ψ(xi ; y; h)i h2H • New scoring function: + − Dw(xi ; y) = hw; Ψ(xi ; y; hi;y)i + hw; Ψ(xi ; y; hi;y)i (2) • Prediction function ) find the output with maximum score y^ = arg max Dw(xi ; y) (3) y2Y 7/35 • MANTRA: max+min vs max for LSSVM ) negative evidence MANTRA: Minimum Maximum Latent Structural SVM Classifying only with the max scoring latent value not always relevant MANTRA model: + − • Pair of latent variables (hi;y; hi;y) + I max scoring latent value: hi;y = arg max hw; Ψ(xi ; y; h)i h2H − I min scoring latent value: hi;y = arg min hw; Ψ(xi ; y; h)i h2H • New scoring function: + − Dw(xi ; y) = hw; Ψ(xi ; y; hi;y)i + hw; Ψ(xi ; y; hi;y)i (2) • Prediction function ) find the output with maximum score y^ = arg max Dw(xi ; y) (3) y2Y 7/35 • MANTRA: max+min vs max for LSSVM ) negative evidence MANTRA: Model & Training Rationale Intuition of the max+min prediction function • x image, h image region, y image class • hw; Ψ(x; y; h)i: region h score for class y + − • Dw(x; y) = hw; Ψ(x; y; hy )i + hw; Ψ(x; y; hy )i + I hy : presence of class y ) large for yi − I hy : localized evidence of the absence of class y I Not too low for yi ) latent space regularization I Low for y 6= yi ) tracking negative evidence [Parizi, ICLR15] street image x Dw(x; street) = 2 Dw(x; highway) = 0:7 Dw(x; coast) = −1:5 8/35 MANTRA: Model Training Learning formulation • Loss function: `w(xi ; yi ) = max [∆(yi ; y) + Dw(xi ; y)] − Dw(xi ; yi ) y2Y I (Margin rescaling) upper bound of ∆(yi ; y^), constraints: 8y 6= yi ; Dw(xi ; yi ) ≥ ∆(yi ; y) + Dw(xi ; y) | {z } | {z } | {z } score for ground truth output margin score for other output • Non-convex optimization problem N 1 2 C X min kwk + `w(xi ; yi ) (4) w 2 N i=1 • Solver: non convex one slack cutting plane [Do, JMLR12] I Fast convergence I Direct optimization 6= CCCP for LSSVM I Still needs to solve LAI: maxy [∆(yi ; y) + Dw(xi ; y)] 9/35 MANTRA: Optimization • MANTRA Instantiation: define (x; y; h), Ψ(x; y; h), ∆(yi ; y) • Instantiations: binary & multi-class classification, AP ranking Binary Multi-class AP Ranking x bag bag set of bags (set of regions) (set of regions) (of regions) y ±1 f1;:::; Kg ranking matrix h instance (region) region regions [1 Φ(x; h);:::; joint latent ranking Ψ(x; y; h) y · Φ(x; h) (y=1) 1(y=K)Φ(x; h)] feature map ∆(yi ; y) 0/1 loss 0/1 loss AP loss LAI exhaustive exhaustive exact and efficient • Solve Inference maxy Dw(xi ; y) & LAI maxy [∆(yi ; y) + Dw(xi ; y)] I Exhaustive for binary/multi-class classification I Exact and efficient solutions for ranking 10/35 WELDON Weakly supErvisedLearning ofDeep cOnvolutionalNets • MANTRA extension for training deep CNNs • Learning Ψ(x; y; h): end-to-end learning of deep CNNs with structured prediction and latent variables I Incorporating multiple positive & negative evidence I Training deep CNNs with structured loss 11/35 Standard deep CNN architecture: VGG16 Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR 2015 12/35 Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach MANTRA adaptation for deep CNN Problem • Fixed-size image as input 13/35 MANTRA adaptation for deep CNN Problem • Fixed-size image as input Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach 13/35 MANTRA adaptation for deep CNN Problem • Fixed-size image as input Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach 13/35 MANTRA adaptation for deep CNN Problem • Fixed-size image as input Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach 2. Spatial aggregation I Perform object localization prediction 13/35 WELDON: deep architecture • C: number of classes 14/35 Aggregation function [Oquab, 2015] • Region aggregation = max • Select the highest-scoring window original image motorbike feature map max prediction Oquab, Bottou, Laptev, Sivic. Is object localization for free? weakly-supervised learning with convolutional neural networks. CVPR 2015 15/35 WELDON: region aggregation Aggregation strategy: • max+min pooling (MANTRA prediction function) • k-instances I Single region to multiple high scoring regions: k k 1 X 1 X max ! i-th max min ! i-th min k k i=1 i=1 I More robust region selection [Vasconcelos CVPR15] max max + min 3 max +3 min 16/35 WELDON: architecture 17/35 WELDON: learning • Objective function for multi-class task and k = 1: N 1 X gt min R(w) + `(fw(xi ); y ) w N i i=1 w w 0 fw(xi ) =arg max max Lconv(xi ; y; h)+ min Lconv(xi ; y; h ) y h h0 How to learn deep architecture ? • Stochastic gradient descent training. • Back-propagation of the selecting windows error. 18/35 WELDON: learning Class is present • Increase score of selecting windows. Figure: Car map 19/35 WELDON: learning Class is absent • Decrease score of selecting windows. Figure: Boat map 20/35 Experiments • VGG16 pre-trained on ImageNet • Torch7 implementation Datasets • Object recognition: Pascal VOC 2007, Pascal VOC 2012 • Scene recognition: MIT67, 15 Scene • Visual recognition, where context plays an important role: COCO, Pascal VOC 2012 Action VOC07/12 MIT67 15 Scene COCO VOC12 Action 21/35 Experiments Dataset Train Test Classes Classification VOC07 ∼5.000 ∼5.000 20 multi-label VOC12 ∼5.700 ∼5.800 20 multi-label 15 Scene 1.500 2.985 15 multi-class MIT67 5.360 1.340 67 multi-class VOC12 Action ∼2.000 ∼2.000 10 multi-label COCO ∼80.000 ∼40.000 80 multi-label 22/35 Experiments • Multi-scale: 8 scales (combination with Object Bank strategy) 23/35 Object recognition VOC 2007 VOC 2012 VGG16 (online code) [1] 84.5 82.8 SPP net [2] 82.4 Deep WSL MIL [3] 81.8 WELDON 90.2 88.5 Table: mAP results on object recognition datasets. [1] Simonyan et al. Very deep convolutional networks. ICLR 2015 [2] He et al. Spatial pyramid pooling in deep convolutional networks. ECCV 2014 [3] Oquab et al. Is object localization for free? CVPR 2015 24/35 Scene recognition 15 Scene MIT67 VGG16 (online code) [1] 91.2 69.9 MOP CNN [2] 68.9 Negative parts [3] 77.1 WELDON 94.3 78.0 Table: Multi-class accuracy results on scene categorization datasets. [1] Simonyan et al. Very deep convolutional networks. ICLR 2015 [2] Gong et al. Multi-scale Orderless Pooling of Deep Convolutional Activation Features. ECCV 2014 [3] Parizi et al. Automatic discovery and optimization of parts. ICLR 2015 25/35 Context datasets VOC 2012 action COCO VGG16 (online code) [1] 67.1 59.7 Deep WSL MIL [2] 62.8 Our WSL deep CNN 75.0 68.8 Table: mAP results on context datasets.