Deep Learning and Weak Supervision for Image Classification

Deep learning and weak supervision for image classification Matthieu Cord Joint work with Thibaut Durand, Nicolas Thome Sorbonne Universités - Université Pierre et Marie Curie (UPMC) Laboratoire d’Informatique de Paris 6 (LIP6) - MLIA Team UMR CNRS June 09, 2016 1/35 Outline Context: Visual classification 1. MANTRA: Latent variable model to boost classification performances 2. WELDON: extension to Deep CNN 2/35 Motivations • Working on datasets with complex scenes (large and cluttered background), not centered objects, variable size, ... VOC07/12 MIT67 15 Scene COCO VOC12 Action • Select relevant regions ! better prediction • ImageNet: centered objects I Efficient transfert: needs bounding boxes [Oquab, CVPR14] • Full annotations expensive ) training with weak supervision 3/35 Motivations How to learn without bounding boxes? • Multiple-Instance Learning/Latent variables for missing information [Felzenszwalb, PAMI10] • Latent SVM and extensions => MANTRA How to learn deep without bounding boxes? • Learning invariance with input image transformations I Spatial Transformer Networks [Jaderberg, NIPS15] • Attention models: to select relevant regions I Stacked Attention Networks for Image Question Answering [Yang, CVPR16] • Parts model I Automatic discovery and optimization of parts for image classification [Parizi, ICLR15] • Deep MIL I Is object localization for free? [Oquab, CVPR15] I Deep extension of MANTRA: WELDON 4/35 Notations Variable Notation Space Train Test Example Input x X observed observed image Output y Y observed unobserved label Latent h H unobserved unobserved region • Model missing information with latent variables h • Most popular approach in Computer Vision: Latent SVM [Felzenszwalb, PAMI10] [Yu, ICML09] 5/35 Latent Structural SVM [Yu, ICML09] • Prediction function: (y^; h^) = arg max hw; Ψ(x; y; h)i (1) (y;h)2Y×H I Ψ(x; y; h): joint feature map I Joint inference in the (Y × H) space • Training: a set of N labeled trained pairs (xi ; yi ) I Objective function: upper bound of ∆(yi ; yî ) N 1 2 C X kwk + max [∆(yi ; y) + hw; Ψ(xi ; y; h)i] − max hw; Ψ(xi ; yi ; h)i 2 N (y;h)2Y×H h2H i=1 | {z } ≥∆(yi ;^yi ) I Difference of Convex Functions, solved with CCCP I LAI: max [∆(yi ; y) + hw; Ψ(xi ; y; h)i] (y;h)2Y×H I Challenge exacerbated in the latent case, (Y × H) space 6/35 MANTRA: Minimum Maximum Latent Structural SVM Classifying only with the max scoring latent value not always relevant MANTRA model: + − • Pair of latent variables (hi;y; hi;y) + I max scoring latent value: hi;y = arg max hw; Ψ(xi ; y; h)i h2H − I min scoring latent value: hi;y = arg min hw; Ψ(xi ; y; h)i h2H • New scoring function: + − Dw(xi ; y) = hw; Ψ(xi ; y; hi;y)i + hw; Ψ(xi ; y; hi;y)i (2) • Prediction function ) find the output with maximum score y^ = arg max Dw(xi ; y) (3) y2Y 7/35 • MANTRA: max+min vs max for LSSVM ) negative evidence MANTRA: Minimum Maximum Latent Structural SVM Classifying only with the max scoring latent value not always relevant MANTRA model: + − • Pair of latent variables (hi;y; hi;y) + I max scoring latent value: hi;y = arg max hw; Ψ(xi ; y; h)i h2H − I min scoring latent value: hi;y = arg min hw; Ψ(xi ; y; h)i h2H • New scoring function: + − Dw(xi ; y) = hw; Ψ(xi ; y; hi;y)i + hw; Ψ(xi ; y; hi;y)i (2) • Prediction function ) find the output with maximum score y^ = arg max Dw(xi ; y) (3) y2Y 7/35 • MANTRA: max+min vs max for LSSVM ) negative evidence MANTRA: Model & Training Rationale Intuition of the max+min prediction function • x image, h image region, y image class • hw; Ψ(x; y; h)i: region h score for class y + − • Dw(x; y) = hw; Ψ(x; y; hy )i + hw; Ψ(x; y; hy )i + I hy : presence of class y ) large for yi − I hy : localized evidence of the absence of class y I Not too low for yi ) latent space regularization I Low for y 6= yi ) tracking negative evidence [Parizi, ICLR15] street image x Dw(x; street) = 2 Dw(x; highway) = 0:7 Dw(x; coast) = −1:5 8/35 MANTRA: Model Training Learning formulation • Loss function: `w(xi ; yi ) = max [∆(yi ; y) + Dw(xi ; y)] − Dw(xi ; yi ) y2Y I (Margin rescaling) upper bound of ∆(yi ; y^), constraints: 8y 6= yi ; Dw(xi ; yi ) ≥ ∆(yi ; y) + Dw(xi ; y) | {z } | {z } | {z } score for ground truth output margin score for other output • Non-convex optimization problem N 1 2 C X min kwk + `w(xi ; yi ) (4) w 2 N i=1 • Solver: non convex one slack cutting plane [Do, JMLR12] I Fast convergence I Direct optimization 6= CCCP for LSSVM I Still needs to solve LAI: maxy [∆(yi ; y) + Dw(xi ; y)] 9/35 MANTRA: Optimization • MANTRA Instantiation: define (x; y; h), Ψ(x; y; h), ∆(yi ; y) • Instantiations: binary & multi-class classification, AP ranking Binary Multi-class AP Ranking x bag bag set of bags (set of regions) (set of regions) (of regions) y ±1 f1;:::; Kg ranking matrix h instance (region) region regions [1 Φ(x; h);:::; joint latent ranking Ψ(x; y; h) y · Φ(x; h) (y=1) 1(y=K)Φ(x; h)] feature map ∆(yi ; y) 0/1 loss 0/1 loss AP loss LAI exhaustive exhaustive exact and efficient • Solve Inference maxy Dw(xi ; y) & LAI maxy [∆(yi ; y) + Dw(xi ; y)] I Exhaustive for binary/multi-class classification I Exact and efficient solutions for ranking 10/35 WELDON Weakly supErvisedLearning ofDeep cOnvolutionalNets • MANTRA extension for training deep CNNs • Learning Ψ(x; y; h): end-to-end learning of deep CNNs with structured prediction and latent variables I Incorporating multiple positive & negative evidence I Training deep CNNs with structured loss 11/35 Standard deep CNN architecture: VGG16 Simonyan et al. Very deep convolutional networks for large-scale image recognition. ICLR 2015 12/35 Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach MANTRA adaptation for deep CNN Problem • Fixed-size image as input 13/35 MANTRA adaptation for deep CNN Problem • Fixed-size image as input Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach 13/35 MANTRA adaptation for deep CNN Problem • Fixed-size image as input Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach 13/35 MANTRA adaptation for deep CNN Problem • Fixed-size image as input Adapt architecture to weakly supervised learning 1. Fully connected layers ! convolution layers I sliding window approach 2. Spatial aggregation I Perform object localization prediction 13/35 WELDON: deep architecture • C: number of classes 14/35 Aggregation function [Oquab, 2015] • Region aggregation = max • Select the highest-scoring window original image motorbike feature map max prediction Oquab, Bottou, Laptev, Sivic. Is object localization for free? weakly-supervised learning with convolutional neural networks. CVPR 2015 15/35 WELDON: region aggregation Aggregation strategy: • max+min pooling (MANTRA prediction function) • k-instances I Single region to multiple high scoring regions: k k 1 X 1 X max ! i-th max min ! i-th min k k i=1 i=1 I More robust region selection [Vasconcelos CVPR15] max max + min 3 max +3 min 16/35 WELDON: architecture 17/35 WELDON: learning • Objective function for multi-class task and k = 1: N 1 X gt min R(w) + `(fw(xi ); y ) w N i i=1 w w 0 fw(xi ) =arg max max Lconv(xi ; y; h)+ min Lconv(xi ; y; h ) y h h0 How to learn deep architecture ? • Stochastic gradient descent training. • Back-propagation of the selecting windows error. 18/35 WELDON: learning Class is present • Increase score of selecting windows. Figure: Car map 19/35 WELDON: learning Class is absent • Decrease score of selecting windows. Figure: Boat map 20/35 Experiments • VGG16 pre-trained on ImageNet • Torch7 implementation Datasets • Object recognition: Pascal VOC 2007, Pascal VOC 2012 • Scene recognition: MIT67, 15 Scene • Visual recognition, where context plays an important role: COCO, Pascal VOC 2012 Action VOC07/12 MIT67 15 Scene COCO VOC12 Action 21/35 Experiments Dataset Train Test Classes Classification VOC07 ∼5.000 ∼5.000 20 multi-label VOC12 ∼5.700 ∼5.800 20 multi-label 15 Scene 1.500 2.985 15 multi-class MIT67 5.360 1.340 67 multi-class VOC12 Action ∼2.000 ∼2.000 10 multi-label COCO ∼80.000 ∼40.000 80 multi-label 22/35 Experiments • Multi-scale: 8 scales (combination with Object Bank strategy) 23/35 Object recognition VOC 2007 VOC 2012 VGG16 (online code) [1] 84.5 82.8 SPP net [2] 82.4 Deep WSL MIL [3] 81.8 WELDON 90.2 88.5 Table: mAP results on object recognition datasets. [1] Simonyan et al. Very deep convolutional networks. ICLR 2015 [2] He et al. Spatial pyramid pooling in deep convolutional networks. ECCV 2014 [3] Oquab et al. Is object localization for free? CVPR 2015 24/35 Scene recognition 15 Scene MIT67 VGG16 (online code) [1] 91.2 69.9 MOP CNN [2] 68.9 Negative parts [3] 77.1 WELDON 94.3 78.0 Table: Multi-class accuracy results on scene categorization datasets. [1] Simonyan et al. Very deep convolutional networks. ICLR 2015 [2] Gong et al. Multi-scale Orderless Pooling of Deep Convolutional Activation Features. ECCV 2014 [3] Parizi et al. Automatic discovery and optimization of parts. ICLR 2015 25/35 Context datasets VOC 2012 action COCO VGG16 (online code) [1] 67.1 59.7 Deep WSL MIL [2] 62.8 Our WSL deep CNN 75.0 68.8 Table: mAP results on context datasets.

Deep Learning and Weak Supervision for Image Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support