Teaching Computers About Visual Categories

Teaching computers about visual categories Kristen Grauman Department of Computer Science University of Texas at Austin Visual category recognition Goal: recognize and detect categories of visually and semantically related… Objects Scenes Activities Kristen Grauman, UT Austin The need for visual recognition Robotics Augmented reality Indexing by content Surveillance Scientific data analysis Kristen Grauman, UT Austin Difficulty of category recognition Illumination Object pose Clutter Occlusions Intra-class Viewpoint appearance ~30,000 possible categories to distinguish! [Biederman 1987] Kristen Grauman, UT Austin Progress charted by datasets Roberts 1963 COIL 1963 … 1996 Kristen Grauman, UT Austin Progress charted by datasets MIT-CMU Faces INRIA Pedestrians UIUC Cars 1963 … 1996 2000 Kristen Grauman, UT Austin Progress charted by datasets MSRC 21 Objects Caltech-101 Caltech-256 1963 … 1996 2000 2005 Kristen Grauman, UT Austin Progress charted by datasets PASCAL VOC Detection challenge 1963 … 1996 2000 2005 2007 Kristen Grauman, UT Austin Progress charted by datasets ImageNet 80M Tiny Images PASCAL VOC Birds-200 Faces in the Wild 1963 … 1996 2000 2005 2007 2008 2013 Kristen Grauman, UT Austin Learning-based methods Last ~10 years: impressive strides by learning appearance models (usually discriminative). Novel image Annotator Training images Car Non-car [Papageorgiou & Poggio 1998, Schneiderman & Kanade 2000, Viola & Jones 2001, Dalal & Triggs 2005, Grauman & Darrell 2005, Lazebnik et al. 2006, Felzenszwalb et al. 2008,…] Kristen Grauman, UT Austin Exuberance for image data (and their category labels) 14M images 1K+ labeled object categories [Deng et al. 2009-2012] ImageNet 80M images 53K noisily labeled object categories [Torralba et al. 2008] 80M Tiny Images 131K images 902 labeled scene categories 4K labeled object categories SUN Database [Xiao et al. 2010] Kristen Grauman, UT Austin Problem Difficulty+scale Complexity of of data supervision Log Log scale 1998 2013 1998 2013 While complexity and scale of recognition task has escalated dramatically, our means of “teaching” visual categories remains shallow. Kristen Grauman, UT Austin Envisioning a broader channel “This image has a cow in it.” Human annotator More labeled images ↔ more accurate models? Kristen Grauman, UT Austin Envisioning a broader channel Human annotator Need richer means to teach system about visual world Kristen Grauman, UT Austin Envisioning a broader channel human system human system Today Next 10 years Vision Knowledge Learning representation Vision Learning Multi-agent Human systems computation Robotics Language Kristen Grauman, UT Austin Our goal Teaching computers about visual categories must be an ongoing, interactive process, with communication that goes beyond labels. This talk: 1. Active visual learning 2. Learning from visual comparisons Kristen Grauman, UT Austin Active learning for visual recognition Labeled data Annotator ? Active request Unlabeled Current data classifiers ? Active request [Mackay 1992, Cohn et al. 1996, Freund et al. 1997, Lindenbaum et al. 1999, Tong & Koller 2000, Schohn and Cohn 2000, Campbell et al. 2000, Roy & McCallum 2001, Kapoor et al. 2007,…] Kristen Grauman, UT Austin Active learning for visual recognition Labeled active data Annotator Accuracy passive Unlabeled Num labels added Current data classifiers Intent: better models, faster/cheaper Kristen Grauman, UT Austin Problem: Active selection and recognition Less expensive to obtain • Multiple levels of annotation are possible • Variable cost depending on level and example More expensive to obtain Kristen Grauman, UT Austin Our idea: Cost-sensitive multi-question active learning • Compute decision-theoretic active selection criterion that weighs both: – which example to annotate, and – what kind of annotation to request for it as compared to – the predicted effort the request would require [Vijayanarasimhan & Grauman, NIPS 2008, CVPR 2009] Kristen Grauman, UT Austin Decision-theoretic multi-question criterion Value of asking given Current Estimated risk if candidate Cost of getting question about givenmisclassification risk request were answered the answer data object Three “levels” of requests to choose from: ? ? 1. Label a region 2. Tag an object 3. Segment the in the image image, name all objects. Kristen Grauman, UT Austin Predicting effort • What manual effort cost would we expect to pay for an unlabeled image? Which image would you rather annotate? Kristen Grauman, UT Austin Predicting effort • What manual effort cost would we expect to pay for an unlabeled image? Which image would you rather annotate? Kristen Grauman, UT Austin Predicting effort We estimate labeling difficulty from visual content. Kristen Grauman, UT Austin Predicting effort We estimate labeling difficulty from visual content. Other forms of effort cost: expertise required, resolution of data, how far the robot must move, length of video clip,… Kristen Grauman, UT Austin Multi-question active learning “Completely Labeled segment data Annotator image #32.” “Does image #7 contain Unlabeled a cow?” Current data classifiers [Vijayanarasimhan & Grauman, NIPS 2008, CVPR 2009] Kristen Grauman, UT Austin Multi-question active learning “Completely Labeled segment data Annotator image #32.” “Does image #7 contain Unlabeled a cow?” Current data classifiers [Vijayanarasimhan & Grauman, NIPS 2008, CVPR 2009] Kristen Grauman, UT Austin Multi-question active learning curves Accuracy Annotation effort Kristen Grauman, UT Austin Multi-question active learning with objects and attributes [Kovashka et al., ICCV 2011] Labeled data Annotator Unlabeled Current data What is this Does this object model object? have spots? Weigh relative impact of an object label or an attribute label, at each iteration. Kristen Grauman, UT Austin Budgeted batch active learning [Vijayanarasimhan et al., CVPR 2010] Labeled data Annotator Unlabeled $ $ Current data model $ $ Unlabeled data Select batch of examples that together improves classifier objective and meets annotation budget. Kristen Grauman, UT Austin Problem: “Sandbox” active learning Thus far, tested only in artificial settings: • Unlabeled data already fixed, small scale, biased ~103 prepared images passive • Computational cost ignored active Accuracy Actual time Kristen Grauman, UT Austin Our idea: Live active learning Large-scale active learning of object detectors with crawled data and crowdsourced labels. How to scale active learning to massive unlabeled pools of data? Kristen Grauman, UT Austin Pool-based active learning e.g., select point nearest to hyperplane decision boundary ? for labeling. w [Tong & Koller, 2000; Schohn & Cohn, 2000; Campbell et al. 2000] Kristen Grauman, UT Austin Sub-linear time active selection We propose a novel hashing approach to identify the most uncertain examples in sub-linear time. Current 110 classifier 101 111 Actively selected Hash table examples Unlabeled data [Jain, Vijayanarasimhan, Grauman, NIPS 2010] Kristen Grauman, UT Austin Hashing a hyperplane query h(w) {x1,xk } x(t) 2 x(t) 1 x(t1) 1 (t1) (t) (t1) x w x 3 (t1) 2 w At each iteration of the learning loop, our hash functions map the current hyperplane directly to its nearest unlabeled points. Kristen Grauman, UT Austin Hashing a hyperplane query h(w) {x1,xk } x(t) 2 x(t) Guarantee1 high probability of collision for x(t1) points near decision boundary:1 (t1) (t) (t1) x w x 3 (t1) 2 w At each iteration of the learning loop, our hash functions map the current hyperplane directly to its nearest unlabeled points. Kristen Grauman, UT Austin Sub-linear time active selection Accuracy improvements Accounting for all costs as more data H-Hash Active 15% labeled Exhaustive Active Passive 10% 2 5% H-Hash Active Exhaustive Active Improvement in AUROC in Improvement Passive Time spent 1.3 4 8 searching for Selection + labeling time (hrs) selection By minimizing both selection and labeling H-Hash Exhaustive Active Active time, obtain the best H-Hash result on 1M Tiny Images accuracy per unit time. Kristen Grauman, UT Austin PASCAL Visual Object Categorization • Closely studied object detection benchmark • Original image data from Flickr http://pascallin.ecs.soton.ac.uk/challenges/VOC/ Kristen Grauman, UT Austin Live active learning Consensus (Mean shift) Annotated data “bicycle” Current hw 1100 hyperplane 1010 1111 Actively selected Jumping h(Oi ) window Hash table of examples candidates image windows Unlabeled Unlabeled images windows [Vijayanarasimhan & Grauman CVPR 2011] Live active learning Consensus (Mean shift) Annotated data For 4.5 million unlabeled instances, 10 minutes machine time per iter, “bicycle” Current hw 1100 vs. 60hyperplane hours for a linear scan. 1010 1111 Actively selected Jumping h(Oi ) window Hash table of examples candidates image windows Unlabeled Unlabeled images windows [Vijayanarasimhan & Grauman CVPR 2011] Live active learning results PASCAL VOC objects - Flickr test set Outperforms status quo data collection approach Kristen Grauman, UT Austin Live active learning results What does the live learning system ask first? Live active learning (ours) Keyword+image baseline First selections made when learning “boat” Kristen Grauman, UT Austin Ongoing challenges in active visual learning • Exploration vs. exploitation • Utility tied to specific classifier or model • Joint batch selection (“non-myopic”) expensive, remains challenging • Crowdsourcing: reliability, expertise, economics • Active annotations for objects/activity in video Kristen Grauman, UT Austin

Load more