Visual Object Category Recognition

Visual Object Category Recognition Robotics Research Group Department of Engineering Science University of Oxford Supervisors: Professor Andrew Zisserman Professor Pietro Perona Robert Fergus New College December 2, 2005 Abstract We investigate two generative probabilistic models for category-level object recognition. Both schemes are designed to learn categories with a minimum of supervision, requiring only a set of images known to contain the target category from a similar viewpoint. In both methods, learning is translation and scale-invariant; does not require alignment or correspondence between the training images, and is robust to clutter and occlusion. The schemes are also robust to heavy contamination of the training set with unrelated images, enabling them to learn directly from the output of Internet Image Search engines. In the first approach, category models are probabilistic constellations of parts, and their parameters are estimated by maximizing the likelihood of the training data. The appearance of the parts, as well as their mutual position, relative scale and probability of detection are explicitly represented. Recognition takes place in two stages. First, a feature-finder identifies promising locations for the model’s parts. Second, the category model is used to compare the likelihood that the observed features are generated by the category model, or are generated by background clutter. The second approach is a visual adaptation of “bag of words” models used to extract topics from a text corpus. We extend the approach to incorporate spatial information in a scale and translation-invariant manner. The model represents each image as a joint histogram of visual word occurrences and their locations, relative to a latent reference frame of the object(s). The model is entirely discrete, making no assumptions of uni-modality or the like. The parameters of the multi-component model are estimated in a maximum likelihood fashion over the training data. In recognition, the relative weighting of the different model components is computed along with the model reference frame with the highest likelihood, enabling the localization of object instances. The flexible nature of both models is shown by experiments on 28 datasets containing 12 diverse object categories, including geometrically constrained categories (e.g. faces, cars) and flexible objects (such as animals). The different datasets give a thorough evaluation of both methods in classification, categorization, localization and learning from contaminated data. This thesis is submitted to the Department of Engineering Science, University of Oxford, in fulfilment of the requirements for the degree of Doctor of Philosophy. This thesis is entirely my own work, and except where otherwise stated, describes my own research. Robert Fergus, New College Copyright c 2005 Robert Fergus All Rights Reserved To my parents Acknowledgements I would like to thank my two advisers: Professor Andrew Zisserman at Oxford and Professor Pietro Perona at Caltech for their guidance, patience and advice. This thesis has been a thrilling and rewarding experience thanks to them. I also thank Fei-Fei Li being my main collaborator over the last 5 years, both in classes and research. Many other people who have offered advice and guidance with my work for which I am very grateful (in alphabetical order): Andrew Blake, Mark Everingham, David Forsyth, Alex Holub, Dan Huttenlocher, Michael Isard, Jitendra Malik, Silvio Savarese, Josef Sivic, Frederik Schaffalitzky. I would also like to thank all the people in the Vision Labs at both Caltech and Oxford for making them such interesting places to be. I thank the various sources of funding I have had over the years: the Caltech CNSE, the UK EPSRC, EC project CogViSys and the PASCAL project. Agnes deserves special thanks for being so supportive of my efforts and so understanding of the endless deadlines. A final thanks must go to Pietro, Markus Weber and Max Welling who supervised me as a summer student while I was an undergraduate, sparking my interest in computer vision and object recognition. Contents Table of Contents i 1 Introduction 1 1.1Objective........................................ 1 1.2Motivation....................................... 2 1.3Challenges........................................ 4 1.4Definitionofvocabulary................................ 6 1.5 Contribution . ..................................... 7 1.5.1 TheConstellationModel........................... 7 1.5.2 Translation and Scale-Invariant pLSA (TSI-pLSA) ............. 8 1.6Outlineofthethesis.................................. 9 2 Literature review 10 2.1Specificinstancerecognition.............................. 11 2.1.1 Geometricmethods............................... 11 2.1.2 Globalappearancemethods.......................... 15 2.1.3 Texturedregionmethods........................... 16 2.2Categorylevelrecognition............................... 19 2.2.1 Digits...................................... 19 2.2.2 Faces,CarsandHumans............................ 21 2.2.3 Recent work . ................................. 25 2.2.4 Summaryofliterature............................. 35 2.3FeaturesandRepresentationSchemes........................ 36 2.3.1 Kadir&Brady................................. 36 2.3.2 Curves...................................... 37 2.3.3 Difference of Gaussians . ......................... 38 2.3.4 MultiscaleHarris................................ 38 2.3.5 SampledEdgeoperator............................ 39 2.3.6 Comparisonoffeaturedetectors....................... 41 2.3.7 SIFTdescriptor................................. 41 3 Datasets 43 3.1Caltechdatasets.................................... 44 3.2UIUCdataset...................................... 44 3.3FawltyTowers...................................... 45 3.3.1 TrainingdataforFawltyTowers....................... 45 3.4PASCALchallenge................................... 47 3.5Imagesearchenginedata............................... 48 3.6Summaryofdatasets.................................. 51 i 4 The Constellation model 53 4.1Introduction....................................... 53 4.2 Model inputs . ..................................... 54 4.3 Overview of model . ................................. 54 4.4Appearance....................................... 56 4.4.1 Appearancerepresentation.......................... 57 4.4.2 Curverepresentation.............................. 58 4.5Shape.......................................... 60 4.5.1 Fullmodel.................................... 60 4.5.2 Starmodel................................... 62 4.6Relativescale...................................... 63 4.7 Occlusion and Statistics of the feature finder . ................. 63 4.8Multipleaspectsviaamixtureofconstellationmodels............... 64 4.9Modeldiscussion.................................... 65 4.9.1 Appearanceterm................................ 65 4.9.2 Alternativeformsofshapemodel....................... 65 4.9.3 Improvements over Weber et al. ....................... 66 4.9.4 Modelassumptions............................... 67 4.10Modelstructuresummary............................... 67 5 Learning and Recognition with the Constellation model 69 5.1Learning......................................... 69 5.1.1 Initialization.................................. 71 5.1.2 EMupdateequations............................. 71 5.1.3 Computational considerations ......................... 73 5.1.4 Efficientsearchmethodsforthefullmodel.................. 74 5.1.5 Convergence.................................. 76 5.1.6 Backgroundmodel............................... 79 5.1.7 Finalmodel................................... 80 5.2Recognition....................................... 83 5.3Considerationsforthestarmodel........................... 84 5.3.1 Efficientmethodsforthestarmodel..................... 84 6 Weakly supervised experiments with the constellation model 89 6.1Fullmodelexperiments................................ 91 6.1.1 Baselineexperiments.............................. 96 6.2Analysisofperformance................................ 97 6.2.1 Changingscaleoffeatures........................... 98 6.2.2 FeatureRepresentation............................ 99 6.2.3 Numberofpartsinmodel...........................100 6.2.4 Contribution of the different model terms . .................101 6.2.5 Over-fitting . .................................103 6.2.6 Contaminationofthetrainingset.......................104 6.2.7 Samplingfromthemodel...........................105 6.3Comparisonwithothermethods...........................105 6.4Starmodelexperiments................................107 6.4.1 Comparisontofullmodel...........................107 6.4.2 Heterogeneouspartexperiments.......................107 6.4.3 Numberofpartsanddetections........................108 ii 7 Translation and Scale Invariant Probabilistic Latent Semantic Analysis 114 7.1 Probabilistic Latent Semantic Analysis (pLSA) . .................115 7.1.1 LatentDirichletAllocation(LDA)......................116 7.2ApplyingpLSAtovisualdata.............................117 7.2.1 Visualwords..................................117 7.2.2 Anexample...................................118 7.3AddinglocationintothepLSAmodel........................118 7.3.1 AbsolutePositionpLSA............................119 7.3.2 ScaleandTranslationInvariantpLSA....................119 7.4Regiondetectors....................................123 7.5Implementationaldetails................................124 7.6Caltechexperiments..................................125 7.7Modelinvestigations..................................132

Visual Object Category Recognition

Learning and Using Taxonomies for Visual and Olfactory Category Recognition

CS 558: Computer Vision 9Th Set of Notes

Dataset Issues in Object Recognition

Labelme: a Database and Web-Based Tool for Image Annotation Bryan C

Labelme: a Database and Web-Based Tool for Image Annotation

Topics of the Class

LNCS 4170, Pp