Detection and Evaluation Methods for Local Image and Video Features Julian Stottinger¨
Total Page:16
File Type:pdf, Size:1020Kb
Technical Report CVL-TR-4 Detection and Evaluation Methods for Local Image and Video Features Julian Stottinger¨ Computer Vision Lab Institute of Computer Aided Automation Vienna University of Technology 2. Marz¨ 2011 Abstract In computer vision, local image descriptors computed in areas around salient interest points are the state-of-the-art in visual matching. This technical report aims at finding more stable and more informative interest points in the domain of images and videos. The research interest is the development of relevant evaluation methods for visual mat- ching approaches. The contribution of this work lies on one hand in the introduction of new features to the computer vision community. On the other hand, there is a strong demand for valid evaluation methods and approaches gaining new insights for general recognition tasks. This work presents research in the detection of local features both in the spatial (“2D” or image) domain as well for spatio-temporal (“3D” or video) features. For state-of-the-art classification the extraction of discriminative interest points has an impact on the final classification performance. It is crucial to find which interest points are of use in a specific task. One question is for example whether it is possible to reduce the number of interest points extracted while still obtaining state-of-the-art image retrieval or object recognition results. This would gain a significant reduction in processing time and would possibly allow for new applications e.g. in the domain of mobile computing. Therefore, the work investigates different corner detection approaches and evaluates their repeatability under varying alterations. The proposed sparse color interest point detector gives a stable number of features and thus a better comparable image represen- tation. By taking the saliency of color information and color invariances into account, improved retrieval of color images, being more stable to lighting and shadowing effects than using illumination correlated color information, is obtained. In an international benchmark the approach outperforms all other participants in 4 out of 20 classes using a fractional amount of features compared to other approaches. The Gradient Vector Flow (GVF) has been used with one manually adjusted set of parameters to locate centers of local symmetry at a certain scale. This work extends this approach and proposes a GVF based scale space pyramid and a scale decision criterion to provide general purpose interest points. This multi-scale orientation invariant interest point detector has the aim of providing stable and densely distributed locations. Due to the iterative gradient smoothing during the computation of the GVF, it takes more surrounding image information into account than other detectors. In the last decade, a great interest in evaluation of local visual features in the domain of images is observed. Most of the state-of-the-art features have been extended to the temporal domain to allow for video retrieval and categorization using similar techniques as used for images. However, there is no comprehensive evaluation of these. This thesis provides the first comparative evaluation based on isolated and well de- fined alterations of video data. The aim is to provide researchers with guidance when selecting the best approaches for new applications and data-sets. A dedicated publicly available data-set of 1710 videos is set up, with which researchers are able to test their features’ robustness against well defined challenges. For the evaluation of the detectors, a repeatability measure treating the videos as 3D volumes is developed. To evaluate the robustness of spatio-temporal descriptors, a prin- cipled classification pipeline is introduced where the increasingly altered videos build a set of queries. This allows for an in-depth analysis of local detectors and descriptors and their combinations. B Inhaltsverzeichnis 1 Introduction 1 1.1 Image Features . 4 1.2 Video Features . 7 1.3 Contributions . 9 1.3.1 Summary of the Contributions . 9 1.3.2 Contributions in Detail . 10 1.4 Applications . 17 1.4.1 Image Features and Matching . 17 1.4.2 Video Features and Matching . 22 1.5 Structure of the Thesis . 25 2 Principles 27 2.1 Color Spaces and Perception . 27 2.2 Properties of the Gaussian Kernel . 30 2.3 The Structure Tensor . 32 2.4 Local Description . 36 2.4.1 Local Description in Images . 36 2.4.2 Local Description in Videos . 41 2.5 Invariance of Features . 43 2.6 Summary . 46 3 Interest Point Detectors for Images 48 3.1 Luminance Based detectors . 48 3.1.1 Corner Detection . 48 3.1.2 Blob Detection . 52 3.1.3 Symmetry Based Interest Points . 56 3.1.4 Affine Invariant Estimation of Scale . 59 3.2 Color based detectors . 60 3.2.1 Corner Detection . 62 3.2.2 Scale Invariant Color Points . 63 3.2.3 Blob Detectors . 67 i 3.3 Summary . 69 4 Evaluation of Interest Points for Images 70 4.1 Data-sets . 72 4.1.1 Robustness Data-set . 72 4.1.2 The Amsterdam Library of Object Images . 78 4.1.3 PASCAL VOC 2007 data-set . 80 4.2 Robustness . 81 4.2.1 GVFpoints . 82 4.2.2 Color Points . 89 4.3 Image Matching . 91 4.4 Object categorization . 96 4.5 Feature Benchmark . 99 4.6 Summary . 104 5 Interest Point Detectors for Video 112 5.1 Luminance Based Spatio-Temporal Features . 113 5.1.1 Corner Detection . 113 5.1.2 Blob Detection . 115 5.1.3 Gabor Filtering . 118 5.2 Color Based Spatio-Temporal Features . 119 5.2.1 Corner Detection . 119 5.2.2 Blob Detection . 121 5.3 Summary . 122 6 Evaluation of Interest Point Detectors for Videos 123 6.1 Data-Sets . 123 6.1.1 Popular Video Data-sets . 124 6.1.2 FeEval . 127 6.2 Feature Behavior . 131 6.3 Robustness . 137 6.4 Video Matching . 141 6.5 Summary . 146 7 Conclusion 147 Bibliography 150 ii Kapitel 1 Introduction Interest points are the first stage of robust visual matching applications and build the state-of-the-art of visual feature localization. The interest point detectors allow for the reduction of computational complexity in scene matching and object recognition app- lications by selecting only a subset of image locations corresponding to specific and/or informative structures [110]. The extraction of stable locations is a successful way to match visual input in images or videos of the same scene acquired under different conditions [107]. As evaluated in [105], successful approaches extracting stable loca- tions rely on corner detection [50, 103], blobs like Maximally Stable Extremal Regions (MSER) [100], Difference of Gaussians (DoG) [87] or detecting local symmetry [90]. The majority of interest point extraction algorithms are purely intensity based [50, 61, 105]. This ignores saliency information contained in the color channels. It is known that the distinctiveness of color based interest points is larger, and therefore color is of importance when matching images [136]. Furthermore, color plays an important role in the pre-attentive stage in which features are detected [52, 135] as it is one of the elementary stimulus features [164]. Fergus et al. [34] point out that a categorization framework is heavily dependent on the detector to gather useful features. In general, the current trend is toward increasing the number of points [183], applying several detectors or combining them [102, 138], or making the interest point distribution as dense as possible [159]. While such a dense sampling has been shown to be effective in object recognition, these approaches ba- sically shift the task of discarding the non-discriminative points to the classifier [17]. Further, dense sampling implies that a huge amount of data must be extracted from each image and processed. This is feasible when executed on a cluster of computers in a rese- arch environment. Nevertheless, there are environments in which the luxury of extensive computing power is not available. This is illustrated by the strong trend towards mobile computing on Netbooks, mobile phones and PDAs. With growing datasets, clustering and off-line training of features become infeasible [133]. Therefore, there is a strong interest to exploit state-of-the-art classification and to 1 Abbildung 1.1: The main stages of image matching. focus on the extraction of discriminative interest points. Following the main idea of interest points to reduce the data to be processed, one important question is whether it is possible to reduce the number of interest points extracted while still obtaining state- of-the-art image retrieval or object recognition results. Recent work aims at finding discriminative features e.g. by performing an evaluation of all features within the dataset or per image class and choosing the most frequent ones [157]. All these methods require an additional calculation step with an inherent demand for memory and processing time dependent on the number of features. In this thesis, color salient and invariant information is used for image point selection with the aim to achieve state-of-the-art performance while using significantly fewer interest points. The difference to prior work is that feature selection takes place at the very first step of the feature extraction and is carried out independently per feature. In contrast to feature selection that is taken care of by the classification step only, this method of regarding color saliency or color invariance provides a diminished amount of data for subsequent operations. Techniques such as the bags-of-words approach are originally inspired by text re- trieval. These have been extended to “2D” techniques on images and build the state-of- the-art in image matching. These approaches are now successfully carried out in both the spatial and the temporal domains for action recognition, video understanding and video matching (e.g.