Context-Aware Features and Robust Image Representations 5 6 ∗ 7 P
Total Page:16
File Type:pdf, Size:1020Kb
CAKE_article_final.tex Click here to view linked References 1 2 3 4 Context-Aware Features and Robust Image Representations 5 6 ∗ 7 P. Martinsa, , P. Carvalhoa,C.Gattab 8 aCenter for Informatics and Systems, University of Coimbra, Coimbra, Portugal 9 bComputer Vision Center, Autonomous University of Barcelona, Barcelona, Spain 10 11 12 13 14 Abstract 15 16 Local image features are often used to efficiently represent image content. The limited number of types of 17 18 features that a local feature extractor responds to might be insufficient to provide a robust image repre- 19 20 sentation. To overcome this limitation, we propose a context-aware feature extraction formulated under an 21 information theoretic framework. The algorithm does not respond to a specific type of features; the idea is 22 23 to retrieve complementary features which are relevant within the image context. We empirically validate the 24 method by investigating the repeatability, the completeness, and the complementarity of context-aware fea- 25 26 tures on standard benchmarks. In a comparison with strictly local features, we show that our context-aware 27 28 features produce more robust image representations. Furthermore, we study the complementarity between 29 strictly local features and context-aware ones to produce an even more robust representation. 30 31 Keywords: Local features, Keypoint extraction, Image content descriptors, Image representation, Visual 32 saliency, Information theory. 33 34 35 1. Introduction While it is widely accepted that a good local 36 37 feature extractor should retrieve distinctive, accu- 38 Local feature detection (or extraction, if we want 39 rate, and repeatable features against a wide vari- to use a more semantically correct term [1]) is a 40 ety of photometric and geometric transformations, 41 central and extremely active research topic in the 42 it is equally valid to claim that these requirements fields of computer vision and image analysis. Reli- 43 are not always the most important. In fact, not 44 able solutions to prominent problems such as wide- 45 all tasks require the same properties from a local 46 baseline stereo matching, content-based image re- feature extractor. We can distinguish three broad 47 trieval, object (class) recognition, and symmetry 48 categories of applications according to the required 49 detection, often make use of local image features 50 properties [1]. The first category includes appli- (e.g., [2, 3, 4, 5, 6, 7]). 51 cations in which the semantic meaning of a par- 52 53 ticular type of features is exploited. For instance, ∗Corresponding author 54 edge or even ridge detection can be used to identify 55 Email addresses: [email protected] (P. Martins), 56 [email protected] (P. Carvalho), [email protected] blood vessels in medical images and watercourses 57 (C. Gatta) 58 59 Preprint submitted to Elsevier March 6, 2013 60 61 62 63 64 65 1 2 3 or roads in aerial images. Another example in included a qualitative evaluation of our context- 4 5 this category is the use of blob extraction to iden- aware features. 6 tify blob-like organisms in microscopies. A second 7 8 category includes tasks such as matching, track- 2. Related work 9 10 ing, and registration, which mainly require distinc- 11 tive, repeatable, and accurate features. Finally, a The information provided by the first and second 12 13 third category comprises applications such as object order derivatives has been the basis of diverse al- 14 (class) recognition, image retrieval, scene classifica- gorithms. Local signal changes can be summarized 15 16 tion, and image compression. For this category, it is by structures such as the structure tensor matrix or 17 crucial that features preserve the most informative the Hessian matrix. Algorithms based on the for- 18 19 image content (robust image representation), while mer were initially suggested in [10] and [11]. The 20 21 repeatability and accuracy are requirements of less trace and the determinant of the structure tensor 22 importance. matrix are usually taken to define a saliency mea- 23 24 We propose a local feature extractor aimed at sure [12, 13, 14, 15]. 25 providing a robust image representation. Our algo- The seminal studies on linear scale-space repre- 26 27 rithm, named Context-Aware Keypoint Extractor sentation [16, 17, 18] as well as the derived affine 28 29 (CAKE), represents a new paradigm in local fea- scale-space representation theory [19, 20] have been 30 ture extraction: no a priori assumption is made on a motivation to define scale and affine covariant fea- 31 32 the type of structure to be extracted. It retrieves ture detectors under differential measures, such as 33 locations (keypoints) which are representatives of the Difference of Gaussian (DoG) extractor [21] or 34 35 salient regions within the image context. Two ma- the Harris-Laplace [22], which is a scale (and rota- 36 jor advantages can be foreseen in the use of such tion) covariant extractor that results from the com- 37 38 features: the most informative image content at a bination of the Harris-Stephens scheme [11] with 39 40 global level will be preserved by context-aware fea- a Gaussian scale-space representation. Concisely, 41 tures and an even more complete coverage of the the method performs a multi-scale Harris-Stephens 42 43 content can be achieved through the combination of keypoint extraction followed by an automatic scale 44 context-aware features and strictly local ones with- selection [23] defined by a normalized Laplacian 45 46 out inducing a noticeable level of redundancy. operator. The authors also propose the Hessian- 47 48 This paper extends our previously published Laplace extractor, which is similar to the former, 49 work in [8]. The extended version contains a more with the exception of using the determinant of the 50 51 detailed description of the method as well as a Hessian matrix to extract keypoints at multiple 52 more comprehensive evaluation. We have added the scales. The Harris-Affine scheme [24], an exten- 53 54 salient region detector [9] to the comparative study sion of the Harris-Laplace, relies on the combina- 55 and the complementarity evaluation has been per- tion of the Harris-Stephens operator with an affine 56 57 formed on a large data set. Furthermore, we have shape adaptation stage. Similarly, the Hessian- 58 59 2 60 61 62 63 64 65 1 2 3 Affine algorithm [24] follows the affine shape adap- ponents have either higher or lower values than the 4 5 tation; however, the initial estimate is taken from pixels on their outer boundaries. An extremal re- 6 the determinant of the Hessian matrix. Another gion is said to be maximally stable if the relative 7 8 differential-based method is the Scale Invariant Fea- area change, as a result of modifying the threshold, 9 10 ture Operator (SFOP) [25], which was designed to is a local minimum. The MSER algorithm has been 11 respond to corners, junctions, and circular features. extended to volumetric [31] and color images [32] 12 13 The explicitly interpretable and complementary ex- as well as been subject to efficiency enhancements 14 traction results from a unified framework that ex- [33, 34, 35] and a multiresolution version [36]. 15 16 tends the gradient-based extraction previously dis- 17 cussed in [26] and [27] to a scale-space representa- 18 3. Analysis and Motivation 19 tion. 20 21 The extraction of KAZE features [28] is a Local feature extractors tend to rely on strong 22 multiscale-based approach, which makes use of non- assumptions on the image content. For instance, 23 24 linear scale-spaces. The idea is to make the inher- Harris-Stephens and Laplacian-based detectors as- 25 ent blurring of scale-space representations locally sume, respectively, the presence of corners and 26 27 adaptive to reduce noise and preserve details. The blobs. The MSER algorithm assumes the existence 28 29 scale-space is built using Additive Operator Split- of image regions characterized by stable isophotes 30 ting techniques and variable conductance diffusion. with respect to intensity perturbations. All of the 31 32 The algorithms proposed by Gilles [29] and Kadir above-mentioned structures are expected to be re- 33 and Brady [9] are two well-known methods relying lated to semantically meaningful parts of an image, 34 35 on information theory. Gilles defines keypoints as such as the boundaries or the vertices of objects, or 36 image locations at which the entropy of local inten- even the objects themselves. However, we cannot 37 38 sity values attains a maximum. Motivated by the ensure that the detection of a particular feature will 39 40 work of Gilles, Kadir and Brady introduced a scale cover the most informative parts of the image. Fig- 41 covariant salient region extractor. This scheme es- ure 1 depicts two simple yet illustrative examples 42 43 timates the entropy of the intensity values distribu- of how standard methods such as the Shi-Tomasi 44 tion inside a region over a certain range of scales. algorithm [13] can fail in the attempt of providing 45 46 Salient regions in the scale-space are taken from a robust image representation. In the first exam- 47 48 scales at which the entropy is peaked. There is also ple (Fig. 1 (a)–(d)), the closed contour, which is 49 an affine covariant version of this method [30]. a relevant object within the image context, is ne- 50 51 Maximally Stable Extremal Regions (MSER)[2] glected by the strictly local extractor. On the other 52 are a type of affine covariant features that corre- hand, the context-aware extraction retrieves a key- 53 54 spond to connected components defined under cer- point inside the closed contour as one of the most 55 tain thresholds.