CAKE_article_final.tex Click here to view linked References

1 2 3 4 Context-Aware Features and Robust Image Representations 5 6 ∗ 7 P. Martinsa, , P. Carvalhoa,C.Gattab 8 aCenter for Informatics and Systems, University of Coimbra, Coimbra, Portugal 9 bComputer Vision Center, Autonomous University of Barcelona, Barcelona, Spain 10 11 12 13 14 Abstract 15 16 Local image features are often used to efficiently represent image content. The limited number of types of 17 18 features that a local feature extractor responds to might be insufficient to provide a robust image repre- 19 20 sentation. To overcome this limitation, we propose a context-aware feature extraction formulated under an 21 information theoretic framework. The algorithm does not respond to a specific type of features; the idea is 22 23 to retrieve complementary features which are relevant within the image context. We empirically validate the 24 method by investigating the repeatability, the completeness, and the complementarity of context-aware fea- 25 26 tures on standard benchmarks. In a comparison with strictly local features, we show that our context-aware 27 28 features produce more robust image representations. Furthermore, we study the complementarity between 29 strictly local features and context-aware ones to produce an even more robust representation. 30 31 Keywords: Local features, Keypoint extraction, Image content descriptors, Image representation, Visual 32 saliency, Information theory. 33 34 35 1. Introduction While it is widely accepted that a good local 36 37 feature extractor should retrieve distinctive, accu- 38 Local feature detection (or extraction, if we want 39 rate, and repeatable features against a wide vari- to use a more semantically correct term [1]) is a 40 ety of photometric and geometric transformations, 41 central and extremely active research topic in the 42 it is equally valid to claim that these requirements fields of and image analysis. Reli- 43 are not always the most important. In fact, not 44 able solutions to prominent problems such as wide- 45 all tasks require the same properties from a local 46 baseline stereo matching, content-based image re- feature extractor. We can distinguish three broad 47 trieval, object (class) recognition, and symmetry 48 categories of applications according to the required 49 detection, often make use of local image features 50 properties [1]. The first category includes appli- (e.g., [2, 3, 4, 5, 6, 7]). 51 cations in which the semantic meaning of a par- 52 53 ticular type of features is exploited. For instance, ∗Corresponding author 54 edge or even ridge detection can be used to identify 55 Email addresses: [email protected] (P. Martins), 56 [email protected] (P. Carvalho), [email protected] blood vessels in medical images and watercourses 57 (C. Gatta) 58 59 Preprint submitted to Elsevier March 6, 2013 60 61 62 63 64 65 1 2 3 or roads in aerial images. Another example in included a qualitative evaluation of our context- 4 5 this category is the use of blob extraction to iden- aware features. 6 tify blob-like organisms in microscopies. A second 7 8 category includes tasks such as matching, track- 2. Related work 9 10 ing, and registration, which mainly require distinc- 11 tive, repeatable, and accurate features. Finally, a The information provided by the first and second 12 13 third category comprises applications such as object order derivatives has been the basis of diverse al- 14 (class) recognition, image retrieval, scene classifica- gorithms. Local signal changes can be summarized 15 16 tion, and image compression. For this category, it is by structures such as the matrix or 17 crucial that features preserve the most informative the Hessian matrix. Algorithms based on the for- 18 19 image content (robust image representation), while mer were initially suggested in [10] and [11]. The 20 21 repeatability and accuracy are requirements of less trace and the determinant of the structure tensor 22 importance. matrix are usually taken to define a saliency mea- 23 24 We propose a local feature extractor aimed at sure [12, 13, 14, 15]. 25 providing a robust image representation. Our algo- The seminal studies on linear scale-space repre- 26 27 rithm, named Context-Aware Keypoint Extractor sentation [16, 17, 18] as well as the derived affine 28 29 (CAKE), represents a new paradigm in local fea- scale-space representation theory [19, 20] have been 30 ture extraction: no a priori assumption is made on a motivation to define scale and affine covariant fea- 31 32 the type of structure to be extracted. It retrieves ture detectors under differential measures, such as 33 locations (keypoints) which are representatives of the Difference of Gaussian (DoG) extractor [21] or 34 35 salient regions within the image context. Two ma- the Harris-Laplace [22], which is a scale (and rota- 36 jor advantages can be foreseen in the use of such tion) covariant extractor that results from the com- 37 38 features: the most informative image content at a bination of the Harris-Stephens scheme [11] with 39 40 global level will be preserved by context-aware fea- a Gaussian scale-space representation. Concisely, 41 tures and an even more complete coverage of the the method performs a multi-scale Harris-Stephens 42 43 content can be achieved through the combination of keypoint extraction followed by an automatic scale 44 context-aware features and strictly local ones with- selection [23] defined by a normalized Laplacian 45 46 out inducing a noticeable level of redundancy. operator. The authors also propose the Hessian- 47 48 This paper extends our previously published Laplace extractor, which is similar to the former, 49 work in [8]. The extended version contains a more with the exception of using the determinant of the 50 51 detailed description of the method as well as a Hessian matrix to extract keypoints at multiple 52 more comprehensive evaluation. We have added the scales. The Harris-Affine scheme [24], an exten- 53 54 salient region detector [9] to the comparative study sion of the Harris-Laplace, relies on the combina- 55 and the complementarity evaluation has been per- tion of the Harris-Stephens operator with an affine 56 57 formed on a large data set. Furthermore, we have shape adaptation stage. Similarly, the Hessian- 58 59 2 60 61 62 63 64 65 1 2 3 Affine algorithm [24] follows the affine shape adap- ponents have either higher or lower values than the 4 5 tation; however, the initial estimate is taken from pixels on their outer boundaries. An extremal re- 6 the determinant of the Hessian matrix. Another gion is said to be maximally stable if the relative 7 8 differential-based method is the Scale Invariant Fea- area change, as a result of modifying the threshold, 9 10 ture Operator (SFOP) [25], which was designed to is a local minimum. The MSER algorithm has been 11 respond to corners, junctions, and circular features. extended to volumetric [31] and color images [32] 12 13 The explicitly interpretable and complementary ex- as well as been subject to efficiency enhancements 14 traction results from a unified framework that ex- [33, 34, 35] and a multiresolution version [36]. 15 16 tends the gradient-based extraction previously dis- 17 cussed in [26] and [27] to a scale-space representa- 18 3. Analysis and Motivation 19 tion. 20 21 The extraction of KAZE features [28] is a Local feature extractors tend to rely on strong 22 multiscale-based approach, which makes use of non- assumptions on the image content. For instance, 23 24 linear scale-spaces. The idea is to make the inher- Harris-Stephens and Laplacian-based detectors as- 25 ent blurring of scale-space representations locally sume, respectively, the presence of corners and 26 27 adaptive to reduce noise and preserve details. The blobs. The MSER algorithm assumes the existence 28 29 scale-space is built using Additive Operator Split- of image regions characterized by stable isophotes 30 ting techniques and variable conductance diffusion. with respect to intensity perturbations. All of the 31 32 The algorithms proposed by Gilles [29] and Kadir above-mentioned structures are expected to be re- 33 and Brady [9] are two well-known methods relying lated to semantically meaningful parts of an image, 34 35 on information theory. Gilles defines keypoints as such as the boundaries or the vertices of objects, or 36 image locations at which the entropy of local inten- even the objects themselves. However, we cannot 37 38 sity values attains a maximum. Motivated by the ensure that the detection of a particular feature will 39 40 work of Gilles, Kadir and Brady introduced a scale cover the most informative parts of the image. Fig- 41 covariant salient region extractor. This scheme es- ure 1 depicts two simple yet illustrative examples 42 43 timates the entropy of the intensity values distribu- of how standard methods such as the Shi-Tomasi 44 tion inside a region over a certain range of scales. algorithm [13] can fail in the attempt of providing 45 46 Salient regions in the scale-space are taken from a robust image representation. In the first exam- 47 48 scales at which the entropy is peaked. There is also ple (Fig. 1 (a)–(d)), the closed contour, which is 49 an affine covariant version of this method [30]. a relevant object within the image context, is ne- 50 51 Maximally Stable Extremal Regions (MSER)[2] glected by the strictly local extractor. On the other 52 are a type of affine covariant features that corre- hand, the context-aware extraction retrieves a key- 53 54 spond to connected components defined under cer- point inside the closed contour as one of the most 55 tain thresholds. These components are said to be salient locations. The second example (Fig. 1 (e) 56 57 extremal because the pixels in the connected com- and (f)) depicts the “Needle in a Haystack” im- 58 59 3 60 61 62 63 64 65 1 2 3 age and the overlaid maps (in red) representing the row of Fig. 2 depicts our context-aware keypoint 4 5 Shi-Tomasi saliency measure and our context-aware extraction on a well-structured scene by retrieving 6 saliency measure. It is readily seen that our method the 100 most salient locations. This relatively small 7 8 provides a better coverage of the most relevant ob- number of features is sufficient to provide a reason- 9 10 ject. able coverage of the image content, which includes 11 diverse structures. However, in the case of scenes 12 13 characterized by repetitive patterns, context-aware 14 extractors will not provide the desired coverage. 15 16 Nevertheless, the extracted set of features can be 17 complemented with a counterpart that retrieves the 18 19 (a) (b) repetitive elements in the image. The image in the 20 21 bottom row of Fig. 2 depicts a combined feature 22 extraction on a textured image in which context- 23 24 aware features are complemented with SFOP fea- 25 tures [25] to achieve a better coverage. In the latter 26 27 (c) (d) example, one should note the high complementar- 28 29 ity between the two types of features as well as the 30 good coverage that the combined set provides. 31 32 33 34 4. Context-Aware Keypoints 35 36 Our context-aware feature extraction adopts an 37 38 (e) (f) information theoretic framework in which the key 39 idea is to use information content to quantify (and 40 Figure 1: Context-aware keypoint extractions vs. strictly 41 local keypoint extraction: 1. Keypoints on a psychologi- express) feature saliency. In our case, a context- 42 aware keypoint will correspond to a particular point 43 cal pattern: (a) pattern (input image); (b) 60 most salient 44 Shi-Tomasi keypoints; (c) 5 most salient context-aware key- within a structure with a low probability of occur- 45 points; (d) 60 most salient context-aware keypoints. 2. rence. 46 Saliency measures as overlaid maps on the “Needle in a 47 Shannon’s measure of information [37] forms the 48 Haystack” image: (e) Shi-Tomasi; (f) Context-aware. Best 49 viewed in color. basis for our saliency measure. If we consider a 50 symbol s, its information is given by 51 52 Context-aware features can show a high degree 53 I(s)= log(P (s)), (1) 54 of complementarity among themselves. This is par- − 55 ticularly noticeable in images composed of differ- where P ( ) denotes the probability of a symbol. For 56 · 57 ent patterns and structures. The image in the top our purposes, using solely the content of a pixel 58 59 4 60 61 62 63 64 65 1 2 3 quentists approach might be used. In this case, one 4 5 should be able to quantize codewords into symbols. 6 It is clear that the frequentists approach becomes 7 8 inappropriate and the quantization becomes a dan- 9 10 gerous process when applied to a codeword, since 11 the quantization errors can induce strong artifacts 12 13 in the I(x) map, generating spurious local maxima. 14 We abandon the frequentist approach in favor of 15 16 a Parzen Density Estimation [38], also known as 17 Kernel Density Estimation (KDE). The Parzen es- 18 19 timation is suitable for our method as it is non- 20 21 parametric, which will allow us to estimate any 22 probability density function (PDF), as long as there 23 24 is a reasonable number of samples. Using the KDE, 25 we estimate the probability of a codeword w(y)as 26 27 follows: 28 29 ˆ 1 d(w(y), w(x)) Figure 2: Proposed context-aware extraction. Top row: P (w(y)) = K , (3) 30 Nh h x∈Φ " # 31 context-aware keypoints on a well-structured scene (100 ! 32 most salient locations); bottom row: a combination of where K denotes a kernel, d is a distance measure, 33 context-aware keypoints (green squares) with SFOP key- h is a smoothing parameter called bandwidth and 34 35 points [25] (red squares) on a highly textured image. Best N = Φ is the cardinality of the image domain Φ. 36 viewed in color. | | The key idea behind the KDE method is to smooth 37 38 out the contribution of each sample x by spread- 39 x as a symbol is not applicable, whereas the con- Rn 40 ing it to a certain area in and with a certain 41 tent of a region around x will be more appropri- shape as defined by the kernel K. There is a num- 42 ate. Therefore, we will consider any local descrip- 43 ber of choices for the kernel. Nonetheless, the most 44 Rn tion w(x) that represents the neighborhood commonly used and the most suitable is a multi- 45 ∈ 46 of x as a viable codeword. This codeword will be dimensional Gaussian function with zero mean and 47 our symbol s, which allows us to rewrite Eq. (1): 48 standard deviation σk. Using a Gaussian kernel, 49 (3) can be rewritten as 50 I(x)= log(P (w(x))). (2) − 2 51 1 d (w(y), w(x)) P˜(w(y)) = e , (4) 52 NΓ − 2σ2 However, in Shannon’s perspective, a symbol x∈ k 53 !Φ " # 54 should be a case of a discrete set of possibilities, where h has been replaced by the standard devia- 55 n whereas we have w(x) R . As a result, to es- tion σk and Γis a proper constant such that the esti- 56 ∈ 57 timate the probability of a certain symbol, a fre- mated probabilities are taken from an actual PDF. 58 59 5 60 61 62 63 64 65 1 2 3 Summarizing, our saliency measure will be given not have ground truth labels to use the same strat- 4 5 by egy as the one defined for Parzen classifiers. We 6 propose an efficient method that reduces the num- 7 1 d2(w(y), w(x)) m(y)= log e , 8 2 ber of samples by approximating the full (N 2) − $NΓ − 2σk % 9 x∈Φ " # O ! (5) PDF in (4) with a (N log N) algorithm. A de- 10 O 11 and context-aware keypoints will correspond to lo- tailed explanation of the speed-up strategy can be 12 13 cal maxima of m that are above a given threshold found in Appendix B. 14 T. 15 4.1. The distance d 16 For a complete description of the proposed 17 method, we have to define a distance measure d To completely define a KDE-based approach, we 18 have to define (i) the distance d, (ii) the kernel K, 19 and set a proper value to σk. Due to relevance of 20 and (iii) the bandwidth h.Thesethreeparame- 21 these two parameters in the process of estimating 22 the PDF, they will be discussed in two separate sub- ters are interrelated since they will form the final 23 24 sections (4.1 and 4.2). Nonetheless, the KDE has “shape” of the kernel. As for the distance function 25 an inherent and significant drawback: the computa- d, we consider the Mahalanobis distance: 26 27 tional cost. To estimate the probability of a pixel, −1 d(w(x), w(y)) = (w(x) w(y))T Σ (w(x) w(y)), 28 − W − we have to compute (4), which means computing 29 & (6) 30 N distances between codewords, giving a computa- 31 where W = x∈Φ w(x)andΣW is the covariance tional cost of (N 2) for the whole image. The com- 32 O matrix of W .' Using this distance, any affine covari- 33 putational complexity of the KDE is prohibitive for 34 ant codeword will provide an affine invariant be- 35 images, where N is often of the order of millions. havior to the extractor. In other words, any affine 36 Different methods have been proposed to reduce the 37 transformation will preserve the order of P . This 38 computation of a KDE-based PDF. Many methods 39 result is summarized in the following theorem: 40 rely on the hypothesis that the sample distribution Theorem 1. Let w(1) and w(2) be codewords such 41 forms separated clusters, so that it is feasible to (2) (1) 42 that w (x)=T (w (x))),whereT is an affine 43 approximate the probability in a certain location of transformation. Let P (1) and P (2) be the probability 44 (1) (2) (i) (i) the multivariate space using a reduced set of sam- maps of w and w ,i.e.,P ( )=P (w ( )), 45 · · i =1, 2.Inthiscase, 46 ples. Other methods have been devised for the pur- (2) (2) (1) (1) 47 P (xl) ≤ P (xm) ⇐⇒ P (xl) ≤ P (xm), ∀xl, xm ∈ Φ. 48 pose of a Parzen classifier, so that the cardinality 49 of the training sample is reduced, without chang- Proof: (See Appendix A). 50 51 ing significantly the performance of the reduced 52 Parzen classifier. In our case, none of the two afore- 4.2. The smoothing parameter σk 53 54 mentioned strategies can be straightforwardly used A Parzen estimation can be seen as an interpo- 55 since (i) we cannot assume that the multivariate lation method, which provides an estimate of the 56 57 distribution forms different clusters, and (ii) we do continuous implicit PDF. It has been shown that, 58 59 6 60 61 62 63 64 65 1 2 3 for N ,theKDEconvergestotheactualPDF 5. CAKE Instances 4 →∞ 5 [38]. However, when N is finite, the bandwidth h 6 plays an important role in the approximation. In Different CAKE instances are constructed by 7 8 the case of a Gaussian kernel, σk is the parameter considering different codewords. As observed by 9 10 that accounts for the smoothing strength. Gilles [29] and Kadir and Brady [9], the notion of 11 The free parameter σk can potentially vanish the saliency is related to rarity. What is salient is rare. 12 13 ability of the proposed method to adapt to the However, the reciprocal is not necessarily valid. A 14 image context. When σ is too large, an over- highly discriminating codeword will contribute in 15 k 16 smoothing of the estimated PDF occurs, cancel- turning every location into a rare structure; noth- 17 18 ing the inherent PDF structure due to the image ing will be seen as salient. On the other hand, 19 content. If σk is too small, the interpolated val- with a less discriminating codeword, rarity will be 20 21 ues between different samples could be low, such harder to find. We will present a differential-based 22 that there is no interpolation anymore. We pro- instance, which is provided by a sufficiently discrim- 23 24 pose a method, in the case of univariate distribu- inating codeword. The strong link between image 25 ! derivatives and the geometry of local structures is 26 tion, to determine an optimal sigma σk, aiming at 27 sufficient blurring while having the highest sharpen the main motivation to present an instance based 28 29 PDF between samples. We use univariate distribu- on local differential information. 30 tions since we approximate the KDE computations We propose the use of the Hessian matrix as a 31 32 of a D-dimensional multivariate PDF by estimat- codeword to describe the local shape characteris- 33 ing D separate univariate PDFs (see Appendix B). 34 tics. We will consider components computed at 35 From N samples w, we define the optimal σk for multiple scales, which will allow us to provide an 36 37 the given distribution as instance with a quasi-covariant response to scale 38 2 2 changes. The codeword for the multiscale Hessian- (w w ) (w wi+1) − − i − − 39 2σ2 2σ2 " de +e  " w " " 40 ! i+1 " " based instance is 1 "   " σk =argmax "   " d w, σ>0 √2πσ "  d w  " 41 wi " " ! " " " " " " 42 " " w x 2 2 2 " " ( ) = t Lxx(x;t1) t Lxy(x;t1) t Lyy(x;t1) " " 1 1 1 43 " " (7) " " ( 44 2 2 2 where w and w is the farthest pair of consecutive t2Lxx(x;t2) t2Lxy(x, t2) t2Lyy(x,t2) 45 i i+1 46 samples in the distribution. It can be shown that, 47 ··· ! T by solving (7), we have σk = wi wi+1 .Itcanbe 48 | − | 2 2 2 tM Lxx(x;tM ) tM Lxy (x;tM ) tM Lyy(x;tM ) , 49 also demonstrated that for σ< wi wi+1 /2, the 50 | − | (8)) 51 estimated PDF between the two samples is concave, 52 which provides insufficient smoothing. Using σ! where L , L and L are the second order partial 53 k xx xy yy 54 as defined above, we assure that we have sufficient derivatives of L, a Gaussian smoothed version of 55 blurring between the two farthest samples, while, at the image, and t , with i =1,...,M,represents 56 i 57 the same time, providing the highest sharpen PDF. the scale. 58 59 7 60 61 62 63 64 65 1 2 3 6. Experimental Results and Discussion gives a hint on the quality of the trade-offbetween 4 5 the context-awareness and the locality of context- 6 Our experimental validation relies on a compar- aware features. However, it does not provide a hint 7 8 ative study that includes both context-aware fea- on how features cover informative content within 9 tures and strictly local ones. We recall that our 10 the image context. If we take the “Needle in a 11 context-aware features were designed to provide a Haystack” image depicted in Fig. 1 as an example, 12 13 robust image representation, with or without the we can claim that strictly local features can show 14 contribution of strictly local features. high completeness scores without properly covering 15 16 We compare the performance of our Hessian- the most interesting object in the scene. Note that 17 based instance, here coined as [HES]-CAKE, with such image representation, despite its considerable 18 19 some of the most prominent scale covariant algo- robustness, might be ineffectual if the goal is to 20 21 rithms: Hessian-Laplace (HESLAP) [22], Harris- recognize the salient object. Therefore, for a better 22 Laplace (HARLAP) [22], SFOP [25], and the scale understanding of the performance of our method, 23 24 covariant version of the Salient Region detector we complement the completeness analysis with a 25 (Salient) [9]. The MSER algorithm [2], which has qualitative evaluation of the context-awareness. 26 27 an affine covariant response, is also included in the Repeatability is also considered in our validation. 28 evaluation. All the implementations correspond to 29 We measure it through the standard evaluation pro- 30 the ones provided and maintained by the authors. tocol proposed by Mikolajczyk et al. [40]. Although 31 32 We follow the evaluation protocol proposed by the presence of repeatable features may not always 33 Dickscheid et al. [39] to measure the completeness be a fundamental requirement for tasks demand- 34 35 and the complementarity of features. Completeness ing a robust image representation, their existence is 36 can be quantified as the amount of image informa- advantageous: a robust representation with repeat- 37 38 tion preserved by a set of features. Complementar- able features provides a more predictable coverage 39 40 ity appears as a particular case of completeness: it when image deformations are present. In addition, 41 reflects the amount of image information coded by repeatable features allow a method to be used in a 42 43 sets of potentially complementary features. Mea- wider range of tasks. 44 suring such properties is crucial as the main purpose Both evaluation protocols deal with regions as lo- 45 46 of context-aware features is to provide a robust im- cal features instead of single locations. To obtain 47 age representation, either in an isolated manner or 48 regions from context-aware keypoints, a normalized 2 2 49 in combination with strictly local features. Laplacian operator, Ln = t (Lxx +Lyy), is used. 50 ∇ 51 The metric for completeness is based on local The characteristic scale for each keypoint corre- 52 statistics, which totally excludes the bias in favor sponds to the one at which the operator attains an 53 54 of our context-aware features, since our algorithm extremum. This scale defines the radius of a circu- 55 is based on the analysis of the codeword distribu- lar region centered about the keypoint. Note that 56 57 tion over the whole image. In fact, this evaluation the CAKE instance does not solely respond to blob- 58 59 8 60 61 62 63 64 65 1 2 3 like keypoints: it captures other structures where where Φis the image domain. When pH and pc 4 5 scale selection can be less reliable (e.g., edges). are very close, the distance dH will be small, which 6 Nevertheless, the combination of [HES]-CAKE with means the set of features with a coding density p 7 c 8 the normalized Laplacian operator provides robust- effectively covers the image content (the set of fea- 9 10 ness to scale changes, despite the resulting method tures has a high completeness). Such metric penal- 11 not being entirely scale covariant. Figure 3 depicts izes the use of large scales (a straightforward solu- 12 13 the extraction of [HES]-CAKE regions using the tion to achieve a full coverage) as well as the pres- 14 Laplacian operator for scale selection. ence of features in pure homogeneous regions. On 15 16 the other hand, it will reward the “fine capturing” 17 of local structures or superimposed features appear- 18 19 ing at different scales (the entropy density takes into 20 21 consideration several scales). The dataset contains 22 six of the seven categories used in the original eval- 23 24 uation (Fig. 4). It comprises four categories of nat- 25 ural scenes [41, 42], the Brodatz texture collection 26 27 [43] as well as a set of aerial images. The seventh 28 29 category, which is comprised of different cartoon 30 images, was not made publicly available. 31 32 The cardinality of the sets influences the com- 33 pleteness scores, as sparser sets tend to be less 34 35 complete. While it is interesting to analyze the 36 completeness of sets with comparable sparseness, 37 38 Figure 3: [HES]-CAKE regions under viewpoint changes. one cannot expect similar cardinalities when deal- 39 40 ing with different features types. We take such 41 facts into consideration and, as a result, we per- 42 43 6.1. Completeness and complementarity evaluation form two different tests. The first one corresponds 44 To measure the completeness of features, to the main completeness test, which does not re- 45 46 Dickscheid et al. [39] compute an entropy density strict the number of features. The second one al- 47 48 pH (x) based on local image statistics and a feature lows us to make a direct comparison between our

49 coding density pc(x) derived from a given set of method and the salient regions algorithm by using 50 features. The measure of (in)completeness corre- the same number of features. Let (g) 51 F[HES]−CAKE 52 sponds to the Hellinger distance between the two and Salient(g) be the respective sets of [HES]- 53 F 54 densities: CAKE regions and Salient regions extracted from 55 an image g.Fromeachset,weextractthen 56 1 2 dH (pH ,pc)= ( pH (x) pc(x)) , (9) 57 2 − * x∈Φ highest ranked features (both methods provide a 58 ! + + 59 9 60 61 62 63 64 65 1 2 3 4 5 6 7 8 Brodatz Aerial Forest Mountain Tallbuilding Kitchen 9 (30 images) (28 images) (328 images) (374 images) (356 images) (210 images) 10 11 Figure 4: Example images from the categories in the completeness and complementarity dataset. 12 13 14 well-define hierarchy among features), where n = of regions. 15 16 min [HES]−CAKE(g) , Salient(g) . The additional test computes the completeness 17 { F |F |} , , scores of context-aware regions and salient regions 18 The, parameter settings, for our algorithm are out- 19 lined in Table 1. For the remaining algorithms, de- for the first 20 images in each category using the 20 21 fault parameter settings are used. same number of features. The results are summa- 22 rized in Fig. 6. Here, salient regions achieve bet- 23 24 Table 1: Parameter settings for [HES]-CAKE. ter results; however, the difference between scores 25 Number of scales 3 is not significant. Aerial and Kitchen are the cate- 26 ti+1/ti (ratio between successive scale levels) 1.19 27 gories where context-aware features exhibit the low- t (initial scale) 1.4 28 0 29 Non-maximal suppression window 3×3 est scores. This is explained by the strong presence 30 Threshold None of homogeneous regions, which might be part of 31 σk optimal 32 salient objects within the image context, such as NR (number of samples) 200 33 roads, rooftops (Aerial category), home appliances, 34 35 and furniture (Kitchen category). 36 Figure 5 is a summary of the main complete- 37 0.6 0.55 [HES]−CAKE Salient 38 ness evaluation. Results are shown for each im- 0.5 0.45

39 ) 0.4 C 0.35 age category, in terms of the distance dH (pH ,pc). ,p

40 H 0.3 p (

H 0.25

41 1 d The plot includes the line y = 2 , which corre- 0.2 42 0.15 & 0.1 43 sponds to an angle of 90 degrees between √pH and 0.05 0 44 Brodatz Aerial Forest Mountain Tall Building Kitchen √pc. For a better interpretation, the average num- 45 46 ber of features per category is also shown. Regard-

47 Figure 6: Average dissimilarity measure dH (pH ,pc)forthe 48 less of the image collection, our context-aware in- different sets of features extracted over the categories of the 49 stance retrieves more features than the other algo- 50 dataset (20 images per category). 51 rithms, which contributes to achieve the best com- 52 pleteness scores. The exception is the Brodatz cate- Complementarity was also evaluated on the first 53 54 gory, which essentially contains highly textured im- 20 images of each category by considering combina- 55 ages. For this category, salient regions achieve a tions of two different feature types. The results are 56 57 better completeness score despite the lower number summarized in Table 2. As expected, any combina- 58 59 10 60 61 62 63 64 65 1 2 3 0.8 4 [HES]−CAKE 0.7071 Salient 5 HESLAP 0.6 6 HARLAP ) 7 C 0.5 SFOP

,p MSER

H 0.4

8 p ( H

9 d 0.3 10 0.2 11 0.1 12 0 13 Brodatz Aerial Forest MountainTall Building Kitchen 14 4 15 10 [HES]−CAKE 16 Salient HESLAP 3 17 10 HARLAP 18 SFOP MSER 2 19 10 20 21 1 Number of regions 10 22

23 0 10 24 Brodatz Aerial Forest MountainTall Building Kitchen 25 26 27 Figure 5: Completess results. Top row: Average dissimilarity measure dH (pH ,pc)forthedifferentsetsoffeaturesextracted 28 over the categories of the dataset. Bottom row: Average number of extracted features per image category. 29 30 31 32 tion that includes [HES]-CAKE regions achieves the

33 best completeness scores. We give particular em- Table 2: Average dissimilarity measure dH (pH ,pc)fordiffer- 34 ent sets of complementary features (20 images per category). 35 phasis to the complementarity between HESLAP ∆representsthedifferencedH (pH ,pc) − min{d ,d },where 36 and [HES]-CAKE: both methods are Hessian-based 1 2 37 d1 and d2 denote the average dissimilarity measures of the 38 and yet they produce complementary regions. The two different sets. 39 combination of [HES]-CAKE and Salient regions is 40 41 also advantageous: the latter provides a good cov- 42 erage of “busy” parts composed of repetitive pat- [HES]-CAKE HESLAP HARLAP SFOP MSER SALIENT dH (pH ,pc) ∆ 43 0.1028 -0.0352 • • 44 0.1214 -0.0166 terns. • • 0.1242 -0.0138 45 • • 0.1246 -0.0134 46 • • 0.1253 -0.0127 47 6.2. Context-awareness evaluation • • 0.1375 -0.0579 • • 48 0.1550 -0.0404 • • 49 0.1725 -0.0493 For a qualitative evaluation of the context- • • 0.1765 -0.0189 50 • • 0.1789 -0.0165 51 awareness of [HES]-CAKE regions, we use three im- • • 0.1983 -0.0235 52 • • 0.2187 -0.0031 ages typically used in the validation of algorithms • • 53 0.2274 -0.0366 • • 54 for visual saliency detection [44]. Each one of the 0.2895 -0.0311 • • 55 0.2052 -0.0588 test images shows a salient object over a back- • • 56 57 ground containing partially salient elements. Fig- 58 59 11 60 61 62 63 64 65 1 2 3 ure 7 depicts the test images, the corresponding 4 Table 3: Image sequences in the Oxford dataset. 5 information maps given by the CAKE instance, as 6 Sequence Scene type Transformation well as the coverage provided by the context-aware 7 Graffiti well-structured viewpoint change 8 regions when 100 and 250 points are used. In all Wall textured viewpoint change 9 10 cases, our algorithm succeeds in covering distinctive Boat well-structured 11 elements of the salient objects. With 250 [HES]- + zoom + rotation 12 textured 13 CAKE regions, the coverage becomes a relatively Bark textured zoom + rotation 14 robust image representation in all cases. 15 Bikes well-structured blur 16 Trees textured blur 17 6.3. Repeatability Evaluation Leuven well-structured lighting conditions 18 19 Boat well-structured 20 The repeatability score between regions ex- + JPEG compression 21 tracted in two image pairs is computed using 2D textured 22 23 homographies as a ground truth. Two features (re- 24 gions) are deemed as corresponding and, therefore, 25 its counterparts. Thus, for a fair evaluation of re- 26 repeated, with an overlap error of "R 100% if 27 × peatability, we defined a threshold that avoids a

28 T µ1 (H µ2H) discrepancy between the number number of features 29 1 R ∩R <"R, (10) − T 30 , µ1 (H µ2H), ,R ∪R , retrieved by [HES]-CAKE and the remaining algo- 31 , , 32 where Rµ denotes, the set of image, points in the rithms. 33 elliptical region verifying xT µx 1andH is the 34 ≤ 35 homography that relates the two input images. For Table 4: Parameter settings for [HES]-CAKE. 36 Scales 12 a given pair of images and a given overlap error, the 37 ti+1/ti 1.19

38 repeatability score corresponds to ratio between the t0 (initial scale) 1.19 39 Non-maximal suppression window 3×3 40 number of correspondences between regions and the Threshold 12 41 smaller of the number of regions in the pair of im- 42 (or 3000 points) ages. Only regions that are located in parts of the 43 σk optimal 44 scene that are common to the two images are con- 45 46 sidered. The benchmark is supported by the Oxford 47 image, which comprises 8 sequences of images, each 48 Figure 8 reports the results in terms of aver- 49 one with 6 images, under different photometric and age repeatability scores (top plot) and number of 50 51 geometric transformations (Table 3). The repeata- correspondences (bottom row) for each sequence. 52 bility of regions is computed within an overlap error Among scale covariant features, HESLAP regions 53 54 of 40%, using the first image as a reference. exhibit a slightly better overall repeatability score, 55 Table 4 outlines the parameter settings for [HES]- namely in well-structured scenes (e.g., Bikes) where 56 57 CAKE. Our method retrieves more features than blob-like features are more present and well-defined. 58 59 12 60 61 62 63 64 65 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Input image Information map Coverage (100 points) Coverage (250 points) 24 25 26 Figure 7: [HES]-CAKE information maps and extraction results in terms of coverage. 27 28 29 30

31 100 [HES]−CAKE 32 HESLAP 33 80 HARLAP 34 SFOP 60 MSER 35 Salient 36 40 37 Repeatability (%) 38 20

39 0 40 Graffiti Wall Boat Bark Bikes Trees Leuven UBC

41 [HES]−CAKE 42 5000 HESLAP 43 HARLAP 44 4000 SFOP MSER 45 3000 Salient

46 2000 47 48 1000 Number of correspondences 49 0 50 Graffiti Wall Boat Bark Bikes Trees Leuven UBC 51 52 Figure 8: Repeatability score and number of correspondenceswithanoverlaperrorof40%ontheOxforddataset.Toprow: 53 Average repeatability. Error bars indicate the standard deviation. Bottom row: Average number of correspondences. Error 54 55 bars indicate the standard deviation. 56 57 58 59 13 60 61 62 63 64 65 1 2 3 HARLAP has a similar performance, yielding the The experimental evaluation has shown that re- 4 5 most repeatable results in textured scenes. The re- lying on image statistics to extract keypoints is 6 peatability scores of SFOP and [HES]-CAKE are a winning strategy. A robust image representa- 7 8 very similar, yet the latter responds to a higher tion can be easily achieved with context-aware fea- 9 10 number of features. Aside from viewpoint changes, tures. Furthermore, the complementarity between 11 the repeatability of MSER tends to be lower than context-aware features and strictly local ones can 12 13 its counterparts. In a direct comparison of infor- be exploited to produce an even more robust rep- 14 mation theoretic-based methods, we observe that resentation. 15 16 [HES]-CAKE features are more repeatable than As for the applicability of the method, we be- 17 salient regions. The only two exceptions to this 18 lieve that most of the tasks requiring a robust image 19 observation are the results for Trees and Wall se- representation will benefit from the use of context- 20 21 quences. Such results are explained by the fact that aware features. In this category, we include tasks 22 both sequences depict highly textured scenes, pro- such as scene classification, image retrieval, object 23 24 viding denser sets of salient regions. As for scale (class) recognition, and image compression. 25 changes (Boat and Bark sequences), [HES]-CAKE 26 27 regions show a sufficiently robust behavior. In the Appendix A. Proof of Theorem 1 28 29 case of the Bark sequence, only HESLAP features Proof. Let us suppose that P (2)(x ) P (2)(x ) 30 l m are more repeatable than the proposed regions. (the reasoning will be analogous if we≤ consider the 31 32 other inequality). From the definition of probabil- 33 ity, we have 7. Conclusions 34 N (2) (2) T 1 (2) (2) (w (xl) − w (xj )) Σ− (2) (w (xj ) − w (xl)) 35 e(− W ) ≤ 2σ2 36 j=1 k We have presented a context-aware keypoint ex- ) N w(2) x w(2) x T 1 w(2) x w(2) x 37 ( ( m) − ( j )) Σ− (2) ( ( j ) − ( m)) ≤ e(− W ). 38 2σ2 tractor, which represents a new paradigm in local j=1 k 39 ) feature extraction. The idea is to retrieve salient 40 Let A be the matrix that represents the trans- 41 locations within the image context, which means formation T (we assume no translation). Since 42 T Σ (2) = AΣ (1) A ,thenumeratorsfromtheex- no assumption is made on the type of structure to W W 43 ponents in the first and second members of the in- 44 be extracted. Such scheme was designed to provide equality can be rewritten as 45

46 a robust image representation, with or without the (1) (1) T T 1 (1) (1) (A(w (x )−w (x ))) (AΣ A )− (A(w (x )−w (x ))) 47 l j W (1) j l contribution of other local features. 48 and 49 The algorithm follows an information theoretic 50 (1) (1) T T 1 (1) (1) (A(w (x )−w (x ))) (AΣ A )− (A(w (x )−w (x )), 51 approach to extract salient locations. The possi- m j W (1) j m 52 ble shortcomings of such approach were analyzed, 53 54 namely the difficulties in defining sufficiently dis- respectively. By simplifying the previous expres- 55 criminating descriptors and estimating the informa- sions, we have 56 57 (1) (1) T 1 (1) (1) tion of the inherent distributions in an efficient way. ((w (xl) − w (xj ))) Σ− (1) (w (xj ) − w (xl))) 58 W 59 14 60 61 62 63 64 65 1 2 3 and always verified. However, the proposed approxima- 4 5 (1) (1) T 1 (1) (1) tion works sufficiently well for convex multivariate ((w (xm) − w (xj ))) Σ− (1) (w (xj ) − w (xm))). 6 W distributions, which is the case in all the exper- 7 8 Thus, iments we have conducted in the paper. There- 9 1 fore, we have to compute D one dimensional KDEs 10 P (2)(x)= P (1)(x), x Φ. det A ∀ ∈ 11 | | p˜i(wP,i(y)), using the Euclidean distance, which re- 12 (1) (1) From the hypothesis, we have P (xl) P (xm). duces a multivariate KDE to D univariate prob- 13 ≤ 14 lems. This step simplifies the computation of dis- 15 16 tances between codewords, but still does not re- 17 Appendix B. Reduced KDE duce the number of basic product-sum computa- 18 19 As shown by Theorem 1, applying an affine trans- tions. Nevertheless, we can approximate the D one 20 formation to the codewords does not change the re- dimensional KDEs to speed-up the process. The 21 sult of the extractor. We take advantage of this, 22 and perform a principal component analysis (PCA) fact that we have univariate distributions will be 23 to obtain a new codeword distribution W ,where 24 P profitably used. For the sake of compactness and elements are denoted by w (x). In this case, the 25 P inverse of the covariance matrix Σ−1 is a diagonal clarity, in the next part of the section we will refer 26 WP matrix, where the elements on the diagonal contain 27 top ˜i(wP,i(y)) as p(w(y)). We will also omit the 28 the inverse of the variance of every variable of WP . 29 Consequently, we can rewrite the Gaussian KDE in constant 1/N Γandtheconstantsai. (4), using the Mahalanobis distance d( , ), as an- 30 · · We can extend the concept of KDE, by giving a 31 other Gaussian KDE with Euclidean distance as

32 D 2 weight v(x) > 0 to each sample, so that the uni- 1 i=1 ai(wP,i(y) − wP,i(x)) p˜(wP (y)) = e − , 33 NΓ 2σ2 x Φ * + k , variate KDE can be rewritten as a reduced KDE: )∈ 34 (Appendix B.1) 35 −1 2 where ai = ΣW (i, i), i.e., the square root of (w(y) w(x)) 36 P pR(w(y)) = v(x)e − , th 2 − 2σk 37 the i diagonal& element of the inverse of covariance x∈ΦR " # 38 matrix. Equation (Appendix B.1) can be rewritten ! (Appendix B.4) 39 as 40 where ΦR Φ. This formulation can be seen as D 2 ⊂ 1 ai(wP,i(y) − wP,i(x)) 41 p˜(wP (y)) = e − . a hybrid between a Gaussian KDE and a Gaussian NΓ 2σ2 x Φ i=1 * k , 42 )∈ - 43 (Appendix B.2) Mixture Model. The former has a large number of By assuming that each dimension i provides a PDF 44 samples, all of them with unitary weight and fixed 45 that is independent of other dimensions, Equation 46 (Appendix B.2) can be approximated as follows: σk, while the latter has a few number of Gaussian 47 D 2 1 ai(wP,i(y) − wP,i(x)) functions, each one with a specific weight and stan- 48 p˜(wP (y)) % e − NΓ 2σ2 i=1 x Φ * k , 49 - )∈ dard deviation. 50 1 D % p˜i(wP,i(y)). N The goal of our speed-up method is to obtain 51 Γ i -=1 52 (Appendix B.3) asetΦR with ΦR = Nr N samples that 53 | | - 54 Note that this approximation is only valid if PCA is approximate the (N 2) KDE. The idea is to fuse 55 O able to separate the multivariate distribution into samples that are close each other into a new sample 56 57 independent univariate distributions. This is not that “summarizes” them. Given a desired number 58 59 15 60 61 62 63 64 65 1 2 3 of samples NR, the algorithm progressively fuses data can be represented using a self-balancing tree 4 5 pairs of samples that have a minimum distance: [45] allowing us to perform deletion and insertions 6 (line 7), in log N time. Since the samples are or- 7

8 1: ΦR Φ dered both in terms of w(x)anddm(x), updating 9 ← 2: v(x) 1, x Φ the distances after deletions and insertions can be 10 ← ∀ ∈ 11 3: while ΦR >NR do done in (1). Summarizing, we need to perform 12 | | O 13 4: x˜0, x˜1 arg min w(x0) 2(N Nr) deletions and N Nr insertions, so that 0 1 R 0 1 − − 14 { }← x ,x ∈Φ , x =& x | − w(x1) the total cost of the reduction algorithm is propor- 15 | 16 5: v(x01) v(˜x 0)+v(˜x 1) tional to N log N +3(N Nr) log N, thus being ← 0 1 − 17 v(˜x 0)w(˜x 0)+v(˜x 1)w(˜x 1) 6: w(x01) (N log N). The total cost to compute pR(w(y)) 18 ← v(˜x 0)+v(˜x 1) O 19 7: ΦR (ΦR x˜0, x˜1 ) x01 linearly depends on the desired NR and the number 20 ← \{ } ∪{ } 21 8: end while of dimensions D. 22 To further speed-up the approximation, we can 23 The algorithm uses as input the N samples of 24 use a reduced number of dimensions D