Int J Comput Vis (2011) 94:154–174 DOI 10.1007/s11263-010-0340-z

Coding Images with Local Features

Timo Dickscheid · Falko Schindler · Wolfgang Förstner

Received: 21 September 2009 / Accepted: 8 April 2010 / Published online: 27 April 2010 © Springer Science+Business Media, LLC 2010

Abstract We develop a qualitative measure for the com- The results of our empirical investigations reflect the theo- pleteness and complementarity of sets of local features in retical concepts of the detectors. terms of covering relevant image information. The idea is to interpret feature detection and description as image coding, Keywords Local features · Complementarity · Information and relate it to classical coding schemes like JPEG. Given theory · Coding · Keypoint detectors · Local entropy an image, we derive a feature density from a set of local fea- tures, and measure its distance to an entropy density com- puted from the power spectrum of local image patches over 1 Introduction scale. Our measure is meant to be complementary to ex- Local image features play a crucial role in many computer isting ones: After task usefulness of a set of detectors has vision applications. The basic idea is to represent the image been determined regarding robustness and sparseness of the content by small, possibly overlapping, independent parts. features, the scheme can be used for comparing their com- By identifying such parts in different images of the same pleteness and assessing effects of combining multiple de- scene or object, reliable statements about image geome- tectors. The approach has several advantages over a simple try and scene content are possible. Local feature extraction comparison of image coverage: It favors response on struc- comprises two steps: (1) Detection of stable local image tured image parts, penalizes features in purely homogeneous patches at salient positions in the image, and (2) description areas, and accounts for features appearing at the same lo- of these patches. The descriptions usually contain a lower cation on different scales. Combinations of complementary amount of data compared to the original image intensities. features tend to converge towards the entropy, while an in- The SIFT descriptor (Lowe 2004), for example, represents creased amount of random features does not. We analyse the each local patch by 128 values at 8 bit resolution indepen- complementarity of popular feature detectors over different dent of its size. image categories and investigate the completeness of com- This paper is concerned with feature detectors. Important binations. The derived entropy distribution leads to a new properties of a good detector are: scale and rotation invariant window detector, which uses a fractal image model to take pixel correlations into account. 1. Robustness. The features should be robust against typical distortions such as image noise, different lighting condi- tions, and camera movement. 2. Sparseness. The amount of data given by the features T. Dickscheid () · F. Schindler · W. Förstner Department of Photogrammetry, Institute of Geodesy and should be significantly smaller compared to the im- Geoinformation, University of Bonn, Bonn, Germany age itself, in order to increase efficiency of subsequent e-mail: [email protected] processing. F. Schindler 3. Speed. A feature detector should be fast. e-mail: [email protected] 4. Completeness. Given that the above requirements are W. Förstner met, the information contained in an image should be e-mail: [email protected] preserved by the features as much as possible. In other Int J Comput Vis (2011) 94:154–174 155

words, the amount of information coded by a set of fea- images are taken into account. Such complementarity is of- tures should be maximized, given a desired degree of ro- ten taken for granted, but cannot always be expected. A tool bustness, sparseness, and speed. for quantifying it would be highly desirable. Popular detectors have been investigated in depth regarding Our goal is to develop a qualitative measure for evaluat- Item 1, and there exists some common sense about the most ing how far a specified set of detectors covers relevant image robust detectors for typical application scenarios. Sparse- content completely and whether the detectors are comple- ness often depends on the parameter settings of a detector, mentary in this sense. We would also like to know if com- especially on the significance level used to separate the noise pleteness over different image categories reflects the theo- from the signal. The speed of a detector should be charac- retical concepts behind well-known detectors. terized referring to a specific implementation. This requires us to find a suitable representation of what Item 4 has not received much attention yet. To our knowl- we consider as “relevant image content”. We want to follow edge, no pragmatic tool for quantifying the completeness an information theoretical approach, where image content is of a specific set of features w.r.t. image content is avail- characterized by the number of bits using a certain coding able. This may be due to the differences in the concepts of scheme. Then it is desirable to have strong feature response the detectors, making such a statement difficult. As shown at locations where most bits are needed for a faithful cod- in Fig. 1, for example, blob-like features often cover much ing, while features on purely homogeneous areas are less of the areas representing visible objects, while the charac- important. Therefore we will derive a measure d for the in- teristic contours are better coded by junction features and straight edge segments. completeness of local features sets, which takes small values Complementarity of features plays a key role when us- if a feature set covers image content in a similar manner as ing multiple detectors in an application. It is strongly related a good compression algorithm would. We will model d as to completeness, as the information coded by sets of com- the difference between a feature coding density pc derived plementary features is higher than that coded by redundant from a set of local features, and an entropy density pH , both feature sets. The detectors shown in Fig. 1 complement each related to a particular image. The entropy density is closely other very well. Using all three, most relevant parts of the related to the image coding scheme used in JPEG. Although

Fig. 1 Sets of extracted features on three example images. The fea- segments, and SFOP0 (Förstner et al. 2009,usingα = 0 in (7)) extracts ture detectors here are substantially different: LOWE (Lowe 2004)fires junctions at dark and bright blobs, EDGE (Förstner 1994) yields straight edge 156 Int J Comput Vis (2011) 94:154–174 not being a precise model of the image, we will show that it detailed description. Before we start however, we want to is a reasonable reference. mention the interesting work of Corso and Hager (2009), We do not intend to give a general benchmark for detec- who search for different scalar features arising from kernel- tors. More specifically, we do not claim sets of features with based projections that summarize content. This approach ad- highest completeness to be most suitable for a particular dresses our idea of preferring complementary feature sets, application. Task usefulness can only be determined by in- but is very different from classical feature detectors. vestigating all of the four properties listed above. Therefore we assume that a set of feature detectors with comparable 2.1.1 Blob and Region Detectors sparseness, robustness and speed is preselected before using our evaluation scheme. The scale invariant blob detector proposed by (Lowe 2004), The entropy density pH also leads to a new scale and ro- here denoted as LOWE, is by far the most prominent one. It tation invariant window detector, which explicitly searches is based on finding local extrema of the Laplacian of Gaus- ∇2 =∇2 ∗ for locations over scale having maximum entropy. The de- sians (LoG) τ g τ g, where g is the image function tector uses a fractal image model to take pixel correlations and τ is the scale parameter, identified with the width of into account. It will be described and included in the inves- the smoothing kernel for building the . The LoG tigations. has the well-known Mexican hat form, therefore the detec- The paper is organized as follows. In Sect. 2 we will give tor conceptually aims at extracting dark and bright blobs on an overview on popular feature detectors and related work characteristic scales of an image. The underlying principle in image statistics. Section 3 covers the derivation of the en- of scale localization has already been described by Linde- tropy and feature coding densities together with an appro- berg (1998b). To gain speed, the LOWE detector approxi- priate distance metric d, and introduces the new entropy- mates the LoG by (DoG). based window detector. An evaluation scheme based on d is The Hessian affine detector (HESAF) introduced by described in Sect. 4.1, followed by experimental results for Mikolajczyk and Schmid (2004) is theoretically related to various detectors over different image categories, including LOWE, as it is also based on the theory of Lindeberg (1998b) a broad range of mixed feature sets (Sect. 4.2). We finally and relies on the second derivatives of the image function conclude with a summary and outlook in Sect. 5. over scale space. It evaluates both the determinant and the trace of the Hessian of the image function, the latter one being identical to the Laplacian introduced above. In Miko- 2 Related Work lajczyk et al. (2003), the response of an edge detector is eval- uated on different scales for locating feature points. Then, in 2.1 Local Feature Detectors a similar fashion as for HESAF, a maximum in the Laplacian scale space is searched at each of these locations to obtain Tuytelaars and Mikolajczyk (2008) emphasized that most scale-invariant points. Therefore this detector, referred to as local feature detectors are in fact extractors, and that the fea- EDGELAP, computes a high amount of blobs located near tures themselves are usually covariant w.r.t. distortions of edges with some non-homogeneous signal on at least one the image. However we use the term “detector” for all pro- side. cedures which extract some relevant features, irrespective of Regions, as opposed to blobs, determine feature windows their invariance properties. based on their boundary, and thus have a strong relation to A broad range of local feature detectors with different methods. A very prominent affine re- properties is available today. They are often divided into cor- gion detector is the Maximally Stable Extremal Region de- ner detectors, blob detectors and region detectors, the lat- tector (MSER) proposed by Matas et al. (2004). The idea is to ter two representing two-dimensional features as opposed to compute a watershed-like segmentation with varying thresh- corners, which have dimension zero. While blobs may also olds, and to select such regions that remain stable over a be seen as regions referring to their dimension, the distinc- range of thresholds. The MSER detector is known to have tion is reasonable: With “blobs” we usually denote features very good repeatability especially on objects with planar attached to a particular pixel position, representing dark or structures, and is widely used especially for object recogni- bright areas around the pixel, while regions are explicitly tion. Another example is the intensity-based region detector determined by their boundaries. We also need to take detec- IBR proposed by Tuytelaars and Van Gool (2004). Here, lo- tors of one dimensional features into account, namely edge cal maxima of the intensity function are first detected over detectors, as they explicitly focus on boundaries of regions multiple scales. Subsequently a region boundary is deter- and possibly build corner and junction points. mined by seeking for peaks of the intensity function along We give a short overview on the most popular detec- rays radially emanating around these points. It has been ob- tors, and refer to Tuytelaars and Mikolajczyk (2008)fora served that the feature sets extracted by MSER and IBR are Int J Comput Vis (2011) 94:154–174 157 fairly similar (Tuytelaars and Mikolajczyk 2008). The direct integration scale by optimizing the precision of junction lo- output of both algorithms can be any closed boundary of a calization. The recent work of Förstner et al. (2009)pro- segmented region. Within this work however, we will model poses a scale space formulation that directly exploits the feature patches by elliptical shapes, following most existing structure and the general spiral feature model to de- evaluations. tect scale-invariant features (SFOP). It includes junctions as A different approach is been taken by Kadir and Brady a special case and generalizes the point detector in Förstner (2001) who estimate the entropy of local image attributes (1994). The authors have shown that the scale and rotation over a range of scales and extract points exhibiting entropy invariant features have repeatability comparable to that of peaks in this space with significant magnitude change of LOWE. the local probability density. Their “salient regions” detec- Extensions from scale to affine invariant detectors usually tor (SALIENT) has been extended to affine invariance later also rely on the , by undistorting a local im- (Kadir et al. 2004). It typically extracts many features, but is age patch such that the structure tensor becomes isotropic. known to have significantly higher computational complex- This implicitly yields elliptical image windows. ity and lower repeatability than most other detectors (Miko- lajczyk et al. 2005). 2.1.3 Edge and Line Features

2.1.2 Corner and Junction Detectors Edge or line features are usually obtained by first comput- ing the response of a pixel-wise detector, e.g. using the struc- Corners have been used extensively since the early works of ture tensor, followed by a grouping stage to obtain connected Förstner and Gülch (1987) and Harris and Stephens (1988). straight or curved segments. In object recognition, such one- Both of these detectors in principle provide rotation invari- dimensional features were historically considered less often ance and good localization accuracy. The detectors are based on the second moment , or structure tensor, due to the lack of a robust wide-baseline matching tech-   nique. However, some promising approaches have been pro- g2 g g posed. Bay et al. (2005) choose weak intensity-based match- = ∇ ∇ T = ∗ x,τ x,τ y,τ Mτ,σ τ g τ g Gσ 2 , (1) gx,τgy,τ gy,τ ing using two quantized color-histograms on each side of an oriented straight line segment, followed by a sophisti- computed from the dyadic products of the image gradi- cated topological filter and boosting stage for both eliminat- ents. Here, τ is the smoothing parameter used for com- ing outliers and iteratively collecting previously discarded puting the derivatives, and σ used for specifying the size inliers. Meltzer and Soatto (2008) proposed a technique for of the integration window. Usually they are tied, i.e. by computing robust descriptors for scale-invariant edge seg- τ(σ) = σ/3. Strictly speaking, these detectors are window ments (Lindeberg 1998a) with possibly curved shape. They detectors: They compute optimal local patches for describ- achieve impressive results for shape and object recognition. ing a feature instead of optimal feature locations, and usually We refer to other literature for an overview of edge and fire close to junctions. Window detectors also cover circular line detection methods, e.g. Heath et al. (1996). The focus symmetric (Förstner and Gülch 1987) and spiral type image of this initial study on the completeness of detectors is on features (Bigün 1990). More specifically, spirals are a gen- zero- and two-dimensional features. However, we will in- eralization of junctions and circular symmetric features. clude a straight edge segment detector (EDGE) based on the In recent applications rotation invariance is usually not framework by Förstner (1994) in our investigations. sufficient. Therefore a number of attempts for exploiting the structure tensor over scale space have been proposed. The Harris affine (HARAF) detector (Mikolajczyk and Schmid 2.1.4 Performance Evaluation for Feature Detectors 2004) computes the structure tensor on multiple scales to detect 2D extrema within each scale. The final points are As a generally accepted criterion, the repeatability of ex- determined by locating characteristic scales at these posi- tracted patches under changes of viewpoint and illumination tions using the Laplacian, which effectively fires at blobs. has been investigated in mostly planar scenes (Mikolajczyk This way the corners seen in the image usually get lost, as et al. 2005) and on 3D structured objects (Moreels and Per- they do not exhibit extrema in the Laplacian. We therefore ona 2007). Recent work has also focussed on localization believe that HARAF should be considered a blob detector, in accuracy (Haja et al. 2008; Zeisl et al. 2009) and the gen- contrast to Tuytelaars and Mikolajczyk (2008) who refer to eral impact on automatic bundle adjustment (Dickscheid and it as a corner detector. Förstner 2009). Benchmarks usually evaluate each method Lindeberg (1998b) uses the junction model of Förstner separately, not addressing the positive effect when using a and Gülch (1987), determines the differentiation scale by combination of multiple detectors, which may be very useful analyzing the local sharpness of the image and chooses the in many applications (Dickscheid and Förstner 2009). For 158 Int J Comput Vis (2011) 94:154–174 example, Bay et al. (2005) propose to use edge segments to- trum Pg(u) of an image g(x) falls off with a power of the gether with regions for calibrating images of man-made en- frequency u. For large frequencies one observes vironments with poor texture. Their approach benefits from 1 stable topological relations between features of complemen- P (u) ∝ (2) g ub tary detectors. Completeness of feature detection received less attention with exponents b in the range 1 to 3, smaller exponents rep- up to now. Perdochˇ et al. (2007) use the notion of “cover- resenting rougher images. age” for stating that a good distribution of features over the The entropy of a signal depends on the probability distri- image domain is favorable over a cluster of features in a nar- bution of the events, which is usually not known and hence row region. This concept is strongly related to completeness. has to be estimated from data. In spite of the empirical find- In the interesting work of Lillholm et al. (2003), the ability ings, approximations are frequently used to advantage. Re- of edge- and blob-like features to carry image information stricting to second order moments, thus variances and co- is investigated based on a model of retinal cells and con- variances, one implicitly assumes the signal to be a Gaussian sidering the task of image reconstruction. The authors show process, neglecting the long tails of real distributions. As an example, Bercher and Vignat (2000) proposed a general that features based on the second derivatives of the image, method for estimating the entropy of a signal by modeling namely blobs, carry most image information, while fine de- the unknown distribution as an autoregression process, fo- tails are coded in the first order derivatives, namely edges. cusing on finding a good approximation with tractable nu- They also report that increasing the number of edge features merical integration properties. The widely accepted JPEG effectively improves the image details, while increasing the image coding scheme is also based on entropy encoding. It number of blobs is less rewarding, and emphasize the com- uses the Discrete Cosine Transformation (DCT) and implic- plementarity of these two feature types. itly assumes that second order statistics are sufficient. Other ˇ Neither Perdoch et al. (2007) nor Lillholm et al. (2003) coding schemes, e.g. based on wavelets, follow a similar ar- formalize their notions of coverage or complementarity. To gument (Davis and Nosratinia 1998). The “salient region” our best knowledge a tool for measuring complementarity detector by Kadir and Brady (2001) derives the entropy from of different detectors is still lacking. the histogram of an image patch, neglecting the interrela- tions between intensities within an image patch. 2.2 Information Contained in an Image In the context of this paper we are only concerned with local image statistics, not taking higher level context into account. Describing the information content of an image can be ap- proached from at least two sides: by analyzing the visual system, following a biologically inspired approach, or by 3 Completeness of Coding with Image Features investigating the image statistics, following an information theoretic approach (Shannon 1948). An excellent review on 3.1 Completeness vs. Coverage these paradigms, together with a mathematical formaliza- tion, is given by Mumford (2005). A simple approach for evaluating the completeness of a fea- The biologically inspired approach of (Marr 1982) iden- ture set would measure the amount of image area covered tified relevant image information with the so-called primal by the local patches, as applied in Perdochˇ et al. (2007) sketch, which mainly refers to the blobs, edges, ends and or Dickscheid and Förstner (2009). Such approaches have zero-crossings that can be found in an image. Such a rep- a number of shortcomings, as illustrated in Fig. 2. resentation in principle is achieved using a combined set of 1. Full completeness can be achieved using a few features different feature detectors. Marr’s approach has been sup- with large scales as in Fig. 2(a), or a regular grid of adja- ported by the recent work of Olshausen and Field (1996). cent features. However, such feature sets do not code the Extracting image features for subsequent processing in im- image structures well. We would like a good measure not age analysis actually can be seen as mimicking the vi- to overrate such feature sets. sual system. Higher level structures, resulting from group- 2. Superimposed features on different scales, as in Fig. 2(b), ing processes, are essential for interpretation and give rise would not contribute to the completeness, although often to models like unsupervised clustering for segmentation carrying important additional information. We would like (Puzicha et al. 1999) or grammars (Liang et al. 2010). a proper measure to take such effects into account. Following Mumford (2005), at least two properties char- 3. Fine capturing of local structures, as in Fig. 2(c), is not acterize the local image statistics: (1) The histograms of in- rewarded. As a consequence, adding complementary fea- tensities or are heavy tailed, (2) the power spec- tures may not converge towards full completeness, while Int J Comput Vis (2011) 94:154–174 159

1. Representation of local image content. We need a rea- sonable representation for the parts considered to reflect “relevant information” within an image, corresponding to Fig. 3(2). 2. Representation of local feature content. We need a mean- ingful representation of the content that is coded by a particular set of local features, roughly related to Fig. 3(3–5). Fig. 2 Illustration of three simple feature sets (circles) covering an im- 3. Distance measure. Having two such representations, we age of a checkerboard. (a) Single feature on a large scale, (b) two su- need a proper distance measure. If the distance vanishes, perimposed features with different scales, (c) mixed set of junction and the features are supposed to capture the image content blob features. Obviously (c) captures the image structure most com- pletely. The dashed line indicates the stripe used in Fig. 3 completely, so it should reflect the incompleteness of the local image information preserved by the features. Fig. 3 From top: (1)Image For representing local image content, we use the entropy stripe across the main diagonal density pH (x). It is based on the local image statistics and line of the checkerboard in Fig. 2,(2) profile of the intuitive gives us the distribution of the number of bits needed to rep- amount of relevant image resent the complete image over the image domain. Thus, if information along the stripe, we need H (I) [bits] for coding the complete image, we can (3–5) coverage profiles along (I) use pH for computing the fraction of H needed for cod- the stripe for the feature sets in = (I) Fig. 2 (a–c). The profiles take ing an image region R by HR H x∈R pH (x)dx.The overlap into account number of bits will be high in busy areas, and low in homo- geneous areas of the image, as depicted in the right column of Fig. 6. For taking into account features of different scales at the same image position, we compute the local entropies over scales, as will be explained in Sect. 3.5. adding random features often would. A proper measure For the local feature content, a feature coding density should show the opposite behavior. pc(x) is directly computed from each particular set of lo- Obviously we need a better measure. Consider the image cal features. Each region covered by a feature is represented signal in a diagonal stripe across the checkerboard image de- with an anisotropic Gaussian distribution spreading over the picted in Fig. 3(1). The intuitive amount of relevant image image domain. We assume that a certain number of bits is information contained in small local patches along the stripe needed to represent each feature, i.e. the area covered by is depicted by Fig. 3(2): It is zero on the image border, has each Gaussian. Within this paper, the number of bits remains peaks at the checkerboard corners, especially in the center of constant for all features, but one may also derive it explic- the checkerboard, and is also high within the checkerboard itly for each feature. The coding density pc(x) is finally ob- patches. Fig. 3(3–5) show coverage profiles along the im- tained by the normalized sum of these Gaussians. We have age diagonal, obtained by taking the overlap of the features illustrated such coding densities for different feature sets in in Fig. 2(a–c) into account. The similarity between these the center columns of Fig. 6. coverage profiles and the profiles in the second row would Figure 4 shows the profiles of some feature coding and obviously give a better measure: The additional feature in entropy densities along the diagonal of a noisy image of a Fig. 3(4) is rewarded w.r.t. Fig. 3(3), and the reasonable cov- 4 × 4 checkerboard with grey background. These plots can erage of complementary features in Fig. 3(5) is most similar be compared to the introductory illustration in Fig. 3,vi- to the reference profile. sually indicating that the proposed representations have the The measure that we derive in the following is modelled desired properties. after this principle. The incompleteness d is determined as the distance be- tween pH (x) and pc(x), using the Hellinger metric. When 3.2 Basic Principle pc is close to pH , thus d being small, the image is effi- ciently covered with features, and the completeness is high. As motivated in the introduction, we want to interpret fea- We hereby require busy parts of images to be densely cov- ture detection and description as coding of image content, ered with features, and smooth parts not to be covered with so we propose to measure completeness of local feature ex- features, as motivated in the Sect. 3.1. traction as a comparison with classical coding schemes. We We now describe the individual quantities used in our ap- will apply three quantities for deriving the measure (Fig. 5): proach in more detail. 160 Int J Comput Vis (2011) 94:154–174

The vanishing covariance Σnf between signal and noise re- flects the assumed mutual independence. We want to derive the minimum number of bits needed to code the local area at each pixel, based on the statistics of the surrounding image patch, with the requirement that the loss of information is below a certain error tolerance. The tolerance is expressed by the mean square discrepancy  2 D0 = E[(f −f ) ], where the given signal f is approximated by a reconstructionf. We first derive the number of bits required for lossy coding a single Gaussian variable and then generalize to Gaussian vectors with correlations. In order to obtain a rea- sonable estimate without knowledge of the correlations, we finally assume the image patch to be the representative part of a doubly periodic infinite signal, and will derive the num- ber of bits from the power spectrum. Using a Gaussian model for the image patch is an ap- proximation that has been used by other authors before. For example, Lillholm et al. (2003) use it to motivate differ- Fig. 4 Profiles of the entropy density p and different feature coding H ent norms for regularization during image reconstruction, densities pc along the diagonal stripe of a noisy image of a checker- board, now with 4 × 4 patches and a grey border (top row, cf. Fig. 10). assuming either uncorrelated pixels or white noise for the The profiles correspond to the diagonal slices of the distributions. The gradients. At first the choice is pragmatic, as the model al- combination of LOWE, SFOP0andEDGE is most similar to the entropy lows us to exploit the power spectrum, which considers only pH (x). The regular grid of features is not very similar, but may have higher completeness than a very sparse detector. That is why we rec- first and second moments. Due to the lack of an alternative ommend to use the evaluation scheme on feature sets of comparable model, we cannot verify the influence of this approximation sparseness on our results theoretically. However, in our experiments we will investigate in how far the model leads to meaningful 3.3 Representing Image Content by Local Entropy results, and claim that it is reasonable for the given prob- lem. The total number of bits required for coding an image with a certain fidelity can be derived using rate distortion theory, 3.3.2 Entropy of a Lossy Coded Single Gaussian which has already been discussed in the classical paper of Shannon (1948) and worked out by Berger (1971). The idea Assume that we want to code a signal y which is a sample is to determine a lower bound for the number of bits that are of a Gaussian with a given mean and variance Vy . Its differ- necessary for coding a signal, given a certain error tolerance. ential entropy is (Bishop 2006, eq. 1.110) We will use this theory for determining how the required = 1 = 1 + 1 bits are distributed over the image, and derive a pixel related H(y) log2(2πeVy) log2(2πe) log2 Vy. (5) measure from the local image statistics of a square patch. 2 2 2 We consider y to be the signal that has originally been 3.3.1 Model for the Signal of an Image Patch emitted, and want the reconstructed signal y to lie with- 2 in an error tolerance E[(y − y) ]=D0 after transmission. × = We assume the signal g(x) in an J J K image patch to Under the further assumption that the transmission process be the noisy version of a true signal f(x). The true signal is ergodic, i.e. that a very long sample of the transmis- and the noise are assumed to be mutually independent sto- sion will be typical for the process with probability near 1, chastic variables. While f(x), vectorized to f, has arbitrary rate distortion theory gives the following result (Davisson mean μf and covariance matrix Σff , we assume the noise 1972): n to have zero mean and diagonal covariance Σnn = N0 IK . ≥ The model hence reads – We only need to transmit the signal if Vy D0.IfVy is smaller, D0 dominates, hence we have g = f + n (3) D = min(D0,Vy). (6) with – The minimum number of bits Ry —the rate—required for f ∼ N (0,Σff ), n ∼ N (0,N0IK ), Σnf = 0. (4) the transmission is given by the mutual entropy of y and Int J Comput Vis (2011) 94:154–174 161

Fig. 5 Workflow for a single image of the evaluation scheme de- pc2 looks much more similar to the entropy distribution pH than pc1 . scribedinSect.4.1.1, using only two different sets of features. L2 is a If d1 d2 would not hold, we would conclude that the blobs and edges combined set of junctions, edges and blobs, while L1 is its subset of {L2 \ L1} do not complement the junctions L1 well junctions. Having |L1||L2|, we expect d1 d2. Indeed the density

y, sometimes also denoted as mutual information: tween its components. Using the factorization of the covari- . ance matrix Σyy into eigenvectors and eigenvalues Ry = H(y)− H(y|y) (7) K = 1 − 1 Σ = λ2r rT, (11) log2(2πeVy) log2(2πeD0) (8) yy k k k 2 2 k=1 V = 1 y log2 . (9) we may use the alternative representation 2 D0 K Here we use the assumption that the noise and the true sig- = + ∼ 2 y μy rkzk with zk N(0,λk). (12) nal are mutually independent, hence that their entropies add. k=1 Therefore the conditional entropy—the second part of (8)— is simply the entropy of the noise, i.e. the number of bits This representation allows us to effectively replace the needed for coding the distortion. original components of y by K stochastically independent Combining (9) and (6), we have the required number Gaussian variables zk. We therefore can determine the to- of bits for a lossy coded single Gaussian (Davisson 1972, tal entropy for lossy coding of the stochastic vector y as the eq. (11)) sum of the entropies for the individual mutually independent   zk (Davisson 1972, eq. (23)) 1 Vy R = max 0, log . (10)   y 2 K λ2 2 D0 = 1 k Ry max 0, log2 . (13) 2 D0 We use this approximation for coding, assuming that the k=1 power spectrum of the local patch, from which we deter- 2 This expression depends essentially on the eigenvalues λk mine Ry , is representative for its statistical properties. of the covariance matrix Σyy of the signal to be coded.

3.3.3 Entropy of a Lossy Coded Gaussian Vector 3.3.4 Entropy of a Local Image Patch

We now want to generalize (10) to the case of a stochastic The covariance matrix of a local image patch is not known, ∼ Gaussian K-vector y N(μy,Σyy) with correlations be- hence we do not have the eigenvalues for applying (13)di- 162 Int J Comput Vis (2011) 94:154–174

Fig. 6 Estimated entropy densities pH (x) and feature coding densi- converge towards the entropy (right column). Observe that the white ties pc(x) on three different images. We illustrate feature coding den- area in the image of the middle row is below the noise level, thus not sities for LOWE separately, and for a combination of LOWE, EDGE and considered by the entropy, while the sand in the bottom row carries a SFOP0. Combining complementary features (third column) seems to small amount of information rectly onto an image patch. Therefore we assume the image showing the entries of the power spectrum to be the eigen- patch to be the central part of an infinite periodic signal, and values of the covariance matrix. use the power spectrum for deriving the eigenvalues of the If the patch is a doubly periodic two-dimensional sig- covariance matrix of the intensities in the image patch. nal f(x) of size K = J × J , the power spectrum Pf (u) = Let F =[Fjk]=[exp(−2πijk/K)] be the K × K- Pf (u1,u2) depends on the row and column frequencies u1 2 2 = Fourier matrix. Then the discrete Fourier transform of the and u2. Thus the squared eigenvalues λk of Σff are λk cyclical signal y(k) with K-vector y is Y = Fy and the Pf (u), the index k capturing the indices u =[uk].Giventhe power spectrum of y(k) is power spectrum, we obtain the entropy of an K = J × J patch f(x) by (Davisson 1972, discussion after eq. (36)) 1 P (u) = |Y(u)|2. (14)   y  P (u) K = 1 f Rf max 0, log2 . (17) 2 D0 One can show that the eigenvalues of the covariance matrix u\(0,0) Σyy of y are identical to the elements of the power spectrum P(u)of the periodic signal y(k): We use the unitary matrix Note that we omit the DC term Pf (0), as it represents the mean of the signal and we are only concerned with the vari- 1 ∗ ∗ ance. F = √ F, FF = F F = IK , (15) K Finally, as the true signal is not available, but only the observed signal as in (3), we need to estimate the power ∗ where F is the transposed complex conjugate of F .Itis spectrum Pf . From the structure of the covariance matrix equivalent to a rotation matrix. The empirical covariance Σgg = Σff + Σnn we know that Pg(u) = Pf (u) + Pn(u). = 1 ∗ matrix of y(k) then is Σyy K F Diag(Py(u))F . Finally Therefore we choose to estimate the power spectrum of the this can be written as true signal by ∗  Σyy = F Diag(Py(u))F , (16) Pf (u) = max(0,Pg(u) − N0) (18) Int J Comput Vis (2011) 94:154–174 163

Fig. 7 Detail of the example will be seen in the next section however, we handle super- image depicted in Fig. 1 left. Note that the regions are plotted imposed features at different scales in a very similar manner with radius equal to the scale of when computing the feature coding densities. the feature. The support region Finally we obtain the entropy density by normalizing used for computing descriptors (21): is usually larger but proportional H(x) pH (x) = . (22) z H(z) and obtain the final entropy per patch as the rate The expected number of bits in a certain region R therefore (I) (I)   is H ∈ pH (x), where H is the total number of bits 1  max(0,P (u) − N ) x R  = g 0 for the complete image. However, for comparing the entropy Rf max 0, log2 . (19) 2 D0 u\(0,0) distribution over the image with a particular set of local fea- tures, we cannot use absolute values (i.e. bits per pixel), but We estimate the standard deviation of the noise N for 0 only relative values. This is why we take pH (x) as reference. each input image separately using the approach of Förstner In the right column of Fig. 1, this entropy density is depicted (1998). As a minimum, we use the rounding error, identified for two example images. 2 by the standard round-off error variance D0 = N0 =  /12 (Smith 2007). The quantization unit  depends on the actual 3.4 Representing Local Feature Content image representation, e.g.  = 1/256 for 8-bit intensities. As the estimated number of bits spreads over the local patch however, we assign to the pixel only the corresponding The local feature content is a density comparable to pH (x), ={ } fraction according to the patch size, i.e. representing a set of L local features L l1,...,lL . Each   local feature represents a certain image region with known  P (u) ={ } (p) = 1 f shape and is represented as a tuple ll xl,Σl, dl,cl , Rf (x,M) max 0, log2 . (20) 2K D0 where u\(0,0) – x is the spatial location of the feature in the image, In the end we use the Discrete Cosine Transform Type-II l – Σ is a scale matrix representing the shape of the feature instead of the Discrete Fourier Transform. It suppresses ef- l window, using the model of a 2D ellipse, fects of signal jumps at the borders of the window onto the – d is a vector used as a description of the feature window, spectrum, while preserving the frequencies except for a fac- l tor of 2 (Rosenfeld and Kak 1982, p. 159). The equivalence and of the power spectrum entries with the eigenvalues of the – cl is the number of bits required for dl, coding the whole covariance matrix therefore still holds. region Rl. The number of bits may be different between We cannot proceed with a fixed patch size J , as can be feature types, but we usually assume a constant number seen from the image detail shown in Fig. 7: The large feature of bits for all features. in the middle covers content with rich information, although Positioning Σl at xl gives us the actual region Rl covered a smaller patch placed in its center for computing the en- by the feature. For straight edge segments, identified by their tropy might have minimal entropy. start- and endpoints (xS, xE)l ,weusexl = (xS,l + xE,l)/2 Like in wavelet transform coding, we assume that the and use the direction of the edge as the major semi-axis of coding is hierarchical for the entropy density, and choose to Σl with length |xE,l − xS,l|/2 while setting the length of the integrate information from different scales in order to take minor semi-axis to 1 [pel]. such scale effects into account. For doing so, we compute If we spread the bits cl uniformly over each region Rl , the sum we obtain a feature coding map by S (p) s H(x) = R (x, 1 + 2 ). (21) L 1 (x) f = Rl s=1 c(x) cl (23) |Σl| l=1 In our experiments we use S = 7, so the patch size is lim- ited to 3 ≤ M ≤ 129. Observe that the omission of the DC where 1Rl (x) is an indicator function, being 1 within the re- term in a lower scale is compensated by the coding in higher gion Rl , and 0 outside. In order to emphasize the informa- scales. Taking the sum in (21) is a pragmatic choice. We tion next to a feature’s center, it is common practice to apply did not yet investigate to which extent this is an approxima- Gaussian weighting within the local patches before comput- tion regarding possible correlations between scale levels. As ing feature descriptors (Lowe 2004, 6.1). So it is reasonable 164 Int J Comput Vis (2011) 94:154–174

3.6 Euclidean Embedding of Detectors

In Sect. 3.5 we derived as an incompleteness measure the distance d between the entropy pH distributed over an im- age and the information pc coded with a set of local features. In addition to comparing image codings with entropy, we can also measure distances between pairwise feature cod-

ings pci and pcj . This allows us to build up a distance net- work in high dimensional space representing the similarity Fig. 8 Hellinger’s metric d illustrated for two sample distributions pH and pc of an image with two pixels x1, x2 or complementarity of feature detectors. From such a network of N detectors we can determine − to replace the uniform distribution by a Gaussian and derive their relative positions in a (N 1)-dimensional space us- the feature coding map by ing a technique called “Euclidean embedding” (Cayton and Dasgupta 2006). First the matrix of squared distances D : L 2 [Dij ]=d(pc ,pc ) is centered using the centering matrix c(x) = c G(x; x ,Σ ). (24) i j l l l H = I − 1 11T, with 1 being a vector filled with ones: l=1 D Then the actual density to be compared with the entropy 1 B =− HDH. (27) density pH (x) is 2

c(x) Computing the spectral decomposition of B = UΛUT, and p (x) = . (25) c [ ] = z c(z) ensuring positive eigenvalues Λ+ ij max(Λij , 0), we ob- tain relative positions X =[x ] for each feature detector: We illustrate feature coding densities of some example im- n ages in Fig. 6. 1/2 T X = Λ+ U . (28) 3.5 Evaluating Completeness of Feature Detection This way we end up with a mapping for feature detectors,

In case the empirical feature coding density pc(x) would taking one image into account only. By evaluating the mean be identical to the entropy density pH (x), coding the im- of the distances d between each two codings pci and pcj , age with features would be equivalent to using image com- we can average the positions X over multiple images, since pression. Of course image features may use less or more they are supposed to be rather invariant w.r.t. image content. bits for coding the complete image, depending on the cod- In order to propagate the variance of the distances d into the ing of the individual feature, so we do not compare the detector space, we perform the embedding for each image absolute number of bits per pixel, but their densities. We individually. Afterwards we transform the detector positions use Hellinger’s metric for measuring the difference between X into one common reference frame, taking changes in ori- pH (x) and pc(x): entation and directions of the eigenvectors into account. −  The (N 1)-dimensional detector space can be visual- 1 2 d(pH (x), pc(x)) = ( pH (x) − pc(x)) . (26) ized by neglecting small principal components. Figure 9 il- 2 x lustrates the mapping of some detectors into 3D space. The ellipsoids represent an approximation of their distribution. Hellinger’s metric is computed from the pixel-wise differ- Some well-known relationships between the detectors be- ences of the square roots of the densities, and may be ex- come visually apparent in the plot, as for example between plained as illustrated in Fig. 8: Each of the square-rooted IBR and MSER. densities can be thought of as a unit vector in the space of all possible densities over the image, the dimension of this space being equal to the number of pixels. Then Hellinger’s 3.7 A Feature Detector Based on Local Entropy distance between the two densities is proportional to the Euclidean distance between the piercing points on the unit So far we discussed two strategies for selecting the char- sphere, so it basically relates to the angle between the vec- acteristic scale of a feature: The Laplacian, or Difference of tors representing each density. Gaussian space; and the structure tensor. Both methods have Note that although one often uses the Kullback-Leibler been summarized in Sect. 2. The entropy density pH that we divergence for comparing density functions, it is not useful developed in the previous section gives rise to a third para- here: It is no metric, and a larger divergence does not neces- digm, namely using the local entropy for building the scale sarily indicate a larger deviation from the reference density. space. Int J Comput Vis (2011) 94:154–174 165

3.7.1 The Local Entropy for Rough Signals

We represent the power spectrum, which is a decaying func- tion, by the model

 − u 1 P(u)= P0 1 + (29) u0

taking the extreme exponent b = 1in(2) and regularizing for small frequencies. This can be used to replace the power spectrum in (17), and allows us to approximate the local en- tropy as a function of the signal to noise ratio SNRand the effective bandwidth bg of the signal in the local image con- tent:

= · + 2 Rf k bg log2(1 SNR ) (30) Fig. 9 Visualization of the Euclidean embedding (Cayton and Das- gupta 2006) of detectors: Several feature detectors are mapped into with k 5.43. Here we use the relationships for the signal- a 3D subspace spanned by the three largest principal components of the centred distance matrix B. The ellipsoids approximate the distri- to-noise ratio bution of the detectors over all considered images. Notice how the P Vg close theoretical relationship between the SFOP variants, MSER/IBR SNR2 = 0 = (31) and HESAF/HARAF becomes visible in terms of proximity N0 Vn and the effective bandwidth (McGillem and Svedlow 1976, To our knowledge, Kadir and Brady (2001) were the first Eq. (16))  to present a local feature detector based on information the- un 2 Vg = u P(u)du ory (SALIENT). They estimate the entropy of local image b2 = = 4π 2 u 0 , (32) g V un attributes over a range of scales and extract points at peaks g u=0 P(u)du with significant magnitude change of the local probability to rewrite (30)as density. A set of such features is shown on the right side   of Fig. 10. The number of features extracted by SALIENT Vg Vg is high compared to other detectors (see Table 1), result- Rf ∝ log 1 + , (33) V 2 V ing in good image coverage. However, by estimating en- g n tropy from local histograms, the approach ignores the un- which only depends on the variance of the image gradients known pixel correlations, which usually affect local image Vg , the variance of the gray-scale values Vg and the noise attributes. It has rather low repeatability compared to other variance Vn. detectors (Mikolajczyk et al. 2005). The derivation is outlined in Appendix. Note that this In this section, we want to derive a scale-invariant fea- model is only proportional to the approximate entropy (40), ture detector which models the local entropy in the same as we intend to perform a maximum search over the func- way as when deriving the entropy density pH proposed in tion. Sect. 3.3. It extracts sparse sets of features near peaks of the entropy distribution pH . One possible way to model the de- 3.7.2 Finding Maxima of Local Entropy over Scale tector would be to build a stack of entropy distributions for varying patch sizes using (20), and then search for local ex- Following Shi and Tomasi (1994), we identify the vari- trema within 3 × 3 × 3 dices of the resulting cuboid. How- ance of the gradients Vg with the smaller eigenvalue of ever, computing the power spectrum over all scales would the structure tensor λ2(Mτ,σ ) instead of using its trace (see result in high computational complexity. Therefore we will Sect. 2.1.2). This avoids detecting points located on lines take another approach: We express the entropy density using with poor localization accuracy. The integration scale σ then an approximate functional model for the power spectrum, spans the space for selecting features at different scales, depending only on the variances of the intensities, the im- yielding the following scale space of the entropy rate: age gradients and the image noise. These can easily be deter-   M V (x,τ,σ) mined. The detector is included in our evaluation (Sect. 4.2), 2 = λ2( (x,τ,σ)) + g Rf (x,τ,σ) log2 1 . (34) denoted by EFOP. Vg(x,τ,σ) Vn(x,τ) 166 Int J Comput Vis (2011) 94:154–174

4 Experiments

In the following we will apply the above-mentioned evalu- ation to a data set involving various image categories, and analyze the complementarity and completeness of both sep- arate detectors and detector combinations. After describing the experimental setup in Sect. 4.1, we will present com- prehensive results, and show in Sect. 4.2 how they coincide with the concepts of the detectors.

Fig. 10 A noisy image of a checkerboard, overlaid with features de- 4.1 Experimental Setup tected by the maximum-entropy detector EFOP described in Sect. 3.7 (left) and the one proposed by Kadir et al. (2004)(right). Note that we 4.1.1 Evaluation Scheme use σ as a radius for the features, while one would usually compute σ descriptors on a 2 patch We use the distance measure d derived in Sect. 3 (26)for evaluating sets of local features. The procedure is illustrated The close relationship of the structure tensor with the local for two sets L1, L2 in Fig. 5. We start by computing fea- ture coding densities p for each set L as well as the en- information in an image has already been observed by Mc- ci i tropy distribution p of the image as a reference. Next we Clure (1980, eq. (7)). H compute the distances d = d[p (x), p (x)], and compare The variance of the gray-scale values can be computed i H ci them. We typically draw conclusions of these types: via 1. If d

Fig. 11 Example images from the datasets used for the experiments provided by the author. Here we choose to build the scale Our experiments are built on the fifteen natural scene cat- space starting with the original instead of the double image egory dataset (Lazebnik et al. 2006; Li and Perona 2005), resolution, for comparability with the other detectors. complemented by the well-known Brodatz texture dataset, a collection of cartoon images from the web and a set of image Detectors based on the structure tensor For point-like fea- sections of an aerial image. We present results for a subset tures based on the structure tensor, we use the recently pro- of seven categories, as depicted in Fig. 11. For each of the posed scale invariant detector by Förstner et al. (2009), but categories, we report average results over all images. distinguish the results into three classes:

– The subset of pure junction points (SFOP0) obtained when 4.1.4 Investigated Sets of Features restricting to α = 0 (Förstner et al. 2009, eq. (7)), = – the set of pure circular features (SFOP90) with α 90, We start by computing the incompleteness d(pc,pH ) for and each of the detectors separately over all images of a cate- – the complementary set of features with optimal α,i.e.the gory. This will give us a first impression on the complete- full power of the detector (SFOP). ness of the different methods. The results are discussed in Additionally we compare that to the non-scale-invariant de- Sect. 4.2.1. Our special interest is the effect of combining tectors described in Förstner (1994) for junctions and circu- sets of features, i.e. for finding evidence about their com- lar symmetric features (FOP0, FOP90), as well as the classi- plementarity. However, we obviously cannot discuss results cal detector HARRIS by Harris and Stephens (1988) with an for arbitrary tuples of all of these detectors. We therefore implementation taken from Kovesi (2009). The maximum- choose to constrain the set of all possible combinations in entropy-detector (EFOP) described in Sect. 3.7 is not specif- two ways: ically a point detector. It represents the class of window or 1. In Sect. 4.2.2 we start by identifying detectors with very texture detectors exploiting the structure tensor. similar properties and close theoretical relationships, and Line and edge detectors Here we restrict to a straight edge use such sets as groups. Detectors within a group are detector (EDGE) based on the theory in Förstner (1994), us- not combined mutually to decrease the overall number ing a minimum edge length of ten pixels. We intentionally of combinations. do not include a whole range of line segment detectors in 2. We concentrate on combinations of three detectors. We order to keep the number of detector combinations tractable hereby assume that using more than three detectors is within this paper. We also do not expect significantly differ- not feasible in most applications. Results for triplets are ent results using other edge detectors. given in Sect. 4.2.3.

Parameter settings Completeness is in general lower for 4.2 Experimental Results sparse feature sets. Decreasing a detector’s significance level will yield more features but at the same time reduce repeata- 4.2.1 Results for Separate Detectors bility and performance. We have chosen to use default pa- rameter settings provided by the authors whenever available Figure 12 shows the average of the distances d(p ,p ) on in order to make our results a direct complement to existing c H seven image categories for each detector separately. The cor- evaluations. Unfortunately this involves different amounts responding average amounts of features are listed in Ta- of features being extracted, which we report in Table 1 for ble 1. clarity. The most noticeable result is that the EDGELAP detector 4.1.3 Image Data shows significantly highest completeness on all image cate- gories with rather good confidence. This is certainly due to We compare the completeness and mutual complementar- the very high number of features compared to other detec- ity of the above mentioned detectors on a variety of images. tors (Table 1). 168 Int J Comput Vis (2011) 94:154–174

Fig. 12 Average dissimilarity d(pc,pH ) for separate detectors over completeness of a detector w.r.t. the entropy distribution pH .Thead- all images of seven image categories. The categories for BRODATZ, ditional black bars denote the 1-σ -confidence region over all images AERIAL and CARTOON contain between 15 and 30 images, all oth- in a category ers contain around one hundred images. Small distances denote high

Table 1 Average number of features per image category BRODATZ AERIAL CARTOON FOREST MOUNTAIN TALL BUILDING KITCHEN extracted by the different detectors. We see that SALIENT EDGELAP 28011 4658 5518 4678 2548 3252 3210 and especially EDGELAP extract SALIENT 5832 613 846 607 296 385 200 more features than most other MSER 2202 212 187 186 82 122 96 detectors, while IBR and EFOP are very sparse IBR 1078 182 122 40 29 31 31 SFOP 2052 494 238 290 157 152 145 SFOP0 1355 344 175 197 110 102 95 SFOP90 1177 295 120 183 91 95 87 EFOP 473 153 72 66 57 54 42 LOWE 2093 324 307 152 105 111 115 HESAF 3499 338 837 156 134 139 195 HARAF 4847 359 596 239 138 156 166

The MSER and SALIENT detectors show overall best re- on KITCHEN, TALL BUILDING, CARTOON and AERIAL,it sults aside from EDGELAP. While in case of SALIENT we shows overall worst results for the natural structures in FOR- have to put the higher number of features into perspective EST, MOUNTAIN and BRODATZ. Conversely, the complete- again, the result for MSER is really remarkable: It has signif- ness for the blob and corner detectors rather benefits from icantly higher completeness compared to most other detec- natural structures, the behavior hence being almost opposite tors, but at the same time similar sparseness. SFOP and IBR over categories compared to EDGE. both have slightly lower completeness than the above men- tioned. The good score of IBR is especially noticeable, as the 4.2.2 Relationships Between Detectors feature sets are very sparse. For EFOP, having similar sparse- ness, the completeness is rather poor instead. The LOWE, The goal of this section is to reduce the set of detectors HARAF and HESAF detectors, all in principle exploiting the used for the final evaluation of triple combinations, both Laplacian scale space, achieved similar results among each by selectively excluding some detectors and by grouping other. They rank slightly behind IBR and SFOP on average. mostly redundant methods. To do so, we will use the Euclid- As expected, SFOP performs better than its special cases ean embedding of different sets of detectors as explained in SFOP0 and SFOP90. Among these two, the junctions are more complete than the circular points, which is also in- Sect. 3.6 and identify clusters. We will finally end up with tuitive. Especially on AERIAL and FOREST, the complete- the six groups of detectors depicted in Table 2. ness of detectors based on the structure tensor seems to be better than that of Laplacian-based methods. This observa- Role of the EDGELAP detector The EDGELAP detector pro- tion is interesting, as the structure tensor has been shown to duces significantly more points than all other detectors, as it be related to high local information (McClure 1980). The does not explicitly suppress responses on edges. This para- behavior of the straight edge features (EDGE) depends on digm is superior regarding the proposed completeness mea- the amount of man-made structures. While it is acceptable sure, but targets at special applications. When combined Int J Comput Vis (2011) 94:154–174 169

Table 2 Groups of detectors used for evaluating possible triplets, re- ferring to the conclusions made in Sect. 4.2.2

Group Members

Edges EDGE Segmentation-based IBR, MSER Laplacian LOWE Mixed Laplacian HESAF, HARAF Structure tensor EFOP Fig. 15 Projection of the Euclidean embedding (Sect. 3.6)ofSFOP, Spiral type SFOP, SFOP0, SFOP90 SFOP0, SFOP90 together with their classical non-scale-invariant coun- terparts FOP0, FOP90 and HARRIS onto its first two principal compo- nents. The nodes represent average values over all images of all cate- gories. The scale-invariant feature sets are by far closer to the reference pH , thus significantly more complete

detectors, and its repeatability ranges significantly below av- erage (Mikolajczyk et al. 2005). It is therefore rather un- likely that SALIENT will be used in conjunction with other detectors, so we choose not to include it in our evaluation of combinations. Fig. 13 Average incompleteness d(pc,pH ) for pairwise combinations of other detectors complementing EDGELAP over all images in our of detectors for corners and circular fea- evaluation. Black bars denote the 1-σ -confidence region. We see that tures EDGELAP is not significantly complemented by any of the detectors, as We analyzed the classical detectors for corners and it has strong completeness on its own already circular features FOP0, FOP90 and HARRIS together with the scale-invariant extensions SFOP, SFOP0, SFOP90 over different image categories. Throughout the experiments we found that the scale-invariant methods are significantly more complete than the classical ones, which is expected due to the variable patch sizes. The projection of a mapping over all image categories is shown in Fig. 15. We therefore choose to only include the scale invariant features in the final eval- uation.

Affine invariance of blob detectors We compared HAR- Fig. 14 Average incompleteness d(pc,pH ) for pairwise combina- tions of other detectors complementing SALIENT over all images in LAP and HESLAP with their affine covariant versions HARAF our evaluation. Black bars denote the 1-σ -confidence region. We see and HESAF, which differ by a final affine refinement of the that SALIENT is most efficiently complemented by HESAF,followed window shapes. The differences between the affine and non by HARAF and SFOP affine versions are negligible referring to our completeness measure, as illustrated in Fig. 16. Therefore we only include with other detectors, the increase in complementarity be- the affine-extended versions in the subsequent evaluation. comes hardly visible, as shown in Fig. 13. Therefore we choose not to include EDGELAP in the final evaluation of Groups of similar detectors We want to find out in how detector combinations. far different detectors are mutually complementary refer- ring to the proposed measure. It is therefore not necessary to Role of the SALIENT detector SALIENT detects less fea- consider combinations of highly redundant detectors, where tures than EDGELAP but still significantly more than the complementarity cannot be expected. We illustrate such re- other detectors. Despite its high completeness scores, it is dundant groups by a projection of a high-dimensional map- still complemented by some other detectors (Fig. 14), es- ping over all image categories in Fig. 17. The results of the pecially by HESAF. In our experiments we found that triple IBR and MSER detectors are strongly related throughout the combinations including SALIENT achieve best scores among image categories. Therefore we put them into a group de- all combinations, apart from those including EDGELAP.As noted “Segmentation-based”. This is in agreement with the mentioned before however, the computational complexity of observation already made by Tuytelaars and Mikolajczyk the detector is orders of magnitude higher than that of other (2008, p. 211) that IBR and MSER yield very similar regions. 170 Int J Comput Vis (2011) 94:154–174

Fig. 16 Projection of Euclidean 4.2.3 Results for Selected Triplets of Detectors embedding (Sect. 3.6)of HARAF, HESAF, HARLAP and d(p ,p ) HESLAP onto its first two We computed the incompleteness H c of all possible principal components. The triplets over the groups given in Table 2, without combin- affine invariant feature sets are ing detectors within the same group. This yields 82 overall not significantly different from combinations. The average incompleteness scores for these the non-affine invariant ones combinations over all image categories are listed in Table 3. w.r.t. the completeness measure The most complete triplets contain a combination of the groups “segmentation based” and “spiral type”. These groups seem to be complementary over a wide range of im- age categories. As a third component, EDGE, LOWE or EFOP are most promising. Besides this, two “holes” in the table attract the attention: The Laplacian-based blobs do not appear among the first ten to fifteen entries, and the last rows of the table are empty for the spiral type features. This indicates that spiral-type fea- tures play an important role for promising complementary sets. Regarding the Laplacian-based detectors, it suggests some redundancy with one of the other groups, which can be verified when observing the last rows of the table: Here we find mainly combinations of the segmentation based with the Laplacian detectors. It therefore seems that combining Fig. 17 Projection of Euclidean embedding (Sect. 3.6)ofthemost important detectors onto its first two principal components. Regarding MSER or IBR with one of the Laplacian based detectors is the proposed completeness measure, the detectors roughly build three rather redundant, but that IBR/MSER have higher comple- groups: MSER and IBR form a cluster, as well as the spiral model based mentarity to the remaining groups. feature sets SFOP, SFOP0andSFOP90, and the HESAF and HARAF de- There are many other combinations that also work well: tectors. The other detectors are rather distinguished and not put into HARAF HESAF groups. Note that the EDGE detector has significant larger distance and Combining or as a blob detector with junc- is not included in this figure tions, edges or segmentation based regions usually achieves good results. We motivated our approach by the fact that highly re- The SFOP, SFOP0 and SFOP90 detectors are also closely re- dundant feature sets are punished. Considering combina- lated, which is intuitive as they all result from the same the- tions including LOWE and HESAF, which are theoretically most closely related, strengthens this statement. They ap- oretical framework (Bigün 1990). Therefore we put these pear prevalently in the last rows of the table, but not even three into a group called “Spiral type”. The LOWE detec- once within the first half. tor, though theoretically similar to HESAF and also HARAF, It is important to note that the differences between combi- shows significant distance towards these two in all our ex- nations are substantially smoothed due to the large amount periments. Therefore it builds a separate group “Laplacian”. of images used. For separate categories, one obtains more The EFOP detector developed in Sect. 3.7 is also distin- significant differences. guished from the others in most image categories. As it ba- sically fires at texturedness, classified by the structure ten- 4.2.4 Other Combinations of Detectors sor, it represents the group “Structure tensor”. The HESAF Combinations containing detectors within the same group and HARAF detectors show mostly similar results and have (Sect. 4.2.2) show worse results, confirming our selection small mutual distance over all image categories. This ob- principle. As an example, a combination of three blob de- servation contradicts the statement of Tuytelaars and Miko- tectors, i.e. LOWE, HESAF and HARAF, usually achieves sig- lajczyk (2008, 4.3) that HESAF and HESLAP are “comple- nificantly lower completeness scores. A similar effect occurs mentary to their Harris-based counterparts, in the sense that when using three spiral-type detectors. they respond to a different type of feature in the image.” Al- In addition to triple combinations, we also computed the though making use of the Laplacian scale space, the detec- results for combinations of four and more detectors, respect- tors differ from LOWE in that the scale search is induced by ing the grouping shown in Table 2. We found that the best local extrema of multi-scale representations differing from group of four or five detectors is usually only slightly better the pure Laplacian. Therefore they are grouped as “Mixed than the best group of three. Laplacian”, above the Laplacian exploiting the structure ten- A complete list of our results is available at http://www. sor and the Hessian respectively. ipb.uni-bonn.de/completeness. Int J Comput Vis (2011) 94:154–174 171

Table 3 Average incompleteness d(pH ,pci ) for feature sets Li ,aris- ages in a category. Black column borders for detectors indicate groups ing from all considered triplets of detectors, over all image categories. (Table 2). The triplets are sorted in ascending order w.r.t. their com- The additional black bars denote the 1-σ -confidence region over all im- pleteness regarding the entropy density pH

4.3 Applicability of the Results in Real Applications a setting is the accuracy of the estimated camera projection centers compared to ground truth (Dickscheid and Förstner In the end we are interested in the impact of these results 2009, Fig. 7), which we want to relate to our findings about onto some real application scenarios. The two most impor- completeness. tant applications are certainly camera calibration and object For separate detectors, SFOP and MSER achieved best re- recognition. sults in the bundle adjustment. This is in full agreement with The impact of a variety of detector combinations onto the incompleteness in Fig. 12, considering that EDGELAP, camera calibration has been recently studied by Dickscheid SALIENT and IBR have not been included in the bundle ad- and Förstner (2009). There, a fixed strategy for automatic justment test. The LOWE detector performed only slightly image orientation has been used with different detector com- worse, which can also be verified from our completeness binations as an input. Then results of a final bundle adjust- scores. Combinations of detectors usually had positive ef- ment were compared. The most meaningful measure in such fects onto the bundle adjustment. The best combination of 172 Int J Comput Vis (2011) 94:154–174 three was LOWE with SFOP and MSER. This combination The proposed entropy density pH (x) givesrisetoanew also achieves one of the best completeness scores, when scale invariant keypoint detector which locally maximizes ignoring combinations including EDGE. Especially interest- the entropy over position and scale. We have proposed such ing is the fact that combining LOWE and HESAF, which are a detector (EFOP), which, compared to the detector of Kadir closely related and highly redundant, had negative effects and Brady (2001), yields significantly sparser feature sets onto the bundle adjustment. In Table 3, we also see that com- and takes pixel correlations into account. The completeness binations including these two appear only towards the end, of the new detector however is below average. in the second part of the table. Evaluating feature detectors in general suffers from the The completeness therefore seems to be a good indicator somewhat arbitrary and implementation dependent parame- for promising feature sets w.r.t. camera calibration, besides ter settings. We believe that in principle the parameters robustness and localization accuracy. A comparison with re- should be clearly determined from the image, i.e. via the sults in an object recognition framework still has to be done. estimated noise variance (Förstner 1998; Liu et al. 2008). It would be highly desirable for future benchmarks to have a common, reduced set of input parameters, which in our 5 Conclusion and Outlook opinion should be no more than the possibly signal depen- dent noise variance and maximum number of features. Each We have proposed a scheme for measuring completeness of detector could then select the best ones according to its own local features in the sense of image coding. To achieve this, formulation of significance. This way the performance of we derived suitable estimates for the distribution of relevant detectors could be characterized as a function of the number information and the coverage by a set of local features over of features. By evaluating the minimum number of neces- the image domain, which we compared by the Hellinger dis- sary features to reach a prespecified performance in a given tance. This enables us to quantify the completeness of dif- application, a characterization in terms of efficiency would ferent sets of local features over images, and especially in also be possible. how far detectors are complementary in this sense. The ap- Of course the results have to be transferred to practical proach has important advantages over a simple comparison applications. By relating them to a recent evaluation in the of image coverage. For example, it favors response on struc- context of camera calibration, we found that the accuracy of tured image parts while penalizing features in purely homo- geneous areas, and it accounts for features appearing at the bundle adjustment results was strongly correlated with com- same location on different scales. pleteness. A comparison with results in an object recogni- The proposed scheme does not give a general benchmark tion framework still has to be done. for detectors. To make a good choice for a particular ap- We did not consider edge detectors in detail due to their plication, detectors have to be preselected according to task rather special role. It would be interesting to include a num- usefulness, considering sparseness, robustness, and speed. ber of edge detectors, especially scale invariant ones, into Then the proposed evaluation helps in finding the most com- the setup. An investigation into individual coding schemes plete ones, and especially the most promising complemen- for different feature types—points, lines, and blobs—would tary combinations. be desirable. Finally a more realistic image model than the We made a number of interesting observations. Scale in- Gaussian could lead to a more detailed insight into the rela- variance is clearly beneficial for covering image content, tionship between different detectors. while at the same time we could not observe improvements when using affine covariant windows. The EDGELAP detec- Acknowledgements We wish to thank the anonymous reviewers for tor (Mikolajczyk et al. 2003), basically using edge informa- valuable comments leading to an improvement of the manuscript. tion, showed clearly superior completeness as a separate de- tector, but targets at specific applications. Also the SALIENT detector (Kadir and Brady 2001) achieved very good re- Appendix: Derivation of the Effective Bandwidth in (30) sults, extracting a significantly higher number of features than other detectors. The MSER detector (Matas et al. 2004) seems to be most complete among the detectors with aver- This section outlines the derivation of the approximation age sparseness. For combining features, it is most profitable (30) based on the model (29).Basedon(17), a continuous to use a detector implementing a junction model (like SFOP, formulation of the entropy of a local patch using the fractal Förstner et al. (2009)) together with a good blob or region model is MSER detector, ideally (Matas et al. 2004). Best triplet com- u −1 1 un P0(1 + ) binations are achieved when additionally using a texture- = u0 Rf log2 du. (36) related detector, such as an edge or entropy-based detector. 2 u0 D0 Int J Comput Vis (2011) 94:154–174 173

For asserting that we do not use any bits where the signal Corso, J. J., & Hager, G. D. (2009). Image description with fea- is below distortion, we have to restrict the integration limits tures that summarize. and Image Understand- and require u ≤ u satisfying ing, 113(4), 446–458. n Davis, G. M., & Nosratinia, A. (1998). Wavelet-based image coding:   an overview. Applied and Computational Control, Signals, and P P 0 = ⇐⇒ = 0 − Circuits, 1, 205–269. u N0 un u0 1 . (37) 1 + n N Davisson, L. D. (1972). Rate-distortion theory and application. Pro- u0 0 ceedings of the IEEE, 60, 800–808. Dickscheid, T., & Förstner, W. (2009). Evaluating the suitability of fea- In order to find a suitable value for the model parameter u0, we need further knowledge about the signal. We can solve ture detectors for automatic image orientation systems. In Pro- ceedings of the international conference on computer vision sys- the integrals in the right hand side of (32) to relate the effec- tems, Liege, Belgium (pp. 305–314). 2 tive bandwidth bg to u0 and the signal-to-noise ratio by Förstner, W. (1994). A framework for low level feature extraction. In Proceedings of the European conference on computer vision, 2log SNR2 + 3 − 4SNR2 + SNR4 Stockholm, Sweden (Vol. III, pp. 383–394). 2 = 2 2 Förstner, W. (1998). Image preprocessing for feature extraction in dig- bg u0. (38) 4log2 SNR ital intensity, color and range images. In Lecture notes in earth sciences: Vol. 95/2000. Geomatic methods for the analysis of data Here we again assume additive white noise with known vari- in earth sciences (pp. 165–189). Berlin: Springer. ance, as in the model introduced in (3) and (4). Assuming Förstner, W., & Gülch, E. (1987). A fast operator for detection and the bandwidth and signal-to-noise ratio of the signal to be precise location of distinct points, corners and centres of circular features. In ISPRS conference on fast processing of photogram- known, we can use (38) for replacing u0 in (36), with the metric data, Interlaken (pp. 281–305). integration limit as specified above, and obtain the entropy Förstner, W., Dickscheid, T., & Schindler, F. (2009). Detecting inter- pretable and accurate scale-invariant keypoints. In Proceedings = 2 − + − of the IEEE international conference on computer vision Rf 2(SNR 1)(log2 2πe 1 2log2 SNR) , Kyoto, Japan. Haja, A., Jähne, B., & Abraham, S. (2008). Localization accuracy of log2 SNR × b . (39) region detectors. In Proceedings of the IEEE conference on com- 2 + − 2 + 4 g 2log2 SNR 3 4SNR SNR puter vision and pattern recognition. Harris, C., & Stephens, M. J. (1988). A combined corner and edge This function can be approximated by detector. In Proceedings of the Alvey vision conference (pp. 147– 152). = + 2 Heath, M., Sarkar, S., Sanocki, T., & Bowyer, K. (1996). Comparison Ha(g) kbg log2(1 SNR ) (40) of edge detectors: a methodology and initial study. In Proceedings of the IEEE conference on computer vision and pattern recogni- with tion (p. 143). √ Kadir, T., & Brady, M. (2001). Saliency, scale and image description. k = 2(1 + ln(2πe)) ≈ 5.43. (41) International Journal of Computer Vision, 45(2), 83–105. Kadir, T., Zisserman, A., & Brady, M. (2004). An affine invariant Using the correct scale factor, one can show by analytic salient region detector. In Proceedings of the European confer- ence on computer vision (pp. 228–241). Berlin: Springer. comparison that the approximation error between (30) and Kovesi, P. D. (2009). MATLAB and Octave functions for com- (17) is below 1%. puter vision and image processing. School of Computer Science & Software Engineering, The University of Western Australia, http://www.csse.uwa.edu.au/~pk/research/matlabfns/. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of fea- References tures: spatial pyramid matching for recognizing natural scene cat- egories. In Proceedings of the IEEE conference on computer vi- Bay, H., Ferrari, V., & Gool, L. V. (2005). Wide-baseline stereo match- sion and pattern recognition, Washington, DC, USA (pp. 2169– ingwithlinesegments.InProceedings of the IEEE conference on 2178). computer vision and pattern recognition, Washington, DC, USA Li, F. F., & Perona, P. (2005). A Bayesian hierarchical model for learn- (Vol. 1, pp. 329–336). ing natural scene categories. In Proceedings of the IEEE confer- Bercher, J. F., & Vignat, C. (2000). Estimating the entropy of a sig- ence on computer vision and pattern recognition (Vol. 2, pp. 524– nal with applications. IEEE Transactions on Signal Processing, 531). Los Alamitos: IEEE Comput. Soc. 48(6), 1687–1694. Liang, P., Jordan, M. I., & Klein, D. (2010). Probabilistic grammars Berger, T. (1971). Rate distortion theory: a mathematical basis for data and hierarchical Dirichlet processes. In The Oxford handbook of compression. Englewood Cliffs: Prentice-Hall. applied Bayesian analysis. London: Oxford University Press. Bigün, J. (1990). A structure feature for some image processing appli- Lillholm, M., Nielsen, M., & Griffin, L. D. (2003). Feature-based im- cations based on spiral functions. Computer Vision, Graphics and age analysis. International Journal of Computer Vision, 52(2–3), Image Processing, 51(2), 166–194. 73–95. Bishop, C. M. (2006). Information Science and Statistics. Pattern Lindeberg, T. (1998a). and with auto- recognition and machine learning. Berlin: Springer. matic scale selection. International Journal of Computer Vision, Cayton, L., & Dasgupta, S. (2006). Robust Euclidean embedding. In 30(2), 117–156. Proceedings of the international conference on machine learning Lindeberg, T. (1998b). Feature detection with automatic scale selec- (pp. 169–176). New York: ACM. tion. International Journal of Computer Vision, 30(2), 79–116. 174 Int J Comput Vis (2011) 94:154–174

Liu, C., Szeliski, R., BingKang, S., Zitnick, C. L., & Freeman, W. T. Mumford, D. (2005). Empirical statistics and stochastic models for (2008). Automatic estimation and removal of noise from a sin- visual signals. In S. Haykin, J. Principe, T. Sejnowski, & gle image. IEEE Transactions on Pattern Analysis and Machine J. McWhirter (Eds.), New directions in statistical signal process- Intelligence, 30(2), 299–314. ing: from systems to brain. Cambridge: MIT Press. Lowe, D. G. (2004). Distinctive image features from scale-invariant Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell re- keypoints. International Journal of Computer Vision, 60(2), 91– ceptive field properties by learning a sparse code for natural im- 110. ages. Nature, 381(6583), 607–609. Marr, D. (1982). Vision: a computational investigation into the human Perdoch,ˇ M., Matas, J., & Obdrzálek,ˇ S.ˇ (2007). Stable affine frames representation and processing of visual information.NewYork: on isophotes. In D. Metaxas, B. Vemuri, A. Shashua, & H. Shum Freeman. (Eds.), Proceedings of IEEE international conference on com- Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide base- puter vision (p. 8). Los Alamitos: IEEE Comput. Soc. line stereo from maximally stable extremal regions. Image and Puzicha, J., Hofmann, T., & Buhmann, J. M. (1999). Histogram clus- Vision Computing, 22, 761–767. tering for unsupervised image segmentation. In Proceedings of McClure, D. E. (1980). Image models in pattern theory. Computer IEEE conference on computer vision and pattern recognition,Ft. Graphics and Image Processing, 12(4), 309–325. Collins, USA (pp. 602–608). McGillem, C., & Svedlow, M. (1976). Image registration error variance Rosenfeld, A., & Kak, A. C. (1982). Digital picture processing.San as a measure of overlay quality. IEEE Transactions on Geoscience Diego: Academic Press. Electronics, 14(1), 44–49. Shannon, C. E. (1948). A mathematical theory of communication Meltzer, J., & Soatto, S. (2008). Edge descriptors for robust wide- (Tech. Rep.). Bell System Technical Journal. baseline correspondence. In Proceedings of the IEEE conference Shi, J., & Tomasi, C. (1994). Good features to track. In Proceedings of on computer vision and pattern recognition. the IEEE conference on computer vision and pattern recognition Mikolajczyk, K., & Schmid, C. (2004). Scale and affine invariant in- (pp. 593–600). terest point detectors. International Journal of Computer Vision, Smith, J. O. (2007). Mathematics of the discrete Fourier transform 60(1), 63–86. (DFT), W3K Publishing, Chap. 16.3. Mikolajczyk, K., Zisserman, A., & Schmid, C. (2003). Shape recogni- Tuytelaars, T., & Mikolajczyk, K. (2008). Local invariant feature de- tion with edge-based features. In Proceedings of the British ma- tectors: a survey. Hanover: Now Publishers. chine vision conference, Norwich, UK (pp. 779–788). Tuytelaars, T., & Van Gool, L. (2004). Matching widely separated Mikolajczyk, K., Tuytelaars, T., Schmid, C., Zisserman, A., Matas, J., views based on affine invariant regions. International Journal of Schaffalitzky, F., Kadir, T., & Gool, L. V. (2005). A comparison of Computer Vision, 59(1), 61–85. affine region detectors. International Journal of Computer Vision, Zeisl, B., Georgel, P., Schweiger, F., Steinbach, E., & Navab, N. 65(1/2), 43–72. (2009). Estimation of location uncertainty for scale invariant fea- Moreels, P., & Perona, P. (2007). Evaluation of features detectors and ture points. In Proceedings of the British machine vision confer- descriptors based on 3D objects. International Journal of Com- ence, London, UK. puter Vision, 73(3), 263–284.