<<

An analysis of the Scale Saliency algorithm

Timor Kadir1, Djamal Boukerroui, Michael Brady Robotics Research Laboratory, Department of Engineering Science, University of Oxford, Parks Road, Oxford OX1 3PJ, U.K.

August 8, 2003

Abstract

In this paper, we present an analysis of the theoretical underpinnings of the Scale Saliency algorithm recently introduced in (Kadir and Brady, 2001). Scale Saliency considers image regions salient if they are simultaneously unpredictable in some feature-space and over scale. The algorithm possesses a number of attractive properties: invariance to planar rotation, scaling, intensity shifts and trans- lation; robustness to noise, changes in viewpoint, and intensity scalings. Moreover, the approach offers a more general model of feature saliency compared with conventional ones, such as those based on kernel , for example wavelet analysis. Typically, such techniques define saliency and scale with respect to a particular set of basis morphologies. The aim of this paper is to make explicit the nature of this generality. Specifically, we aim to answer the following questions: What exactly constitutes a ‘salient feature’? How does this differ from other feature selection methods? The main result of our analysis is that Scale Saliency defines saliency and scale independently of any particular feature morphology. Instead, a more general notion is used, namely spatial unpre- dictability. This is determined within the geometric constraint determined by the sampling window and its parameterisation. In (Kadir and Brady, 2001) this window was a circle parameterised by a single parameter controlling the radius. Under such a scheme, features are considered salient if their feature-space properties vary rapidly with an incremental change in radius. In other words, the feature favours isotropically unpredictable features. We also present a number of variations of the original algorithm: a modification for colour images (or more generally any vector-valued image) and a generalisation of the isotropic scale constraint to the anisotropic case.

Keywords

Visual Saliency, Scale Selection, Salient Features, Entropy, Scale-space.

1 1 Introduction

Computer vision algorithms are, in general, information reduction processes. Brute-force approaches to image or image sequence analysis can quickly overwhelm most computing resources at our dis- posal. Fortunately, images are a redundant data source. The same set of inferences may be drawn from a variety of image characteristics. This becomes self-evident considering the array of different methodologies available for solving any particular vision task. Hence, the selection of a sufficient set of image regions and properties, or salient features, forms the first step in many algorithms. Two key issues face the vision algorithm designer: the subset of image properties selected for subsequent analysis and the model used to represent those properties. For example, many image matching algorithms begin with a set of ‘landmark’ points which serve as a basis for estimating the image transformation that defines the match. In this case, well-localised and unique image regions are desirable to minimise the likelihood of false matches. For many tasks geometric and photometric invariance properties are also beneficial. Finally, there is often an implicit, but difficult to quantify, requirement that the salient regions be relevant to the task of interest — in other words, the regions or descriptions subsequently extracted from them are somehow characteristic of the scene contents they are intended to signify. Many definitions for saliency have been proposed. Perhaps the most popular have arisen out of the application of local surface differential geometry techniques to imaging. Such methods consider the image to be a discrete approximation to a surface and categorize it by application of differential operators. Closely related to these are basis projection and filtering methods. Common to both is the development of one or two dimensional features; one dimensional features include edges, lines, ridges (Bergholm, 1986, Canny, 1986); two dimensional features are often referred to as Interest points or ‘Corners’ (Deriche and Giraudon, 1993, Harris and Stephens, 1988, Mokhtarian and Suomela, 1998). Much effort within the Scale-Space and Wavelet communities have been devoted to providing a mathematically sound basis for the application of such techniques to what are essentially discrete sets (Kœnderink, 1984, Lindeberg and ter Haar Romeny, 1994, Mallat, 1998, Witkin, 1983). In general, these methods share one assumption: that saliency is a direct property of the geometry or morphology of the image surface. While it is certainly the case that there are many useful image features that can be defined in such a manner, efforts to generalise such methods to capture a broader range of salient image regions have had limited success. We contend that one of the major factors for this is that such methods typically define both saliency and scale with respect to a small set of basis functions or geometric properties. Perhaps then, it is for this reason and the lack of a satisfactory

2 definition of what constitutes a salient feature in the broader sense, that the term ‘feature selection’ has acquired this restricted interpretation. There are a number of exceptions to this. Phase Congruency and the related Local Energy approach (Kovesi, 1999) defines features in terms of the phase coherence of Fourier components. For example, at a step-edge all Fourier components are maximally in phase at an angle of 0◦ or 180◦ for positive or negative transitions respectively. One of the benefits of such an approach is that several feature types may be detected simultaneously. Yet despite the novelty of the model, Kovesi was primarily interested in simple one or two dimensional features typical of geometric methods. There was no effort to broaden the definition of saliency. An alternative strategy is to define saliency in terms of the probabilistic or statistical properties of the image. This approach has been most popular for region segmentation tasks (Besag, 1986, Leclerc, 1989, Li, 1995, Paragios and Deriche, 2002, Zhu and Yuille, 1996). There have also been several attempts at feature detection using statistical measures — it is well known that local variance can be employed as a basic edge detector. Other methods have attempted to estimate saliency by measuring rarity of feature properties. In (Kadir and Brady, 2001), we proposed a novel model of feature saliency. In our approach, termed Scale Saliency, regions are deemed salient if they exhibit unpredictable behaviour (in a prob- abilistic sense) simultaneously in feature-space and over scale. Scale Saliency possesses a number of attractive properties. First, it offers a more general model of feature saliency compared to conven- tional methods. Second, it incorporates an intrinsic notion of scale and a method for its selection locally. Third, it makes explicit the link between the definition of saliency and the method of de- scription. In short, it offers a coherent methodology incorporating three intimately related concepts of scale, saliency and image description. The implementation presented in (Kadir and Brady, 2001) possesses a number of other beneficial qualities: invariance to planar rotation, scaling, intensity shifts and translation; and robustness to noise, changes in viewpoint, and intensity scalings. In this paper, we present an in-depth analysis of the theoretical underpinnings of the Scale Saliency model. The aim here is to make explicit the definition of saliency in this model. Specifically, we aim to answer the following questions: What exactly constitutes a salient feature? How is this different from other feature selection methods? This paper is organised as follows. In Section 2 we provide a brief overview of the Scale Saliency algorithm. The Scale Saliency algorithm is a product of two terms measuring the unpredictability of the local PDF in feature-space and over scale respectively. Detailed analyses of these two terms are presented in Sections 3 and 4 where we derive expressions for the conditions under which Scale Saliency is maximised and discuss the underlying model. In Section 5, we present generalisations of

3 the method to colour images and anisotropic scale. In Section 6, we discuss the relationship between the Scale Saliency algorithm and transform based methods for feature detection. Finally, in Section 7 we conclude our analysis and outline a number of remaining open issues.

2 Scale Saliency

In this section, we briefly describe the Scale Saliency algorithm. A more detailed discussion of the technique may be found in (Kadir and Brady, 2001).

2.1 Saliency as local unpredictability

Gilles (1998) investigated the use of salient local image patches or ‘icons’ for matching and registering two images. He defined saliency in terms of local signal complexity or unpredictability. More specifically, he estimated saliency using the Shannon entropy of local attributes. Figure 1 shows local intensity histograms from a number of image segments. Areas corresponding to high signal complexity tend to have flatter distributions, hence higher entropy. More generally, high complexity of any suitable descriptor can be used as a measure2 of local saliency. Local attributes, such as colour or edge strength, direction or phase, may be used.

Given a point x, a local neighbourhood RX , and a descriptor d that takes values from D =

{d1, . . . , dr} (e.g. in an 8 bit grey level image D would range from 0 to 255), local entropy (in the discrete form) is defined as:

HD,RX = − pd,RX log2 pd,RX (1) Xd where pd,RX (di) is the probability of descriptor D taking the value di in the local region RX . Gilles’ method has a number of limitations. It requires the specification of a window size, or scale, over which an estimate of the local PDF may be obtained. Underlying this definition of saliency is the assumption that complexity is rare in real images. This is generally true, except in the case of pure noise or self-similar images (e.g. fractals) where complexity is independent of scale and position, and textured regions, where, in general, complexity is more prevalent. Moreover, the assumption only holds at a specific scales. In the extreme, the Gilles method would deem salient a region containing only noise. The underlying problem is that the method cannot measure saliency over the scale dimension. This is pivotal to the scale selection problem; in order to perform adequate scale selection one must first arrive at a satisfactory definition of ‘good scales’.

2.2 Inter-scale saliency

In (Kadir and Brady, 2001), we addressed the shortcomings of Gilles method and introduced a novel algorithm called Scale Saliency which defines saliency as a product of two terms: the first is entropy,

4 which measures unpredictability in feature-space; and the second, is a measure of the statistical disimilarity across scale. Peaks in the entropy over scale function are used for scale selection. The key observation here is that (at least in the ideal case) noise has no scale localisation — it is self- similar over all scales — whereas image features exist over specific ranges of scale (Witkin, 1983). Those that persist over large ranges of scale tend to exhibit self-similarity in their statistics. However, we prefer to detect features with surprising inter-scale characteristics, hence, we weight the entropy with a measure of the statistical disimilarity over scale.

In the continuous case the saliency measure YD as a function of scale s and position x is defined as follows:

YD(sp, x) , HD(sp, x) × WD(sp, x) (2) where entropy HD is defined by:

HD(s, x) , − p(d, s, x) log2 p(d, s, x) .dd (3) Z d∈D and where p(d, s, x) is the probability density as a function of scale s, position x and descriptor value d which takes on values in D, the set of all descriptor values. The inter-scale saliency measure,

WD(s, x), is defined by: ∂ WD(s, x) , s p(d, s, x) .dd (4) Z ∂s

d∈D

The vector of scales sp, at which entropy peaks, is defined by:

2 ∂HD(s, x) ∂ HD(s, x) sp , s : = 0, < 0 (5)  ∂s ∂s2 

In the discrete case, HD, WD and sp become:

HD(s, x) , − pd,s,x log2 pd,s,x (6) dX∈D

s2 W (s, x) , |p − p −1 | (7) D 2s − 1 d,s,x d,s ,x dX∈D

sp , {s : HD(s − 1, x) < HD(s, x) > HD(s + 1, x)} (8)

In relation to the discussion in Section 1, one may think of Scale Saliency as employing a hybrid approach to saliency. Feature-space saliency is measured in a statistical sense, since HD, entropy, operates on the local PDF where the local order is lost, but WD applies a geometric constraint by means of the sampling window and its parameterisation. As we shall see in Section 4, this is a key attribute of the method and enables a definition of saliency and scale independant of any particular basis morphology.

5 2.3 The Algorithm

The algorithm can be implemented in the following form: for each pixel location, calculate entropy from the PDF estimate of the local descriptor. A circular window is used to sample the image. This process is repeated for increasing scales; that is, circular windows of increasing radius are used. Scales are selected at which entropy attains a peak. The statistical self-dissimilarity can be measured using the sum of absolute difference between the histograms around the entropy peak. Below is a pseudo-code description of the Scale Saliency algorithm:

01 For each pixel location, (x,y), in the image I: 02 { 03 For each scale, s, between Smin and Smax: 04 { 05 IS=Sample local descriptor values at I(x,y) in a window of size s 06 P(d,s)=Estimate the local PDF from IS (e.g. using histograms) 07 HD(s)=Calculate the entropy of P(d,s) 08 WD(s)=Calculate inter-scale saliency between P(d,s) & P(d,s-1) 09 } 10 Run smoothing filter over WD(s) (e.g. 3-tap average) 11 For each scale for which the entropy attains a peak, SP, do: 12 YD(SP,x,y)=HD(SP) x WD(SP) 13 }

Step 10 simply smooths the WD values as these tend to be sensitive to noise. This is equivilent to a robust version of the discrete derivative operator (DO = −1, 1). A more efficient implementation could be to apply the smoothing filter directly to the operator. The algorithm generates a space in R3 (two spatial dimensions and scale) sparsely populated with scalar saliency values. In essence, the method searches for scale-localised features with high entropy, subject to the constraint (in the above implementation) that scale is isotropic. The method therefore favours blob-like features. Such features are useful for matching because they are locally constrained in two directions. Features such as edges or lines only locally constrain matches to one direction (of course depending on their length). We may relax the isotropicity requirement and use anisotropic regions. We present such a generalisation in Section 5.2, where scale is defined with respect to an ellipse parameterised by a scale, rotation and axis ratio. Some features, such as those with local linear structure, are better characterised under such a scheme. It should be noted however, that the isotropic implementation does detect non blob-like features, but labelled with a lower saliency than their isotropic equivalents. In the case of linear structure, this is because there is a degree of self-similarity in the tangential direction. The selected scale is determined predominantly by the spatial extent of the feature in the perpendicular direction.

6 However, for some vision tasks, such as correspondence, this ranking might be beneficial as it reflects the information gained by matching each type of feature.

3 Feature-space Saliency — HD

In this section, we analyse the underlying model of saliency implied by its definition as complexity or unpredictability and as characterised by entropy. Even at the conceptual level, complexity is not an absolute term; a measure of complexity implies a comparison with what is not complex. A similar remark was made by Leclerc (1989) regarding the concept of simplicity:

Thus, in this formalism, the notion of simplicity is a relative one that depends strongly on the choice of descriptive language. The necessity of providing an a-priori descriptive language is a very important and fundamental point. It means that, for a finite number of observations, there is no such thing as an absolute measure of the simplicity of description; simplicity is inescapably a function of one’s prior assumptions.

Similarly, estimating complexity with an information theoretic measure such as entropy neces- sarily implies a prior model of what is not complex or unpredictable.

3.1 Maximising entropy

Before we proceed, we state without proof the following theorem for the conditions under which entropy is maximised:

Theorem 1

H(p1, p2, . . . pM ) ≤ log(M) (9)

1 with equality iff ∀i ∈ 1 . . . M, pi = M

In words, entropy is maximised when all descriptor values are equally likely and minimised when all pixels take on one particular descriptor value. The theorem is well-known and its proof may be found in any standard text on information theory such as (Ash, 1965).

3.2 What is the prior model?

Consider, as an example, measuring the predictability of a local image region’s intensity values. First, we compute the local histogram of grey-values; this approximates the local PDF of grey-values. If the histogram is highly peaked around a specific intensity range then this (by definition) means that, within this region, it is highly likely that pixel values within that particular intensity range can be found. If the PDF is spread out, it means that many intensity values are equally likely. Entropy

7 simply summarises the degree to which a PDF is peaked or spread out. Those regions with a flat PDF, and hence high entropy, are deemed salient. There is an apparent paradox here: A sharply peaked PDF is generally considered highly in- formative, yet in our method this is labelled with low saliency. In Shannon Information theory, information is often considered to be opposite to entropy. Restating the sampling procedure in a probabilistic framework provides a useful insight. Let p(d, s) represent our prior expectations of the local PDF of pixel values and p0(d, s) represent the PDF estimated as a result of the measurement process. Both are taken with respect to the local region at scale s. Initially, we assume nothing about the prior distribution of intensity values within the local region and so p(d, s) is uniform. If, after measurement, p0(d, s), the posterior PDF is very sharply peaked, then the measurement must have been very informative. In this case, the region can be described well using a descriptor with few parameters (often just an average grey-value will do). This means that the representation of the local region as a few intensity values is a good way of representing that region. A flat posterior PDF means that nothing new has been learnt from the measurement process (except that the region does not have very predictable intensity values). In this case the local region is not very well captured by an intensity representation and hence a more powerful descriptor is needed to accurately describe the region. In other words, the local PDF of a descriptor measures how well that descriptor captures the local signal behaviour. If it captures it well, then this is classed as non-salient; if not, then this is salient. It follows that the technique can be considered to model the non-salient parts of an image rather than the salient parts. Saliency is defined as those regions that cannot be represented well by the prior model. In other words, the entropy of a descriptor tells us how well that descriptor captures the local signal behaviour. This is opposite to the way in which most feature extraction and visual saliency techniques work, but is actually rather intuitive. Those parts of an image that are most informative are those that do not get modeled well (or predicted) by our prior assumptions. Put another way: if we can model them, then they’re probably not all that interesting. We assume that the prior model captures the common or widespread behaviour of the signal. For example, descriptors based on local grey-values imply a non-saliency model of a piecewise constant local image patch; the assumption here is that piecewise constant regions are very common and not very interesting. The feature-space definition of saliency can be stated as: Given a descriptor with one (or more) free parameters, describe the local region’s signal; its saliency is inversely related to the compactness of this description. Although, it should be noted that the converse is not necessarily true; a case in

8 point is noise. This is one of the major shortcomings of the Gilles method we discussed earlier in

Section 2.1 — that entropy alone is insufficient. This is precisely the role played by WD which we shall discuss in detail next.

4 Inter-scale saliency — WD

In this section, we analyse the inter-scale saliency term, WD. In particular, we make explicit the effects of the sampling window and its parameterisation on the resultant hierarchy of salient features.

4.1 Pixel Permutations

The PDF estimation and entropy calculation steps in the feature-space part of Scale Saliency destroy all structural and spatial information within the sampling window. A rather obvious, but important corollary of this is that any permutations of the pixels within this region will result in the same entropy value and hence saliency. Figure 3 shows a patch taken from a face image, and three permutations of it: a linear gradient intensity sort, a randomisation and a radial gradient intensity sort. All have identical entropy values but clearly some are of greater interest than others; that is, some should be labelled with a higher saliency. For example, those permutations that result in image patches with a high degree of struc- ture are usually considered perceptually more salient than those with random pixel arrangements. Restated from a matching perspective, some permutations will contain features that are more dis- criminant or spatially constraining than others (in the same way that Corner features can be more useful than edge features because they constrain searches to two dimensions). It is interesting to note that despite this property, entropy can be sufficient for detecting salient regions within some images, given an appropriate choice of scale. The reason for this is that, in general, images do not solely comprise permutations of a few image patches. Put another way, in most real world images there exists at least one scale at which the degree of spatial self-similarity (in a statistical sense) is negligible. However, in many images there are many scales at which this is not true, in particular those with textured regions. Scale Saliency can differentiate between these permutations by means of a geometric constraint defined by the sampling window and its parameterisation, and applied through WD. The variation of the local PDF and its entropy (HD) as a function of scale provide the necessary inter-scale saliency characteristics. In the examples shown in Figure 3, the entropy of the image patches is identical, but the form of HD is quite different in each case. In some there is a peak, whilst in others the function is rather flat, indicating a degree of self-similarity. The Scale Saliency algorithm assigns values of saliency according to the degree of statistical self-dissimilarity at peaks in HD.

9 Note that the definition of Scale Saliency in Equations 2 to 8 does not specify how p(d, s, x), the estimate of the local PDF, is generated from the image. The implementation used in (Kadir and Brady, 2001) restricts scale to be isotropic and uses a circular sampling window. The resulting measure is thus invariant to rotation in the image plane. The important part of Equation 4 is the partial derivative of p(d, s, x) with respect to the s parameter. This measures the magnitude change of the PDF as a function of s. This parameter, is what we have so far termed scale. Alternatively, we may consider it simply as the parameter which determines the degree of local spatial support when sampling the image for the PDF estimate.

Therefore, ‘large’ values of WD must occur as a result of ‘large’ changes in p(d, s, x) as the sampling window size (the circle radius) is varied. Clearly, the spatial arrangements which cause the largest

WD are those that occur at the boundaries of the sampling function. In other words, in our particular implementation the method is biased towards isotropic features.

Figure 4 demonstrates the effect of different spatial arrangements on HD and WD. Once again, the images are permutations of each other, so the entropy is the same for all of the cases. In each case, a single peak in entropy has been found, and a saliency value calculated. The position x is fixed at the centre of the image, as marked by the red cross. Consider the first three cases, the black circle (a), the horizontal ellipse (b) and the vertical ellipse

(c). All have a similarly valued peak in HD, but the former exhibits markedly different behaviour in

WD. The largest change in the histogram occurs in the first case as the s parameter passes through the value corresponding to the radius of the black circle. This results in an almost step change in

WD around this value of s, whereas in cases (b) and (c), the corresponding graph is rather flat. Cases (d) to (f) show the effect of other permutations; a translation of the black circle, two smaller circles, and a random permutation. These simple examples serve to demonstrate a number of important points about the saliency method. The first, is that it is the definition of scale-space that determines which of all the possible of permutations, and hence which ‘types’ of feature are labelled as highly salient. This scale-space is determined by the sampling function and its parameterisation.

Second, that the spatial part of the saliency measure, WD, measures change in the local statistics as a function of the parameterisation of sampling function. Large changes in these statistics are labelled as highly salient (in WD).

Finally, invariants of the sampling function determine the invariants of WD. For example, a circular sampling function is invariant to rotation (compare cases (b) and (c)), but not translation (compare cases (a) and (d)) or anisotropic scaling (of geometry) (compare cases (a) and (b)). Note however, that although the sampling function is not invariant to isotropic scaling or translation, the

10 method as a whole is because scale and position are included as variables. These observations lead us to pose two questions: Can we determine explicitly the relationship between the sampling function and the set of image permutations, or in other words the set of feature morphologies, which the method labels as highly salient? Can we design a sampling function so as to constrain the method to ‘prefer’ a given set of morphologies?

4.2 The Continuous Case

Direct approaches for determining the conditions under which the continuous form of WD is max- imised are problematic. For example, we could define the PDF in terms of the image function and then seek to maximise WD by taking the derivative with respect to the image function. However, this is difficult for a number of reasons, not least because the mapping between an image function and its PDF is many to one — many different images can give rise to a single PDF. As an illustrative example, consider the case of a 1D function F : R → R, to be a function of a random variable, X. For example, this may be considered to be any single line from an image. We assume X has a uniform PDF since all values of position are equally likely. Y = F(X) is then another random variable with, in general, a different PDF from X. The standard transformation formula may be used to evaluate the PDF of Y , pY (y):

d p (y) = p (F −1(y)). F −1(y) (10) Y X dy

The evaluation of the inverse of F −1 demands that F is bijectiv e, i.e. strictly increasing or decreasing; in other words, implying that F −1 exists. Clearly, for most images this is not true. To overcome this restriction, we may consider an image to be piecewise bijective and define Equation 10 over each of these intervals. The final PDF is then the sum of all these intervals. Next, the sampling window parameterisation must be included and then finally the expression substituted into WD and maximised with respect to F. We can see that the expressions quickly become unwieldy, especially once the problem is extended into the 2D case. One aspect of the problem is that shape is not well defined in this context. We require an appropriate mathematical language to describe precisely the various ‘forms’ that might satisfy our constraints. For example, intuitively we can state that use of a circular sampling function biases the method towards isotropic features. How do we describe such features? Moreover, complex sampling functions may not be as intuitive. Therefore, we seek a mathematical analogue of the relatively loose qualitative descriptions we have used thus far. One approach could be to apply constraints to the problem. For example, the image could be restricted to the binary case (e.g. black and white) reducing the PDF to two terms. Alternatively,

11 we could restrict the form of the image function to a specific analytic form for which the PDF can be entirely determined. For example, we could analyse the images that are constrained to Gaussian or Sine shapes. Whilst providing a useful insight, neither of these approaches provides a sufficiently complete analysis. The constraints required to make the mathematics simpler, significantly limit the scope of the solution to a very specific aspect of the question we are trying to address. For these reasons, in remaining discussion we concentrate solely on the discrete form of Scale Saliency. Generalisations of the following proofs to the continuous case are left for future work. Fortunately, since images are in fact of discrete nature, this does not pose significant restriction on the scope of the analysis. The discrete form allows an abstraction of the problem into a more generic form.

4.3 The Discrete Case

We reformulate the image sampling and PDF estimation using set theory. Since the ordering of the pixels inside each region is not important, we can simply represent the region as a set of pixels. This can be considered as an un-spiraling of the sampling function into a set. The abstraction is valid for any one parameter sampling function, though it can be applied to multi-parameter sampling kernels as long as each variable can be considered independently. Such an approach lends itself more directly to the discrete form of the saliency algorithm. First, we recast the saliency algorithm in set theory. At every point in the image, a PDF estimate is generated (using a histogram) for a range of consecutive scales, s = {s1, . . . , smax}. These scales are, in fact, parameters of the sampling function. The entropy, HD, is calculated at each of these scales. A subset of these scales is selected at maxima of HD and for each scale within this subset, a value of WD is calculated. Finally, saliency is simply the product of WD and HD for each scale in the maxima subset.

At any given scale in the maxima subset, the discrete form of WD requires an estimate of the local PDF at two consecutive scales, s − 1 and s; the former can be represented as a subset of the latter. Figure 5 demonstrates this approach. Using this methodology, difficulties associated with explicit representation of the sampling func- tion and its linking to the image function and the PDF estimate are avoided. We also avoid the problem of requiring a rigorous definition of ‘shape’ or ‘form’ for the image. Instead, the ‘shape’ of features may be derived from the resulting probabilities of set membership and knowledge of the sampling window. Define a finite lattice, L, of size MN representing the pixel positions. Each of these positions

12 can take on values, d, from a finite descriptor set D = {d1, d2, . . . , dr}. We can define an image to be a function I : L → D. Considering the image as a set of descriptor values I(j) = {i1, i2, . . . in} where ∀ij ∈ D, the sampling process results in X an ordered subset of I representing pixel values in the image and taking on values in D:

+ X = {x1, x2, . . . , xn} ∀xi ∈ D, i ∈ N (11)

By partitioning X into two subsets, Xt and Xc:

Xt = {x1, x2, . . . , xt}

Xc = {xt+1, xt+2, . . . , xn}

t, c ∈ N+ t + c = n (12) we can rewrite WD as: s2 W = |p X − p X | (13) D 2s − 1 d, d, t dX∈D where pd,X and pd,Xt are the probabilities that descriptor value d occurs in subsets X and Xt respec- tively. The relationship between s and s − 1, and X and Xt depends on the sampling function. It is related to the area contained within two consecutive values of the scale parameter. In the case of a circle these are simply the pixels that fall in the areas defined by πs2 and π(s − 1)2. Using the above reformulation of the algorithm, we now restrict our analysis to the following questions:

1. Given any t and c, which set(s) X out of all possible sets results in a maximal value of WD?

2. Given any t, c and an arbitrary but fixed set, which ordering of X results in a maximal value

of WD?

Next, we rearrange WD into a more convenient form, as follows:

s2 W = |p X − p X | D 2s − 1 d, d, t dX∈D

= |pd,Xt − pd,Xc | (14) dX∈D where pd,Xt and pd,Xc are the probabilities of a pixel taking on value d in subset Xt and Xc respectively. The full derivation is given in Appendix A.

This result restates WD in terms of the PDFs from non-overlapping regions Xt and Xc (see Figure 5). The original measure was calculated in terms of two PDFs where one is generated from a subset of the pixels in the other. Now, we must evaluate which particular ordering of X causes WD to be

13 maximised (or ‘large’). From such an expression, we may interpret what the corresponding salient features might look like.

Intuitively, we might guess that Equation 14 may be maximised by making pd,Xt and pd,Xc histograms as different as possible. For example, if Xt is all black then Xc should be all white. This is indeed the case but in fact there are many conditions which provide this maximum. The following theorem makes these explicit:

Theorem 2 (Conditions that maximise WD) For any set of pixels X where Xt and Xc form a partition of X and take on values d ∈ D = {d1, d2, . . . , dr}:

WD = |pd,Xt − pd,Xc | dX∈D ≤ 2 (15) with equality if and only if:

∀pd,Xt > 0 pd,Xc = 0

∀pd,Xc > 0 pd,Xt = 0 (16)

Proof. We start with the similarity measure WD:

WD = |pd,Xt − pd,Xc | dX∈D

= |pd1,Xt − pd1,Xc | + |pd2,Xt − pd2,Xc | + . . . + |pdr,Xt − pdr,Xc | (17)

Let Kt and Kc be the sets for which pd,Xt and pd,Xc are greater than zero respectively:

Kt = d ∈ D : pd,Xt > 0  Kc = d ∈ D : pd,Xc > 0 (18) 

Hence, we have Kt ∪ Kc ⊆ D, and since:

= 0 (19)

d∈DX−Kt∪Kc because pd,Xt = pd,Xc = 0 in this case, therefore:

= (20)

dX∈D d∈KXt∪Kc Using this notation, the summation in Equation 17 can be rewritten as a summation over three mutually exclusive sets. The first, Kt −Kt ∩Kc, contains descriptor values whose probability is greater than zero in Xt but zero in Xc. The second, Kt ∩ Kc, contains descriptor values whose probability

14 is greater than zero in both Xt and Xc. Finally the third, Kc − Kt ∩ Kc, contains descriptor values whose probability is greater than zero in Xc but zero in Xt:

WD = |pd,Xt − pd,Xc | + |pd,Xt − pd,Xc | + |pd,Xt − pd,Xc | d∈KtX−Kt∩Kc d∈KXt∩Kc d∈KcX−Kt∩Kc

= pd,Xt + |pd,Xt − pd,Xc | + pd,Xc (21) d∈KtX−Kt∩Kc d∈KXt∩Kc d∈KcX−Kt∩Kc

Under the conditions in Equation 16 Kt ∩ Kc = ∅, therefore WD reduces to:

WD = pd,Xt + pd,Xc = 2 (22) dX∈D dX∈D

Where conditions in Equation 16 are not true, Kt ∩ Kc 6= ∅ and it contains at least one member. We may place the following upper bounds3:

|pd,Xt − pd,Xc | ≤ 1 − pd,Xt (23) d∈KtX−Kt∩Kc d∈KXt∩Kc

|pd,Xt − pd,Xc | ≤ pd,Xt + pd,Xc (24) d∈KXt∩Kc d∈KXt∩Kc d∈KXt∩Kc

|pd,Xt − pd,Xc | ≤ 1 − pd,Xc (25) d∈KcX−Kt∩Kc d∈KXt∩Kc

Equation 24 becomes an equality if and only if ∀d ∈ Kt ∩ Kc, pd,Xt = 0 or pd,Xc = 0 (or both). Since these conditions are mutually exclusive with Kt ∩ Kc 6= ∅, then the inequality becomes:

|pd,Xt − pd,Xc | < pd,Xt + pdi,Xc (26) d∈KXt∩Kc d∈KXt∩Kc d∈KXt∩Kc

Therefore:

WD < 2 (27)



The set of pixels that maximises the value of WD for any given values of t and c is the set that satisfies the conditions specified in Equations 16. This means that the pixels in areas t and c in Figure 5 should take on complementary descriptors. The particular descriptor values themselves do not matter, nor does their spatial arrangement inside each region. For example, if inside region t the pixels take on grey levels 1 to 100, the pixels inside region c should take on values 101 to 255.

For a fixed set of pixels, the spatial arrangement that maximises WD is the set that comes closest to creating the complimentary set. This means that the pixels should be arranged such that the minimum number of pixels across both regions share the same descriptor value.

15 4.4 Normalisation

The scale part of the Scale Saliency measure, WD, contains a scale normalisation term: s, in the s2 continuous case and 2s−1 in the discrete case. It is necessary because the derivative part of the expression is not independent of the scale at which the expression is evaluated. In this section, we show how the normalisation for the discrete case converges to the continous case as ∆s → 0. We may rewrite WD a generic form as follows:

s2 W = |p − p −∆ | D 2s∆s − ∆s2 d,s,x d,s s,x dX∈D s2 = |p − p −∆ | ∆s(2s − ∆s) d,s,x d,s s,x dX∈D 2 s p x − p −∆ x = d,s, d,s s, since ∆s ≥ 0 (28) 2s − ∆s ∆s dX∈D

Setting ∆s = 1 in Equation 28, we get the discrete form of WD:

s2 W = |p − p −1 | (29) D 2s − 1 d,s,x d,s ,x dX∈D

Whilst in the limit, as ∆s → 0 and set D becomes continuous, Equation 28 becomes:

s ∂ WD = p(d, s, x) .dd (30) 2 Z ∂s

d∈D

1 which is equivalent to Equation 4 (ignoring the multiplicative factor 2 ). In the implementation described in (Kadir and Brady, 2001), the continuous normalisation, s, was used as an empirical approximation for the discrete case. This approximation converges to the exact form in Equation 7 for large values of s. Figure 6 illustrates the effect of approximate and exact normalisation on a simple black square feature. In Figure 7 the variation of the saliency value with respect to the feature scale (black square of sides 2s) for both cases is plotted as a graph. As the feature size increases the approximation converges to the exact value as expected. However, a small variation in the exact curve can also be observed. This is due to the discretisation of the sampling window and may be reduced through the use of a sub-pixel implementation.

Finally, we note that scale normalisation may be handled implicitly by implementing WD in the form given by Equation 14. This compares the PDFs obtained from non-overlapping regions t and c directly. However, there are computational advantages of implementing the method in the original way; for example by reusing the PDFs at consecutive scales.

16 5 Generalising Scale Saliency

In this section, we present generalisations of the Scale Saliency method to colour images, anisotropic scale and derive an alternative measure of WD. Whilst these variations have some attractive prop- erties, they also serve to illustrate the underlying saliency model developed in Sections 3 and 4.

5.1 Colour saliency model

Two colour models are presented; a colour difference representation (CbCr) and a full colour (RGB) model. Colour difference signals are a type of colour space commonly used in television signal processing. The colour image is represented as a luminance (Y ), and two chrominance channels

(CbCr) and can be derived from the standard RGB model using the following:

Y = 0.257R + 0.504G + 0.098B + 16

Cr = 0.439R − 0.368G − 0.071B + 128

Cb = −0.148R − 0.291G + 0.439B + 128 (31)

For the colour difference experiments, the CbCr values are rescaled to the 0-255 range and a two dimensional histogram of 162 bins is used for the PDF estimation. For the RGB experiments, a three dimensional histogram of 163 bins is used4. The model of non-saliency, that is the benchmark against which a region’s unpredictability is measured, is a piecewise constant region of a single colour. The results comparing all three models, the greyscale, colour difference and full colour, are shown in Figure 8. Though the overall number of salient regions and their positions is similar, the colour information has caused a shift in the ranking of the regions. For example, the bottom row shows the top 0.01% of salient regions in each image. In the greyscale result, this corresponds to the hole in the label tag, whereas in both colour versions it corresponds to the drawing on the label. Between the two colour models, again the difference is subtle, but the RGB model has fewer, but more-distinct regions than the CbCr. This is as expected since the latter includes the intensity information. In practice, better colour models should be used which might include effects due to surface reflectance and lighting. One thing to note is that a larger number of dimensions in the saliency model causes a significant increase in the computational load and furthermore exacerbates PDF estimation problems.

5.2 Multi-parameter Scale

In this section, we generalise the isotropic, single parameter sampling function to the anisotropic case using an ellipse parameterised by a scale parameter, a rotation and a ratio of the major and minor axes. This modification enables the method to become invariant to anisotropic scaling and shear;

17 that is, the full affine set of transformations. Another benefit is that orientation information can be captured. Similar approaches have been used to generalise Corner based interest point detectors to the affine invariant case (Baumberg, 2000, Mikolajczyk and Schmid, 2001). The modification is quite straight-forward and requires replacing the single parameter sampling function with a three-parameter version. The entropy is calculated at each step of each of the three parameters. There is no need to modify, WD, the inter-scale saliency because as shown in Section

4.3 the shape that causes the largest WD is the one that matches the feature shape. Furthermore, including the rotation angle in WD would cause a bias against isotropic features. Similarly, the peak detection is not modified. As in the isotropic case, scales are selected at peaks of entropy over scale. Equations 3 to 5 can be modified for the anisotropic case by replacing the scalar s parameter with a vector, s = (s, r, θ) corresponding to the scale, ratio and orientation. The vector of scales at which the entropy peaks, sp, becomes a matrix, Sp with three rows, one for each of the scale variables and as many columns as peaks at that position. The modified equations are as follows:

YD(Sp, x) , HD(Sp, x) × WD(Sp, x) (32)

HD(s, x) , − p(d, s, x) log2 p(d, s, x) .dd (33) Z d∈D ∂ WD(s, x) , s p(d, s, x) .dd (34) Z ∂s

d∈D 2 ∂ HD(s, x) Sp , s : < 0 (35)  ∂s2 

In the discrete case, HD, WD and Sp become:

HD(s, x) , − pd,s,x log2 pd,s,x (36) dX∈D s2 W (s, x) , |p − p −1 | (37) D 2s − 1 d,s,x d,s ,x dX∈D

Sp , {s : HD(s − 1, x) < HD(s, x) > HD(s + 1, x)} (38)

The original isotropic and modified anisotropic Scale Saliency algorithms are compared in Figure 9 where they have been applied to a synthetic image. The anisotropic version correctly identifies the scales of the ellipses and the circle, whereas the isotropic version correctly detects only the circle and finds numerous features along the ellipses. In Figure 10 the anisotropic Scale Saliency is applied to an original and stretched version of an image of a Cheetah. The image sizes for the original and stretched versions were 262x340 and 340x340 respectively. The features have been thresholded and clustered using the algorithm described in (Kadir and Brady, 2001), modified to work in this new

18 space. The parameters of the clustering were set such that the images shown are fairly clear. It can be observed that in both images many of the spots (in this case the most salient features) have been identified at a scale, ellipse ratio and orientation that is appropriate to the local feature. The extra information brought through the use of the three parameter scale-space provides a more accurate representation of the image and a richer descriptor set. However, it should be noted that we have found the modified algorithm to be quite sensitive to noise in the image and further developments are necessary before this approach can be applied. For example, alternative parameterisations of the ellipse might prove more stable, for example using the scales of the two axes and a rotation. Such investigations are left for future work.

5.3 Alternative Scale measures

In (Kadir and Brady, 2001), we briefly discussed different possibilities for PDF similarity measures, and for simplicity we chose the SAD (sum of absolute difference) measure for our implementation. We note however, that the formulation in Equation 14 makes explicit the similarity measure based upon the PDFs from two non-overlapping regions t and c. Therefore, as an alternative we can use a probabilistic measure by considering the conditional probability of the pixels values in the c region given the distribution in the t region. If we use log probabilities, we will be able to maintain the information theory approach. Starting with the expectation of the conditional probabilities of pixels in Xc given a PDF of the pixels in Xt, pd,Xt where d ∈ D, we can derive an expression that compares the PDFs of the two regions as follows:

, MXc,Xt E[log(p(Xc|pd,Xt ))] 1 = log(p(x|p X )) c d, t xX∈Xc 1 = (log(p(x1|p X )) + . . . + log(p(x |p X ))) (39) c d, t c d, t

We define the counting function CX (d) as:

CX (d) = δ(x − d) (40) xX∈X Equation 40 simply counts the number of times the value d occurs in region X . Substituting into Equation 39:

1 MX X = [CX (d1) log(p X ) + . . . + CX (d ) log(p X )] c, t c c d1, t c r dr, t 1 = [cp X log(p X ) + . . . + cp X log(p X )] c d1, c d1, t dr, c dr, t

= pd,Xc log(pd,Xt ) (41) dX∈D

19 Equation 41 is the cross-entropy between the PDF of regions c and t. This can be interpreted as a measure of the information cost of coding the region c given a model of region t. The value of cross-entropy is equal to zero when the PDFs are totally different and equals the entropy when the

PDFs are the same. However, in order to use MXc,Xt as a replacement for WD this behaviour should be inverted. Similar PDFs should be given a low value and different ones a high one; we want to rank higher those scales which bring ‘surprising’ information. One way this can be done is to subtract the value of MXc,Xt when Xt = Xc. In fact, by doing this we obtain the Kullback-Leibler similarity measure:

WD = MXc,Xt − MXc,Xc

= pd,Xc log(pd,Xt ) − pd,Xc log(pd,Xc ) dX∈D dX∈D

= [pd,Xc log(pd,Xt ) − pd,Xc log(pd,Xc )] dX∈D

pd,Xt = pd,Xc log (42) pd,X dX∈D c

6 Discussion: Scale Saliency and Basis Projection

In this section, we complete our analysis with a discussion of the relationship between the Scale Saliency algorithm and transform based feature detection methods, for instance those based on the wavelet transform. Recent work in the area of transform based signal processing, principally by Mallat (1998), has unified many previously disparate image processing operators. As a result of this work, many feature detection algorithms may equivalently be considered as filtering operations, coordinate transforms or basis projections. We utilise this equivalence in the discussion below. Many transform-based feature selection algorithms, for example those using a wavelet transform, operate by projecting the image onto a series of scaled, translated and rotated basis functions. Salient features are considered to be those scalings, translations and rotations which result in large magnitude coefficients, and hence such methods ‘prefer’ features that resemble the morphology, that is to say the shape, of the available basis functions of the transform. In other words, the transform determines which features are labelled as salient. Figure 11 shows examples of different wavelet transforms applied to the same image patch. Each wavelet transform produces a different ranking of the features in the image. One of the key properties of such transforms is that they are a high to low entropy mapping (for features of interest). This is precisely the property that image compression algorithms rely upon. Furthermore, this property facilitates feature detection by application of a simple magnitude

20 threshold on the transformed image; if the transform resulted in high entropy, all coefficients would be equally likely and a magnitude threshold could not be used. An alternative strategy such as Phase Congruency must be used (Kovesi, 1999).

The Gilles algorithm and the feature-space part of the Scale Saliency algorithm, HD, also consider high entropy regions in image space are salient, but measure it directly. However, as discussed in Section 4.1 all permutations of a given image patch result in the same entropy value. Hence, a random permutation of the patch is considered as salient as one resulting in structure. The scale measure, WD (Equation 4), places constraints on the preferred spatial structuring and thus produces a ranking of the permutations. Those permutations which exhibit unpredictable structure within the constraints imposed by the sampling window and its parameterisation, i.e. the scale-space, are ranked higher. See Figure 12. The difference between the transform and the Scale Saliency approaches is that the rankings are produced in different ways. While both sets of rankings are generated by imposing constraints on the spatial arrangements of pixels, in the former case this is achieved by convolution with a basis function, in the latter a statistical measure in combination with a sampling function is used. This leads to quite different properties. In the transform case, the results are biased towards features which match the morphology of the basis functions. In the Scale Saliency case, the features can be of any morphology as long as the constraints of the scale-space are satisfied. For example, with a circular sampling window, the scale-space is isotropic and hence the features must have isotropically surprising information at some scale. Clearly, the set of constraints imposed by the Scale Saliency method will be satisfied by many more morphologies than those of a given transform method. It is in this sense that the Scale Saliency algorithm may be considered a more general method for salient feature detection.

7 Conclusion

In this paper, we have presented a theoretical analysis of Scale Saliency method, the purpose of which was to develop an understanding of the relationship between the algorithm and the resultant hierarchy of salient features. The discussion was split into two main parts, which dealt with the feature-space and scale-space aspects of the algorithm respectively. The conclusions from the first part were that the feature-space aspect of the algorithm measures unpredictability compared to a benchmark determined by an underlying prior model of non-saliency. This prior model is determined by the descriptor used for the saliency calculation. For example, in the entropy of intensity case, the prior model is that, at some scale, the image can be represented

21 by a single intensity value. Therefore, the descriptors used for the saliency calculation should be chosen such that they model the non-salient parts of the image. Furthermore, since entropy of a local descriptor is a measure of the compactness of the description of the local signal behaviour, then subsequent descriptions of salient regions should be derived using more powerful descriptors than the one used for measuring saliency. From the second part of the analysis, the main result was that the Scale Saliency method is a more general approach to feature saliency compared to conventional transform (or equivalently filtering or basis projection methods) because it measures saliency and scale without reference to a particular basis morphology. Instead, spatial constraints are applied via a statistical measure derived from a sampling window, the parameterisation of which determines the properties of the scale-space and hence the resultant hierarchy of salient features. A generalised version of Scale Saliency was implemented using a three-parameter sampling function corresponding to the two axes of an ellipse and a rotation. Such a modification gives rise to an anisotropic scale-space and correctly detects the scale of anisotropic features.

There remain a number of open issues. First, is the generalisation of the analysis of WD to the continuous case. The discrete nature of images has meant that this has not been a significant limitation to date. Nevertheless, it is clearly desirable to complete the theoretical analysis of the proposed saliency model. We discussed some of the difficulties with the continuous analysis in Section 4. There are, in fact, two parameters that have been assumed to be discrete in our analysis: the spatial plane and the descriptor values. Perhaps, one possible direction could be to examine these seperately. Second, is the generalisation of the analysis to sampling functions with multiple parameters. Finally, perhaps most fruitful, is the formalisation of the link between Scale Saliency and basis projection methods, as outlined in the discussion in Section 6.

Acknowledgments

This research was sponsored by a Motorola University Partners in Research grant. We would like to thank Dr. Paola Hobson of the Motorola UK Research Laboratory for the many very useful discussions we have had during the course of this work. We would also like to thank Vicky Mortimer of the Robotics Research Group, Oxford, for help with the derivation in Section 4.4.

22 A Re-arranging WD in the discrete case

s2 W = |p X − p X | D 2s − 1 d, d, t dX∈D

= |pd,Xc − pd,Xt | (43) dX∈D

where pd,X , pd,Xt and pd,Xc are the probabilities of a pixel taking on value d ∈ D in regions

X = {x1, x2, . . . , xn} and the partition of X into two subsets of size t and c, Xt = {x1, x2, . . . , xt} and Xc = {xt+1, xt+2, . . . , xn} respectively.

|pd,X − pd,Xt | = |pd1,X − pd1,Xt | + . . . + |pdr,X − pdr,Xt | dX∈D N X N X N X N X = d1, − d1, t + . . . + dr, − dr, t n t n t

N X + N X N X N X + N X N X = d1, t d1, c − d1, t + . . . + dr, t dr, c − dr, t t + c t t + c t

tN X − cN X tN X − cN X = d1, c d1, t + . . . + dr, c dr, t (44) t(t + c) t(t + c)

Ndr,Xt where Ndr,X is the number of pixels in region X with value dr. Defining pdr,Xt = t and

Ndr,Xc pdr,Xc = c :

c t pd1,Xc − c t pd1,Xt c t pdr,Xc − c t pdr,Xt |p X − p X | = + . . . + d, d, t t(t + c) t(t + c) dX∈D

c t |p X − p X | c t |p X − p X | = d1, c d1, t + . . . + dr, c dr, t t(t + c) t(t + c) c |p X − p X | c |p X − p X | = d1, c d1, t + . . . + dr, c dr, t t + c t + c c = |p X − p X | (45) t + c d, c d, t dX∈D

Therefore:

s2 c W = |p X − p X | (46) D 2s − 1 t + c d, c d, t dX∈D where t and c are the numbers of pixels in regions t and c (see Figure 5) and the sizes of sets Xt and

Xc respectively. These approximate the area defined by the (continuous) sampling function at two incremental values of its parameterisation. We restrict our analysis to the one parameter sampling functions. We can represent the area of many 2D shapes in the following form:

A(s) = ρs2 (47)

23 where ρ is a weighting term for specific shapes. For example, ρ = π for a circle with radius s, ρ = 4 1 2π for a square with sizes 2s and ρ = 2 n sin n for a polygon with n sides inscribed inside a circle of s radius. Ignoring the effects of discretisation of the sampling, we may write:

t + c = A(s)

= ρs2 (48)

c = A(s) − A(s − ∆s)

= ρs2 − ρ(s − ∆s)2

= ρ s2 − s2 − 2s∆s + ∆s2  = ρ 2s∆s − ∆s2 (49) 

Therefore, the we can rewrite WD as:

c s2 W = |p X − p X | D t + c 2s − 1 d, c d, t dX∈D ρ 2s∆s − ∆s2 s2 = |p X − p X | ρs2  2s − 1 d, c d, t dX∈D 2s∆s − ∆s2 s2 = |p X − p X | (50) s2  2s − 1 d, c d, t dX∈D (51)

In the discrete case ∆s = 1, therefore:

(2s − 1) s2 W = |p X − p X | D s2 2s − 1 d, c d, t dX∈D

= |pd,Xc − pd,Xt | (52) dX∈D (53)

Note, the normalisation is independant of the sampling function that is used as long as it can be represented in the form 47.

24 References

Robert Ash. Information Theory. New York:Wiley-Interscience, 1965.

A. Baumberg. Reliable feature matching across widely separated views. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 774–781, 2000.

F. Bergholm. Edge focusing. In Proc. International Conference on Pattern Recognition, pages 597–600, 1986.

J. Besag. The statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 48(3): 259–302, 1986.

J. Canny. A computational approach to edge detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(6):679–698, 1986.

R. Deriche and T. Giraudon. A computational approach for corner and vertex detection. Interna- tional Journal of Computer Vision., 10(2):101–124, 1993.

S. Gilles. Robust Description and Matching of Images. PhD thesis, University of Oxford, 1998.

C. Harris and M. Stephens. A combined corner and edge detector. In Proc. 4th Alvey Vision Conference, pages 189–192, 1988. Manchester.

Timor Kadir and Michael Brady. Scale, saliency and image description. International Journal of Computer Vision, 45(2):83–105, 2001.

J.J. Kœnderink. The structure of images. Biological Cybernetics, 50(5):363–370, 1984.

P. Kovesi. Image features from phase congruency. Videre, 1(3), 1999.

Y.G. Leclerc. Constructing simple stable descriptions for image partitioning. International Journal of Computer Vision., 3(1):73–102, 1989.

S.Z. Li. Markov Random Field Modeling in Computer Vision. Springer-Verlag, Tokyo, 1995.

T. Lindeberg and B.M. ter Haar Romeny. Linear scale-space: I. basic theory, II. early visual operations. In B.M. ter Haar Romeny, editor, Geometry-Driven Diffusion. Kluwer Academic Publishers, Dordrecht, The Netherlands, 1994.

S. Mallat. A Wavelet Tour of Signal Processing. Academic Press, San Diego, 1998.

25 K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In International Conference on Computer Vision, 2001.

F. Mokhtarian and R. Suomela. Robust image through curvature . IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(12), 1998.

N. Paragios and R. Deriche. Geodesic active regions and level set methods for supervised texture segmentation. International Journal of Computer Vision, 46(3):223, 2002.

A. Witkin. Scale-space filtering. In Proc. International Joint Conference on Artificial Intelligence. Karlsruhe, Germany, 1983.

S.C. Zhu and A. Yuille. Region competition: Unifying snakes, region growing, and bayes/mdl for multiband . IEEE Trans. on Pattern Analysis and Machine Intelligence, 18 (9):884–900, 1996.

26 Footnotes

0Affiliation of Authors - Robotics Research Laboratory, Department of Engineering Science, Uni- versity of Oxford, Parks Road, Oxford. OX1 3PJ, U.K.

1Timor Kadir is the recipient of a Motorola University Partners in Research grant.

2A note on the nomenclature used in this paper: we use the term ‘measure’ in the general sense, that is not in relation to Measure Theory.

3These bounds result from the fact that the sets are mutually exclusive. We wish to show that by moving a descriptor value to Kt ∩ Kc we necessarily reduce the total sum of WD.

4Alternatively, we may quantise the image to the 0-15 range directly.

27 Figure 1: High saliency regions, such as the eye, exhibit unpredictable local intensity hence high entropy. Image from NIST Special Database 18, Mugshot Identification Database.

28 Figure 2: Salient scale region selection and salient icons (10% most salient shown) are robust to rotation (45◦ clockwise) and scaling (60% of original size).

29 Figure 3: Permutations. The lower three images are permutations of the top one. The graphs show the entropy with increasing scale of a local circular window grown around the centre of the image. Image from NIST Special Database 18, Mugshot Identification Database.

30 Figure 4: Curves for HD(s) and WD(s) for different permutations of image (a) at fixed x. The entropy at the largest scale is identical for all, but the form of the curves up to this scale are quite different.

31 Figure 5: Reformulating the sampling function as a Set theory problem. Any one parameter sampling function may be represented in this way; shown are the circular (A) and free-form cases (B).

Figure 6: Comparing the Entropy, WD and Saliency of the same feature at two scales - two squares with sides 60 and 180 pixels. The saliency values are close in both the approximate, s, and exact, t+c c , normalisation. There is some ‘drift’ in both due to error from discretisation.

32 Figure 7: Saliency as a function of feature size. The approximation converges to the exact normali- sation for large scale features.

33 G CbCr RGB

Figure 8: Comparing greyscale (G), colour difference (CbCr) and full colour (RGB) Scale Saliency results on Paddington image. Top and bottom rows are the clustered 4% most salient and global thresholded top 0.01% respectively.

34 Figure 9: Comparing the original isotropic (left) and modified anisotropic (right) Scale Saliency algorithms.

35 Figure 10: The anisotropic Scale Saliency applied to a Cheetah image and its stretched version. The 0.5% (by number) clustered most salient features are shown. In each case, many of the Cheetah’s spots are correctly identified at their appropriate local (anisotropic) scale.

36 Figure 11: Different wavelet transforms produce different rankings of features. Shown are the Daubechies 1, Daubechies 2, Haar and Spline wavelet transforms of the Eye image patch at 4 scales. Each quadrant is normalised for display.

37 Figure 12: Entropy for all permutations of the pixels of a region is the same. The scale measure WD ranks these into an order of preference or saliency.

38