Scale Invariance Without Scale Selection

Scale Invariance without Scale Selection Iasonas Kokkinos Alan Yuille Department of Statistics, UCLA Department of Statistics, UCLA [email protected] [email protected] ¤ Abstract In this work we construct scale invariant descriptors (SIDs) without requiring the estimation of image scale; we thereby avoid scale selection which is often unreliable. Our starting point is a combination of Log-Polar sampling and spatially-varying smoothing that converts image scalings and rotations into translations. Scale invariance Local Structure Context can then be guaranteed by estimating the Fourier Trans- Figure 1. Our goal is to extract scale-invariant information around generic image structures where scale selection can be unreliable, form Modulus (FTM) of the formed signal as the FTM is e.g. edges. The local image structure that is used by most scale translation invariant. selection mechanisms is often not informative about the scale of We build our descriptors using phase, orientation and the structure, which becomes apparent from the image context. amplitude features that compactly capture the local image structure. Our results show that the constructed SIDs out- shown in Fig. 1. Context is most informative and can only perform state-of-the-art descriptors on standard datasets. be incorporated by considering multiple scales. A main advantage of SIDs is that they are applicable to a We propose a method to compute scale invariant descrip- broader range of image structures, such as edges, for which tors (SIDs) that does not require scale selection. For this scale selection is unreliable. We demonstrate this by com- we use a combination of log-polar sampling with spatially bining SIDs with contour segments and show that the per- varying filtering that converts image scalings and rotations formance of a boundary-based model is systematically im- into translations. Scale invariance is achieved by taking the proved on an object detection task. Fourier Transform Modulus (FTM) of the transformed sig- nals as the FTM is translation invariant. Our experiments show that SIDs outperform current descriptors when tested on standard datasets. 1. Introduction By freeing us from the need for scale selection, SIDs can Local image descriptors evaluated at interest points have be used in a broader setting, in conjunction with features been very successful for many visual tasks related to ob- such as edges and ridges [18]. Such features can be used to ject detection [21]. An important issue is how to deal with construct intuitive object representations as they are related changes in image scale. Typically this is done in a two- to semantically meaningful structures, namely boundaries stage process which first extracts a local estimate of the and symmetry axes. However, they have been limited since scale and then computes the descriptor based on an appro- they do not come with scale estimates, so it has been hard to priately sized image patch. This strategy is limited in two use them for scale-invariant detection. We address this by respects. First, for most places in the image it is hard to ob- augmenting contour segments with SIDs, and use them in a tain reliable scale estimates, with the exception of symmet- flexible object model for scale-invariant object detection. ric structures, such as blobs or ridges. However, we would not like to limit ourselves to the few structures for which 2. Previous Work scale estimation is reliable, as other structures, e.g. edges can be useful for object detection. Second, even if scale es- Image descriptors summarize image information around timation is reliable, it does not necessary indicate the scale points of interest using low-dimensional feature vectors that where the most useful appearance information resides, as are designed to be both distinctive and repeatable. Two seminal contributions have been SIFT descriptors [21] and ¤This work was supported by NSF Grant 0413214 Shape Contexts [26]; these have been followed by several Scale Space Band-pass image Amplitude, A Phase, sin(Á) Orientation, θ Space-Variant Processing Figure 2. Front-End Analysis. Left: We combine log polar sampling with a spatially increasing filter scale to guarantee scale invariance (taken from [20]). Right: Features extracted from a band-pass filtered image using the Monogenic Signal. extensions and refinements, such as Geometric Blur [3, 4], is proportional to the distance from the center of the log- PCA-SIFT [12] and GLOH [23]. Please see [23] for an up- polar sampling grid. As we show in App. A, this guarantees to-date review, comparisons and more extensive references; that scaling the image only scales the features and does not we will compare in detail our descriptor to related ones in distort them in any other way. This allows us to then use a the following sections. sparse log-polar sampling of the image features and convert Typically, a two stage approach is used to deal with scalings/rotations into translations. changes in image scale. First, a front end system is used Comparing to other image descriptors that use a log- to estimate local image scale, e.g. by using a scale-adapted polar sampling strategy, in the GLOH descriptor of [23], differential operator [19, 18]. Then, the estimated scale is the histogram of the image gradient at several orientations used to adapt the descriptor. As argued in the introduction, is computed by averaging the gradient within each compart- this two-stage approach is often problematic and the only ment of a log polar grid. Such descriptors can be redundant, work known to us that addresses this is [6]. There the au- since typically a single orientation is locally dominant. Fur- thors obtain stable measures of scale along edges by extract- ther, a front-end detector is used to determine the descrip- ing descriptors at multiple scales and choosing the most sta- tor’s scale, which as mentioned can be problematic. The ble one; however this requires an iterative, two-stage front- work of the authors in [3] is also closely related, as they end procedure for each interest point. increase the smoothing with the distance from the center. However their approach leads to distortions due to scale 3. Front-End Analysis changes, while in our work we guarantee that apart from being translated, the signal does not get distorted. 3.1. Image Sampling with the Log-Polar Transform In order to design scale-invariant descriptors we exploit 3.2. Feature Extraction the fact that the log polar transform Having described our sampling strategy, we now de- r r scribe the features that are being sampled. Specifically, I^(r; u) = I(x0 ¡ σ cos(u); y0 ¡ σ sin(u)) (1) 0 0 we compute the image orientation, phase and amplitude, as converts rotations and scalings of the image I around shown in Fig. 2, which largely capture local image struc- (x0; y0) into translations of the transformed image I^; r and ture. The phase of an image provides symmetry-related in- u are log-polar coordinates and σ0 is a scaling constant. formation, indicating whether the image looks locally like This transform is extensively used in image registration (see an edge (Á = 0) or a peak/valley (Á = §¼=2). The am- e.g. [31] and references therein), while in [27] it is argued plitude A is a measure of feature strength (contrast), and that this logarithmic sampling of the image is similar to the its orientation θ indicates the dominant direction of image sampling pattern of the human visual front-end. We apply variation. To compute these we use the Monogenic signal this sampling strategy to our problem, by setting x0; y0 in of [7], as described below. (1) equal to the location of an interest-point, and consider the construction of a scale-invariant descriptor around it. 3.2.1 The Monogenic Signal A practical concern is that directly sampling the image around each point is impractical, because we would need In order to estimate the amplitude and phase of an 1-D sig- too many samples to avoid aliasing. Therefore we remove nal a well established method is based on the Analytic Sig- high frequency components by band-pass filtering the im- nal, obtained via the Hilbert transform [11]. However the age before extracting features from it. extension of the Analytic Signal to 2D had only been par- Further, as shown in Fig. 2, we use spatially varying tial, until the introduction of the Monogenic Signal in [7]. filtering and sample the image (or its features) at a scale that Following [7], we obtain the local amplitude a, orientation θ and phase Á measurements of a 2D signal h by: Input Image q h h A = h2 + h2 + h2 ; θ = tan¡1 y ;Á = tan¡1 p x y 2 2 hx hy + hx ¡1 p !fx;yg hfx;yg = F ( ¡1 p H); 2 2 !x + !y where H = F(h) is the 2D Fourier transform of h and + Multi-Scale Analysis !x;!y are horizontal/vertical frequencies. A simple imple- Filter Amplitude: A Phase: sin(Á) Orientation: θ mentation can be found at [16]. Apart from being theoretically sound, the Monogenic Signal is also efficient; prior to the generalization of the Hilbert transform to 2D, earlier approaches would first pre- process the image with a set of orientation-selective filters Small Scale [25, 10, 22, 24, 13] and then essentially treat the output of each filter as an 1-D signal. Instead, the Monogenic Signal only requires filtering with a single band-pass filter with no orientational preference. Large Scale + Log-Polar Sampling A cos(Á ) A sin(Á ) A cos(2θ¤) A sin(2θ¤) 4. Scale Invariant Descriptor Construction d d d d d d d d Having laid out the ideas underlying our front-end processing, we now describe our method for computing SIDs, depicted as a block diagram in Fig.

Load more