SIFT, GLOH, SURF descriptors

Dipartimento di Sistemi e Informatica Invariant local descriptor: Useful for…

• Object Recognion and Tracking. • Robot Localizaon and Mapping. • Image Registraon and Stching. • Image Retrieval. • Augmented Reality (hp://blogs.oregonstate.edu/hess/si-library-places-2nd-in-acm-mm-10-ossc/)

Template

Video Stream Scale invariant detectors

• In most object recognion applicaons, when the scale of the object in the image is unknown instead of extracng features at many different scales and then matching all of them, it is more efficient to design a funcon on the region which is the same for corresponding regions, even if they are at different scales.

• The problem can also be stated as follows: given two images of the same scene with a large scale difference between them, find the same interest points independently in each image.

• For scale invariant feature extracon it is necessary to detect structures that can be reliably extracted under scale changes.

• This is done by evaluang a signature funcon (a kernel) in the point neighbourhood and plot the result as a funcon of the neighbourhood scale. Since it measures properes of the local neighbourhood at a certain scale, it should take a similar qualitave shape if two keypoints are centered on corresponding image structures; • The funcon shape should be squashed or expanded as a result of the scaling factor. Corresponding neighbourhood sizes should be detected by searching for extrema of the signature funcon in both images.

We can consider points as a funcon of region size (circle radius) . A common approach is to take a local maximum of this funcon. The soluon is to search for maxima of suitable funcons in scale and in space over the images. f Image 2 f Image 1 scale = 1/2

region size region size

The region size (scale), for which the maximum is achieved, should be invariant to image scale.

f Image 1 f Image 2 scale = 1/2

region size/scale s1 region size/scale s2

• A “good” funcon for scale detecon has one stable sharp peak

f f f Good ! bad bad region size region size region size

• For usual images a good funcon would be the one with contrast (sharp local intensity change). It is easier to look for zero-crossings of 2nd derivave than maxima.

• There are a few approaches which are truly invariant to significant scale changes. Typically, such techniques assume that the scale change is the same in every direcon, although they exhibit some robustness to weak affine deformaons.

• The appropriate kernel for this is the scale-normalized Gaussian kernel G(x, σ) and its derivaves.

• The classical approach is to generate a Gaussian scale-space representaon of an image, i.e. a set of images from the convoluon of an isotropic (circular) Gaussian Kernel of various sizes: A larger scale results into a smoother image

• Exisng methods search for local extrema in the 3D Gaussian scale-space representaon of an image (x , y and scale). Local extrema over scale of normalized derivaves indicate the presence of characterisc local structures

• The movaon for generang a scale-space representaon of a given image originates from the basic observaon that real-world objects are composed of different structures at different scales. This implies that real-world objects, may appear in different ways depending on the scale of observaon.

• The Gaussian scale-space guarantees that new structures must not be created when going from a fine scale to any coarser scale. Its properes include , shi invariance, non-enhancement of local extrema, and rotaonal invariance

Funcons for determining scale

f =Kernel∗ Image Kernels:

2 LGxyGxy=+σσ( xx(, , ) yy (, , σ )) Laplacian of Gaussians

DoG= G(, x y , kσσ )− G (, x y , ) Difference of Gaussians (an approximaon of Laplacian)

where Gaussian xy22+ − Gxy(, , ) 1 e 2σ 2 σ = 2πσ both kernels are invariant to scale and rotaon

The method: - build scale-space pyramids; - all scales are examined to idenfy scale-invariant features: - compute the Difference of Gaussian (DoG) pyramid or Laplacian of Gaussians (LoG)

- detect maxima and minima in scale space scale →

• Harris-Laplacian1 Find local maximum of:

– Harris corner detector in space (image coordinates) y Laplacian

– Laplacian in scale ←

← Harris → x

• SIFT (Lowe)2 Find local maximum of: scale – Difference of Gaussians in space and scale → DoG

y ←

← DoG → x 1 K.Mikolajczyk, C.Schmid. “Indexing Based on Scale Invariant Interest Points”. ICCV 2001 2 D.Lowe. “Disncve Image Features from Scale-Invariant Keypoints”. Accepted to IJCV 2004 Harris-Laplacian scale-invariant detector

• Harris-Laplacian method uses Harris funcon first at mulple scales, then selects points for which Laplacian aains maximum over scales.

• Harris-corner points are interest points that have good rotaonal and illuminaon invariance. But are not scale invariant. To reflect scale-invariance the second-moment matrix is modified taking a Gaussian scale space representaon with a Laplacian of Gaussian kernel. • Since the computaon of derivaves usually involves a stage of scale-space smoothing, an operaonal definion of the Harris operator requires two scale parameters: – (i) a local derivaon scale for smoothing before the computaon of derivaves – (ii) an integraon scale for accumulang the operaons on derivaves

• where g(σI) is the Gaussian kernel of scale σI (integraon scale) and L(x,y) is the gaussian smoothed image and Lx and Ly its derivaves in the x and y direcon, calculated using a 2 Gaussian kernel of scale σD (differenaon scale). Mulplicaon by σ is because derivaves must m be normalized across scales according to Dm(x, s ) = σ Lm(x, s ).

1 2 3 n • The algorithm searches across mulple scales σn σ0 , k σ0 , k σ0 k σD…… k σ0 (k=1,4 )

seng σI = σn and σD = s σI (s=0,7).

• At each scale corners are found as with the Harris method applied to M matrix in a 8 point neighbourhood. An iterave algorithm localizes corner points spaally and chooses the characterisc scale: – Laplacian of Gaussians is used to judge if each of the candidate points found on different levels, forms a maximum in the scale direcon (check with n-1 and n+1). The scale where such maximum in scale is found is referred to as Characterisc scale. It is used in future iteraons. Points are spaally localized at the characterisc scale

• Mikolajczyk and Schmid (2001) demonstrated that the LoG measure D D D aains the highest percentage of correctly detected corner points in comparison to other scale-selecon measures:

• At each iteraon the corner point xk+1 is selected that maximizes the LoG within the scale neigbourhood. The process terminates when xk+1 = xk

Mul-scale Harris points

Selecon of points at the characterisc scale with Laplacian

Invariant points + associated regions [Mikolajczyk & Schmid’01] SIFT Scale Invariant Feature Transform

• SIFT method has been introduced by D. Lowe in 2004 to represent visual enes according to their local properes. The method employs local features taken in correspondence of salient points (referred to as keypoints or SIFT points). Keypoints (their SIFT descriptors) are used to characterize shapes with invariant properes • Image points selected as keypoints and their SIFT descriptors are robust under: - Luminance change (due to difference-based metrics) - Scale change (due to scale-space) - Rotaon (due to local orientaons wrt the keypoint canonical)

The original Lowe’s algorithm:

Given a grey-scale image: - Build a Gaussian-blurred image pyramid - Subtract adjacent levels to obtain a Difference of Gaussians (DoG) pyramid (so approximang the Laplacian of Gaussians) - Take local extrema of the DoG filters at different scales as keypoints - Compute keypoint dominant orientaon For each keypoint:

- Evaluate local gradients in a neighbourhood of the keypoint with orientaons relave to the keypoint orientaon and normalize - Build a descriptor as a feature vector with the salient keypoint informaon • Movaons for usage of DoG are that while Laplacian of Gaussian σ2 2 G (x,y, σ) provides strong responses to dark blobs of size √σ and is good to capture scale invariance, calculaon of Laplacian is costly. So an approximaon can be used: Scale normalized Laplacian σ 2 G (x,y, σ)

Heat diffusion equaon unless ½ mulplicave constant (Koenderink ’92 for luminance scale space)

• SIFT descriptors are obtained in the following three steps: 1. Keypoint detecon using local extrema of DoG filters 2. Computaon of keypoint orientaon 3. SIFT descriptor derivaon

Build Gaussian pyramids

Keypoints are detected as local scale-space maxima of the Differences of Gaussians. They correspond to local min/max points in image I(x,y) that keep stable at different scales σ

Resample

Blur

Pyramid construcon process Blur: σ is doubled from the boom to top of each pyramid Resample: pyramid images are sub-sampled from scale to scale Subtract: adjacent levels of pyramid images are subtracted Building pyramids in detail

• A first pyramid is obtained by the convoluon n operaon at different σ such that σn =k σ0

• L(x,y,σ) are grouped into a first octave

• The DoG at a scale σ is obtained by the difference of two nearby scales separated by a constant k

• Aer the first octave is completed the image such that σ = 2 σ0 is subsampled by a factor equal to 2 and the next pyramid is obtained in the same way

• The procedure is iterated for the next levels

n L(x, y, ) = G(x, y, ) *I(x, y) D(x, y, ) = L(x, y, k ) – L(x, y, ) σn =k σ0 σ σ σ σ σ

0 σ0 = (k ) σ 1 σ1 = (k ) σ 2 σ2 = (k ) σ 3 σ3 = (k ) σ 4 σ4 = (k ) σ • Octave: the original image is convoluted with a set of Gaussians, so as to obtain a set of images that differ by k in the scale space: each of these sets is usually called octave.

k4 k3 Octave k2 k k0

• Each octave is divided into a number of intervals such as k = 2 1/s. .

• For each octave s + 3 images must be calculated. For example if s = 2 then k = 2 ½ and we will have 5 images at different scales. In this case an octave corresponds to doubling the value of σ

½ 0 σ0 = (2 ) σ = σ ½ 1 σ1 = (2 ) σ = κ σ ½ 2 σ2 = (2 ) σ = 2 σ ½ 3 σ3 = (2 ) σ = 2 κ σ ½ 4 2 is doubled wrt σ4 = (2 ) σ = κ σ σ4 σ0 Choice of s = 2 is based on empirical verificaon of the keypoint stability

• Gaussian kernel size: the number of samples increases as σ increases. The number of operaons that are needed are (N2 -1) sums and N2 products. They grow as σ grows. A good compromise is to use a sample interval of [-3σ, 3σ]

Sums (N2 – 1) Products (N2)

• Computaonal savings can be obtained considering that the Gaussian kernel is separable into the product of two one-dimensional convoluons (2N) products and (2N – 2) sums. This makes computaonal complexity O(N).

2 2 • Moreover convoluon of two gaussians of σ1 and σ2 is a Gaussian with variance: 2 2 2 σ3 = ( σ1 + σ2 ). This property can be exploited to build the scale space, so to use convoluons already calculated aa

Example

Detect maxima and minima of DoGs in scale space

• Local extrema of D(x,y, σ) are the local interest points. To detect the interest points at each level of scale of the DoG pyramid every pixel p is compared to its 8 neighbours: – if p is a local extrema (local minimum or maximum) it is selected as a candidate keypoint – each candidate keypoint is compared to the 9 neighbours in the scale above and below Only pixels that are local extrema in 3 adjacent levels are promoted as keypoints Keypoint stability

• The many points extracted from maxima+minima of DoGs have only pixel-accuracy at best and may correspond to low contrast and therefore unreliable points.

• To improve keypoint stability a funcon is adapted to the local points in order to determine the interpolated posion. Since points are defined in 3D (x,y, σ) it is a 3D curve fing problem. The interpolaon is done using the quadrac Taylor expansion of the Difference-of- Gaussian scale-space funcon, with the candidate keypoint as the origin:

k x w-k

where D and its derivaves are evaluated at the candidate keypoint k (x,y σ) and x(x,y σ) is the offset from this point.

• The locaon of the extremum, is determined by taking the derivave of this funcon with respect to x and seng it to zero:

that is where

– If the offset is larger than 0.5 in any dimension, then it is an indicaon that the extremum lies closer to another candidate keypoint. In this case, the candidate keypoint is changed and the interpolaon performed instead about that point.

– otherwise the offset is added to its candidate keypoint to get the interpolated esmate for the locaon of the extremum.

• low contrast keypoints are generally less reliable than high contrast and keypoints that respond to edges are unstable. Filtering can be performed respecvely by: • thresholding on simple contrast • thresholding based on principal curvature

• The local contrast can be directly obtained from D(x,y,σ) calculated at the locaon of the keypoint as updated from the previous step. Unstable extrema with low contrast can be discarded according to Lowe’ rule: |D(x) < 0,03| • The DoG funcon has strong responses along edges. To eliminate the keypoints that have poorly determined locaons but have high edge responses it must be noced that for poorly defined peaks in the DoG funcon, the principal curvature across the edge would be much larger than the principal curvature along it.

• Finding these principal curvatures amounts to solving for the eigenvalues of the second-order of D(x,y,s) . The eigenvalues of H are proporonal to the principal curvature of D(x,y,s):

to calculate for adiacent DoG pixels

• The rao of the two eigenvalues is sufficient to the goal. If r is the rao between the highest and the lowest eigenvalue, then:

R = (r+1)2 / r depends only on the rao of the two eigenvalues and is minimum when the two eigenvalues have the same value and increases as r increases.

In order to have the rao between the two principal curvatures below a threshold it must be that for some threshold on r, if R is higher than the keypoint is poorly localized and hence rejected. Maxima in D Remove low contrast and edges • Experimental evaluaon of detectors w.r.t. scale change

Repeatability rate:

# correspondences # possible correspondences (points present)

• The common drawback of both the LoG (and DoG) representaon is that local maxima can also be detected in the neighborhood of contours or straight edges, where the signal change is only in one direcon.

• These maxima are less stable because their localizaon is more sensive to noise or small changes in neighboring texture. Orientaon assignment

• For a keypoint, if L is the image with the closest scale, for a region around keypoint compute gradient magnitude and orientaon using finite differences:

⎡⎤Lx(1,)(1,)+ y−− Lx y GradientVector = ⎢⎥ ⎣⎦Lxy(,+ 1)− Lxy (, − 1)

For such region - create an histogram with 36 bins for orientaon - weight each point with Gaussian window of 1.5σ east squares)

• Peak orientaon is the keypoint canonical orientaon • Any peak within 80% of the highest peak is used to create a keypoint with that orientaon. Local peak within 80% creates mulple orientaons. About 15% has mulple orientaons

Once the local orientaon and scale of a keypoint have been esmated, a scaled and oriented patch around the detected point can be extracted and used to form a feature descriptor