SIFT, GLOH, SURF descriptors
Dipartimento di Sistemi e Informatica Invariant local descriptor: Useful for…
• Object Recogni on and Tracking. • Robot Localiza on and Mapping. • Image Registra on and S tching. • Image Retrieval. • Augmented Reality (h p://blogs.oregonstate.edu/hess/si -library-places-2nd-in-acm-mm-10-ossc/)
Template
Video Stream Scale invariant detectors
• In most object recogni on applica ons, when the scale of the object in the image is unknown instead of extrac ng features at many different scales and then matching all of them, it is more efficient to design a func on on the region which is the same for corresponding regions, even if they are at different scales.
• The problem can also be stated as follows: given two images of the same scene with a large scale difference between them, find the same interest points independently in each image.
• For scale invariant feature extrac on it is necessary to detect structures that can be reliably extracted under scale changes.
• This is done by evalua ng a signature func on (a kernel) in the point neighbourhood and plot the result as a func on of the neighbourhood scale. Since it measures proper es of the local neighbourhood at a certain scale, it should take a similar qualita ve shape if two keypoints are centered on corresponding image structures; • The func on shape should be squashed or expanded as a result of the scaling factor. Corresponding neighbourhood sizes should be detected by searching for extrema of the signature func on in both images.
We can consider points as a func on of region size (circle radius) . A common approach is to take a local maximum of this func on. The solu on is to search for maxima of suitable func ons in scale and in space over the images. f Image 2 f Image 1 scale = 1/2
region size region size
The region size (scale), for which the maximum is achieved, should be invariant to image scale.
f Image 1 f Image 2 scale = 1/2
region size/scale s1 region size/scale s2
• A “good” func on for scale detec on has one stable sharp peak
f f f Good ! bad bad region size region size region size
• For usual images a good func on would be the one with contrast (sharp local intensity change). It is easier to look for zero-crossings of 2nd deriva ve than maxima.
• There are a few approaches which are truly invariant to significant scale changes. Typically, such techniques assume that the scale change is the same in every direc on, although they exhibit some robustness to weak affine deforma ons.
• The appropriate kernel for this is the scale-normalized Gaussian kernel G(x, σ) and its deriva ves.
• The classical approach is to generate a Gaussian scale-space representa on of an image, i.e. a set of images from the convolu on of an isotropic (circular) Gaussian Kernel of various sizes: A larger scale results into a smoother image
• Exis ng methods search for local extrema in the 3D Gaussian scale-space representa on of an image (x , y and scale). Local extrema over scale of normalized deriva ves indicate the presence of characteris c local structures
• The mo va on for genera ng a scale-space representa on of a given image originates from the basic observa on that real-world objects are composed of different structures at different scales. This implies that real-world objects, may appear in different ways depending on the scale of observa on.
• The Gaussian scale-space guarantees that new structures must not be created when going from a fine scale to any coarser scale. Its proper es include linearity, shi invariance, non-enhancement of local extrema, scale invariance and rota onal invariance
Func ons for determining scale
f =Kernel∗ Image Kernels:
2 LGxyGxy=+σσ( xx(, , ) yy (, , σ )) Laplacian of Gaussians
DoG= G(, x y , kσσ )− G (, x y , ) Difference of Gaussians (an approxima on of Laplacian)
where Gaussian xy22+ − Gxy(, , ) 1 e 2σ 2 σ = 2πσ both kernels are invariant to scale and rota on
The method: - build scale-space pyramids; - all scales are examined to iden fy scale-invariant features: - compute the Difference of Gaussian (DoG) pyramid or Laplacian of Gaussians (LoG)
- detect maxima and minima in scale space scale →
• Harris-Laplacian1 Find local maximum of:
– Harris corner detector in space (image coordinates) y Laplacian
– Laplacian in scale ←
← Harris → x
• SIFT (Lowe)2 Find local maximum of: scale – Difference of Gaussians in space and scale → DoG
y ←
← DoG → x 1 K.Mikolajczyk, C.Schmid. “Indexing Based on Scale Invariant Interest Points”. ICCV 2001 2 D.Lowe. “Dis nc ve Image Features from Scale-Invariant Keypoints”. Accepted to IJCV 2004 Harris-Laplacian scale-invariant detector
• Harris-Laplacian method uses Harris func on first at mul ple scales, then selects points for which Laplacian a ains maximum over scales.
• Harris-corner points are interest points that have good rota onal and illumina on invariance. But are not scale invariant. To reflect scale-invariance the second-moment matrix is modified taking a Gaussian scale space representa on with a Laplacian of Gaussian kernel. • Since the computa on of deriva ves usually involves a stage of scale-space smoothing, an opera onal defini on of the Harris operator requires two scale parameters: – (i) a local deriva on scale for smoothing before the computa on of deriva ves – (ii) an integra on scale for accumula ng the opera ons on deriva ves
• where g(σI) is the Gaussian kernel of scale σI (integra on scale) and L(x,y) is the gaussian smoothed image and Lx and Ly its deriva ves in the x and y direc on, calculated using a 2 Gaussian kernel of scale σD (differen a on scale). Mul plica on by σ is because deriva ves must m be normalized across scales according to Dm(x, s ) = σ Lm(x, s ).
1 2 3 n • The algorithm searches across mul ple scales σn σ0 , k σ0 , k σ0 k σD…… k σ0 (k=1,4 )
se ng σI = σn and σD = s σI (s=0,7).
• At each scale corners are found as with the Harris method applied to M matrix in a 8 point neighbourhood. An itera ve algorithm localizes corner points spa ally and chooses the characteris c scale: – Laplacian of Gaussians is used to judge if each of the candidate points found on different levels, forms a maximum in the scale direc on (check with n-1 and n+1). The scale where such maximum in scale is found is referred to as Characteris c scale. It is used in future itera ons. Points are spa ally localized at the characteris c scale
• Mikolajczyk and Schmid (2001) demonstrated that the LoG measure D D D a ains the highest percentage of correctly detected corner points in comparison to other scale-selec on measures:
• At each itera on the corner point xk+1 is selected that maximizes the LoG within the scale neigbourhood. The process terminates when xk+1 = xk
Mul -scale Harris points
Selec on of points at the characteris c scale with Laplacian
Invariant points + associated regions [Mikolajczyk & Schmid’01] SIFT Scale Invariant Feature Transform
• SIFT method has been introduced by D. Lowe in 2004 to represent visual en es according to their local proper es. The method employs local features taken in correspondence of salient points (referred to as keypoints or SIFT points). Keypoints (their SIFT descriptors) are used to characterize shapes with invariant proper es • Image points selected as keypoints and their SIFT descriptors are robust under: - Luminance change (due to difference-based metrics) - Scale change (due to scale-space) - Rota on (due to local orienta ons wrt the keypoint canonical)
The original Lowe’s algorithm:
Given a grey-scale image: - Build a Gaussian-blurred image pyramid - Subtract adjacent levels to obtain a Difference of Gaussians (DoG) pyramid (so approxima ng the Laplacian of Gaussians) - Take local extrema of the DoG filters at different scales as keypoints - Compute keypoint dominant orienta on For each keypoint:
- Evaluate local gradients in a neighbourhood of the keypoint with orienta ons rela ve to the keypoint orienta on and normalize - Build a descriptor as a feature vector with the salient keypoint informa on • Mo va ons for usage of DoG are that while Laplacian of Gaussian σ2 2 G (x,y, σ) provides strong responses to dark blobs of size √σ and is good to capture scale invariance, calcula on of Laplacian is costly. So an approxima on can be used: Scale normalized Laplacian σ 2 G (x,y, σ)
Heat diffusion equa on unless ½ mul plica ve constant (Koenderink ’92 for luminance scale space)
• SIFT descriptors are obtained in the following three steps: 1. Keypoint detec on using local extrema of DoG filters 2. Computa on of keypoint orienta on 3. SIFT descriptor deriva on
Build Gaussian pyramids
Keypoints are detected as local scale-space maxima of the Differences of Gaussians. They correspond to local min/max points in image I(x,y) that keep stable at different scales σ
Resample
Blur
Pyramid construc on process Blur: σ is doubled from the bo om to top of each pyramid Resample: pyramid images are sub-sampled from scale to scale Subtract: adjacent levels of pyramid images are subtracted Building pyramids in detail
• A first pyramid is obtained by the convolu on n opera on at different σ such that σn =k σ0
• L(x,y,σ) are grouped into a first octave
• The DoG at a scale σ is obtained by the difference of two nearby scales separated by a constant k
• A er the first octave is completed the image such that σ = 2 σ0 is subsampled by a factor equal to 2 and the next pyramid is obtained in the same way
• The procedure is iterated for the next levels
n L(x, y, ) = G(x, y, ) *I(x, y) D(x, y, ) = L(x, y, k ) – L(x, y, ) σn =k σ0 σ σ σ σ σ
0 σ0 = (k ) σ 1 σ1 = (k ) σ 2 σ2 = (k ) σ 3 σ3 = (k ) σ 4 σ4 = (k ) σ • Octave: the original image is convoluted with a set of Gaussians, so as to obtain a set of images that differ by k in the scale space: each of these sets is usually called octave.
k4 k3 Octave k2 k k0
• Each octave is divided into a number of intervals such as k = 2 1/s. .
• For each octave s + 3 images must be calculated. For example if s = 2 then k = 2 ½ and we will have 5 images at different scales. In this case an octave corresponds to doubling the value of σ
½ 0 σ0 = (2 ) σ = σ ½ 1 σ1 = (2 ) σ = κ σ ½ 2 σ2 = (2 ) σ = 2 σ ½ 3 σ3 = (2 ) σ = 2 κ σ ½ 4 2 is doubled wrt σ4 = (2 ) σ = κ σ σ4 σ0 Choice of s = 2 is based on empirical verifica on of the keypoint stability
• Gaussian kernel size: the number of samples increases as σ increases. The number of opera ons that are needed are (N2 -1) sums and N2 products. They grow as σ grows. A good compromise is to use a sample interval of [-3σ, 3σ]
Sums (N2 – 1) Products (N2)
• Computa onal savings can be obtained considering that the Gaussian kernel is separable into the product of two one-dimensional convolu ons (2N) products and (2N – 2) sums. This makes computa onal complexity O(N).
2 2 • Moreover convolu on of two gaussians of σ1 and σ2 is a Gaussian with variance: 2 2 2 σ3 = ( σ1 + σ2 ). This property can be exploited to build the scale space, so to use convolu ons already calculated aa
Example
Detect maxima and minima of DoGs in scale space
• Local extrema of D(x,y, σ) are the local interest points. To detect the interest points at each level of scale of the DoG pyramid every pixel p is compared to its 8 neighbours: – if p is a local extrema (local minimum or maximum) it is selected as a candidate keypoint – each candidate keypoint is compared to the 9 neighbours in the scale above and below Only pixels that are local extrema in 3 adjacent levels are promoted as keypoints Keypoint stability
• The many points extracted from maxima+minima of DoGs have only pixel-accuracy at best and may correspond to low contrast and therefore unreliable points.
• To improve keypoint stability a func on is adapted to the local points in order to determine the interpolated posi on. Since points are defined in 3D (x,y, σ) it is a 3D curve fi ng problem. The interpola on is done using the quadra c Taylor expansion of the Difference-of- Gaussian scale-space func on, with the candidate keypoint as the origin:
k x w-k
where D and its deriva ves are evaluated at the candidate keypoint k (x,y σ) and x(x,y σ) is the offset from this point.
• The loca on of the extremum, is determined by taking the deriva ve of this func on with respect to x and se ng it to zero:
that is where
– If the offset is larger than 0.5 in any dimension, then it is an indica on that the extremum lies closer to another candidate keypoint. In this case, the candidate keypoint is changed and the interpola on performed instead about that point.
– otherwise the offset is added to its candidate keypoint to get the interpolated es mate for the loca on of the extremum.
• low contrast keypoints are generally less reliable than high contrast and keypoints that respond to edges are unstable. Filtering can be performed respec vely by: • thresholding on simple contrast • thresholding based on principal curvature
• The local contrast can be directly obtained from D(x,y,σ) calculated at the loca on of the keypoint as updated from the previous step. Unstable extrema with low contrast can be discarded according to Lowe’ rule: |D(x) < 0,03| • The DoG func on has strong responses along edges. To eliminate the keypoints that have poorly determined loca ons but have high edge responses it must be no ced that for poorly defined peaks in the DoG func on, the principal curvature across the edge would be much larger than the principal curvature along it.
• Finding these principal curvatures amounts to solving for the eigenvalues of the second-order Hessian matrix of D(x,y,s) . The eigenvalues of H are propor onal to the principal curvature of D(x,y,s):
to calculate for adiacent DoG pixels
• The ra o of the two eigenvalues is sufficient to the goal. If r is the ra o between the highest and the lowest eigenvalue, then:
R = (r+1)2 / r depends only on the ra o of the two eigenvalues and is minimum when the two eigenvalues have the same value and increases as r increases.
In order to have the ra o between the two principal curvatures below a threshold it must be that for some threshold on r, if R is higher than the keypoint is poorly localized and hence rejected. Maxima in D Remove low contrast and edges • Experimental evalua on of detectors w.r.t. scale change
Repeatability rate:
# correspondences # possible correspondences (points present)
• The common drawback of both the LoG (and DoG) representa on is that local maxima can also be detected in the neighborhood of contours or straight edges, where the signal change is only in one direc on.
• These maxima are less stable because their localiza on is more sensi ve to noise or small changes in neighboring texture. Orienta on assignment
• For a keypoint, if L is the image with the closest scale, for a region around keypoint compute gradient magnitude and orienta on using finite differences:
⎡⎤Lx(1,)(1,)+ y−− Lx y GradientVector = ⎢⎥ ⎣⎦Lxy(,+ 1)− Lxy (, − 1)
For such region - create an histogram with 36 bins for orienta on - weight each point with Gaussian window of 1.5σ east squares)
• Peak orienta on is the keypoint canonical orienta on • Any peak within 80% of the highest peak is used to create a keypoint with that orienta on. Local peak within 80% creates mul ple orienta ons. About 15% has mul ple orienta ons
Once the local orienta on and scale of a keypoint have been es mated, a scaled and oriented patch around the detected point can be extracted and used to form a feature descriptor