Interest points Outline

• Correspondence • Interest point detection • • Orientation selection Core visual understanding task: finding correspondences between images

(a) incline L.jpg (img1) (b) incline R.jpg (img2) (c) img2 warped to img1’s frame

Figure 5: Example output for Q6.1: Original images img1 and img2 (left and center) and img2 warped to fit img1 (right). Notice that the warped image clips out of the image. We will fix this in Q6.2 Figure 6: Final panorama view. With homography estimated with RANSAC.

H2to1=computeH(p1,p2) a folder matlab containing all the .m and .mat files you were asked to write and • Inputs: p1 and p2 should be 2 N matrices of correspondinggenerate (x, y)T coordinates between two images. ⇥ Outputs: H2to1 should be a 3 3 matrix encoding the homographya pdf thatnamed best matcheswriteup.pdf containing the results, explanations and images asked for ⇥ • the linear equation derived above for Equation 8 (in the leastin squares the assignment sense). Hint: along with to the answers to the questions on homographies. Remember that a homography is only determined up to scale. The Matlab functions eig() or svd() will be useful. Note that this functionSubmit can be all written the code without needed an to make your panorama generator run. Make sure all the .m explicit for-loop over the data points. files that need to run are accessable from the matlab folder without any editing of the path variable. If you downloaded and used a feature detector for the extra credit, include the 6 Stitching it together: Panoramas (30code pts)with your submission and mention it in your writeup. You may leave the data folder in your submission, but it is not needed. Please zip your homework as usual and submit it We can also use homographies to create a panorama image from multiple views of the same using blackboard. scene. This is possible for example when there is no camera translation between the views (e.g., only rotation about the camera center), as we saw in Q4.2. First, you will generate panoramas using matched point correspondences between images using the BRIEF matching you implemented in Q2.4. We willAppendix: assume that there is Image no error Blending in your matched point correspondences between images (Although there might be some errors). Note: This section is not for credit and is for informational purposes only. In the next section you will extend the technique to use (potentially noisy) keypoint matches. For overlapping pixels, it is common to blend the values of both images. You can sim- You will need to use the provided function warp im=warpH(im,ply average H, out thesize) values,which but that will leave a seam at the edges of the overlapping images. warps image im using the homography transform H.Thepixelsinwarp_im are sampled at coordinates in the rectangle (1, 1) to (out_size(2), out_size(1)Alternatively,). The coordinates you can obtain of a blending value for each image that fades one image into the the pixels in the source image are taken to be (1, 1) to (size(im,2)other. To, dosize(im,1) this, first) and create a mask like this for each image you wish to blend: transformed according to H. mask = zeros(size(im,1), size(im,2)); Q6.1 (15pts) In this problem you will implement and use the function (stub provided • mask(1,:) = 1; mask(end,:) = 1; mask(:,1) = 1; mask(:,end) = 1; in matlab/imageStitching.m): mask = bwdist(mask, ’city’); [panoImg] = imageStitching(img1,mask img2, = H2to1) mask/max(mask(:));

on two images from the Dusquesne incline. This functionThe accepts function two imagesbwdist andcomputes the the distance transform of the binarized input image, so this output from the homography estimation function. Thismask function will will: be zero at the borders and 1 at the center of the image. You can warp this mask just as you warped your images. How would you use the mask weights to compute a linear 10 combination of the pixels in the overlap region? Your function should behave well where one or both of the blending constants are zero.

13 Example: license plate recognition Example: product recognition

Google Glass Example: image matching of landmarks Reference

Distinctive Image Features from Scale-Invariant Keypoints

David G. Lowe Computer Science Department University of British Columbia Vancouver, B.C., Canada [email protected]

January 5, 2004

Abstract

This paper presents a method for extracting distinctive invariant features from images that can be used to performIJCVreliable 04matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a a substantial range of affine dis- tortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be cor- rectly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual fea- tures to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a to identify clusters belonging to a sin- gle object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.

Accepted for publication in the International Journal of Computer Vision, 2004.

1 Motivation

Which of these patches are easier to match? Why? How can be mathematically operationalize this? Corner Detector: Basic Idea

“flat” region: “edge”: “corner”: no change in any no change along the significant change in direction edge direction all directions

Defn: points are “matchable” if small shifts always produce a large SSD error The math Defn: points are “matchable” small shifts always produce a large SSD error

W

corner(x0,y0) = min Ex0,y0 (u, v) u2+v2=1

where E (u, v)= [I(x + u, y + v) I(x, y)]2 x0,y0 (x,y) W (x ,y ) 0 0 Background: taylor series expansion

@f(x) 1 @f(x) f(x + u)=f(x)+ u + u2 + Higher Order Terms @x 2 @xx

Approximation of f(x) = ex at x=0 Why are low-order expansions reasonable? Underyling smoothness of real-world signals Multivariate taylor series

u I(x + u, y + v)=I(x, y)+ @I(x,y) @I(x,y) + @x @y v gradient  @I(x,y) @I(x,y) h i 1 u uv @xx @xy + Higher Order Terms 2 @I(x,y) @I(x,y) v " @xy @yy #  ⇥ ⇤ Hessian

I(x + u, y + v) I + I u + I v ⇡ x y where @I(x, y) I = x @x Feature detection: the math Consider shifting the window W by (u,v) • how do the pixels in W change? • compare each pixel before and after by summing up the squared differences W • this defines an “error” of E(u,v):

E(u, v)= [I(x + u, y + u) I(x, y)]2 (x,y) W X2 [I + I u + I v I]2 ⇡ x y (x,y) W X2 2 2 2 2 = [Ixu + Iyv +2IxIyuv] (x,y) W X2 2 u Ix IxIy = uvA ,A= 2 v IyIx Iy  (x,y) W  ⇥ ⇤ X2 The math (cont’d) Defn: points are “matchable” small shifts always produce a large SSD error

W

Corner(x0,y0)= min E(u, v) u2+v2=1

where simple products of Ix,Iy 2 u Ix IxIy E(u, v)= uvA ,A= 2 v IyIx Iy  (x,y) W (x0,y0)  ⇥ ⇤ 2X Claim 1: ‘A’ is symmetric (AT = A) and PSD

Claim 2: Corner-ness is given by min eigenvalue of ‘A’ SVD

Deva Ramanan February 1, 2016

Let us represent a linear transformation as follows:

n m y = Ax, A R ⇥ (1) 2 where A is a matrix with n columns and m rows. This document uses the singular value decomposition (SVD) to decompose A into a series of geometric transformations, focusing intuition rather than a precise formulation. For simplicity, let n =2andm =3,suchthat A transforms points in 2D to 3D.

2 3 2 3 Figure 1: Visualizing a matrix A R ⇥ as a transformation of points from R to R . Revisiting spectral2 decomposition Orthonormal basis: First,A let= usV ⇤ recallV T that the projection of a vector x Rn along a unit vector v (e.g., vT v =1)canbewrittenasvT x. Let us construct a set of n2unit vectors and write them as1 a matrix00... 0 2 0 ... T 2 3 V = v1 v2,...vn . ⇤ = 003 ... V V = I We can then compute6 . . the projection. 7 or coordinates of vector x along the unit vectors with a 6 . . . 7 ⇥ ⇤ 6 T 7 T matrix multiplication4 p = V x.5 If all the unit vectors are orthogonal to each other (vi vj =0 for i = j), then V T V =AI.= ThisV implies⇤V T that V can be thought o↵ as a rotation matrix (whos inverse6 is V T ), making it easy to undo the projection. The set of vectors in V form an orthogonal basis for Rn. Let us similar construct an orthonormal basis for the output space

U = u1 u2 ... ⇥ ⇤ 1

Turns out the above holds for any symmetric matric Figure 2: Visualizing the spectral eigendecomposition of a symmetric PSD matrix.

This makes intuitive sense geometrically; taking the k largest singular values and vectors produces a transformation A0 that uses as much of the output space as possible. The sketch of the proof relies on the fact that U and V act as rotations and so do not e↵ect the rank of A. The best k-rank approximation of A is then given by the best k-rank approximation of the (diagonal) matrix ⌃. Corollary 2: The solution of a homogenous least squares problem is given by smallest right singular value: min Ah 2 = V (:, end) h:hT h=1 || || The proof sketch follows by the fact that any input v must project to one of the right singular vectors (because they form a basis). A closely related result is that for any PSD T T matrix B = A A,minh:hT h=1 h Bh = V (:, end), where V (:, end)theeigenvectorwiththe smallest eigenvalue. Corollary 3: The pseudoinverse of A is given by

1 00... T 1 0 1 0 ... + + 2 2 3 T A =argmin A A I F = V 001 ... U A+ || || 3 6 . . . 7 6 . . . 7 6 7 4 5 which could also be obtained by mimimizing AA+ I (without proof). || ||F

3 Spectral decompositions vs SVDs A = V ⇤V T

Figure 2: VisualizingDefn: a symmetric the spectral matrix eigendecomposition A is PSD if xTAx of a >=0 symmetric for all PSDx matrix.

This makes intuitive senseA is PSD geometrically; <=> eigenvalues taking the arek alllargest positive singular values and vectors produces a transformation A0 that uses as muchT of the output space as possible. The1 sketch of the proof relies on theA is fact PSD that <=>U and A =V XXact as, where rotationsX and= p so1 dov1 notp2 ev↵2ect... the= ⇤ rank2 V of A. The best k-rank approximation of A is then given by the best⇥ k-rank approximation⇤ of T T 2 T the (diagonal) matrix ⌃. X = U⌃V 0 A = XX = U⌃ U Corollary 2: The solution of a homogenous! least squares problem is given by smallest right singular value: Eigenvectors of A = left singular vectors of X Eigenvaluesmin of AAh = 2squared= V (:, end singular) values of X h:hT h=1 || || The proof sketch follows by the fact that any input v must project to one of the right singular vectors (because they form a basis). A closely related result is that for any PSD T T matrix B = A A,minh:hT h=1 h Bh = V (:, end), where V (:, end)theeigenvectorwiththe smallest eigenvalue. Corollary 3: The pseudoinverse of A is given by

1 00... T 1 0 1 0 ... + + 2 2 3 T A =argmin A A I F = V 001 ... U A+ || || 3 6 . . . 7 6 . . . 7 6 7 4 5 which could also be obtained by mimimizing AA+ I (without proof). || ||F

3 Alternative visualization of PSD matrices A = V ⇤V T

x1 Consider set of (x1,x2) points for which: x1 x2 A =1 x2 x1  x1 x2 A ⇥ =1 ⇤ A = I x2  A2 = ⇤2 ⇥ ⇤ A = I x1 + x2 =1 (1, 0)(0, 1) A = V ⇤V T! 4x2 + x2 =1 (.5, 0)(0, 1) A = ⇤ 1 2 ! A = V ⇤V T

x2

v1 x1 1 1

λ2 v2 λ1 Harris Detector: Mathematics

Classification of image λ2 “Edge” points using eigenvalues λ2 >> λ1 of A: “Corner” λ1 and λ2 are large,

λ1 ~ λ2; E increases in all directions

λ1 and λ2 are small; E is almost constant “Flat” “Edge” in all directions region λ1 >> λ2

λ1

The distribution of the x and y derivatives is very different for all three types of patches

Efficient computation

Computing eigenvalues (and eigenvectors) is expensive Turns out that it’s easy to compute their sum (trace) and product (determinant)

– Det(A) = λminλmax

– Trace(A) = λmin+λmax (The trace is the sum of the diagonal of A)

Det(A) (is proportional to the ratio of R =4 Trace(A)2 eigvenvalues and is 1 if they are equal)

R = Det(A) ↵Trace(A)2 (also favors large eigenvalues) Harris detector example corner value (red high, blue low) Question: can we compute these heat maps with convolutions? Threshold (f > value) Find local maxima of f Harris features (in red)

The tops of the horns are detected in both images Scale and rotation invariance Harris Detector: Some Properties Rotation invariance

Ellipse rotates but its shape (i.e. eigenvalues) remains the same

Corner response R is invariant to image rotation Scale Invariant Detection

The problem: how do we choose corresponding circles independently in each image?

Choose the scale for which the point is most “cornerlike” Scale invariant detection Suppose you’re looking for corners

Key idea: find scale that gives local maximum of f • f is a local maximum in both position and scale • Think of f as a blob detector (looks for dark blobs in white backgrounds or vice-versa) • Common definition of f: Difference of Gaussian or Laplacian (or difference between two Gaussian filtered images with different sigmas)

Alternatively, a band-pass filter Pope and Lowe (2000) used features based on the hierarchical grouping of image contours, which are particularly useful for objects lacking detailed texture. The history of research on visual recognition contains work on a diverse set of other imagePopepropertiesand Lowethat(2000)can usedbe usedfeaturesas basedfeatureon themeasurements.hierarchical groupingCarneiroof imageandcontours,Jepson (2002) describewhichphase-basedare particularlylocalusefulfeaturesfor objectsthat representlacking detailedthe phasetexture.rather than the magnitude of local spatial frequencies,The history whichof researchis likonelyvisualto prorecognitionvide improcontainsved inwvorkarianceon a ditoverseillumination.set of otherSchiele and Croimagewleyproperties(2000) hathatve canproposedbe usedtheas usefeatureof multidimensionalmeasurements. Carneirohistogramsand Jepsonsummarizing(2002) the describe phase-based local features that represent the phase rather than the magnitude of local distributionspatialoffrequencies,measurementswhich iswithinlikely imageto provideregions.improvedThisinvtypearianceoftofeatureillumination.may beSchieleparticularly useful andforCrorecognitionwley (2000)ofhatevexturedproposedobjectsthe usewithof multidimensionaldeformable shapes.histogramsBasrisummarizingand Jacobsthe (1997) have demonstrateddistribution of measurementsthe value ofwithinextractingimage relocalgions.reThisgiontypeboundariesof feature mayfor berecognition.particularly Other useful usefulpropertiesfor recognitionto incorporateof texturedincludeobjectscolorwith, deformablemotion, figure-groundshapes. Basri anddiscrimination,Jacobs (1997) region shape hadescriptors,ve demonstratedand stereothe valuedepthof extractingcues. Thelocallocalregionfeatureboundariesapproachfor recognition.can easilyOtherincorporate useful properties to incorporate include color, motion, figure-ground discrimination, region novel feature types because extra features contribute to robustness when they provide correct shape descriptors, and stereo depth cues. The local feature approach can easily incorporate matches,novelbutfeatureotherwisetypes becausedo littleextraharmfeaturesothercontribthanutetheirto robcostustnessof computation.when they provideTherefore,correct future systemsmatches,are likbelyut otherwiseto combinedo littlemanharmy featureother thantypes.their cost of computation. Therefore, future systems are likely to combine many feature types. 3 Detection of scale-space extrema 3 Detection of scale-space extrema

As describedAs describedin thein introduction,the introduction,wewewillwill detectdetect kkeypointseypointsusingusinga cascadea cascadefilteringfilteringapproachapproach that usesthatefusesficientefficientalgorithmsalgorithmstotoidentifyidentifycandidatecandidate locationslocationsthatthatare thenare thenexaminedexaminedin furtherin further detail.detail.The firstThe firststagestageof ofkeypointkeypointdetectiondetection isistotoidentifyidentifylocationslocationsand scalesand scalesthat canthatbe can be repeatablyrepeatablyassignedassignedunderunderdifdifferingferingvieviewsws of thethesamesameobject.object.DetectingDetectinglocationslocationsthat arethat are invariantinvariantto scaleto Background:scalechangechangeof ofthetheimageimage canscale-spacecan be accomplishedaccomplishedby theorybysearchingsearchingfor stablefor stablefeaturesfeatures across all possible scales, using a continuous function of scale known as scale space (Witkin, across all possible scales, using a continuous function of scale known as scale space (Witkin, 1983). 1983). It has been shown by Koenderink (1984) and Lindeberg (1994) that under a variety of It hasreasonablebeen shoassumptionswn by Ktheoenderinkonly possible(1984)scale-spaceand Lindeberkernel is theg (1994)Gaussianthatfunction.underThere-a variety of reasonablefore, theassumptionsscale space theof anonlyimagepossibleis defi nedscale-spaceas a function,kernelL(x,isythe, σ),Gaussianthat is producedfunction.fromThere- fore, thethe scaleconvolutionspaceofofa vanariable-scaleimage is Gaussian,defi ned Gas(x,a yfunction,, σ), with anL(inputx, y,image,σ), thatI(x,isyproduced): from the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y): L(x, y, σ) = G(x, y, σ) I(x, y), ∗ where is the convolution Loperation(x, y, σin) =x andG (yx,, andy, σ) I(x, y), ∗ ∗ 1 (x2+y2)/2σ2 where is the convolution operationG(x, y,inσ)x=and y,eand− . ∗ 2πσ2 To efficientlydetect stable keypoint locations1in scale space,2 2 we ha2 ve proposed (Lowe, 1999) Claim: Gaussian pyramid is the only “reasonable”( wayx + yto )characterize/2σ signals at using scale-space extrema Gin(thex, ydif, ference-of-Gaussianσ) = e− function con. volved with the image, multiple resolutions (such that new2πσ extrema2 is not introduced) D(x, y, σ), which can be computed from the difference of two nearby scales separated by a To efficientlyconstant multiplicatidetect stableve factorkeypointk: locations in scale space, we have proposed (Lowe, 1999) using scale-space extrema in the difference-of-Gaussian function convolved with the image, D(x, y, σ) = (G(x, y, kσ) G(x, y, σ)) I(x, y) D(x, y, σ), which can be computed from the difference− of tw∗o nearby scales separated by a = L(x, y, kσ) L(x, y, σ). (1) constant multiplicative factor k: − There are a Dnumber(x, y,ofσreasons) = for(Gchoosing(x, y, kthisσ) function.G(x, yFirst,, σ))it isIa(particularlyx, y) efficient function to compute, as the smoothed images, L, need− to be computed∗ in any case for scale space feature description, and D= canLtherefore(x, y, kσbe) computedL(x, y,byσ)simple. image subtraction. (1) − There are a number of reasons for choosing this function. First, it is a particularly efficient 5 function to compute, as the smoothed images, L, need to be computed in any case for scale space feature description, and D can therefore be computed by simple image subtraction.

5 Pope and Lowe (2000) used features based on the hierarchical grouping of image contours, which are particularly useful for objects lacking detailed texture. The history of research on visual recognition contains work on a diverse set of other imagePopepropertiesand Lowethat(2000)can usedbe usedfeaturesas basedfeatureon themeasurements.hierarchical groupingCarneiroof imageandcontours,Jepson (2002) describewhichphase-basedare particularlylocalusefulfeaturesfor objectsthat representlacking detailedthe phasetexture.rather than the magnitude of local spatial frequencies,The history whichof researchis likonelyvisualto prorecognitionvide improcontainsved inwvorkarianceon a ditoverseillumination.set of otherSchiele and Croimagewleyproperties(2000) hathatve canproposedbe usedtheas usefeatureof multidimensionalmeasurements. Carneirohistogramsand Jepsonsummarizing(2002) the describe phase-based local features that represent the phase rather than the magnitude of local distribution of measurements within image regions. This type of feature may be particularly spatial Popefrequencies,and Lowe which(2000) usedis likfeaturesely tobasedprovideon theimprohierarchicalved invgroupingarianceoftoimageillumination.contours, Schiele useful andforCrorecognitionwhichwleyare(2000)particularlyofhatevexturedusefulproposedforobjectsobjectsthe uselackingwithof multidimensionaldetaileddeformabletexture. shapes.histogramsBasrisummarizingand Jacobsthe (1997) have demonstrateddistributionTheofhistorymeasurementsthe vofalueresearchofwithinonextractingvisualimagerecognitionrelocalgions.containsreThisgionwtypeorkboundariesonof afeaturediverse maysetforofberecognition.otherparticularly Other image properties that can be used as feature measurements. Carneiro and Jepson (2002) useful for recognition of textured objects with deformable shapes. Basri and Jacobs (1997) useful propertiesdescribetophase-basedincorporatelocal featuresincludethat representcolor, themotion,phase ratherfigure-groundthan the magnitudediscrimination,of local region shape hadescriptors,ve demonstratedspatial frequencies,and stereothe whichvaluedepthisoflikeelyxtractingcues.to provideThelocalimprolocalrevedgioninfeaturevarianceboundariesapproachto illumination.for recognition.canSchieleeasilyOtherincorporate useful propertiesand Crowleyto(2000)incorporatehave proposedincludethe usecolorof multidimensional, motion, figure-groundhistograms summarizingdiscrimination,the region novel feature types because extra features contribute to robustness when they provide correct shape descriptors,distribution ofandmeasurementsstereo depthwithincues.imageTheregions.localThisfeaturetype ofapproachfeature maycanbe particularlyeasily incorporate matches,novelbutfeatureusefulotherwisefortypesrecognitionbecausedo littleof teextraxturedharmfeaturesobjectsotherwithcontribthandeformableutetheirto robcostshapes.ustnessofBasricomputation.whenand Jacobsthey pro(1997)videTherefore,correct future systems are likhaveelydemonstratedto combinethe valuemanofyextractingfeaturelocaltypes.region boundaries for recognition. Other matches,usefulbutpropertiesotherwiseto doincorporatelittle harmincludeothercolorthan, motion,theirfigure-groundcost of computation.discrimination,Therefore,region future systemsshapeare likdescriptors,ely to combineand stereomandepthy featurecues. Thetypes.local feature approach can easily incorporate novel feature types because extra features contribute to robustness when they provide correct 3 Detectionmatches, bofut otherwisescale-spacedo little harm otherextrthanematheir cost of computation. Therefore, future 3 Detectionsystems are likofely toscale-spacecombine many featureextrtypes.ema As described in the introduction, we will detect keypoints using a cascade filteringapproach As described3 Detectionin the introduction,of scale-spacewe willextrdetectemakeypoints using a cascade filteringapproach that usesthatefusesficientefficientalgorithmsalgorithmstotoidentifyidentifycandidatecandidate locationslocationsthatthatare thenare thenexaminedexaminedin furtherin further detail.detail.The AsfirstThedescribedfirststagestagein theof introduction,ofkeypointkeypointwedetectiondetectionwill detectiskiseypointstotoidentifyidentifyusing alocationscascadelocationsfilteringand scalesandapproachscalesthat canthatbe can be repeatablyrepeatablyassignedthat usesassignedefficientunderunderalgorithmsdifdifferingferingto identifyvieviewswscandidateof thethelocationssamesameobject.thatobject.are thenDetectingeDetectingxaminedlocationsin furtherlocationsthat arethat are detail.Background:The first stage of ke ypointscale-spacedetection is to identify theorylocations and scales that can be invariantinvariantto scalerepeatablyto scalechangeassignedchangeofunderofthethedifimageferingimageviecancanws ofbetheaccomplishedaccomplishedsame object. Detectingby bysearchingsearchinglocationsforthatstableforare stablefeaturesfeatures across all possible scales, using a continuous function of scale known as scale space (Witkin, across all possibleinvariant scales,to scale changeusingofathecontinuousimage can be accomplishedfunction ofbyscalesearchingknoforwnstableasfeaturesscale space (Witkin, 1983). across all possible scales, using a continuous function of scale known as scale space (Witkin, 1983). It has1983).been shown by Koenderink (1984) and Lindeberg (1994) that under a variety of It hasreasonablebeenItshoassumptionshas wnbeen byshownKtheoenderinkbyonlyKoenderinkpossible(1984)(1984)scale-spaceandandLindeberLindeberkernelg (1994)is thegthat(1994)Gaussianunder athatvfunction.arietyunderof There-a variety of reasonable assumptions the only possible scale-space kernel is the Gaussian function. There- reasonablefore, theassumptionsfore,scalethe scalespacespacetheof anofonlyanimageimagepossibleisisdefidefi ned nedscale-spaceasasa afunction,function,kLernel(x,Ly(,x,σis),ythatthe, σ)is,Gaussianthatproducedis producedfromfunction.fromThere- fore, thethe scaleconthevolutionconspacevolutionofofa ofvanariable-scalea vimageariable-scaleis Gaussian,defiGaussian, nedGGas(x,(x,ay,yσfunction,),,σwith), withan inputanL(inputimage,x, y,image,σI()x,, ythat): I(x,isyproduced): from the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y): L(Lx,(x,yy,,σσ)) = GG(x,(x,y,yσ,)σ)I(x,Iy(x,), y), ∗ ∗ where is the convolution operation in x and y, and where is the∗ convolution operation in x and y, and ∗ L(x, y, σ) = G(x, y, σ) I(x, y), 1 (x2+y2)/2σ2 G(x, y, σ) = e− ∗ . 2πσ12 (x2+y2)/2σ2 where is the convolution operationG(x, y,inσ)x=and y,eand− . ∗ To efficientlydetect stable keypoint locations2πinσscale2 space, we have proposed (Lowe, 1999) using scale-space extrema in the difference-of-Gaussian function convolved with the image, To efficientlydetect stable keypoint locations in scale space,2 2 we ha2 ve proposed (Lowe, 1999) D(x, y, σ), which can be computed from the dif1ference(ofx tw+oy nearby)/2σ scales separated by a Proposal: estimate scaleG( ofx, bloby, σ at) (x,y)= using eextrema− of D(x,y,sigma). along sigma using scale-spaceconstant multiplicatiextremave finactorthek:difference-of-Gaussian2πσ2 function convolved with the image, D(x, y, σ), which can be computed from the difference of two nearby scales separated by a To efficientlydetect stableDk(x,eypointy, σ) =locations(G(x, y, kσin) scaleG(x, y,space,σ)) I(x,wey) have proposed (Lowe, 1999) constant multiplicative factor k: − ∗ using scale-space extrema in the dif=ference-of-GaussianL(x, y, kσ) L(x, y, σ). function convolved(1)with the image, − D(x, y, σ) = (G(x, y, kσ) G(x, y, σ)) I(x, y) D(x, y, σ), whichTherecanare abenumbercomputedof reasons forfromchoosingthe thisdiffunction.ference− First,of ittwis∗oa particularlynearby scalesefficientseparated by a constant multiplicatifunction to vcompute,e factoras thek:smoothed= L(images,x, y, kσL,) needL(tox,beycomputed, σ). in any case for scale (1) space feature description, and D can therefore be computed− by simple image subtraction. There are a Dnumber(x, y,ofσreasons) = for(Gchoosing(x, y, kthisσ) function.G(x, yFirst,, σ))it isIa(particularlyx, y) efficient function to compute, as the smoothed images,5 L, need− to be computed∗ in any case for scale space feature description, and D= canLtherefore(x, y, kσbe) computedL(x, y,byσ)simple. image subtraction. (1) − There are a number of reasons for choosing this function. First, it is a particularly efficient 5 function to compute, as the smoothed images, L, need to be computed in any case for scale space feature description, and D can therefore be computed by simple image subtraction.

5 . . .

Scale . . . (next octave) Scale (next octave)

Scale . . . (first octave) Scale Scale (first (next octave) octave)

Difference of Gaussian Gaussian (DOG) Difference of Figure 1: For each octave of scale space, the initial image is repeatedly convolved with Gaussians to Gaussian (DOG) produce the set Scofale scale spaceGaussianimages shown on the left. Adjacent Gaussian images are subtracted (first toFigureproduce1: theFor odifeachctaference-of-Gaussianve)octave of scale space,imagesthe initialon theimageright.isAfterrepeatedlyeach conoctavvolve, edthewithGaussianGaussiansimageto is doproducewn-sampledthe setbyofa fscaleactorspaceof 2, andimagesthe shoprocesswn onrepeated.the left. Adjacent Gaussian images are subtracted to produce the difference-of-Gaussian images on the right. After each octave, the Gaussian image is down-sampled by a factor of 2, and the process repeated. In addition, the difference-of-Gaussian function providesDifferenca closee of approximation to the scale-normalized LaplacianGaofussianGaussian, σ2 2G, as studiedGaubyssianLindeber (DOG) g (1994). Lindeberg In addition, the difference-of-Gaussian ∇function provides a close approximation to the showed thatFigurethe1: Fnormalizationor each octave of scaleof space,the Laplacianthe initial imagewithis repeatedlythe factorconvolvσed2 withis requiredGaussians tofor true scale scale-normalizedproduce the setLaplacianof scale spaceof Gaussian,images shownσon2 the2Gleft., asAdjacentstudiedGaussianby Lindeberimages areg subtracted(1994). Lindeberg invariance. In detailed experimental comparisons,∇ Mikolajczyk (2002) found that the maxima showed tothatproducethe thenormalizationdifference-of-Gaussianof theimagesLaplacianon the right.withAftertheeachfactoroctave,σthe2 isGaussianrequiredimageforis true scale and minimadown-sampledof σ2 2byGa fproduceactor of 2, andthethemostprocessstablerepeated.image features compared to a range of other invariance. In detailed∇ experimental comparisons, Mikolajczyk (2002) found that the maxima possibleand minimaimageoffunctions,σ2 2GAside:producesuch Differenceasthethemostgradient, of GaussiansstableHessian,image vs Laplacianfeaturesor Harris ofcompared Gaussianscorner tofunction.a range of other In addition,∇ the difference-of-Gaussian2 2 function provides a close approximation to the possibleThe relationshipscale-normalizedimage functions,betweenLaplaciansuchDofasandGaussian,theσgradient,σ2G2canG,Hessian,asbestudiedunderstoodbyorLindeberHarrisfromgcorner(1994).thefunction.Lindeberheat difgfusion equa- Consider a Gaussian that ∇is growing∇ in variance (“diffusing”)2 over time: tion (parameterizedThe shorelationshipwed that theinbetweennormalizationterms ofDσandofrathertheσ2Laplacianthan2G canthewithbemoretheunderstoodfactorusualσ2tis=fromrequiredσ ):theforheattrue difscalefusion equa- invariance. In detailed experimental comparisons,∇ Mikolajczyk (2002) found2 that the maxima tion (parameterizedand minima of inσ2 terms2G produceof σ ratherthe mostthanstabletheimagemorefeaturesusualcomparedt = σ to):a range of other ∇ ∂G 2 possible image functions, such as the gradient,=Hessian,σ G.or Harris corner function. The relationship between D and σ2∂∂Gσ2G can be∇understood2 from the heat diffusion equa- ∇ = σ G. From this,tionwe(parameterizedsee that in2termsG canof σberathercomputed∂thanσ the more∇fromusualthet =finiteσ2): difference approximation to ∇ 2 ∂G/From∂σ,this,usingwetheseedifthatferenceGofcannearbybe computed∂scalesG atfromkσ andtheσfinite: difference approximation to ∇ = σ 2G. ∂G/∂σ, using the difference of nearby scales∂σ at∇ kσ and σ: From this, we see that 2 2G can∂beGcomputedG(x,fromy, ktheσ)finiteGdif(x,ferencey, σ)approximation to σ ∇G = − ∂G/∂σ, using the difference∇ 2 of nearby∂Gσ ≈scalesG(atx,kσy,andkσkσσ): Gσ(x, y, σ) σ G = −− and therefore, ∇ ∂∂σG ≈ G(x, y, kσ)kσG(x,σy, σ) σ 2G = − − and therefore, ∇ ∂σ ≈ kσ σ − and therefore, G(x, y, kσ) G(x, y, σ) (k 1)σ2 2G. G(x, y, kσ) − G(x, y, σ) ≈(k −1)σ2 ∇2G. G(x, y, kσ) G(x, y, σ) (k 1)σ2 2G. This shows that when the difference-of-Gaussian−− ≈ ≈ −function− ∇ has∇ scales differing by a con- This shows that when the difference-of-Gaussian function has scales differing by a con- stant factorThisit alreadyshows thatincorporateswhen the difference-of-Gaussianthe σ2 scale normalizationfunction has scalesrequireddifferingforbythea con-scale-invariant stant factorstantitfactoralreadyit alreadyincorporatesincorporatesthetheσσ2 scalescalenormalizationnormalizationrequiredrequiredfor the scale-infor thevariantscale-invariant

6 6 6 Aside: Difference of Gaussians vs Laplacian of Gaussians

Gaussian LoG NLoG

Difference of Gaussians ~ normalized Laplacian of Gaussian Pope and Lowe (2000) used features based on the hierarchical grouping of image contours, which are particularly useful for objects lacking detailed texture. The history of research on visual recognition contains work on a diverse set of other image properties that can be used as feature measurements. Carneiro and Jepson (2002) describe phase-based local features that represent the phase rather than the magnitude of local spatial frequencies, which is likely to provide improved invariance to illumination. Schiele and Crowley (2000) have proposed the use of multidimensional histograms summarizing the distribution of measurements within image regions. This type of feature may be particularly useful for recognition of textured objects with deformable shapes. Basri and Jacobs (1997) have demonstrated the value of extracting local region boundaries for recognition. Other useful properties to incorporate include color, motion, figure-ground discrimination, region shape descriptors, and stereo depth cues. The local feature approach can easily incorporate novel feature types because extra features contribute to robustness when they provide correct matches, but otherwise do little harm other than their cost of computation. Therefore, future systems are likely to combine many feature types.

3 Detection of scale-space extrema

As described in the introduction, we will detect keypoints using a cascade filteringapproach that uses efficientalgorithms to identify candidate locations that are then examined in further detail. The first stage of keypoint detection is to identify locations and scales that can be repeatably assigned under differing views of the same object. Detecting locations that are invariant to scale change of the image can be accomplished by searching for stable features Implementationacross all possible scales, using a continuous function of scale known as scale space (Witkin, . . . 1983). It has been shown by Koenderink (1984) and Lindeberg (1994) that under a variety of

Scale reasonable assumptions the only possible scale-space kernel is the Gaussian function. There- (next fore, the scale space of an image is defi ned as a function, L(x, y, σ), that is produced from octave) the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y):

L(x, y, σ) = G(x, y, σ) I(x, y), ∗ where is the convolution operation in x and y, and ∗

Scale 1 (x2+y2)/2σ2 G(x, y, σ) = e− . (first 2πσ2 octave) To efficientlydetect stable keypoint locations in scale space, we have proposed (Lowe, 1999) using scale-space extrema in the difference-of-Gaussian function convolved with the image, D(x, y, σ), which can be computed from the difference of two nearby scales separated by a constant multiplicatiDifferencvee offactor k: Gaussian Gaussian (DOG) Figure 1: For each octave of scale space, the initial image is repeatedly convolvDed(x,withy,Gaussiansσ) = to (G(x, y, kσ) G(x, y, σ)) I(x, y) produce the set of scale space images shown on the left. Adjacent Gaussian images are subtracted − ∗ to produce the difference-of-Gaussian images on the right. After each octave, the Gaussian image= is L(x, y, kσ) L(x, y, σ). (1) down-sampled by a factor of 2, and the process repeated. − Access finer scale changesThere are abeyondnumber of2Xreasons for choosing this function. First, it is a particularly efficient In addition, the difference-of-Gaussian function provides a close approximation to the scale-normalized Laplacian of Gaussian, σ2 function2G, as studiedto bycompute,Lindebergas(1994).the smoothedLindeberg images, L, need to be computed in any case for scale s= # levels∇ in an “octave” showed that the normalization of the Laplacianspacewith featurethe factordescription,σ2 is required forandtrueDscalecan therefore be computed by simple image subtraction. invariance. In detailed experimental comparisons, Mikolajczyk (2002) found that the maxima and minima of σ2 2G produce the most stable image features compared to a range of other ∇ possible image functions, such as the gradient, Hessian, or Harris corner function. 5 The relationship between D and σ2 2G can be understood from the heat diffusion equa- ∇ tion (parameterized in terms of σ rather than the more usual t = σ2):

∂G = σ 2G. ∂σ ∇ From this, we see that 2G can be computed from the finite difference approximation to ∇ ∂G/∂σ, using the difference of nearby scales at kσ and σ:

∂G G(x, y, kσ) G(x, y, σ) σ 2G = − ∇ ∂σ ≈ kσ σ − and therefore,

G(x, y, kσ) G(x, y, σ) (k 1)σ2 2G. − ≈ − ∇ This shows that when the difference-of-Gaussian function has scales differing by a con- stant factor it already incorporates the σ2 scale normalization required for the scale-invariant

6 Pope and Lowe (2000) used features based on the hierarchical grouping of image contours, which are particularly useful for objects lacking detailed texture. The history of research on visual recognition contains work on a diverse set of other image properties that can be used as feature measurements. Carneiro and Jepson (2002) describe phase-based local features that represent the phase rather than the magnitude of local spatial frequencies, which is likely to provide improved invariance to illumination. Schiele and Crowley (2000) have proposed the use of multidimensional histograms summarizing the distribution of measurements within image regions. This type of feature may be particularly useful for recognition of textured objects with deformable shapes. Basri and Jacobs (1997) have demonstrated the value of extracting local region boundaries for recognition. Other useful properties to incorporate include color, motion, figure-ground discrimination, region shape descriptors, and stereo depth cues. The local feature approach can easily incorporate novel feature types because extra features contribute to robustness when they provide correct matches, but otherwise do little harm other than their cost of computation. Therefore, future systems are likely to combine many feature types.

3 Detection of scale-space extrema

As described in the introduction, we will detect keypoints using a cascade filteringapproach that uses efficientalgorithms to identify candidate locations that are then examined in further detail. The first stage of keypoint detection is to identify locations and scales that can be repeatably assigned under differing views of the same object. Detecting locations that are invariant to scale change of the image can be accomplished by searching for stable features across all possible scales, using a continuous function of scale known as scale space (Witkin, 1983). Non-maximalIt has been shown by Koenderink suppression(1984) and Lindeberg (1994) that under a variety of reasonable assumptions the only possible scale-space kernel is the Gaussian function. There- fore, the scale space of an image is defi ned as a function, L(x, y, σ), that is produced from the convolution of a variable-scale Gaussian, G(x, y, σ), with an input image, I(x, y):

L(x, y, σ) = G(x, y, σ) I(x, y), ∗ where is the convolution operation in x and y, and ∗ Scale 1 (x2+y2)/2σ2 G(x, y, σ) = e− . 2πσ2 To efficientlydetect stable keypoint locations in scale space, we have proposed (Lowe, 1999) using scale-space extrema in the difference-of-Gaussian function convolved with the image, D(x, y, σ), which can be computed from the difference of two nearby scales separated by a constant multiplicative factor k:

D(x, y, σ) = (G(x, y, kσ) G(x, y, σ)) I(x, y) Figure 2: Maxima and minima of the difference-of-Gaussian images are detected− by comparing∗ a pixel (marked with X) to its 26 neighbors in 3x3 regions at=theL(currentx, y, kσand) adjacentL(x, y, σscales). (marked (1) − with circles). Find (x,y,sigma) is that is maximal amoung 26 neighbors There are a number of reasons for choosing this function. First, it is a particularly efficient function to compute, as the smoothed images, L, need to be computed in any case for scale Laplacian. The factor (k space1) featurein the equationdescription,is anda constantD can thereforeover all bescalescomputedand thereforeby simpledoesimage subtraction. − not influence extrema location. The approximation error will go to zero as k goes to 1, but in practice we have found that the approximation has almost no impact on the stability of 5 extrema detection or localization for even signifi cant differences in scale, such as k = √2. An efficient approach to construction of D(x, y, σ) is shown in Figure 1. The initial image is incrementally convolved with Gaussians to produce images separated by a constant factor k in scale space, shown stacked in the left column. We choose to divide each octave of scale space (i.e., doubling of σ) into an integer number, s, of intervals, so k = 21/s. We must produce s + 3 images in the stack of blurred images for each octave, so that final extrema detection covers a complete octave. Adjacent image scales are subtracted to produce the difference-of-Gaussian images shown on the right. Once a complete octave has been processed, we resample the Gaussian image that has twice the initial value of σ (it will be 2 images from the top of the stack) by taking every second pixel in each row and column. The accuracy of sampling relative to σ is no different than for the start of the previous octave, while computation is greatly reduced.

3.1 Local extrema detection In order to detect the local maxima and minima of D(x, y, σ), each sample point is compared to its eight neighbors in the current image and nine neighbors in the scale above and below (see Figure 2). It is selected only if it is larger than all of these neighbors or smaller than all of them. The cost of this check is reasonably low due to the fact that most sample points will be eliminated following the firstfew checks. An important issue is to determine the frequency of sampling in the image and scale do- mains that is needed to reliably detect the extrema. Unfortunately, it turns out that there is no minimum spacing of samples that will detect all extrema, as the extrema can be arbitrar- ily close together. This can be seen by considering a white circle on a black background, which will have a single scale space maximum where the circular positive central region of the difference-of-Gaussian function matches the size and location of the circle. For a very elongated ellipse, there will be two maxima near each end of the ellipse. As the locations of maxima are a continuous function of the image, for some ellipse with intermediate elongation there will be a transition from a single maximum to two, with the maxima arbitrarily close to

7 Scale selection in 1-D

Blob A Blob B Blob C

G(x)

DoG(x)

D(x, )= f(x) DoG (x) ⇤ Scale selection in 2D LindebergLindeberg et et al, al., 1996 1996

SlideSlide from from Tinne Tinne Tuytelaars Tuytelaars Scale selection in 2D Scale selection in 2D Scale selection in 2D Scale selection in 2D Scale selection in 2D Scale selection in 2D Combining both: Harris-Laplacian detector

W

Gaussian-weighted window 2 Ix(D) IxIy(D) A(x, y, I , d)=G(I ) 2 ⇤ IyIx(D) I (D)  y Problem: Harris corner measure is not scale-invariant (depends on size of integration window).

How should we relate window scale I to scale-space scale ( D , used to compute derivatives)?

Heuristic: D = .7I

cornerness(x, y, )=det(A(x, y, )) ↵Trace2(A(x, y, )) Combining both: Harris-Laplacian detector

1. Optimize cornerness(x,y,sigma) over discrete set of locations and scales

2. Fine-tune “sub-pixel” accuracy by iterating the following:

i. Given (x,y), we can find maximal sigma with finer search (For speed, optimize sigma over scale-space with DoG or normalized Laplacian)

ii. Given sigma, find maximal (x,y) of cornerness

Repeat (i,ii) over local neighborhoods of (sigma,x,y) until convergence Example run Affine-invariant detection (overview)

Need richer description of “neighborhood” or scale Replace scalar with ⌃ (e.g., scale differently long x and y, or even a diagonal axis)

1. Optimize cornerness(x,y,sigma) over discrete set of locations and scales

2. Fine-tune “sub-pixel” accuracy by iterating the following:

i. Given (x,y), find maximal Sigma with local search ii. Given Sigma, find maximal (x,y) of cornerness Affine Invariance State of the Art: Affine Invariance

Example from Mikolajczyk and Schmid 2004 Application: Finding Correspondences Initial detections Scale: 4.9 Rotation: 19o

Example from Mikolajczyk and Schmid 2004 Final matches: 32 correct correspondences Scale: 4.9 Rotation: 19o

Example from Mikolajczyk and Schmid 2004 Application: Finding Correspondences

Scale change: 1.7 Viewpoint change: 50o Example from Mikolajczyk and Schmid 2004 Alternative approach for rotation invariance

(Lowe, SIFT) Compute gradients for all pixels in patch. Histrogram (bin) gradients by orientation

0 2π

(I prefer this because you can look for multiple peaks) Comparison References

• K. Mikolajczyk and C. Schmid. A performance evaluation of local descriptors. Pattern Analysis and Machine Intelligence (PAMI), 27(10):31–47,2005. • David G. Lowe. Distinctive Image Features from Scale-Invariant Keypoints. IJCV (International Journal of Computer Vision), 2004 • K.Mikolajczyk and C.Schmid. Scale & Affine Invariant Interest Point Detectors. IJCV, Vol. 60, No. 1, 2004. Software can be downloaded from Schmid’s and Lowe’s pages • T. Lindeberg. Feature detection with automatic scale selection'. International Journal of Computer Vision, vol 30, number 2, pp. 77--116, 1998. • T. Lindeberg. Scale-Space Theory in Computer Vision, Kluwer Academic Publishers, Dordrecht, Netherlands, 1994. Coordinate frames

Represent each patch in a canonical scale and orientation (or general affine coordinate frame)

Scale Invariant Feature Transform

Basic idea: • Take 16x16 square window around detected feature • Compute edge orientation (angle of the gradient - 90°) for each pixel • Throw out weak edges (threshold gradient magnitude) • Create histogram of surviving edge orientations

0 2π angle histogram

Adapted from slide by David Lowe SIFT descriptor Full version • Divide the 16x16 window into a 4x4 grid of cells (2x2 case shown below) • Compute an orientation histogram for each cell • 16 cells * 8 orientations = 128 dimensional descriptor

Adapted from slide by David Lowe Properties of SIFT

Extraordinarily robust matching technique • Can handle changes in viewpoint – Up to about 60 degree out of plane rotation • Can handle significant changes in illumination – Sometimes even day vs. night (below) • Fast and efficient—can run in real time • Lots of code available – http://people.csail.mit.edu/albert/ladypack/wiki/index.php/Known_implementations_of_SIFT http://www.vlfeat.org/overview/sift.html

We’ll discuss many more on Thursday!