Object Recognition from Local Scale-Invariant Features
David G. Lowe Computer Science Department University of British Columbia Vancouver, B.C., V6T 1Z4, Canada [email protected]
Abstract translation, scaling, and rotation, and partially invariant to illumination changes and af®ne or 3D projection. Previous An object recognition system has been developed that uses a approaches to local feature generation lacked invariance to new class of local image features. The features are invariant scale and were more sensitive to projective distortion and to image scaling, translation, and rotation, and partially in- illumination change. The SIFT features share a number of variant to illumination changes and af®ne or 3D projection. properties in common with the responses of neurons in infe- These features share similar properties with neurons in in- rior temporal (IT) cortex in primate vision. This paper also ferior temporal cortex that are used for object recognition describes improved approaches to indexing and model ver- in primate vision. Features are ef®ciently detected through i®cation. a staged ®ltering approach that identi®es stable points in The scale-invariant features are ef®ciently identi®ed by scale space. Image keys are created that allow for local ge- using a staged ®ltering approach. The ®rst stage identi®es ometric deformations by representing blurred image gradi- key locations in scale space by looking for locations that ents in multiple orientation planes and at multiple scales. are maxima or minima of a difference-of-Gaussian function. The keys are used as input to a nearest-neighbor indexing Each point is used to generate a feature vector that describes method that identi®es candidate object matches. Final veri- the local image region sampled relative to its scale-space co- ®cation of each match is achieved by ®nding a low-residual ordinate frame. The features achieve partial invariance to least-squares solution for the unknown model parameters. local variations, such as af®ne or 3D projections, by blur- Experimental results show that robust object recognition ring image gradient locations. This approach is based on a can be achieved in cluttered partially-occluded images with model of the behavior of complex cells in the cerebral cor- a computation time of under 2 seconds. tex of mammalian vision. The resulting feature vectors are called SIFT keys. In the current implementation, each im- 1. Introduction age generates on the order of 1000 SIFT keys, a process that requires less than 1 second of computation time. Object recognition in cluttered real-world scenes requires The SIFT keys derived from an image are used in a local image features that are unaffected by nearby clutter or nearest-neighbour approach to indexing to identify candi- partial occlusion. The features must be at least partially in- date object models. Collections of keys that agree on a po- variant to illumination, 3D projective transforms, and com- tential model pose are ®rst identi®ed through a Hough trans- mon object variations. On the other hand, the features must form hash table, and then througha least-squares ®t to a ®nal also be suf®ciently distinctive to identify speci®c objects estimate of model parameters. When at least 3 keys agree among many alternatives. The dif®culty of the object recog- on the model parameters with low residual, there is strong nition problem is due in large part to the lack of success in evidence for the presence of the object. Since there may be ®nding such image features. However, recent research on dozens of SIFT keys in the image of a typical object, it is the use of dense local features (e.g., Schmid & Mohr [19]) possible to have substantial levels of occlusion in the image has shown that ef®cient recognition can often be achieved and yet retain high levels of reliability. by using local image descriptors sampled at a large number The current object models are represented as 2D loca- of repeatable locations. tions of SIFT keys that can undergo af®ne projection. Suf- This paper presents a new method for image feature gen- ®cient variation in feature location is allowed to recognize eration called the Scale Invariant Feature Transform (SIFT). perspective projection of planar shapes at up to a 60 degree This approach transforms an image into a large collection rotation away from the camera or to allow up to a 20 degree of local feature vectors, each of which is invariant to image rotation of a 3D object.
Proc. of the International Conference on 1 Computer Vision, Corfu (Sept. 1999) 2. Related research Furthermore, an explicit scale is determined for each point, which allows the image description vector for that point to Object recognition is widely used in the machine vision in- be sampled at an equivalent scale in each image. A canoni- dustry for the purposes of inspection, registration, and ma- cal orientation is determined at each location, so that match- nipulation. However, current commercial systems for object ing can be performed relative to a consistent local 2D co- recognition depend almost exclusively on correlation-based ordinate frame. This allows for the use of more distinctive template matching. While very effective for certain engi- image descriptors than the rotation-invariant ones used by neered environments, where object pose and illumination Schmid and Mohr, and the descriptor is further modi®ed to are tightlycontrolled, template matching becomes computa- improve its stability to changes in af®ne projection and illu- tionally infeasible when object rotation, scale, illumination, mination. and 3D pose are allowed to vary, and even more so when Other approaches to appearance-based recognition in- dealing with partial visibility and large model databases. clude eigenspace matching [13], color histograms [20], and An alternative to searching all image locations for receptive ®eld histograms [18]. These approaches have all matches is to extract features from the image that are at been demonstrated successfully on isolated objects or pre- least partially invariant to the image formation process and segmented images, but due to their more global features it matching only to those features. Many candidate feature has been dif®cult to extend them to cluttered and partially types have been proposed and explored, including line seg- occluded images. Ohba & Ikeuchi [15] successfully apply ments [6], groupings of edges [11, 14], and regions [2], the eigenspace approach to cluttered images by using many among many other proposals. While these features have small local eigen-windows, but this then requires expensive worked well for certain object classes, they are often not de- search for each window in a new image, as with template tected frequently enough or with suf®cient stability to form matching. a basis for reliable recognition. There has been recent work on developing much denser collections of image features. One approach has been to 3. Key localization use a corner detector (more accurately, a detector of peaks We wish to identify locations in image scale space that are in local image variation) to identify repeatable image loca- invariant with respect to image translation, scaling, and ro- tions, around which local image properties can be measured. tation, and are minimally affected by noise and small dis- Zhang et al. [23] used the Harris corner detector to iden- tortions. Lindeberg [8] has shown that under some rather tify feature locations for epipolar alignment of images taken general assumptions on scale invariance, the Gaussian ker- from differing viewpoints. Rather than attempting to cor- nel and its derivatives are the only possible smoothing ker- relate regions from one image against all possible regions nels for scale space analysis. in a second image, large savings in computation time were To achieve rotation invariance and a high level of ef®- achieved by only matching regions centered at corner points ciency, we have chosen to select key locations at maxima in each image. and minima of a difference of Gaussian function applied in For the object recognition problem, Schmid & Mohr scale space. This can be computed very ef®ciently by build- [19] also used the Harris corner detector to identify in- ing an image pyramid with resampling between each level. terest points, and then created a local image descriptor at Furthermore, it locates key points at regions and scales of each interest point from an orientation-invariant vector of high variation, making these locations particularly stable for derivative-of-Gaussian image measurements. These image characterizing the image. Crowley & Parker [4] and Linde- descriptors were used for robust object recognition by look- berg [9] have previously used the difference-of-Gaussian in ing for multiple matching descriptors that satis®ed object- scale space for other purposes. In the following, we describe based orientation and location constraints. This work was a particularly ef®cient and stable method to detect and char- impressive both for the speed of recognition in a large acterize the maxima and minima of this function. database and the ability to handle cluttered images. As the 2D Gaussian function is separable, its convolution The corner detectors used in these previous approaches with the input image can be ef®ciently computed by apply- have a major failing, which is that they examine an image ing two passes of the 1D Gaussian function in the horizontal at only a single scale. As the change in scale becomes sig- and vertical directions:
ni®cant, these detectors respond to different image points.
2 2
Also, since the detector does not provide an indication of the 1