A Comprehensive Review of Current Local Features for Computer Vision
Total Page:16
File Type:pdf, Size:1020Kb
ARTICLE IN PRESS Neurocomputing 71 (2008) 1771– 1787 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A comprehensive review of current local features for computer vision Jing Li Ã, Nigel M. Allinson Vision and Information Engineering Research Group, Department of Electronic and Electrical Engineering, University of Sheffield, Sheffield, S1 3JD,UK article info abstract Available online 29 February 2008 Local features are widely utilized in a large number of applications, e.g., object categorization, image Keywords: retrieval, robust matching, and robot localization. In this review, we focus on detectors and local Local features descriptors. Both earlier corner detectors, e.g., Harris corner detector, and later region detectors, e.g., Feature detectors Harris affine region detector, are described in brief. Most kinds of descriptors are described and Corner detectors summarized in a comprehensive way. Five types of descriptors are included, which are filter-based Region detectors descriptors, distribution-based descriptors, textons, derivative-based descriptors and others. Finally, the Feature descriptors matching methods and different applications with respect to the local features are also mentioned. The Filter-based descriptors objective of this review is to provide a brief introduction for new researchers to the local feature Distribution-based descriptors research field, so that they can follow an appropriate methodology according to their specific Textons requirements. Derivative-based descriptors & 2008 Elsevier B.V. All rights reserved. 1. Introduction response of this detector is anisotropic, noisy, and sensitive to edges. To reduce these shortcomings, the Harris corner detector Nowadays, local features (local descriptors), which are distinc- [28] was developed. However, it fails to deal with scale changes, tive and yet invariant to many kinds of geometric and photometric which always occur in images. Therefore, the construction of transformations, have been gaining more and more attention detectors that can cope with this scaling problem is important. because of their promising performance. They are being applied Lowe [42] pioneered a scale invariant local feature, namely the widely in computer vision research, for such tasks as image re- scale invariant feature transform (SIFT). It consists of a detector trieval [41,63,73,74,78], image registration [5], object recognition and a descriptor. The SIFT’s detector finds the local maximums of a [9,10,30–32,42,50,72], object categorization [20,21,39,46,60,71],tex- series of difference of Gaussian (DoG) images. Mikolajczyk and ture classification [29,37,40,81,82], robot localization [65],wide Schmid [47] developed the Harris–Laplace detector by combining: baseline matching [22,45,56,61,75,78], and video shot retrieval (i) the Harris corner detector and (ii) the Laplace function for [67,68]. characteristic scale selection, for scale invariant feature detection. There are two different ways of utilizing local features in To deal with the viewpoint changes, Mikolajczyk and Schmid [47] applications: (i) traditional utilization, which involves the following put forward the Harris (Hessian) affine detector, which incorpo- three steps: feature detection, feature description, and feature rates: the Harris corner detector (the Hessian point detector), matching; (ii) bag-of-features [55] and hyperfeatures [1], scale selection, and second moment matrix based elliptical shape which include the following four steps: feature detection, feature estimation. Tuytelaars and Van Gool developed an edge-based description, feature clustering, and frequency histogram construc- region detector [78], which considers both curved and straight tion for image representation. The focus of this review is on local edges to construct parallelograms associated with the Harris features. A local feature consists of a feature detector and a feature corner points. They also proposed an intensity-based detector descriptor. [78], which starts from the local extrema of intensity and Feature detectors can be traced back to the Moravec’s corner constructs ellipse-like regions with a number of rays emitted detector [52], which looks for the local maximum of minimum from these extrema. Both the edge- and intensity-based methods intensity changes. As pointed by Harris and Stephens [31], the preserve the affine invariance. Matas et al. [45] developed the maximally stable extremal region (MSER) detector, which is a watershed-like method. The last but not the least affine invariant detector is the salient region detector [33], which locates regions à Corresponding author. E-mail addresses: elq06jl@sheffield.ac.uk (J. Li), n.allinson@sheffield.ac.uk based on an entropy function. The detailed discussions of these (N.M. Allinson). methods are given in Section 2. 0925-2312/$ - see front matter & 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2007.11.032 ARTICLE IN PRESS 1772 J. Li, N.M. Allinson / Neurocomputing 71 (2008) 1771–1787 To represent points and regions, which are detected by the candidate points in other image for each point in the reference above methods, a large number of different local descriptors have image, in case that the distance between the descriptors of the been developed. The earliest local descriptor could be the local candidate point and the reference point is below a specified derivatives [36]. Florack et al. [23] incorporated a number of local threshold. Nearest neighbor matching algorithms find the derivatives and constructed the differential invariants, which are point with the closest descriptor to a reference point. Nearest rotational invariant, for local feature representation. Schmid and neighbor distance ratio matching utilizes the ratio between the Mohr [63] extended local derivatives as the local grayvalue distance of the nearest and the second-nearest neighbors for a invariants for image retrieval. Freeman and Adelson [25] proposed reference point. Using which form of matching method depends steerable filters, which are linear combinations of a number of on a specific application. If a simple and fast strategy is required, basis filters, for orientation and scale selection to handle tasks in the threshold-based matching is often the best choice; if an image processing and computer vision research. Marcelja [44] and accurate and effective algorithm is a prerequisition, the nearest Daugman [17,18] modeled the responses of the mammalian visual neighbor distance ratio matching has distinct advantages. For cortex through a series of Gabor functions [17,18], because these detector [49,53,64] and descriptor [48,53] performance evalua- functions can suitably represent the receptive field profiles in tions, almost all existing works are based on one of these cortical simple cells. Therefore, Gabor filters can be applied for matching methods. local feature description. Wavelets, which are effective and Existing detector performance evaluations [49,53] have efficient for multiresolution analysis, can also represent local demonstrated that: (i) under viewpoint changes, MSER out- features. Textons, e.g., 2D textons, 3D textons [40], and the perform others for both structured and textured images; Varma–Zisserman model [81], have been demonstrated to have (ii) under scale changes, Hessian affine detector achieves the good performance for texture classification. A texton dictionary is best results for both structured and textured images; (iii) under constructed from a number of textures and a clustering algorithm image blur, MSER performs poorly for structured flat images is applied to select a small number of models to represent each but generally reliable for textured images; others work similarly texture. Texton representation is also a good choice for local for both; (iv) under JPEG compression for structured images, feature modeling. Van Gool et al. [80] computed them for Hessian and Harris affine perform best; (v) under illumination moments up to second order and second degree based on the changes, all detectors perform well and MSER works best; and (vi) derivatives of x and y directions of an image patch. Past research Hessian affine and DoG consistently outperform others for 3D has shown the effectiveness of the SIFT descriptor, which is a 3D objects. histogram for gradient magnitudes and orientations representa- Previous descriptor evaluations [48,53] revealed that: (i) under tion. This feature is invariant to changes in partial illumination, rotation changes, all descriptors have a similar performance for background clutter, occlusion, and transformations in terms of structured images. GLOH, SIFT and shape context obtain the best rotation and scaling. Schaffalitzky and Zisserman [61] introduced results for textured images although all descriptors do not complex filters for generating kernel functions for efficient multi- perform well for this kind of images; (ii) under viewpoint changes, view matching. Shape context, developed by Belongie et al. [9], the results are better for textured images than for structured describes the distribution of the rest points of a shape with image: the GLOH descriptor perform best for structured images respect to a reference point on the shape for matching. Based and SIFT obtains the best results for textured images; (iii) under on the phase and amplitude of steerable bandpass filters, scale changes, SIFT and GLOH obtain the best results for both Carneiro and Jepson [12] proposed phase-based local features, textured