SIFT and SURF Performance Evaluation for Mobile Robot-Monocular Visual Odometry

Journal of Image and Graphics, Volume 2, No.1, June, 2014 SIFT and SURF Performance Evaluation for Mobile Robot-Monocular Visual Odometry Houssem Eddine Benseddik, Oualid Djekoune, and Mahmoud Belhocine Computer Integrated Manufacturing and Robotics Division Center for Development of Advanced Technologies, Algiers, Algeria Email: {hbenseddik, odjekoune, mbelhocine}@cdta.dz Abstract—Visual odometry is the process of estimating the estimate the motion of a moving camera in a static motion of mobile through the camera attached to it, by environment, called visual odometry [1]. These properties matching point features between pairs of consecutive image make visual sensors especially useful for navigation on frames. For mobile robots, a reliable method for comparing rough terrain [2], such as in reconnaissance, planetary images can constitute a key component for localization and exploration, safety and rescue applications, as well as in motion estimation tasks. In this paper, we study and compare the SIFT and SURF detector/ descriptor in terms urban environments [3], [4]. Since its early appearance, of accurate motion determination and runtime efficiency in visual odometry has been based on three main stages: context the mobile robot-monocular visual odometry. We feature detection, feature tracking, and motion estimation. evaluate the performance of these detectors/ descriptors However, several particular implementations have been from the repeatability, recall, precision and cost of proposed in literature [5], [6], which mainly differ computation. To estimate the relative pose of camera from depending on the type of video sensors used, i.e. outlier-contaminated feature correspondences, the essential monocular, stereo or omnidirectional cameras, on the matrix and inlier set is estimated using RANSAC. feature tracking method, and on the transformation Experimental results demonstrate that SURF, outperform adopted for estimating the camera motion. the SIFT, in both accuracy and speed. A variety of feature detection algorithms have been Index Terms—SIFT, SURF, essential matrix, RANSAC, proposed in the literature to compute reliable descriptors visual odmetry for image matching [7]-[11]. SIFT [10] and SURF [11] detectors and descriptors are the most promising due to good performance and have now been used in many I. INTRODUCTION applications. For visual odometry as a real-time video system, accuracy of feature localization and computation In the last decade, visual odometry has emerged as a cost are crucial. Different from matching image novel and promising solution to the problem of robot applications with large viewpoint changes such as localization in uncharted environments. The key idea of panorama stitching, object recognition and image retrieval, visual odometry is that of estimating the robot motion by visual odometry is a video sequence matching between tracking visually distinctive features in subsequent images the successive frames. When the latter produces a number acquired by an on-board camera [1]. Ego-motion of false matches that significantly affect localization estimation is an important prerequisite in robotics accuracy. This is mainly due to the fact that many features applications. Many higher level tasks like obstacle from an image may have no match in the preceding image. detection, collision avoidance or autonomous navigation The essential matrix estimation is one of the stages of rely on an accurate localization of the robot. All of these Visual odometry: this is where a robot’s motion is applications make use of the relative pose of the current computed by calculating the trajectory of an attached camera with respect to the previous camera frame or a camera, this matrix encoding the relative orientation and static world reference frame. Usually, this localization translation direction between the two views, and it is used task is performed using GPS or wheel speed sensors. to estimate the relative position from features matched In recent years, camera systems became cheaper and between two images (‘feature correspondences’). the performance of computing hardware increased Normally some features will be incorrectly matched, so a dramatically. In addition, video sensors are relatively robust estimation to these outliers must be used. inexpensive and easy to integrate in mobile platforms. RANSAC (RANdom SAmple Consensus) [12] is a Hence, high resolution images can be processed in real- commonly used approach to achieve accurate estimates time on standard hardware. It has been proven, that the also in presence of large fractions of outliers. The use of information given by a camera system is sufficient to RANSAC allows for outliers rejection in 2D images corresponding to real traffic scenes, providing a method Manuscript received January 4, 2013; revised May 5, 2014. for carrying out visual odometry onboard a robot. One the ©2014 Engineering and Technology Publishing 70 doi: 10.12720/joig.2.1.70-76 Journal of Image and Graphics, Volume 2, No.1, June, 2014 application is for simultaneously finding the fundamental Rejection of weak interest points: All interest matrix (or essential matrix in the case of calibrated points that have low contrast and are lying on an cameras) relating correspondences between two images, edge are removed. and to identify and remove bad correspondences [13]. Orientation assignment: To obtain rotational In this paper, we offer a substantive evaluation of SIFT invariance, each interest point is assigned an and SURF to find the most appropriate detector and orientation determined from the image gradients descriptor to estimate the accurate motion in visual of the surrounding patch. The size of the patch is odometry. We have selected the two popular detectors and determined by the selected scale. descriptors which have previously shown a good The SIFT descriptor is a 3D histogram of gradient performance in visual odometry. location and orientation. The magnitudes are weighted by The following sections are organized as follows: a Gaussian window with sigma equal to one half the width Section II, we briefly discuss the working mechanism of of the descriptor window. These samples are then SIFT and SURF followed by discussion of matching accumulated into orientation histograms (with eight bins) algorithm. After, we present the essential matrix, how to summarizing the contents over 4x4 sub-regions. The estimate the relative pose of two cameras from this matrix feature vector contains the values of all orientation and robust motion estimation by RANSAC. In section III, histograms entries. With a descriptor window size of we thoroughly compare matching performance of the two 16x16 samples leading to16 sub-regions the resulting detectors and descriptors and present our evaluation feature vector has 16x8 = 128 elements. A calculation of criterions. Finally, we conclude in Section IV. descriptor histogram: Given the position, scale and orientation of each interest point, a patch is selected where II. BACKGROUND magnitude and orientation of gradient is used to create a representation which allows, to some extent, affine and A. Features Description and Matching illumination changes. One of the most important aspects of visual odometry is Speeded Up Robust Features (SURF): The SURF features detection and matching. Two well-known detector-descriptor is proposed by Bay et al. [11]. Like approaches to detect salient regions in images have been SIFT, the SURF approach describes a keypoint detector published: Scale Invariant Feature Transform (SIFT), and descriptor. This section gives all details on the Lowe [10], and Speeded Up Robust Features (SURF), following step of SURF algorithm structure: Bay et al [11]. Both approaches do not only detect interest Computation of the integral image of the input points or so called features, but also propose a method of images. creating an invariant descriptor. This descriptor can be Computation of the Box Hessian operator at used to (more or less) uniquely identify the found interest several scales and sample rates using box filters; points and match them even under a variety of disturbing Selection of maxima responses of the determinant conditions like scale changes, rotation, changes in of the Box Hessian matrix in box space illumination or viewpoints or image noise. Exactly this Refinement of the corresponding interest point invariancy is most important to applications in mobile location by quadratic interpolation; robotics, where stable and repeatable visual features serve as landmarks for visual odometry and SLAM. Storage of the interest point with its contrast sign. The Scale Invariant Feature Transform (SIFT): Lowe The SURF descriptor encodes the distribution of pixel proposed a SIFT detector/descriptor [10], is a local feature intensities in the neighborhood of the detected feature at extraction method invariant to image translation, scaling, the corresponding scale. To extract the SURF-Descriptor, rotation, and partially invariant to illumination changes the first step is to construct a square window of size 20σ, and affine 3D projection. The extraction of SIFT features (σ is scale) around the interest point oriented along the relies on the following stages: dominant direction. The window is divided into 4x4 regular sub-regions. Then for each sub-region the values Creation of scale-space: The scale-space is created by repeatedly smoothing the original image with a of ∑ , ∑ , ∑| |, ∑| | are computed, where ∑ Gaussian kernel. and ∑ refer to the Haar

Load more