Visual Odometry Part I: the First 30 Years and Fundamentals

© DIGITAL VISION Visual Odometry Part I: The First 30 Years and Fundamentals By Davide Scaramuzza and Friedrich Fraundorfer isual odometry (VO) is the process of estimating The advantage of VO with respect to wheel odometry is the egomotion of an agent (e.g., vehicle, human, that VO is not affected by wheel slip in uneven terrain or and robot) using only the input of a single or other adverse conditions. It has been demonstrated that multiple cameras attached to it. Application domains compared to wheel odometry, VO provides more accurate Vinclude robotics, wearable computing, augmented reality, trajectory estimates, with relative position error ranging and automotive. The term VO was coined in 2004 by Nis- from 0.1 to 2%. This capability makes VO an interesting ter in his landmark paper [1]. The term was chosen for its supplement to wheel odometry and, additionally, to other similarity to wheel odometry, which incrementally esti- navigation systems such as global positioning system mates the motion of a vehicle by integrating the number (GPS), inertial measurement units (IMUs), and laser of turns of its wheels over time. Likewise, VO operates by odometry (similar to VO, laser odometry estimates the incrementally estimating the pose of the vehicle through egomotion of a vehicle by scan-matching of consecutive examination of the changes that motion induces on the laser scans). In GPS-denied environments, such as under- images of its onboard cameras. For VO to work effec- water and aerial, VO has utmost importance. tively, there should be sufficient illumination in the envi- This two-part tutorial and survey provides a broad ronment and a static scene with enough texture to allow introduction to VO and the research that has been under- apparent motion to be extracted. Furthermore, consecu- taken from 1980 to 2011. Although the first two decades tive frames should be captured by ensuring that they have witnessed many offline implementations, only in the third sufficient scene overlap. decade did real-time working systems flourish, which has led VO to be used on another planet by two Mars-exploration rovers for the first time. Part I (this tutorial) presents a Digital Object Identifier 10.1109/MRA.2011.943233 historical review of the first 30 years of research in this field Date of publication: 8 December 2011 and its fundamentals. After a brief discussion on camera 80 • IEEE ROBOTICS & AUTOMATION MAGAZINE • DECEMBER 2011 1070-9932/11/$26.00ª2011 IEEE • modeling and calibration, it describes the main motion- digitizing and analyzing images at every location. At each estimation pipelines for both monocular and binocular stop, the camera slid horizontally taking nine pictures at scheme, outlining pros and cons of each implementation. equidistant intervals. Corners were detected in an image Part II will deal with feature matching, robustness, and using his operator and matched along the epipolar lines of applications. It will review the main point-feature detectors the other eight frames using normalized cross correlation. used in VO and the different outlier-rejection schemes. Par- Potential matches at the next robot ticular emphasis will be given to the random sample consen- locations were found again by correla- sus (RANSAC), and the distinct tricks devised to speed it up tion using a coarse-to-fine strategy to • The advantage of VO will be discussed. Other topics covered will be error model- account for large-scale changes. Out- ing, location recognition (or loop-closure detection), and liers were subsequently removed by with respect to wheel bundle adjustment. checking for depth inconsistencies in odometry is that VO is This tutorial provides both the experienced and non- the eight stereo pairs. Finally, motion expert user with guidelines and references to algorithms was computed as the rigid body not affected by wheel to build a complete VO system. Since an ideal and unique transformation to align the triangu- slip in uneven terrain VO solution for every possible working environment lated 3-D points seen at two consecu- does not exist, the optimal solution should be chosen tive robot positions. The system of or other adverse carefully according to the specific navigation environ- equation was solved via a weighted conditions. ment and the given computational resources. least square, where the weights were inversely proportional to the dis- • History of Visual Odometry tance from the 3-D point. The problem of recovering relative camera poses and Although Moravec used a single sliding camera, his three-dimensional (3-D) structure from a set of camera work belongs to the class of stereo VO algorithms. This images (calibrated or noncalibrated) is known in the terminology accounts for the fact that the relative 3-D computer vision community as structure from motion position of the features is directly measured by triangula- (SFM). Its origins can be dated back to works such as [2] tion at every robot location and used to derive the relative and [3]. VO is a particular case of SFM. SFM is more gen- motion. Trinocular methods belong to the same class of eral and tackles the problem of 3-D reconstruction of algorithms. The alternative to stereo vision is to use a both the structure and camera poses from sequentially single camera. In this case, only bearing information is ordered or unordered image sets. The final structure and available. The disadvantage is that motion can only be camera poses are typically refined with an offline optimi- recovered up to a scale factor. The absolute scale can then zation (i.e., bundle adjustment), whose computation time be determined from direct measurements (e.g., measuring grows with the number of images [4]. Conversely, VO the size of an element in the scene), motion constraints, or focuses on estimating the 3-D motion of the camera from the integration with other sensors, such as IMU, air- sequentially—as a new frame arrives—and in real time. pressure, and range sensors. The interest in monocular Bundle adjustment can be used to refine the local estimate methods is due to the observation that stereo VO can of the trajectory. degenerate to the monocular case when the distance to the The problem of estimating a vehicle’s egomotion from scene is much larger than the stereo baseline (i.e., the dis- visual input alone started in the early 1980s and was tance between the two cameras). In this case, stereo vision described by Moravec [5]. It is interesting to observe that becomes ineffective and monocular methods must be used. most of the early research in VO [5]–[9] was done for Over the years, monocular and stereo VOs have almost planetary rovers and was motivated by the NASA Mars progressed as two independent lines of research. In the exploration program in the endeavor to provide all-terrain remainder of this section, we have surveyed the related rovers with the capability to measure their 6-degree-of- work in these fields. freedom (DoF) motion in the presence of wheel slippage in uneven and rough terrains. Stereo VO The work of Moravec stands out not only for present- Most of the research done in VO has been produced using ing the first motion-estimation pipeline—whose main stereo cameras. Building upon Moravec’s work, Matthies functioning blocks are still used today—but also for and Shafer [6], [7] used a binocular system and Moravec’s describing one of the earliest corner detectors (after the procedure for detecting and tracking corners. Instead of first one proposed in 1974 by Hannah [10]) which is using a scalar representation of the uncertainty as Moravec known today as the Moravec corner detector [11], a prede- did, they took advantage of the error covariance matrix of cessor of the one proposed by Forstner [12] and Harris the triangulated features and incorporated it into the and Stephens [3], [82]. motion estimation step. Compared to Moravec, they dem- Moravec tested his work on a planetary rover equipped onstrated superior results in trajectory recovery for a with what he termed a slider stereo: a single camera sliding planetary rover, with 2% relative error on a 5.5-m path. on a rail. The robot moved in a stop-and-go fashion, Olson et al. [9], [13] later extended that work by DECEMBER 2011 • IEEE ROBOTICS & AUTOMATION MAGAZINE • 81 • introducing an absolute orientation sensor (e.g., compass but detected features (Harris corners) independently in all or omnidirectional camera) and using the Forstner corner frames and only allowed matches between features. This detector, which is significantly faster to compute than has the benefit of avoiding feature drift during cross-corre- Moravec’s operator. They showed that lation-based tracking. Second, they did not compute the • the use of camera egomotion estimates relative motion as a 3-D-to-3-D point registration problem Keyframe selection is alone results in accumulation errors but as a 3-D-to-two-dimensional (2-D) camera-pose estima- with superlinear growth in the distance tion problem (these methods are described in the “Motion averyimportantstep traveled, leading to increased orienta- Estimation” section). Finally, they incorporated RANSAC in VO and should tion errors. Conversely, when an abso- outlier rejection into the motion estimation step. lute orientation sensor is incorporated, A different motion estimation scheme was introduced always be done before the error growth can be reduced to a by Comport et al. [21]. Instead of using 3-D-to-3-D point updating the motion. linear function of the distance traveled. registration or 3-D-to-2-D camera-pose estimation tech- This led them to a relative position niques, they relied on the quadrifocal tensor, which • error of 1:2% on a 20-m path.

Load more