Visual-Lidar Odometry and Mapping: Low-Drift, Robust, and Fast

2015 IEEE International Conference on Robotics and Automation (ICRA) Washington State Convention Center Seattle, Washington, May 26-30, 2015 Visual-lidar Odometry and Mapping: Low-drift, Robust, and Fast Ji Zhang and Sanjiv Singh Abstract— Here, we present a general framework for combining visual odometry and lidar odometry in a fundamental and first principle method. The method shows improvements in performance over the state of the art, particularly in robustness to aggressive motion and temporary lack of visual features. The proposed on-line method starts with visual odometry to estimate the ego-motion and to register point clouds from a scanning Fig. 1. The method aims at motion estimation and mapping using a lidar at a high frequency but low fidelity. Then, scan matching monocular camera combined with a 3D lidar. A visual odometry method based lidar odometry refines the motion estimation and point estimates motion at a high frequency but low fidelity to register point clouds. cloud registration simultaneously. We show results with datasets Then, a lidar odometry method matches the point clouds at a low frequency collected in our own experiments as well as using the KITTI to refine motion estimates and incrementally build maps. The lidar odometry odometry benchmark. Our proposed method is ranked #1 on also removes distortion in the point clouds caused by drift of the visual the benchmark in terms of average translation and rotation odometry. Combination of the two sensors allows the method to accurately errors, with a 0.75% of relative position drift. In addition map even with rapid motion and in undesirable lighting conditions. to comparison of the motion estimation accuracy, we evaluate robustness of the method when the sensor suite moves at a high at a high frequency as the image frame rate (60Hz) to speed and is subject to significant ambient lighting changes. estimate motion. The second uses lidar odometry at a low frequency (1 Hz) to refine motion estimates and remove I. INTRODUCTION distortion in the point clouds caused by drift of the visual Recent separate results in visual odometry and lidar odom- odometry. The distortion-free point clouds are matched and etry are promising in that they can provide solutions to 6- registered to incrementally build maps. The result is that DOF state estimation, mapping, and even obstacle detection. the visual odometry handles rapid motion, and the lidar However, drawbacks are present using each sensor alone. odometry warrants low-drift and robustness in undesirable Visual odometry methods require moderate lighting condi- lighting conditions. Our finding is that the maps are often tions and fail if distinct visual features are insufficiently accurate without the need for post-processing. Although available. On the other hand, motion estimation via moving loop closure can further improve the maps, we intentionally lidars involves motion distortion in point clouds as range choose not to do so since the emphasis of this work is to measurements are received at different times during contin- push the limit of accurate odometry estimation. uous lidar motion. Hence, the motion often has to be solved The basic algorithm of V-LOAM is general enough that it with a large number of variables. Scan matching also fails in can be adapted to use range sensors of different kinds, e.g. degenerate scenes such as those dominated by planar areas. a time-of-fly camera. The method can also be configured to Here, we propose a fundamental and first principle method provide localization only, if a prior map is available. for ego-motion estimation combining a monocular camera In addition to evaluation on the KITTI odometry bench- and a 3D lidar. We would like to accurately estimate the mark [1], we further experiment with a wide-angle camera 6-DOF motion as well as a spatial, metric representation of and a fisheye camera. Our conclusion is that the fisheye the environment, in real-time and onboard a robot navigating camera brings in more robustness but less accuracy because in an unknown environment. While cameras and lidars have of its larger field of view and higher image distortion. complementary strengths and weaknesses, it is not straight- However, after the scan matching refinement, the final motion forward to combine them in a traditional filter. Our method estimation reaches the same level of accuracy. Our experi- tightly couples the two modes such that it can handle both ment results can be seen in a publicly available video.1. aggressive motion including translation and rotation, and II. RELATED WORK lack of optical texture as in complete whiteout or blackout imagery. In non-pathological cases, high accuracy in motion Vision and lidar based methods are common for state estimation and environment reconstruction is possible. estimation [2]. With stereo cameras [3], [4], the baseline Our proposed method, namely V-LOAM, explores advan- provides a reference to help determine scale of the motion. tages of each sensor and compensates for drawbacks from However, if a monocular camera is used [5]–[7], scale of the the other, hence shows further improvements in performance motion is generally unsolvable without aiding from other over the state of the art. The method has two sequentially sensors or assumptions about motion. The introduction of staggered processes. The first uses visual odometry running RGB-D cameras provides an efficient way to associate visual images with depth. Motion estimation with RGB-D cameras J. Zhang and S. Singh are with the Robotics Institute at Carnegie Mellon University. Emails: [email protected] and [email protected]. 1www.youtube.com/watch?v=-6cwhPMAap8 978-1-4799-6922-7/15/$31.00 ©2015 IEEE 2174 [8], [9] can be conducted easily with scale. A number of III. COORDINATE SYSTEMS AND TASK RGB-D visual odometry methods are also proposed showing The problem addressed in this paper is to estimate the promising results [10]–[12]. However, these methods only motion of a camera and lidar system and build a map of utilize imaged areas where depth is available, possibly wast- the traversed environment with the estimated motion. We ing large areas in visual images without depth coverage. The assume that the camera is modeled by a general central visual odometry method used in our system is similar to [8]– camera model [21]. With such a camera model, our sys- [12] in the sense that all use visual images with additionally tem is able to use both regular and fisheye cameras (see provided depth. However, our method is designed to utilize experiment section). We assume that the camera intrinsic sparse depth information from a lidar. It involves features parameters are known. The extrinsic parameters between the both with and without depth in solving for motion. camera and lidar are also calibrated. This allows us to use a For 3D mapping, a typical sensor is a (2-axis) 3D lidar single coordinate system for both sensors, namely the sensor [13]. However, usage of these lidars is difficult as motion coordinate system. For simplicity of calculation, we choose distortion is present in point clouds as the lidar continu- the sensor coordinate system to coincide with the camera ally ranges and moves. One way to remove the distortion coordinate system – all laser points are projected into the is incorporating other sensors to recover the motion. For camera coordinate system upon receiving. As a convention example, Scherer et al.’s navigation system [14] uses stereo of this paper, we use left uppercase superscription to indicate visual odometry integrated with an IMU to estimate the coordinate systems. In the following, let us define motion of a micro-aerial vehicle. Lidar clouds are registered • Sensor coordinate system fSg is originated at the by the estimated motion. Droeschel et al.’s method [15] camera optical center. The x-axis points to the left, employs multi-camera visual odometry followed by a scan the y-axis points upward, and the z-axis points forward matching method based on a multi-resolution point cloud coinciding with the camera principal axis. representation. In comparison to [14], [15], our method • World coordinate system fW g is the coordinate system differs in that it tightly couples a camera and a lidar such that coinciding with fSg at the starting position. only one camera is needed for motion recovery. Our method With assumptions and coordinate systems defined, our also takes into account point cloud distortion caused by drift odometry and mapping problem is stated as of the visual odometry, i.e. we model the drift as linear motion within a short time (1s) and correct the distortion Problem: Given visual images and lidar clouds perceived with a linear motion model during scan matching. in fSg, determine poses of fSg with respect to fW g and build a map of the traversed environment in fW g. It has also shown that state estimation can be made with 3D lidars only. For example, Tong et al. match visual features IV. SYSTEM OVERVIEW in intensity images created by stacking laser scans from a 2- Fig. 2 shows a diagram of the software system. The overall axis lidar to solve for the motion [16]. The motion is modeled system is divided into two sections. The visual odometry with constant velocity and Gaussian processes. However, section estimates frame to frame motion of the sensor at since this method extracts visual features from laser images, the image frame rate, using visual images with assistance dense point clouds are required. Another method is from from lidar clouds. In this section, the feature tracking block Bosse and Zlot [17], [18]. The method matches geometric extracts and matches visual features between consecutive structures of local point clusters. They use a hand-held images.

Visual-Lidar Odometry and Mapping: Low-Drift, Robust, and Fast

Semantically Informed Visual Odometry and Mapping

Visual Odometry by Multi-Frame Feature Integration

Spherical-Model-Based SLAM on Full-View Images for Indoor Environments

Omnidirectional Visual-Inertial Odometry Using Multi-State Constraint Kalman Filter

Evolution of Visual Odometry Techniques

Semi-Dense Visual Odometry for AR on a Smartphone

Optical Flow and Deep Learning Based Approach to Visual Odometry

Unsupervised Deep Learning-Based RGB-D Visual Odometry

Improved Ground-Based Monocular Visual Odometry Estimation Using Inertially-Aided Convolutional Neural Networks

Real-Time Stereo Visual Odometry for Autonomous Ground Vehicles

Monocular Visual Odometry

Deep Direct Visual Odometry Chaoqiang Zhao, Yang Tang, Senior Member, IEEE, Qiyu Sun, and Athanasios V