Direct Visual Odometry in Low Light Using Binary Descriptors

Direct Visual Odometry in Low Light using Binary Descriptors Hatem Alismail1, Michael Kaess1, Brett Browning2, and Simon Lucey1 Abstract— Feature descriptors are powerful tools for pho- consistent appearance between the matched pixels, otherwise tometrically and geometrically invariant image matching. To known as the brightness constancy assumption [25], [26] date, however, their use has been tied to sparse interest point requiring constant irradiance despite varying illumination, detection, which is susceptible to noise under adverse imaging conditions. In this work, we propose to use binary feature which is seldom satisfied in robotic applications. descriptors in a direct tracking framework without relying Due to the complexity of real world illumination conditions, on sparse interest points. This novel combination of feature an efficient solution to the problem of appearance change descriptors and direct tracking is shown to achieve robust for direct VO is challenging. The most common scheme to and efficient visual odometry with applications to poorly lit mitigating the effects of illumination change is to assume subterranean environments. a parametric illumination model to be estimated alongside the camera pose, such as the gain+bias model [17], [27]. I. INTRODUCTION This approach is limited by definition and does not address Visual Odometry (VO) is the problem of estimating the the range of non-global and nonlinear intensity deformations relative pose between two cameras sharing a common field- commonly encountered in robotic applications. More sophis- of-view. Due to its importance, VO has received much ticated techniques have been proposed [24], [28], [29], but attention in the literature [1] as evident by the number of either impose stringent scene constraints (such as planarity), high quality systems available to the community [2], [3], [4]. or heavily rely on dense depth estimates, which are not always Current systems using conventional cameras, however, are available. not equipped to tackle challenging illumination conditions, In this work, we relax the brightness consistency assump- such as poorly-lit environments. In this work, we devise a tion required by most direct VO algorithms thus allowing them novel combination of direct tracking using binary feature to operate in environments where the appearance between descriptors to allow robust and efficient vision-only pose images vary considerably. We achieve this by combining the estimation in challenging environments. illumination invariance afforded by binary feature descriptors Current state-of-the-art algorithms rely on a feature-based within a direct alignment framework. This is a challenging pipeline [5], where keypoint correspondences are used to problem for two reasons: Firstly, binary illumination-invariant obtain an estimate of the camera motion (e.g.[6], [3], [7], feature descriptors have not been shown to be well-suited [8], [9], [10], [11], [12]). Unfortunately, the performance of for the iterative gradient-based optimization at the heart of feature extraction and matching using conventional hardware direct methods. Secondly, binary descriptors must be matched struggles under challenging imaging conditions, such as under a binary-norm such as the Hamming distance, which motion blur, low light, and repetitive texture [13], [14] is unsuitable for gradient-based optimization due to its non- thereby reducing the robustness of the system. Examples of differentiability. such environments include operating at night [13], mapping To address these challenges, we propose a novel adaptation subterranean mines as shown in Fig.1 and even sudden of binary descriptors that is experimentally shown to be illumination changes due to automatic camera controls as amenable to gradient-based optimization. More importantly, shown in Fig.2. If the feature-based pipeline fails, a vision- the proposed adaption preserves the Hamming distance under only system has little hope of recovery. conventional least-squares as we will show in Section III. An alternative to the feature-based pipeline is to use pixel This novel combination of binary feature descriptors intensities directly, or what is commonly referred to as direct in a direct alignment framework is shown to work well methods [15], [16], which has recently been popularized for in underground mines characterized by non-uniform and RGB-D VO [17], [18], [19], [20], [21] and monocular SLAM poor lighting. The approach is also efficient achieving real- from high frame-rate cameras [2], [4]. When the apparent time performance. An open-source implementation of the image motion is small, direct methods deliver robust and algorithm is freely available online https://github. precise estimates as many measurements could be used to com/halismai/bpvo. estimate a few degrees-of-freedom [22], [23], [4], [24]. Nonetheless, as pointed out by other researchers [3], the II. BACKGROUND main limitation of direct methods is their reliance on a Direct Visual Odometry: Let the intensity, and depth of 1Alismail, Kaess and Lucey are with the Robotics a pixel coordinate p = (x; y)> at the reference frame be Institute, Carnegie Mellon University, Pittsburgh PA, USA + fhalismai,kaess,[email protected]. 2Browning is with respectively given by I(p) 2 R and D(p) 2 R . Upon a Uber Advanced Technologies Center [email protected]. rigid-body motion of the camera a new image is obtained Fig. 1: Top row shows an example of commonly encountered low signal-to-noise ratio imagery from an underground mine captured with a conventional camera. The bottom row shows a histogram-equalized version emphasizing the poor quality and the significant motion blur. intensities. By equating to zero the derivative of the first-order Taylor expansion of Eq. (2), we arrive at the solution given by the following closed-form (normal equations) −1 ∆θ = J>J J>e; (3) Fig. 2: An example of the nonlinear intensity deformation caused > > m×p by the automatic camera settings. A common problem with outdoor where J = g(p1) ;:::; g(pm) 2 R is the matrix applications of robot vision. of first-order partial derivatives of the objective function, or the Jacobian, m is the number of pixels, and p = jθj is the number of parameters. Each g is 2 1×p and is given by the 8 12 200 8<42 12<42 200<42 1 1 0 R chain rule as 56 42 55 56<42 55<42 0 0 @w g(p)> = rI(p) ; (4) @θ 128 16 11 128<42 16<42 11<42 0 1 1 1×2 where rI = (Ix;Iy) 2 R is the image gradient along (a) (b) (c) the x- and y-directions respectively. Finally, Fig. 3: Local intensity comparisons in a 3 × 3 neighborhood. In e(p) = I0(w(p; θ)) − I(p) (5) Fig. 3a the center pixel is highlighted and compared to its neighbors as shown in Fig. 3b. The descriptor is obtained by combining the is the vector of residuals, or the error image. Parameters of result of each comparison in Fig. 3c into a single scalar [30], [31]. the motion model are updated via the IC rule given by w (p; θ) w (p; θ) ◦ w (p; ∆θ)−1: (6) I0(p0). The goal of conventional direct VO is to estimate an We refer the reader to the comprehensive work by Baker and increment of the camera motion parameters ∆θ 2 6 such R Matthews [32] for a detailed treatment. that the photometric error is minimized Image Warping: Given a rigid body motion T(θ) 2 SE(3) ∗ X 0 2 ∆θ = argmin kI (w(p; θ + ∆θ)) − I (p)k ; (1) and a depth value D(p) in the coordinate frame of the ∆θ p2Ω template image, warping to the coordinates of the input image where Ω is a subset of pixel coordinates of interest in the is performed according to: reference frame, w (·) is a warping function that depends p0 = π T(θ)π−1(p; D(p) ; (7) on the parameter vector we seek to estimate, and θ is an initial estimate. After every iteration, the current estimate where π (·): R3 ! R2 denotes the projection onto a camera of parameters is updated additively. This is the well-known with a known intrinsic calibration, and π−1 (·; ·): R2 × R ! Lucas and Kanade algorithm [15]. R3 denotes the inverse of this projection given the intrinsic By conceptually interchanging the roles of the template camera parameters and the pixel’s depth. Finally, the intensity and input images, Baker & Matthews’ devise a more efficient values corresponding to the warped input image I(p0) is alignment techniques known as the Inverse Compositional obtained using bilinear interpolation. (IC) algorithm [32]. Under the IC formulation we seek an III. BINARY DESCRIPTOR CONSTANCY update ∆θ that satisfies A limitation of direct method is the reliance on the X ∆θ∗ = argmin kI (w(p; ∆θ)) − I0 (w(p; θ))k2: (2) brightness constancy assumption (Eq. (1)), which we address ∆θ p2Ω by using a descriptor constancy assumption instead. Namely, The optimization problem in Eq. (2) is nonlinear irrespective the parameter update is estimated to satisfy: of the form of the warping function, as in general there is ∆θ∗ = argminkφ(I0 (w(p; θ + ∆θ))) − φ(I (p))k2; (8) no linear relationship between pixel coordinates and their ∆θ 8 where φ(·) is a robust feature descriptor. The idea of using where f∆xigi=1 is the set of the eight displacements that descriptors in lieu of intensity has been recently explored in are possible within a 3 × 3 neighborhood around the center optical flow estimation [33], image-based tracking of a known pixel location x. 3D model [34], Active Appearance Models [35], and inter- In order for the descriptor to maintain its morphological object category alignment [36], in which results consistently invariance to intensity changes it must be matched under a outperform the minimization of the photometric error. To binary norm, such as the Hamming distance, which counts date, however, the idea has not been explored in the context the number of mismatched bits.

Direct Visual Odometry in Low Light Using Binary Descriptors

Semantically Informed Visual Odometry and Mapping

Visual Odometry by Multi-Frame Feature Integration

Spherical-Model-Based SLAM on Full-View Images for Indoor Environments

Omnidirectional Visual-Inertial Odometry Using Multi-State Constraint Kalman Filter

Evolution of Visual Odometry Techniques

Semi-Dense Visual Odometry for AR on a Smartphone

Optical Flow and Deep Learning Based Approach to Visual Odometry

Unsupervised Deep Learning-Based RGB-D Visual Odometry

Improved Ground-Based Monocular Visual Odometry Estimation Using Inertially-Aided Convolutional Neural Networks

Real-Time Stereo Visual Odometry for Autonomous Ground Vehicles

Monocular Visual Odometry

Deep Direct Visual Odometry Chaoqiang Zhao, Yang Tang, Senior Member, IEEE, Qiyu Sun, and Athanasios V