A Comparative Analysis of Tightly-Coupled Monocular, Binocular, and Stereo VINS

A Comparative Analysis of Tightly-coupled Monocular, Binocular, and Stereo VINS Mrinal K. Paul, Kejian Wu, Joel A. Hesch, Esha D. Nerurkar, and Stergios I. Roumeliotisy Abstract— In this paper, a sliding-window two-camera vision- few works address the more computationally demanding aided inertial navigation system (VINS) is presented in the tightly-coupled stereo VINS. To the best of our knowledge, square-root inverse domain. The performance of the system Leutenegger et al. [7] and Manderson et al. [10] present the is assessed for the cases where feature matches across the two-camera images are processed with or without any stereo only tightly-coupled stereo VINS, which operate in real- constraints (i.e., stereo vs. binocular). To support the com- time but only on desktop CPUs. Manderson et al. [10] parison results, a theoretical analysis on the information gain employ an extension of PTAM [11] where the tracking and when transitioning from binocular to stereo is also presented. mapping pipelines are decoupled, and hence is inconsistent.1 Additionally, the advantage of using a two-camera (both stereo On the other hand, Leutenegger et al. [7] propose a consistent and binocular) system over a monocular VINS is assessed. Furthermore, the impact on the achieved accuracy of different keyframe-based stereo simultaneous localization and map- image-processing frontends and estimator design choices is ping (SLAM) algorithm that performs nonlinear optimization quantified. Finally, a thorough evaluation of the algorithm’s over both visual and inertial cost terms. In order to maintain processing requirements, which runs in real-time on a mobile the sparsity of the system, their approach employs the follow- processor, as well as its achieved accuracy as compared to ing approximation: Instead of marginalizing the landmarks alternative approaches is provided, for various scenes and motion profiles. associated with the oldest pose in the temporal window, these are dropped from the system (fully dropped for non- I. INTRODUCTION AND RELATED WORK keyframes and partially dropped for keyframes), rendering Combining measurements from an inertial measurement their approach sub-optimal. unit (IMU) with visual observations from a camera, known In contrast, our method builds on and extends the work as a VINS, is a popular means for navigating within GPS- of [6], where a monocular VINS is presented in the inverse denied areas (e.g., underground, in space, or indoors). With square-root form (termed as SR-ISWF). In this paper, we the growing availability of such sensors in mobile devices present both stereo and duo (binocular i.e., two independent (e.g., [1]), the research focus in VINS is gradually turning cameras, without any stereo constrains between them) VINS, towards finding efficient, real-time solutions on resource- adopting the approach of [6]. We experimentally show the constrained devices. Moreover, with the recent improvements benefit of a stereo system over mono and duo systems, in mobile CPUs and GPUs (e.g., NVIDIA’s TK1 [2]), the especially in challenging environments and motion profiles. interest in more robust multi-camera VINS is also increasing. We also provide a theoretical analysis on the performance Most existing VINS approaches can be classified into gain in stereo, as compared to duo, in terms of the Cholesky loosely-coupled and tightly-coupled systems. In loosely- factor update. Additionally, we present the impact of different coupled systems, either the integrated IMU data are incorpo- image-processing frontends on VINS and show that our rated as independent measurements into the (stereo) vision stereo system operates in real-time on mobile processors and optimization (e.g., [3]), or vision-only pose estimates are achieves high accuracy, even with a low-cost commercial- used to update an extended Kalman filter (EKF) performing grade IMU, as opposed to [7] that employs an industrial- IMU propagation (e.g., [4]). In contrast, tightly-coupled grade IMU. Our main contributions are: approaches jointly optimize over all sensor measurements • We present the first tightly-coupled stereo VINS that (e.g., [5], [6], [7]) which results in higher accuracy. operates in real-time on mobile processors. The first step towards a multi-camera VINS is to em- • We present a detailed comparison between mono, duo, ploy two cameras. Incorporating, however, a second camera and stereo VINS under different scenes and motion to a localization system is usually computationally costly. profiles, and provide a theoretical explanation of the Thus, although there exist many implementations of loosely- information gain when transitioning from duo to stereo. coupled stereo systems (e.g., [3], [8]), or approaches that • We assess the impact of different image-processing use stereo only for feature initialization (e.g., [9]), only frontends on the estimation accuracy of VINS, and per- form a sensitivity analysis of different frontends, with yThis work was supported by Google, Project Tango. respect to the changes in feature track length. Moreover, M. K. Paul and S. I. Roumeliotis are with the Department of Computer Science and Engineering, Univ. of Minnesota, Minneapolis, MN, USA. we provide a detailed analysis of how different design fpaulx152,[email protected] K. Wu is with the Department of Electrical and Computer Engineering, 1As defined in [12], a state estimator is consistent if the estimation Univ. of Minnesota, Minneapolis, MN, USA. [email protected] errors are zero-mean and have covariance matrix smaller or equal to the J. A. Hesch and E. D. Nerurkar are with Project one calculated by the filter. Since PTAM considers parts of the state vector Tango, Google, Mountain View, CA, USA. fjoelhesch, to be perfectly known during its tracking or mapping phases, the resulting [email protected] Hessian does not reflect the information and hence uncertainty of the system. Ip where qG is the quaternion representing the orientation of the global frame fGg in the IMU’s frame of reference G fIpg, pIp is the position of fIpg in fGg, and tdp is the IMU-camera time offset (similar to the definition in [15]), at time step p. Next, the parameter state vector is defined as T I qT I pT I qT I pT xP = CL CL CR CR (4) I I where qCL and qCR are the quaternion representation of I I the orientations and pCL and pCR are the positions, of the Fig. 1. Stereo camera-IMU setup, where fIg, fCLg, fCRg, and fGg left and right camera frames, fC g and fC g, in the IMU’s are the IMU, left camera, right camera, and global frames, respectively, L R I I I I frame of reference fIg. An alternative representation of xP ( qCL ; pCL ) and ( qCR ; pCR ) are the corresponding left and right CL CL IMU-camera extrinsic parameters, ( qCR ; pCR ) are the extrinsic is also considered, which consists of the left camera-IMU S L R parameters between the left and right cameras, and f , f , and f are CL j j j extrinsic and the left-right camera-camera extrinsic ( qCR , the stereo and mono, left and right, features. CL pCR ). T choices (e.g., optimization window-size and extrinsics I T I T CL T CL T xP = qC pC qC pC (5) representation) affect the VINS performance. L L R R In Sec. VI-G, we present a detailed comparison of these two • We compare our stereo-VINS algorithm against two representations, which supports selecting the later. Finally, state-of-the-art systems: i) OKVIS [13] (an open-source x stores the current IMU biases and speed. implementation of [7]) and ii) ORB-SLAM2 [14] (a Ek T bT GvT bT vision-only stereo SLAM system with loop-closures); xEk = gk Ik ak (6) and demonstrate its superior performance. where bgk and bak correspond to the gyroscope and The rest of this paper is structured as follows: In Sec. II, G accelerometer biases, respectively, and vIk is the velocity we briefly overview the key components of the proposed of fIkg in fGg, at time step k. VINS, while Sec. III describes the image-processing The error state xe is defined as the difference between the frontend. Sec. IV presents an overview of the estimation true state x and the state estimate ^x employed for lineariza- algorithm, highlighting the key differences between the tion (i.e., xe = x − ^x), while for the quaternion q a multi- duo and stereo systems. A theoretical explanation of the −1 1 T T plicative error model qe = q ⊗ ^q ' 2 δθ 1 is used, information gain in stereo, as compared to duo, is presented where δθ is a minimal representation of the attitude error. in Sec. V. Finally, experimental results over several datasets are shown in Sec. VI, while Sec. VII concludes the paper. B. Inertial Measurement Equations and Cost Terms T Given inertial measurements u = !T aT , II. VISION-AIDED INERTIAL NAVIGATION SYSTEM k;k+1 mk mk where ! and a are gyroscope and accelerometer The key components of the proposed VINS (see Fig. 1) mk mk measurements, respectively, analytical integration of the are briefly described hereafter. continuous-time system equations (see [6]) within the time A. System State interval tk; tk+1 is used to determine the discrete-time At each time step k, the system maintains the following system equations, state vector: xIk+1 = f(xIk ; uk;k+1 − wk;k+1) (7) 0 T T T T xk = xS xk (1) x xT xT w where Ik , Ck Ek , and k;k+1 is the discrete- T xT ··· xT xT xT time zero-mean white Gaussian noise affecting the IMU with xk = Ck−M+1 Ck P Ek (2) measurements with covariance Qk. Linearizing (7), around where xS is the state vector of the current SLAM features be- the state estimates ^xIk and ^xIk+1 , results in the cost term: ing estimated and xk is the state vector comprising all other T h i i i xIk Cm T Cm T e current states.

A Comparative Analysis of Tightly-Coupled Monocular, Binocular, and Stereo VINS

EECS 498/598: Brain-Inspired Computing: Models, Architectures, and Programming

A Deep Neural Network for On-Board Cloud Detection on Hyperspectral Images

AI EDGE up Series

Artificial Intelligence Computing for Automotive Webcast for Xilinx Adapt: Automotive: Anywhere January 12Th, 2021

Synopsys and Movidius First-Pass Silicon Success for Myriad 2 Vision Processing Unit with Designware USB 3.0, LPDDR3/2 & MIPI D-PHY IP

Semi Autonomous Vehicle Intelligence: Real Time Target Tracking for Vision Guided Autonomous Vehicles

Using Computer Vision in Retail Analytics

Cost-Effective Hardware Accelerator Recommendation for Edge Computing∗

Benchmarking of Cnns for Low-Cost, Low-Power Robotics Applications

Exploring the Vision Processing Unit As Co-Processor for Inference

HETEROGENEOUS COMPUTING for AI at the EDGE Webinar Pasca Sarjana Terapan PENS 2 September 2020 Dr

Intel Vision Products