Unsupervised Deep Learning-Based RGB-D Visual Odometry

Article Unsupervised Deep Learning-Based RGB-D Visual Odometry Qiang Liu1,2 , Haidong Zhang 2, Yiming Xu 2,* and Li Wang 2 1 Department of Psychiatry, University of Oxford, Warneford Hospital, Oxford OX3 7JX, UK; [email protected] 2 School of Electrical Engineering, Nantong University, Nantong 226000, China; [email protected] (H.Z.); [email protected] (L.W.) * Correspondence: [email protected] Received: 6 July 2020; Accepted: 3 August 2020; Published: 6 August 2020 Abstract: Recently, deep learning frameworks have been deployed in visual odometry systems and achieved comparable results to traditional feature matching based systems. However, most deep learning-based frameworks inevitably need labeled data as ground truth for training. On the other hand, monocular odometry systems are incapable of restoring absolute scale. External or prior information has to be introduced for scale recovery. To solve these problems, we present a novel deep learning-based RGB-D visual odometry system. Our two main contributions are: (i) during network training and pose estimation, the depth images are fed into the network to form a dual-stream structure with the RGB images, and a dual-stream deep neural network is proposed. (ii) the system adopts an unsupervised end-to-end training method, thus the labor-intensive data labeling task is not required. We have tested our system on the KITTI dataset, and results show that the proposed RGB-D Visual Odometry (VO) system has obvious advantages over other state-of-the-art systems in terms of both translation and rotation errors. Keywords: RGB-D sensor; visual odometry; unsupervised deep learning; Recurrent Convolutional Neural Networks 1. Introduction Visual Odometry (VO) is the process of estimating the position and orientation of a robot or an agent by analyzing continuous camera images [1]. With the VO, positioning can be performed in an environment without prior information. Up to now, VO has been widely applied to various mobile robotic platforms, visual and augmented reality, and wearable devices [2]. The performance of a VO system is determined by several factors, e.g., accuracy, time complexity, robustness, etc. Traditional VO systems deploy a classic feature-based pipeline, i.e., image correction, feature extraction, feature matching, transformation matrix estimation, and pose graph optimization. Recent VO systems [3–6] start to deploy deep learning frameworks, some of which have already shown promising results and aroused great interest in the robotics and computer vision domains. Most deep learn-based VO systems use supervised training models, which means we need to deploy ground truth camera poses during training. In order to obtain ground truth data, a lot of manpower, material resources, and time are required. Therefore, some researchers [7–10] have proposed VO systems based on unsupervised learning, which does not rely on ground truth data. In this way, training data can be easily obtained. In addition, the pose estimation accuracy can be further increased by training with unlabeled datasets which are significantly larger and widely available. Appl. Sci. 2020, 10, 5426; doi:10.3390/app10165426 www.mdpi.com/journal/applsci Appl. Sci. 2020, 10, 5426 2 of 18 All of the unsupervised methods mentioned above are monocular VO systems which are easy and cheap to set up. However, systems using only monocular images are incapable of recovering absolute scale. Prior or external information has to be introduced for scale recovery. Prior information directly provides a Euclidean distance to VO systems, e.g., mounting a camera on a car at a fixed height [11]. However, the system will inevitably generate noise due to car bumps. Some systems integrate monocular images with multiple other data sources, e.g., Inertial Measurement Unit (IMU) [12], omnidirectional camera [13], stereo camera [14], etc. In the last decade, range-sensing devices have been widely used in various applications, such as self-driving cars, somatosensory games, and even smartphones. Thus, RGB-D data is easily acquirable. This combination has been adopted by many researchers [15–17] and normally yields a performance boost. In our previous work [10], the VO system is a monocular one and adopts a single-channel neural network structure. In this paper, the RGB-D camera is used to provide the system with RGB images and depth images at the same time. The depth images inject absolute scale information, and both of them occupy the same position in the system. During neural network training and pose estimation, a dual-stream neural network is proposed to process RGB images and depth images. Figure1 shows the proposed system. Our experimental results indicate that adding additional depth streams has impacts on the accuracy and time complexity of the VO system. On one hand, the translation and rotation errors are greatly reduced and the accuracy can be compared to a stereo VO system. On the other hand, the operating speed of the system has slightly decreased. RGB-D Transformation Matrix Images Training Input Output Loss Calculation Recurrent Convolutional Neural Network Input Output Testing RGB-D Transformation Matrix Images Figure 1. Overview of the proposed visual odometry system. The system benefits from an unsupervised training framework, under which ground-truth poses in any form are not required for training. During training, transformation matrices generated by the network are used for loss calculation. During testing, consecutive RGB-D images are fed into the network which directly yields poses on an absolute scale. Then, the translation error and rotation error of the system are obtained by comparing the poses of our system with the ground-truth poses. Experiments have been carried out on the KITTI odometry dataset [18] and results have shown that our VO system outperforms other state-of-the-art VO systems in terms of both translation and rotation accuracy. The additional depth information can reduce the pose estimation errors by more than half. In this paper, we review recent literature on visual odometry in Section2. Section3 details the proposed VO system and the implementation method. Experimental results including comparison with Appl. Sci. 2020, 10, 5426 3 of 18 other VO systems are presented in Section4. Finally, we conclude our work and propose ideas for our future research. 2. Relate Works Visual odometry solves the problem of estimating the trajectory of a robot by using associated camera images as the main input source. The term was introduced by Nistér et al. [1] in 2004 and has attracted enormous attention from computer science and robotics domains. The sensors of the VO system are mainly cameras, and currently, the mainstream ones are monocular cameras [19], stereo cameras [20], and omnidirectional cameras [21]. The monocular VO system has a simple structure, low cost and easy deployment, but it cannot provide absolute scale information [22]. The binocular camera can obtain absolute scale information by calculating the parallax between the two cameras, but it will consume a lot of computer resources for calculation. On this basis, some scholars proposed to integrate other external sensors, such as IMU [23] and Lidar [24,25] into VO systems, and improve their performance through multi-sensor fusion technique. On the other hand, the RGB-D camera can obtain both image and depth information based on structured light or time of flight (TOF) techniques [26]. With the advent of low-budget RGB-D sensors such as Kinect, ASUS Xtion, and Intel SR300 [27], the combination of monocular images and depth information becomes increasingly popular. Traditional RGB-D odometry systems [28] deploy a feature-based pipeline. Image features such as Scale Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), and Oriented Fast and Rotated Brief (ORB) are first extracted. Descriptor matching, Random Sample Consensus (RANSAC) and bundle adjustment are commonly used for motion estimation. Associated depth information is then used to inject the absolute scale. Huang et al. [29] applied such system to 3D flight in cluttered environments. Steinbrücker et al. [30] proposed an energy-based approach as an alternative, which infers odometry with dense RGB-D images. The same idea was also adopted by Kerl et al. [15], who minimized both photometric and depth errors over all pixels. Generally, the structures and parameters of these methods need to be carefully designed and fine-tuned to achieve optimal performance. Recently, attention has been drawn to deep neural networks which have shown remarkable results in most computer science related tasks [31]. These VO systems deploy an end-to-end training approach, i.e., little or no pre-processing on input images and directly output transformation matrices. Networks that are originally designed for other recognition or classification tasks are always leveraged since the features extracted by lower layers of a particular network are transferable. The landmark network, PoseNet [3], proposed by Kendall et al. first applied Convolutional Neural Networks (CNN) to visual odometry. PoseNet takes monocular images as input and directly outputs a 7-dimensional pose vector representing position and direction in real time, which relies on GoogLeNet [32] architecture. GoogLeNet, pre-trained on ImageNet and Places datasets, is modified for this task, in which softmax classifiers are removed and a fully connected layer is inserted. Li et al. [33] improved PoseNet by introducing depth information as the second stream of the network. Wang et al. [34] further extended the CNN-based PoseNet by incorporating a Recurrent Neural Network (RNN) into their system. The CNN is used to extract features across space, while the RNN is used to match features across time. However, these systems require labeled ground truth camera poses to train the neural networks, which is labor-intensive and time-consuming. Zhou et al. [7] started a research based on an unsupervised framework and proposed a network for monocular depth estimation, which can also be used to output camera poses.

Load more