LDD: Learned Detector and Descriptor of Points for Visual Odometry

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 LDD: Learned Detector and Descriptor of Points for Visual Odometry JEVGENIJA AKSJONOVA KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE i LDD: Learned Detector and Descriptor of Points for Visual Odometry JEVGENIJA AKSJONOVA Master of Science in Machine Learning Supervisor: Josephine Sullivan Examiner: Patric Jensfelt E-mail: [email protected] School of Electrical Engineering and Computer Science Date: August 24, 2018 iii Abstract Simultaneous localization and mapping is an important problem in robotics that can be solved using visual odometry – the process of estimating ego-motion from subsequent camera images. In turn, visual odometry systems rely on point matching between different frames. This work presents a novel method for matching key-points by apply- ing neural networks to point detection and description. Traditionally, point detectors are used in order to select good key-points (like corners) and then these key-points are matched using features extracted with descriptors. However, in this work a descriptor is trained to match points densely and then a detector is trained to predict, which points are more likely to be matched with the descriptor. This information is further used for selection of good key-points. The results of this project show that this approach can lead to more accurate results compared to model-based methods. iv Sammanfattning Samtidig lokalisering och kartläggning är ett viktigt problem inom ro- botik som kan lösas med hjälp av visuell odometri – processen att upp- skatta självrörelse från efterföljande kamerabilder. Visuella odometri- system förlitar sig i sin tur på punktmatchningar mellan olika bildru- tor. Detta arbete presenterar en ny metod för matchning av nyckelpunkter genom att applicera neurala nätverk för detektion av punkter och deskriptorer. Traditionellt sett används punktdetektorer för att välja ut bra nyckelpunkter (som hörn) och sedan används dessa nyckelpunkter för att matcha särdrag. I detta arbete tränas istället en de- skriptor att matcha punkterna. Sedan tränas en detektor till att förutspå vilka punker som är mest troliga att matchas korrekt med deskriptorn. Denna information används sedan för att välja ut bra nyckelpunkter. Resultatet av projektet visar att det kan leda till mer precisa resultat jämfört med andra modellbaserade metoder. Contents 1 Introduction 1 1.1 Related work . 4 1.1.1 LIFT . 4 1.1.2 SuperPoint . 5 1.1.3 GCN . 5 1.1.4 Dense matching . 5 1.2 Ethical considerations and societal impact . 6 2 Background 7 2.1 Deep Convolution Neural Networks . 7 2.1.1 Training Neural Networks . 8 2.1.2 Convolutional layer . 9 2.1.3 Batch Normalization . 10 2.1.4 ResNets . 10 2.2 Baseline Methods . 11 2.2.1 SIFT . 11 2.2.2 DAISY . 11 3 Methods 13 3.1 Data . 13 3.1.1 KITTI . 13 3.1.2 FlyingThings . 14 3.1.3 Augmentation . 14 3.1.4 Evaluation . 14 3.2 Learning the Descriptor . 15 3.2.1 Training . 17 3.3 Learning the Detector . 18 3.3.1 Construction of the ground truth data . 18 3.4 Point selection . 20 v vi CONTENTS 3.5 Implementation details . 22 4 Results 23 4.1 Descriptor . 23 4.2 Detector . 24 4.3 Combining the Descriptor and Detector . 26 5 Discussion and Conclusions 31 5.1 Discussion . 31 5.1.1 Drawbacks and future work . 31 5.2 Conclusions . 32 Bibliography 34 Chapter 1 Introduction Simultaneous Localization and Mapping (SLAM) is an important problem in robotics. The objective is to simultaneously estimate ego-motion and construct a map of the environment. In visual SLAM information obtained from the camera is used to solve this problem. This involves estimating ego-motion from the subsequent camera images – the process called visual odometry. There are different ways to perform visual odometry. For instance, Bundle Adjustment (BA) can be used to jointly estimate the positions of the camera and the positions of points in space, given that these points were captured by the camera from different view-points [16], [24]. This is illustrated in figure 1.1. Methods that use BA are usu- ally called sparse feature-based methods. One of the state-of-the-art feature-based systems is ORB-SLAM [27]. An alternative approach to Figure 1.1: BA setup: points are projected into several image planes, [24]. 1 2 CHAPTER 1. INTRODUCTION BA is to estimate the displacement of the camera by aligning images so that the photometric error is minimized. This approach can be used to obtain a dense or semi-dense map of the environment, for instance, as in LSD-SLAM [11]. However, sparse feature-based methods are mostly used in practice. As we can see in figure 1.1, in order to use BA it is necessary to find points in the images that correspond to the same points inspace. In other words, BA relies on point matching between different frames. There are different feature extraction methods that can be used tode- tect and reliably match key-points. For instance, the ORB-SLAM [27] makes use of the ORB features [31] in all subtasks, such as, tracking, mapping, relocalization and loop closing. Some methods propose al- gorithms for both detection and description of key-points, for instance, SIFT [25], SURF [3] and ORB [31]. While others, like Harris Corner De- tector [15] and FAST [30], focus on detection of points that would be sta- ble and repeatable across different frames. An example of key-points detected and matched with SIFT is illustrated in figure 1.2. Figure 1.2: Example of key-points detected and matched with SIFT. Artificial neural networks is a class of models used in Machine Learn- ing, which gained an increased popularity over the last decade. Deep Convolutional Neural Networks have been used to achieve state-of-the- art results in computer vision tasks, for instance, image classification [22], object detection [29] and semantics segmentation [23]. Despite the great potential of deep learning features to define and match key- points, model-based local descriptors still dominate the field. Recently, there have been many attempts to create learned local descriptors using neural networks. For instance, DeepCompare [43], Deep- Desc [34], MatchNet [14] and TFeat [2]. These methods are trained to CHAPTER 1. INTRODUCTION 3 compare small image patches and, as indicated in a comparative survey [33], they are not bound to a particular choice of detector. Therefore, they are often used in combination with hand-crafted point detectors. More recent works have proposed to extract features densely for all points. For instance, [6] proposed the Universal Correspondence Network (UCN) for learning the feature space using dense correspon- dences, such as disparities in stereo images. The resulting model pro- duced a dense feature map for each image and then points could be matched between different images with the Nearest Neighbor algo- rithm. Furthermore, [8] and [37] combined dense feature extraction with score prediction, so that a score value can be used to decide if the point should be selected as a key-point or not. [8] aimed to predict “cor- nerness” of points by pre-training the detector on a synthetic dataset with geometric figures, whereas [37] used Harris corners to provide the ground truth data for the detector. These works are still somehow based on model based methods even though they push the learned detector to generalize over the model-based ground truth during the training. This Master’s thesis poses a question: in the case of dense feature extraction, is it possible to create a detector by simply predicting which points are more likely to be matched? This work tries to answer this question by constructing a new type of ground truth data for the detector: points are divided into good and bad depending whether they are matched between images in all-to-all matching scenario or not. To do this, we first train a descriptor that we call LDesc, which creates a dense feature map for each image. Then we construct the ground truth data and train a detector that we call LDet to discriminate between 2 classes: “matched” or “not matched”. Deep Neural Networks are used for both of these tasks. In the process we have discovered that scores predicted by the detector are not repeatable enough for selecting points with a simple non- maximum suppression and performing sparse matching on the selected points. Therefore a semi-dense matching procedure is proposed for the evaluation. We call the combined method LDD (Learned Detec- tor and Descriptor) and evaluate it on augmented stereo images from synthetic FlyingThings dataset [28] and image pairs from KITTI Stereo and KITTI Optical Flow [12], [26]. The results show that LDD leads to more accurate point matching than SIFT in our experiments. This thesis report consists of 5 sections. The first section introduces 4 CHAPTER 1. INTRODUCTION the research problem and describes related work. The second section briefly describes the basics of deep learning and methods usedasa baseline. Third section introduces the LDD method and, finally, fourth and fifth sections present and discuss the results. 1.1 Related work The idea of predicting which points are likely to be matched is not abso- lutely new. For instance, [17] tried to predict which key-points survive through the matching process of the traditional Structure-from-Motion pipeline. The authors of [32] pointed out that many of the proposed methods for learning detectors utilize model-based features for training and this might be a huge limitation. They proposed an unsuper- vised learning approach: first, the model, called Quad-Networks, was trained to produce a consistent ranking of points, then key-points were defined as the top/bottom quantiles of the ranking.

Load more