Deep Learning Assisted Visual Odometry

Deep Learning Assisted Visual Odometry JIEXIONG TANG Doctoral Thesis Stockholm, Sweden 2020 KTH Royal Institute of Technology School of Electrical Engineering and Computer Science TRITA-EECS-AVL-2020:30 SE-100 44 Stockholm ISBN 978-91-7873-550-1 SWEDEN Akademisk avhandling som med tillstånd av Kungliga Tekniska högskolan fram- lägges till offentlig granskning för avläggande av doktorsexamen i datalogi fredagen den 12 juni 2020 klockan 10.00 i sal U1 Kungliga Tekniska högskolan, Brinellvägen 28A, Stockholm. © Jiexiong Tang, June 2020, except where otherwise stated. Tryck: Universitetsservice US AB iii Abstract The capabilities to autonomously explore and interact with the environment has always been a greatly demanded capability for robots. Various sensor based SLAM methods were investigated and served for this purpose in the past decades. Vision intuitively provides 3D understanding of the surrounding and contains a vast amount of information that require high level intelligence to interpret. Sensors like LIDAR, returns the range measurement directly. The motion estimation and scene reconstruction using camera is a harder problem. In this thesis, we are in particular interested in the track- ing frond-end of vision based SLAM, i.e. Visual Odometry (VO), with a focus on deep learning approaches. Recently, learning based methods have dominated most of the vision applications and gradually appears in our daily life and real-world applications. Different to classical methods, deep learning based methods can potentially tackle some of the intrinsic problems in multi-view geometry and straightforwardly improve the performance of cru- cial procedures of VO. For example, the correspondences estimation, dense reconstruction and semantic representation. In this work, we propose novel learning schemes for assisting both direct and in-direct visual odometry methods. For the direct approaches, we investigate mainly the monocular setup. The lack of the baseline that provides scale as in stereo has been one of the well-known intrinsic problems in this case. We propose a coupled single view depth and normal estimation method to reduce the scale drift and address the issue of lacking observa- tions of the absolute scale. It is achieved by providing priors for the depth optimization. Moreover, we utilize higher-order geometrical information to guide the dense reconstruction in a sparse-to-dense manner. For the in-direct methods, we propose novel feature learning based methods which noticeably improve the feature matching performance in comparison with common classical feature detectors and descriptors. Finally, we discuss potential ways to make the training self-supervised. This is accomplished by incorporating the differential motion estimation into the training while performing multi-view adaptation to maximize the repeatability and matching performance. We also investigate using a different type of supervisory signal for the training. We add a higher-level proxy task and show that it is possible to train a feature extraction network even without the explicit loss for it. In summary, this thesis presents successful examples of incorporating deep learning techniques to assist a classical visual odometry system. The results are promising and have been extensively evaluated on challenging bench- marks, real robot and handheld cameras. The problem we investigate is still in an early stage, but is attracting more and more interest from researcher in related fields. iv Sammanfattning Förmågan att självständigt utforska och interagera med en miljö har alltid varit önskvärd hos robotar. Olika sensorbaserade SLAM-metoder har utveck- lats och använts för detta ändamål under de senaste decennierna. Datorseende kan intuitivt används för 3D-förståelse men bygger på en enorm mängd information som kräver en hög nivå av intelligens för att tolka. Sensorer som LIDAR returnerar avståndet för varje mätpunkt direkt vilket gör rörelseupp- skattning och scenrekonstruktion mer rättframt än med en kamera. I den här avhandlingen är vi särskilt intresserade av kamerabaserad SLAM och mer specifikt den första delen av ett sådan system, dvs det som normalt kallas visuell odometri (VO). Vi fokuserar på strategier baserade på djupinlärning. Nyligen har inlärningsbaserade metoder kommit att dominera de flesta av kameratillämpningarna och dyker gradvis upp i vårt dagliga liv. Till skill- nad från klassiska metoder kan djupinlärningsbaserade metoder potentiellt ta itu med några av de inneboende problemen i kamerabaserade system och förbättra prestandan för viktiga delar i VO. Till exempel uppskattningar av korrespondenser, tät rekonstruktion och semantisk representation. I detta ar- bete föreslår vi nya inlärningssystem för att stödja både direkta och indirekta visuella odometrimetoder. För de direkta metoder undersöker vi huvudsak- ligen fallet med endast en kamera. Bristen på baslinje, som i stereo, som ger skalan i en scen har varit ett av de välkända problemen i detta fall. Vi föreslår en metod som kopplar skattningen av djup och normaler, baserad på endast en bild. För att adressera problemen med att skatta den absoluta skalan och drift i dessa skattningar, används det predikterade djupet som startgissningar för avståndsoptimeringen. Dessutom använder vi geometrisk information för att vägleda den täta rekonstruktionen på ett glest-till-tätt sätt. För de indirekta metoderna föreslår vi nya nyckelpunktsbaserade metoder som märkbart förbättrar matchningsprestanda jämfört med klassiska metoder. Slutligen diskuterar vi potentiella sätt att göra inlärningen själv- kontrollerad. Detta åstadkoms genom att integrera skattningen av den inkre- mentella rörelsen i träningen. Vi undersöker också hur man kan använda en så kallad proxy-uppgift för att generera en implicit kontrollsignal och visar att vi kan träna ett nyckelpunktgenererande nätverk på detta sätt. Sammanfattningsvis presenterar denna avhandling flera fungerade exempel på att hur djupinlärningstekniker kan hjälpa ett klassiskt visuellt odo- metrisystem. Resultaten är lovande och har utvärderats i omfattande och utmanande scenarier, från dataset, på riktiga robotar så väl som handhållna kameror. Problemet vi undersöker befinner sig fortfarande i ett tidigt skede forskningsmässigt, men intresserar nu också forskare från närliggande områ- den. v Dedicated to my family you paint the colors to my life. vi Acknowledgements I am writing to thank all the friends who have been helping me though the four years PhD study, I cannot possibly finish this thesis without you. I would like to express my sincere thankfulness to my supervisor Patric, who has always been kind and thoughtful to me. You are always there and raised me up when I was at my lowest, I appreciated everything you have done for me. Sorry for my childish sometime and thank you for your tolerance. I hope everything will get better and better in your future. I would like to thank my colleagues and friends at RPL. John, thanks to be my co-supervisor, appreciated you are always being available for help and talk. Rares, thanks for hosting me at U.S., the four months have been my favorite time in the past four years. I have learnt things that are invaluable from you both in work and life. Xi, thanks for hearing me out when I really needed it. Hope your work will go smoothly in the future. Ludvig and Daniel, I will not forget the morning meeting we had. I hope you guys will continue being motivated and good luck for your research. I would also like to thank all the members at RPL for any help I have received from you, I will you all the best in the future. Most importantly, I would like to say thanks to my family, you are the ones I cherish the most. I have been and will always being grateful for the love received from you. Last, thank you Sining, the story started with you. Thank you Yuan, I don’t know how to end the story without you. Jiexiong Tang Stockholm, Sweden May, 2020 Contents Contents vii I Introduction 1 1 Introduction 3 1.1 Visual Odometry and Vision-based SLAM . 3 1.2 Deep Learning for Visual Odometry . 5 1.3 Thesis outline . 9 2 Proposed Methods 11 2.1 Deep Learning for Direct Methods . 11 2.2 Deep Learning for In-direct Methods . 14 2.3 Self-Supervision for Visual Odometry . 17 3 Conclusions and Future Work 21 3.1 Conclusions . 21 3.2 Future Work . 22 4 Summary of Included Papers 23 A Sparse2Dense: From Direct Sparse Odometry to Dense 3-D Recon- struction. 23 B Geometric correspondence network for camera motion estimation. 24 C GCNv2: Efficient Correspondence Prediction for Real-Time SLAM. 25 D Self-Supervised 3D Keypoint Learning for Ego-motion Estimation. 26 E Neural Outlier Rejection for Self-Supervised Keypoint Learning. 27 Bibliography 29 II Included Publications 33 vii Part I Introduction 1 Chapter 1 Introduction 1.1 Visual Odometry and Vision-based SLAM Robots see the world through different sensors and use this information to interact with the surrounding environment. Since the environment is typical unknown and, thus, unpredictable, one of the fundamental problems in robotics is how to reduce the uncertainty of the estimation from perception and action. To tackle this problem, the robot needs to be able to perceive the world in an intelligent and autonomous way. This goal requires a robust system can provide reliable positional information and comprehensive understanding of the 3D world, so that the robot can support itself to navigate and move around. Among all sensors, the camera is probably the easiest attainable sensor in our daily life. In addition, compared to, for example, LiDAR sensors cameras are rel- atively low weight, low power and low cost. Used in the right way, cameras can provide accurate position information and 3D reconstruction on par with LiDAR sensors. However, vision based approaches provide a large amount of visual information, it is more challenging to process as some of the information that comes for free from a LiDAR has to be estimated with a camera, such as the depth. The advantages of being able to estimate motion and perform 3D reconstruction solely using a camera are therefore significant.

Deep Learning Assisted Visual Odometry

Semantically Informed Visual Odometry and Mapping

Visual Odometry by Multi-Frame Feature Integration

Spherical-Model-Based SLAM on Full-View Images for Indoor Environments

Omnidirectional Visual-Inertial Odometry Using Multi-State Constraint Kalman Filter

Evolution of Visual Odometry Techniques

Semi-Dense Visual Odometry for AR on a Smartphone

Optical Flow and Deep Learning Based Approach to Visual Odometry

Unsupervised Deep Learning-Based RGB-D Visual Odometry

Improved Ground-Based Monocular Visual Odometry Estimation Using Inertially-Aided Convolutional Neural Networks

Real-Time Stereo Visual Odometry for Autonomous Ground Vehicles

Monocular Visual Odometry

Deep Direct Visual Odometry Chaoqiang Zhao, Yang Tang, Senior Member, IEEE, Qiyu Sun, and Athanasios V