<<

Pose Estimation using Overhead Imagery and Semantics by Yousif Farid Dawood Rouben S.B., Mathematics, MIT (2016) S.B., Electrical Engineering and Computer Science, MIT (2016) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfilment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2018 ○c 2018 Yousif Farid Dawood Rouben. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole in any part medium now known or hereafter created.

Author...... Department of Electrical Engineering and Computer Science September 7, 2018 Certified by...... Nicholas Roy Professor of Aeronautics and Astronautics Thesis Supervisor Accepted by ...... Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Pose Estimation using Overhead Imagery and Semantics by Yousif Farid Dawood Rouben

Submitted to the Department of Electrical Engineering and Computer Science on September 7, 2018, in partial fulfilment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract As hardware improves, so does the ability to run more complex algorithms. Improve- ments in the portability and performance of computing units has allowed for more compact robotic systems such as Micro Aerial Vehicles (MAVs), while developments in camera technologies has allowed for the generation of denser robot perception, such as depth and semantics, derived from environmental cues such as colour, shape and texture. Additionally, with the advent of commercial civilian satellites and initiatives like the Google Maps project, most of the earth’s outdoor urban environments have been captured through satellite imagery. These overhead images contain useful information that can aid with outdoor robot navigation and localisation, by, for example, being used as prior maps in a localisation pipeline. This thesis presents a method for generating semantically consistent robot local- isation updates on-board a MAV. The key contribution is a process that fuses local ground and global satellite image information through geometric alignment at the se- mantic level. This method uses a semantically segmented collapsed point cloud from a camera system on-board a MAV, without any notion of mapping, referencing only a potentially stale global overhead image. A particle filter is used to enable multi- hypothesis data association and allow for efficient recovery from motion uncertainty.

Thesis Supervisor: Nicholas Roy Title: Professor of Aeronautics and Astronautics

3 4 Acknowledgements

To my family: Thank you does not cut it; nothing ever will. I love you all. Mama, you are my light. This will never change. To my neighbours who have taught me so much in life and have been the founda- tions to my inquisitive nature: I am forever grateful. Thank you to Nick Roy for allowing me to join his group, providing me with support and direction in both research and life and for being nothing but direct with me. I would also like to thank John (Jake) Ware and John Carter for their invaluable insights, mentorship and for making research fun. Thank you Nicholas Greene for unknowingly serving as my general encyclopaedia, debugger and teacher. Thanks to Nicholas Villanueva for being in the same boat and reminding me that boats float more often than they sink. Thanks to Fernando Yordan for listening to my ideas and generally being a potato. Finally, I would like to thank the Robust Robotics Group and FLA team for their support and feedback - you have taught me a lot this past year.

5 6 Contents

1 Introduction 13 1.1 Motivation ...... 14 1.2 Problem Statement ...... 17 1.3 Contributions ...... 18 1.4 Thesis Outline ...... 19

2 Related Work 21 2.1 Robot Localisation ...... 21 2.1.1 Basic Notation ...... 22 2.1.2 Dead Reckoning and GPS ...... 23 2.1.3 Recursive State Estimation ...... 24 2.2 Vision Based Overhead Localisation ...... 29 2.2.1 Feature Based Approaches ...... 29 2.2.2 Learned Embedding Approaches ...... 35 2.3 Semantic Segmentation ...... 36 2.3.1 Datasets ...... 39

3 Approach 43 3.1 Algorithm Outline ...... 43 3.2 Problem Formulation ...... 45 3.2.1 Notation ...... 45 3.2.2 Problem Statement ...... 46 3.3 Semantic Segmentation: Training ...... 47

7 3.4 Semantic Topview Generation ...... 50 3.5 Monte Carlo Localisation ...... 52 3.5.1 Initialisation ...... 53 3.5.2 Prediction: Motion Model ...... 54 3.5.3 Correction: Measurement Model for Particle Weighting . . . . 55 3.5.4 Pose Estimate Calculations ...... 57 3.6 Summary ...... 59

4 Results 61 4.1 Semantic Segmentation ...... 61 4.2 Simulation ...... 67 4.2.1 Setup ...... 67 4.2.2 Results ...... 70 4.3 Real Flight Data ...... 73

5 Conclusion 85

8 List of Figures

1-1 Examples of Commercial Autonomous Robots ...... 14 1-2 Examples of Overhead Imagery ...... 15 1-3 Examples of Perception Sensors for Autonomous Vehicles ...... 17

2-1 Graphical Model of Markovian State Estimation Problem ...... 24 2-2 Semantic Segmentation Annotated Training Image Pair Example . . . 39

3-1 Semantic Overhead Localisation Pipeline ...... 44 3-2 Semantic Topview Generation Pipeline ...... 51 3-3 Odometry Model Diagram ...... 55

4-1 Overhead Image of the Simulated Environment ...... 69 4-2 Approximate Flight Path for Simulation Tests ...... 70 4-3 “Figure-Eight” Simulation Flight Path ...... 73 4-4 Example of “Easy” Simulation Trajectory and Errors ...... 75 4-5 Example of “Medium” Simulation Trajectory and Errors ...... 76 4-6 Example of “Hard” Simulation Trajectory and Errors ...... 77 4-7 Simulation Position and Orientation Mean Square Errors ...... 78 4-8 CDFs of Absolute Position and Orientation Errors in Simulation . . . 79 4-9 Examples of Pose Estimate Trajectories along a “Figure-Eight” Path . 80 4-10 Semantically Labelled Overhead Image of Real World Flight Region . 81 4-11 SAMWISE vs Semantic Overhead Localisation Estimate Trajectories 82 4-12 Semantic Path Reconstruction using SAMWISE ...... 83 4-13 Semantic Path Reconstruction using Semantic Overhead Localisation 84

9 10 List of Tables

3.1 Semantic Segmentation Training Set: Class Mappings ...... 49

4.1 Semantic Segmentation Prediction Examples Across Test Datasets . . 62 4.2 Semantic Segmentation Results: Pixel Accuracies ...... 65 4.3 Semantic Segmentation Results: Intersection over Union ...... 65 4.4 Semantic Segmentation Training Set: Cross-Entropy Loss Reweightings 66 4.5 Parameters Controlling Testing Difficulty in Simulation ...... 68 4.6 Mean and Standard Deviation of Simulation Position Mean Square Errors 71

11 12 Chapter 1

Introduction

The field of robotics has progressed greatly over the past few decades. However, what remains one of the most compelling and yet most difficult challenges in robotics, is autonomy: allowing a robot to operate without human intervention or guidance. A key aspect of autonomy, particularly for robots that are tasked with kinematic actions such as ground and aerial vehicles that navigate in various environments, is localisation. In order for a robot to complete its designated tasks, it needs to know, with accuracy, the where (localisation) and then when (synchronisation) such that it can be able to do the how (planning) and the what (actions). This thesis presents a robot localisation approach based on Monte Carlo Local- isation (MCL) methods that uses satellite imagery as the prior map and a camera system with depth estimation capabilities as the environmental measurement sensor. Semantic segmentation is used, enabling semantically meaningful alignments between the global overhead image and the local onboard camera imagery. The presented ap- proach is designed for real-time use onboard autonomous vehicles and is tested using a micro aerial vehicle (MAV) framework. This chapter outlines the motivations for the research presented (Section 1.1), a brief introduction to the robot localisation problem that this thesis addresses (Sec- tion 1.2), the contribution of this thesis towards exploring the field of semantic lo- calisation using an overhead image (Section 1.3), and finally, an overview of the subsequent chapters (Section 1.4)

13 (a) Skydio aerial photography drone [1] (b) Waymo’s self-driving car [2]

(c) Chinese surveillance drone [3]

Figure 1-1: Examples of autonomous robots for transport, photography and surveil- lance.

1.1 Motivation

The development of autonomous robots has been largely driven by the advancements in both hardware and software. Improvements in hardware have enabled the execution of more complex algorithms and systems, while the development of new algorithms have allowed for more robust robot perception, motion, control and planning. Indeed, the two go hand in hand in moving towards truly autonomous, real-time systems that could integrate themselves into our everyday lives, ranging in purpose from aerial photography [1] to driving [2], surveillance [3] to search and rescue [4].

Typically, the robotic platform of focus for localisation techniques has been au- tonomous ground vehicles. However, a more generalised and arguably more chal- lenging class of robots is aerial vehicles, particularly, micro aerial vehicles (MAVs). Although their dynamic models are well understood, they are, in some ways, a more challenging platform over traditional ground vehicles. Their movements are not re- stricted over planar surfaces (such as roads), while additional degrees of freedom impose requirements for systems that can account for increased complexity. For ex- ample, cameras mounted on MAVs are no longer restricted to a fixed “dashcam” view,

14 Figure 1-2: Examples of publicly available satellite overhead images overlooking the main MIT campus (right) and the Guardian Centers of Georgia (left). Images cour- tesy of Google Maps [7] as would be the case with ground vehicles, making problems such as semantic seg- mentation and object detection more difficult to solve. Additionally, while constraints on size, weight and power allow for more versatile applications such as package deliv- ery [5, 6], this versatility comes with limitations, particularly the speed with which a MAV can manoeuvre, especially when denied of prior knowledge of their environment, Global Position System (GPS) for global localisation, or communications to external systems, thus necessitating for all sensing, computing and inference to be conducted onboard the vehicle’s limited computational unit.

Sensors such as inertial measurement units (IMUs) have become readily available and affordable over the past decade and allow for cheaper, more accessible - albeit noisy - odometry, while more expensive commercial systems like GPS [8] provide global positioning to within a few meters of error [9]. Additionally, localisation tech- niques such as MCL [10] exist, where local sensory information usually generated from range-based sensors such as LiDAR or sonar are used to cross reference against a prior of the environment, usually in the form of a 2D metric map. As a result of the range of initiatives for mapping the world through satellite imagery, including several made publicly available [11] (such as Google Earth [12]), there currently ex- ists a plethora of overhead imagery covering most of the outdoor urban world. Such information could serve as valuable priors to the robot localisation problem in out- door urban environments, which could facilitate more versatile, robust developments

15 in urban autonomous driving and flight.

In cases where priors (such as a map of the environment) are not known, other techniques can be used, such as Simultaneous Localisation and Mapping (SLAM) solving frameworks [10, 13, 14], where robots simultaneously localise and build a map of their environment - using one to do the other and vice versa. As the robot moves within its environment, it perceives the environment through its various sensors and fuses those noisy measurements in order to localise itself with respect to the environment, which it must also map in parallel. This is challenging, since the robot must estimate its pose and thus requires a map to do so, but it must estimate a map which requires knowledge of its pose and the outputs of its sensors. A paradoxical problem in many senses.

In addition to direct improvements in localisation techniques, there have also been great advancements in other areas that may implicitly improve robot localisation. One field that has seen significant progress in the last five years is semantic segmentation, which is the problem of segmenting images such that each pixel is associated with a semantic class representing what that pixel belongs to. By extracting such informa- tion from both maps and local sensory inputs, one can further constrain the matching problem between these two data sets such that they incorporate an additional notion of correctness: are the data sets that are being matched between the map and sensory input, semantically consistent? In other words, are they the same “thing”?

Being able to localise within an environment using only local sensors and computa- tion onboard MAVs is the underlying motivation for this thesis. Although high-speed, high-fidelity GPS-denied localisation and navigation is a task that challenges theen- tire autonomy stack, this thesis focuses on improving the state-of-the-art in robot localisation using potentially stale overhead images and an onboard camera and com- putation system. The central approach of this thesis, semantic overhead localisation, will be introduced and shown to be capable of providing corrections to noisy pose estimates.

16 (a) Hokuyo 2D Scanning LiDAR [40] (b) Velodyne 3D Scanning LiDAR [41]

(c) Intel Realsense D435 Stereo Camera [42] (d) PointGrey Flea3 Monocular Camera [43]

Figure 1-3: The most popular perception sensors used within autonomous vehicle stacks are typically laser based scanning LiDARs and cameras.

1.2 Problem Statement

Deep learning and neural networks have shown that we can now fairly reliably segment a single image of a scene into different semantic components, such as roads, buildings and trees [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]. Several robot systems have used this information for choosing local navigation strategies [26, 27, 28, 29, 30]. Similar strategies have been used to segment overhead imagery [31, 32, 33, 34], also for navigation but at a global scale [35, 36, 37, 38, 39]. One area that remains relatively unexamined, however, is that of fusing segments from overhead imagery with local segments. For example, a vehicle may know from an overhead image that it is is supposed to be following a road, but is such road the same road in the local camera image that it is supposed to be following? The key question this thesis considers is how to fuse segmentations from local and global images, and use them in a navigation strategy. We would especially like to do this without resorting to a globally consistent, detailed metric map, but rather reasoning at the symbolic level.

In typical MCL approaches that use ranging sensors (such as sonar or LiDAR) for generating local measurements of the environment and a pre-built 2D map of the

17 environment, localisation is usually achieved through geometric alignments or feature- matching since the two sources of data exist within the same planar representation. However, these approaches fail to take into consideration the semantic inconsistencies of such alignments. With a vision-based technique, however, semantics come natu- rally due to the sheer amount of information captured in each frame. Additionally, unlike 2D range based approaches, vision approaches allow for denser signals since 3D information can be generated, which would otherwise require significantly more expensive 3D range sensors to be used. Recent advancements in deep-learning-based semantic segmentation techniques enable real-time on-board segmentations of image streams, thus allowing for more insightful matchings to be made between local and global inputs. This thesis presents a new localisation framework that is based on the well estab- lished MCL method to localise a robot within a map via local measurements from on-board sensors. This framework is intended to serve as an additional layer to a global localisation framework, providing real-time pose corrections at regular inter- vals. In this new framework, the map of the environment is an overhead satellite image that need not be a fully detailed metric map. The local measurements used as inputs to the localiser are images from a camera on board the MAV and depth information obtained through the use of stereo vision, monocular vision techniques or RGBD sensors. Given a prior of the robot’s current pose, the goal is to generate accurate robot pose estimates through robust alignments of environment geometry, obtained by fusing global and local information at the semantic level.

1.3 Contributions

This thesis makes three key contributions. Firstly, it demonstrates that training a semantic segmentation network with a sufficiently large and varied dataset composed of urban images, allows for urban scene segmentation that is generalisable for use on board a MAV in environments not used during the training phase. Secondly, it

18 establishes that localisation from stale overhead images is viable, given local vision sensors that provide for greater contextual detail than more costly alternatives such as laser scanners. Finally, it illustrates, first hand, the usefulness of semantic infor- mation in global localisation. In particular, it demonstrates the ability to use “binary semantic photometric error” as a more robust method than traditional colour-based photometric error for the reweighing factor in a particle filter’s measurement model. The effectiveness of the system presented in this thesis is evaluated inbotha simulation environment and using flight data captured from onboard a real-world MAV platform. A battery of flights conducted in a simulated environment of a real- world outdoor urban environment demonstrates that a robot can effectively correct its pose estimates and recover from drift errors, even as inverse depth noise is increased in the depth estimates. In addition, sample trials on data collected from onboard a real quad-rotor have shown reductions in error between estimated and actual landing pose.

1.4 Thesis Outline

This thesis begins with by providing an overview of relevant background information in the field of robot localisation, overhead localisation and semantic segmentation in Chapter 2. The remainder of this thesis presents a novel approach to autonomous robot localisation using semantics from local and global imagery. Chapter 3 details the foundations and schematics of this new approach, while Chapter 4 presents the results derived from testing this framework, first in simulation, followed by real-world physical flight data. Finally, Chapter 5 provides concluding remarks and briefly notes future work that could further progress the nature of this research.

19 20 Chapter 2

Related Work

This chapter provides a brief overview of the vision-based robot localisation prob- lem and some of the existing methods for solving it, with a concentrated focus on overhead localisation. Section 2.1 discusses the robot localisation problem, intro- duces some basic notation and compares some of the most popular problem-solving techniques. Section 2.2 presents past works in camera-based overhead localisation. Finally, Section 2.3 discusses semantic segmentation with a focus on deep-learning pipelines and several publicly available datasets for training segmentation networks in urban environments.

2.1 Robot Localisation

In this section, we begin by first introducing some basic notation (Section 2.1.1) in the context of robot localisation that will be used in the latter sections of this chapter and in subsequent chapters to discuss the robot localisation problem, formulations, approaches and results. We continue by briefly outlining some of the most basic methods for robot localisation (Section 2.1.2) as well as more standard approaches that form the foundations of what is generally used in practice (Section 2.1.3).

21 2.1.1 Basic Notation

The typical robot localisation problem involves a robot moving through an environ- ment and trying to estimate, at discrete time steps 푡 ∈ N0, its position and orientation, also known as its pose (interchangeably referred to as “state"); usually within a 2D or

3D coordinate frame. We denote robot pose at time 푡 by x푡 ∈ 휒, where, for example, 휒 = SE(2) in the case of 2D localisation or 휒 = SE(3) in the case of 3D localisation.

Robot movements are executed through control inputs, with input at discrete time

푡 denoted by u푡, and represents the command used to move the robot in the interval

(푡 − 1, 푡] from pose x푡−1 to pose x푡. Common sources for these control inputs include inertial measurement units (IMUs) and visual odometry (VO) systems.

The environment is often represented by either a metric map, which we denote by

푀푎푝 or a set of 퐾 landmarks (l1, ..., l퐾 ). The space in which these representations exist depends on the problem setup of the localisation problem; however, for this

3 푚×푛 thesis, we typically assume that l푘 ∈ R and 푀푎푝 ⊂ R for 푚, 푛 ∈ N.

Measurements of the surrounding environment through local sensors can be used to help determine robot pose by, for example, using data association techniques that try to match noisy measurements from sensors to landmarks in the environment. Although not always the case, measurements are typically represented by a single variable z푡 ∈ 풵 for each discrete timestep 푡. Since this thesis focuses on vision based approaches, 풵 is typically a subset of R푚×푛 or R3×1×푘 in the case of the image domain or 3D point clouds, for some 푚, 푛, 푘 ∈ N.

At times, it may be necessary to refer to a history of values for some variable v. This history may either be a temporal history between two discrete timesteps or simply a set of values such as landmarks. We represent this history by v푡:푡′ = ′ ′ (v푡, v푡+1,..., v푡′ ) for 푡 > 푡 and 푡, 푡 ∈ N0.

With this notation, we define robot localisation as the process of estimating at discrete time 푡 the pose of the robot x푡 within some predefined coordinate frame, given a history of command inputs u1:푡 and measurements z1:푡.

22 2.1.2 Dead Reckoning and GPS

The most basic form of robot localisation is often attributed to dead reckoning, whereby motion measurements are used to estimate a distribution over robot pose, without any notion of a map. An initial pose distribution 푝(x0) is required, however, from which the robot is initially localised with respect to, by propagating the robot about it using input measurements. In the case of wheeled robots, measurements can be derived using odometry from wheel encoders, while more generally, sensors such as Inertial Measurement Units (IMUs) can be used that combine accelerometers and gy- roscopes to measure force, angular rate and magnetic field surrounding the robot that it is attached to [44]. This method, is subject to problems in pose accuracy degrada- tion as noise is accumulated over time in an integral fashion. Alternative sensor-based methods exist, with the most popular being the GPS [8], which generates real-time timestamped global position and velocity estimates through triangulation with four or more satellites from a constellation that orbit the earth. While unencrypted civil- ian GPS is considered accurate to within 4.9 meters [9, 45, 46], it is susceptible to exploits, namely spoofing attacks [47, 48], and its reliability for mission critical robot localisation is hampered by issues with signal blockage and reflection when operating in urban canyons (such as around buildings, bridges and trees), indoors or under- ground [49, 50, 45]. Although methods do exist that fuse dead reckoned odometry to deal with GPS signal drop-out and degradation [51], in general, the existing ac- curacy of GPS networks are unacceptable for localising large-scale robots (such as autonomous cars) and more compact vehicles (such as MAVs), where errors of a few meters may be the difference between safety and catastrophe when adopting these measurements in the autonomy stack.

Indeed, robot localisation need not only involve knowing the robot’s pose relative to a point of reference. As robots become increasingly complex systems that interact with their environments, the need for more local reasoning has become paramount in informing several parts of the autonomy stack, such as planners that estimate opti- mal strategies for robot actions. Typically, this is done using methods that reason

23 Figure 2-1: Graphical model of the Markovian state estimation problem. White nodes represent the hidden state variables to be estimated, dark nodes express con- trol and measurement observations, while arrows illustrate conditional dependence of destination on source.

about measurements of the environment generated by local sensors and a map that represents a belief of the environment’s state. In the case where a prior map of the environment is present, global pattern matching methods or recursive state estima- tion techniques such as Kalman filters (KFs) and MCL can be used whereby data association between the map and local sensor measurements of the environment are conducted. However, in the case where no map exists or the existing map is defi- cient, the problem is often modelled as a Simultaneous Localisation and Mapping (SLAM) problem, which requires mapping the environment and simultaneously lo- calising from the generated map. In the following section, we briefly discuss recursive state estimation, which forms the basis of the robot localisation approach we present in Chapter 3.

2.1.3 Recursive State Estimation

Since robot pose x푡 cannot be measured directly, it must instead be inferred from noisy

measurements z1:푡 and control inputs u1:푡. As such, the robot localisation problem can be thought of as one that requires estimating the probability distribution over the robot’s pose given a set of noisy inputs. More precisely, we denote this distribution

24 as:

푏푒푙(x푡) = 푝(x푡|z1:푡, u1:푡) (2.1)

We note, however, that the computational and memory complexity grows expo- nentially with the number of measurements and control inputs. To make maintaining the conditional distribution more manageable, it is often assumed that the system is

Markovian (Figure 2-1), whereby the state x푡 is independent of all past states, given x푡−1 and u푡 and measurements are independent given state.

While belief is defined as a probability distribution over x푡 that incorporates the most recent measurement and control input, prediction, denoted by 푏푒푙(x푡), denotes the probability distribution before incorporating the most recent measurement z푡 but just after executing the most recent control input u푡:

푏푒푙(x푡) = 푝(x푡|z1:푡−1, u1:푡). (2.2)

Calculating the belief distribution from the prediction distribution is possible and forms the basis of recursive state estimation. The Bayes filter is the most general algorithm for calculating belief distributions and serves as the foundation to proba- bilistically estimating robot state given noisy measurements and control data through a recursive procedure that uses the Markov assumption presented in Figure 2-1 to compute current belief 푏푒푙(x푡) based off of previous belief 푏푒푙(x푡−1). The procedure is outlined in Algorithm 1 and we will show how to derive the belief step followed by the prediction step shown on lines 7 and 5 respectively.

By Bayes’ rule, we can rewrite the belief distribution as:

25 Algorithm 1 Recursive update step of the Bayes Filter.

1: function BayesFilter(푏푒푙(푥푡−1), 푢푡, 푧푡) 2: // For all pose estimates 3: for all 푥푡 do 4: // compute prediction ∫︀ 5: 푏푒푙(푥푡) = 푝(푥푡|푢푡, 푥푡−1)푏푒푙(푥푡−1)푑푥푡−1 6: // Measurement update that corrects from prediction to belief 7: 푏푒푙(푥푡) = 휂푝(푧푡|푥푡)푏푒푙(푥푡)

8: return 푏푒푙(푥푡)

푏푒푙(x푡) = 푝(x푡|z1:푡, u1:푡) (2.3) 푝(z |x , z , u )푝(x |z , u ) = 푡 푡 1:푡−1 1:푡 푡 1:푡−1 1:푡 (2.4) 푝(z푡|z1:푡−1, u1:푡

= 휂푝(z푡|x푡, z1:푡−1, u1:푡)푝(x푡|z1:푡−1, u1:푡) (2.5)

= 휂푝(z푡|x푡, z1:푡−1, u1:푡)푏푒푙(x푡) (2.6)

where 휂 is a normalisation constant that we can parametrise out.

Due to conditional independence:

푝(z푡|x푡, z1:푡−1, u1:푡) = 푝(z푡|x푡). (2.7)

From this, we derive line 7 of the Bayes Filter Algorithm:

푏푒푙(x푡) = 휂푝(z푡|x푡)푏푒푙(x푡). (2.8)

By theorem of total probability, we can rewrite the prediction step defined by Equation 2.2 as:

26 ∫︁ 푏푒푙(x푡) = 푝(x푡|x푡−1, z1:푡−1, u1:푡)푝(x푡−1|z1:푡−1, u1:푡)푑x푡−1. (2.9)

We now derive line 5 of the Bayes filter algorithm by application of the Markov

assumption and, since we can omit u푡 from the set of conditioning variables of

푝(x푡−1|z1:푡−1, u1:푡) for randomly chosen controls:

∫︁ 푏푒푙(x푡) = 푝(x푡|x푡−1, u푡)푝(x푡−1|z1:푡−1, u1:푡−1)푑x푡−1. (2.10)

The Bayes filter, however, is intractable for most class of problems — withthe exception of linear Gaussian systems — due to the appearance of an integral, and can therefore only be implemented to a finite representation by means of approximate

methods that try to compute 푏푒푙(x푡) by estimating three key distributions: the initial belief 푝(x0), the measurement probability 푝(z푡|x푡) and the state transition probabil- ity 푝(x푡|x푡−1, u푡). The state transition probability distribution is typically estimated using a motion model such as dead reckoning, while the measurement probability dis- tribution is estimated using a measurement model that tries to associate local sensory inputs to landmarks or a map in order to generate a measure of confidence for the observation. Many types of measurement models exist and often vary depending on the sensors used. For example, measurement models associated with visual sensors are often associated with the process of feature matching, whereby features (such as corners or edges) from input images are matched with known features of the environ- ment (represented by landmarks or a map), which then allows for a probability to be estimated based on the closeness of the matchings. In Section 2.2.1, we will discuss pipelines that use feature matching measurement models as part of their localisation procedures between images from local camera sensors and a map of the environment. Two of the most popular recursive state estimation techniques are Gaussian and nonparametric filters, which differ in their representation of the probability density function (PDF) over state x푡. As their name implies, Gaussian filters represent be-

27 lief through known models represented by multivariate normal distributions that are expressed by two sets of unknown parameters: mean vector 휇 and positive definite covariance matrix Γ, with PDF:

− 1 1 푇 −1 푓(푥) = |2휋Γ| 2 exp(− (푥 − 휇) Γ (푥 − 휇)). (2.11) 2

The most popular and widely used Gaussian filter is known as the Kalman filter

[52], which computes the PDF over state x푡 by its first and second moments, but is restricted to linear Gaussian systems in order to correctly (exactly) estimate the posterior. The Extended Kalman filter (EKF) is an extension of the Kalman filter that can be used to approximate nonlinear systems by using a first-order Taylor expansion about the current estimate to linearise the system.

Nonparametric filters, such as particle filters, on the other hand, are not restricted by the functional form of the system as they approximate the belief through a finite set of 푁 state samples drawn from the belief distribution, which when taken to infinity, the approximation error — defined over the statistics computed fromthe samples’ relation to the statistics of the original distribution — converges uniformly to zero. These state samples, called particles in the context of particle filters, are [푖] represented by 푋푡 = {x푡 ; 푖 ∈ [1, 푁]}. The more densely concentrated particles are in a particular state, the higher the posterior probability around that state. It is because of this that particle filters can be used for state estimation of non-linear non- Gaussian systems. Particle filters are, however, computationally more expensive than Kalman filters, with computation time suspected to increase exponentially instead of quadratically with state space dimension. Adaptive versions do exist which can vary particle set size dynamically, making particle filtering methods attractive, as they allow for adaptability to a variety of localisation problems and computational constraints. We revisit the MCL algorithm that is based on particle filtering later in Chapter 3 as it comprises the backbone of our localisation approach.

28 2.2 Vision Based Overhead Localisation

A variety of approaches exist that use satellite imagery within autonomous vehicle pipelines for purposes that range from planning to localisation. However, due to the computational complexity of several of these algorithms, not all are viable for online use. For example, Kaminsky et al. [53] proposed a system, whereby they performed ground to overhead optimal global alignment using an objective function that matched 3D points — generated using structure from motion [54] techniques — to image edges and imposed free space constraints based on the visibility of points in each local image. This was done by aligning the 2D ground plane projection of the point cloud with the overhead image and using GPS data to further improve results. The approach was promising in its ability to accuracy localise, but was constrained to offline use, in part due to reliance on structure from motion techniques. Alternatively, Sofman et al. [55] in earlier research demonstrated the ability to use overhead imagery for predicting traversal costs over large areas by learning to use scope-limited features. Although this does not perform direct localisation, it illustrates the breadth with which satellite imagery can be used within robotic navigation systems. This section highlights localisation pipelines that use satellite imagery as priors of the environment and attempt to use images from camera sensors on board a robot for localisation in urban environments. Section 2.2.1 discusses the most common category of approaches that focus on matching features belonging to either building outlines or roads due to their similar observability in both local and satellite imagery. Finally, Section 2.2.2 discusses more recent deep-learning based works that use learned embeddings to represent local and global images in a common feature space, allowing for unified similarity measurements.

2.2.1 Feature Based Approaches

Earlier works in overhead localisation using an onboard camera sensor primarily fo- cused on feature-based matching methods to either perform direct global localisation or act as measurement models that estimated the measurement distribution within

29 a recursive estimation framework. Features of interest were usually those considered distinct and observable in both local and global imagery.

Building Features

Some of the most distinct observable features in an urban environment come from orthogonal structures such as trees, poles, traffic signs and buildings. Many of these objects tend to be static, particularly buildings, and thus ideal for use in localisation pipelines that employ satellite images, which may not always be up-to-date. McHenry et al. [56] presented a very early example of work that leveraged such observed features by using edge-detection-based heuristics for extracting roof-lines. The observed roof- line features were provided range through wide baseline stereo and fed into a particle filter along with the overhead image for localisation. Although this research illustrated the possibility for local-global image localisation, the breath of features, their manner of extraction and the use of the overhead image was very limited, having not made full use of the richness of available information. In a similar light, Leung et al. [57] also focused on highlighting building boundaries (walls) in aerial images. However, instead of employing stereo vision, they used vanishing point analysis to infer 3D information from 2D images to identify relevant wall features from local images. They then compared the orientation of building boundaries to determine the importance factor of particles in a particle filter. Although the accuracy of this system was shown to be similar to GPS, the authors noted the requirement for a considerable duration of motion — up to 150푠 — in order for the particle cloud to converge towards robot pose, effectively rendering the system impractical for many robot critical tasks. More recent work by Lansiedel et al. [58, 59], used sparse semantic data by ex- tracting building outlines and road networks from OpenStreetMaps [60] to localise a robot in unknown urban environments. Local 3D point cloud data generated from a laser scan was first reduced to a noise absorbing occupancy grid by using therect- angular cuboid approximation framework (RMAP) [61]. Then, a 2D topview con- sisting of line segments representing extracted orthogonal structures (buildings) was extracted by using the probabilistic Hough transform [62]. The Hough transform

30 works by mapping each edge pixel to various points in a parameter space that rep- resents all possible lines that could pass through the point. Sets of collinear points will intersect in the parameter space and the point of maximum intersection provides the parameters that describe the edge in the input image. An augmented version of the classical Chamfer Matching technique [63, 64] that minimises the sum over a distance transform was used for template matching between the locally generated topview and the semantic overhead image to generate matching scores of potentially viable poses. They restricted the number of poses considered as valid candidates for localisation, by utilising information from the semantic overhead map to discard estimates located within buildings or more than 12m from any road. Although a 3D laser scanner was used to generate local measurements of the environment, the approach could be generalised for use with a camera system that includes a depth estimation pipeline. Doing so would lead to sparser point clouds, but enable the use of semantic segmentation techniques to extract additional scene information allowing for more semantically consistent alignments with the overhead image. A benefit of texture independence derived from laser scan use, however, is increased robustness to temporal changes such as season and lighting conditions. Since this method is a global localisation technique that relies on geometry alone and does not use odometry nor keep track of past robot state, it is susceptible to perceptual aliasing in cases where the environment contains multiple regions that are geometrically indistinguishable. While the authors state that the non-optimised single-threaded implementation is not suitable for real-time use onboard a robot, it could still be used to provide initial and periodic pose estimate distributions for a MCL pipeline, allowing for more informed re-insertions of particles.

Although building roof outlines have been the feature of choice in the approaches discussed thus far, Wang et al. [65] have shown that vertical edges can also be used. They introduced feature-based localisation between air and ground (FLAG), a frame- work that computes stable descriptorless features associated with vertical structure in the environment from an onboard stereo camera system and matches them to cor- responding hand annotated features in an aerial image. BRIEF descriptors [66] were

31 used to match vertical edge pixels in corresponding stereo image pairs, and a sparse, filtered 3D point cloud consisting only of the matched pairs was generated using least square triangulation techniques [67]. A second filtering procedure for outlier removal was executed on the point cloud by using density-based spacial clustering (DBSCAN) [68], which groups closely packed points together and rejects points in low density regions. They then collapsed the point cloud down to a 2D grid cell by eliminating their 푍 “altitude” component and detected descriptorless ground plane features by thresholding and flagging grid cells that contained greater than 50 collapsed points. These descriptorless features were then fed into a particle filter measurement model where data association with the overhead image was performed as a nearest neigh- bour search. FLAG was tested indoors on a real robot platform and on an outdoor MAV dataset, with results showing improved performance over dead reckoning and similar performance to state of the art LiDAR-based SLAM when operating in re- gions with orthogonal structures. However, when no vertical features were detected, the system relied on odometry and performance degraded to dead reckoning, limiting its usefulness to city-like environments. Indeed, prolonged dead reckoned motion and the effects of false matches in data association due to the descriptorless natureofthe features were the motivating factors for their using multi-hypothesis data association instead of alternative maximum likelihood or single hypothesis methods like Kalman filters. One way of combating false matches and their effects in the data association step would be to constrain the problem further by incorporating semantic labellings during feature extraction, but this would be contingent on achieving reliable pixel- wise segmentations prior to the stereo matching step.

Ground Plane Features

Since ground vehicles have historically been the most popular robotic platforms of choice, road features such as curb edges and lane markings have also been used for overhead localisation, allowing for measurement models that can operate in a larger subset of urban environments void of man-made structures. Pink [69] demonstrated a method that matched lane-marking features from local

32 and global images by posing the problem as a general point pattern matching problem, which they solved using iterative closest point (ICP) [70] and iterative reweighted least squares (IRLS) [71]. Since their platform of choice was a ground vehicle, they focused on detecting lane markings through the use of a Canny edge detector [72] on the lower half of the image plane; a method that is crude, prone to false positives and sensitive to lighting conditions and occlusions. GPS was used for coarse initial position and heading estimates at each iteration of the matching algorithm, which strongly influenced the correctness of the algorithm’s estimates, because ICP solutions to the matching problem are locally and not necessarily globally optimal.

Instead of posing the problem as a point matching problem, Noda et al. [73] used projective transformations (homography) to warp the local images to a ground plane topview representation that allowed for feature matchings to be conducted directly with on-road features extracted from a pre-built feature map derived from the overhead image. To mitigate against variations in viewpoints, illumination conditions and other temporal differences between aerial and local imagery, Speeded Up Robust Feature (SURF) [74] image descriptors were used for feature matching. To combat occlusions from local sensing, sequential matching was conducted between projected frames and then used for matching with the overhead feature map, however, occlusions in the overhead image were not addressed. Since this method was essentially posed as a global matching problem, disregarding odometry and pose history, its performance relied on the existence of clear, non-occluded, unique visual textures, which is not often the case in urban environments, where roads are often subject to dynamic changes and occlusions and perceptual aliasing are likely to occur. Viswanathan et al. [75] presented a multi-hypothesis variant, whereby a database of descriptors was first computed offline from the satellite image, over a uniformly sampled rectangular grid of locations. Onboard the vehicle, local images were warped to form topviews of the ground, from which a single image descriptor was computed for each topview. Comparisons between the topview descriptor and those in the database were made and the 퐿2 norm was used as a means of ranking matches, with matchings considered positive if they fell within 5 meters of the vehicle. A variety of descriptors were tested,

33 and they found that Scale-invariant Feature Transform (SIFT) [76] worked best. The best performing descriptor from the aerial-ground matching was then incorporated into a particle filter and demonstrated improvements in accuracy over GPS alone.

While homography can generate dense topviews from local images, they operate with the implicit assumption that the entire road region lies within a single plane, which is often not the case. Indeed, even the slightest appearance of orthogonal structure (such as a bush) can lead to distortions in the resulting topview. Senlet and Elgammal [38, 36, 39] attempted to overcome these problems by generating topviews from 3D point clouds, generated using stereo vision techniques. Generating usable topviews, however, requires sufficiently dense point clouds, which they accomplished by estimating disparity at four levels of resolution over the same image pair, using estimates from lower resolution images to fill holes in estimates from higher resolu- tion images. The final topview used for localisation was then generated by takinga weighted average over topviews generated from successive frames. Greater weighting was placed on information from newer frames to mitigate the distortions of far ob- jects in the earlier frames during the depth estimation process. Chamfer matching between the topview and a binary road map generated from satellite imagery were then used for particle reweighting within a particle filter scheme, which demonstrated real-time performance comparable to GPS and the ability to recover from odometry drift. Although the results were promising, their approach was limited for use on board ground vehicles under the assumption of operating on well defined, unoccluded road networks. We believe that both this and the previous two homography based approaches reflect situations where semantic segmentation could benefit performance by allowing for the detection and extraction of objects not considered part of the road network.

In general, descriptor-based feature matching requires large computation and has historically been slow; however, with improvements in feature matching algorithms [77, 78] and hardware, their applications within localisation pipelines, such as those described thus far, have become more feasible for online robot use.

34 2.2.2 Learned Embedding Approaches

Deep learning techniques have also been employed to find local-to-global correspon- dences for localisation by exploiting the availability of large datasets containing geo- referenced pairings between local ground-view and satellite images. et al. [79] compiled a database of such pairings, to learn a dictionary consisting of colour, edge and neural features, which they used for retrieval during testing. They performed localisation by estimating the co-occurrence probabilities between ground and satel- lite images using this feature dictionary, demonstrating observable improvements over typical feature matching methods. They also presented a ranking-loss based algorithm that learned location-discriminative feature projection matrices to better capture re- lationships between ground and satellite images, which resulted in further improve- ments in accuracy. However, the requirement for large amounts of data in compiling the database of pairings makes this approach infeasible for rapid online deployment and affects its ability to generalise to new environments.

Kim et al. [80] used a neural Siamese [81] multi-view model that learned location- discriminative embeddings in which local images were matched with their correspond- ing georeferenced global image by learning a feature representation that maps local and global images to a feature space where matched local-global image pairs are close and far otherwise. The Siamese network was trained using matched and unmatched image pairs, each consisting of a ground aerial image. They used this learned func- tion as a measurement model to a particle filter to overcome significant viewpoint and appearance variations between the images and maintain a distribution over the vehicle’s pose. Similar to the approach by Chu et al. [79], they relied on geotagged ground-level images for training and matching, but were able to demonstrate addi- tional generalisability to environments not seen in training. However, online robot use was deemed infeasible due to the Convolutional Neural Network (CNN) feature representation of each ground-satellite image pair, which required 55푚푠 to generate on an NVIDIA Titan X.

Unlike both the methods discussed so far that perform the matching problem as

35 a single query over the local image, Tian et al. [82] perform multiple queries over the buildings in the local image to ensure better global consistency. Faster R-CNN [83] is used to segment buildings from the ground image in real-time, while a Siamese network is used to find ground-aerial image building matchings. Unlike Kim etal. [80], however, Tian et al. do not employ the use of a particle filter, instead solving for a global localisation problem by performing a custom multiple matching nearest neighbours search between buildings in the local image and reference buildings in the overhead images. This method is also not viable for real-time robot use due to compounded computational overhead of the Siamese network, segmentation network and nearest neighbours algorithm. Additionally, the accuracy is too poor for precise robot localisation, however, it could be used to inform other GPS denied pipelines with candidate pose distributions within a wider urban city environment.

2.3 Semantic Segmentation

The goal of semantic segmentation, in the context of images, is to label each pixel of an image with a value representing the semantic class that it belongs to. Semantic segmentation is useful in that it allows for dense symbolic reasoning at the pixel level, which image classification and object detection do not achieve. In this section, we briefly discuss semantic segmentation from a deep-learning perspective. Wethen review outdoor urban datasets used for training, in Section 2.3.1. A naive approach to semantic segmentation with CNNs is to solve the problem in a similar but much more computationally expensive way as that used to solve the image classification problem, whereby identically padded convolutional layers are stacked together to preserve dimensionality throughout the network and generate an output segmentation that matches the input dimensions. Unlike the image classification problem, which only requires courser outputs representing the “what” in an image, semantic segmentation requires the additional “where” and on a per-pixel basis. It is because of this that the identical padding between layers is needed, thus leading to networks with a larger number of parameters than corresponding classification

36 networks, which typically employ striding factors to the convolutions or insert pooling layers to downsample the feature maps as we move deeper into the network. This downsampling results in a loss of information, but increases the receptive field and allows for greater context to be gained during the learning procedure, which helps in understanding differences such as roads and lane markings (where lane markings are contextually contained within roads).

As a result of the problems associated with maintaining full input dimensionality throughout the network in order to generate pixel-wise labels, the encoder-decoder network architecture was developed and currently exists as the most general deep learning architecture for semantic segmentation. This architecture works by using two networks: an encoder and a decoder. The encoder, usually a pre-trained clas- sification network, increases the receptive field by reducing the spacial dimensions of the input image using pooling layers and strided convolutions, while the decoder projects the lower resolution features learnt by the encoder back into higher reso- lution pixel space, thus generating the desired dense pixel-wise classifications. Like the encoder, which uses pooling layers to achieve many-to-one mappings, the decoder can use unpooling layers to achieve one-to-many mappings — using predefined in- terpolation methods — at the expense of information loss and coarseness. However, the most popular upsampling technique in segmentation networks is the transposed convolution (commonly referred to as deconvolution), which uses learned parameters. Unlike the typical convolution layer which has a many-to-one mapping by taking the dot product between a convolution kernel and the input, deconvolution achieves a one-to-many mapping by multiplying each value in the feature map with all the deconvolution kernel weights, which are then mapped into the higher dimensional feature map.

Long et al. [84] presented the Fully Convolutional Network (FCN), which used an off-the-shelf image classification network such as AlexNet [85], GoogleNet [86]or VGG16 [87] for the encoder, while using deconvolutional layers to form their decoder. The FCN pipeline is able to take as input any input size due to the network only using convolutional and pooling layers and no fully connected layers, which require

37 fixed input sizes. In reality, FCN does not discard the fully connected layers ofthe encoder, but instead treats it as a larger convolutional layer, thus loosening the in- put dimensionality constraint. However, the FCN network struggles in recovering semantic boundaries due to the strong downsampling factor associated with the en- coder (up to a factor of 32). To combat this, they formed “skip connections” between earlier, finer layers of the network and latter layers of the decoder to provide de- tails needed to reconstruct finer segmentation boundaries. SegNet [15] is another semantic segmentation architecture based off of FCN that uses a 13-layer encoder and symmetric 13-layer decoder, but completely does away with the fully connected layers that FCN converted into convolutional layers. This allowed them to reduce the number of parameters in their network by almost a factor of 10, which at the time, made it the state-of-the-art in lightweight accurate segmentation. Additionally, unlike FCN, which copies over encoder features to the decoder, SegNet copies the maxpooling indices allowing for additional boosts in speed. Paszke et al. [17] pre- sented ENet, an even more lightweight architecture that generated approximately 18x less parameters and required 70x less FLOPS than SegNet to segment an input of size 640 × 360 × 3, while demonstrating comparable accuracies. This performance boost was mainly achieved by optimising the early stages of their network by performing early downsampling and heavily reducing their input size within the first two blocks of their network. This was claimed to be possible due to spatial redundancy in visual information, which they believed necessitated for earlier extraction of good features to inform deeper layers of the network. Since they used an unbalanced encoder-decoder architecture with a much shallower decoder, they avoided excessing downsampling in the later stages of the encoder by using dilated convolutions [88], which allowed for an exponential increase in field of view without decreasing the resolution. This is achieved by applying convolutional kernels with gaps specified by a dilation rate. For example, a dilation rate of one gives a normal convolution kernel, while a rate of two leads to a one pixel “padding gap” being added between and around each kernel value.

While we have briefly introduced the encoder-decoder architecture for semantic

38 (a) Raw Image (b) Ground Truth Segmented Image

Figure 2-2: An example of an image pair used for training a semantic segmentation network. The annotated image is presented as an RGB visualised image for easy viewing, however, during training an 8-bit greyscale image is used that allows for up to 255 semantic classes and a void class for unwanted categories. Images courtesy of Mapillary Vistas [92]. segmentation and examples networks, the field is rapidly growing and a number of new and improved networks and methods are being developed constantly. For a more thorough survey of semantic segmentation, we refer the reader to [89, 90, 24, 91].

2.3.1 Datasets

With the advent of deep learning techniques for semantic segmentation, the need for datasets to train and evaluate the various network architectures has become paramount in exploiting their performance. Such datasets are composed of a set of image pairs, with each pair consisting of a raw camera and corresponding ground truth semantically labelled image. Unlike object detection datasets that consist of sparse bounding box labellings, semantic segmentation datasets are more timely and costly to compile due to the need for denser annotations at the pixel level that require manual verification. In this section, we briefly discuss the most significant publicly available semantic segmentation datasets for outdoor urban environments. Many of these datasets contain intersecting subsets of semantic classes, with some of the most common being: road, sky, building, pavement, vegetation, terrain, pedestrian and car.

39 The first publicly available urban driving dataset for semantic segmentation was the CamVid dataset [93], which consisted of 701 images manually annotated into 32 semantic classes. For several years, this was the dataset of choice, however, it quickly became clear that its limited size, variability and variety were insufficient for training networks that generalise across environments. The Cityscapes dataset [94] was a newer dataset similar to CamVid, but of significantly larger size. It consisted of 5,000 densely and 20,000 coarsely annotated images respectively, with semantic labellings belonging to 30 classes. The dataset was derived from dashcam recordings on board vehicles operating in 50 cities across France, Germany and Switzerland, taken over several months to account for variations in season; albeit during clear day- time weather. Frames were manually selected to include a large number of dynamic objects, varying scene layouts and backgrounds. Although Cityscapes was an im- provement over the Camvid dataset due to its size and higher resolution images, the scene variation was still insufficient for generating generalisable performance across domains.

Since generating real-world datasets is difficult, attempts have been made to gen- erate synthetic datasets from simulated environments. Using simulated environments allows for automatic generation of pixel-wise ground truth annotations and the abil- ity to easily create variations in season, illumination, weather and texture. The most notable synthetic dataset is the Synthia dataset [95], consisting of 13-class densely annotated images from dashcam driving scenarios. This dataset consists of 13,400 images, where no two images were captured within at least a 10 meter radius, to improve visual variability within the dataset. Typically, datasets such as Synthia are used to initially train networks before fine-tuning on smaller real-world datasets. This sort of procedure learns the domain of the synthetic dataset, but allows for the statis- tics of the real domain to be considered during the second stage of training. With an increasing number of techniques [96, 97, 98, 99] focused on generating more photo- realistic synthetic images from simulated environments it may one day be sufficient to train using simulated data while addressing the effects of domain shift.

The Mapillary Vistas Dataset [92] can be considered the first dataset aimed at

40 generalisable network performance, consisting of 18,000 test and 2,000 validation densely annotated street level image pairs with 66 semantic categories. The dataset was selected from a repository of approximately 140 million images, with a final 90 : 10 composition of images from road or sidewalk views (90%) and highways, rural areas and off-road views10% ( ). Images with large amounts of distortion were filtered out, in addition to images dominated by parts of the capturing vehicle or device. This dataset also incorporates a variety of seasons, weather, lighting conditions and geographic regions by including images from all six populated continents. Additionally, camera diversity was also incorporated by accounting for different resolutions, focal lengths, camera models and aspect ratios. The two newest datasets are the Berkeley Deep Drive dataset (BDD100K) [100] and ApolloScape dataset [101]. The former consists of 10, 000 densely annotated im- ages sampled from dashcam video sequences capturing varying weather and lighting conditions, from four regions within the United States. While the dataset is larger and more varied than Cityscapes, it remains dwarfed in comparison to the Mapillary dataset. The ApolloScapes dataset, on the other hand, consists of 143,906 labelled images captured from four regions in China, but lacks the variations in lighting con- ditions since all images were captured during the day. The dataset is broken down into three difficulty categories: easy, moderate and hard, where images are placed into each category based on a heuristic that uses the frequency of movable objects like vehicles and pedestrians. Although the dashcam method of capture and scene variability of the ApolloScapes dataset are similar to those of CamVid, Cityscapes and BDD100K, the sheer size of the dataset may prove promising in training better networks as is the case in the context of the object detection problem. Despite datasets for semantic segmentation becoming larger and more common, issues still remain with regards to consistency. Since each dataset contains different sets of classes, their granularity can make using multiple datasets more challenging.

41 42 Chapter 3

Approach

Despite the numerous overhead localisation algorithms presented in Section 2.2, none made direct use of scene semantics at the pixel level, with all but two focusing on a single semantic aspect and disregarding the richness of semantic information that visual sensors uniquely provide. Indeed, a motivating factor for such narrow use of semantics, was the difference in observability between local and global images, which lead to more recent approaches shifting their focus towards learned embeddings that try to learn a common unified feature space in which data association between local and satellite imagery becomes possible. This chapter presents a novel multi-hypothesis algorithm that enables the explicit use of multiple semantic classes at the pixel level by exploiting advances in depth estimation and semantic segmentation pipelines to allow for pixel-wise data associa- tion between local and overhead images within a unified 2D planar representation of the environment. Pose corrections to relative pose changes are estimated online with- out the use of a GPU, with throughput bottlenecked by the semantic segmentation network.

3.1 Algorithm Outline

This section briefly outlines the semantic overhead localisation pipeline, presented in Figure 3-1 and described more fully in the remainder of this chapter. We recall that

43 Figure 3-1: Semantic Overhead Localisation Pipeline: The overhead image is seg- mented offline prior to algorithm execution. Online, the local image stream isseg- mented and semantic topviews are generated using registered depthmaps to project into the camera frame and recover scene structure. Relative pose change input is used to inform the motion model of an MCL pipeline, while the segmented overhead and semantic topview is used in a measurement model to update the belief state of the system through a set of particles, from which a pose estimate is computed and returned.

the goal of this thesis is to estimate the pose of a moving robot using only relative pose change estimates, a possibly stale overhead image and the image stream and corresponding depth estimates from an onboard camera system. The pipeline begins offline with the manual segmentation of the overhead image, although automated procedures for doing this can also be employed [31, 33, 34]. Then, the local image stream is segmented using a semantic segmentation deep learning framework (Sec- tion 3.3). The segmented local image along with a corresponding depthmap are used to generate a semantic 3D point cloud that is collapsed into a planar 2D semantic topview representation (Section 3.4). The semantic topview along with the semantic overhead and relative pose change information are passed into an MCL framework that propagates particles using a standard motion model (Section 3.5.2). Particles are then reweighted by a measurement model that uses Binary Semantic Photometric Error (Section 3.5.3): a new novel variant of the standard photometric error method,

44 where sensitivity to lighting and viewing conditions no longer affect the photometric procedure directly, but are instead dealt with at segmentation time. The pose esti- mate and variance are then computed from a resampled particle set that represents the updated belief distribution (Section 3.5.4), before finally blurring the set to ensure a sufficient spread for continued multi-hypothesis tracking.

3.2 Problem Formulation

In this section, we briefly present the notation (Section 3.2.1) used to mathematically describe our semantic overhead localisation approach (Section 3.2.2).

3.2.1 Notation

We represent images of size 푚 × 푛 pixels as functions defined over a 2D domain. The

2 image at discrete time 푡 is represented by 퐼푡 :Ω → R+, where Ω ⊂ N0 represents the image pixel domain. The same definition is also used for the static semantic overhead image denoted by 푂푉 퐻. The depthmap at discrete time 푡, registered to image 퐼푡 is defined by a scalar function that maps from 2D pixel space to a scalar depth value

2 along the camera’s optical axis such that 퐷푡 :Ω → R+, where Ω ⊂ N0. We denote the intrinsic camera parameters by matrix

⎡ ⎤ 푓푥 0 푐푥 ⎢ ⎥ ⎢ ⎥ 3푥3 K = ⎢ 0 푓푦 푐푦⎥ ∈ R , (3.1) ⎣ ⎦ 0 0 1

where (푓푥, 푓푦) and (푐푥, 푐푦) denote the (푥, 푦) focal lengths and optical centre in pixels respectively. [︁ ]︁푇 We let x¯ = x푇 1 represent the homogeneous coordinate of a vector x such that at time 푡, a pixel u ∈ Ω in the 2D image frame is projected into the 3D camera frame by

45 −1 p = K 퐷푡(u)u¯. (3.2)

3 We define the pose of a point p푓1 ∈ R in frame 푓1 with respect to 3D frame 푓2 by transform

⎡ ⎤ R t 푓2 T푓 = ⎣ ⎦ ∈ SE(3), (3.3) 1 0 1

3 with rotation matrix R ∈ SO(3) and translation vector t ∈ R , such that p푓1 can be transformed into frame 푓 by p¯ = T푓2 p¯ . It should also be noted that similar 2 푓2 푓1 푓1 definitions can be used for the strictly 2D case (or the same definition if one were to augment all 2D points with a third dimension always set to zero). Finally, we define a semantic indicator function

⎧ ⎨⎪1 if 푎 = 푏 and 푎, 푏 ∈ 푆 1푆(푎, 푏) = (3.4) ⎩⎪0 otherwise such that given two semantic labels 푎, 푏 ∈ 푆, where 푆 ⊂ N0 is the set of possible semantic labels, outputs 1 if and only if the two labels match. This definition will be useful later in Section 3.5.3 when defining binary semantic photometric error.

3.2.2 Problem Statement

Given a prior semantically segmented overhead image of the environment 푂푉 퐻, rel- ative pose change u푡, camera image 퐼푡 and depthmap 퐷푡 from local sensors onboard a moving robot, we would like to estimate the pose x푡 of the robot (assumed equiv- alent to camera pose) within the overhead coordinate frame, while also considering scene semantics. The depthmap is used to project the image into 3D space as a point

46 cloud that represents structure in the environment. Our goal is therefore, to estimate online:

∙ the semantic topology of the local environment from 퐼푡 and 퐷푡;

∙ the robot pose x푡 within the overhead frame.

Our primary contribution will be to estimate robot pose over a distribution that incorporates both structural and semantic consistency with the environment as rep- resented by 푂푉 퐻. Since the body of research in this thesis is not focused on depth estimation and thanks to a variety of software and hardware-based depth estima- tion piplelines, we assume access to a black box depth estimator of choice such as [102, 103].

3.3 Semantic Segmentation: Training

In this section, we briefly discuss the training of the ENet [17] segmentation architec- ture presented in Section 3.3 that we use to generate real-time pixel-wise semantics on images derived from a robot’s onboard camera system. The ENet network architec- ture was selected for semantic segmentation due to its balance between computational overhead and classification accuracy. Additionally, an optimised Caffe [104] imple- mentation of the network that merges batch normalisation and dropout layers into convolutional kernels was already accessible [105] allowing for immediate use along side other Caffe-based inference pipelines such as the MobileNetSSD [106] object de- tector that are typically found within an autonomy stack. It should be noted that initial attempts were made to use the SegNet [15] architecture, however, it was quickly found that CPU-only use onboard a robot would not be feasible due to its number of parameters and floating point operations. Since the objective of our measurement model is to perform semantic data associa- tion between local and global imagery, the overhead images also require segmentation, which we perform manually, because testing (Section 4) will be confined to a single environment in both simulation and physical flight.

47 Dataset

The first step towards generating a deployable version of the ENet architecture con- sists of training the network on a dataset large and varied enough to enable segmenta- tion performance that generalises across a variety of unseen urban environments. We use a modified version of the Mapillary Vistas dataset [92] described in Section 2.3.1 to train our network. First, we merge the test and validation sets consisting of 18,000 and 2,000 densely annotated street-level image pairs respectively in order to maximise the size of our training set. Next, we merge similar semantic classes together and as- sign unwanted classes to the undefined set. This enables us to increase the pixel-wise training data available for classes of interest at the expense of increased coarseness. Table 3.1 illustrates the remapping used to generate the ground truth greyscale label images for the final test set, with classes remapped to void omitted for convenience. A total of 16 semantic classes were formed, with their selection intended to capture some of the most common scene semantics in urban street environments. Although the Mapillary Vistas dataset is dwarfed in size when compared to the ApolloScape [101] dataset, we chose it for training due to its rich variation in viewing, lighting, environmental and camera conditions, with almost no temporal consistency between images. Initial training conducted using the Cityscapes [94] training set showed strong qualitative performance on its corresponding test set, however per- formed extremely poorly when exposed to environments not seen during training. In fact, even crude hybrid training schemes that involved training with the Mapil- lary dataset and then fine tuning with the Cityscapes dataset or vice versa proved ineffective. We present segmentation results from our trained network in Section 4.1.

Training Setup and Parameters

The ENet network was trained in two stages, with input dimensions to the network constrained to 640 × 480 × 3 pixels. In the first stage, the encoder was trained for 600,000 iterations with a base learning rate of 0.005 and a multi-step learning policy with step values 50,000 and 450,000, dictating at which iterations to reduce

48 Mapillary Vistas Semantic Class Remapping Final Class Merged Classes Label ID Road Bike Lane Crosswalk Parking 0 Road Rail Track Service Lane Manhole Pothole Sidewalk 1 Sidewalk Pedestrian Area Building 2 Building Wall Fence Fence 3 Pole 4 Pole Utility Pole Traffic Light Traffic Light5 Traffic Sign (Back) 6 Traffic Sign Traffic Sign (Front) Vegetation Vegetation 7 Terrain Terrain 8 Sky Sky 9 Person Bicyclist 10 Pedestrian Motorcyclist Other Rider Car Car 11 Guard Rail 12 Railing Barrier Curb 13 Curb Curb Cut Lane Markings (General) 14 Lane Markings Lane Markings (Crosswalk) Street Light Street Light 15 Void 33 Classes 255

Table 3.1: The final set of semantic classes chosen for training were based onthose most expected to be present in a typical urban road environment. Similar semantic classes were merged to form larger “super classes” with more pixel-wise data for train- ing. This was done to try and improve final segmentation accuracy at the expense of coarser classifications.

49 the learning rate at by a factor of 10 each time. A weight decay of 0.0002, batch size of 4 and the Adam solver [107] for gradient descent optimisation was used with momentum value of 0.999. For the second stage, we used the output weights from stage one to serve as pretrained weights for training the entire encoder-decoder architecture. The network was trained for 990,000 iterations using mostly the same parameters as stage one, but changing the step values to 670,000, 850,000 and 950,000 respectively. A smaller batch size of 2 was used during this stage since the memory required for training the entire architecture was roughly double that needed for the encoder alone. Since the batch normalisation layers in ENet shift the input feature maps according to their mean and variance statistics for each mini batch during training, we needed to use the statistics for the entire dataset and so we computed them and shifted accordingly, generating an updated set of weights for the network. Finally, the batch normalisation and dropout layers were merged into convolutional kernels - as done in the original ENet paper - for increased network performance to reduce total inference time. All training was conducted using an NVIDIA Titan X GPU with 11GB of DDR5 RAM, an Intel i-7700K 4.20GHz quad-core processor and 32GB of system DDR4 RAM. Total training time amounted to approximately nine days.

3.4 Semantic Topview Generation

Visual robot localisation from an image overhead requires a means by which data from the local camera system can be associated with the overhead image. As seen in Section 2.2, most pipelines focus on feature matching between a single semantic class that displays similar observable characteristics in both local and global imagery for either feature or geometric matching. In this section, we describe a method used to generate a 2D ground plane representation of the locally observable environment that will allow for direct pixel-wise association with an overhead image, shown later in Section 3.5.3.

50 Figure 3-2: Semantic Topview Generation Pipeline: The input image is first seg- mented and then projected into the camera frame using a corresponding registered depthmap to form a 3D point cloud. The point cloud is then collapsed onto the ground plane and a voting mechanism over a discretisation of the plane is conducted to generate a final semantic topview.

The procedure takes as input an image 퐼푡 and corresponding depthmap 퐷푡 at time 푡. It begins by generating a 3D semantic point cloud P푠푒푚,푡 by feeding 퐼푡 to the semantic segmentation network trained in Section 3.3 to generate a new greyscale label image 퐼푠푒푚,푡 and then applying Equation 3.2 to project all pixels that contain depth information into the 3D camera frame. For the remainder of this algorithm

P푠푒푚,푡 is treated as a collapsed ground plane 2D point cloud by ignoring the 푌푐푎푚푒푟푎 component of all its points. This is considered equivalent to projecting all points onto the (푋푐푎푚푒푟푎, 푍푐푎푚푒푟푎) plane of the camera frame, thus generating a 2D topological representation of the locally observable 3D environment.

A semantic topview 푇 푉푠푒푚,푡 at time 푡 is defined as a discretised 2D grid consisting of cells, each representing an equal subset of the space, with resolution equal to that of the overhead image. Each cell is assigned a value corresponding to the semantic class that dominates the space. For the purposes of efficient computation, we implement the [0] [퐾] topview as a set of (퐾 +1) 2D point clouds such that 푇 푉푠푒푚,푡 = {푇 푉푠푒푚,푡, . . . , 푇 푉푠푒푚,푡}, where the first 퐾 point clouds represent class specific information over the 퐾 possible semantic classes and the final point cloud represents their union. Unlike a contiguous

51 buffer, point clouds allow for the omission of void cells, while their separation intoclass specific containers necessitates for only a single pass over the entire topview during the data association step set out in Section 3.5.3. Additionally, rigid transformations can be applied quicker and more easily through the use of the Point Cloud Library (PCL) [108], an important procedure that may be applied thousands of times at each iteration of our localisation algorithm (Section 3.5.3). To determine which semantic class dominates a cell in the topview, we begin by

iterating through P푠푒푚,푡 and generating a histogram of the number of points of each semantic class that fall within each cell. Finally, for each possible cell, a point is added to the semantic topview with semantic label 푖 ∈ [0, 퐾 − 1] if, and only if, it had the largest vote count among all other semantic classes for the 3D points that fell into that cell and was at least equal to a predefined threshold 훽. If for a given cell no semantic class has a count larger than 훽 then it is assumed to be undefined and no point corresponding to the cell is added to the semantic topview. For this thesis, we set 훽 = 1 in simulation and 훽 = 5 for real flight data. A higher 훽 value is used for physical flight data to compensate for noisier depth estimation that does not capture the topology of the local environment as well as in simulation. For the remainder of this chapter, we simplify the definition of a semantic topview to that of an image, presented in Section 3.2.1. A topview is thought of as an image

of size 푚 × 푛 pixels and is represented by a scalar function 푇 푉푠푒푚,푡 :Ω → R+, where Ω ⊂ R2 represents the corresponding image pixel domain.

3.5 Monte Carlo Localisation

The underlying goal of our approach is to estimate the belief distribution over the robot’s pose. Recalling the definitions of belief (Equation 2.1) and prediction (Equa- tion 2.2), in Section 2.1.3 we demonstrated how to calculate belief from prediction through an algorithm (Algorithm 1) called the Bayes filter and briefly described para- metric and nonparametric methods to estimate the belief distribution through three

key distributions: initial belief 푝(x0), the measurement probability 푝(z푡|x푡) and the

52 state transition probability 푝(x푡|x푡−1, u푡). In this section, we present an MCL based localisation framework that maintains [푖] a set of particles X푡 = {x푡 ; 푖 ∈ [1, 푁]} to estimate the belief distribution at time 푡. Algorithm 2 outlines the high level structure of our MCL approach, which is used to estimate robot pose by applying the following three steps recursively: prediction, correction and resampling. An initial belief distribution 푝(x0) must also be defined and we briefly discuss this in Section 3.5.1. Prediction involves propagating thepar- ticle set at time 푡 − 1 to generate a new particle set at time 푡 using the newest control input u푡 according to a motion model (Section 3.5.2) that models the state transition probability 푝(x푡|x푡−1, u푡). During the correction phase, the weighting of each parti- [푖] [푖] cle in the set {(x푡 , 푤푡 ); for 푖 = 1 to 푁} is readjusted by considering the newest measurement z푡 according to a measurement model (Section 3.5.2) that estimates the measurement probability 푝(z푡|x푡, 푀푎푝). Finally, particles from the updated particle set after the correction phase are resampled with replacement according to a discrete distribution defined by their weights, to form a new particle set of equal sizethat approximates 푏푒푙(x푡). In the remainder of this chapter, we present our approach to the prediction (Sec- tion 3.5.2) and correction (Section 3.5.3) procedures by defining our motion and measurement models, respectively. We also briefly explain how our pose estimate x푡 and variance measures at time 푡 are calculated from the output particle set X푡 (Section 3.5.4).

3.5.1 Initialisation

Since we are interested in providing pose estimates on top of an existing state esti- mation framework that provides us with relative pose change inputs, we initialise the particles about some prior pose estimate x0 of the robot in the overhead coordinate frame. The particles are distributed using a multivariate Gaussian with PDF defined by Equation 2.11 and parameters 휇x0 and Γx0 which we set in Chapter 4. If, however, the prior pose x0 is unknown, we can refine this step by initialising particles uniformly across the overhead frame, which is the typical approach in global MCL frameworks.

53 Algorithm 2 Monte Carlo Localisation Algorithm

1: function ParticleFilter(X푡−1, u푡, z푡, 푀푎푝) 2: // Initialise Particle Sets ¯ 3: X푡 = ∅ 4: X푡 = ∅ 5: for 푖 = 1 to 푁 do 6: // 1) Prediction [푖] [푖] 7: x푡 = 푆푎푚푝푙푒푀표푡푖표푛푀표푑푒푙푂푑표푚푒푡푟푦(x푡−1, u푡) 8: // 2.1) Correction (Part 1): Binary Semantic Photometric Error [푖] [푖] 9: 푤푡 = 푀푒푎푠푢푟푒푚푒푛푡푀표푑푒푙(x푡 , z푡, 푀푎푝) ¯ ¯ [푖] [푖] 10: X푡 = X푡 + ⟨x푡 , 푤푡 ⟩ 11: // 2.2) Correction (Part 2): Final Weights Calculation [푖] 12: 휂푡 = 훾0 · max ({푤푡 }) 푖∈[1,푁] 13: for 푖 = 1 to 푁 do [푖] 푤[푖] 14: 푤 = 1/ exp( 푡 ) 푡 휂푡 15: // 3) Resampling 16: for 푖 = 1 to 푁 do [푗] 17: draw 푗 with replacement from probability ∝ 푤푡 [푗] 18: X푡 = X푡 + x푡

19: return X푡

The segmented overhead image 푂푉 퐻 is loaded into memory and stored as a set of (퐾 + 1) point clouds 푂푉 퐻 = {OVH[0],..., OVH[퐾]}: one for each of the 퐾 possible semantic classes and a final for their union.

3.5.2 Prediction: Motion Model

Although our framework is designed for use generalisable to MAV localisation, it does so within a simplified 2D planar coordinate system represented by SE(2) . As a result, we employ the standard motion model described by Thrun et al. [10], with noise distribution for each of the three state variables modelled as independently Gaussian and parameters chosen empirically based on algorithm runtime and the expected maximum velocity of the robot. We set these values in Chapter 4 when discussing experimental setup and results. Modelling robot motion, as shown in Figure 3-3, allows us to ignore the intricacies of robot dynamics within a given time interval and

54 Figure 3-3: The motion of the robot between two timesteps 푡 and 푡′, 푡 > 푡′ is approx- imated by three noisy sequential actions: rotation 훿푟표푡1, translation 훿푡푟푎푛푠 and finally rotation 훿푟표푡2, computed using the relative pose change inputs.

simplify the problem into three piecewise components to which noise can be added. This allows us to propagate particles forward in time to estimate the state transition

distribution 푝(x푡‖x푡−1, u푡).

3.5.3 Correction: Measurement Model for Particle Weighting

Now that the relative pose change u푡 has been used to propagate the particle set

forward in time, we need to use the measurement z푡 to determine how well each particle estimates the robot’s pose with respect to the overhead. In this section,

we use the semantic topview 푇 푉푠푒푚,푡 generated in Section 3.4 as our measurement input and perform data association with the overhead at the pixel level, to determine particle scores that are normalised, fed into a logistic function and renormalised to generate final particle weightings.

55 Algorithm 3 Sampling Based Motion Model using Relative Pose Changes

1: function SampleMotionModelRPC(x푡−1, u푡) [︀ ]︀푇 2: // x푡−1 = 푥 푦 휃 [︁ [︁ ]︁푇 ]︁ [︀ ]︀ [︀ ]︀푇 ′ ′ ′ 3: // u푡 = x푡−1, x푡 = 푥, 푦, 휃 , 푥 , 푦 , 휃 4: 5: // Break down the relative pose change into the three motion components ′ ′ 6: 훿푟표푡1 = 푎푡푎푛2(푦 − 푦, 푥 − 푥) √︀ ′ 2 ′ 2 7: 훿푡푟푎푛푠 = (푥 − 푥) + (푦 − 푦) ′ 8: 훿푟표푡2 = 휃 − 휃 − 훿푟표푡1 9: 10: // Add noise sampled from independent normal distributions to each of the three motion components ˆ 2 11: 훿푟표푡1 = 훿푟표푡1 − 푠푎푚푝푙푒(휇푟표푡1, 휎푟표푡1) ˆ 2 12: 훿푡푟푎푛푠 = 훿푡푟푎푛푠 − 푠푎푚푝푙푒(휇푡푟푎푛푠, 휎푡푟푎푛푠) ˆ 2 13: 훿푟표푡2 = 훿푟표푡2 − 푠푎푚푝푙푒(휇푟표푡2, 휎푟표푡2) 14: 15: // Propagate x푡−1 forward according to the noisy motion ′ ˆ ˆ 16: 푥 = 푥 + 훿푡푟푎푛푠 · cos(휃 + 훿푟표푡1) ′ ˆ ˆ 17: 푦 = 푦 + 훿푡푟푎푛푠 · sin(휃 + 훿푟표푡1) ′ ˆ ˆ 18: 휃 = 휃 + 훿푟표푡1 + 훿푟표푡2 19: [︀ ′ ′ ′]︀푇 20: return x푡 = 푥 푦 휃

Binary Semantic Photometric Error

Since 푇 푉푠푒푚,푡 exists within a similar 2D planar representation as the overhead, this [푖] means that a rigid transformation Tx푡 between the topview frame 푓 and 푓푡표푝푣푖푒푤 푡표푝푣푖푒푤 [푖] each particle x푡 ∈ X푡 can be computed and used to transform the topview so that its [푖] origin is now centred at and relative to x푡 . From this, direct pixel-wise comparisons between the transformed topview and the overhead image can now be made. We will now present “binary semantic photometric error”, a measure of how well the semantic structure of the topview matches that of the corresponding region of the overhead, [푖] when positioned at x푡 . This error will then be used to determine particle weights using a logistic function.

Recalling the definition of a semantic indicator function 1푆 from Equation 3.4, we [푖] define the binary semantic photometric error of a particle x푡 ∈ X푡 by

56 ‖푆−1‖ [푖] [푖] [푖] ∑︁ ∑︁ x x 푤 = 훼 · 1 (푇 푉 (T 푡 p¯), 푂푉 퐻(T 푡 p¯)), (3.5) 푡 푘 푆 푠푒푚,푡 푓푡표푝푣푖푒푤 푓푡표푝푣푖푒푤 푘=0 푘 p∈푇 푉푠푒푚,푡

where 훼푘 is an importance factor for semantic class 푘, which we define based on the ratio with which the class occurs in the overhead map

‖푂푉 퐻푘‖ 훼 = . (3.6) 푘 ‖푂푉 퐻‖푆‖‖

This definition of 훼푘 allows for semantic classes that occur less frequently in the environment to provide stronger signals to the particle reweighing scheme. Typical classes that fall into this category are lane markings, curb and building outlines.

Reweighting Normalisation

Once the binary semantic photometric error has been calculate for all particles, it is converted into a weighting factor by taking the inverse exponential raised to a scaled normalisation. This is shown on lines 12 to 14 of Algorithm 2. This normalisation is done in order to generate a relative ranking of “goodness”, while the scaling by 훾0 is needed to increase the sensitivity of the logistic function to ensure sufficient particle cloud collapse. We set 훾0 = 0.01; however, this value may be varied depending on the desired collapse sensitivity. Once all particle weights are computed, a final normalisation procedure is executed to ensure the sum over all weights is equal to 1, thus allowing us to update the belief state by resampling particles with replacement according to a discrete distribution defined by the weights.

3.5.4 Pose Estimate Calculations

Once the belief distribution has been updated by resampling particles with repeti- tion based on a discrete distribution defined by the particle weights, statistics over the distribution can be computed to provide a pose estimate and some measure of

57 [︁ ]︁푇 “quality”. We define the pose estimate at time 푡 by x푡 = 휇푥,푡 휇푦,푡 휇휃,푡 , which is characterised as the mean pose over the particle set.

We compute 휇푥,푡 and the corresponding variance 휎푥,푡 (and similarly for 푦) by the standard definition for sample mean and variance

푁 1 ∑︁ [푖] 휇 = 푥 (3.7) 푥,푡 푁 푡 푖=1 푁 1 ∑︁ [푖] 휎 = (푥 − 휇 )2. (3.8) 푥,푡 푁 푡 푥,푡 푖=1

Since angles are circular values, standard arithmetic methods for computing mean and variance cannot be used. Instead, we compute these quantities by employing circular statistical methods [109]. Wrap-around is dealt with by converting from polar coordinates to Cartesian coordinates in the complex plane, where all angles are converted to corresponding points on the unit circle. The mean is then calculated by computing the arithmetic mean over the points on the unit circle and then converting back to polar coordinates to extract the resulting angle. The variance, on the other hand, is computed under the assumption that the 휃 distribution of the particle set is approximated by a warped normal distribution around the unit circle [110].

푁 1 ∑︁ [푖] 푆 = sin 휃 (3.9) 휃,푡 푁 푡 푖=1 푁 1 ∑︁ [푖] 퐶 = · cos 휃 (3.10) 휃,푡 푁 푡 푖=1

√︁ 2 2 푅휃,푡 = 푆휃,푡 + 퐶휃,푡 (3.11)

휇휃,푡 = 푎푡푎푛2(푆휃,푡, 퐶휃,푡) (3.12)

휎휃,푡 = −2 log(푅휃,푡) (3.13)

58 3.6 Summary

This chapter described a semantic overhead localisation algorithm in detail. The al- gorithm maintains a distribution over the robot pose composed of a set of particles in a particle filter. Given a live image stream and depthmaps from an off-the-shelf black box depth estimator of choice, incoming images are fed into a semantic segmen- tation deep learning network to generate pixel-wise segmentations which are then used with the corresponding depthmap to generate a 3D semantic point cloud. The point cloud is then collapsed and used to generate a semantic topview that represents the local sensory perception of the structure of environment in the same 2D planar representation as the overhead. Given odometry inputs, particles are propagated forward by a motion model that models noise by independent zero-mean Gaussians. The semantic topview is then transformed for each particle relative to their pose in the overhead coordinate frame. Binary Semantic Photometric Error is calculated between the topview and semantic overhead, with a final normalisation using the maximum error over the particle set. The particles are then reweighted by taking the inverse exponential raised to their respective corresponding normalised error. The mean and variance over the pose pa- rameters of the particle distribution are then computed and returned as the resulting pose estimate and quality indicator respectively. Finally, a multi-variate independent Gaussian blur is applied to the particle distribution to offset the effects of strong particle cloud collapse on the ability to track multiple hypotheses. This approach allows for semantically consistent pose estimates without the use of traditional feature matching algorithms. A quantitative and qualitative evaluation of the semantic overhead localisation pipeline in simulation and from real-world MAV flight data respectively is presented in Chapter 4.

59 60 Chapter 4

Results

In this chapter, the performances of the trained segmentation network and semantic localisation pipeline presented in Chapter 3 are evaluated. Section 4.1 evaluates the segmentation network on five different urban driving datasets. Section 4.2 presents quantitative analysis of the full localisation pipeline in a photo-realistic simulation of a real-world environment. Finally, Section 4.3 presents preliminary qualitative analysis using data recorded onboard a MAV during real-world physical flight.

4.1 Semantic Segmentation

The trained network is capable of operating in CPU mode or in GPU mode using a CUDA enabled GPU. The average prediction time in both modes was assessed by executing the network 100 times in loop over a single RGB image and then taking the average. We defined prediction time as the time taken to load an image, resizeit to the VGA (640 × 480 × 3) resolution input size of the network, execute the network over it and finally extract the final greyscale prediction image. The throughput was then calculated by taking the inverse of this average. In GPU mode, a throughput of 29.7Hz was measured using an NVIDIA Titan X GPU with 11GB of DDR5 dedicated RAM, a 4.2GHz Intel i-7700K i7 quad-core CPU and 32GB of DDR4 RAM. Approximately 90% and 1.6GB of the GPU, a full virtual CPU core (thread) and 0.4GB of RAM were consumed. In CPU mode, the

61 Raw Image Ground Truth Prediction

Camvid

Cityscapes

BDD100k

Synthia-SF SEQ3

Synthia-SF SEQ5

Table 4.1: Raw image, ground truth and corresponding network prediction labels for a randomly sampled image from each of the five test sets. The network is able to segment lane markings and curbs even when not labelled in a dataset.

network was clocked at 0.8Hz and consumed a single CPU thread and 1.6GB of RAM onboard an Intel NUC with a 2.6GHz i-6700HQ quad-core processor and 16GB of DDR4 RAM. CPU mode was tested using a NUC because it was the processing unit used by the drones that generated the dataset from which we qualitatively evaluate our system in Section 4.3. Although 0.8Hz is far from the standard camera frame rate of 30Hz, in Section 4.2 we observe that this is sufficient for online execution, since our system is intended to generate periodic pose corrections to an already existing pose estimation framework. The classification performance was evaluated on five datasets, modified to closely match the semantic classes of the trained network. These datasets were selected to evaluate the network’s ability to generalise across environments, in the real-world and in simulation, in addition to temporal variations such as clear or cloudy weather. The

62 real-world test sets comprised of the Cityscapes [94] and BDD100K [100] validation sets and full Camvid [93] dataset consisting of 500, 745 and 701 image pairs respec- tively. The simulated test sets consisted of the Synthia San Francisco (Synthia-SF) [111] SEQ3 and SEQ5 datasets consisting of 201 and 366 image pairs respectively. The newer Synthia-SF dataset was chosen over the original Synthia dataset [95] since it included a lane markings class, in addition to more photo realistic images. We used two sequences SEQ3 and SEQ5 of the dataset because they were captured during clear and cloudy weather respectively and allowed for evaluation of generalisability with respect to subtle weather changes. Since each of the five datasets comprised of images of different resolutions, all raw images were rescaled to VGA resolution when passed into the network, while prediction outputs were scaled back to original input size using bilinear interpolation, which preserved pixel values to the set of possible semantic classes. An example of raw input image, ground truth segmentation and network output for each dataset is presented in Table 4.1.

To align all semantic classes in the test sets as closely as possible with those de- fined in the training set, class remapping similar to that set out in Table 3.1was conducted. If a desired semantic class did not exist in a test set, it was ignored from all prediction images during generation of evaluation metrics. Exceptions, however, were made for the “curb” and “lane markings” classes, which were treated as “side- walk” and “road” classes respectively in the prediction images at evaluation time. This was due to the assumption that curbs typically occur at the edge of sidewalks and lane markings within roads, while ignoring them completely from the prediction during evaluation time would likely lead to negative bias during computation of the classification accuracy metrics.

Network classification performance was evaluated using five standard metrics in semantic segmentation literature: classwise pixel accuracy (cPA), mean pixel accuracy (mPA), classwise intersection over union (cIoU), mean intersection over union (mIoU) and frequency weighted intersection over union (fwIoU). For the remainder of this section, we define 퐾 as the total number of semantic classes that exist in a dataset,

′ 푛푘,푘′ as the number of pixels of class 푘 that are predicted to belong to class 푘 and 푡푘

63 as the total number of pixels of class 푘 in the ground truth segmentation set. Classwise pixel accuracy is a simple metric that computes, for a semantic class 푘, the ratio between the total number of correctly classified pixels and total number of ground truth pixels for that class in the dataset

푛푘,푘 cPA푘 = , (4.1) 푡푘

while mean pixel accuracy is simply the mean over all the possible 퐾 semantic classes in the test set

퐾−1 1 ∑︁ mPA = · cPA . (4.2) 퐾 푘 푘=0

Although the pixel accuracy metric is useful in understanding how well the network can correctly classify pixels that belong to each semantic class, it gives no indication of performance with respect to misclassifying incorrect pixels as belonging to a semantic class. To gain insight into such performance, the intersection over union metric is used. The cIoU computes, for a given class 푘, the ratio between the total number of correctly classified pixels (intersection) and the sum (union) of the total number of pixels that were incorrectly classified as a different class (false negatives), the total number of pixels that were incorrectly classified as the class of interest (false positives) and the total number of correctly classified pixels (intersection)

퐾−1 ∑︁ 푛푘,푘 cIoU푘 = 퐾−1 . (4.3) ∑︀ ′ 푘=0 푡푘 + ( 푘′=0 푛푘 ,푘) − 푛푘,푘

The mIoU is, similar to the mPA, the mean of the cIoU over all possible K classes

퐾−1 1 ∑︁ mIoU = · cIoU . (4.4) 퐾 푘 푘=0

One issue with the mIoU metric, is its equal treatment of all classes in generating a score. This however, is not often the case and does not represent the relative

64 esae niaefae rapae esfeunl ntetann e.I general, In set. training the in frequently less appeared or frames occupied image typically in that across classes area poor smaller however, less generalise for classes, board to larger the across able for observed data was was performance real network and The synthetic between respectively. and environments 4.3 Table and 4.2 haveTable to likely are they as accurately training. more network segmented during dominated be to tend classes semantic normalisation where: a dataset, by the does metric in that the occurrence metric of to a frequency contribution is their class over fwIoU each The dictating by dataset. the that, in exactly class semantic each of proportions network the of performance IoU weighted datasets. frequency 5 and on mean Classwise, 4.3: Table 5 on network the of performance accuracy pixel mean and datasets. Classwise 4.2: Table yti-F(SEQ5) Synthia-SF (SEQ3) Synthia-SF yti-F(SEQ5) Synthia-SF (SEQ3) Synthia-SF (VAL) BDD100K (VAL) Cityscapes (FULL) Camvid iycps(VAL) Cityscapes D10 (VAL) BDD100K avd(FULL) Camvid ie cuayaditreto vruinpromnemtisaepeetdin presented are metrics performance union over intersection and accuracy Pixel larger since mIoU the than values higher give often will however, metric, This 88.4 86.8 88.1 90.2 88.4 Road 95.9 93.7 91.4 91.9 91.3

71.5 47.8 41.3 51.5 70.6 Road Sidewalk 84.9 73.6 55.8 75.4 90.9

87.4 90.7 68.7 77.6 79.0 Sidewalk Building 95.7 97.4 91.8 92.7 93.3 0.000262 0.000145

9.47 12.1 22.1 Building fwIoU Fence 0.987 0.258 17.9 18.8 31.0 Fence 22.4 11.5 19.4 19.2 17.9 Pole 33.2 16.8 28.5 30.0 27.4 =

0.00904 0.00456 Pole 2.97 1.01 4.55 ∑︀ Traffic Light 2.04 1.17 7.41 2.56 6.15

푘 퐾 Traffic Light ′ 0.00421 0.00133 1 =0 − 16.8 4.81 27.9 9.97 12.0 8.63 6.73 3.37 1 Traffic Sign 65 푡 Traffic Sign 푘 ′ 93.9 89.4 84.9 93.1 90.7 64.0 50.2 76.0 82.7 69.5 · Vegetation 퐾

∑︁ Vegetation 푘 =0 N/A N/A N/A − 20.5 37.5 N/A N/A N/A 15.5 26.3 1 Terrain Terrain 푡 푘 88.0 97.1 93.7 88.9 77.3 86.6 96.2 92.0 82.7 75.6 ·

cIoU Sky Sky 4.87 1.88 8.00 8.07 20.7 9.67 3.42 12.8 9.05 23.1 Pedestrian Pedestrian 푘 70.5 58.6 79.2 79.9 78.4 . Car 76.7 65.2 87.7 84.5 87.0 Car N/A N/A N/A 36.9 2.53 N/A N/A N/A Railing 52.0 91.6 Railing N/A N/A N/A N/A N/A

Curb N/A N/A N/A N/A N/A Curb N/A N/A 22.3 27.8 37.9 N/A N/A

Lane Markings 24.6 31.2 52.8 N/A N/A N/A N/A 0.00 Lane Markings Street Light N/A N/A N/A N/A 0.00

43.2 39.3 39.0 41.6 47.3 Street Light mIoU (4.5) 51.9 47.8 48.0 55.8 56.9

82.8 84.1 78.8 78.7 76.2 mPA fwIoU Semantic Class median freq/freq(semantic class) Road 0.1569 Sidewalk 0.6685 Building 0.2254 Fence 1.5433 Pole 2.5338 Traffic Light 8.6868 Traffic Sign 5.5858 Vegetation 0.2038 Terrain 1.3383 Sky 0.1059 Pedestrian 4.5923 Car 0.8591 Railing 1.2937 Curb 2.9740 Lane Markings 1.3431 Street Light 46.2001

Table 4.4: Weights that could be used to class-balance the network into performing better on smaller classes. Each weight correlates to a relative measure of how “small” a semantic class is, by considering its ratio of pixel representation in all images that contain it. Smaller values indicate stronger representation in the training set.

the per class IoU metrics were lower than their respective PA counterparts because of how IoU factors in false positives and false negatives. In addition, the fwIoU scores were all significantly higher than their corresponding mIoU scores, which supports the notion that larger classes are represented more in the datasets.

One way to mitigate against poor performance in classifying smaller classes is to generate a more class-balanced network by reweighing each class in the cross-entropy loss function defined by the final softmax layer of the network. Eigen and Fergus [112] 푚푒푑푖푎푛푓푟푒푞 suggest reweighing each pixel of class 푘 by 푓푟푒푞(푘) , where (i) 푓푟푒푞(푘) is defined as the the total number of pixels of class 푘 divided by the total number of pixels in images where class 푘 is present and (ii) 푚푒푑푖푎푛푓푟푒푞 is the median of these values. Table 4.4 presents these weight factors calculated over the network training set and shows a clear representation bias for classes that performed better in our test sets.

66 4.2 Simulation

In this section, we evaluate the performance of our semantic overhead localisation pipeline at three different levels of difficulty within a simulated environment. Webegin by describing the basic setup, difficulty levels and localisation pipeline parameters in Section 4.2.1 and finally present our results in Section 4.2.2.

4.2.1 Setup

The simulated environment used to test our framework was the same one used by the MIT team that participated in the Defense Advanced Research Projects Agency (DARPA) Fast Lightweight Autonomy (FLA) program, which involved a MAV flying at speeds of up to 20m/s in an outdoor urban environment while simultaneously avoiding obstacles and exploring the space. The simulator comprised of a photo- realistic environment that closely modelled the Guardian Centers of Georgia — the final test site of the MIT team. Simulated perceptual data, including densedepth, was provided by the Unity game engine, while MAV dynamics were simulated by a 12-state nonlinear quadrotor model [113] using Drake [114]. Our localisation pipeline was implemented as two C++ nodes using the Robot Operating System (ROS) [115]; one for segmentation and the other for overhead localisation. Input images to the segmentation network were 3-channel RGB with a 59.95 degree horizontal field of view, while the MAV’s altitude and top speed were capped at approximately 2m and 4m/s, respectively. To assess the performance of our system, we tested it using three difficulty levels defined by noise parameters that affected sensory inputs to our system. Noisy relative pose changes were modelled by adding, in an integral fashion, independent Gaussian noise to each of the three ground truth pose variables. For depthmap inputs, we did not affect their density, but instead limited their effective range and applied noiseby adding independent Gaussian noise to the inverse of each depth value and then taking their inverse again to generate new noisy depth values. The reason for using inverse depth noise was to mimic the behaviour of typical depth sensors that experience

67 푥 noise 푦 noise 휃 noise inverse depth noise max depth range Easy 풩 (0, 0.332) 풩 (0, 0.332) 풩 (0, 0.052) N/A 20m Medium 풩 (0, 1.02) 풩 (0, 1.02) 풩 (0, 1.02) N/A 20m Hard 풩 (0, 1.52) 풩 (0, 1.52) 풩 (0, 0.22) 풩 (0, 0.052) 12m

Table 4.5: The three difficulty levels of the simulation tests are dictated by parameters affecting relative pose change and depth inputs to the localisation framework.

greater noise when sensing objects farther away. Table 4.5 presents the different parametrisations used to define the three difficulty levels: easy, medium and hard, respectively. The easy setup applied minimal Gaussian noise to relative pose change inputs while maintaining noiseless depthmaps at a range of up to 20m. This setup was used to gently test our system while operating in close- to-ideal conditions. The medium setup used the same depth parameters as easy, however, employed much noisier relative pose change inputs to simulate larger drifts of magnitudes similar to real-world sensory inputs. Finally, the hard setup further increased the relative pose change noise parameters while also reducing depth range and applying inverse depth noise. This setup was designed to stress the system and evaluate its ability to recover from highly corrupted motion inputs given less perceptual information. The simulated environment consisted of nine different semantic classes: terrain, road, lane marking, curb, building, pedestrian, vegetation, car and sky. Despite this, only five of these semantic classes were present in the overhead image and, result, a five-class semantic overhead image was generated as shown in Figure 4-1. While all semantic classes were labelled by the area they occupied in the overhead image, buildings were only labelled along their contours, while their interiors were designated to the “unclassified” set. This was done to ensure consistency with what the MAVs depth sensor can perceive. Additionally, while the environment was mostly modelled as static, several moving pedestrians and cars were present along the roads. The differences between the overhead image and environment were used to subtly introduce occlusions and verify our systems ability to deal with differences between the overhead and local sensory input that were not attributed to noise.

68 (a) Overhead Image (b) Segmented Overhead Image

Figure 4-1: The overhead image of the simulation environment and its corresponding segmentation. Only five semantic classes are observable in the overhead: terrain (green), road (purple), lane marking (yellow), curb (orange) and building (black), with unclassified pixels labelled in white.

To ensure repeatability of the localisation pipeline, 10 independent flights were conducted and logged for each of the three difficulty levels. All tests were flown along a flight path very similar to the one shown in Figure 4-2; however, toensure sufficient randomness, each flight was generated randomly with minor variations such as the MAV correcting its course when, for example, faced with a moving vehicle nearby. Average distance travelled per flight was measured at approximately 400m, while average throughput of the localisation pipeline was clocked at 0.72Hz, with the majority of the computation time per iteration of the algorithm consumed by the segmentation network (0.8Hz, Section 4.1).

Finally, we set the parameters of the MCL framework empirically based on prior runtime performance tests that balanced between runtime and localisation perfor- mance. The particle set size was fixed at 10, 000 particles, while the initialisation dis- tribution parameters (Section 3.5.1) were set as 휇x0 = 0 and Γx0 = 푑푖푎푔(5, 5, 0.25).

Motion model noise parameters defined in Section 3.5.2 were selected to 휇푟표푡1 =

휇푟표푡2 = 휇푡푟푎푛푠 = 0, 휎푟표푡1 = 휎푟표푡2 = 0.1 radians and 휎푡푟푎푛푠 = 0.3m, with the sam-

69 Figure 4-2: All 30 flight tests were conducted with same start and goal locations denoted by “S” and flew according to a closed loop trajectory — depicted by thebold black arrows — moving from “S” to intersection “1”, followed by “2”, “3” and finally back to “S”. ple distributions modelled by independent Gaussians. Finally, parameters defining a blurring procedure at the end of each iteration of the MCL algorithm were varied according to difficulty level of the tests. Easy and medium flights used zero-mean Gaussian blurring with blur strength defined by diagonal matrix Γ = 푑푖푎푔(1, 1, 0.1), while hard flights also used zero-mean Gaussian blurring with strength definedby 2·Γ. The reason for this difference in blurring strength was to allow the MCL framework to cope with greater sensory noise in the hard flights by ensuring sufficient particle spread so that the framework could recover from larger input drift and perceptual aliasing.

4.2.2 Results

To illustrate the ability of our system in providing pose estimate corrections to noisy relative pose change inputs, we first present an example trajectory and corresponding error plots for each of the three simulation difficulties (easy, medium and hard) in Figures 4-4, 4-5 and 4-6, respectively. As expected, relative pose change input drift increased with simulation difficulty due to larger noise parameters; however, inall

70 Position MSE [푚2] Orientation MSE [푟푎푑2] Mean Stdv Mean Stdv Easy 0.868 (7.78) 0.867 (0.156) 0.00369 (0.0582) 0.00342 (0.000409) Medium 2.46 (73.2) 1.41 (3.86) 0.00316 (0.0576) 0.00349 (0.00172) Hard 6.49 (73.37) 5.95 (3.30) 0.00592 (0.230) 0.00382 (0.00448)

Table 4.6: The sample mean and standard deviation over our algorithm’s and noisy sim pose (values in parentheses) MSEs for all 10 trials of each simulation difficulty. Both metrics increase with increases in difficulty, indicating greater uncertainty and reduced localisation accuracy with respected to noise. Our algorithm has substantially smaller mean values, indicative of better estimates.

three scenarios, our semantic overhead localisation framework was able to overcome input drift and estimate poses that closely matched the ground truth. From the 푥, 푦 and 휃 error plots, we note that the particle set standard deviations over each of the pose variables spiked at instances where corresponding pose estimate errors did too, suggesting that particle spread could be used to provide a quality metric for the estimates. Additionally, the hard setup appeared to suffer from greater estimate uncertainty through larger particle set variances. This is illustrated by the appearance of red ellipsoids in Figure 4-6 with height and width defined by corresponding 푥 and 푦 particle set variances at each timestep, which are not visible in Figures 4-4 and 4-5 due to their significantly smaller magnitudes. To further evaluate the effectiveness of our localisation approach, we computed the mean squared error (MSE) over position and orientation from each individual simulation trial, for both noisy simulation pose and our pose estimates, as presented in Figure 4-7. We observed that the position MSE increased with simulation difficulty for both our computed pose estimates and corrupted simulation poses, with the medium and hard trials experiencing similar input MSEs that were much larger than the easy trials, while the corresponding MSEs from our algorithm increasing only marginally by comparison. Orientation MSEs also illustrate similar performance. Table 4.6 summarises the mean and standard deviation of the MSEs for our system and the noise sim poses (in parentheses) across each difficulty level and all 10 trials. Trial 7 of the hard simulation is shown to be an outlier with respect to positional

71 MSE. Examining the trajectory, the performance can be attributed to semantic per- ceptual aliasing that occurred at two instances where the MAV had just turned past intersections 1 and 3 shown in Figure 4-2. During these manoeuvres, the MAV was flying directly over the curb on one side of the road and the segmentation network incorrectly classified the curb and surrounding terrain as lane markings and roadre- spectively. Due to blurring at the end of each iteration of the MCL pipeline, and given these observations, both sides of the road were determined to be likely positions for the robot. In both instances, however, the algorithm was eventually able to correct itself when finally presented with more accurate image segmentations.

Cumulative distribution function (CDF) plots for absolute position and orienta- tion errors over all pose estimates for each of the three simulation difficulties are presented in Figure 4-8. We observed that the median (50th percentile) position errors for easy, medium and hard simulations were 0.27, 0.37 and 0.76 meters, re- spectively, while the 90th percentiles were 1.0, 1.86 and 4.6 meters, respectively. The median orientation errors were 0.0073, 0.0089 and 0.016 radians, respectively, while the 90th percentiles were 0.031 0.043 and 0.079 radians, respectively. Although the 90th percentile position errors are somewhat similar to GPS in the hard case, the orientation errors are promising and demonstrate our algorithm’s ability to use se- mantics to reliably correct rotational drift. However, it should be noted that this performance may not be observed if the robot is flying for prolonged periods of time over regions of low semantic entropy such as the regions of terrain, in which case performance would likely degrade to that of the control input.

Finally, to briefly demonstrate our algorithm’s ability to operate under different flight paths, we present an example of trajectories (Figure 4-9) — for each ofthe three simulation difficulties — generated from drone flight looping twice alonga “figure-eight” path shown in Figure 4-3. Qualitatively, we observe a trend similar to the previous 30 trials discussed: the semantic overhead localisation algorithm is able to provide significantly better pose estimates than its noisy input, while estimate error increases with increases in setup difficulty.

72 Figure 4-3: The “figure-eight” path consists of a trajectory with motion between subgoals ordered by: S, 1, 2, 3, 4, 1, 2, 5, S.

4.3 Real Flight Data

In this section, we present preliminary qualitative results of our semantic overhead localisation framework executed on real flight data collected from the MIT FLAve- hicle during flight tests at the Guardian Center of Georgia facility. Colour imageand depthmap input streams were generated using an Intel Realsense D435 [42], while pose estimates were generated using SAMWISE [116], a sliding-window visual iner- tial state estimator. We used these estimates to generate the relative pose change inputs for our system. The total flight time and path length were approximately 150 seconds and 300 meters respectively, while flight altitude and speed were maintained at approximately 2.3m and 4m/s respectively. Figure 4-10 shows the semantically labelled subsection of the overhead image that the data was recorded from. Since the satellite image was taken from an oblique angle, building edges were interpolated. The semantic overhead localisation algorithm was executed on the recorded flight data using the motion model parameters described in Section 4.2.1 in addition to the same blurring parameters as during the “hard” simulation trials. Although the camera manufacture specifications state an maximum effective depth range of 10m, weused all depth information up to 20m. This was done to verify our systems robustness to large amounts of inverse depth noise given semantic alignments. Finally, during

73 the topview generation (Section 3.4) phase, we gave lane marking points greater weighting in deciding the semantic label of each cell in the topview. This was done to compensate for the fact that lane markings in the real-world environment were very thin and often worn out. Pose trajectories generated by the SAMWISE state estimator and from the se- mantic overhead localisation algorithm are presented in Figure 4-11. Since the MAV was not fitted with a GPS sensor, a crude approximation of the “ground truth” vehicle trajectory was generated using a priori waypoints, the live video feed from onboard the MAV and human observers to provide a qualitative, visual understanding of sys- tem performance. Despite the SAMWISE pose estimates incurring substantial drift, the semantic localisation algorithm is able to correct for this by ensuring semantic consistency between local observations and the overhead image. Figure 4-12 and Figure 4-13 show the topview semantic reconstructions of the world based on the SAMWISE and semantic localisation pose estimates respectively, overlaid on top of the overhead image. The semantic localisation algorithm generates a more observably consistent semantic reconstruction.

74 (a) Pose Trajectories

(b) X Errors and Particles Stdv (c) Y Errors and Particles Stdv

(d) Orientation Errors and Particles Stdv (e) Euclidean Position Errors

Figure 4-4: An example trajectory from one of the 10 “easy” simulation trials. The trajectory formed by the noisy relative pose change inputs is subject to drift away from the ground truth trajectory, while our algorithm is able to correct for the input noise and generate a trajectory closer to ground truth.

75 (a) Pose Trajectories

(b) X Errors and Particles Stdv (c) Y Errors and Particles Stdv

(d) Orientation Errors and Particles Stdv (e) Euclidean Position Errors

Figure 4-5: An example trajectory from one of the 10 “medium” simulation trials. The trajectory formed by the noisy relative pose change inputs is subject to observably larger drift away from the ground truth trajectory than in the easy trial. Despite this, our algorithm is able to compute a corrected trajectory.

76 (a) Pose Trajectories

(b) X Errors and Particles Stdv (c) Y Errors and Particles Stdv

(d) Orientation Errors and Particles Stdv (e) Euclidean Position Errors

Figure 4-6: An example trajectory from one of the 10 “hard” simulation trials. Al- though the trajectory formed by the noisy relative pose change inputs appears similar to that in the medium trial, the orientation errors are up to twice as large. Despite this, our algorithm continues to compute an improved trajectory over the noisy input.

77 (a) Position Mean Square Error

(b) Orientation Mean Square Error

Figure 4-7: Mean Square Errors (MSEs) for each of the 10 trials show clear reduc- tions in position and orientation errors in our algorithm’s pose estimates. Although the MSE for the corrupted simulation pose increases substantially with simulation difficulty, the effect is much weaker with our algorithm.

78 (a) CDF of Absolute Position Errors

(b) CDF of Absolute Orientation Errors

Figure 4-8: CDFs of absolute position and orientation errors over all pose estimates from our algorithm for each of the 3 simulation difficulties. Larger errors become more likely as sensory input noise increases and depth range decreases.

79 (a) “Easy” (b) “Medium”

(c) “Hard”

Figure 4-9: Pose trajectories for each difficulty setup along a “figure-eight” path. The semantic overhead localisation pipeline is able to recover from translational and rotational drifts even as input noise is increased.

80 Figure 4-10: The flight region of the real-world recorded data consisted of 4 semantic classes. To match what is observable by the MAV’s depth sensor, buildings were labelled around their contours with the internal areas designated as unclassified.

81 Figure 4-11: A crude approximation of the ground truth vehicle trajectory is shown in black, with the approximate start location marked by the “S”. This was generated given the a priori waypoints, live video feed and human observers. The SAMWISE estimates accumulated drift errors of up to (approximately) 30m and falsely estimated its final pose as the start location. The semantic overhead localisation algorithm was able to correct for estimator drift and generate pose estimates closer to the actual vehicle trajectory.

82 Figure 4-12: The semantic reconstruction of the flight path using the semantic topviews — generated at each timestep of the semantic localisation algorithm — centred at the corresponding SAMWISE poses. Due to the large drift in the state es- timator, the semantic reconstruction does not match with the overhead image during the last leg of the flight. Road is placed in place of terrain and the buildings (greyin the semantic topviews) are placed at the edge of the terrain on the left of the start location.

83 Figure 4-13: The semantic reconstruction of the flight path using the semantic topviews — generated at each timestep of the semantic localisation algorithm — centred at the corresponding semantic overhead localisation poses. The road, lane markings and building reconstruction, although still noisy, are truer than the recon- struction using SAMWISE poses. Terrain is mostly aligned correctly, however some overlap with buildings can be seen, owing to errors in pose estimates.

84 Chapter 5

Conclusion

Despite rapid improvements in semantic segmentation to date and the presence of large satellite image databases, most frameworks that localise using an overhead im- age do not use more than one semantic class within the environment to perform data association with local visual sensor input, with none using semantic segmentation networks and most focusing on feature-matching based methods. This thesis pre- sented research that addressed the overhead localisation problem as a multi-semantic matching problem to try and generate more accurate pose estimates by constraining the data association process. A visual semantic overhead localisation pipeline was presented in Chapter 3 that used a deep learning semantic segmentation framework to generate dense scene se- mantics from an onboard camera sensor for direct data association with an overhead image at the pixel level. To generate local scene semantics, a segmentation network was trained on a large publicly available urban dataset and was shown to generalise across unseen environments for larger semantic classes that are more likely present in overhead images. The key contribution is the generation of a semantic topview representing the semantic geometric topology of the locally sensed environment. This topview is used to directly estimate particle weights in the measurement model of an MCL framework by applying pixel-wise semantic match comparisons with a semantic overhead image to generate a semantic photometric error score. Semantic classes that appear less frequently in the overhead are given greater error weights allowing them

85 to provide a strong “corrective” signal in cases of estimate drift. This approach allows for semantically consistent pose estimates under the assumptions of accurate local image segmentations, dense depth estimation and correctly scaled overhead images. The semantic localisation pipeline was evaluated in Chapter 4, first in simulation (Section 4.2) and then using real-world flight data (Section 4.3). Simulation test- ing consisted of flight within a simulated environment with three levels of difficulty parametrised by control input noise, maximum depth range and inverse depth noise. The quantitative analysis illustrates that the semantic localisation pipeline is able to correct translational drift to within GPS level accuracy, while rotational drift is cor- rected 90% of the time to within 5 degrees. Qualitative results generated using data recorded from onboard a real MAV during a 300m, 150 second flight demonstrated our algorithms ability to generate corrected pose estimates to a state of the art state estimator, with observably better semantic consistency. While the research presented in this thesis demonstrates the ability to improve pose estimate accuracy using multi-class pixel-wise data association, we make note of possible extensions that could incorporate techniques used for overhead localisation in Section 2.2. For instance, once the local image has been segmented, its output could be used to produce masks for each semantic class that would enable class-specific features to be detected over the corresponding image regions of each mask. The same procedure could be conducted over the semantic overhead image so that multi-class feature matching or geometric alignment procedures such as chamfer matching and ICP can be used to generate semantically consistent solutions. We believe this would prove useful in cases where the robot is operating in regions of low semantic variability and our system performance degrades to control input accuracy, since segmentation provides semantics at the expense of colour and texture loss.

86 Bibliography

[1] Evan Ackerman. Skydio demonstrates incredible obstacle-dodging full auton- omy with new r1 consumer drone. https://spectrum.ieee.org/automaton/ robotics/drones/skydio-r1-drone. Accessed: 2018-7-23.

[2] Daisuke Wakabayashi. Waymo’s autonomous cars cut out human drivers in road tests. https://www.nytimes.com/2017/11/07/technology/ waymo-autonomous-cars.html. Accessed: 2018-7-23.

[3] Mike Wehner. Robotic bird drones will moni- tor china’s citizens. https://nypost.com/2018/06/27/ robotic-bird-drones-will-monitor-chinas-citizens/. Accessed: 2018- 7-23.

[4] Mary-Ann Russon. Drones to the rescue! https://www.bbc.com/news/ business-43906846. Accessed: 2018-7-23.

[5] Jeff Desjardins. Amazon and ups are betting big on drone delivery. http://www.businessinsider.com/ amazon-and-ups-are-betting-big-on-drone-delivery-2018-3. Accessed: 2018-7-23.

[6] The Economist. Manna from heaver: How e-commerce with drone delivery is taking flight in china. https://www.economist.com/business/2018/06/09/ how-e-commerce-with-drone-delivery-is-taking-flight-in-china. Ac- cessed: 2018-7-23.

[7] Google Maps. Google maps. https://maps.google.com. Accessed: 2018-8-27.

[8] Yutaka Masumoto. Global positioning system, May 11 1993. US Patent 5,210,540.

[9] Josef Marek and Ladislav Štěpánek. Accuracy and availability of the satel- lite navigation system gps. In Microwave Techniques (COMITE), 2010 15th International Conference on, pages 121–124. IEEE, 2010.

[10] Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005.

87 [11] GISGeography. 15 free satellite imagery data sources. https://gisgeography. com/free-satellite-imagery-data-list/. Accessed: 2018-7-23.

[12] Kalev Leetaru. How google earth and google maps taught us to see the world from above. https://www.forbes.com/sites/kalevleetaru/2018/04/03/ how-google-earth-and-google-maps-taught-us-to-see-the-world-from-above/ #6b61feb576e6. Accessed: 2018-7-23.

[13] Hugh Durrant-Whyte and Tim Bailey. Simultaneous localization and mapping: part i. IEEE robotics & automation magazine, 13(2):99–110, 2006.

[14] Tim Bailey and Hugh Durrant-Whyte. Simultaneous localization and mapping (slam): Part ii. IEEE Robotics & Automation Magazine, 13(3):108–117, 2006.

[15] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep con- volutional encoder-decoder architecture for image segmentation. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 39:2481–2495, 2017.

[16] Abhishek Chaurasia and Eugenio Culurciello. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Visual Communications and Image Processing (VCIP), 2017 IEEE, pages 1–4. IEEE, 2017.

[17] Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. Enet: A deep neural network architecture for real-time semantic segmentation. CoRR, abs/1606.02147, 2016.

[18] Mostafa Gamal, Mennatullah Siam, and Moemen Abdel-Razek. Shuffleseg: Real-time semantic segmentation network. arXiv preprint arXiv:1803.03816, 2018.

[19] Dong-Ki Kim, Daniel Maturana, Masashi Uenoyama, and Sebastian Scherer. Season-invariant semantic segmentation with a deep multimodal network. In Field and Service Robotics, pages 255–270. Springer, 2018.

[20] Abhinav Valada, Johan Vertens, Ankit Dhall, and Wolfram Burgard. Adap- net: Adaptive semantic segmentation in adverse environmental conditions. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 4644–4651. IEEE, 2017.

[21] Johan Vertens, Abhinav Valada, and Wolfram Burgard. Smsnet: Semantic mo- tion segmentation using deep convolutional neural networks. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 582– 589, 2017.

[22] Nazrul Haque, N. Dinesh Reddy, and K. Madhava Krishna. Joint semantic and motion segmentation for dynamic scenes using deep convolutional networks. CoRR, abs/1704.08331, 2017.

88 [23] Bi-ke Chen, Chen Gong, and Jian Yang. Importance-aware semantic segmen- tation for autonomous driving system. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1504–1510. AAAI Press, 2017.

[24] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena- Martinez, and José García Rodríguez. A review on deep learning techniques applied to semantic segmentation. CoRR, abs/1704.06857, 2017.

[25] Mennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek, Senthil Yogamani, and Martin Jägersand. Rtseg: Real-time semantic segmentation comparative study. CoRR, abs/1803.02758, 2018.

[26] Nikolay Atanasov, Menglong Zhu, Kostas Daniilidis, and George J Pappas. Semantic localization via the matrix permanent. In Robotics: Science and Systems, volume 2, 2014.

[27] Daniel Maturana, Po-Wei Chou, Masashi Uenoyama, and Sebastian Scherer. Real-time semantic mapping for autonomous off-road navigation. In Field and Service Robotics, pages 335–350. Springer, 2018.

[28] Daniel Maturana, Sankalp Arora, and Sebastian Scherer. Looking forward: A semantic mapping system for scouting with micro-aerial vehicles. In Intelli- gent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 6691–6698. IEEE, 2017.

[29] Dong Wook Ko, Chuho Yi, and Il Hong Suh. Semantic mapping and naviga- tion: A bayesian approach. In Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, pages 2630–2636. IEEE, 2013.

[30] Janice Lin, Wen-June Wang, Sheng-Kai Huang, and Hsiang-Chieh Chen. Learn- ing based semantic segmentation for robot navigation in outdoor environment. In Fuzzy Systems Association and 9th International Conference on Soft Com- puting and Intelligent Systems (IFSA-SCIS), 2017 Joint 17th World Congress of International, pages 1–5. IEEE, 2017.

[31] Alina Marcu. A local-global approach to semantic segmentation in aerial images. CoRR, abs/1607.05620, 2016.

[32] Pascal Kaiser, Jan Dirk Wegner, Aurélien Lucchi, Martin Jaggi, Thomas Hof- mann, and Konrad Schindler. Learning aerial image segmentation from online maps. IEEE Transactions on Geoscience and Remote Sensing, 55(11):6054– 6068, 2017.

[33] Shivaprakash Muruganandham. Semantic segmentation of satellite images using deep learning, 2016.

[34] Dimitrios Marmanis, Jan D Wegner, Silvano Galliani, Konrad Schindler, Mihai Datcu, and Uwe Stilla. Semantic segmentation of aerial images with an ensemble

89 of cnns. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 3:473, 2016.

[35] Dragos Costea and Marius Leordeanu. Aerial image geolocalization from recognition and matching of roads and intersections. arXiv preprint arXiv:1605.08323, 2016.

[36] Turgay Senlet and Ahmed Elgammal. Satellite image based precise robot lo- calization on sidewalks. In Robotics and Automation (ICRA), 2012 IEEE In- ternational Conference on, pages 2647–2653. IEEE, 2012.

[37] Turgay Senlet, Tarek El-Gaaly, and Ahmed Elgammal. Hierarchical semantic hashing: Visual localization from buildings on maps. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 2990–2995. IEEE, 2014.

[38] Turgay Senlet and Ahmed Elgammal. A framework for global vehicle local- ization using stereo images and satellite and road maps. In Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pages 2034–2041. IEEE, 2011.

[39] Turgay Senlet. Visual localization, semantic video segmentation and labeling us- ing satellite maps. Rutgers The State University of New Jersey-New Brunswick, 2015.

[40] Hokuyo. Hokuyo utm-30lx. https://www.hokuyo-aut.jp/search/single. php?serial=169. Accessed: 2018-8-27.

[41] Velodyne. Velodyne vls-128. https://velodynelidar.com/vls-128.html. Accessed: 2018-8-27.

[42] Intel. Intel realsense depth camera d435. https://click.intel.com/ intelr-realsensetm-depth-camera-d435.html. Accessed: 2018-8-27. [43] Dante D’Orazio. Point grey flea3 camera captures 4k video in a 1.2 inch cube form factor for $949. https://www.theverge.com/2012/7/1/3128515/ point-grey-flea3-worlds-smallest-4k-video-camera. Accessed: 2018-8- 27.

[44] Lauro Ojeda and Johann Borenstein. Personal dead-reckoning system for gps- denied environments. In Safety, Security and Rescue Robotics, 2007. SSRR 2007. IEEE International Workshop on, pages 1–6. IEEE, 2007.

[45] gps.gov. Gps accuracy. https://www.gps.gov/systems/gps/performance/ accuracy/. Accessed: 2018-8-10. [46] Frank van Diggelen and Per Enge. The worlds first gps mooc and worldwide lab- oratory using smartphones. In Proceedings of the 28th International Technical Meeting of The Satellite Division of the Institute of Navigation (ION GNSS+ 2015), pages 361–369, 2015.

90 [47] Nils Ole Tippenhauer, Christina Pöpper, Kasper Bonne Rasmussen, and Srdjan Capkun. On the requirements for successful gps spoofing attacks. In Proceedings of the 18th ACM conference on Computer and communications security, pages 75–86. ACM, 2011. [48] Andrew J Kerns, Daniel P Shepard, Jahshan A Bhatti, and Todd E Humphreys. Unmanned aircraft capture and control via gps spoofing. Journal of Field Robotics, 31(4):617–636, 2014. [49] Bradford W Parkinson, Per Enge, Penina Axelrad, and James J Spilker Jr. Global positioning system: Theory and applications, Volume II. American In- stitute of Aeronautics and Astronautics, 1996. [50] Francisco Rovira-Más and Ratul Banerjee. Gps data conditioning for enhancing reliability of automated off-road vehicles. Proceedings of the Institution of Me- chanical Engineers, Part D: Journal of automobile engineering, 227(4):521–535, 2013. [51] Yuanliang Zhang and Dong Pyo Hong. Navigation of mobile robot using low- cost gps. International Journal of Precision Engineering and Manufacturing, 16(4):847–850, 2015. [52] Rudolph Kalman. A new approach to linear filtering and prediction prob- lems. Journal of basic Engineering, 82(1):35–45, 1960. [53] Ryan S Kaminsky, Noah Snavely, Steven M Seitz, and Richard Szeliski. Align- ment of 3d point clouds to overhead images. In Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE Computer Society Conference on, pages 63–70. IEEE, 2009. [54] H Christopher Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293(5828):133, 1981. [55] Boris Sofman, Ellie Lin, J Andrew Bagnell, John Cole, Nicolas Vandapel, and Anthony Stentz. Improving robot navigation through self-supervised online learning. Journal of Field Robotics, 23(11-12):1059–1075, 2006. [56] Michael McHenry, Yang Cheng, and Larry Matthies. Vision-based localization in urban environments. In Unmanned Ground Vehicle Technology VII, volume 5804, pages 359–371. International Society for Optics and Photonics, 2005. [57] Keith Yu Kit Leung, Christopher M Clark, and Jan P Huissoon. Localization in urban environments by matching ground level video images with an aerial image. In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 551–556. IEEE, 2008. [58] Christian Landsiedel and Dirk Wollherr. Global localization of 3d point clouds in building outline maps of urban outdoor environments. International journal of intelligent robotics and applications, 1(4):429–441, 2017.

91 [59] Christian W Landsiedel. Semantic mapping for autonomous robots in urban environments.

[60] Mordechai Haklay and Patrick Weber. Openstreetmap: User-generated street maps. Ieee Pervas Comput, 7(4):12–18, 2008.

[61] Sheraz Khan, Athanasios Dometios, Chris Verginis, Costas Tzafestas, Dirk Wollherr, and Martin Buss. Rmap: a rectagular cuboid approximatio frame- work for 3d eviromet mappig. Autonomous Robots, 37(3):261–277, 2014.

[62] Jiri Matas, Charles Galambos, and Josef Kittler. Robust detection of lines using the progressive probabilistic hough transform. Computer Vision and Image Understanding, 78(1):119–137, 2000.

[63] Harry G. Barrow, Jay M. Tenenbaum, Robert C. Bolles, and Helen C. Wolf. Parametric correspondence and chamfer matching: Two new techniques for image matching. In IJCAI, 1977.

[64] Ming-Yu Liu, Oncel Tuzel, Ashok Veeraraghavan, and Rama Chellappa. Fast directional chamfer matching. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1696–1703. IEEE, 2010.

[65] Xipeng Wang, Steve Vozar, and Edwin Olson. Flag: Feature-based localization between air and ground. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3178–3184. IEEE, 2017.

[66] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010.

[67] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.

[68] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density- based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.

[69] Oliver Pink. Visual map matching and localization using a global feature map. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW’08. IEEE Computer Society Conference on, pages 1–7. IEEE, 2008.

[70] Paul J Besl and Neil D McKay. Method for registration of 3-d shapes. In Sensor Fusion IV: Control Paradigms and Data Structures, volume 1611, pages 586–607. International Society for Optics and Photonics, 1992.

[71] Kenneth P Bube and Robert T Langan. Hybrid âĎŞ 1/âĎŞ 2 minimization with applications to tomography. Geophysics, 62(4):1183–1195, 1997.

92 [72] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.

[73] Masafumi Noda, Tomokazu Takahashi, Daisuke Deguchi, Ichiro Ide, Hiroshi Murase, Yoshiko Kojima, and Takashi Naito. Vehicle ego-localization by match- ing in-vehicle camera images to an aerial image. In Asian Conference on Com- puter Vision, pages 163–173. Springer, 2010.

[74] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346– 359, 2008.

[75] Anirudh Viswanathan, Bernardo R Pires, and Daniel Huber. Vision based robot localization by ground to satellite matching in gps-denied situations. In Intelligent Robots and Systems (IROS 2014), 2014 IEEE/RSJ International Conference on, pages 192–198. IEEE, 2014.

[76] David G Lowe. Object recognition from local scale-invariant features. In Com- puter vision, 1999. The proceedings of the seventh IEEE international confer- ence on, volume 2, pages 1150–1157. Ieee, 1999.

[77] Tianjia Wu and Zhenjiang Miao. An improved feature image matching algo- rithm based on locality-sensitive hashing. In Signal Processing (ICSP), 2016 IEEE 13th International Conference on, pages 723–728. IEEE, 2016.

[78] Aomei Li, Wanli Jiang, Weihua Yuan, Dehui Dai, Siyu Zhang, and Zhe Wei. An improved fast+surf fast matching algorithm. Procedia Computer Science, 107:306 – 312, 2017. Advances in Information and Communication Technology: Proceedings of 7th International Congress of Information and Communication Technology (ICICT2017).

[79] Hang Chu, Hongyuan Mei, Mohit Bansal, and Matthew R Walter. Accu- rate vision-based vehicle localization using satellite imagery. arXiv preprint arXiv:1510.09171, 2015.

[80] Dong-Ki Kim and Matthew R Walter. Satellite image-based localization via learned embeddings. In Robotics and Automation (ICRA), 2017 IEEE Inter- national Conference on, pages 2073–2080. IEEE, 2017.

[81] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 539–546. IEEE, 2005.

[82] Yicong Tian, Chen Chen, and Mubarak Shah. Cross-view image matching for geo-localization in urban environments. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1998–2006, 2017.

93 [83] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. [84] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional net- works for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015. [85] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [86] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabi- novich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015. [87] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. [88] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated con- volutions. arXiv preprint arXiv:1511.07122, 2015. [89] Martin Thoma. A survey of semantic segmentation. CoRR, abs/1602.06541, 2016. [90] Hongyuan Zhu, Fanman Meng, Jianfei Cai, and Shijian Lu. Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation. Journal of Visual Communication and Image Representation, 34:12–27, 2016. [91] Qichuan Geng, Zhong Zhou, and Xiaochun Cao. Survey of recent progress in semantic image segmentation with cnns. Science China Information Sciences, 61(5):051101, 2018. [92] Gerhard Neuhold, Tobias Ollmann, S Rota Bulo, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In Pro- ceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, pages 22–29, 2017. [93] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009. [94] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213– 3223, 2016.

94 [95] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Anto- nio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016. [96] Gregory J Stein and Nicholas Roy. Genesis-rt: Generating synthetic images for training secondary real-world tasks. In IEEE International Conference on Robotics and Automation (ICRA), 2018. [97] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipula- tion with conditional gans. In Proc. IEEE Conf. Comput. Vis. Pattern Recog- nit.(CVPR), pages 1–13, 2018. [98] Swami Sankaranarayanan, Yogesh Balaji, Arpit Jain, Ser Nam Lim, and Rama Chellappa. Learning from synthetic data: Addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3752–3761, 2018. [99] Fatemeh Sadat Saleh, Mohammad Sadegh Aliakbarian, Mathieu Salzmann, Lars Petersson, and Jose M Alvarez. Effective use of synthetic data for ur- ban scene semantic segmentation. arXiv preprint arXiv:1807.06132, 2018. [100] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving video database with scalable annotation tooling. CoRR, abs/1805.04687, 2018. [101] Xinyu Huang, Xinjing Cheng, Qichuan Geng, Binbin Cao, Dingfu Zhou, Peng Wang, Yuanqing Lin, and Ruigang Yang. The apolloscape dataset for au- tonomous driving. CoRR, abs/1803.06184, 2018. [102] W Nicholas Greene and Nicholas Roy. Flame: Fast lightweight mesh estimation using variational smoothing on delaunay graphs. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 4696–4704. IEEE, 2017. [103] Leonid Keselman, John Iselin Woodfill, Anders Grunnet-Jepsen, and Achintya Bhowmik. Intel realsense stereoscopic depth cameras. arXiv preprint arXiv:1705.05548, 2017. [104] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM inter- national conference on Multimedia, pages 675–678. ACM, 2014.

[105] Timo SÃďmann. Enet for caffe. https://github.com/TimoSaemann/ENet. Accessed: 2018-8-23.

[106] chuanqi305. Mobilenet-ssd. https://github.com/chuanqi305/ MobileNet-SSD. Accessed: 2018-8-23.

95 [107] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimiza- tion. arXiv preprint arXiv:1412.6980, 2014.

[108] Radu Bogdan Rusu and Steve Cousins. 3D is here: Point Cloud Library (PCL). In IEEE International Conference on Robotics and Automation (ICRA), Shang- hai, China, May 9-13 2011.

[109] Nicholas I Fisher. Statistical analysis of circular data. Cambridge University Press, 1995.

[110] Kanti V Mardia and Peter E Jupp. Directional statistics, volume 494. John Wiley & Sons, 2009.

[111] Daniel Hernandez-Juarez, Lukas Schneider, Antonio Espinosa, David Vázquez, Antonio M López, Uwe Franke, Marc Pollefeys, and Juan C Moure. Slanted stixels: Representing san francisco’s steepest streets. arXiv preprint arXiv:1707.05397, 2017.

[112] David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.

[113] Daniel Mellinger, Nathan Michael, and Vijay Kumar. Trajectory generation and control for precise aggressive maneuvers with quadrotors. The International Journal of Robotics Research, 31(5):664–674, 2012.

[114] Russ Tedrake and the Drake Development Team. Drake: A planning, control, and analysis toolbox for nonlinear dynamical systems, 2016.

[115] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y Ng. Ros: an open-source robot operating system. In ICRA workshop on open source software, volume 3, page 5. Kobe, Japan, 2009.

[116] Ted J Steiner, Robert D Truax, and Kristoffer Frey. A vision-aided inertial navigation system for agile high-speed flight in unmapped environments: Dis- tribution statement a: Approved for public release, distribution unlimited. In Aerospace Conference, 2017 IEEE, pages 1–10. IEEE, 2017.

96