Deep Learning Approach for Vision Navigation in Flight

DEEP LEARNING APPROACH FOR VISION NAVIGATION IN FLIGHT

Thesis

Submitted to

The School of Engineering of the

UNIVERSITY OF DAYTON

In Partial Fulﬁllment of the Requirements for

The Degree of

Master of Science in Electrical Engineering

Branden Timothy McNally

Dayton, Ohio

December, 2018 DEEP LEARNING APPROACH FOR VISION NAVIGATION IN FLIGHT

Name: McNally, Branden Timothy

APPROVED BY:

Eric Balster, Ph.D. Frank Scarpino, Ph.D. Advisor Committee Chairman Committee Member Department Chair; Associate Professor Emeritus, Department of Professor, Department of Electrical Electrical and Computer Engineering and Computer Engineering

Donald Venable, Ph.D. Committee Member Senior Electronics Engineer, Air Force Research Laboratory

Robert J. Wilkens, Ph.D., P.E. Eddy M. Rojas, Ph.D., M.A., P.E. Associate Dean for Research and Innovation Dean, School of Engineering Professor School of Engineering

ii c Copyright by

Branden Timothy McNally

2018 ABSTRACT

DEEP LEARNING APPROACH FOR VISION NAVIGATION IN FLIGHT

Name: McNally, Branden Timothy University of Dayton

Advisor: Dr. Eric Balster

The recent advancements in the field of Deep Learning have fostered solutions to many complex image based problems such as image classification, object detection, and image captioning. The goal of this work is to apply Deep Learning techniques to the problem of image based navigation in a flight environment. In the situation GPS is not available, it is important to have alternate navigation systems. An image based navigation system is potentially a cost effective alternative during a GPS outage. The current state of the art results are obtained using a perspective-n-point (PnP) approach. The downsides to the PnP approach include carrying a large database of features for matching and sparse availability of distinct features in all scenes. A deep learning approach allows for a lightweight solution and provides a position estimation for any scene. A variety of published networks are modified for regression and trained to estimate a virtual drones North and East position as a function of a single input image. The best network tested produces an average euclidean distance error, in a 2.5 x 2.5 Km virtual environment, is 5.643 meters.

iii For my family and friends who have supported me through higher education

iv ACKNOWLEDGMENTS

I would like to thank the Air Force Research Labs (AFRL) and the University of Dayton

Research Institute (UDRI) for supporting my research. Speciﬁcally I would like to thank Dr.

Don Venable of AFRL and my advisor Dr. Eric Balster for their guidance and mentorship.

I would also like to thank Dr. Christopher McGuinness and Jonathan Headlee of UDRI for bringing me on as a full time employee as my work progressed.

v TABLE OF CONTENTS

ABSTRACT ...... iii

DEDICATION ...... iv

ACKNOWLEDGMENTS ...... v

LIST OF FIGURES ...... viii

LIST OF TABLES ...... x

CHAPTER I. INTRODUCTION ...... 1

1.1 Overview ...... 1 1.2 Related Work ...... 2 1.3 Approach ...... 3

CHAPTER II. BACKGROUND ...... 5

2.1 Neural Networks ...... 5 2.2 Back-Propagation ...... 7 2.3 Pooling ...... 8 2.4 Dropout ...... 8 2.5 L2 Regularization ...... 9 2.6 Convolutional Neural Networks ...... 10

CHAPTER III. DATA-SET CREATION ...... 11

3.1 Coordinate System ...... 12 3.2 Environment Sampling ...... 13

CHAPTER IV. NETWORK DESIGN AND TRAINING ...... 14

4.1 Software Libraries ...... 14 4.2 Network Architecture ...... 15 4.2.1 ResNet ...... 15 4.2.2 DenseNet ...... 16 4.2.3 MobileNet ...... 17 4.2.4 Inception ...... 18 4.2.5 MobileNet Dense ...... 19 4.3 Training ...... 20 4.4 Preserving Spatial Information ...... 20 4.5 Network Modiﬁcations ...... 21

vi CHAPTER V. RESULTS ...... 24

5.1 Training and Validation Results ...... 24 5.2 Inference Analysis ...... 26 5.2.1 ResNet 50 ...... 27 5.2.2 DenseNet 121 ...... 29 5.2.3 MobileNet ...... 31 5.2.4 DenseNet 201 ...... 33 5.2.5 Inception V3 ...... 35 5.2.6 MobileNet Dense ...... 37 5.2.7 All Networks ...... 39

CHAPTER VI. CONCLUSION ...... 40

6.1 Summary ...... 40 6.2 Future/Current Work ...... 40

BIBLIOGRAPHY ...... 42

vii LIST OF FIGURES

2.1 Perceptron ...... 6

2.2 Artiﬁcial Neural Network (Multilayer Perceptron) ...... 6

2.3 Max Pool with 2x2 Filter and Stride of 2 ...... 8

2.4 Neural Network with Dropout ...... 9

3.1 The Landscape Mountains Environment available through Unreal Engine . 12

3.2 Aircraft Pose ...... 13

4.1 Residual Block [1] ...... 16

4.2 DenseNet Architecture with three dense blocks [2] ...... 17

4.3 MobileNet Depthwise Separable Convolutions [3] ...... 18

4.4 Generic Inception Model Replacing n x n Convolution Filter[4] ...... 19

5.1 Training Results (Loss is Euclidean Distance in Meters) ...... 25

5.2 Validation Results (Loss is Euclidean Distance in Meters) ...... 26

5.3 ResNet50 Box Plot of Inference Errors (2560 samples in meters) ...... 27

5.4 Images of two of the best predictions of ResNet 50. Image ((a)) with 0.558m error. Image ((b)) with 0.818m error ...... 28

5.5 Images of two of the worst predictions of ResNet 50. Image ((a)) with 418.737m error. Image ((b)) with 215.439m error ...... 28

5.6 DenseNet121 Box Plot of Inference Errors (2560 samples in meters) . . . . 29

5.7 Images of two of the best predictions of DenseNet 121. Image ((a)) with 0.773m error. Image ((b)) with 1.154m error ...... 30

viii 5.8 Images of two of the worst predictions of DenseNet 121. Image ((a)) with 373.674m error. Image ((b)) with 190.433m error ...... 30

5.9 MobileNet Box Plot of Inference Errors (2560 samples in meters) ...... 31

5.10 Images of two of the best predictions of MobileNet. Image ((a)) with 0.940m error. Image ((b)) with 1.089m error ...... 32

5.11 Images of two of the worst predictions of MobileNet. Image ((a)) with 514.214m error. Image ((b)) with 404.517m error ...... 32

5.12 DenseNet201 Box Plot of Inference Errors (2560 samples in meters) . . . . 33

5.13 Images of two of the best predictions of DenseNet 201. Image ((a)) with 0.271m error. Image ((b)) with 0.499m error ...... 34

5.14 Images of two of the worst predictions of DenseNet 201. Image ((a)) with 191.6m error. Image ((b)) with 106.223m error ...... 34

5.15 Inception V3 Box Plot of Inference Errors (2560 samples in meters) . . . . . 35

5.16 Images of two of the best predictions of Inception V3. Image ((a)) with 0.624m error. Image ((b)) with 0.752m error ...... 36

5.17 Images of two of the worst predictions of Inception V3. Image ((a)) with 440. 959m error. Image ((b)) with 156.946m error ...... 36

5.18 MobileNet Dense Box Plot of Inference Errors (2560 samples in meters) . . 37

5.19 Images of two of the best predictions of MobileNet Dense. Image ((a)) with 0.105m error. Image ((b)) with 0.205m error ...... 38

5.20 Images of two of the worst predictions of MobileNet Dense. Image ((a)) with 148.637m error. Image ((b)) with 104.907m error ...... 38

ix LIST OF TABLES

4.1 Original MobileNet Architecture ...... 22

4.2 MobileNet Architecture for Regression ...... 23

5.1 Mean, Standard Deviation, Minimum, 25 Percent, 50 Percent, 75 Percent, and Max Validation Absolute Errors (Meters) ...... 39

x CHAPTER I

INTRODUCTION

1.1 Overview

The development of the Global Positioning System (GPS) has forever changed the way the world navigates. The world depends heavily on this technology during many day-to-day tasks. These tasks include navigation while driving, commercial ﬂeet management, cargo tracking, shipping routes, and navigation in aviation. In the situation where GPS becomes unavailable it is important to have backup navigation technologies. This work investigates an image based solution for aerial navigation using deep learning frameworks. An image based solution is cost eﬀective and takes advantage of information readily available during

ﬂight. An aircraft can easily be ﬁtted with an image sensor and an inference computer as a low cost navigation alternative.

It is important to note image based navigation will never fully replace GPS, given that position estimations rely on an unobstructed view of the ground below. The imagery must also be unique in some way in order to produce an accurate position estimate, for example an image of a body of water below would provide very little information about the aircraft’s exact position. This alternative navigation solution is a good solution on clear days with unique features on the ground below.

Convolutional Neural Networks (CNNs) have quickly become the workhorse behind many computer vision tasks since Alexnet won the Imagenet Challenge in 2012 [5, 6, 7].

1 Many successful networks have been developed following the groundbreaking results of

AlexNet including VGG Net, GoogLeNet, DenseNet, XceptionNet, ResNet, and MobileNet

[8, 9, 2, 10, 1, 3]. These networks are usually designed with the task of image classiﬁcation in mind. The common method of performance bench-marking is to use a common publicly available data-set, such as Imagenet, CIFAR 10, or CIFAR 100 [11]. The classiﬁcation task, determining the object contained in the image, is one of the most popular applications of CNNs. For example, the CIFAR 10 data-set contains around 60,000 labeled images

(cat, dog, etc.). The data-set is broken up into a training batch of 50,000 images and a validation batch of 10,000 images. There are 10 seperate classes the networks are trained to differentiate between. An image is fed into the network and the probabilities of belonging to each of the defined categories is determined. In order to use these networks for position estimation they must be modified to perform regression instead of calssification.

1.2 Related Work

The state of the art results in this ﬁeld are presented in [12]. This work uses a large database of features extracted from reference satellite imagery. The feature extraction algorithms used include Scale Invariant Feature Transform (SIFT), Speeded Up Robust

Features (SURF), and Binary Robust Invariant Scalable Keypoints (BRISK) [13, 14, 15].

Features are extracted from images obtained during ﬂight and compared to the features stored in the reference database. Given matching features and an accurate camera matrix the position of the aircraft can be calculated. This method provides very accurate position estimations of the aircraft, but requires access to a large database of predeﬁned features and only works when an image contains an acceptable number of features for comparison. The

2 use of machine learning to solve this problem would eliminate the need for a large database of features and would provide a position estimation for all images.

The networks described in the literature discussed above are evaluated for the task of image classification, which is useful in a geo-location task like PlaNet [16] where the authors are attempting to classify images into predefined categories around the globe. This work presents promising results, the network accurately classifies landmark buildings as well as images with just mountain ranges, waterfalls, or beaches. This work used 2.3M geo-tagged

Flickr images taken from ground level around the world.

In [17] the Microsoft 7-Scenes Data-set [18] is used to train a recurrent model for 6

DOF localization. This data-set contains RGB-D images sequences of 7 diﬀerent indoor environments. The authors use a CNN to extract relevant features from images and a long short-term memory network to incorporate temporal information.

1.3 Approach

In this work, regression networks are trained to estimate the position of a drone in a virtual environment. Modiﬁed versions of several networks are trained using a data-set created from the AirSim plugin [19] to the Unreal Game Engine [20]. The Unreal Game

Engine provides a state of the art virtual environment, specifically designed for the creation of artificial intelligence (AI) training data. The AirSim plugin is a simulator for drones and cars built upon the Unreal Game engine. In this work AirSim is used to create a data-set of virtual aerial images for training regression networks. The regression problem in this work is to determine the position of the drone with only 2 degrees of freedom, the north and east positions. Many of the classification networks described in recent literature are modified for

3 this task. A hybrid network is also presented, combining techniques in other works, which outperforms the standard network benchmarks.

The following chapter discusses background information on Neural Networks and Con- volutional Neural Networks. Chapter III discusses the creation of the data-set used for training and validation. Chapter IV explains the networks used, modiﬁcation made, and training methods. Chapter V presents the benchmark position estimation results using the artiﬁcial data set. Chapter VI presents concluding remarks and plans for future research.

4 CHAPTER II

BACKGROUND

2.1 Neural Networks

Neural Networks have been around for many years. The creation of the perceptron was one of the ﬁrst attempts to use a computer to model how the brain works [21]. The perceptron in Figure 2.1 takes inputs x0, x1, ..., xm each input is multiplied by weights

w0, w1, ..., wm and summed,

m X a = xiwi + b (2.1) i=1 where the output of the summation, a, is passed through an activation function,

y = f(a) (2.2) where y is the output of the perceptron. There are many common activation functions used in practice. These include sigmoid, hyperbolic tangent, and rectiﬁed linear unit (RELU).

The activation function most commonly used in this work is the RELU function.

5 Figure 2.1: Perceptron

The perceptron is the building block of Neural Networks, also called a multilayer per- ceptrons(MLP). An MLP consists of at least three layers of nodes, an input layers, a hidden layer, and an output layer as shown in Figure 2.2.

Figure 2.2: Artiﬁcial Neural Network (Multilayer Perceptron)

6 2.2 Back-Propagation

The process of back-propagation attempts to minimize an error function, with respect to the weights in the network, using gradient descent. Back-propagation is a method to compute gradients of an expression through recursive application of the chain rule. After the gradient is calculated using back-propagation gradients descent is used to minimize an error function. Ideally the error is minimized to a global minimum, not just a local minimum. The mean squared error function is described in Equation 2.6, where Yi is the truth and Yˆi is the prediction. 1 n E = X(Y − Yˆ )2 (2.3) n i i i=1 The error function is a continuous and diﬀerentiable function of the m weights in the

network, w1, w2, w3, ..., wm. The error E can be minimized by the iterative process of

gradient descent, ﬁrst the gradient must be calculated,

∂E ∂E ∂E ∇E = ( , , ..., ) (2.4) ∂w1 ∂w2 ∂wm

The value by which each weight in the network is updated is then calculated,

∂E ∆wi = −γ for i = 1, ..., m (2.5) ∂wi where γ is the learning rate. Finally the weight is updated based on the calculated value,

wi = wi + ∆wi (2.6)

7 2.3 Pooling

Pooling layers are used to decrease the size of feature maps as a network gets deeper.

There are a few commonly used pooling methods max pooling, average pooling, and minimum pooling. Figure 2.3 shows the eﬀect of a 2x2 max pooling ﬁlter applied with a stride of 2. The initial feature map is 4x4 and each color represents the application of max pooling

filter. The max pooling filter takes the largest number within the filter, minimum takes the smallest, and average computes the average of all numbers within the filter.

Figure 2.3: Max Pool with 2x2 Filter and Stride of 2

2.4 Dropout

Dropout is a simple and eﬀective regularization technique used when training a neural network[22]. The Neural Network structure using dropout is shown in Figure 2.4. The dropout technique keeps only a percentage of nodes in the network active during training, remaining nodes are set to zero. The nodes with X’s in Figure 2.4 are nodes that are currently inactive, during the training process only the active nodes weights are altered.

8 This technique can be used alongside other regularization techniques to prevent over-ﬁtting to the training data.

Figure 2.4: Neural Network with Dropout

2.5 L2 Regularization

The L2 norm discourages large weights through element wise quadratic penalty over all parameters. m X 2 R(W ) = Wi (2.7) i=1 The regularization value calculated from all weights in the network is then applied to the error function, 1 n E = X E + λR(W ) (2.8) n i i=1

th where Ei is the error of the i training example and n is the total number of training

examples used. One beneﬁt of L2 regularization is the fact that penalizing large weights

9 tends to lead to better generalization because no single input dimension can have a large impact on the errors by itself.

2.6 Convolutional Neural Networks

Traditional neural networks are not ideal for working with images. They don’t gen- eralize very well to variations in images. The number of parameters in traditional neural network can become very large, within just a few layers, when working with images. The

Convolutional Neural Network (CNN) is a much more eﬃcient way to solve image base problems. Convolution layers with diﬀerent strides allows the reduction of the feature map sizes, and therefore the reduction of the number of parameters necessary in the network.

Convolution layers allows the network to learn spatially relevant features. The early layers learn basic features, like edge detectors. As the network gets deeper it learns more complex features within the image.

10 CHAPTER III

DATA-SET CREATION

The data-set used in this work is created using the Unreal game engine. The Landscape

Mountain environment, shown in Figure 3.1, is the speciﬁc environment used for data collection. This environment contains snowy mountains, valleys, frozen lakes, and forests.

It is available for free through the Unreal Engine marketplace. The AirSim plugin allows a drone to be placed anywhere on the map and the AirSim API allows for programmatic control using python. The standard drone has 5 diﬀerent default camera views: : front center, front right, front left, bottom center and back center. For this work the bottom center camera is used to acquire images of the ground directly below. The environment is approximately 2.5 kilometers by 2.5 kilometers.

Traditional labeled data-sets are very costly and time consuming to produce. Problems such as object detection requires a signiﬁcant amount of data labeled by humans for training.

The use of the Unreal Engine makes the data-set creation for this work unique in the sense that training data can be created very quickly with no additional human eﬀort to produce corresponding labels. Since the drone can be controlled using a python script image collection is relatively easy. The AirSim API also provides the current position of the drone within the environment. Each time an image is captured the position and attitude data of the drone is stored as a corresponding image label.

11 Figure 3.1: The Landscape Mountains Environment available through Unreal Engine

3.1 Coordinate System

The coordinate system used is a North, East, and Down coordinate system. The attitude of the aircraft attitude is as shown in Figure 3.2. The origin point, (0,0,0), for the drone’s

ﬂight is relative to the starting point of the drone. The drone is placed in the environment in reference to a global coordinate system within the unreal engine. It is important to make note of the starting position of the drone, so in the case the drones starting location is moved the coordinate system used for training can be recovered.

12 Figure 3.2: Aircraft Pose

3.2 Environment Sampling

The Landscape Mountains environment is sampled in a grid pattern at 150 meter intervals, in both the north and east directions, at an 800 meter altitude. Image resolution is 224x224 pixels. At each sample point images from 0 to 360 degrees yaw are collected at 45 degree intervals. The AirSim API provides a gimbal functionality, which is used to ensure the camera points directly at the ground regardless of pitch or roll of the drone itself. The complete data-set contains about 26,000 images stored in an HDF5 ﬁle with the corresponding location and yaw values. The data-set contains images that are all unique, in either their location on the sampling grid or the drone rotation at the same sampling point. The data-set and labels are randomly shuﬄed and 80 percent of the data becomes the training set and the remaining 20 percent becomes the validation and inference set.

13 CHAPTER IV

NETWORK DESIGN AND TRAINING

The field of Deep Learning has received a great deal of attention over the past few years, as a result there have been many different software libraries specifically developed for the

ﬁeld. These libraries include Keras, TensorFlow, Theano, CNTK, Pytorch [23, 24, 25, 26,

27], and many others. The two most popular frameworks appear to be TensorFlow and

Pytorch. Each library is very well supported and allows users to develop, train, and test models in a reasonable amount of time. Section 4.1 discusses the libraries used in this work and the reasons they were selected.

4.1 Software Libraries

All network training is done using the Keras API with a TensorFlow back-end. Keras describes itself as ”a high level neural network API written in Python capable of running on top of TensorFlow, CNTK, or Theano”[23]. It is designed in a very modular fashion, making it quick and easy to develop new models or modify existing models. The Keras library has open source code for many well known published models, including many of the models tested in this work. This enabled the modiﬁcation of various networks for a regression application.

According to TensorFlow’s website, it is an open source software library for high performance numerical computation. It allows for easy deployment of computation to CPUs,

GPUs, or TPUs. The library was originally developed by the researchers and Engineers on

14 the Google Brain team [24]. There is also strong support for machine learning and deep learning. This makes the TensorFlow library ideal for this research. The Keras front-end allows for quick development of the models and the TensorFlow backend allows for quick training on GPUs.

4.2 Network Architecture

Most of the networks trained are slightly modiﬁed versions of existing networks, and one that combines desirable aspects of two diﬀerent networks. The existing networks used are DenseNet 121, DenseNet 201 [2], Inception V3 [4], ResNet 50 [1], MobileNet [3], and a network that combines aspects of the MobileNet architecture as well as the DenseNet architecture referred to as MobileNet Dense.

4.2.1 ResNet

The ResNet architecture implements short cut connections in an eﬀort to prevent performance degradation as a network gets deeper. As a traditional network gets deeper the information about the gradient or input can deteriorate. The problem of performance degradation is an interesting one. In theory a shallow network can be modiﬁed, by adding on an arbitrary number of identity mapping layers, and still obtain the exact same training error.

In practice this is not the case, but the use of residual building blocks helps solve this issue.

A residual building block, presented in the ResNet paper [1], is shown in Figure 4.1. The shortcut layers perform an identity mapping and their outputs are added to the output of the stacked layer.

15 Figure 4.1: Residual Block [1]

4.2.2 DenseNet

The DenseNet architecture is another attempt to solve the same problem as ResNets.

The authors recognize that it is crucial to pass information contained in early layers to later layers in as short of a path as possible. The architecture proposed in the DenseNet paper is shown in Figure 4.2. Given a dense block with input x0 and L layers, the output of the lth layer is described by Equation 4.1.

xl = Hl([x0, x1, ..., xl−1]) (4.1)

The input to the lth layer is all of the preceding feature maps concatenated together.

The authors of the DenseNet paper argue that having a direct connection to previous layers,

instead of summing the previous layer output and the identity function like ResNets, will

allow better information ﬂow within the network.

16 Figure 4.2: DenseNet Architecture with three dense blocks [2]

4.2.3 MobileNet

The advantage of the MobileNet architecture is the reduction in computational complexity. The computational savings of the MobileNet architecture comes from the fact depthwise separable convolutions are used instead of standard convolutions. The visualization of a depthwise separable convolution compared to a standard convolution is shown Figure 4.4. A depthwise convolution kernel is applied to each of the input channels individually, as shown in Figure 4.4 (b), then a pointwise convolution is applied Figure 4.4 (c) to generate the new feature maps for the next layer. The authors of MobileNet claim that the 3x3 separable convolutions use between 8 and 9 time less computation than standard convolutions with only a slight reduction in accuracy.

The reduction in computational complexity is a very attractive aspect of the MobileNet architecture, given that the end goal of this work is to develop a network to be used on an airplane or drone for navigation. In that environment it is very important to consume as little space, power, and weight as possible. MobileNet architecture is designed speciﬁcally with these types of limitations in mind. The network is designed to run on mobile devices with limited processing capabilities and battery life.

17 Figure 4.3: MobileNet Depthwise Separable Convolutions [3]

4.2.4 Inception

The Inception architecture is another attempt to reduce the computational complexity of the convolution process in the networks. The authors of the Inception architecture argue that convolution kernels can be broken down into components to improve computational efficiency. The authors argue that any filters larger than 3x3 are not particularly useful, since they can always be reduced into a sequence of 3x3 convolutions. It is also argued that an arbitrary nxn convolution can be replaced by a 1xn convolution followed by a nx1 convolution and the computational savings increases significantly as n is increased.

18 Figure 4.4: Generic Inception Model Replacing n x n Convolution Filter[4]

4.2.5 MobileNet Dense

The MobileNet Dense architecture combines aspects of the traditional MobileNet architecture and the DenseNet architecture, as the name suggests. The MobileNet architecture is designed to be computationally eﬃcient and the DenseNet architecture allows for better information ﬂow within the network. Any points within the MobileNet architecture con- taining consecutive feature maps of the same size were made into dense blocks, like those shown in Figure 4.2.

19 4.3 Training

All networks are trained using back-propagation [28]. Each network is trained between

250-300 epochs. The loss function used for each of the networks is Mean Squared Error

Equation 4.2, where Yi is the actual drone location and Yˆi is the predicted drone location.

1 n MSE = X(Y − Yˆ )2 (4.2) n i i i=1

The gradient decent optimization algorithm used is RMS Prop [29] with an initial learning rate of 0.0002. Each network is regularized using both L2 regularization and dropout of

5 percent. The learning rate is adjusted on plateau using the built in Keras function. The patience is set to 2 epochs for all networks. A batch size of 32 is used.

4.4 Preserving Spatial Information

In a CNN designed for classification the presence of certain features is the most important aspect in determining which class the image belong to. Given an image of a dog a classification network may identify there are paws, floppy ears, and a snout in the image, hopefully resulting in the image being classified as a dog. The ideal performance of such a network wouldn’t care where the features of the dog are in the image, just that the features are present.

Basic intuition tells us the image based navigation problem is the opposite of this. In order to accurately estimate the drone’s location it is important to have access to more detailed information regarding features locations within the image. The location of speciﬁc feature within a frame relative to each other gives a great deal of information about location

20 and pose. As a human navigating based solely on the visual information around us, we learn key features and where they are relative to other feature to determine our location.

One technique detrimental to maintaining spatial information as the network progresses is the use of pooling layers. A common pooling layer used is max pooling and has been proven to work very well for classiﬁcation, but max pooling disregards a great deal of information about the image. The amount of information lost depends on the kernel size and stride. Instead a larger convolution stride should be used to down sample the size of a feature map.

The use of short cut connections, like those used in the ResNet architecture, or the use of dense blocks like those used in the DenseNet architecture help pass relevant spatial information deeper into the network. As the networks get deeper and feature maps are down sampled spatial information is lost. The feature maps deeper in a network detect the presence of complex features, not necessarily their location within the image. The use of shortcut connections or dense blocks gives layers deeper within the network access to the earlier feature maps, which ideally will improve the amount of spatial information retained by the network.

4.5 Network Modiﬁcations

The modifications made to the networks for regression are relatively simple. Most of the original networks presented have an average pooling layer, followed by a 1000 node fully connected layer since the ImageNet challenge has a thousand different classes, the fully connected layer is then fed into a softmax activation function. These final layers are stripped from the network and replaced with a flatten layer after the last convolution layer, followed by a dropout layer, followed finally by a fully connected layer with 2 nodes. One

21 node is associated with the north position of the drone, the other with the south position of the drone. The modiﬁcations to the MobileNet architecture are shown in Tables 4.1 and

4.2. The same modiﬁcations are made to all other networks tested. Notice there is no longer an activation function following the ﬁnal fully connected layer when performing regression.

Table 4.1: Original MobileNet Architecture

Type / Stride Filter Shape Input Size Conv / s2 3 x 3 x 3 x 32 224 x 224 x 3 Conv dw / s1 3 x 3 x 32 dw 112 x 112 x 32 Conv / s1 1 x 1 x 32 x 64 112 x 112 x 32 Conv dw / s2 3 x 3 x 64 dw 112 x 112 x 64 Conv / s1 1 x 1 x 64 x 128 56 x 56 x 64 Conv dw / s1 3 x 3 x 128 dw 56 x 56 x 128 Conv / s1 1 x 1 x 128 x 128 56 x 56 x 128 Conv dw / s2 3 x 3 x 128 dw 56 x 56 x 128 Conv / s1 1 x 1 x 128 x 256 28 x 28 x 128 Conv dw / s1 3 x 3 x 256 dw 28 x 28 x 256 Conv / s1 1 x 1 x 256 x 256 28 x 28 x 256 Conv dw / s2 3 x 3 x 256 dw 28 x 28 x 256 Conv / s1 1 x 1 x 256 x 512 14 x 14 x 256 5x Conv dw / s1 3 x 3 x 512 dw 14 x 14 x 512 5x Conv / s1 1 x 1 x 512 x 512 14 x 14 x 512 Conv dw / s2 3 x 3 x 512 dw 14 x 14 x 512 Conv / s1 1 x 1 x 512 x 1024 7 x 7 x 512 Conv dw / s2 3 x 3 x 1024 dw 7 x 7 x 1024 Conv / s1 1 x 1 x 1024 x 1024 7 x 7 x 1024 Avg Pool / s1 Pool 7 x 7 7 x 7 x 1024 FC / s1 1024 x 1000 1 x 1 x 1024 Softmax / s1 Classiﬁer 1 x 1 x 1000

22 Table 4.2: MobileNet Architecture for Regression

Type / Stride Filter Shape Input Size Conv / s2 3 x 3 x 3 x 32 224 x 224 x 3 Conv dw / s1 3 x 3 x 32 dw 112 x 112 x 32 Conv / s1 1 x 1 x 32 x 64 112 x 112 x 32 Conv dw / s2 3 x 3 x 64 dw 112 x 112 x 64 Conv / s1 1 x 1 x 64 x 128 56 x 56 x 64 Conv dw / s1 3 x 3 x 128 dw 56 x 56 x 128 Conv / s1 1 x 1 x 128 x 128 56 x 56 x 128 Conv dw / s2 3 x 3 x 128 dw 56 x 56 x 128 Conv / s1 1 x 1 x 128 x 256 28 x 28 x 128 Conv dw / s1 3 x 3 x 256 dw 28 x 28 x 256 Conv / s1 1 x 1 x 256 x 256 28 x 28 x 256 Conv dw / s2 3 x 3 x 256 dw 28 x 28 x 256 Conv / s1 1 x 1 x 256 x 512 14 x 14 x 256 5x Conv dw / s1 3 x 3 x 512 dw 14 x 14 x 512 5x Conv / s1 1 x 1 x 512 x 512 14 x 14 x 512 Conv dw / s2 3 x 3 x 512 dw 14 x 14 x 512 Conv / s1 1 x 1 x 512 x 1024 7 x 7 x 512 Conv dw / s2 3 x 3 x 1024 dw 7 x 7 x 1024 Conv / s1 1 x 1 x 1024 x 1024 7 x 7 x 1024 Flatten / s1 Flatten 7 x 7 x 1024 7 x 7 x 1024 FC / s1 50176 x 2 1 x 1 x 50176

23 CHAPTER V

RESULTS

In most of the literature network performances are compared using a well known public data-set, such as CIFAR 10, CIFAR 100, or ImageNet. Since these data-sets are not ap- plicable to the navigation problem presented in this work, a number of existing networks are tested on the custom Landscape Mountains data-set for fair comparison between all networks. The following sections present the training and validation error results of all networks. Statistical analysis of the inference results, using 5120 image samples are also shown for each network individually.

5.1 Training and Validation Results

The the training results of the networks are displayed in Figure 5.1 and the validation results are displayed in Figure 5.2. The loss metric used in this case is the euclidean distance between the estimated drone location and the actual drone location in meters. The training results in Figure 5.1 show that most of the networks reached an optimal solution within about 25 epoch. The MobileNet Dense took signiﬁcantly longer during training, but after around 110 epochs the network reaches the lowest training error out of all networks tested.

Analyzing the training results most of the Networks appear to have similar training error, with ResNet 50 being the worst performer by a decent margin and MobileNet Dense being the best performer by a slight margin.

24 Figure 5.1: Training Results (Loss is Euclidean Distance in Meters)

The validation error results, shown in Figure 5.2 appear to match the training results relatively well. The estimation error is more noisy from epoch to epoch for all networks, but there doesn’t appear to be much over-ﬁtting since the training and validation errors end up being about the same. The MobileNet Dense validation results appear surprisingly steady from epoch to epoch compared to the other networks tested.

25 Figure 5.2: Validation Results (Loss is Euclidean Distance in Meters)

5.2 Inference Analysis

The following section details inference results of each network for 2560 samples. North and east error values for each network are presented individually in a boxplot and discussed.

26 5.2.1 ResNet 50

The results of the ResNet 50 network are shown in Figure 5.3. The mean of the north dimension is -3.61 meters, ﬁfty percent of error values fall between -17.35 and 12.25 meters, and the majority of the error values fall between -61.75 and 56.65 meters. The mean of the east dimension is 3.20 meters, ﬁfty percent of the error values fall between -10.88 and 16.50 meters, and the majority of the error values fall between -51.95 and 57.57 meters.

Figure 5.3: ResNet50 Box Plot of Inference Errors (2560 samples in meters)

In Figure 5.4 two of the images with best drone position estimation are shown. The prediction error of both of these images is less than 1 meter. The images have distinctive features in them, most notably the position of the lake.

27 ((a)) ((b))

Figure 5.4: Images of two of the best predictions of ResNet 50. Image ((a)) with 0.558m error. Image ((b)) with 0.818m error

In Figure 5.5 two of the images with worst drone position estimation are shown. The prediction error of both is hundreds of meters oﬀ. The images appear to have few distinctive features and some cloud cover.

((a)) ((b))

Figure 5.5: Images of two of the worst predictions of ResNet 50. Image ((a)) with 418.737m error. Image ((b)) with 215.439m error

28 5.2.2 DenseNet 121

The results of the DenseNet 121 network are shown in Figure 5.6. The mean of the north dimension is -1.33 meters, ﬁfty percent of error values fall between -9.67 and 5.77 meters, and the majority of the error values fall between -32.83 and 28.92 meters. The mean of the east dimension is 0.43 meters, ﬁfty percent of the error values fall between -9.78 and 9.00 meters, and the majority of the error values fall between -37.94 and 37.16 meters.

Figure 5.6: DenseNet121 Box Plot of Inference Errors (2560 samples in meters)

In Figure 5.7 two of the images with best drone position estimation are shown. The prediction error of both of these images are quite low. The images have distinctive features in them, most notably the position of the lake.

29 ((a)) ((b))

Figure 5.7: Images of two of the best predictions of DenseNet 121. Image ((a)) with 0.773m error. Image ((b)) with 1.154m error

In Figure 5.8 two of the images with worst drone position estimation are shown. The prediction error of both is hundreds of meters oﬀ. The images appear to have few distinctive features and some cloud cover.

((a)) ((b))

Figure 5.8: Images of two of the worst predictions of DenseNet 121. Image ((a)) with 373.674m error. Image ((b)) with 190.433m error

30 5.2.3 MobileNet

The results of the MobileNet network are shown in Figure 5.9. The mean of the north dimension is 0.92 meters, ﬁfty percent of error values fall between -8.52 and 9.18 meters, and the majority of the error values fall between -35.07 and 35.73 meters. The mean of the east dimension is 3.80 meters, ﬁfty percent of the error values fall between -3.94 and 10.85 meters, and the majority of the error values fall between -26.12 and 33.04 meters.

Figure 5.9: MobileNet Box Plot of Inference Errors (2560 samples in meters)

In Figure 5.10 two of the images with best drone position estimation are shown. The prediction error of both of these images are low. These images also have the lake as a reference.

31 ((a)) ((b))

Figure 5.10: Images of two of the best predictions of MobileNet. Image ((a)) with 0.940m error. Image ((b)) with 1.089m error

In Figure 5.11 two of the images with worst drone position estimation are shown. The images also appear to have few distinctive features and some cloud cover.

((a)) ((b))

Figure 5.11: Images of two of the worst predictions of MobileNet. Image ((a)) with 514.214m error. Image ((b)) with 404.517m error

32 5.2.4 DenseNet 201

The results of the DenseNet 201 network are shown in Figure 5.12. The mean of the north dimension is 0.37 meters, ﬁfty percent of error values fall between -7.05 and 7.84 meters, and the majority of the error values fall between -29.38 and 30.17 meters. The mean of the east dimension is 0.79 meters, ﬁfty percent of the error values fall between

-7.08 and 8.20 meters, and the majority of the error values fall between -30.01 and 31.12 meters.

Figure 5.12: DenseNet201 Box Plot of Inference Errors (2560 samples in meters)

In Figure 5.13 two of the images with best drone position estimation are shown. The prediction error of both of these images are quite low. These images also have the lake as a reference.

33 ((a)) ((b))

Figure 5.13: Images of two of the best predictions of DenseNet 201. Image ((a)) with 0.271m error. Image ((b)) with 0.499m error

In Figure 5.14 two of the images with worst drone position estimation are shown. The images also appear to have few distinctive features and some cloud cover.

((a)) ((b))

Figure 5.14: Images of two of the worst predictions of DenseNet 201. Image ((a)) with 191.6m error. Image ((b)) with 106.223m error

34 5.2.5 Inception V3

The results of the Inception V3 network are shown in Figure 5.15. The mean of the north dimension is -1.18 meters, ﬁfty percent of error values fall between -6.84 and 4.71 meters, and the majority of the error values fall between -24.16 and 22.03 meters. The mean of the east dimension is -0.68, ﬁfty percent of the error values fall between -6.28 and

4.47 meters, and the majority of the error values fall between -22.41 and 20.60 meters.

Figure 5.15: Inception V3 Box Plot of Inference Errors (2560 samples in meters)

In Figure 5.16 two of the images with best drone position estimation are shown. The prediction error of both of these images are below 1 meter. These images also have the lake as a reference.

35 ((a)) ((b))

Figure 5.16: Images of two of the best predictions of Inception V3. Image ((a)) with 0.624m error. Image ((b)) with 0.752m error

In Figure 5.17 two of the images with worst drone position estimation are shown. The images also appear to have few distinctive features and some cloud cover.

((a)) ((b))

Figure 5.17: Images of two of the worst predictions of Inception V3. Image ((a)) with 440. 959m error. Image ((b)) with 156.946m error

36 5.2.6 MobileNet Dense

The results of the MobileNet Dense network are shown in Figure 5.18. The mean of the north dimension is -1.09 meters, ﬁfty percent of error values fall between -2.51 and 1.85 meters, and the majority of the error values fall between -9.05 and 8.39 meters. The mean of the east dimension is -0.54 meters, ﬁfty percent of the error values fall between -2.48 and

1.82 meters, and the majority of the error values fall between -8.93 and 8.27 meters.

Figure 5.18: MobileNet Dense Box Plot of Inference Errors (2560 samples in meters)

In Figure 5.19 two of the images with best drone position estimation are shown. The prediction errors of both of these images are well below 1 meter. These images also have the lake as a reference.

37 ((a)) ((b))

Figure 5.19: Images of two of the best predictions of MobileNet Dense. Image ((a)) with 0.105m error. Image ((b)) with 0.205m error

In Figure 5.20 two of the images with worst drone position estimation are shown. The images appear to be of peaks of the mountain ranges with few distinct features.

((a)) ((b))

Figure 5.20: Images of two of the worst predictions of MobileNet Dense. Image ((a)) with 148.637m error. Image ((b)) with 104.907m error

38 5.2.7 All Networks

Table 5.1 displays the euclidean distance error between estimated and actual drone position for all networks. The table describe the mean, standard deviation, minimum error, and maximum error in meters. The table also includes the cutoﬀ points for 25, 50, and 75 percent marks for error estimation values. For example 25 percent of the estimation errors from the DenseNet 121 network fall below 9.396 meters, 50 percent below 14.567 and so on.

Table 5.1: Mean, Standard Deviation, Minimum, 25 Percent, 50 Percent, 75 Percent, and Max Validation Absolute Errors (Meters)

Network Mean Std Min 25 50 75 Max DenseNet-121 17.489 15.299 0.647 9.396 14.567 21.449 373.674 DenseNet-201 15.145 11.373 0.256 7.840 12.970 19.357 194.037 Inception V3 11.378 10.444 0.243 6.171 9.637 14.210 440.961 Resnet 50 29.529 23.002 0.280 16.114 25.493 37.853 505.547 MobileNet 18.001 20.644 0.550 8.714 14.081 21.734 514.214 MobileNet Dense 5.643 10.834 0.079 2.34 3.572 5.533 230.558

The training, validation, and inference errors all show the best performing network on this data-set is the Mobile Dense, performing more than twice as well as any of the other networks tested. The Densenet 121, DenseNet 201, Inception V3, and standard MobileNet networks all performed within a relatively close margin. The worst network tested for this

Dataset appears to be Resnet 50, with signiﬁcantly higher inference errors as well as very noisy validation results as seen in Figure 5.2.

39 CHAPTER VI

CONCLUSION

6.1 Summary

This work benchmarks six diﬀerent networks for the use of drone position estimation in a virtual environment. With the best inference results, of a the MobileNet Dense network, being below 0.1 meter position prediction error and the best mean absolute error being

5.643 meters. The MobileNet Dense architecture is a fusion between the computational eﬃciency of MobileNet and the feed forward architecture of DenseNets. The feed forward architecture allows for preservation of basic spatial information deeper within the network, which is important for the task of navigation. This work shows promising results for the use of CNNs to predict position in a ﬂight environment. The initial testing of the 2 DOF network are showing relatively low position estimation errors, but there is still a lot of work to be done to create a functional real world model.

6.2 Future/Current Work

The work presented here is all done using an artiﬁcial data-set. The results of testing on the artiﬁcial data-set are promising and current work is being done to train these regression networks on real world data. The area of interest is around a 15 x 15 Km area around

Dayton, Ohio. These networks are also being modiﬁed for regression of 6 degrees of freedom.

The problem of predicting North, East, Down, Roll, Pitch, and Yaw is much more complex than the just the North and East regression problem presented in this work. Training is

40 being done using many diﬀerent satellite images of the Dayton area. The current position error using the deep learning approach on Dayton data is currently about 3 times worse than the state of the art method.

Current areas of interest are optimizing the loss function used to train the network, such as [30] where the authors apply diﬀerent loss functions for position and attitude. The network is trained on a weighted sum of the two loss functions, the weight value can either be a learned parameter or manually tuned. Another important area of research moving forward is having an uncertainty value along with the position and attitude estimation. The current approach to achieve this goal is the use of Mixture Density Networks (MDN) [31]. The

MDN can be appended to the output of any of the CNNs discussed in this work. After the network is trained, the MDN represents the conditional probability density function of the target variables conditioned on the input image. This will allow for uncertainty in position and pose estimation to be determined.

41 BIBLIOGRAPHY

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[2] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3.

[3] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. An- dreetto, and H. Adam, “Mobilenets: Eﬃcient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.

[4] C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.

[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,” in Advances in neural information processing systems.

[7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.

[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- houcke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

[10] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.

[11] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.

[12] D. T. Venable, “Improving real world performance of vision aided navigation in a ﬂight environment,” Air Force Institute of Technology WPAFB, Tech. Rep., 2016.

42 [13] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.

[14] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in Euro- pean conference on computer vision. Springer, 2006, pp. 404–417.

[15] S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011, pp. 2548–2555.

[16] T. Weyand, I. Kostrikov, and J. Philbin, “Planet-photo geolocation with convolutional neural networks,” in European Conference on Computer Vision. Springer, 2016, pp. 37–55.

[17] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen, “Vidloc: A deep spatio- temporal model for 6-dof video-clip relocalization,” in Proceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), vol. 3, 2017.

[18] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon, “Scene coordinate regression forests for camera relocalization in rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2930– 2937.

[19] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-ﬁdelity visual and physical simulation for autonomous vehicles,” in Field and service robotics. Springer, 2018, pp. 621–635.

[20] B. Karis and E. Games, “Real shading in unreal engine 4,” Proc. Physically Based Shading Theory Practice, pp. 621–635, 2013.

[21] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review, vol. 65, no. 6, p. 386, 1958.

[22] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[23] F. Chollet et al., “Keras,” https://keras.io, 2015.

[24] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/

43 [25] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016. [Online]. Available: http://arxiv.org/abs/1605.02688

[26] F. Seide and A. Agarwal, “Cntk: Microsoft’s open-source deep-learning toolkit,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16. New York, NY, USA: ACM, 2016, pp. 2135–2135. [Online]. Available: http://doi.acm.org/10.1145/2939672.2945397

[27] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic diﬀerentiation in pytorch,” in NIPS-W, 2017.

[28] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol. 1, no. 4, pp. 541–551, 1989.

[29] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.

[30] A. Kendall, R. Cipolla et al., “Geometric loss functions for camera pose regression with deep learning,” in Proc. CVPR, vol. 3, 2017, p. 8.

[31] C. M. Bishop, “Mixture density networks,” Citeseer, Tech. Rep., 1994.