<<

Image-based Perceptual Learning Algorithm for Autonomous Driving

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Yunming Shao, M.S.

Graduate Program in Geodetic Science

The Ohio State University

2017

Dissertation Committee:

Dr. Dorota A. Grejner-Brzezinska, Advisor

Dr. Charles Toth, Co-advisor

Dr. Alper Yilmaz

Dr. Rongjun Qin

Copyrighted by

Yunming Shao

2017

Abstract

Autonomous driving is widely acknowledged as a promising solution to modern traffic problems such as congestions and accidents. It is a complicated system including sub-modules such as perception, path planning and control etc. During the last few years, the research and experiment have transferred from the academic to the industrial sector.

Different sensor configurations exist depending on the manufactories, but an imaging components is exclusively used by every company. In this dissertation, we mainly focus on innovating and improving the camera perception algorithms using the algorithms. In addition, we propose an end-to-end control approach which can map the image pixels directly to control commands. This dissertation contributes in the development of autonomous driving in the three following aspects:

Firstly, a novel dynamic objects detection architecture using still images is proposed.

Our dynamic detection architecture utilizes the Neural Network (CNN) with end-to-end training approach. In our model, we consider multiple requirements for dynamic object detection in autonomous driving in addition to accuracy, such as inference speed, model size, and energy consumption. These are crucial to the deployment of a detector in a real autonomous vehicle. We determine our final architecture by exploring different pre-trained feature extractor and different combination of multi-scale feature layers. Our architecture is intensively tested on KITTI visual

ii

benchmark datasets [84] and has achieved comparable accuracy to the state-of-the-art approaches in real time.

Secondly, to take advantage of the contextual information in the video sequences, we develop a video object detection framework based on CNN and Long Short Term

Memory (LSTM). LSTM is a special kind of (RNN). The architecture we proposed in chapter 3 acts as the still image detector and feature extractor, and LSTM is responsible for exploiting the temporal information in the video stream. The input to the LSTM can be the visual features of the still image detector or the detection results in an individual frame, or both. We found that a combination of proper visual feature and detection results used as the temporal information to LSTM can achieve better performance compared with only one of them being used.

Finally, we design an end-to-end control algorithm that takes the video sequences as input and directly outputs the control commands. We mainly focus on the methods, i.e., convolution neural network and recurrent neural network, and train them using simulated and real road data. As in the video object detection task, the recurrent neural network is designed to take advantage of the temporal information.

By experiments, we evaluate several different proposed architecture networks, and recommend the one with the best performance.

iii

Acknowledgment

First and foremost, I would like to thank my adviser Dr. Dorota Grejner-Brzezinska, for her valuable mentorship, patience, and encourage during my Ph.D. study. Her continuous support when repeatedly shifting research focus and exploring new ideas is greatly appreciated. I would like to also thank my co-adviser, Dr. Charles Toth, for his guidance of my study and the insight to the research topics. In addition, I learned a lot by attending the weekly group meeting he hosted. I also appreciate his patient review and constructive feedback to my dissertation.

I would also like to give thanks to Dr. Alper Yilmaz and Dr. Rongjun Qin for serving on the committee of my dissertation defense. Both of them have provided precious comments and feedback, which help improve my dissertation. I would further like to thank my fellow students and colleagues for maintaining a professional research environment and providing helpful discussion with me.

Finally, I would like to thank my family, especially my wife Yan Gao. They have been always encouraging and supporting me, and I would not have been able to complete this work without all of their love and encouragement.

iv

Vita

2010………………………. B.S. Mapping Engineering, Chinese University of Petroleum

2013………………………. M.S. Mapping Engineering, Chinese Academy of Sciences

2013 to present…………… Graduate Student, School of Earth Sciences, The Ohio State

University

Fields of Study

Major Field: Geodetic Science

v

Table of Contents

Abstract ...... ii

Acknowledgment ...... iv

Vita ...... v

Fields of Study ...... v

Table of Contents ...... vi

List of Acronyms ...... x

List of Tables ...... xii

List of Figures ...... xiv

Chapter 1: Introduction ...... 1

1.1 Motivation ...... 1

1.2 History and Present ...... 3

1.3 System Architecture ...... 6

1.3.1 Sensor Input ...... 7

1.3.2 Perception ...... 10

1.3.3 Planning ...... 13

vi

1.3.4 Control ...... 15

1.4 Contributions...... 17

1.5 Dissertation Organization ...... 19

Chapter 2: Foundation and Literature Review ...... 21

2.1 and Deep Learning...... 21

2.1.1 Computer Vision ...... 21

2.1.2 Deep Learning ...... 27

2.1.3 Datasets ...... 28

2.2 Neural Network ...... 29

2.2.1 Neurons and Neural Network Architecture ...... 29

2.2.2 Activation Functions ...... 31

2.2.3 Training A Neural Network ...... 34

2.3 Convolutional Neural Network ...... 39

2.3.1 Architecture ...... 39

2.3.2 Convolution ...... 41

2.3.3 Pooling ...... 43

2.3.4 Dropout...... 45

2.3.5 ...... 46

2.4 Recurrent Neural Network ...... 48

vii

2.4.1 Architecture ...... 49

2.4.2 Training RNNs ...... 50

Chapter 3: Dynamic Object Detection on Road ...... 52

3.1 Related Work...... 52

3.1.1 Traditional Approaches ...... 52

3.1.2 CNNs for Object Detection ...... 54

3.2 Model Architecture ...... 56

3.4 Experiments ...... 64

3.4.1 Quantitative Results ...... 64

3.4.2 Qualitative Results ...... 70

3.5 Conclusion ...... 74

Chapter 4: Video Object Detection for Multiple Tracking ...... 75

4.1 Introduction ...... 75

4.2 Related Work...... 77

4.3 Long Short Term Memory ...... 79

4.4 Methods ...... 83

4.4.1 System Overview ...... 83

4.4.2 LSTM Choice ...... 84

4.4.3 Training ...... 85

viii

4.5 Experiments ...... 86

4.5.1 Quantitative Results ...... 86

4.5.2 Qualitative Results ...... 88

4.6 Conclusion ...... 92

Chapter 5: End to End Learning for Vehicle Control ...... 93

5.1 Related Work...... 93

5.2 Network Architecture ...... 95

5.3 Implementation and Training ...... 101

5.4 Experiments and Results ...... 104

5.4.1 Simulation Study ...... 104

5.4.2 Real Road Test ...... 110

5.5 Summary ...... 114

Chapter 6: Summary and Future Work ...... 116

6.1 Summary and Contributions...... 116

6.2 Future Work ...... 118

References ...... 121

ix

List of Acronyms

ANN Artificial Neural Network

BGD Batch

BPTT Through Time

CNN Convolutional Neural Network

DPM Deformable Part Model

ELU Exponential Linear Units

FCN Fully Convolutional Network

GPS Global Positioning System

GRU Gated Recurrent Unit

HOG Histogram of Oriented Gradients

INS Inertial Navigation System

IOU Intersection Over Union

LSTM Long Short Term Memory

MDP Markov Decision Processes

MPC Model Predictive Control

NMS Non-Maximum Suppression

PID Proportional, Integral, Derivative

x

ReLU Rectified Linear Unit

RNN Recurrent Neural Network

RPN Region Propose Network

SGD Stochastic Gradient Descent

SIFT Scale Invariant Feature Transform

SLAM Simultaneous Localization And Mapping

SSD Single Stage Detection

SVM Support Vector Machine

SVR Support Vector Regression

YOLO You Only Look Once

xi

List of Tables

Table 3.1: Detection accuracy summary for different feature extractors. The mean Average Precision (mAP) is the mean values of the AP for each class. The mAP and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS). The mAP for each class (Car, Cyclist, Pedestrian) of different difficulty levels, i.e., Easy(E), Moderate(M), and Hard(H), are also calculated...... 67

Table 3.2. Detection accuracy summary by utilizing different feature layers. The mean Average Precision (mAP) is the mean values of the AP for each class. The mAP and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS)...... 70

Table 4.1: Detection performance using VGG16 as the backbone CNN architecture and different input to LSTM units. The mean Average Precision (mAP) and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS). Base means no LSTM units are used; D means only detected bounding boxes are input to the LSTM units; F means only visual features are input to the LSTM units; F&D means both detected bounding boxes and visual features are input the LSTM units. The mean Average Precision (mAP) is the mean values of the AP for each class. The speed is measured by Frame Per Second (FPS). The mAP for each class (Car, Cyclist, Pedestrian) of different difficulty levels, i.e., Easy(E), Moderate(M), and Hard(H), are also calculated...... 87

Table 4.2: Detection performance using MobileNet as the backbone CNN architecture and different input to LSTM units. The mean Average Precision (mAP) and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS). Base means no LSTM units are used; D means only detected bounding boxes are input to the LSTM units; F means only visual features are input to the LSTM units; F&D means both detected bounding boxes and visual features are input the LSTM units. The mAP is the mean values of the AP for each class. The mAP for each class (Car, Cyclist, Pedestrian) of different difficulty levels, i.e., Easy(E), Moderate(M), and Hard(H), are also calculated...... 88

Table 5.1: Steering angles Mean Absolute Error (MAE) and Standard Deviation (Std) of different architectures on the simulated testing datasets in radians. “GT” means that the ground truth values are used as feedback to the next time stamp; “OP” means that the network’s actual outputs are used as feedback. The L1, L2, L2 Fb, and L2 Fb Skip architectures are illustrated in figure 5.1-5.4...... 108

xii

Table 5.2: Steering angles Mean Absolute Error (MAE) and Standard Deviation (Std) of different architectures on the real road testing datasets in radians. “GT” means that the ground truth values are used as feedback to the next time stamp; “OP” means that the network’s actual outputs are used as feedback. The L1, L2, L2 Fb, and L2 Fb Skip architectures are illustrated in figure 5.1-5.4...... 112

xiii

List of Figures

Figure 1.1: The architecture of the autonomous driving system ...... 7 Figure 1.2: The Flea3 camera from Point Grey (left), and HDL-64E Lidar from Velodyne (right) used for autonomous vehicle...... 9 Figure 1.3: Sub-modules of the perception module ...... 11 Figure 1.4: Sub-modules of the planning module ...... 13 Figure 1.5: A deep learning neural network that replaces the perception, planning and control modules ...... 16 Figure 1.6: Relation between modules discussed in different chapters...... 19 Figure 2.1: An example of vehicle detection using our proposed method. The image is from the KITTI Visual Benchmark dataset...... 23 Figure 2.2: Semantic segmentation sample from the Cityscapes dataset by [65]. Each pixel is assigned to a predefined set of classes, and illustrated with a specific color. For example, all the pedestrian instances are recognized and pictured with red...... 26 Figure 2.3: A cartoon drawing of a biological neuron (left) and its mathematical model (right) [86]. ). � = [�0, �1, �2] is the input vector; ωi and bi are weight and bias parameters, respectively, and they are the parameters learned by training. The allows the model to represent more complicated real-world problem...... 30 Figure 2.4: A 3- Neural Network architecture (left) and its corresponding mathematic equations (right). � is the input vector; Wi and Bi are weight and bias parameters, respectively, and they are the parameters learned by training. �1 and �2 are the intermediate vector after the hidden layer 1 and 2...... 31 Figure 2.5: The (left) and hyperbolic tangent function (right)...... 33 Figure 2.6: The ReLU function (left) and leaky ReLU function (right)...... 34 Figure 2.7: The LeNet-5 neural network architecture [45]. It is a typical CNN architecture comprised of convolution layer, activation layer, pooling layer and fully connected layer. It was designed to recognize the hand-written letters and digits. The subsampling in this architecture is actually a 2x2 pooling layer...... 40

xiv

Figure 2.8: An example of convolution operation. The input feature map has spatial dimension 32 × 32 × 3. After applying 6 filters with size 5 × 5 × 3, stride 1, and zero padding 0, the size of the output feature map is 28 × 28 × 6, and the number of parameters in this convolution layer is 456...... 43 Figure 2.9: An example of max pooling operating on the 28 × 28 × 6 feature map with a 2 × 2 window and stride 2, resulting a 14 × 14 × 6 feature map. The pooling operation is applied on each channel of the feature map, and thus preserve the channel dimension...... 44 Figure 2.10: An example of applying dropout to a standard neural network. The hidden layer 1 and 2 in (a) are both full connected layer. Each circle represents a neuron and the arrow line connecting them represents the axon with data flowing on it. Red circles represent the input vector; green circles represent the intermediate feature map; and blue circles represent the output vector. Red cross in the circle means that there is no data flowing through that neuron...... 46 Figure 2.11: Transfer learning with CNNs. (a) The pre-trained VGG-16 model [95] on ImageNet dataset for classification. (b) Transfer learning on small dataset. (c) Transfer learning on medium dataset. “conv3-64” donates 64 filters with size 3x3. “fc-4096” donates a fully connected layer with output vector size 4096. The layers in the red box are frozen, which means the weighs and bias parameters will not be updated. The layers in the blue box will be learned during the transfer learning process...... 48 Figure 2.12. An example of RNN. The left part of the equal sign is the recursive description of RNN, and the right part is the corresponding extended RNN model in a time sequences manner. The red arrow indicates the backpropagation direction at time step t. �� is the input at time step t, ℎ� is the output at time step t, and �� is the cell state at time step t...... 50 Figure 3.1: Our proposal object detection architecture including three steps. (1) Feature extraction: the input image is feed into a feature extractor pre-trained on ImageNet, such as VGG16 [95] and multiple scale feature map is produced. (2) Prediction with a convolution: anchors are created on selected feature map and feature map combination, and a convolution layer is applied to these anchors to predict the offsets to the anchors, the associated confidence and class probability. (3) Refinement: Usually a Non-Maximum Suppression (NMS) is good enough to filter out the repeat bounding boxes...... 57 Figure 3.2: An example of prediction on feature map. The feature map of size � × � with P channels has K=3 anchors produced on each grid cell. By convoluting the feature map with 4 + � × � filters of size 3 × 3 × � , where C is the number of classes to distinguish, and proper zero padding, we get an output map with the same width W and height H but different channel size 4 + � × �. The output map can be interpreted as C class probability and 4 bounding box offsets for each anchor...... 59 Figure 3.3: The Intersection-Over-Union (IOU) ratio calculation and Non-Maximum Suppression (NMS). (a) IOU ratio is the area of intersection over the area of the union,

xv

which indicates how much two bounding boxes overlaps. (b) The detection results before and after Non-Maximum Suppression (NMS)...... 60 Figure 3.4: The precision recall curve for car, cyclist and pedestrian at different difficulty level easy, moderate, and hard...... 69 Figure 3.5: Example of successful detection on KITTI test datasets. We use SqueezeNet V2 with three feature maps for the detection. Each color corresponds to an object category...... 72 Figure 3.6: Example of detection error on KITTI test datasets. For top to bottom, the errors are: predict part of the tree to be a cyclist; predict a still cyclist to be a pedestrian; missed detection; and predicted bounding box does not fit to the object. We use SqueezeNet V2 with three feature maps for the detection. Each color corresponds to an object category...... 73 Figure 4.1: A standard architecture of LSTM units. Each line carries a vector data, and the arrow denotes the direction of the data flow. Two merging lines mean the data concatenation, while a line separating into two lines means that the data it carries is copied and flows to different directions. The circles represent pointwise operations, such as vector addition or multiplication, while the boxes are sigmoid neural network layers. �� is the input at time step t, ℎ� is the output at time step t, and �� is the cell state at time step t. � and ���ℎ are sigmoid function and hyperbolic tangent function, respectively, defined and described in figure 2.5...... 80 Figure 4.2: The gate structure in LSTM units. It is composed of a sigmoid neural network layer � and a pointwise multiplication operation. It can control whether the information can flow through it...... 81 Figure 4.3: Overview of our video object detection architecture. The dash line means that only feature vector or still image detection results can be inputs to the Long Short Term Memory (LSTM) units. “Detection” donates the detection results, i.e., bounding boxes coordinates and class labels from Convolutional Neural Network (CNN) or LSTM. “Features” denotes the visual features vector from the CNN feature extractor...... 84 Figure 4.4: Video object detection without considering the contextual information. Only base MobileNet is used to detect objects in each individual frame. Each color corresponds to an object category...... 90 Figure 4.5: Video object detection by considering the contextual information. MobileNet is used to detect objects in each individual frame, and outputted visual feature vector and detection results are fed into the LSTM units. Each color corresponds to an object category...... 91 Figure 5.1: End-to-end control architecture with one layer Long Short Term Memory (LSTM) called “L1”. At each time step except the first, both deep feature vector of the current time step and the output control commands of last time step are fed into the LSTM unit to update the cell state...... 97 xvi

Figure 5.2: End-to-end control architecture with two layers Long Short Term Memory (LSTM) called “L2”. The outputs of the first LSTM layer acts as the inputs to the second LSTM layer ...... 98 Figure 5.3: End-to-end control architecture with two layers Long Short Term Memory (LSTM) and feedback mechanism. We add the outputs of second LSTM layer at the previous time step as the third input to the first LSTM layer of the current time step. This architecture is called “L2 Fb”...... 99 Figure 5.4: End-to-end control architecture with two layers Long Short Term Memory (LSTM), feedback mechanism, and skip structure. We call this architecture “L2 Fb Skip”...... 100 Figure 5.5: The training diagram of our proposed architecture using BackPropagation Through Time (BPTT) algorithm...... 102 Figure 5.6: The architecture at test time. It can produce a steering angle from each frame of the video sequences...... 104 Figure 5.7: The Udacity self-driving car simulator (https://www.udacity.com)...... 105 Figure 5.8: An example of the collected video frames from left, center, and right camera...... 106 Figure 5.9: The steering angles distribution before and after data augmentation. Too much steering angle are located in the range of [−0.3,0] before data augmentation, but the steering angles distribution becomes balanced after data augmentation...... 107 Figure 5.10: Backbone CNN feature extractor for simulated data. “8-3 × 3“ means a convolution layer with eight 3 × 3 filters; “Relu” means a Relu activation layer to increase the non-linearity to the system; “Max 2 × 2” is max pooling layer with filter size 2 × 2 to reduce the feature map dimension; “Dropout 0.2” is dropout layer with dropout rate 0.2 to prevent to the training data; “FC 50” means a fully connected layer with output feature vector length 50...... 108 Figure 5.11: Sample frames from the training video sequences...... 111 Figure 5.12: The predicted steering angles of architecture “L2 Fb Skip GT” on the test datasets...... 114

xvii

Chapter 1: Introduction

1.1 Motivation

People must move from one place to another to satisfy both our physiological and psychological needs. Thus, our ancestors invented various forms of transportation, which became one of the necessities of life. Animals, such as horses, were the main means of transport for a long time until the later part of the 19th century when travel by railway was invented. Without a doubt, the persistent improvement in production efficiency and popularization of the automobile during the last century have largely expanded the boundaries of our daily life activities. People, especially those in the developed countries like the United States, rely heavily on cars for activities such as work, shopping, and entertainment.

However, driving by human beings is subject to errors and mistakes that may result in situations caused countless deaths over the years. Worldwide, per the Global Road Crash

Data [1], traffic crashes are the major cause of death and injuries, specifically estimated at 1.3 million fatalities each year, on average 3,287 deaths per day. In the United States alone, there are over 37,000 deaths and an additional 2.35 million injuries in road crashes each year. Of these, 94% are caused by human error [4], reported by USA’s National

Highway Traffic Safety Administration (NHTSA) research. The cost of traffic crashes is incredibly high, reaching USD $518 billion globally and $230.6 billion in the United

1

States. Unless actions are taken, traffic crashes are predicted to be the fifth leading cause of death by 2030.

In addition to traffic crashes, traffic congestion and the difficulties of parking are a nuisance and inconvenience experienced by many commuters every day. The American

Driving Survey [2] from the American Automobile Association (AAA) foundation is the most current and comprehensive survey on how much Americans drive daily and yearly.

It reveals that American drivers spend an average of 17,600 minutes on the road each year, i.e., about 48 minutes each day, which is equivalent to seven 40-hour weeks at the office. That is a substantial amount of time, which could be saved. Another problem is parking, and parking is frustrating, especially if you live in a dense urban area. A study

[3] of Los Angeles parking structures found that there are 18.6 million parking spaces, occupying 14% of the incorporated L.A. land in 2010. Each of the 5.6 million vehicles in

L.A. takes more than 3 spaces, but the residents still complaint about the lack of parking spaces and the difficulties to find one.

It is widely believed that autonomous driving is the most promising technology to eliminate the problems described above. Autonomous driving enables cars to sense the environment and navigate without human intervention. First, autonomous driving is expected to be much safer because it would not be distracted and subject to fatigues, as happens to human drivers. Most traffic accidents are human’s fault, simply because no one can guarantee 100 percent focus on the road when driving. With multiple sensors installed, autonomous vehicles can better percept and understand the surrounding environment, thus also improve the safety. Second, autonomous driving can reduce and

2

even totally avoid traffic jam because they have better driving behavior, and most importantly they are connected and can communicate to each other. For example, a bad driver’s sudden breaking could cause miles of traffic jam on a busy freeway.

Autonomous driving is expected to avoid this. Third, an autonomous vehicle can park itself after delivering the passenger to a destination. A more promising scenario is vehicle sharing, which will increase the service time of each vehicle and reduce the number of total vehicles. In this way, a large amount of parking space will not be necessary. In addition, autonomous driving will benefit specific groups of people, such as the elderly or disabled. In conclusion, autonomous driving has a potential to benefit transportation system in multiple ways.

1.2 History and Present

The autonomous driving experiments date back to the 1920s [5], but it was not until

1980s were a significant progress was made by the Carnegie Mellon University on its

Navlab vehicle that operated in structured environments. Since then, the European project

PROMETHEUS [6], the ARGO project carried out at the University of Parma, Italy, and the California partners for advanced transportation technology (PATH) program [7] [8] all set their own milestones. However, the real breakthrough was achived during the 2004 and 2005 Defense Advanced Research Projects Agency (DARPA) Grand Challenge and

2007 DARPA Urban Challenge. No team finished the 2004 DARPA Grand Challenge, which challenged the competitors to finish a 142 miles’ course without human intervention, and was the first long distance competition for autonomous vehicles. In the

3

next year, Stanford’s robot car “Stanley” [9] completed 132 miles’ course within 6 hours and 54 minutes, thus wining the 2005 DARPA Grand Challenge. In 2007, DARPA decided to make the race even tougher by introducing the Grand Challenge to an urban environment. Urban environment is more challenging because it is more complicated, has more road sharers involved in the traffic, and must interact with other autonomous and human-operated vehicles. The Carnegie Mellon University’s robot car named “BOSS”

[10] won the race by finishing all the missions in a little more than 4 hours. Since then, other similar competitions that have taken place around the world, together demonstrate that autonomous driving is feasible. The three DARPA challenges and other competitions improved the techniques of autonomous driving further, and most importantly successfully attracted the attention of the public on autonomous vehicles.

After the DARPA Grand Challenge, the industry gradually took over from academic research institutes. Among them, Google was leading the development and they have tested their prototype vehicles for millions of miles. In January 2017, Google launched a company named Waymo, which is a standalone independent company under Google’s parent company, Alphabet. Other companies also invested significant resources on autonomous driving. For example, NVIDIA has released its autonomous driving development platform Drive PX and Drive PX 2 to the market [140]. Other IT giants, such as Apple, Baidu, and traditional car manufacturers such as GM, Ford, and Toyota all have built their own independent department working on autonomous driving. Small startups such as Otto and nuTonomy are also developing their own systems, and some are acquired at high prices by others big players like Uber and Ford.

4

Autonomous driving vehicle is defined by the Nevada State law as "a motor vehicle that uses , sensors and Global Positioning System coordinates to drive itself without the active intervention of a human operator" [11]. Both Society of

Automotive Engineers (SAE) international and the National Highway Traffic Safety

Administration (NHTSA) released their classification of the automotive levels. But

NHTSA abandoned their classification methods by adopting SAE standard in September

2016. SAE classification standard called SAE J3016 [11] defines six levels of automation based on the amount of human intervention or attentiveness required. In general, SAE

J3016 levels and definitions include:

• Level 0 – No Automation: The full-time performance by the human driver of all

aspects of the dynamic driving task, even when enhanced by warning or

intervention systems

• Level 1 – Driver Assistance: The driving mode-specific execution by a driver

assistance system of either steering or acceleration/deceleration using information

about the driving environment and with the expectation that the human driver

performs all remaining aspects of the dynamic driving task

• Level 2 – Partial Automation: The driving mode-specific execution by one or more

driver assistance systems of both steering and acceleration/deceleration using

information about the driving environment and with the expectation that the human

driver performs all remaining aspects of the dynamic driving task

• Level 3 – Conditional Automation: The driving mode-specific performance by an

Automated Driving System of all aspects of the dynamic driving task with the

5

expectation that the human driver will respond appropriately to a request to

intervene

• Level 4 – High Automation: The driving mode-specific performance by an

Automated Driving System of all aspects of the dynamic driving task, even if a

human driver does not respond appropriately to a request to intervene

• Level 5 – Full Automation: The full-time performance by an Automated Driving

System of all aspects of the dynamic driving task under all roadway and

environmental conditions that can be managed by a human driver

1.3 System Architecture

The system to achieve Autonomous driving is very complicated and it comprises many sub-systems can be included. Here we simplify the system into three primary modules, which are perception/localization module, planning module and control module, as shown in figure 1.1. Many other researchers separate perception/localization into two tasks, making the system composed of four modules. However, we treat them as the same task because they are so closely related. Even though we simplify the system a single path, it is a close-loop system due to the dynamic interactive environment. A close-loop system means that the outputs of the current time step will affect the inputs of the next time step. For example, the steering angle and throttle signals outputted from the system will navigate the vehicle to a new position and orientation, where the camera will capture a new view of the surroundings as part of the sensor input.

6

Figure 1.1: The architecture of the autonomous driving system

Perception refers to the ability to collect information and extract relevant knowledge from the environment. For autonomous driving, the perception module is responsible for perceiving and understanding the surrounding environment, by taking the inputs such as images or point clouds. Some researchers compare the perception module to the human eye. We argue that the eye itself cannot understand the scene it sees, and the brain must involve in.

The tasks of perception module not only include other vehicle detection, pedestrian detection, and traffic sign/light detection and understanding, but also traffic scene understanding that is more challenging for machine. For example, it must be able to recognize the emergency vehicles like fire trucks, and understand that they have higher priority. The vehicle localization task is included in the perception module because the reliable localization relies on the understanding of the urban environment.

The planning module takes all the perception results and makes decisions about the vehicle’s future motion. The tasks can include optimal path planning, lane changing, left or right turn, and speeding or stop, and so on. The control module then performs these planned decisions by sending the steering angle, throttle and brake strength level computed from the optimal control algorithms to the vehicle’s transmission system.

1.3.1 Sensor Input

7

There is no doubt that multiple sensors should be used on Autonomous Vehicles to guarantee the functionality, reliability and safety. In addition to GPS/INS system [136,

137], there are four types of commonly used sensors for Autonomous Driving, including

Radar, Ultrasonic, Lidar and Camera. Each of them can perceive the world in different ways. Usually a low-cost GPS/INS system can localize the vehicle with the accuracy of several meters, which is insufficient for autonomous vehicles. Furthermore, remains a challenge to keep the same accuracy in urban areas because of the satellite signal blockage by high buildings. Radar uses radio waves to determine the range, angle and velocity of the objects, and works equally well in different lighting condition and weather condition such as rain or snow. Its drawback is the low resolution compared with other sensors such as Lidar. Ultrasonic works in similar way but emits high frequency sound.

Compared with Radar, the Ultrasonic can only perceive the environment within a short range. While both can work on difficult lighting and weather conditions, their common drawback is the lack of color, contrast and optical character information perceived from objects.

8

Figure 1.2: The Flea3 camera from Point Grey (left), and HDL-64E Lidar from Velodyne (right) used for autonomous vehicle.

There is an ongoing debate among the research and industrial sectors of whether

Lidar or Camera should be the primary sensor for Autonomous Vehicles. Figure 1.2 shows a typical camera and Lidar used for autonomous driving. Google and many other companies are building their Autonomous Vehicles with Lidar acting as the central sensor. Their assumption is that the sensor price will drop quickly enough to catch up with the development of Autonomous driving.

Lidar [138] is short for “Light detection and ranging”, and works by emitting laser light and measuring the time the light travels from the emitter to the target, thus measuring the distance. Its measurements include another important value, the reflectivity, which measures the reflectance features of difference target, enabling Lidar to classify the targets.

These measurements can be used to generate a 2D/3D point cloud, which is a great map representative of the surrounding environment. Therefore, Lidar is the master of 3D mapping and vehicle localization. One of the popular state-of-the-art localization

9

methods [23], [24] and [25] mostly rely on a pre-built map of the environment using

Lidar. Then the vehicle localization can be achieved by correlating the online Lidar measurements to the pre-built map. Mapping using Lidar does not suffer from the severe lighting conditions, such as shadow or direct sunlight, and enjoys much higher resolution than Radar.

However, in general, a camera performs much better than Lidar when used as a . Generally speaking, the problem with Lidar on perception task is the lack of color, texture and appearance information, not to mention that the Lidar measurements is sparse, as compared with camera. Almost all the information regarding the environment can be captured by cameras and stored in images. Their color, contrast, and optical- character give them a full new capability set entirely missing from all other sensors. In addition, cameras are cheap and small enough to be deployed in mass productions. Thus, for perception task, the challenge left is developing more advanced computer vision algorithms that can better utilize the camera outputs. In this thesis, we mainly investigate the perception part of the Autonomous driving, and we mainly focus on learning based computer vision algorithms that take images and video as the input.

1.3.2 Perception

Perception module is the most significant module among the Autonomous driving system. It is the foundation and prerequisites for the following modules to function well.

The tasks of this module include mapping/localization, object detection, object tracking, semantic segmentation, and scene understanding, and so on. Figure 1.3 describes the

10

tasks that are closely related to Autonomous driving. Most of these tasks are within the domain of computer vision, which is our focus in this dissertation.

Figure 1.3: Sub-modules of the perception module

Environment mapping and object localization are problems that are depending on each other and can be regarded as a simultaneous localization and mapping problem, or known in short as SLAM. SLAM addresses the problem of building a spatial map of an environment while simultaneously localizing the robot relative to this map. The SLAM is generally regarded as one of the most important task in perception module. However,

SLAM is still mostly solved by traditional Bayes filter based algorithms, such as

Extended Kalman Filter [28] or Particle filter [29], and graph-based optimization techniques [26], [27]. These are other intensive research areas but not our focus. In this dissertation, we focus on learning based algorithms applied to image data.

While mapping and localization inform the vehicle where it is in its surroundings, object detection and object tracking tell the vehicle where the obstacles are. Object detection sub-module is responsible to locate and classify all critical objects in the traffic

11

scene, and usually outputs the 2D or 3D bounding boxes on top of the objects. There are two categories of objects: static objects and dynamic objects. Static objects, such as traffic sign/lights and lane markings, are relatively easy because their positions do not change over time in the environment. Then a solution immediately arise is to annotate these static objects to the map, if we have a reliable mapping solution. However, dynamic objects detection is more challenging since we can only rely on the real time online observation from the sensors on board to detect them. In this dissertation, we focus on dynamic objects detection and tracking.

The objective of object tracking sub-module is to associate moving target objects in consecutive video frames. Visual tracking [135] is a challenging task in computer vision due to target deformations, scale changes, partial occlusions, motion blur, and object deformation. Segmentation is a partition of an image into several "coherent" parts, but without any attempt at understanding what these parts represent. Semantic segmentation is a further partition of an image into several semantically meaningful parts in pixel level, to classify each part into one of the pre-determined classes. There are demos that they are used to discover the drivable area in images. Scene understanding aims to understand the meaning of the scene and the global structure semantically, which is a high-level task. Its foundation is the reliable object detection/tacking and semantic segmentation.

In the past five years, deep learning based methods have dominated the computer vision research areas. Deep learning is so powerful that most of these perception sub- tasks achieved performance that traditional methods never reached. In addition, deep

12

learning is a very general method addressing each problem within a unified framework, while very different specific algorithms must be proposed for different task in traditional methods. In this dissertation, we mainly explore and propose novel deep learning based algorithms on some of the perception tasks for the application of autonomous driving.

1.3.3 Planning

Planning module is responsible for outputting a drivable reference path or trajectory to the control module to follow. The perception results, such as location and orientation of other vehicles, the state of the traffic lights, together with prior knowledge about the road network, vehicle dynamics, and sensor models etc., are used to plan an optimal path.

Planning is difficult because of the high dynamics of the real driving scenario. The planning module in autonomous driving systems is hierarchically separated into three consecutive tasks, which are route planning, behavioral decision making, and motion planning. We demonstrate their relationship in figure 1.4.

Figure 1.4: Sub-modules of the planning module

13

In route planning module, a route is planned through the road network from its current locations to the requested destination. It can be formulated as the problem of finding a minimum-cost path on a road network graph, by representing the road network as a directed graph with edge weights corresponding to the cost of traversing through that edge. The cost is usually the length of the route, or the time taken to travel through, or a balance of both. The output of the path planning task is a sequence of waypoints through road network. The route planning problem has attracted significant interest in the transportation research community, and the state-of-the-art algorithms, such as [30] and

[31], can offer an optimal route on a continent-scale network in milliseconds. So, route planning is generally regarded as a solved problem.

An optimal route is given from route planning sub-module. The behavioral decision- making task will help the vehicle navigate the selected route and interact with other traffic participants. Given a sequence of waypoints, the behavioral layer is responsible for deciding an appropriate driving behavior based on the behavior of other traffic participants, road conditions, and traffic light state. The behavioral decision-making problem is commonly modeled using probabilistic planning formalisms, such as Markov

Decision Processes (MDPs) in paper [32], and Partially-Observable Markov Decision

Process (POMDP) in paper [33] and [34].

As the last submodule of planning, the motion planning will perform the decided driving behavior, e.g., turn left/right, lane changing, by creating a path or trajectory that can be followed by the control module. The resulting path or trajectory should be comfortable for the passengers, feasible for the mechanical system of the vehicle, and

14

avoid collisions with obstacles detected from the perception module. In practice, numerical approximation methods are usually used due to the high computational cost, such as graph-search approaches [35] that search for a shortest path using graph search method from a constructed graphical discretization of the vehicle’s state space.

1.3.4 Control

In the autonomous driving system, the control module stabilizes the vehicle to the reference trajectory by computing the steering angle and acceleration/brake level. The difficulties lie on the presence of modeling error and other forms of uncertainty. In the previous studies, various theories and methods have been investigated that fully consider the dynamics of the vehicle by compensating for any delay or disturbance. These include the PID (Proportional, Integral, Derivative) control method [36] [37], the MPC (Model

Predictive Control) [38], the fuzzy control method [39], the model reference adaptive method [40], and the SVR (Support Vector Regression) method [41], etc. The most practical and reliable control method is the PID controller, because it has advantages that it is simple, reliable, and easy to implement.

1.3.5 Deep Learning Involvement

We already mentioned that deep learning is dominating the computer vision research community. In this dissertation, we study how computer vision techniques can improve the perception ability of an autonomous vehicle. It is obvious that deep learning plays a big role in the perception module of our high level autonomous system. For example, many object detection research solutions compete against each other by utilizing and designing deep learning convolution neural network. In fact, as a very powerful machine

15

learning tool, deep learning is also applied to planning and control modules in the research and industrial community.

However, deep learning can play a bigger role in the autonomous driving system than just helping the vehicles improve the perception ability. The most aggressive attempt is to replace all three modules with a deep learning neural network (Figure 1.5), which can map the image or video inputs directly to control commends, i.e., steering angle and throttle/braking level. The entire system is a black box with no explicit intermediate modules. NVIDIA already demonstrated that a system implemented with a convolutional neural network is powerful enough to output steering angles to operate in diverse road and weather conditions [42].

Figure 1.5: A deep learning neural network that replaces the perception, planning and control modules

We can also implement a deep learning neural network to replace the combination of perception and planning modules, i.e., map the input images to planned trajectory; or replace the combination of planning and control modules, i.e., map the perception results to control commands. In summary, deep learning neural network is heavily involved in

16

all the modules of the high level autonomous driving system presented in this dissertation.

1.4 Contributions

After having a general architecture of an autonomous driving system in mind, we are not intended to cover everything in our dissertation. In fact, covering every detail of autonomous driving system in one dissertation is never possible. In this dissertation, we focus on improving the perception ability of an autonomous vehicle from the perspective of still image and video object detection. In addition, we proposed an end-to-end control algorithm based on supervised deep learning techniques. The planning module and traditional control methods will not be studied in this dissertation.

In this dissertation, the following contributions are made towards making autonomous driving a reality:

• A deep learning-based novel dynamic objects detection architecture using still

images is proposed. Our dynamic detection architecture utilizes the Convolution

Neural Network (CNN) with end-to-end training approach trying to balance the

detection accuracy and speed. In our model, we also consider the special

requirements for dynamic objects detection in autonomous driving application,

such as model size and energy, which are also crucial to deploying a detector in a

real autonomous vehicle. We determine our final architecture by exploring different

pre-trained feature extractor and different combination of multi-scale feature layers.

Different from traditional detectors, vehicles, pedestrians and cyclists can be

17

detected within the same pipeline in our implementation. Our architecture is

intensively tested on KITTI visual benchmark datasets and achieved comparable

accuracy to the state-of-the-art approaches in real time.

• We propose a video object detection framework for multiple object tracking in

autonomous driving application. The architecture we proposed in chapter 3 acts as

the still image detector and feature extractor, and a special recurrent neural network

- Long Short Term Memory, which is responsible for taking advantage of the

temporal information from consecutive video sequences. The input to the LSTM

can be the features in different layers of the still image detector or the detection

results in individual frame, or both. We found that a combination of feature vector

and detection results, used as the temporal information to LSTM, can achieve the

best video object detection accuracy in real time among our proposed models. We

also experiment with our developed tracker on KITTI visual benchmark datasets,

and compare it with other algorithms using various criteria.

• We design an end-to-end learning algorithm taking video sequences as input and

directly output the control commands, such as steering angle. We mainly focus on

designing different architectures using supervised learning methods, i.e.,

convolution neural network and recurrent neural network, and train them using

simulated data and real road data collected by Udacity Self-Driving Car Engineer

Nanodegree (https://www.udacity.com). As in the object tracking task, the

recurrent neural network is designed to taking advantage of the temporal

information.

18

1.5 Dissertation Organization

Figure 1.6: Relation between modules discussed in different chapters.

Figure 1.6 describes the relationship between chapters and different modules in our autonomous driving architecture. This dissertation is organized as follows:

In Chapter 2, we firstly review several computer vision tasks that are closely related to the autonomous driving perception system, such as object detection, object tracking and semantic segmentation. Because we rely heavily on deep learning algorithms in our work, we then review the building blocks of a deep learning algorithm, such as convolution layer, activation functions and SGD optimal methods. In addition, two commonly used deep learning architectures, Convolutional Neural Network and

Recurrent Neural.

Chapter 3 investigates the state-of-the-art object detector from still images. By considering the special requirements of autonomous driving application, we propose a novel deep learning-based detection architecture, aiming to detect the dynamic objects on

19

the road in real time, and keep a comparable accuracy to the state-of-the-art approaches, at the same time.

In Chapter 4, we extend our proposed detection architecture to address the video object detection problem for online multiple object tracking purpose. Specifically, we studied how temporal information in the video can be tackled with the Long Short Term

Memory deep learning architecture.

In Chapter 5, the end-to-end autonomous driving approach is explored. We propose and compare several supervised learning architectures by experimenting on simulated and real road data, and select the one that displayed the best performance.

Chapter 6 concludes this dissertation and suggests some potential future work.

20

Chapter 2: Foundation and Literature Review

In this chapter, we firstly review several computer vision tasks that are closely related to the autonomous driving perception system, such as object detection, object tracking and semantic segmentation. Because in our work we heavily rely on deep learning algorithms, we also review the building blocks of a deep learning algorithm, such as convolution layer, activation functions and Stochastic Gradient Descent (SGD) optimal methods. In addition, two commonly used deep learning architectures, Convolutional

Neural Network and Recurrent Neural Network, are carefully reviewed.

2.1 Computer Vision and Deep Learning

Most of the problems we study in this dissertation belong to the category of computer vision and we mainly focus on developing the deep learning based algorithm to address them. Therefore, we will review the basics of computer vision and deep learning techniques that are related to visual object detection, tracking, and end-to-end learning.

2.1.1 Computer Vision

The ultimate goal of computer vision is to enable a computer to perceive similarly to the human vision system. Computer vision research deals with a number of tasks and it has been a hot research topic since 1970s [125]. Here we do not intend to cover all of these tasks but review only those that are relevant to this dissertation’s research focus.

21

These are object detection, object tracking, semantic segmentation and scene understanding. We chose these tasks based on two criteria: (1) must be closely related to autonomous driving; (2) can be addressed with learn based methods, such as deep learning.

2.1.1.1 Object Detection

Reliable object detection is a crucial requirement to realize autonomous driving because the awareness of other traffic participants or obstacles is necessary to avoid accidents. Object detection is difficult due to the various object appearances, shadows, occlusions etc. The traditional detection pipeline has three steps: region of interest extraction, object classification and refinement. Of course, some preprocess is necessary before the detection pipeline, such as exposure adjustment, camera calibration, and image rectification.

The naïve regions of interest extraction method use a sliding window over the image at different scales. However, it is very expensive and time consuming. Several alternatives were proposed to improve the efficiency. For example, Selective Search [53] exploits segmentation for efficiently extracting approximate locations instead of performing an exhaustive search over the full image. Classification of the object in the region of interest labels the object with a predefined class. Support Vector Machine

(SVM) [54] combined with the Histogram of Oriented Gradients (HOG) features have become the most efficient and fast classification algorithm so far. The purpose of the refinement step is to filter out the detection results with low confidence and to delete the repeat detection.

22

With deep learning introducing to solve the object detection problem, CNNs are exclusively utilized in an end-to-end fashion and have significantly improved the performance. The most successful and powerful CNNs methods are the R-CNN in various speed, including R-CNN [55], fast R-CNN [56], and faster R-CNN [57]. They are two-stage CNNs architecture, i.e., two networks exist for region proposal and classification, respectively. At the same time, one-stage CNNs methods were proposed and achieved precision and recall comparable with the state-of-the-art algorithms, most importantly much faster than the R-CNN methods.

YOLO [59] and SSD [58] represent the most commonly used one-stage method.

Figure 2.1 shows the vehicle detection results of an image from the KITTI Visual

Benchmark dataset using our proposed method. In this figure, all the vehicles are detected and marked with the green bounding boxes. The readers are referred to chapter 3 for more details of our object detection methods.

Figure 2.1: An example of vehicle detection using our proposed method. The image is from the KITTI Visual Benchmark dataset.

23

2.1.1.2 Object Tracking

Tracking is responsible for estimating the states of single or multiple objects over time in a sequence of images. The states include location, velocity, acceleration and orientation of the objects of interest in each image. In autonomous driving, the states of other traffic participants are important to plan the trajectory and avoid possible accidents by predicting the future location. It is particularly difficult to predict the future behaviors of pedestrians or bicyclists because they can change the direction of their movements fairly abruptly. Besides these, the challenges of object tracking are similar to object detection, which are occlusion, intersection of objects, and poor lighting condition, etc.

If the detection is reliable, object tracking can be formulated as a Bayesian inference problem in a recursive manner. In that formulation, the goal is to estimate the posterior probability density function of the states given the current observation and the previous states. There are two steps in each recursion, which are the prediction step using a motion model and a correction step using an observation model. Extended Kalman filter and particle filter algorithms [60], [61] are widely used models in this context. Non-recursive approaches, which optimize a global energy function with respect to all trajectories in a temporal window, are also popular and more robust to detection errors. For example, [62] and [63] belong to this approach; both are focused on reducing the search space.

Tracking by detection is the most popular formulation to solve the tracking problem.

In this formulation, a detector is used to detect and classify a certain class of objects in

24

each frame. Recently, the researchers have recognized that recurrent neural network can be used to take advantage of the historical temporal information to reduce the detection error [64] and have achieved promising tracking results. It is different from traditional

Bayesian methods on that it not only can use the history location information but also can chose to utilize the history feature information. In chapter 5, we propose a novel recurrent neural network architecture to address the object detection in video problem.

2.1.1.3 Semantic Segmentation

While the object detection intendeds to assign each object a label, the goal of semantic segmentation is to assign each pixel a label from a predefined set of classes.

Figure 2.2 illustrates the segmentation results with all pixel of a certain class painted as a specific color in a sample of the Cityscapes dataset [65]. Segmentation of the street scenes, such as vehicles and pedestrians, into semantic regions is essential to autonomous driving. The traditional semantic segmentation problem was posed as maximum a posteriori inference in a conditional random field [66]. However, the success of applying deep learning on classification and detection tasks has extended to semantic segmentation task, and deep learning based methods outperform traditional approach in both accuracy and speed [67, 68].

25

Figure 2.2: Semantic segmentation sample from the Cityscapes dataset by [65]. Each pixel is assigned to a predefined set of classes, and illustrated with a specific color. For example, all the pedestrian instances are recognized and pictured with red.

2.1.1.4 Scene Understanding

Scene understanding is a high-level task of computer vision, which involves several sub-tasks including object detection, object tracking, semantic segmentation, depth estimation etc. Each of these tasks describe a particular aspect of a scene. Complex traffic scene understanding is one of the biggest challenges to realize a fully autonomous driving. For example, urban scenarios comprise many independently moving traffic participants and ambiguous visual feature and illumination changes that altogether increase the difficulties, which make it a complex scene. The goal of scene understanding is to obtain a rich and compact representation of the scene. Several promising works investigate the urban traffic scene understanding problem, including [69], [70] and [71].

However, the researchers in this area agree that the improvement of scene understanding relies on the advancement of other computer vision task, such as object detection and semantic segmentation.

26

2.1.2 Deep Learning

Deep learning is a approach that allows the computer to learn multiple levels of representation and abstraction from the data such as image, sound, and text. Its main algorithm is inspired by the structure and function of the human brain neurons, and thus called artificial neural networks. In recent years, its growth in popularity and usefulness is tremendous, because of more powerful computers, larger datasets and techniques to train deeper networks. We focus on reviewing its application and development in computer vision research, especially the computer vision tasks that are related to autonomous driving.

In 2012, [44] created a deep convolutional neural network applying to image classification and won the 2012 ImageNet Larger-Scale Visual Recognition Challenge, which starts the revolution of computer vision research using deep learning. Soon researchers discovered ways to apply it to other computer vision tasks and most of them outperform traditional approaches. Now deep learning is applied to object detection [55],

[59], semantic segmentation [72, 74], pose estimation [73], estimating depth map [75] and many more. The success lies on the much more powerful deep feature extraction ability of deep neural networks, than traditional manually engineered feature methods such as SIFT [76] and HOG [77]. Because our work in this dissertation is largely built on the convolutional neural networks, and most of the improvements of deep CNNs were made by modifying the network structure and training pipeline, we will review the building blocks of a CNN in section 2.3.

27

Another type of deep neural network we will review in detail in section 2.4 is the

Recurrent Neural Network, which uses the historical output as a part of the current input.

It is different from CNNs, which are the feed-forward fashion and do not use historical information. The Recurrent Neural Network can naturally take advantage of the temporal information and memorize historical status, due to the recurrent input property.

Combined with CNNs, it has been applied to many computer vision tasks where video sequences are available as the input, for example, image captioning [78, 79], video object detection [80] and video question answer [81]. In chapter 4, we address the video object detection problem by proposing a novel deep learning pipeline combining CNNs and

RNNs, to take advantage of the temporal information from the video sequences.

2.1.3 Datasets

Datasets and benchmarks play a critical role in deep learning based computer vision research, by providing problem specific data with ground truth. A well-defined benchmark provides a standard platform for researchers to compete each other by proposing more advanced algorithms. There are a few large-scale and public available datasets that should be mentioned here, which have a major impact on deep learning- based computer vision research. For example, ImageNet [132] PASCAL VOC [126], and

Microsoft COCO [133] are datasets aiming to improve the state-of-the-art in tasks such as object classification, object detection and semantic segmentation. And PETS [82] and

MOTChallenge [83] datasets were presented to address the single and multiple object tracking problem.

28

For autonomous driving, KITTI Visual Benchmark Suite dataset [84] was introduced to address stereo, optical flow, Visual SLAM and 2D/3D object detection problems. The dataset was captured from an autonomous driving vehicle with data from multiple sensors such as GPS/INS, stereo cameras, and 3D laser scanner. In 2013, the dataset was extended to the tasks of object tracking, road/lane detection. There exist other datasets related to autonomous driving, for example, the Caltech Pedestrian Detection Benchmark

[85] comprises of 250,000 sequence images recorded by a vehicle while driving through regular urban scenario, and the Cityscapes Dataset [86] is provided especially for semantic and instance segmentation of real world urban traffic scene. However, the

KITTI dataset has established itself as the standard benchmark in the context of autonomous driving applications, and has attracted state-of-the-art algorithms to compete with each other on it. Therefore, we also train and fine tune our deep learning architectures on KITTI dataset.

2.2 Neural Network

Artificial Neural Network is at the core of deep learning. One of the definitions of deep learning is the study of Neural Networks that contain more than one hidden layers.

2.2.1 Neurons and Neural Network Architecture

The concept of the neural network is originally associated with the brain, but nowadays it more engineering for deep learning tasks. The Neuron is the basic computation unit for brain, and it produces output signals along the axon by receiving input signals from its dendrites. An Axon connects to the dendrites of other neurons and

29

transfers the signals it carries to other neurons as the input. Figure 2.3 is a cartoon drawing of a biological neuron and its mathematical model. The neuron has a fire mechanism that will fire a spike along the axon when the sum of its input signal is above a threshold. Similar to that, the activation function in the neuron takes a non-linear operation to the sum of the input, and allow the model to represent more complicated real-world problem. The details and commonly used activation function is reviewed in section 2.2.2.

Figure 2.3: A cartoon drawing of a biological neuron (left) and its mathematical model (right) [86]. ). � = [�0,�1,�2] is the input vector; ωi and bi are weight and bias parameters, respectively, and they are the parameters learned by training. The activation function allows the model to represent more complicated real-world problem.

After understanding the role of one neuron, we can view a neural network as a graph with stacks of connected neurons. Figure 2.4 describes a 3-layer neural network architecture and its corresponding mathematic equations. In neural networks, neurons are usually organized into distinct layers and fully connected layer is most commonly seen type of layer in regular neural networks. Every neuron between neighbor fully connected layer are connected, and neurons within the same fully connected layer are not connected.

30

In figure 2.4, each circle represents a neuron and the arrow line connecting them represents the axon with data flowing on it. Note that cycles are not allowed in a neural network because that would produce infinite loops. A simple activation function changing all negative input to zero is implemented in the mathematic equations. The neural network works by mapping the input to the output. By training the neural network with data, we obtain the estimate of all the weight parameters Wi and bias parameters Bi.

Figure 2.4: A 3-layer Neural Network architecture (left) and its corresponding mathematic equations (right). � is the input vector; Wi and Bi are weight and bias parameters, respectively, and they are the parameters learned by training. �1 and �2 are the intermediate vector after the hidden layer 1 and 2.

2.2.2 Activation Functions

Activation function or layer is applied to every single number on the previous activation layer, and it introduces the non-linearity to the deep learning system. It is widely used in deep neural networks, such as CNNs and Recurrent Neural Network.

31

Activation layer is the reason why deep neural network works. In other word, the deep neural network collapses to a one layer neural network.

Assuming we have a two-layer deep neural networks without activation functions, and we express the first layer as,

�1 = �1 ∗ � + �1 (2.1)

and the second layer as

�2 = �2 ∗ �1 + �2 (2.2)

where X is the input vector; Wi and Bi are weight and bias parameters, respectively, and they are the parameters learned by training. H1 and H2 are the intermediate vector after the hidden layer 1 and 2.

Putting the first layer in the second layer, we have,

�2 = �2 ∗ �1 + �2

= �2 ∗ (�1 ∗ � + �1) + �2 (2.3)

= (�2 ∗ �1) ∗ � + (�2 ∗ �1 + �2)

which is just a simple one-layer neural network. It shows that linear combination of any number of weights is again linear, and activation function must be applied to add the non-linearity to the system.

An ideal activation function should be highly non-linear and continuously differentiable. In traditional neural networks and the earlier stage when CNNs was proposed, nonlinear functions like ���ℎ and ������� (figure 2.5) were used. However, researchers found out that they would all lead to the vanishing gradient problem [47]

32

when layer increased, i.e., lower layers of the network train very slowly because the gradient decreases exponentially through the layers. This is a critical problem because we rely on the gradients to update the weight and bias parameters to learn.

Figure 2.5: The sigmoid function (left) and hyperbolic tangent function (right).

To address the vanishing gradient problem, Rectified Linear Unit (����) (Figure 2.6 left) was proposed by [46]. The ���� layer applies the function �(�) = max (0, �) to all of the values in the input feature map. In other word, this layer changes all the negative activations to 0, and thus increases the nonlinear properties of the model. It is also very computationally efficient and converges much faster than Sigmoid in practice. However,

���� is non-negative and, therefore, have a mean activation larger than zero. According to the justification in [48], we prefer an activation function with means closer to zero to decrease the bias shift effect.

“Leaky ReLUs” (�����) (Figure 2.6 right) is such a zero-mean activation function proposed by [49], which replaces the negative part of the ���� with a linear function, and has been shown to be superior to ����. Parametric Rectified Linear Units

33

(�����) is the generalization of ����� by learning the slope of the negative part �, and has yielded improved learning behavior on large image benchmark [50]. Its mathematic equation is �(�) = max (��, �) where � is learned during the training process.

Figure 2.6: The ReLU function (left) and leaky ReLU function (right).

Other alternative activation functions include Exponential Linear Units (���) [48] and ������ [51]. In practice when we are designing and training our deep neural networks, we first choose ���� because of its simplicity and computational efficiency. It is also common to try out �����, ��� or ������ but avoid ���ℎ and ������� because of the vanishing gradient problems it causes.

2.2.3 Training A Neural Network

On a high level, a neural network can be seen as a function mapping the input to the output, and the unknown parameters are the weights and biases in the networks. We learn these parameters by feeding numerous training data to the neural network, which is called

34

a data-driven approach. If the training data is in company with the ground truth for a specific problem, it belongs to supervised learning category. In this dissertation, we only deal with supervised learning problem. For example, both images and corresponding ground truth bounding boxes and class labels are provided as the training data for object detection task. In this section, we will briefly review main steps to train a neural network and key techniques involved in this process.

• Setting up

The first step is to define the problem mathematically, and to choose a model structure for the problem. Specifically, we need to determine the number of hidden layers in the network, the number of neurons in each layer, and the activation function used. If it’s a Convolutional Neural Network, we also must determine the filter size in each layer, stride, padding and so on. The reader can refer to section 2.3 for more details. One important task in this step is to determine the needed to be optimized. The loss function is a function with weight and bias as the parameters. For our case, usually two kinds of loss are used, which are cross-entropy loss [87] for classification problem and L2 loss function for regression problem. L2 loss function minimizes the squared differences between the estimated and ground truth values.

• Date preprocessing and augmentation

Two commonly used image data preprocessing methods are mean subtraction and normalization. Data augmentation is common to train small datasets and serves to create new training data by rotating, translating, and flipping the original image. We can also augment the image by changing its brightness and shadowiness.

35

• weight initialization

Before training our neural network, all the weight and bias parameters must be initialized. The first idea is to initialize them with small random number, for example

Gaussian with zero mean and 0.01 standard deviation [44]. However, several papers proved that random small number is not suited for deep neural network because of the non-linear activation function that will lead to either a very large gradient or a vanishing gradient, as explained in [90]. Instead, more advanced initialization methods such as

Xavier [88] and LSUV [89] were proposed to address the non-linearity of the activation function. This is still an active research area but usually Xavier initialization is good enough to allow the weight parameters learned efficiently.

• Optimization and Backpropagation

Once the loss function is properly defined and the weight � and bias � parameters are initialized, the next step is to find � and � that minimize the loss function, i.e., optimization. This is the core and purpose of training a neural network. Gradient Decent

[91] is currently the most effective and popular optimization method to train a deep neural network, which computes the gradient of the loss function relative to each parameter �, � and updates them based on these gradients. Assume that � is the parameters need to be updated, the update equation is,

� = � − ��′(�) (2.4)

where �′(�) is the gradient of the loss function relative to θ, α is called learning rate which controls how much the parameter θ change to the negative direction of gradient.

36

If the training data is very large, for example ILSVRC challenge has millions of training images, it is unrealistic to compute a loss function over the entire training data in order to just perform a single parameter update. A practical approach is to compute the gradient over batches of the training data, called Batch Gradient Descent (BGD). In extreme case, the batch only contains one training example. The optimization is called

Stochastic Gradient Descent (SGD). In practice, the batch size of the power of 2 is used, such as 4, 16 and 32.

It is difficult to compute the gradient descent efficiently on condition that the neural network is very deep. Backpropagation [91] is the approach to address it, which computes gradients of expressions through recursive application of the chain rule on each layer. The chain rule tells us

� � � � = � ∙ � (2.5) �� �� ��

if � = �(�) and y = �(�), which means that � is the function of �, and � is the

�� �� function of parameter �. Where is the derivative of function � to the parameter �, �� ��

� is the derivative of function � to the parameter �, and � is the derivative of function � to �� the parameter �. Backpropagation allows us to efficiently compute the gradients on the connections of the neural network with respect to a loss function.

• Parameter Update and Learning Rate

Once the gradient is computed with backpropagation, the gradients are used to

37

perform a parameter update. There are several approaches for performing the update, the naive one is equation 2.4, which we will repeat here as equation 2.6, where α is called learning rate and is a hyperparameter. In deep learning, hyperparameters are a set of variables before actually training the system, and are often set by hand or some search algorithms.

θ = θ − αL′(θ) (2.6)

Momentum update [92] is another approach that almost always enjoys faster converge on condition of high curvature and noise gradient. The update equation is,

� = mu ∗ � − ��′(�) (2.7)

� = � + � (2.8)

where � is initialized at zero and mu is another hyperpamateter. � is being built up in the direction that has a consistent gradient. Another variant of momentum update is proposed in [93] and is reported to perform better. Another important and commonly used method is called Adam update, and was first proposed in [94]. Adam update formula adds element-wise scaling of the gradient based on the historical sum of squares in each dimension.

As we mentioned above, training Neural Networks can involve many hyperparameter settings. And learning rate � is the most important hyperparameter. It would take too long to converge if the � is too small, and the loss will out of control if the learning rate

� is too large. Usually a learning rate 0.01 is a good start. In deep neural network, it is usually helpful to reduce the learning rate over time, which is called learning rate decay. 38

2.3 Convolutional Neural Network

Deep convolutional neural networks (CNNs) have been at the heart of spectacular advances in deep learning, and they constitute a very useful tool for machine learning users. The input to a CNN is multiple dimensional, such as a Red Green Blue (RGB) image, and the spatial information after convolution can be preserved. Although CNNs have been used as early as the 1990’s to solve character recognition tasks [43], their current significant influence is due to much more recent work, when a deep CNN was used to outperform the state-of-the-art methods in the ImageNet image classification challenge [44]. Since then, CNN is exclusively used in many computer vision and natural language processing problems.

2.3.1 Architecture

A classic CNN architecture is made of a sequence of layers, and each layer is a which can transform one volume of data to another. Usually, the intermediate result is called activation layer. There are four main type of layers including convolutional layer, activation layer, pooling layer and full connected layer. A typical CNN architecture looks like this.

����� → ���� → ���������� �������� → ������� → ����� �������� �����

For example, the famous LeNet-5 architecture [45] (figure 2.7) used for handwritten and machine-printed character recognition.

39

Figure 2.7: The LeNet-5 neural network architecture [45]. It is a typical CNN architecture comprised of convolution layer, activation layer, pooling layer and fully connected layer. It was designed to recognize the hand-written letters and digits. The subsampling in this architecture is actually a 2x2 pooling layer.

����� is the original multiple dimensional data. In computer vision, the input is the raw pixel values of the image with three color channels. Color channels are used to specify the color of each pixel of the image. For example, in RGB (Red, Green, Blue), an image has three numbers for each pixel that directly correspond to the three R, G and B elements in the computer display.

���� layer uses a filter to slide over the activation layer by computing a dot product between the weights on filter and corresponding values on activation layer, and summing up to one single number;

���������� layer (included in layer in LeNet-5) applies a nonlinear function on each of the individual values on the activation layer. For example, ����, i.e.,

�(�) = max (0, �), is the most commonly used activation function;

40

������� layer (i.e., subsampling layer in LeNet-5) performs a down sampling operation along each channel;

����� ��������� layer is the same to the layer in the ordinary neural network, and each element in this layer is connected to all the elements in the previous activation layer.

We will review each of the above layers in the following sections.

2.3.2 Convolution

The convolution is the basic operation in a Convolutional Neural Network. It takes an image or a feature map of dimension �, �, �, which corresponds to the width, height and channel number of that image or feature map. A square filter with width � convolves with the input feature map in a sliding window fashion, and it must have the same number of channels as the feature map. When it convolves, it is multiplying the values in the filter with the corresponding original values in the feature map, and summed up to one single value. � and the number of filters � is the hyper parameters when the filter is designed. There are two other hyper parameters which are stride � and zero padding number �:

Stride �: The filter moves across the feature map from left to right, and up to bottom with step size S. We call this step size stride. For example, we move the filter one pixel at a time on the feature map when stride is 1.

Zero padding number �: To control the spatial size of the output feature map, and take advantage of the information at the border, it is common to pad the input feature

41

map with zeros around the border. The size of this zero-padding is called zero padding number.

At this point, we can calculate the size of the output feature map using the following equation:

�′ = (� − � + 2�)/� + 1 (2.9)

�′ = (� − � + 2�)/� + 1 (2.10)

�′ = � (2.11)

where �′, �′, �′ are the width, height and channel number of the output feature map, responsively. �, �, � correspond to the width, height and channel number of the input layer. Figure 2.8 shows an example of convolution operation.

Usually we also care about the number of parameters in a convolution operation, because it is closely related to the model size and computation complexity of our model.

These are the parameters we need to learn during the training process. The number of parameters � in a convolution layer is:

� = � ∗ (� ∗ � ∗ � + 1) (2.12)

42

Figure 2.8: An example of convolution operation. The input feature map has spatial dimension 32 × 32 × 3. After applying 6 filters with size 5 × 5 × 3, stride 1, and zero padding 0, the size of the output feature map is 28 × 28 × 6, and the number of parameters in this convolution layer is 456.

2.3.3 Pooling

Pooling is a simple but important operation in CNNs, it allows small translation invariance and helps reducing the size of the input feature map but retains the most important information. It is also referred to as a down sampling layer. The pooling operation is applied on each channel of the feature map, and thus preserve the channel dimension. There are two types of commonly used pooling operations: average pooling and max pooling. Figure 2.9 is an example of max pooling operating on the 28 × 28 × 6 feature map with a 2 × 2 window and stride 2, resulting a 14 × 14 × 6 feature map. The stride here has the same meaning as the convolution layer. The max pooling operation takes a filter (normally of size 2x2) and a stride of the same length, and applies it to the input feature map and outputs the maximum number in every sub-region that the filter covers. The average pooling is the same as max pooling except for outputting the average of the numbers in the sub-region. 43

Figure 2.9: An example of max pooling operating on the 28 × 28 × 6 feature map with a 2 × 2 window and stride 2, resulting a 14 × 14 × 6 feature map. The pooling operation is applied on each channel of the feature map, and thus preserve the channel dimension.

Assume we apply a � × � filter window on the feature map with dimension �, �, � and stride �, the resulting feature map dimension after this is �′, �′, �′:

�′ = (� − �)/� + 1 (2.13)

�′ = (� − �)/� + 1 (2.14)

�′ = � (2.15)

In practice, there are only two commonly used pooling operation, both are max pooling: max pooling with filter size 3 × 4 and stride 2, and max pooling with size 2 × 2

44

and stride 2. In fact, the latter is much more commonly seen in various CNNs. Note that it is rare to use zero-padding for pooling operation.

2.3.4 Dropout

After several convolutional, ����, and pooling layers, the output represents high- level features in the data. Usually a full connected layer is followed to learn the non- linear combination of these features and output a one-dimensional array. In a fully connected layer, each neuron has full connections to every neuron in the previous layer, as seen in regular Neural Networks (Figure 2.4). Their activations can be computed with a matrix multiplication followed by a bias offset, as we explained in the previous neural network part.

In practice, overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. It is too dependent on the training data and will not generalize to the new data. For example, the deep neural network often has tens of millions of parameters while only limited amount of data is available, so they tend to overfit. Dropout is effective and surprisingly simple techniques used in deep learning to prevent overfitting, and is a vital feature in almost every state-of- the-art neural network implementation. It works by manually reducing the number of parameters. The primary idea is to randomly drop neurons from each layer of neural network in a probability, such as 0.5, by setting these neurons to zero in the training phase (Figure 2.10). In the testing phase, all neurons are used for prediction. Dropout provides another benefit except avoiding overfitting, which is forcing the neural network

45

to learn more from the data. This is equivalent to train multiple neural networks to learn different aspects of the data, and average the outputs from them.

Figure 2.10: An example of applying dropout to a standard neural network. The hidden layer 1 and 2 in (a) are both full connected layer. Each circle represents a neuron and the arrow line connecting them represents the axon with data flowing on it. Red circles represent the input vector; green circles represent the intermediate feature map; and blue circles represent the output vector. Red cross in the circle means that there is no data flowing through that neuron.

2.3.5 Transfer Learning

We have reviewed and described the normal training protocol and commonly used techniques to train a regular neural network in section 2.2.3. Training a convolutional neural network is no difference with training a regular neural network. However, in practice, very few people train a deep CNNs from scratch, i.e., parameters initialization randomly. Instead, it is very common to pre-train a CNNs on very large data such as

ImageNet which contains millions of images and 1000 classes, and use it as a feature extractor or an initialization. We can take the published pre-trained model, such as VGG-

46

16 (Figure 2.11 (a)) [95], which are usually trained for several weeks across multiple

GPU on ImageNet.

If our dataset is very small, for example there are only several hundred training images, we use the pre-trained model as a feature extractor. Specifically, we will take a pre-trained model on ImageNet, replace the last fully connected layer with a new one, and treat the rest of model as a feature extractor. For example, in Figure 2.11 (b), only the last fully connected layer is trainable and the rest of the VGG-16 model is frozen, i.e., the weight parameters is not updated at the training time.

If we have a medium size dataset, we have the choice to train more layers of the pre- trained model. Usually we will keep some of the earlier layers fixed and only fine-tune some higher-level layers of the network. The reason is that the earlier CNNs layers contain more generic features (such as edge, line and corner) that should be useful to many tasks, but later CNNs layers becomes more specific to the data in the original dataset. For example, in Figure 2.11 (c), we add more trainable layer compared with in

Figure 2.11 (b) because we have more training data in this case.

47

Figure 2.11: Transfer learning with CNNs. (a) The pre-trained VGG-16 model [95] on ImageNet dataset for classification. (b) Transfer learning on small dataset. (c) Transfer learning on medium dataset. “conv3-64” donates 64 filters with size 3x3. “fc-4096” donates a fully connected layer with output vector size 4096. The layers in the red box are frozen, which means the weighs and bias parameters will not be updated. The layers in the blue box will be learned during the transfer learning process.

2.4 Recurrent Neural Network

So far, we only discussed feedforward neural network, in which the connections between neurons do not form a cycle, which means the information can only move in one

48

direction. In contrast, the connection between neurons in recurrent neural network form a directed cycle, which makes RNN a perfect candidate to process sequences data such as text, audio and video. In this section, we will review the architecture of RNN, its optimization methods and its application on video sequences data.

2.4.1 Architecture

RNN models a dynamic system, where the output not only depends on the current input, but also depends on the previous state of the system. The system will maintain a state vector ��. At every time step, the system produces an output ℎ�, updates the state ��, and sends the state �� to the next time step. In this way, the system has memory of the history information. Sequences data, for example time-series data, changes over time but keeps consistent in the adjacent time step. RNN is suitable to process sequences of data, because the historical information can influence and help the output of the current time step. For example, it is better to know the location of the vehicle in the previous frame if we want to detect the car in the current frame of the video sequence.

Figure 2.12 shows a common form of RNN architecture. The left part of the equal sign is the recursive description of RNN architecture, and the right part is the corresponding extended RNN model in a time sequences manner. �� is the input at time step t, for example, a frame of a video sequence. �� is the state at time step t, and it is the memory of the network. The state �� is calculated based on the previous state ��−1 and current input ��, i.e., �� = �(��−1, ��). The function � is usually a non-linear function such as ���ℎ or ���� which are introduced in section 2.2.2. ℎ� is the output at time step t. For example, if we want to detect vehicle from the video sequence it would be the

49

bounding box indicating the location of the vehicle in the frame. However, the diagram in figure 2.12 has outputs at each time step, but this may not be necessary depending on the task.

Figure 2.12. An example of RNN. The left part of the equal sign is the recursive description of RNN, and the right part is the corresponding extended RNN model in a time sequences manner. The red arrow indicates the backpropagation direction at time step t. �� is the input at time step t, ℎ� is the output at time step t, and �� is the cell state at time step t.

2.4.2 Training RNNs

We have described how to train a regular feedforward neural network in section 2.2.3.

In feedforward network, backpropagation moves backward from the error (the difference between estimation and ground truth in supervised learning) through each hidden layer, and assign the weights in these hidden layers the responsibility of the error by calculate their partial derivatives. These derivatives are then used by the Gradient Descent learning algorithm to adjust the weights in the direction the error decreases. Training a RNN relies on the generalization of backpropagation called BackPropagation Through Time

50

(BPTT). In the case of RNN, the time is expressed as the ordered series of calculations linking one time step to the next.

In figure 2.12, the red arrow indicates the backpropagation direction at time step �.

Through the backpropagation, the derivatives of the error at � with respect to the weight parameters at each layer are calculated based on the chain rule. The key difference is that the weights in each layer of a RNN are the same, while the weights in feedforward network are all different. Because it takes too much time and computer memory to BPTT through very long sequences, in practice, Truncated BPTT is used to reduce the cost by only backpropagating a limited time step. The drawback is that RNN cannot learn long dependencies that are in full BPTT, due to the truncation.

This vanilla RNN architecture described in figure 2.12 can easily cause exploding gradients or vanishing gradients problem [16, 17] when the input sequences are too long.

In this case, the RNN cannot learn anything. We will explore this problem in detail in chapter 4, and introduce the Long Short Time Memory units, which is an advanced architecture of RNN.

51

Chapter 3: Dynamic Object Detection on Road

Robust and accurate object detection is one of the most crucial elements for autonomous driving. Although object detection with Lidar data is explored by several researchers, camera is the dominant sensor for object detection task due to the abundant information it produced. Of course it is ideal to combine Lidar and camera data to obtain more robust and accurate object detection results. However, in this chapter, we focus on computer vision based object detection, and we developed a deep learning based method to detect vehicle, pedestrian and cyclist from images.

3.1 Related Work

3.1.1 Traditional Approaches

The traditional detection pipeline often consists of region of interest (ROI) extraction, object classification and refinement steps. ROI extraction recognizes the area that could contain the object of interest. The sliding window approach is the simplest, which shifts a box over the image at a step size and various scales. A large number of bounding boxes could be produced from the sliding window approach, and it is too time consuming to feed all of them to the classifier. Fortunately, the number of the bounding boxes can be reduced with filter method, such as [98], by assuming certain size, ratio and position of the candidate bounding boxes. Several alternatives were proposed to improve the

52

efficiency. For example, selective Search [53] exploits segmentation for efficiently extracting approximate locations instead of performing an exhaustive search over the full image, and Edgeboxes approach [99] uses object boundaries as a feature to score the candidate bounding box extracted from other methods. In the object detection task, we describe the detected object by its location (bounding box coordinates) and category.

The classification step determines which classes the objects in the candidate bounding boxes area belong to. The classification can be quite costly due to the larger number of candidate bounding boxes. Therefore, the classifier must be able to quickly recognize objects in the bounding boxes. An early classifier, such as [100], improves the classification speed by efficiently discarding candidate boxes in the background region and focusing on more promising regions. With the introduction of Support Vector

Machine (SVM) by [101], which maximum the margin of all samples from a linear decision boundary. It is the dominant classifier due to its speed efficiency and accuracy.

The purpose of refinement step is to filter out the detection results with low confidence and to delete the repeat detection. A simple Non-Maximum Suppression (NMS) [102] is usually good enough for the refinement purpose.

Reliable classification relies on robust features learned from the training data. Two of the most commonly used feature descriptor is Scale Invariant Feature Transform (SIFT)

[104] and Histogram of Oriented Gradients (HOG) [103]. Support Vector Machine

(SVM) [54] combined with the HOG features have become the most efficient and fast traditional classification algorithm. Rather than learning the appearance of the whole objects, the idea of splitting the complex appearance into simple parts is more promising,

53

because this allows more flexibility and less training data to learn the appearance of the objects. Deformable Part Model (DPM) by [105] is the product of the above idea.

However, all previous methods depend on hand-crafted features that relies on specific feature descriptor and are difficult to design. With the introduction of deep learning methods to object detection, convolutional neural networks have dominated this task while significantly boosting performance.

3.1.2 CNNs for Object Detection

Convolutional Neural Networks are the leading method for visual object detection task. We will review some of the milestone methods. In 2013, [55] proposed region- based Convolutional Neural Networks (R-CNN) for object detection, which brought dramatic improvements compared with traditional methods. R-CNN approach uses traditional Selective Search [53] method to propose the region of interest, and then classifies these regions using a CNN. However, this approach is expensive due to the large number of region proposals, and it contains lots of duplicated calculation from overlapping region proposals. To overcome this problem, Fast R-CNN [56] runs a CNN extractor through the whole image and makes region proposal from the resulting feature map. This significantly reduces the computation time by sharing the computation of feature extraction.

However, both R-CNN and Fast R-CNN still rely on traditional Selective Search method to propose the region of interest. The Multibox detector [106] and Faster R-CNN

[57] are among the first papers to use a CNN to propose the region of interest. They generate a collection of boxes with different location, scale and aspect ratios at selected

54

intermediate feature map and project is to the original image. These boxes are called

“anchors”, “a priori boxes”, or “default boxes” at different literatures. For each anchor, the CNN model is trained to predict a class label (classification problem) and an offset to which the anchor should be moved to fit the ground truth bounding box (regression problem). During training, we minimize the sum of both the classification error and regression error. Since proposed, the Faster R-CNN style methods had significant influence on object detection community, and led to may follow up methods such as R-

FCN [107]. R-FCN is fully convolutional neural network, in which there are only CNN layers and no fully connected layer. This reduces the number of parameters, making the model smaller and more computationally efficient.

So far, all these methods consist of two stages. The first stage, called Region Propose

Network (RPN) in Faster R-CNN, pushes the whole image through a feature extractor and proposes the region of interest. The purpose of this stage is to produce anchors at different spatial location, scale, and aspect ratios. The second stage, these anchors are used to clip the features from certain intermediate feature map, and predict a class label and an offset of the anchor position. In both stages, the loss function is the sum of the classification error and regression error. This two-stage mechanism stops these methods from becoming a real-time detector because of the duplicated computation.

Alternatively, single-stage approach can achieve real-time detection. Single-state approach utilizes a single feed-forward convolutional network to directly predict the class label and anchor offsets without the region proposal network. This largely accelerates the detection speed. You Only Look Once (YOLO) [59] divides the input image into grid and

55

each grid predicts two bounding boxes as the anchors. Each anchor is responsible for predicting directly the location of the bounding box and class probability. YOLO is the first real time detector. Recently, the authors of YOLO proposal various improvements by taking in the rapid development of the object detection community, for example, they replace the fully connected layers with a convolution layer. SSD [58] produces anchors at multiple feature map at different scale, and predicts class probability and offsets using these multi-scale features. As a comparison, YOLO uses only the topmost feature layer for the prediction. In this chapter, we propose a one-stage CNNs based object detector considering the special requirements of autonomous driving application.

3.2 Model Architecture

As described in figure 3.1, we propose a novel one-stage object detection system based on Fully Convolutional Neural Networks (FCNN) by considering the following factors. Firstly, the precision and recall should be as high as possible. Precision and recall are two important factors used to measure the accuracy performance of the object detection. The reader can refer section 3.4 for the definition of them. This is critical because ideally all the objects of interest must be reliably detected. Second, the speed of the detection has to allow for real-time operation, which is rarely considered in the regular computer vision benchmarking competition. The speed requirement is important because it is directly related to the time latency of the vehicle control module. Finally, the model size should be fitted to the vehicle’s embedded processor. Even though the

56

computing capability of the on-board computer has been improved largely, it is beneficial to reduce the model size and energy consumption of object detection system.

Figure 3.1: Our proposal object detection architecture including three steps. (1) Feature extraction: the input image is feed into a feature extractor pre-trained on ImageNet, such as VGG16 [95] and multiple scale feature map is produced. (2) Prediction with a convolution: anchors are created on selected feature map and feature map combination, and a convolution layer is applied to these anchors to predict the offsets to the anchors, the associated confidence and class probability. (3) Refinement: Usually a Non-Maximum Suppression (NMS) is good enough to filter out the repeat bounding boxes.

Overall, the object detection system consists of three steps: feature extraction, prediction and refinement.

Feature extraction: it relies on a feature extractor pre-trained on ImageNet classification task, which is the backbone convolutional neural network behind our detector. VGG16 [95] is a commonly used benchmark extractor in various object detector including R-CNN methods and R-FCN. However, its large model size and computation complexity prevent it being the ideal feature extractor for real-time object detection.

57

Recently, several small-size models were proposed and their authors claimed to achieve comparable accuracy to VGG16 on ImageNet classification task, including SqueezeNet

[110] and MobileNet [111]. The backbone feature extractor constitutes a large part of the mode size of our proposed object detection system. We will compare popular feature extractor in our experiment and results section 3.4 including these mentioned above.

Prediction: This is the key step of our object detection system. For a feature map of size � × � with P channels, K anchors (default bounding boxes) with pre-determined shape and size are produced centered around � × � uniformly distributed spatial grids.

Each anchor is associate with C class probability and 4 bounding box offsets, where C is the number of classes to distinguish. As shown in figure 3.2, by convoluting the feature map with (4 + �) × � filters of size 3 × 3 × � and proper zero padding, we get an output map with the same width W and height H but different channel size (4 + �) × �.

The output map can be interpreted as C class probability and 4 bounding box offsets for each anchor. We assign the label with highest class probability to each bounding box.

58

Figure 3.2: An example of prediction on feature map. The feature map of size � × � with P channels has K=3 anchors produced on each grid cell. By convoluting the feature map with (4 + �) × � filters of size 3 × 3 × �, where C is the number of classes to distinguish, and proper zero padding, we get an output map with the same width W and height H but different channel size (4 + �) × �. The output map can be interpreted as C class probability and 4 bounding box offsets for each anchor.

̂ Assuming the position and shape of the anchor is describe as (�̂�, �̂�, �̂�, ℎ�), � ∈

[1, �], � ∈ [1, �], � ∈ [1, �]. Here �̂�, �̂� are coordinates of the center of the anchor,

̂ �̂�, ℎ� are the width and height of the �-th anchor. For each anchor (�, �, �), four offsets

(�����, �����, �����, �ℎ���) are predicted using the convolution. Following the methods presented in [57], called Faster R-CNN, the final bounding box coordinates are

(��, ��, ��, ℎ�):

�� = �̂� + �̂� �����,

̂ �� = �̂� + ℎ� �����,

�� = �̂� ∙ exp (�����),

̂ ℎ� = ℎ� ∙ exp (�ℎ���). (3.1)

59

The prediction step of our system is different with the last layer of Region Proposal

Network (RPN) in Faster-RCNN. RPN is only responsible for predicting the bounding box proposals for the classification, and the fully connected layer in Faster R-CNN is used to classify and regress the bounding box proposals.

Refinement: Similar to traditional SVM + HOG detection pipeline, multiple bounding boxes will be detected surrounding the object in the image. Therefore, a Non-

Maximum Suppression (NMS) is used to filter out the redundant bounding boxes to obtain the final detections. An Intersection-Over-Union (IOU) overlap ratio is calculated between two bounding boxes detection results, each associate with a confidence score indicating the probability that an object of interest exists in it (Figure 3.3 (a)). If the IOU overlap is higher than a threshold, the bounding box detection result with lower confidence score is deleted (Figure 3.3 (b)).

Figure 3.3: The Intersection-Over-Union (IOU) ratio calculation and Non-Maximum Suppression (NMS). (a) IOU ratio is the area of intersection over the area of the union, which indicates how much two bounding boxes overlaps. (b) The detection results before and after Non-Maximum Suppression (NMS).

3.3 Implementation

60

So far, in the last section, we have described our one-stage model architecture. This architecture can work with various backbone CNN models, and embraces multiple ways to take advantage of the feature maps from them. In this section, we will describe the key implementation steps including bounding boxes matching strategy, training loss function, anchor shape choice.

Bounding Boxes Matching: To train our object detection neural network, we need determine which anchor is responsible for a ground truth detection. For example, there is only a few ground truth objects in an image but there are usually several thousand anchors with different location and shape. First, we compute the IOU ratios between each ground truth bounding box and all the anchors and match the anchor with highest IOU ratio to that ground truth. The reason is that we always want to select the closest anchor to match the ground truth bounding box to minimize the transformation need. To allow our network to predict multiple bounding boxes with high confidence score, we then match anchors to any ground truth with IOU ratio larger than a threshold (0.7 in our implementation after multiple experiments).

Training Loss: By defining proper training loss, our object detection system can be trained in an end-to-end fashion. For one feature map, we define our multi-task loss function as a weighted sum of the classification loss (����) and regression loss (����):

1 ���� = (� + � ) (3.2) � ��� ���

where � is the number of the matched anchors.

61

The classification loss ���� is the cross-entropy loss, i.e., the log loss over multiple class:

� � � � � ���� = − ∑�=1 ∑�=1 ∑�=1 ∑�=1 ������ log (��) (3.3)

where � ∈ [1, �], � ∈ [1, �], � ∈ [1, �], � ∈ [1, �]. � and � are the width and height of the current feature map, respectively. K is the number of anchor on each grid cell position. C is the number of classes to distinguish. ���� is assigned to 1 if the �-th anchor at position (�, �) is matched to a ground truth bounding box, otherwise it is assigned to 0.

� �� ∈ {0,1} is the ground truth label and �� ∈ [0,1] is the probability after normalization outputted from our neural network.

The regression loss ���� is the bounding box regression loss:

∑� ∑� ∑� � ���� = �=1 �=1 �=1 ���� ∙ [�����ℎ�1(����� − �����) + �����ℎ�1(����� −

� � � �����) + �����ℎ�1(����� − ����� ) + �����ℎ�1(�ℎ��� − �ℎ���)+] (3.4)

where (�����, �����, �����, �ℎ���) are anchor offsets outputted from our neural

� � � � network. (�����, �����, �����, �ℎ���) are ground truth bounding box offsets computed with:

� � ����� = (� − �̂�)/�̂�,

� � ̂ ����� = (� − �̂�)/ℎ�,

� � ����� = log (� /�̂�),

� � ̂ �ℎ��� = log (ℎ /ℎ�). (3.5)

62

where (��, ��, ��, ℎ�) are coordinates of a ground truth bounding box.

̂ (�̂�, �̂�, �̂�, ℎ�) are coordinates of the corresponding anchor. Note that the above equation is the inverse transformation of equation 3.1.

Smooth �1 �����ℎ�1 function is:

0.5�2 ��|�| < 1 �����ℎ = { (3.6) �1 |�| − 0.5 ��ℎ������

which was first proposed and used in Fast R-CNN [56], and is less sensitive to outliers than the L2 loss.

Anchor Shape: The shape of an anchor can be can be characterized by its width � and its height ℎ. In Faster R-CNN [57], the anchor shapes are arbitrarily chosen by reshaping a 16 × 16 square box using 3 scales and 3 aspect ratios. However, a better way to choose the anchors is to make their shapes similar to the ground truth bounding boxes.

The problem can be formulated as follows: Given a number of the ground truth bounding box shape, find k anchors such that the sum of the distance between each ground truth bounding box to its nearest anchor is minimized. This problem can be effectively solved by K-means Clustering algorithm [112], which clusters data by trying to separate them into k group of equal variance and minimize a criterion, such as the �2 distance. For example, we can extract the heights and widths of all the ground truth bounding boxes from the KITTI object detection training datasets, run the K-means Clustering algorithm on them, and find k anchor shapes.

63

3.4 Experiments

Our system is implemented in Python using Tensorflow [139]. We experiment and evaluate our architecture on KITTI detection datasets, which includes 7481 training images. We randomly extract 30% images from the training datasets as the validation data. The rule of thumb of separating the whole dataset to training and validation dataset is to enable the model to efficiently learn from the datasets but not to overfit to it. For dataset of magnitude of several thousand, 70% training/30% validation or 80% training/20% validation is commonly used separation ratio. Their influence to the object detection accuracy can be ignored. Since the official object detection test server is used for competition and has usage limitation, we primarily report the detector performances on the validation dataset as a common convention. We use the Stochastic Gradient

Descent (SGD) with momentum and mini batch size 20 to optimize the loss function we defined at section 3.2. The initial learning rate is set to 0.01, and reduced by a half every

10000 steps. To equally compare every model, we train them with a fix step size 50000.

All timing information is on a NVIDIA Geforce GTX Titan Xp GPU with Intel Xeon E3

3.2GHz CPU and 16GB RAM.

3.4.1 Quantitative Results

The detected objects are located by bounding boxes. Intersection-Over-Union (IOU) mentioned in section 3.2 is used to measure the overlap between predicted and ground truth bounding box. Following the KITTI and PASCAL evaluation criteria [126], we

64

consider it as a correct detection if ��� > 70% for vehicles. For pedestrian and cyclist, we require ��� > 50% to correctly detect them. Taking car detection as an example, we define the following terms:

True Positive (TP): The instance that a car is correctly detected as a car.

False Positive (FP): The instance that a pedestrian or cyclist is detected as a car.

True Negative (TN): The instance that a pedestrian or cyclist is not detected as a car.

False Negative (FN): The instance that a car is not correctly detected as a car.

Precision reflects the percentage of that the detected cars are really cars. Precision is defined as:

# of TP Precision = (3.7) # of TP+ # of FP

PASCAL and KITTI do not rank the algorithm based on recall. But recall is very important for autonomous driving, because it reflects the percentage of that the cars are correctly detected. Recall is defined as:

# of TP Recall = (3.8) # of TP+ # of FN

Average Precision (AP): We evaluate the performance of object detection using the

PASCAL criteria [126], which measures the detection accuracy by average precision. For each class, the detection results are first ranked by their detection confidence score, and then precision/recall curve are calculated. The AP is a way to summarize the shape of the precision/recall curve, and is defined as the average precision at a set of � equally spaced recall levels [1/�, 2/�, … , �/�]:

65

1 ∑ 1 2 (3.9) �� = �∈[ , ,…,�/�] �(�) � � �

where �(�) is the precision at each recall level �, interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds �:

�(�) = max �(�̃) (3.10) �̃:�̃≥�

where �(�̃) is the calculated precision at recall �̃.

The mean Average Precision (mAP) is simple the mean values of the AP for each class. Table 3.1 summarizes the detection performance for different feature extractor, including mAP, the inference speed measured by Frame Per Second (FPS), and the AP for each class at different difficulty levels. Difficulties are defined as follows:

• Easy (E): Bounding box with minimum height of 40 pixels, fully visible object,

and maximum truncation level of 15%

• Moderate (M): Bounding box with minimum height of 25 pixels, object is partly

occluded, and maximum truncation level of 30%

• Hard (H): Bounding box with minimum height of 25 pixels, object is difficult to

see, and maximum truncation level of 50%

In our experiments, we compared six feature extractors and only one feature map is extracted and utilized for detection. VGG16[95], ResNet50[127] and ResNet101[127] are representation of regular feature extractor with feedforward convolutional layer and pooling layer. They perform very good on ImageNet classification task, especially

ResNet101 is the winner of 2015 ImageNet classification competition. SqueezeNet[128] 66

is created upon fire module, which is comprised of squeeze layer with 1x1 convolution

filter and expand layer with a mixture of 1x1 and 3x3 convolutional filter. It has two

versions with different combination of squeeze layer and expand layer, but both

effectively reduce the model size and accelerate the classification speed on ImageNet

benchmark. The last feature extractor in our experiments is MobileNet [129], which was

designed for mobile and embedded application. It has much smaller model size and thus

less powerful on feature extraction. It was applied to explore the speed limit of our object

detection architecture.

Speed Car Cyclist Pedestrian Feature Extractor mAP (FPS) E M H E M H E M H

VGG16 [95] 85.6 17.3 92.3 85.6 82.8 85.6 76.3 75.6 80.4 72.3 67.4

ResNet50 [127] 86.5 23.5 92.7 89.4 84.5 85.9 79.4 77.8 82.3 73.1 68.2

ResNet101 [127] 89.4 20.4 94.3 90.8 85.4 87.2 80.3 79.1 82.1 74.2 69.0

SqueezeNet V1[128] 78.2 56.3 92.4 85.7 75.1 88.7 77.9 72.8 77.3 69.9 64.1

SqueezeNet V2[128] 82.3 33.2 95.2 91.2 82.1 92.2 82.1 77.1 81.1 72.3 67.7

MobileNet [129] 66.3 67.4 70.6 72.7 63.9 64.3 57.3 51.5 73.3 69.7 63.5

Table 3.1: Detection accuracy summary for different feature extractors. The mean Average Precision (mAP) is the mean values of the AP for each class. The mAP and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS). The mAP for each class (Car, Cyclist, Pedestrian) of different difficulty levels, i.e., Easy(E), Moderate(M), and Hard(H), are also calculated.

67

As shown in table 3.1, ResNet101 achieved the best mAP but could only run at 20.4 frames per second. VGG16 and ResNet50 runs at the same speed level with ResNet101, but with inferior mAP values. While there is no specific definition of what is real time detection, we consider it as real time if the detection speed is higher than 30 frames per second in our dissertation. For autonomous driving, even 1% of accuracy improvement is meaningful if we consider the extremely high requirement of traffic safety. So our goal is to find a detector with the highest accuracy on condition of real time detection. With

MobileNet, we got the amazing 68.3 frames per second inference speed, which is more than real time. By considering both accuracy and speed, we conclude that SqueezeNet V2 is the best feature extractor in our experiments. Figure 3.4 illustrates the precision recall curve for car, cyclist and pedestrian at different difficulty levels.

68

Figure 3.4: The precision recall curve for car, cyclist and pedestrian at different difficulty level easy, moderate, and hard.

Speed: As shown in table 3.1, by utilizing proper feature extractor, several of our proposed model can achieve real time inference speed on KITTI dataset. Another factor influencing the inference speed is the number of feature layers used in our model. In table

3.1, only one feature layer is used. We have experimented different feature layer combinations, and the results are summarized in table 3.2. As expected, the more feature layers used in our architecture, the higher mAP is achieved. However, the corresponding inference speed would decrease. We also found that SqueezeNet V2 with three feature layers used for detection can achieve the same level of accuracy as VGG16 with only one

69

feature layer used. It is superior to VGG16 because its inference speed is more than 10 frames per second faster.

Feature Speed Feature Extractor mAP Layers (FPS)

Conv5_3 85.6 17.3

VGG16 [95] Conv5_3, Conv4_3 88.5 14.6

Conv6, Conv5_3, Conv4_3 89.8 13.8

Fire9 82.3 33.2

SqueezeNetV2 [128] Fire9, Fire6 84.9 29.6

Fire11, Fire9, Fire6 85.7 28.9

Table 3.2. Detection accuracy summary by utilizing different feature layers. The mean Average Precision (mAP) is the mean values of the AP for each class. The mAP and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS).

3.4.2 Qualitative Results

Examples of successful detections on KITTI test datasets are visualized in figure 3.5.

Example of error detections in different types on KITTI test datasets are visualized in figure 3.6. In both figures, we use SqueezeNet V2 with three feature maps for the detection. Each color corresponds to an object category. From figure 3.5, we can see that the detections of car, cyclist and pedestrian are successful even on condition of strong illumination, high occluded, high truncation, and crowed objects. However, the detection

70

can also fail with different error, such as missed detection, false positive, as described in figure 3.6.

71

Figure 3.5: Example of successful detection on KITTI test datasets. We use SqueezeNet V2 with three feature maps for the detection. Each color corresponds to an object category.

72

Figure 3.6: Example of detection error on KITTI test datasets. For top to bottom, the errors are: predict part of the tree to be a cyclist; predict a still cyclist to be a pedestrian; missed detection; and predicted bounding box does not fit to the object. We use SqueezeNet V2 with three feature maps for the detection. Each color corresponds to an object category.

73

3.5 Conclusion

In this chapter, we propose a novel objects detection architecture using still images.

Our detection architecture utilizes the Convolution Neural Network (CNN) with end-to- end training approach. In our model, we consider multiple requirements for dynamic objects detection in autonomous driving application including accuracy, inference speed, and model size. These are crucial to deploy a detector in a real autonomous vehicle. We determine our final architecture by exploring different pre-trained feature extractor and different combination of multi-scale feature layers. Our architecture is intensively tested on KITTI visual benchmark datasets and achieved comparable accuracy in real time.

74

Chapter 4: Video Object Detection for Multiple Tracking

In the previous chapter, we developed a CNNs based dynamic object detection framework from still images. However, the visual data from the autonomous vehicles cameras are image sequences, not discrete still images, therefore it is essential to develop object detection algorithm for video sequences, called video object detection. Video object detection is also the foundation and the most important part of online multiple object tracking task. In this chapter, we aim to improve object detection precision and recall by utilizing the contextual information in video with a special kind of Recurrent

Neural Networks.

4.1 Introduction

Video can be regarded as image sequences with small change between adjacent frames, and thus there exists contextual information in video. We can process each video frame individually, for example, a CNNs detector is conducted on each frame to detect vehicles and other objects. However, this way we neglect the contextual information from the video. To better exploit the video sequence data, the contextual information must be explored, and a promising way to take advantage is by utilizing RNNs.

The biggest problem of utilizing RNNs is its difficulties to be trained because of the long-term dependency problem. It results in the exploding gradients or vanishing

75

gradients problem. This problem was deeply explored in [16] and [17]. We will describe this problem next, and provide several ways to address it.

The exploding and vanishing gradients problem has nothing to do with any type of neural networks, but with gradient based learning methods caused by certain activation functions we introduced in chapter 2. Gradient based methods learn the parameters, such as the weights parameter W, by understanding how a change in the parameter’s value will affect the network’s output. The ideal situation is that a small change of the parameter will cause a proper reaction of the output. On the other hand, the gradients of the network’s output with respect to the weight parameters in each layer keep at proper values.

Exploding gradients problem happens when the gradients of the network’s output with respect to the weight parameters became too large, causing serious instability in the learning procedure. [17] proposed a simple but effective method called gradient clipping to deal with the exploding problem. It prevents gradients from blowing up by rescaling them so that their norm is at most a threshold value �.

Vanishing gradients problem happens when the gradients of the network’s output with respect to the weight parameters became too small, causing the learning procedure taking too many steps. Vanishing gradient problem depends on the choice of the activation function. Many common activation functions (e.g sigmoid or tanh) squash their input into a very small output range in a very non-linear fashion. As a result, there are large regions of the input space which are mapped to an extremely small range. This

76

problem becomes worse when we have multiple activation layers.

Vanishing gradients problem is relatively difficult to solve. In feedforward neural network, such as CNN, we deal with it by using activation functions which don't squash the input space into a small region. An example of this kind of activation function is the popular Rectified Linear Unit (ReLU) function which maps � to max (0, �). We can also design a RNN with all ReLU as the activation functions, but a more popular and effective methods is to redesign the RNN structure. Long Short Term Memory is the most successful RNN structure, which can learn long term temporal information but also avoid the vanishing gradient problem.

4.2 Related Work

Tracking by detection is a popular and promising scheme for multi-object tracking

[123], [124]. These methods first detect the bounding boxes of objects of specific categories, and then connect the bounding boxes of the same trajectories in different time frames using data association algorithms. It is obvious that tracking by detection scheme heavily depends on robust detection in video. We will review the video object detection literature and propose our method.

Compared with CNNs, having RNNs involving in the object detection are not well studied. All the object detection architecture we mentioned in chapter 3 are mainly based on deep CNNs, for example Faster R-CNN [57], Faster R-CNN [57] and YOLO [59].

They all belong to the state-of-the-art methods for still image detection. However, things are changing and researchers begin to incorporate RNNs to video based tasks. 77

Existing video object detection methods incorporate RNNs only on the bounding boxes obtained from above still image detectors. [119] builds a bounding boxes sequence along nearby high-confidence score bounding boxes from neighboring video frames.

Bounding boxes sequences are re-scored to the average confidence score and other bounding boxes close to this sequence are suppressed. T-CNN [80] defines tubelet as sequence of associated bounding boxes on each video frames, and generates them by applying tracking algorithms on still image bounding boxes followed by re-scoring them based on their classification. In this way, it incorporates temporal and textual information from these tubelets in video. Similar approaches taking advantage of contextual information from bounding boxes include [119] and [120].

These methods are built on the assumption that the bounding boxes on individual video frames convey contextual information. We argue that the high-level feature maps on individual video frames also carry contextual information that is richer than bounding boxes. By extracting the temporal and textual information from high-level feature map, several papers demonstrate the improvements on visual tasks including activity recognition [116], human dynamics [121] and video classification [122]. These models generate deep CNN features over tens of seconds, which forms the input to an

RNN. Inspired by them, we propose a novel video object detection architecture for tracking purpose that learns contextual information from feature maps with LSTM units.

In addition, our methods can further include bounding boxes contextual information to improve the precision and recall.

78

Before presenting our methods, we first introduce a special kind of RNNs Long Short

Term Memory (LSTM), on which our video object detection framework builds.

4.3 Long Short Term Memory

Long Short Term Memory is designed to address the long-term dependency problem, thus can learn long-term history information from a sequence of input. It avoids the vanishing gradients problem because the backpropagation learning process does not involve the matrix multiplication of the weight parameters �. Since it is first introduced by [15], many variations of LSTM are proposed. We use the original version in our implementation, and here we will investigate one LSTM unit in details. As shown in figure 4.1, there are four neural network layers in one LSTM unit that interact in a special way. At each time step, the LSTM is trained to add new information to the memory, forget some memory, and output its memory.

79

Figure 4.1: A standard architecture of LSTM units. Each line carries a vector data, and the arrow denotes the direction of the data flow. Two merging lines mean the data concatenation, while a line separating into two lines means that the data it carries is copied and flows to different directions. The circles represent pointwise operations, such as vector addition or multiplication, while the boxes are sigmoid neural network layers. �� is the input at time step t, ℎ� is the output at time step t, and �� is the cell state at time step t. � and ���ℎ are sigmoid function and hyperbolic tangent function, respectively, defined and described in figure 2.5.

There are two key concepts need to be understood in LSTM. Firstly, cell state �� carries the history information, i.e., the memory, which can be updated each time new input comes in. Cell state is the core idea behind LSTM, and enable LSTM to learn from the temporal information. Another key concept is gate, as the name indicates, it can control whether the information can flow through it. A gate is composed of a sigmoid neural network layer and a pointwise multiplication operation, as shown in figure 4.2. We have introduced and compared various sigmoid layers in chapter two. The output of the

80

sigmoid layer is a number between zero and one, representing how much the information can go through it. Next, we will introduce three different gates in LSTM.

Figure 4.2: The gate structure in LSTM units. It is composed of a sigmoid neural network layer � and a pointwise multiplication operation. It can control whether the information can flow through it.

The inputs to the LSTM unit at time step � are the cell state ��−1 at time step � − 1, the output ℎ�−1 at time step � − 1, and the input �� at the current time step �. “forget gate layer” updates ��−1 by deciding which part of the memory should be forgot, which can be expressed mathematically by equation 4.1. Its inputs are ℎ�−1 and ��, and output �� is a vector with the values between zero and one because of the sigmoid function �. �� has the same dimension as cell state ��−1, so it can be pointwise multiplied with ��−1. In this way, part of the memory is forgot.

�� = �(�� ∙ [ℎ�−1, ��] + ��) (4.1)

The second gate called “input gate layer” in LSTM is designed to add new memory to the cell state ��−1. It is expressed mathematically by equation 4.2. The newly produced candidate cell state �̃�, denoted by equation 4.3 and has the same dimension as ��−1, 81

works as the input vector to update cell state ��−1. However, how much of the new candidate cell state �̃� contributes to ��−1 is controlled by the input gate layer. It works in the same way as the forget gate layer, with ℎ�−1 and �� as the input, and outputs �� multiplied with �̃�.

�� = �(�� ∙ [ℎ�−1, ��] + ��) (4.2)

�̃� = ���ℎ(�� ∙ [ℎ�−1, ��] + ��) (4.3)

At this point, the cell state �� is generated by combine the outputs from the forget and input gates layer, which is expressed as:

�� = �� ∗ ��−1 + �� ∗ �̃� (4.4)

Finally, the updated cell state �� is used to generate the output ℎ� at the current time step �. The “output gate layer”, denoted by equation 4.6, works in the same way as the forget and input gate layer with output ��. To get the output ℎ�, an activation function

���ℎ first acts on �� to output a vector in the range of [-1, 1], and then the output vector is pointwise multiplied with the output �� of the output gate layer:

�� = �(�� ∙ [ℎ�−1, ��] + ��) (4.5)

ℎ� = �� ∗ tanh (��) (4.6)

In one LSTM unit, the weights at different layers, i.e., ��, ��, ��, ��, ��, ��, ��, ��, are the trainable parameters which will be learned and updated during the training stage of the network. Once we are satisfied with the losses, we will test our network on validation and test datasets using these weights parameters.

82

4.4 Methods

4.4.1 System Overview

The overview of our proposed video object detection architecture is illustrated in figure 4.3. The proposed model is a deep neural network that takes as input raw video frames and returns the bounding boxes and class label of objects in each frame. Specifically, (1) a CNN is first chosen to process each individual frame, which is exactly the same as still image detection. This step produces two outputs: the robust visual feature and the object detection results (i.e., bounding boxes and class labels). (2)

Then we use LSTM in the next stage to learn the contextual and temporal information, as it is spatially deep and appropriate for sequence processing. In our model, both the visual feature and the detection results for individual frame are inputs to the LSTM units, while other video object detection architectures only use detection results. This is where our proposed pipeline is different from other video object detection architectures. (3) Finally, the object detection results are directly regressed from the LSTM units, at the same time the cell states of the LSTM units are updated. We will demonstrate the performance improvements in the experiments section.

83

Figure 4.3: Overview of our video object detection architecture. The dash line means that only feature vector or still image detection results can be inputs to the Long Short Term Memory (LSTM) units. “Detection” donates the detection results, i.e., bounding boxes coordinates and class labels from Convolutional Neural Network (CNN) or LSTM. “Features” denotes the visual features vector from the CNN feature extractor.

4.4.2 LSTM Choice

Since LSTM was first proposed, several variants were designed, such as [18], [19] and [20]. An important variation of the standard LSTM we described above is the Gated

Recurrent Unit, or in short GRU, introduced by [21]. It combines the forget and input gates into a single update gate, and merges the cell state and hidden state. In the paper

[22], the authors compared eight popular variants of LSTM including the ones we mentioned above, and found that the most commonly used LSTM variants performed reasonably well on various datasets and using any of these variants does not significantly

84

improve the LSTM performance. Therefore, we only implement the standard LSTM in our study, as described in section 4.3.

4.4.3 Training

There are three phases for the end-to-end training of our proposed video object detection system: (1) the pre-training phase of CNN for feature learning at ImageNet datasets, (2) the traditional CNN based object detection training phase based on individual video frame, and (3) the LSTM training phase for contextual information learning with video sequence. The first phase of training CNN for feature learning is conducted on ImageNet datasets with 1000 classes and millions of training examples.

Usually we do not train it ourselves but use the so-called transfer learning techniques we described in chapter 2, section 2.3.5.

We train the second and third phase ourselves. The second stage is actually a fine- tuning process, which means we only fine tune the parameters of the last few layers with our own data. At this phase, it is the same as the training procedure for still image training in chapter 3. At last, we add the LSTM units to train the entire video object detection by learning the contextual information from video sequence. There are two streams of data flowing into the LSTMs, which are the robust visual feature vector and the detection information, i.e., bounding boxes and class label. In the previous study, only the detection information is learned by the LSTM units.

85

4.5 Experiments

Our system is implemented in Python using Tensorflow [139]. We experiment and evaluate our architecture on KITTI object tracking datasets, which includes 21 sequences with more than 8000 labeled frames. We split them into 16 training sequences and 5 validation sequences. To better train the CNN module, we also use the KITTI object detection datasets besides the individual frames of the tracking datasets. Since the official tracking test server only evaluates the tracking results and has usage limitation, we primarily report the performances on the tracking validation set as a common convention for video object detection tasks. All timing information is on a NVIDIA Geforce GTX

Titan Xp GPU with Intel Xeon E3 3.2GHz CPU and 16GB RAM.

4.5.1 Quantitative Results

Our proposed video object detection architecture must be built on still image detection model. We use our proposed still image detection pipeline in chapter 3 as the base model. We have tested three ways to take advantage of the contextual information hidden in the video sequences. Firstly, we fed only the detected results on individual frame into the LSTM units. Secondly, we fed only the visual feature vector extracted from the individual frame into the LSTM units. Finally, both detection results and visual feature vector were fed into the LSTM units. Therefore, there are three ways for the

LSTM units to learn the contextual information.

86

mAP Speed Car Cyclist Pedestrian Model (FPS) E M H E M H E M H

VGG16(Base) 85.6 17.3 92.3 85.6 82.8 85.6 76.3 75.6 80.4 72.3 67.4

VGG16 (D) 86.7 16.5 92.5 86.7 83.4 85.7 76.7 75.8 81.4 73.5 67.9

VGG16 (F) 88.5 12.1 95.4 87.9 82.6 86.9 77.5 75.8 82.3 75.8 68.3

VGG16 (F&D) 89.3 10.2 95.7 88.3 84.5 86.3 78.6 76.6 82.5 77.8 74.2

Table 4.1: Detection performance using VGG16 as the backbone CNN architecture and different input to LSTM units. The mean Average Precision (mAP) and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS). Base means no LSTM units are used; D means only detected bounding boxes are input to the LSTM units; F means only visual features are input to the LSTM units; F&D means both detected bounding boxes and visual features are input the LSTM units. The speed is measured by Frame Per Second (FPS). The mAP for each class (Car, Cyclist, Pedestrian) of different difficulty levels, i.e., Easy(E), Moderate(M), and Hard(H), are also calculated.

We choose VGG16 and MobileNet as the backbone CNN feature extractor in our

experiments, because they are representatives of the powerful but heavy feature

extractors and less powerful but light feature extractor. We summary their performance in

table 4.1 and table 4.2. In both table, Base means no LSTM units are used; D means only

detected bounding boxes are input to the LSTM units; F means only visual features are

input to the LSTM units; F&D means both detected bounding boxes and visual features

are input the LSTM units. From the two tables, we draw the following observation:

87

• Both detection results and visual feature vector can help improve the detection

accuracy, but the latter provides more accuracy increase

• Adding the LSTM units can reduce the inference speed

• The contextual information (visual feature vector and detection results) produced

by the MobileNet can provide more significant accuracy boost

Speed Car Cyclist Pedestrian Model mAP (FPS) E M H E M H E M H

MobileNet(Base) 66.3 67.4 70.6 72.7 63.9 64.3 57.3 51.5 73.3 69.7 63.5

MobileNet (D) 68.6 65.9 71.2 71.4 64.2 65.6 58.5 52.8 75.1 71.4 64.7

MobileNet (F) 73.5 61.5 78.6 77.5 67.5 67.7 61.4 55.6 77.4 75.7 69.3

MobileNet (F&D) 77.7 60.4 79.5 78.9 67.9 68.3 62.8 57.2 79.4 74.9 68.4

Table 4.2: Detection performance using MobileNet as the backbone CNN architecture and different input to LSTM units. The mean Average Precision (mAP) and AP have unit of percentage, and the speed is measured by Frame Per Second (FPS). Base means no LSTM units are used; D means only detected bounding boxes are input to the LSTM units; F means only visual features are input to the LSTM units; F&D means both detected bounding boxes and visual features are input the LSTM units. The mAP for each class (Car, Cyclist, Pedestrian) of different difficulty levels, i.e., Easy(E), Moderate(M), and Hard(H), are also calculated.

4.5.2 Qualitative Results

We have described quantitatively how the historical feature maps and detection

results can improve the video detection accuracy in previous section. In this section, we

88

will qualitatively demonstrate the same effects. Figure 4.4 displays the object detection results of several video frames from KITTI tracking test datasets. In this result, only base

MobileNet is used to detect objects in each individual frame without considering the contextual information. We can see that the cyclist was mistakenly predicted to be both cyclist and pedestrian in the third frame, which is caused by the background change and illumination change.

Figure 4.5 show the detection results for the same frames as figure 4.4. However, in figure 4.5, we utilized the contextual information hidden in the video by using the LSTM units to learn from both the historical visual feature vector and detection results. The benefit is obvious that there is no detection error in the third frame. That means that the

LSTM units learn from the historical information and denies the false prediction based on only one frame.

89

Figure 4.4: Video object detection without considering the contextual information. Only base MobileNet is used to detect objects in each individual frame. Each color corresponds to an object category.

90

Figure 4.5: Video object detection by considering the contextual information. MobileNet is used to detect objects in each individual frame, and outputted visual feature vector and detection results are fed into the LSTM units. Each color corresponds to an object category.

91

4.6 Conclusion

In this chapter, we developed a video object detection framework based on CNN and

Long Short Term Memory (LSTM). Previous studies learn contextual information only from the individual frame detection results. Our framework can learn the contextual information from both visual feature vector and individual frame detection results. The architecture we proposed in chapter 3 acts as the still image detector and feature extractor, and LSTM is responsible to exploit the contextual information in video. We evaluated our framework on the KITTI tracking datasets. By experiments, we demonstrated quantitatively and qualitatively that the contextual information hidden in the video sequences can help improve the detection accuracy and reduce false detection.

92

Chapter 5: End to End Learning for Vehicle Control

In the previous chapters, we studied the mapping from image pixels to object classification and location. This information is significant to understand the traffic scene in order to control the vehicle. However, the power of deep learning can go beyond object recognition. In this chapter, we explore the possibility of mapping from the image pixels directly to control command, such as steering angle, which is called end-to-end vehicle control in this dissertation. This end-to-end learning protocol is attractive because it simulates how humans drive, and skips the perception and pathing planning modules.

In this chapter, we propose a deep learning architecture model involving both CNNs and

LSTM that can map the video sequence to control commands.

5.1 Related Work

Surprisingly, the idea of end-to-end vehicle control was put into practice as early as

1989, as reported in [113]. Autonomous Land Vehicle Neural Network (ALVINN) utilized a 3-layer shallow neural network that can predict the vehicle control command from the input image pixels. To our knowledge, this is the first attempt to map from the image pixels to vehicle control. Its success implies the potential of neural network for directly controlling the vehicles.

93

Another important project named DARPA Autonomous Vehicle (DAVE) [114] was conducted by DARPA. In this project, a radio control (RC) car learned to avoid the obstacles in the junk field off-road settings. The RC car was trained on several hours of human driving data, and tested on different environment. The training data included the video recorded by two cameras and steering angles command sent by the human operator.

Because of the limited training data and the shallow network used, its performance is not reliable enough to control the RC car in complex real-life environment. On average, it crashes every 20 meters as reported in the paper.

The most recent successful demonstration of NVIDIA’s end-to-end learning for autonomous vehicle system [42] implemented an idea similar to ALVINN. This system can control the vehicle to drive in certain real traffic scenario, such as highway lane following and local road driving. It improves the ALVINN system in two aspects. First, its learning architecture is a complex multiple-layer Convolutional Neural Network with

27 million connections and 250 thousand parameters. This enable the system to have a more powerful learning ability from the input image pixels. Second, a hundred hours of driving data, including the images from three cameras and corresponding control command, is utilized as the training data. This is necessary to train complex CNNs with numerous parameters. In addition, this research also benefits from the computing power provided by the massively parallel Graphics Processing Units (GPUs), which significantly accelerates the training speed and push the inference speed to real time.

Instead of mapping from image pixels to control commands, Deepdriving [115] proposed the mapping form image pixels to pre-define affordance representation. These

94

are a small number of perception indicators that represent the traffic scene, such as the distance to lane markings and distance to the other vehicles. These affordance representations are understandable to human, on the other hand, the process of mapping from the image pixels to control commands is like a black box, and not understandable by human. However, they are difficult to define in complex, real urban traffic scene. And similar to the output from the perception module, the affordance representation will be then fed into rule-based control algorithm.

So far, the above approaches all work on still image pixels and do not consider the temporal information between adjacent images. However, the temporal information is essential to understand sequence data, such as video and audio. [116] demonstrated a class of deep learning architectures cooperating Convolutional and Recurrent Neural

Networks. These architectures are both spatially and temporally deep, and can be applied to various computer vision tasks that involve sequence image inputs, such as activity recognition and video description. Another work [117] incorporates a FCN-LSTM architecture, which can map from the video sequence to discrete driving actions, such as

“go straight”, “turn left” or “right“ and “stop”, and continuous driving actions such as angular speed. Inspired by this work, we propose a deep architecture involving both

CNNs and LSTM that can map the video sequence to control commands. The proposed network architecture only takes the current and previous frame as the inputs to predict the control commands, and never incorporates the future frame information.

5.2 Network Architecture

95

In our architecture, both Convolutional and Recurrent Neural Network are adopted.

First, CNNs extract a deep feature vector from each of the video sequences. Temporal information exists between adjacent feature vectors because they are extracted from neighboring frames of the video sequence. To take advantage of the temporal information, a special RNNs called Long Short Term Memory (LSTM) is used because it is good at learning long term dependency relationship. We have described the mechanics of LSTM in chapter 4.

Figure 5.1 depicts our first end-to-end control architecture called “L1”. For each frame, the deep feature vector extracted from CNN is fed into the LSTM unit as the input.

The LSTM unit outputs the predicted control commands, i.e., steering angle and throttle level, and at the same time pushes the outputs to the LSTM unit of the next time step. In other words, at each time step except the first, both deep feature vector of the current time step and the output control commands of last time step are fed into the LSTM unit to update the cell state. As we mentioned in chapter 3, the cell state in the LSTM unit is responsible for learning the historical temporal information from the inputs.

96

Figure 5.1: End-to-end control architecture with one layer Long Short Term Memory (LSTM) called “L1”. At each time step except the first, both deep feature vector of the current time step and the output control commands of last time step are fed into the LSTM unit to update the cell state.

Inspired by [134], and to examine whether stacking more LSTM layers will increase the performance, we propose a two layers LSTM architecture, as shown in figure 5.2.

Compared with architecture “L1”, it stacks one more LSTM layer above the original

LSTM, and the outputs of the first LSTM layer acts as the inputs to the second LSTM layer. In theory, the first LSTM layer servers as the encoder to encode the visual features to the cell state, and the second LSTM layer servers as the decoder to regress the location and classify the objects. It increases the network depth in the time dimension.

97

Figure 5.2: End-to-end control architecture with two layers Long Short Term Memory (LSTM) called “L2”. The outputs of the first LSTM layer acts as the inputs to the second LSTM layer

With the expectation of reducing the model noise, we change the architecture “L2” by add a feedback mechanism. Specifically, as described in figure 5.3, we add the outputs of second LSTM layer at the previous time step as the third input to the first LSTM layer of the current time step. We call this architecture “L2 Fb”.

98

Figure 5.3: End-to-end control architecture with two layers Long Short Term Memory (LSTM) and feedback mechanism. We add the outputs of second LSTM layer at the previous time step as the third input to the first LSTM layer of the current time step. This architecture is called “L2 Fb”.

We can further tweak the “L2 Fb” architecture by skipping the first LSTM layer and feeding the deep feature vector directly to the second LSTM layer, as shown in figure 5.4 called “L2 Fb Skip”. This is inspired by the image captioning architecture in paper [116], which is called “factored” in that paper. The purpose is to separate the responsibility of the first and second LSTM layer, which force the cell state of the first LSTM layer to be updated independent of the feature vector.

99

Figure 5.4: End-to-end control architecture with two layers Long Short Term Memory (LSTM), feedback mechanism, and skip structure. We call this architecture “L2 Fb Skip”.

Until now, we have defined four types of end-to-end control architectures, which are

“L1”, “L2”, “L2 Fb” and “L2 Fb Skip”. They can all be trained in an end-to-end fashion.

During training, there are two ways to produce the feedback input to first LSTM layer of the next time step:

• One is to use the ground truth control commands of the current time step as the

feedback input to the first layer of the next time step.

• The other way is to use the predicted control commands of the current time step

as the feedback input to the first layer of the next time step.

100

This will create two different methods to train “L2 Fb” and “L2 Fb Skip” architectures.

In order to compare the performance of each end-to-end control architecture, we performed intensive experiments on them, which will be explained in section 5.4. In the next section, we will describe the details of the implementation and the training methods.

5.3 Implementation and Training

All our architectures are implemented based on the Python API of Tensorflow [118] deep learning framework. Similar to CNNs, Tensorflow has its own built-in LSTM layer and other types of RNNs layers. All timing information is on a NVIDIA Geforce GTX

Titan Xp GPU with Intel Xeon E3 3.2GHz CPU and 16GB RAM.

For the real road data, we choose VGG16 [95] as the CNN visual feature extractor in all our end-to-end control architecture. It is pre-trained for object classification purpose on ImageNet datasets with millions of pictures and 1000 classes, which enable it the ability of understanding the general feature of almost arbitrary objects. Its output of the first fully connected layer is a feature vector of size 4096, which represents the high-level visual feature. Because the CNNs are pre-trained on such a large dataset, we assume it can extract meaningful feature representations without further training on our own small datasets. Therefore, during training our proposed end-to-end control architecture, the weight and bias parameters of the CNNs are frozen and only the parameters in the LSTM layer are updated.

101

We have described how to train a general RNNs using backpropagation through time

(BPTT) algorithm in chapter 2. The BPTT algorithm can also be used to train our end-to- end control architecture. We use the mini batch SGD optimization methods in the BPTT algorithm. Figure 5.5 describes the training diagram of our proposed architecture. The mini batch data are fed into our proposed CNN/LSTM architecture after data augmentation, which calculates a predicted steering angle. At the same time, the recorded steering angle is adjusted and regarded as the ground truth and compared with the predicted steering angle to produce the steering error. Finally, the BPTT algorithm is used to adjust the weight and bias parameters in the LSTM units.

Figure 5.5: The training diagram of our proposed architecture using BackPropagation Through Time (BPTT) algorithm.

Next, we will explain further two important parts of our training procedure: mini batch data generation and data augmentation, because they are different from regular discrete still image data generation and augmentation.

102

Our goal is to generate a mini-batch training data with N sequences. For each sequence,

Our goal is to generate a mini-batch training data with N sequences, and M frames in each sequence. We randomly select N frame from all the training data as the first frame of the N sequences. Since we define the length of each sequence as �, we fill each sequence by adding M-1 frames following the first frame. Next time a new mini-batch data is generated, we randomly re-select N frame as the starting frame and fill each sequence using the same method. For training, the loss for a mini batch data is the sum of the square Euclidean distance:

1 � = ∑� ∑� (‖��� − ����‖ + ‖� �� − ����‖) (5.1) �� �=1 �=1 �� �� �� ��

where � is the number of independent sequences in one batch, � is the length of each

�� ��� sequence. ��� and ��� is the ground truth and predicted steering angle at sequence � and

gt pre frame �, respectively. Tij and Tij is the ground truth and predicted throttle level at sequence i and frame j, respectively. ‖∙‖ is the squared Euclidean norm. In our implementation for all proposed end-to-end control architecture, we choose the number of independent sequences � = 4, and the length of sequence � = 10 for each mini batch data.

The training data must be augmented to mimic the vehicle at different positions on the road and to learn to recover from that. For simulation and real road data, we have different augmentation strategies such as small random shift, small random rotation, and horizontal flipping. We will describe it in the experiments section. However, we must

103

apply the same augmentation strategy on each image sequence to preserve the contextual information among them. This is different from augmenting the discrete still image, on which different augmentation strategy can be applied.

Once trained, the architecture can produce a steering angle from each frame of the video sequences.

Figure 5.6: The architecture at test time. It can produce a steering angle from each frame of the video sequences.

5.4 Experiments and Results

5.4.1 Simulation Study

Before training on real road data, we first evaluated our network performance in simulation. We chose the open source Udacity (https://www.udacity.com) self-driving car simulator (figure 5.7), which have both training mode and autonomous mode. On the training mode, we collected training data by manually controlling the simulated vehicle.

On the autonomous mode, the simulator would feed the real-time video sequences into our trained network and execute the steering angles outputted from it. Note that there are two tracks in the simulator, and we collected all the training data from the first track and validate our network performance on the second track. In this way, we proved that our network has the inference ability not remember the road.

104

Figure 5.7: The Udacity self-driving car simulator (https://www.udacity.com).

We collected 6428 video frames as the training data from each of the left, center, and right camera mounted on the simulated vehicle. In addition, we also collected 1607 video frames from each camera as the test data. To further test our network performance, we run the simulator with the autonomous mode on the second track. Figure 5.8 shows an example of the collected video frames from left, center, and right camera.

105

Figure 5.8: An example of the collected video frames from left, center, and right camera.

Figure 5.9 (a) displays the steering angles distribution in the training data for center cameras, which shows clearly the unbalance of the training data. Too much steering angle are located in the range of [−0.3,0]. To balance the data, we augmented the training data in two ways:

• Use left/right camera: Images from left/right camera are also used by modifying

the steering angle with 0.25. We should notice that, adding a constant angle to

steering is a simplified version of shifting left and right cameras, but not the best

way. But in the simulation case, this simplification is good enough.

• Horizontal Flipping: Horizontally flip the images to account for the situation of

driving in the opposite way. This also increase our training data.

The steering angles distribution after augmentation is illustrated in figure 5.9 (b). The training data is more balanced now.

106

Figure 5.9: The steering angles distribution before and after data augmentation. Too much steering angle are located in the range of [−0.3,0] before data augmentation, but the steering angles distribution becomes balanced after data augmentation.

As shown in figure 5.8, the simulated scene is a simple one-way road without other traffic precipitators. It is not necessary to use a complex CNN feature extractor, so we designed a simple version as illustrated in figure 5.10. The input video frames are firstly resized to 16 × 32 before fed into the CNN feature extractor. The first layer “8-3 × 3 “is a convolution layer with eight 3 × 3 filters, followed by a “Relu” activation layer to increase the non-linearity to the system. Then a max pooling “Max 2 × 2” layer is used to reduce the feature map dimension. The above 3 layers repeat once followed by a dropout layer “Dropout 0.2” with dropout rate 0.2 to prevent overfitting to the training data.

Finally, a visual feature vector with length 50 is obtained after a fully connected layer

“FC 50”.

If we do not intend to take advantage of the contextual information within the video sequences, the above visual feature vector is processed with an additional “Relu” layer and a fully connected layer “FC 1” to output the steering angle value. We call this 107

architecture “Base CNNs”. On the other hand, the visual feature vector is fed into the

LSTM unit to output the steering angle. In this way, the contextual information is learned by the network architecture.

Figure 5.10: Backbone CNN feature extractor for simulated data. “8-3 × 3“ means a convolution layer with eight 3 × 3 filters; “Relu” means a Relu activation layer to increase the non-linearity to the system; “Max 2 × 2” is max pooling layer with filter size 2 × 2 to reduce the feature map dimension; “Dropout 0.2” is dropout layer with dropout rate 0.2 to prevent overfitting to the training data; “FC 50” means a fully connected layer with output feature vector length 50.

L2 Fb L2 Fb Base L2 Fb L2 Fb Architecture L1 L2 Skip Skip CNNs (GT) (PRE) (GT) (PRE)

MAE (radian) 0.036 0.045 0.041 0.039 0.043 0.036 0.046 Std (radian) 0.084 0.081 0.092 0.087 0.080 0.083 0.075

Table 5.1: Steering angles Mean Absolute Error (MAE) and Standard Deviation (Std) of different architectures on the simulated testing datasets in radians. “GT” means that the ground truth values are used as feedback to the next time stamp; “OP” means that the network’s actual outputs are used as feedback. The L1, L2, L2 Fb, and L2 Fb Skip architectures are illustrated in figure 5.1-5.4.

Using the above CNN feature extractor, we trained our network architecture on the simulated training datasets. After the training step, the simulated vehicle will drive along

108

the track in the autonomous mode, which means the vehicle will rely on the steering angles outputted from our trained network. It turned out that all of the network architectures enable the simulated vehicle’s driving without a crash or touching the edge of the road. In fact, one can hardly tell the difference in performance between different architectures, since all of them result in almost the same driving quality.

To further compare the architecture networks, we evaluated them quantitatively on the test datasets. We calculate the steering angles difference Mean Absolute Error (MAE) and the standard deviation (Std). MAE can be calculated with the following equations:

1 ��� = ∑� |���� − ���| (5.2) � �=1 � �

��� Where � is the number of predicted steering angles, �� is the ��ℎ predicted steering

�� angle, �� is the ��ℎ ground truth steering angle.

Std can be computed with:

1 ���� = ∑� (���� − ��� ) (5.3) � �=1 � �

1 2 ��� = √ ∑� (���� − ����) (5.4) � �=1 �

where ���� of the predicted steering angle error.

Table 6.1 summarizes the performance of each architecture networks. It is not obvious that the LSTM based architectures outperform CNN only architecture. Recall that LSTM based architectures have the ability of utilizing the contextual information in the video sequences. One possible reason is that the driving scene is too simple, and the

CNN only architecture is powerful enough to learn the mapping from the video frame to

109

the steering angle. Overall, the simulation study verified that our proposed architectures are able to output correct steering angles to control the vehicle.

5.4.2 Real Road Test

After testing on the simulator, we evaluated our architecture network on the real road data. Udacity collected the video sequences data together with steering angle, speed and throttle levels. In our study, we used 74370 video frames from the cameras and corresponding steering angles as the training data, and 5614 frames as the testing data.

Figure 5.11 shows two sample frames from the training video sequences. Compared with the simulated environment, the real road collected images are more complicated. For example, there are other moving vehicles on the road; the lane marker is not clear sometimes; the lighting condition changes a lot. All these factors can make it more difficult for our model to learn to map the image to the steering angles. Since we have ten times more training data and much more complicated driving scene than simulated situation, we use a more powerful feature extractor VGG16 [95] as the backbone CNN.

110

Figure 5.11: Sample frames from the training video sequences.

Similar to the simulation study, the training data is not balanced and too much steering angles concentrate around zero. To balance the real road training data, we augment the training data in two ways:

• Brightness Augmentation: Because the training data were collected at a cloudy

afternoon, when the lighting condition is not good, we changed the brightness of

the images to simulate different lighting condition. Specifically, we converted the

image to HSV color model, scale up or down the V channel, and convert the

image back to RGB.

• Horizontal Flipping: Horizontally flip the images to account for the situation of

driving in the opposite way. This also increases our training data.

• Image shifts: The images are randomly shifted horizontally to simulate the effect

of car being at different positions on the road, and offset corresponding to the

shift are added to the steering angle. We will also shift the images vertically by a

random number to simulate the effect of driving up or down the slope.

111

The brightness augmentation does not contribute to balance the data, but it helps

the architectures perform better quantitatively on the test datasets. Randomly image

shifts and flipping in fact increase the training datasets and make it more balanced.

L2 Fb L2 Fb Base L2 Fb L2 Fb Architecture L1 L2 Skip Skip CNNs (GT) (PRE) (GT) (PRE)

MAE (radian) 0.140 0.085 0.081 0.069 0.073 0.066 0.069 Std (radian) 0.193 0.121 0.113 0.097 0.102 0.092 0.097

Table 5.2: Steering angles Mean Absolute Error (MAE) and Standard Deviation (Std) of different architectures on the real road testing datasets in radians. “GT” means that the ground truth values are used as feedback to the next time stamp; “OP” means that the network’s actual outputs are used as feedback. The L1, L2, L2 Fb, and L2 Fb Skip architectures are illustrated in figure 5.1-5.4.

We follow the same training procedure as the simulated data. After training, we

evaluated the pre-defined network architecures on the test datasets. Like the

simulation study, we summarize the results with table 6.2, which lists the steering

angle’s mean absolute error and the standard deviation. From the results, we can

summarize that:

• In general, considering contextual information by using LSTM units improves the

prediction accuracy. We can see from the table that all CNN and LSTM based

architecture outperform the CNN only architecture. However, the architecture L2 has

112

a larger MAE and Std values than L1, which means that simple attach one LSTM

units over another cannot improve the prediction accuracy.

• The feedback mechanisms can reduce the prediction error. And the ground truth

feedback during training is better than the prediction feedback.

• The skip mechanisms also help. They are designed to separate the responsibility of

the first the second LSTM units. This result is in accordance with the conclusion of

the image captioning task in [116].

In summary, the architecture “L2 Fb Skip GT” displayed the best performance.

To better look inside the predicted steering angles, we plotted figure 5.12, which

compares the ground truth steering angle and the predicted steering angles of the

architecture “L2 Fb Skip GT” for the test datasets. In general, they match very well

but pose more difference when the gound truth steering angles are large. In other

words, the predicted steering angles overshoot the ground truth when its magnitude is

large. In addition to the figure, we also generated a video which compares the steering

angle prediction while driving.

113

Figure 5.12: The predicted steering angles of architecture “L2 Fb Skip GT” on the test datasets.

5.5 Summary

In this chapter, we proposed several novel end-to-end vehicle control architectures.

End-to-end vehicle control avoids the perception and path planning modules, and directly maps the image pixels to the control command, such as steering angle, using a deep neural network. Previous end-to-end control architectures focus on utilizing different

Convolutional Neural Networks and still images. However, the collected camera data from an autonomous vehicle are video sequences but not discrete images. Therefore, it is not wise to neglect the potential of contextual information hidden in the video sequences.

In fact, these contextual information is frequently used on other computer vision tasks such as image captioning.

Inspired by this, we propose to incorporate LSTM units into the end-to-end vehicle control architecture to take advantage of the contextual information in the video

114

sequences. We explored different ways to combine LSTM unit to our end-to-end control architecture, and evaluated them on both simulated and real road datasets. By comparing different end-to-end control architecture quantitatively, we found that two-layer LSTM units with skip and feedback mechanisms can achieve the best performance.

Due to the simplicity of the simulated driving scene, adding LSTM to the architecture did not produce any obviously better performance. However, we found that adding

LSTM units can generally improve the steering angle prediction accuracy for complicated real world driving scenario. Therefore, the contextual information should be considered in the future autonomous driving applications involving video sequences data.

115

Chapter 6: Summary and Future Work

In this chapter, we summarize the main findings and contributions of the dissertation, and discuss suggestions for future work.

6.1 Summary and Contributions

In this dissertation, several aspects of autonomous driving have been studied. We focused on extracting meaningful information from the camera data using various deep learning algorithms. Camera sensor is used because it can capture abundant information regarding the traffic environment including color, texture, contrast, appearance information, and optical characters. However, sensor fusion is usually preferred because the camera can be influenced by environment, weather, and lighting conditions. Notably, in the last few years, the deep learning techniques outperformed the traditional computer vision algorithms on almost every visual task. In addition, application of deep learning was also investigated here as an optional vehicle control approach. By witnessing the rapid development and improvements of deep learning techniques, we have confidence that it will be more heavily involved in autonomous driving research and practice, especially when image data are used to sense the environment.

116

Perception is the most important module of the autonomous driving system. It is the foundation and a necessary prerequisite for the following modules. The perception module is in charge of understanding the traffic scene, and is composed of mapping/localization, object detection, object tracking, semantic segmentation, and scene understanding, and so on. In chapter 3 and chapter 4, we proposed several deep learning architectures to detect objects from both still images and video sequences. For still images, a one-stage end to end object detection pipeline based on CNNs was proposed.

By experiments of different feature extractors and different combinations of feature maps on KITTI datasets, we found the extractor that achieved the best accuracy and speed balance. For object detection in video sequences, we insert the LSTM units after the feature extractor to take advantage of the textual information hidden in the video sequences. Our proposed architecture is superior to the previously published research by others, because we enable the LSTM units to learn the textual information from both feature vectors and detected bounding boxes. We have achieved better and smoother detection from video results as compared to methods, such as R-CNN [55] and fast R-

CNN [56], that do not consider the contextual information in video sequences.

It turns out that the power of deep learning is far beyond improving the perception ability of autonomous driving. In chapter 5, we design an end-to-end control algorithm taking video sequences as input and directly output the control commands. We mainly focus on supervised learning methods, i.e., convolution neural network and recurrent neural network, and train them using simulated data and real road data. As in the video object detection task, the recurrent neural network LSTM is designed to take advantage

117

of the temporal information. The primary contribution of this research lies in our innovation of the way the LSTM units cooperate with each other, the feedback mechanism, and the skip mechanisms in our proposed control architecture.

In summary, this work demonstrated the power of deep learning techniques applied to autonomous driving research problems of image and video object detection, object tracking, and end-to-end control. With enough annotated data and computation resources, deep learning can not only improve the perception ability of an autonomous vehicle, but can also be deeply involved in the whole autonomous driving system.

6.2 Future Work

Overall, the proposed algorithms in this dissertation improve the perceptual performance of an autonomous vehicle from aspects of object detection, object tracking and end-to-end control. And important but not well-studied topic related to this dissertation is the boundary condition determination. For example, what object detection accuracy is needed to guarantee that the pathing planning and control modules are able to manipulate the vehicle without any possibility of crash or accident. The boundary condition for every module of the autonomous driving system needs to be further studied by extensively testing and experimenting on various road conditions.

For each topic discussed in this dissertation, they can lead to different future research directions.

118

First of all, we only evaluated our object detection from still image architecture on several previously trained feature extractors. We only fine tune the last several layers of each feature extractor. If more training data is available, fine tuning more layers of the feature extractor should improve the detection accuracy. And more feature map combinations from different feature extractors should be explored for object detection.

Furthermore, we integrated the various ImageNet pre-trained feature extractor to our object detection architectures, such as VGG16 [95] or SqueezeNet [110], which limits the feature maps we can used. In the future work, it will be more flexible if we can design our own feature extractor and train it on large scale public datasets.

For video object detection, our ultimate goal is to use the detection results for multiple object tracking. Compared with detection, object tracking can provide more useful information to the path planning and control modules, such as object trajectory, its speed and movement direction. Object tracking can also predict the future states of the objects of our interest. Current multiple object tracking system compares the current frame detection with the historical detection, and associates them by comparing distance, color, and other feature information. A promising research direction is to associate them with the deep learning techniques [130]. In that work, they detect objects and associate them across multiple learned cues over a temporal window, which outperforms the traditional data association methods in multiple benchmark datasets.

Finally, our end-to-end control architecture is based on the supervised deep learning model, including Convolutional Neural Network and Recurrent Neural Network.

However, the so-called corner cases exist because no training data can cover all of the

119

driving scenarios. On theory, an autonomous vehicle may crash when facing situation that it cannot inference from its previous learned experience. An ideal end-to-end controller has to be able to also learn from the unsupervised data on line, which means that it can accumulate the knowledges from un-labeled data. This is how the humans learn to drive. Therefore, unsupervised deep learning model need to be studied further for the application to the end-to-end control system. For example, the author of [131] successfully trained its simulated agent to generate collision-free motions from the un- labeled data with Deep Q-Networks.

120

References

[1] Association for Safe International Road Travel. Road crash statistics. In http://asirt.org/initiatives/informing-road-users/road-safety-facts/road- crash-statistics.

[2] American Driving Survey 2014–2015. http://publicaffairsresources.aaa.biz/wp- content/uploads/2016/09/AmericanDrivingSurvey2015.pdf

[3] Chester, M., Fraser, A., Matute, J., Flower, C., & Pendyala, R. (2015). Parking infrastructure: A constraint on or opportunity for urban redevelopment? A study of Los Angeles County parking supply and growth. Journal of the American Planning Association, 81(4), 268-286.

[4] National Highway Tra c Safety Administration. Critical reasons for crashes investigated in the national motor vehicle crash causation, 2015. USA.

[5] Phantom Auto' will tour city. The Milwaukee Sentinel. Google News Archive. 8 December 1926. Retrieved 23 July 2013.

[6] Thorpe, C., Hebert, M. H., Kanade, T., & Shafer, S. A. (1988). Vision and navigation for the Carnegie-Mellon Navlab. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(3), 362-373.

[7] Dickmanns, E. D., Behringer, R., Dickmanns, D., Hildebrandt, T., Maurer, M., Thomanek, F., & Schiehlen, J. (1994, October). The seeing passenger car'VaMoRs-P'. In Intelligent Vehicles' 94 Symposium, Proceedings of the (pp. 68-73). IEEE.

[8] Varaiya, P. (1993). Smart cars on smart roads: problems of control. IEEE Transactions on automatic control, 38(2), 195-207.

[9] Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J., ... & Lau, K. (2006). Stanley: The robot that won the DARPA Grand Challenge. Journal of field Robotics, 23(9), 661-692.

[10] Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner, R., Clark, M. N., ... & Gittleman, M. (2008). Autonomous driving in urban environments: Boss and the urban challenge. Journal of Field Robotics, 25(8), 425-466.

[11] On-road Automated Vehicle Standards Committee. SAE J3016: Taxonomy and Definitions for Terms Related to On-Road Motor Vehicle Automated Driving Systems. SAE International.

[12] Li, Bo, Tianlei Zhang, and Tian Xia. "Vehicle detection from 3d lidar using fully convolutional network." arXiv preprint arXiv:1608.07916 (2016).

121

[13] Engelcke, Martin, et al. "Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks." arXiv preprint arXiv:1609.06666 (2016).

[14] Li, Bo. "3D Fully Convolutional Network for Vehicle Detection in Point Cloud." arXiv preprint arXiv:1611.08069 (2016).

[15] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.

[16] Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.

[17] Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. ICML (3), 28, 1310-1318.

[18] Gers, F. A., & Schmidhuber, J. (2000). Recurrent nets that time and count. In Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on (Vol. 3, pp. 189-194). IEEE.

[19] Koutnik, J., Greff, K., Gomez, F., & Schmidhuber, J. (2014). A clockwork rnn. arXiv preprint arXiv:1402.3511.

[20] Yao, K., Cohn, T., Vylomova, K., Duh, K., & Dyer, C. (2015). Depth-Gated Recurrent Neural Networks. arXiv preprint arXiv:1508.03790.

[21] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[22] Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2016). LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems.

[23] Levinson, J., Montemerlo, M., & Thrun, S. (2007, June). Map-Based Precision Vehicle Localization in Urban Environments. In Robotics: Science and Systems (Vol. 4, p. 1).

[24] Levinson, J., & Thrun, S. (2010, May). Robust vehicle localization in urban environments using probabilistic maps. In Robotics and Automation (ICRA), 2010 IEEE International Conference on (pp. 4372-4378). IEEE.

[25] Wolcott, R. W., & Eustice, R. M. (2015, May). Fast LIDAR localization using multiresolution Gaussian mixture maps. In Robotics and Automation (ICRA), 2015 IEEE International Conference on (pp. 2814-2821). IEEE.

[26] Grisetti, G., Kümmerle, R., Stachniss, C., Frese, U., & Hertzberg, C. (2010, May). Hierarchical optimization on manifolds for online 2D and 3D mapping. In Robotics and Automation (ICRA), 2010 IEEE International Conference on (pp. 273-278). IEEE.

[27] Kaess, M., Johannsson, H., Roberts, R., Ila, V., Leonard, J. J., & Dellaert, F. (2012). iSAM2: Incremental smoothing and mapping using the Bayes tree. The International Journal of Robotics Research, 31(2), 216-235. [28] Nieto, J., Bailey, T., & Nebot, E. (2006). Scan-SLAM: Combining EKF-SLAM and scan correlation. In Field and service robotics (pp. 167-178). Springer Berlin/Heidelberg.

122

[29] Doucet, A., De Freitas, N., Murphy, K., & Russell, S. (2000, June). Rao-Blackwellised particle filtering for dynamic Bayesian networks. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence (pp. 176-183). Morgan Kaufmann Publishers Inc..

[30] Goldberg, A. V., & Harrelson, C. (2005, January). Computing the shortest path: A search meets graph theory. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 156-165). Society for Industrial and Applied Mathematics.

[31] Geisberger, R., Sanders, P., Schultes, D., & Vetter, C. (2012). Exact routing in large road networks using contraction hierarchies. Transportation Science, 46(3), 388-404.

[32] Brechtel, S., Gindele, T., & Dillmann, R. (2011, October). Probabilistic MDP-behavior planning for cars. In Intelligent Transportation Systems (ITSC), 2011 14th International IEEE Conference on (pp. 1537-1542). IEEE.

[33] Ulbrich, S., & Maurer, M. (2013, October). Probabilistic online POMDP decision making for lane changes in fully automated driving. In Intelligent Transportation Systems-(ITSC), 2013 16th International IEEE Conference on (pp. 2063-2067). IEEE.

[34] Brechtel, S., Gindele, T., & Dillmann, R. (2014, October). Probabilistic decision-making under uncertainty for autonomous driving using continuous POMDPs. In Intelligent Transportation Systems (ITSC), 2014 IEEE 17th International Conference on (pp. 392-399). IEEE.

[35] Dolgov, D., Thrun, S., Montemerlo, M., & Diebel, J. (2010). Path planning for autonomous vehicles in unknown semi-structured environments. The International Journal of Robotics Research, 29(5), 485-501.

[36] Le-Anh, T., & De Koster, M. B. M. (2006). A review of design and control of automated guided vehicle systems. European Journal of Operational Research, 171(1), 1-23.

[37] Cheein, F., De La Cruz, C., Bastos, T., & Carelli, R. (2010). Slam-based cross-a-door solution approach for a robotic wheelchair. International Journal of Advanced Robotic Systems, 7(2), 155-164.

[38] Lenain, R., Thuilot, B., Cariou, C., & Martinet, P. (2005, April). Model predictive control for vehicle guidance in presence of sliding: application to farm vehicles path tracking. In Robotics and Automation, 2005. ICRA 2005. Proceedings of the 2005 IEEE International Conference on (pp. 885-890). IEEE.\

[39] Choomuang, R., & Afzulpurkar, N. (2005). Hybrid Kalman filter/fuzzy logic based position control of autonomous mobile robot. International Journal of Advanced Robotic Systems, 2(3), 20.

[40] Wang, W., Nonami, K., & Ohira, Y. (2008). Model reference sliding mode control of small helicopter XRB based on vision. International Journal of Advanced Robotic Systems, 5(3), 26.

[41] Gunter, L., & Zhu, J. (2005). Computing the solution path for the regularized support vector regression. Ann Arbor, 1001, 48109.

[42] Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., ... & Zhang, X. (2016). End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316.

[43] Le Cun, Y., Bottou, L., & Bengio, Y. (1997, April). Reading checks with multilayer graph transformer networks. In Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997 IEEE International Conference on (Vol. 1, pp. 151-154). IEEE. 123

[44] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097- 1105).

[45] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.

[46] Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814).

[47] Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge- Based Systems, 6(02), 107-116.

[48] Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.

[49] Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013, June). Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML (Vol. 30, No. 1).

[50] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).

[51] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks. arXiv preprint arXiv:1302.4389.

[52] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958.

[53] Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154-171.

[54] Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1532-1545.

[55] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and (pp. 580-587).

[56] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2016). Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence, 38(1), 142-158.

[57] Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99).

124

[58] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016, October). SSD: Single shot multibox detector. In European Conference on Computer Vision (pp. 21-37). Springer International Publishing.

[59] Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 779-788).

[60] Giebel, J., Gavrila, D., & Schnörr, C. (2004). A bayesian framework for multi-cue 3d object tracking. Computer Vision-ECCV 2004, 241-252.

[61] Breitenstein, M. D., Reichlin, F., Leibe, B., Koller-Meier, E., & Van Gool, L. (2011). Online multiperson tracking-by-detection from a single, uncalibrated camera. IEEE transactions on pattern analysis and machine intelligence, 33(9), 1820-1833. [62] Zhang, L., Li, Y., & Nevatia, R. (2008, June). Global data association for multi-object tracking using network flows. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1-8). IEEE.

[63] Andriyenko, A., & Schindler, K. (2011, June). Multi-target tracking by continuous energy minimization. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (pp. 1265-1272). IEEE.

[64] Kahou, S. E., Michalski, V., & Memisevic, R. (2015). Ratm: Recurrent attentive tracking model. arXiv preprint arXiv:1510.08660.

[65] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., ... & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3213- 3223).

[66] He, X., Zemel, R., & Ray, D. (2006). Learning and incorporating top-down cues in image segmentation. Computer Vision–ECCV 2006, 338-351.

[67] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431-3440).

[68] He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. arXiv preprint arXiv:1703.06870.

[69] Ess, A., Müller, T., Grabner, H., & Van Gool, L. J. (2009, September). Segmentation- Based Urban Traffic Scene Understanding. In BMVC (Vol. 1, p. 2).

[70] Geiger, A., Lauer, M., Wojek, C., Stiller, C., & Urtasun, R. (2014). 3d traffic scene understanding from movable platforms. IEEE transactions on pattern analysis and machine intelligence, 36(5), 1012-1025.

[71] Seff, A., & Xiao, J. (2016). Learning from Maps: Visual Common Sense for Autonomous Driving. arXiv preprint arXiv:1611.08583.

[72] Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012). Scene parsing with multiscale feature learning, purity trees, and optimal covers. arXiv preprint arXiv:1202.2160.

125

[73] Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1653-1660).

[74] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431-3440).

[75] Zbontar, J., & LeCun, Y. (2016). Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1-32), 2.

[76] Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on (Vol. 2, pp. 1150- 1157). Ieee.

[77] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2), 91-110.

[78] Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3128-3137).

[79] Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., ... & Lawrence Zitnick, C. (2015). From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1473-1482). [80] Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X., & Wang, X. (2017). Object detection in videos with tubelet proposal networks. arXiv preprint arXiv:1702.06355.

[81] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).

[82] Ferryman, J., & Shahrokni, A. (2009, December). Pets2009: Dataset and challenge. In Performance Evaluation of Tracking and Surveillance (PETS-Winter), 2009 Twelfth IEEE International Workshop on (pp. 1-6). IEEE.

[83] Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831.

[84] Geiger, A., Lenz, P., & Urtasun, R. (2012, June). Are we ready for autonomous driving? the kitti vision benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 3354-3361). IEEE.

[85] Dollár, P., Wojek, C., Schiele, B., & Perona, P. (2009, June). Pedestrian detection: A benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 304-311). IEEE.

[86] Karpathy, A. (2016). Cs231n: Convolutional neural networks for visual recognition. Online Course.

[87] De Boer, P. T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of operations research, 134(1), 19-67.

126

[88] Glorot, X., & Bengio, Y. (2010, May). Understanding the difficulty of training deep feedforward neural networks. In Aistats (Vol. 9, pp. 249-256).

[89] Mishkin, D., & Matas, J. (2015). All you need is a good init. arXiv preprint arXiv:1511.06422.

[90] Sussillo, D., & Abbott, L. F. (2014). Random walk initialization for training very deep feedforward networks. arXiv preprint arXiv:1412.6558.

[91] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444.

[92] Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17.

[93] Sutskever, I., Martens, J., Dahl, G. E., & Hinton, G. E. (2013). On the importance of initialization and momentum in deep learning. ICML (3), 28, 1139-1147.

[94] Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[95] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[96] Ning, G., Zhang, Z., Huang, C., He, Z., Ren, X., & Wang, H. (2016). Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking. arXiv preprint arXiv:1607.05781.

[97] Gan, Q., Guo, Q., Zhang, Z., & Cho, K. (2015). First step toward model-free, anonymous object tracking with recurrent neural networks. arXiv preprint arXiv:1511.06425.

[98] Broggi, A., Bertozzi, M., Fascioli, A., & Sechi, M. (2000). Shape-based pedestrian detection. In Intelligent Vehicles Symposium, 2000. IV 2000. Proceedings of the IEEE (pp. 215- 220). IEEE. [99] Wang, X., Yang, M., Zhu, S., & Lin, Y. (2013). Regionlets for generic object detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 17-24).

[100] Viola, P., Jones, M. J., & Snow, D. (2003, October). Detecting pedestrians using patterns of motion and appearance. In null (p. 734). IEEE.

[101] Suykens, J. A., & Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural processing letters, 9(3), 293-300.

[102] Neubeck, A., & Van Gool, L. (2006, August). Efficient non-maximum suppression. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on (Vol. 3, pp. 850-855). IEEE.

[103] Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 1, pp. 886-893). IEEE.

[104] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2), 91-110.

127

[105] Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9), 1627-1645.

[106] Erhan, D., Szegedy, C., Toshev, A., & Anguelov, D. (2014). Scalable object detection using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2147-2154).

[107] Dai, J, Li, Y, He, K, and Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. arXiv:1605.06409.

[108] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.

[109] Redmon, J., & Farhadi, A. (2016). YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242.

[110] Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.

[111] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

[112] Arthur, D., & Vassilvitskii, S. (2007, January). k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (pp. 1027-1035). Society for Industrial and Applied Mathematics.

[113] Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems (pp. 305-313).

[114] Net-Scale Technologies, Inc. Autonomous off-road vehicle control using end-to-end learning, July 2004. Final technical report. URL: http://net-scale.com/doc/net-scale-dave- report.pdf.

[115] Chen, C., Seff, A., Kornhauser, A., & Xiao, J. (2015). Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2722-2730).

[116] Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).

[117] Xu, H., Gao, Y., Yu, F., & Darrell, T. (2016). End-to-end learning of driving models from large-scale video datasets. arXiv preprint arXiv:1612.01079.

[118] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.

128

[119] Han, W., Khorrami, P., Paine, T. L., Ramachandran, P., Babaeizadeh, M., Shi, H., ... & Huang, T. S. (2016). Seq-nms for video object detection. arXiv preprint arXiv:1602.08465.

[119] Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017). Flow-Guided Feature Aggregation for Video Object Detection. arXiv preprint arXiv:1703.10025.

[120] Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., ... & Ouyang, W. (2016). T-cnn: Tubelets with convolutional neural networks for object detection from videos. arXiv preprint arXiv:1604.02532.

[121] Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4346-4354).

[122] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4694- 4702).

[123] Xiang, Y., Alahi, A., & Savarese, S. (2015). Learning to track: Online multi-object tracking by decision making. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4705-4713).

[124] Hong, S., You, T., Kwak, S., & Han, B. (2015, June). Online tracking by learning discriminative saliency map with convolutional neural network. In International Conference on Machine Learning (pp. 597-606).

[125] Forsyth, D., & Ponce, J. (2011). Computer vision: a modern approach. Upper Saddle River, NJ; London: Prentice Hall.

[126] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2), 303- 338.

[127] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

[128] Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and< 0.5 MB model size. arXiv preprint arXiv:1602.07360.

[129] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., ... & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861.

[130] Sadeghian, A., Alahi, A., & Savarese, S. (2017). Tracking the untrackable: Learning to track multiple cues with long-term dependencies. arXiv preprint arXiv:1701.01909.

[131] Sharifzadeh, S., Chiotellis, I., Triebel, R., & Cremers, D. (2016). Learning to Drive using Inverse and Deep Q-Networks. arXiv preprint arXiv:1612.03653.

129

[132] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 248-255). IEEE.

[133] Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014, September). Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham.

[134] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems(pp. 3104-3112).

[135] Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. Acm computing surveys (CSUR), 38(4), 13.

[136] Yi, Y., & Grejner-Brzezinska, D. A. (2001, June). Tightly-coupled GPS/INS integration using unscented Kalman filter and particle filter. In Proceedings of the 19th International Technical Meeting of the Satellite Division of The Institute of Navigation (ION GNSS 2006) (pp. 2182-2191).

[137] Grejner-Brzezinska, D. A., Toth, C. K., Sun, H., Wang, X., & Rizos, C. (2011). A robust solution to high-accuracy geolocation: Quadruple integration of GPS, IMU, pseudolite, and terrestrial laser scanning. IEEE Transactions on instrumentation and measurement, 60(11), 3694- 3708.

[138] Toth, C. K., Zaletnyik, P., Laky, S., & Grejner-Brzezińska, D. (2011). The potential of full-waveform LiDAR in mobile mapping applications. Archiwum Fotogrametrii, Kartografii i Teledetekcji, 22.

[139] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., ... & Ghemawat, S. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.

[140] The AI Car Computer for Autonomous Driving. Retrieved from http://www.nvidia.com/object/drive-px.html

130