<<

Computer Vision and for Autonomous Vehicles

by Zhilu Chen

A Dissertation Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Electrical and

August 2017

APPROVED:

Prof. Xinming Huang, Major Advisor

Prof. Lifeng Lai

Prof. Haibo He Abstract

Autonomous vehicle is an engineering technology that can improve transporta- tion safety, alleviate traffic congestion and reduce carbon emissions. Research on autonomous vehicles can be categorized by functionality, for example, object detec- tion or recognition, path planning, navigation, lane keeping, speed control and driver status monitoring. The research topics can also be categorized by the equipment or techniques used, for example, image processing, , machine learning, and localization. This dissertation primarily reports on computer vision and machine learning and their implementations for autonomous vehicles. The vision- based system can effectively detect and accurately recognize multiple objects on the road, such as traffic signs, traffic , and pedestrians. In addition, an autonomous lane keeping system has been proposed using end-to-end learning. In this disserta- tion, a road simulator is built using data collection and augmentation, which can be used for training and evaluating autonomous driving algorithms. The Graphic Processing Unit (GPU) based traffic sign detection and recogni- tion system can detect and recognize 48 traffic signs. The implementation has three stages: pre-processing, feature extraction, and classification. A highly optimized and parallelized version of Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM) is used. The system can process 27.9 frames per second with the active pixels of a 1,628 Ö 1,236 resolution, and with the minimal loss of accuracy. In an evaluation using the BelgiumTS dataset, the experimental results indicate that the detection rate is about 91.69% with false positives per window of 3.39×10−5, and the recognition rate is about 93.77%. We report on two traffic detection and recognition systems. The first sys-

i tem detects and recognizes red circular lights only, using image processing and SVM. Its performance is better than that of traditional detectors and it achieves the best performance with 96.97% precision and 99.43% recall. The second system is more complicated. It detects and classifies different types of traffic lights, including green and red lights in both circular and arrow forms. In addition, it employs image process- ing techniques, such as color extraction and to locate the candidates. Subsequently, a pre-trained PCA network is used as a multi-class classifier for obtain- ing frame-by-frame results. Furthermore, an online multi-object tracking technique is applied to overcome occasional misses and a forecasting method is used to filter out false positives. Several additional optimization techniques are employed to improve the detector performance and to handle the traffic light transitions. A multi-spectral data collection system is implemented for pedestrian detection, which includes a thermal camera and a pair of stereo color cameras. The three cameras are first aligned using trifocal , and the aligned data are processed by using computer vision and machine learning techniques. Convolutional channel features (CCF) and the traditional HOG+SVM approach are evaluated over the data captured from the three cameras. Through the use of trifocal tensor and CCF, training becomes more efficient. The proposed system achieves only a 9% log-average miss rate on our dataset. Autonomous lane keeping system employs an end-to-end learning approach for obtaining the proper steering angle for maintaining a car in a lane. The convolutional neural network (CNN) model uses raw image frames as input, and it outputs the steering angles corresponding to the input frames. Unlike the traditional approach, which manually decomposes the problem into several parts, such as lane detection, path planning, and steering control, the model learns to extract useful features on

ii its own and learns to steer from human behavior. More importantly, we find that having a simulator for and evaluation is important. We then build the simulator using image projection, vehicle dynamics, and vehicle trajectory tracking. The test results reveal that the model trained with augmented data using the simulator has better performance and achieves about a 98% autonomous driving time on our dataset. Furthermore, a vehicle data collection system is developed for building our own datasets from recorded videos. These datasets are used in the above studies and have been released to the public for autonomous vehicle research. The experimental datasets are available at http://computing.wpi.edu/Dataset.html.

iii Acknowledgements

I would like to express my gratitude to my advisor, Professor Xinming Huang, for the opportunity to do research at WPI and his guidance in my research. Thanks for Professor Haibo He, Lifeng Lai and many other professors for their help. I’ve learned a lot from them. Thanks to my families and my friends for giving me the courage and confidence.

iv Contents

Abstract i

Acknowledgements iv

Contents ix

List of Tables x

List of Figures xv

List of Abbreviations xvii

1 Introduction 1 1.1 Motivations ...... 1 1.2 Summary of Contributions ...... 3 1.3 Outline ...... 8

2 Background 10 2.1 Datasets ...... 11 2.2 and recognition ...... 13 2.2.1 Traffic sign ...... 13

v 2.2.2 Traffic light ...... 14 2.2.3 Pedestrian ...... 15 2.3 Lane keeping ...... 19

3 A GPU-Based Real-Time Traffic Sign Detection and Recognition System 21 3.1 Introduction ...... 22 3.2 Traffic Sign Detection and Recognition System ...... 23 3.2.1 System Overview ...... 23 3.2.2 Pre-processing ...... 24 3.2.3 Traffic Sign Detection ...... 26 3.2.4 Traffic Sign Recognition ...... 29 3.3 Parallelism on GPU ...... 29 3.4 Experimental Results ...... 31 3.5 Conclusions ...... 35

4 Automatic Detection of Traffic Lights Using Support Vector Ma- chine 36 4.1 Introduction ...... 37 4.2 Proposed Method for Traffic Light Detection ...... 38 4.2.1 Locating candidates based on color extraction ...... 38 4.2.2 Traffic light detection using ...... 38 4.2.3 An improved method using SVM ...... 40 4.3 Data Collection and Performance Evaluation ...... 43 4.4 Conclusions ...... 46

vi 5 Accurate and Reliable Detection of Traffic Lights Using Multi-Class Learning and Multi-Object Tracking 48 5.1 Introduction ...... 49 5.2 Data Collection and Experimental Setup ...... 51 5.2.1 Training data ...... 52 5.2.2 Test data ...... 58 5.3 Proposed Method of Traffic Light Detection and Recognition . . . . . 58 5.3.1 Locating candidates based on color extraction ...... 60 5.3.2 Classification ...... 65 5.3.2.1 PCANet ...... 65 5.3.2.2 Recognizing green traffic lights using PCANet . . . 66 5.3.2.3 Recognizing red traffic lights using PCANet . . . . . 69 5.3.3 Stabilizing the detection and recognition output ...... 69 5.3.3.1 The problem of frame-by-frame detection ...... 69 5.3.3.2 Tracking and data association ...... 71 5.3.3.3 Forecasting ...... 72 5.3.3.4 Minimizing delays ...... 74 5.4 Performance Evaluation ...... 76 5.4.1 Detection and recognition ...... 76 5.4.2 False positives evaluation ...... 78 5.5 Discussion ...... 79 5.5.1 Comparison with related work ...... 79 5.5.2 Limitation and plausibility ...... 80 5.6 Conclusions ...... 83

vii 6 Pedestrian Detection for Autonomous Vehicle Using Multi-spectral Cameras 84 6.1 Introduction ...... 85 6.2 Data Collection and Experimental Setup ...... 87 6.2.1 Data Collection Equipment ...... 87 6.2.2 Data Collection and Experimental Setup ...... 90 6.3 Proposed Method ...... 90 6.3.1 Overview ...... 90 6.3.2 Trifocal tensor ...... 91 6.3.3 Sliding windows vs. region of interest ...... 93 6.3.4 Detection ...... 97 6.3.5 Information fusion ...... 99 6.3.6 Additional constraints ...... 100 6.3.6.1 Disparity-size ...... 100 6.3.6.2 Road horizon ...... 100 6.4 Performance Evaluation ...... 102 6.5 Discussion ...... 105 6.6 Conclusions ...... 108

7 End-to-End Learning for Lane Keeping of Self-Driving Cars 109 7.1 Introduction ...... 110 7.2 Implementation Details ...... 111 7.2.1 Data pre-processing ...... 111 7.2.2 CNN implementation details ...... 113 7.3 Evaluation ...... 116

viii 7.4 Discussion ...... 119 7.4.1 Evaluation ...... 119 7.4.2 Data augmentation ...... 122 7.5 Conclusions ...... 123

8 Building an Autonomous Lane Keeping Simulator Using Real-World Data and End-to-End Learning 124 8.1 Introduction ...... 125 8.2 Building a Simulator ...... 128 8.2.1 Overview ...... 128 8.2.2 Image projection ...... 130 8.2.3 Vehicle dynamics and vehicle trajectory tracking ...... 134 8.2.4 CNN implementation ...... 142 8.3 Experiment ...... 143 8.3.1 Data collection ...... 143 8.3.2 Data augmentation ...... 147 8.3.3 Evaluation using simulator ...... 147 8.4 Discussion ...... 151 8.5 Conclusions ...... 153

9 Conclusions 154

Bibliography 157

ix List of Tables

3.1 HOG parameters in our system ...... 28

4.1 Evaluation result based on Rin/Rout for different p values ...... 45 4.2 Evaluation result: precision and recall ...... 46

5.1 Number of training samples of Green ROI-n and Red ROI-n . . . . . 58 5.2 Information of 23 test sequences ...... 59 5.3 Test result of 17 sequences that contain traffic lights ...... 78 5.4 Number of false positives in traffic-light-free sequences ...... 79 5.5 Results of several recent works on traffic lights detection ...... 81

8.1 Evaluation result using the simulator, with and without augmented data.149

x List of Figures

2.1 Performance results from the Caltech Pedestrian Detection Benchmark. 17

3.1 Three stages in our proposed system...... 24 3.2 48 classes of traffic signs can be detected and recognized in our system. 25 3.3 An example of color enhancement...... 26 3.4 Selecting ROI from the original image...... 27 3.5 Grouping detected windows...... 29 3.6 Normal CUDA kernel launches...... 30 3.7 CUDA kernel launches using CUDA streams...... 30 3.8 HOG time on CPU and GPU...... 32 3.9 The total processing time when HOG is computed using OpenCV on GPU...... 33 3.10 The total processing time when using our optimized GPU code. . . . 34

4.1 Applying traffic light detector on a candidate...... 40 4.2 Are they traffic lights or not? Dark background on the top and bright background at the bottom...... 41 4.3 The left traffic light has bright background and the right traffic light has dark background...... 42

xi 4.4 Rin/Rout values for true positive candidates (left) and true negative candidates (right). Y-axis is from 0 to 2000...... 44

4.5 Rin/Rout values for true positive candidates (left) and true negative candidates (right). Y-axis is from 0 to 20...... 45 4.6 Both traffic lights are detected...... 47

5.1 Examples of 5 classes of Green ROI-1...... 53 5.2 Examples of 5 classes of Green ROI-3...... 54 5.3 Examples of 5 classes of Green ROI-4...... 55 5.4 Examples of 3 classes of Red ROI-1...... 56 5.5 Examples of 3 classes of Red ROI-3...... 57 5.6 Flowchart of the proposed method of traffic light detection and recog- nition...... 60 5.7 Color extraction, blob detection and closing operation...... 62 5.8 A sample frame from our traffic light dataset...... 64 5.9 The structure of two-stage PCANet...... 66 5.10 An arrow light in three consecutive frames. The middle one is vague and look similar to a circular light. A detector often fails on such vague frame...... 70 5.11 All traffic lights are detected and recognized correctly in the frame. . 77

6.1 Instrumentation setup with both thermal and stereo cameras mounted on the roof of a vehicle...... 88 6.2 Framework of the proposed pedestrian detection method...... 92 6.3 Proper alignment of color and thermal images using trifocal tensor. . 94 6.4 Examples of pedestrians in color and thermal images...... 96

xii 6.5 The relationship between the mean disparity and the height of an object.101 6.6 Performance of different input data combinations, all using HOG fea- tures...... 103 6.7 Performance improvement by adding disparity-size and road horizon constraints...... 104 6.8 Performance of different input data combinations, all using CCF. . . 105 6.9 A pedestrian is embedded in the shadow of a color image...... 106 6.10 An example thermal image with two pedestrians...... 107

7.1 Comparison between the traditional approach and end-to-end learning. 111 7.2 An example of image frame from the dataset...... 112 7.3 Histogram of steering angles in training data...... 114 7.4 The proposed CNN architecture for ...... 115 7.5 Histogram of error of predicted steering angles during test...... 117 7.6 An example frame with the ground truth angle, predicted angle and their respective projected path ...... 118 7.7 of the results from first two convolutional layers. . . . . 119 7.8 An example of the disadvantage of frame by frame evaluation with 5 consecutive frames: the error in the middle frame is false ...... 121

8.1 Comparison between the traditional framework and end-to-end learning.126 8.2 The flowchart of test phase...... 131 8.3 The flowchart of training phase, using original data and augmented data.132

xiii 8.4 Example of original image and generated images given arbitrary camera poses. (a) Original image. A checkerboard pattern on a flat surface. (b) Generated image as if the camera is shifted left by 50 mm. (c) Generated image as if the camera is rotated right by 15.25 degrees. (d) Generated image as if the camera is shifted left by 50 mm and rotated right by 15.25 degrees...... 135 8.5 Camera calibration and ground surface estimation. (a) Selected points in the image taken by the center camera. (b) Cameras and selected points in the world coordinates...... 136 8.6 A virtual bicycle vehicle dynamics...... 138 8.7 Correction of vehicle’s position and orientation using vehicle trajectory tracking. (a) Ground truth and predicted trajectory. (b) Ground truth and predicted orientation. (c) Ground truth and predicted steering wheel angle...... 141 8.8 An example of cropped image frame from the dataset...... 143 8.9 The CNN structure used, slightly modified from NVIDIA’s PilotNet. 144 8.10 Our data collection system, including three forward facing cameras, a USB hub, a laptop and access to OBD-II port...... 145 8.11 Example frames under different weather or lighting condition. (a) Cloudy. (b) Shadowed. (c) Foggy. (d) Sunny...... 146 8.12 Example of original image and augmented images given arbitrary vehi- cle poses. (a) Original image. (b) Augmented image as if the vehicle is shifted right by 0.5 m. (c) Augmented image as if the vehicle is rotated left by 7 degrees. (d) Augmented image as if the vehicle is shifted right by 0.5 m and rotated left by 7 degrees...... 148

xiv 8.13 An example of the simulation result, produced by the CNN trained with data augmentation. (a) Overview of the trajectory in a test se- quence. (b) Trajectory zoomed-in in the black rectangle in (a). (c) Trajectory zoomed-in in the black rectangle in (b)...... 150 8.14 An example of failure. The vehicle is going out of lane to the right be- cause another vehicle is changing lane, and lane markings are partially blocked...... 152 8.15 An example of failure. The vehicle is going out of lane to the right because of unclear lane markings...... 152

xv List of Abbreviations

ACF Aggregated Channel Features

CCF Convolutional Channel Features

CNN Convolutional Neural Network

FN False Negatives

FOV Field of View

FP False Positives

FPPI False Positives Per Image

FPPW False Positives Per Window

FPS Frames per Second

GPU

HOG Histograms of Oriented Gradients

LKAS Lane Keeping Assist System

LQR Linear Quadratic Regulator

xvi MOT Multi-object Tracking

MR Miss Rate

ODE Ordinary Differential Equatio

PCA Principal Component Analysis

PCANet PCA network

RBF Radial Function

ROI Region of Interest

SLAM Simultaneous Localization and Mapping

SMA Simple Moving Average

SVM Support Vector Machine

TP True Positives

xvii Chapter 1

Introduction

In this chapter, we first introduce the background and discuss the motivations of our work in Section 1.1. The major contributions of our work are summarized in Section 1.2. Finally, the organization of this dissertation is presented in Section 1.3.

1.1 Motivations

Road safety is an important topic. After all, data from the Insurance Institute of Highway Safety (IIHS) revealed that in the year of 2012, red-light-running crashes caused around 133,000 injuries and 683 deaths on US roads [1]. These injuries and deaths may be reduced or avoided with the introduction of more advanced technolo- gies. Many researchers are dedicated to the area of autonomous vehicles. Therefore, we believe that this topic is meaningful and important. Releated to the topic of autonomous vehicles are cameras, which are common in our daily lives, andare much cheaper than some sensors, such as . In light of this, vision-based systems are more intuitive, as humans use their eyes to understand

1 the surrounding environment. In addition, humans can easily interpret the informa- tion obtained from images or videos, which makes building manually labeled datasets easier. Therefore, we believe that the vision-based approach is reasonable. In addi- tion to some public datasets, we design and deploy our own data collection system to build our own datasets, especially when the public datasets are limited or not ideal. Object detection and recognition are important for understanding a road scene. Traffic signs, traffic lights, pedestrians, and many other objects on the road need to be detected and recognized to guide drivers or autonomous driving systems. Our projects witness the evolution of object detection and recognition in computer vision. Initially, hand-crafted features (e.g., HOG) proved their effectiveness in detecting objects with certain shapes or patterns. A classifier, such as SVM or AdaBoost, is often used upon the extracted features. Image processing is often used as a pre- processing or post-processing step, and certain assumptions are often made to improve the detector’s performance. Later, researchers found a more generic way of detecting objects, without using hand-crafted features. It is called two-stage training. The first stage performs unsupervised training on all of the training data to determine the bestmethod for extracting features, and the second stage performs supervised training to train the classifiers based on these features. After the two-stage training approach, the one-stage approach became popular again, but with an end-to-end learning Convolutional Neural Network (CNN) instead of hand-crafted features. The CNN takes raw images as input and outputs the classified labels. As the CNN being trained, it learns how to extract information from the raw images and how to classify them. The training is one stage and supervised, and no clear boundary for the feature extractor and classifier in the model exists. The CNN can deliver state-of-the-art performance in object detection and recognition as of now.

2 Besides object detection and recognition, we are also motivated to look at the lane keeping problem, which is an essential part of autonomous cars. The CNN is futhermore used and it takes raw image frames as input, and outputs the steering angles corresponding to the input frames, to keep the vehicle within the lane. This is a regression problem instead of a classification problem. A simulator is then built to provide augmented training data and a proper evaluation metric. The knowledge of computer vision 3D , vehicle dynamic, and vehicle trajectory tracking are also used in the simulator.

1.2 Summary of Contributions

We design and implement a group of systems for autonomous vehicles. Our contri- butions are listed as follows:

ˆ Design and implement a traffic sign detection and recognition system. Traffic sign detection and recognition are important functions for autonomous vehicles. The detection process identifies the existence of traffic signs and their locations in an image, and the recognition process identifies the types of the detected signs. Our GPU-based traffic signs detection and recognition system is able to detect and recognize 48 traffic signs. The implementation features three stages: pre-processing, feature extraction and classification. A highly optimized and parallelized version of HOG+SVM was used. The system can process 27.9 frames per second with the active pixels of a 1,628 Ö 1,236 resolution, and with the minimal loss of accuracy. Evaluating using the BelgiumTS dataset, the experimental results indicate that the detection rate is about 91.69% with false positives per window of 3.39 × 10−5 and the recognition rate is about 93.77%.

3 We emphasize our contributions in the following aspects:

– Our system is able to detect and recognize 48 traffic signs, with a good detection rate and recognition rate.

– We optimized and parallelized the computation of HOG on GPU, as well as some pre-processing steps and the deployed SVM classifier.

– Our system achieves real-time on large resolution images.

ˆ Design and implement two traffic light detection and recognition systems. Two traffic light detection and recognition systems are presented. The first system detects and recognizes red circular lights only, using image processing and SVM. Its performance is better than traditional detectors’s performance is. The second system detects and classifies different types of traffic lights, including green and red lights in both circular and arrow forms. It combines computer vision and machine learning techniques. Color extraction and blob detection are used to locate the candidates, followed by the PCA network (PCANet) classifiers. The PCANet classifier consists of a PCANet and a linear SVM. Our experimental results suggest that the proposed method is highly effective for detecting both green and red traffic lights. We emphasize our contributions in the following aspects:

– For the first system, we demonstrate that detection using a fixed threshold ratio is not very effective and the SVM-based classification has much better performance.

– For the first system, we empirically add more parameters of a candidate to the SVM input, and this can achieve better performance.

4 – For the first system, we build a traffic light dataset from the original videos captured while driving on the streets.

– For the second system, we demonstrate that combining image processing and PCANet can help with detecting and recognize various types of traffic lights, including green and red lights in both circular and arrow forms.

– For the second system, an online multi-object tracking technique is applied to overcome occasional misses, and a forecasting method is used to filter out false positives.

– For the second system, several additional optimization techniques are em- ployed to improve the detector performance and to handle the traffic light transitions.

– For the second system, we build our own dataset of traffic lights from recorded driving videos, including circular lights and arrow lights in various directions.

ˆ Design and implement a pedestrian detection system. Pedestrian detection is a critical feature for self-driving cars or advanced driver assistance systems. Our system consists of a thermal camera and a color stereo camera. Data received from multiple cameras are aligned using trifocal tensor based on pre-calibrated parameters. In addition, candidates are generated us- ing sliding windows at multiple scales. A reconfigurable detector framework is proposed, in which feature extraction and classification are two separate stages. The input to the detector can be the color image, disparity map, thermal data, or any combination of these. When applied to convolutional channel features, feature extraction uses the first three convolutional layers of a pre-trained con-

5 volutional neural network cascaded with an AdaBoost classifier. The evaluation results indicate that it significantly outperforms the traditional histogram of ori- ented gradients features. When combining the color and thermal images, the proposed detector can achieve a 9% log-average miss rate. We emphasize our contributions in the following aspects:

– We design and assemble a multi-spectral camera system mounted on a vehicle to collect data for pedestrian detection.

– We build a dataset for multi-spectral pedestrian detection from on-road driving data. These data contain many complex scenarios that are chal- lenging for detection and classification.

– We propose a machine learning based for pedestrian detection by combining stereo vision and thermal images. The evaluation results show satisfactory performance.

– An experimental dataset is built by labeling the data collected when driv- ing on the city roads.

ˆ Design and implement a lane keeping system. We present an end-to-end learning approach for obtaining the proper steering angle to maintain the car in the lane. The CNN model uses raw image frames as input and outputs the steering angles accordingly. The model is trained and evaluated using the comma.ai dataset, which contains the front view image frames and the steering angle data captured when driving on the road. Unlike the traditional approach, which manually decomposes the autonomous driving problem into technical components such as lane detection, path planning and

6 steering control, the end-to-end model can directly steer the vehicle from the front view camera data after training. It learns how to keep the car in the lane from human driving data. Further discussion of this end-to-end approach and its limitation are also provided. We emphasize our contributions in the following aspects:

– We present a working system for lane keeping using the end-to-end learning approach.

– We provide the evaluation results and discussion of this system. The need for building a simulator is discussed.

ˆ Design and implement a simulator for the lane keeping system. In addition to the state-of-the-art end-to-end learning method that predicts the steering wheel angle for the purpose of staying in the lane, a simulator is built using image projection, vehicle dynamics and vehicle trajectory tracking, which can be helpful in both training and evaluation. The simulation results demonstrate the effectiveness and accuracy of the end-to-end learning method and the benefits of using the simulator. We emphasize our contributions in the following aspects:

– We describe the implementation details of building a simulator for vision- based autonomous lane keeping. Although many recent works exist on lane keeping algorithms, comparing and evaluating them are difficult. Built on real-world data, this simulator employs image projection, vehicle dynam- ics modeling, and vehicle trajectory tracking to predict vehicle movement and its corresponding camera views. The simulator can be used for both training and the evaluation of lane keeping algorithms.

7 – The end-to-end learning approach is to produce the proper steering angle from camera image data aimed at maintaining the self-driving vehicle in a lane. A highly effective end-to-end learning system is demonstrated using the aforementioned simulator for both training and evaluation. The CNN model trained with augmented data from the simulator performs significantly better than the model trained with recorded data only does.

– We build a dataset for autonomous vehicle research. The dataset contains recorded video frames from three forward facing cameras (left, center, and right) as well as a steering wheel angle and vehicle speed information.

1.3 Outline

This dissertation is organized as follows. Chapter 2 summarizes the background of autonomous vehicles, especially com- puter vision and machine learning techniques related to this dissertation. Chapter 3 presents a GPU-based system for real-time traffic sign detection and recognition that can classify 48 traffic signs included in the library. Chapter 4 presents a method for the automatic detection of circular red traffic lights that integrates both image processing and support vector machine techniques. Chapter 5 presents a novel approach that combines computer vision and machine learning techniques for the accurate detection and classification of different types of traffic lights, including green and red lights in both circular and arrow forms. Chapter 6 presents a novel instrument for pedestrian detection by combining a thermal camera with a color stereo camera. Chapter 7 presents an end-to-end learning approach for obtaining the proper steer-

8 ing angle to maintain the car in the lane. Chapter 8 presents the implementation of a simulator for the lane keeping system, using image projection, vehicle dynamics and vehicle trajectory tracking, which can be helpful for both training and evaluation. Chapter 9 draws the conclusions.

9 Chapter 2

Background

Carnegie Mellon University completed the first project involving autonomous vehicles in the US in 1995, which included autonomous driving from Pittsburgh, PA, to San Diego, CA. The vehicle was equipped with a computer, a camera, and a GPS. In 2004, the American Defense Advanced Research Projects Agency (DARPA) started a competition for autonomous vehicles, but none of the teams completed the 150-mile course. In 2005, five teams completed the DARPA challenge, and Stanford Univer- sity’s autonomous car called Stanley took first place. In 2007, the DARPA challenge involved a 60-mile course in an urban environment and the Carnegie Mellon Univer- sity’s autonomous car called Boss took first place. In 2016, Stanford University’s au- tonomous car called Shelley ran on the track in a speed of nearly 120 mph. Nowadays, many vehicle manufacturers are developing their own autonomous vehicles, including Ford, Mercedes Benz, Volkswagen, Audi, and BMW. In addition, many IT companies have also joined this area, including Google, Uber, NVIDIA, and Tesla. For example, Google started a project for self-driving car in 2009, which is now called . It claims that it drives more than 25,000 autonomous miles each week, and mostly

10 on complex city streets. In other words, autonomous vehicles are being developed rapidly, including both their hardware and their software. This dissertation focuses on computer vision and machine learning techniques used in this field, such as the detection and recognition of traffic sign, traffic light and pedestrian, as well as lane keeping for self-driving cars. Many other topics not covered in this dissertation are also important, such as pixel level segmentation, 3D recon- struction, , and Simultaneous Localization and Mapping (SLAM).

2.1 Datasets

Machine learning techniques rely heavily on data. Datasets are often built using real- world data, with manually labeled ground truth. For example, the KITTI dataset [2–5] uses the autonomous driving platform of Annieway to capture data from the real world. The sensors mounted on the car are cameras, 360 Velodyne Laser-scanner and a GPS. The data are manually processed and are divided into several subsets, such as stereo, flow, object, tracking, and road. Furthermore, many datasets are built while aiming for specific tasks. For example, the Belgium Traffic Sign Dataset [6] and German Traffic Sign Benchmark [7] aim for detecting and recognizing a group of European traffic signs in images. The Traffic Lights Recognition (TLR) public benchmarks [8] are for the detection of green or red circular traffic lights in images. The INRIA person dataset [9] and the Caltech Pedestrian Detection Benchmark [10] are for the detection of upright persons in images. The comma.ai dataset contains images captured from a forward facing camera, as well as the vehicle status such as the speed, gear, and steering. It is used for end-to-end learning for the functionality of lane keeping.

11 The datasets built from real-world data are extremely useful for researchers. How- ever, collecting and labeling these data is tedious and time consuming, and the in- formation obtained is limited to the types of sensors used. Therefore, real-world datasets often have limited amounts of data and focus on certain functionalities. On the other hand, some datasets are built using simulators or game engines, and they can provide much more information with little human effort. For example, a dataset generated from a computer game has been proposed for road scene segmentation [11]. The researchers claim that generating the annotation takes seven seconds per im- age on average, whereas the human annotator takes 90 minutes per image. In such datasets, the rich information of the 3D scene and object movements is helpful to researchers, and these data can be generated easily. However, whether the models trained on virtual data can be applied in the real world is questionable, as the im- ages from game engines and the real world have inherent differences. Nevertheless, these virtual datasets provide solid alternatives for researchers to try out their new algorithms. An increasing number of datasets are becoming available as researchers keep col- lecting data and building their own datasets. Using the existing datasets reduces the time and efforts needed to verify an algorithm, as collecting and labeling data are very time consuming. It also makes it easier to compare one’s work with the existing work of other researchers who use the same dataset [12, 13], because works done on different datasets cannot be compared directly. However, sometimes researchers must collect their own data, if the existing datasets are not ideal or are not available. In addition, the newly built datasets can benefit other researchers.

12 2.2 Object detection and recognition

Object detection and recognition are important aspects of autonomous vehicles. This dissertation focuses on the detection and recognition of traffic signs, traffic lights and pedestrians. In addition, many other objects not covered in this dissertation can also be detected and recognized to guide drivers or autonomous driving systems, such as vehicles, road markings, and traffic cones.

2.2.1 Traffic sign

There are several existing works focused on detecting and recognizing a particular class of traffic signs such as stop sign or speed limit sign [14, 15]. The designs were optimized and can be highly efficient for detecting and recognizing a specific class of signs, but they were hardly useful for other types of signs. Other research papers attempted to detect and recognize multiple signs and used the common features such as shapes and colors [6,16,17]. Advanced image processing algorithms were proposed and analyzed thoroughly in order to obtain accurate results. However, the previous works primarily focused on the algorithms, and the computing time is less a concern, which prevents those designs from becoming practically useful. There are some other works which investigate the trade-off between accuracy and computing time [18–20]. Many of them claimed to achieve real-time performance at a high accuracy, but the datasets that they used were varied. Without using the same data set, it is unfair to compare the accuracy among different designs. It is also worth mentioning that the image resolution is another important factor that can affect the processing time as well as accuracy. A higher resolution image can reveal small objects in it. As a result, traffic signs can be detected and recognized even when they are far away and

13 thus leaves more time for drivers to respond.

2.2.2 Traffic light

Spot light detection [21, 22] is a method based on the fact that a traffic light is much brighter than the lamp holder usually in black color. A morphological top-hat operator was used to extract the bright areas from gray-scale images, followed by a number of filtering and validating steps. In [23], an interactive multiple-model filter was used in conjunction with the spot light detection. More information was used to improve its performance, such as status switching probability, estimated position and size. Fast radial symmetry transform is a fast variation of the circular , which can be used to detect circular traffic lights as demonstrated in [24]. Several other methods also combined the vehicle GPS information. A geometry- based filtering method was proposed to detect traffic lights using mobile devices at low computational cost [25]. The GPS coordinates of all traffic lights were presumably available, and a camera projection model was used. Mapping traffic light locations was introduced in [26] by using tracking, back-projection and triangulation. Google also presented a mapping and detection method in [27] which was capable of recog- nizing different types of traffic lights. It predicted when traffic lights should become visible with the help of GPS data, followed by classifying possible candidates. Geo- metric constraints and temporal filtering were then applied during the detection. The inter-frame information was also helpful for detecting traffic lights. A method that used Hidden Markov Model to improve the accuracy and stability of the results was demonstrated in [28]. The state transition probability of traffic lights was considered, and information from several previous frames was used. Reference [29] introduced a traffic light detector based on template matching. The assumption was that the two

14 off lamps in the traffic light holder are similar to each other and neither of them look similar with the surrounding background. Deep learning [30, 31] is a class of machine learning algorithms that has many layers to extract hidden features. Unlike hand-crafted features such as Histograms of Oriented Gradients (HOG) features [9], it learns features from training data. PCANet is a simple, yet effective deep learning network proposed by [32]. Principal Component Analysis (PCA) is employed to learn the filter banks. It can be used to extract features of faces, hand written digits and object images. It has been tested on several datasets and delivers surprisingly good results [32]. Using PCANet in traffic light detection or other similar applications has not been researched thus far. Integration of detection and tracking has been used in a few autonomous vehicles related works. The trajectory of traffic light was used to validate the theoretical result in [23]. Kalman filter was employed to predict the traffic sign positions. It claimed that tracking algorithm was able to improve the overall system reliability [33, 34]. Utilizing accumulated classifier decisions from a tracked speed limit sign, a majority voting scheme was proven to be very robust against accidental mis-classifications [14].

2.2.3 Pedestrian

The Caltech Pedestrian Detection Benchmark [10] has been widely used by the re- searchers. It contains frames from a single vision camera with pedestrians annotated. Based on the CVPR2015 snapshot of the results on the Caltech-USA pedestrian benchmark, it was stated in [35] that at ˜95% recall, the state-of-the-art detectors made ten times more errors than the human-eye baseline, which is still a huge gap that calls for research attentions. Figure 2.1(a) shows some top quality detection meth- ods presented in [36]. Overall, the detector performance has been improved as new

15 methods were introduced in recent years. Traditional methods such as Viola–Jones (VJ) [37] and Histogram of Oriented Gradients (HOG) [9] were often included as the baseline. A total of 44 methods were listed in [38] for Caltech-USA dataset, and 30 of them made the use of HOG or HOG-like features. Channel features [39] and Convo- lutional Neural Networks [40–42] also achieved impressive performance on pedestrian detection. The Convolutional Channel Features (CCF) [43], which combines a boost- ing forest model and low level features from CNN, is one of top performers listed in Caltech Pedestrian Detection Benchmark as shown in Figure 2.1(b) . Despite the progressive improvement of detection results on the datasets, color cameras still have many limitations. For instance, color cameras are sensitive to the lighting condition. Most of these detection methods may fail if the image quality is impaired under poor lighting condition. Thermal cameras can be employed to overcome some limitations of color cameras, because they are not affected by lighting condition. Several research works using ther- mal data for pedestrian detection and tracking were summarized in [44]. Background subtraction was applied in [45] for people detection, since the camera was static. HOG features and Support Vector Machine (SVM) were employed for classification [46]. A two-layered representation was described in [47], where the still background layer and the moving foreground layer were separated. The shape cue and appearance cue were used to detect and locate pedestrians. In [48], a window based screening procedure was proposed for potential candidate selections. The Contour Saliency Map (CSM) was used to represent the edges of a pedestrian, followed by AdaBoost classification with adaptive filters. Assuming the region occupied by a pedestrian has a hot spot, candidates were selected based on thermal intensity value [49] and then classified by a SVM. In addition, both Kalman filter prediction and tracking were

16 (a) Benchmark results of different methods as reported in [36].

(b) Benchmark results of different methods as of May 2016.

Figure 2.1: Performance results from the Caltech Pedestrian Detection Benchmark.

17 incorporated for further improvement. A new contrast invariant descriptor [50] was introduced for far images, which outperformed HOG features by 7% at 10−4 FPPW for people detection. The Shape Context Descriptor (SCD) was also used for pedestrian detection in [51], followed by AdaBoost classifier. The HOG features were considered not suitable for this task because of the small size of the target, variations of pixel intensities and lack of texture information. Probabilistic models for pedes- trian detection in far infrared images was presented in [52]. The method in [53] found the head regions at the initial stage, then confirmed the detection of a pedestrian by the histograms of Sobel edges in the region. For ADAS applications, several pedestrian detection research works were summa- rized in [54], including the use of color cameras and thermal cameras, as well as sensor fusion such as radar and stereo vision cameras. A benchmark for multispectral pedes- trian detection was presented in [55] and several methods were analyzed. However, the color-thermal pairs were manually annotated and it is unclear if any automatic point registration algorithms were used. The combination of stereo vision cameras and a thermal camera was used in [56]. Trifocal tensor was used to align the thermal image with color and disparity images. Candidates were selected based on disparity, and HOG features were extracted from color, thermal and disparity images. Con- catenated HOG features were then fed to radio basis function (RBF) SVM classifier to obtain the final decision. Furthermore, more sophisticated applications or systems can be built upon pedestrian detection, such as pedestrian tracking across multiple driving recorders [57] and crowd movement analysis [58].

18 2.3 Lane keeping

Maintaining the vehicle within the lane is important for driving safety. Lane keeping assist system (LKAS) has been studied by many researchers previously. The lane keeping assist systems [59–62] are able to provide torque to maintain the vehicle within the lane, and often alert the driver with warning messages or sound. Cameras are usually used in the system, and lane markings must be recognized. In addition, the systems also distinguish intended and unintended lane departure, by utilizing more information such as blinker state, braking or steering angle. The LKAS needs to be accurate and robust for autonomous cars. Although Industrial companies have achieved a lot in this area, they seldom publicize their technologies. It is necessary for researchers to study the theories, algorithms and implementations of the LKAS. Deep [63] was used in several research works on autonomous driving [64–66]. The systems learned the optimal pol- icy function given the feedback of the reward. These systems went beyond the basic lane keeping feature, and were able to direct the vehicle to stay in path and avoid collision. The vehicle was not necessarily to be in lane, and other vehicles on road were often involved. The learning and evaluation were often done in a virtual simula- tor, because the learning requires rich ground truth information and needs to interact with the environment. Inverse reinforcement learning [67], on the other hand, was used to estimate the reward from the expert demonstrations. For real world systems, sensors and algorithms are employed to interpret the surrounding environment, without having the rich ground truth information in the simulator. The vision-based approaches use cameras because they are cost effective. An early research work demonstrated a autonomous vehicle ALVINN [68] using neu-

19 ral network to find the proper direction. The input data came from a camera and a laser range finder, and the input resolution was very small. For large resolution color images, an end-to-end learning approach using convolutional neural network was demonstrated in [69]. The system was designed for off-road mobile , not the autonomous vehicles on the road. An end-to-end learning approach using convolu- tional neural network for self driving cars was demonstrated in [70], and the network was trained and evaluated with the help of a simulator. The idea of building the simulator using image projection and vehicle dynamics was described, but there were little technical details. The network was later named as PilotNet, and the effective- ness was validated and visualized in [71, 72]. Our previous work [73] followed this approach using a different dataset and network, and demonstrated the necessity of building the simulator in both the training and evaluation stage. Building the simulator requires the knowledge of computer vision, vehicle dynam- ics and vehicle trajectory tracking. Most autonomous vehicle driving frameworks present a consistent decoupling between low-level control and path planning, while constraining the dynamics of the system to satisfy the vehicle’s motion. Typically nominal path is obtained by optimization-based methods [74], sampling-based ap- proaches [75] and notable searching algorithms [76]. In terms of the system dynamics and control, Rami et al. [77] proposed a linear system dynamics and the control for high speed drifting. Galceran et al. [78] adopted proportional-derivative (PD) feed- back controller for torque-based steering. Approximating the non-linearity of the vehicle dynamics, DeSantis et al. [79] Jacobian linearized the vehicle dynamics for designing a path-tracking controller, but this approximation ignored the high order of the polynomial of the system dynamics, which led to a potential problem in the controlling a vehicle when the error is large.

20 Chapter 3

A GPU-Based Real-Time Traffic Sign Detection and Recognition System

This chapter presents a GPU-based system for real-time traffic sign detection and recognition which can classify 48 different traffic signs included in the library. The proposed design implementation has three stages: pre-processing, feature extraction and classification. For high-speed processing, we propose a window-based histogram of gradient algorithm that is highly optimized for parallel processing on a GPU. For detecting signs in various sizes, the processing was applied at 32 scale levels. For more accurate recognition, multiple levels of supported vector machines are employed to classify the traffic signs. The proposed system can process 27.9 frames per second video with active pixels of 1,628 Ö 1,236 resolution. Evaluating using the BelgiumTS dataset, the experimental results show the detection rate is about 91.69% with false positives per window of 3.39 × 10−5 and the recognition rate is about 93.77%.

21 3.1 Introduction

Traffic sign detection and recognition are important functions in an Advanced Driver Assistance System (ADAS). The detection process has two aspects, including the existence of traffic signs in an image and their locations. Accurately detecting the signs also improves the recognition rate by filtering out redundant information while retaining the useful information on an image. Recognition identifies the signs from the detection result. In the real world, knowing the content of the sign is much more important than simply knowing the existence of a sign. Many existing works have been carried out to improve the accuracy of detection and recognition. In practice, processing time and hardware efficiency also need to be considered. A traffic sign detection and recognition system often contains three stages: pre- processing, detection and recognition. The pre-processing stage is optional, but it is usually included in a real-time system. It identifies and selects the regions of interest in the original image frame that often contains a large number of pixels. Effectively, it reduces the computational tasks and improves the efficiency of the subsequent stages. The second stage detects and locates traffic signs in the selected regions produced by the pre-processing stage. In some systems, the detection stage also identifies the categories of the signs based on shapes, such as round, rectangle, triangle, etc. These categories are called super-classes. The final stage recognizes the detected signs and sends the processing results (i.e., the types of signs and their locations) to the display and control units of an ADAS system. Typically feature extraction and pattern classification algorithms are computa- tionally intensive. Much research has been done to optimize the algorithms themselves to improve the accuracy, but very little research has been focused on the implemen-

22 tation to improve the efficiency. In this chapter, we propose to utilize the many-core architecture in a GPU to accelerate the traffic sign detection and recognition algo- rithms through massive parallel processing. The objective is to reduce the computing time considerably such that the GPU implementation can detect and recognize traffic signs in real-time.

3.2 Traffic Sign Detection and Recognition System

3.2.1 System Overview

The proposed system contains three main stages: pre-processing, detection and recog- nition, as shown in Fig. 3.1. At first, we perform red and blue color extractions respectively and select the regions of interest (ROI). Next, Histograms of Oriented Gradients (HOG) [80] features are extracted on the grayscale image and a sliding window searches the image exhaustively to find the candidates using a linear Support Vector Machine (SVM). Color-based HOG detectors are then performed on these candidates to eliminate false positives, followed by a rectangle grouping operation to locate the detected traffic signs. Finally, the detected signs are delivered to a cascade classifier which contains several linear SVMs. The recognized traffic sign is highlighted with a green rectangle on the image. Furthermore, a standard image of the identified class of traffic sign, scaled to the same size, is placed next to the rectangle, which is used to indicate the actual position and class of the sign. For the proposed system, the BelgiumTS dataset are employed for both training and testing. Our system is able to detect and recognize 48 classes of traffic signs selected from the BelgiumTS Dataset [81], as shown in Fig. 3.2. These signs have aspect ratio 1:1 with red or blue colors on them.

23 Figure 3.1: Three stages in our proposed system.

Although HOG and SVM have been commonly used in detecting and recognizing objects, it is still challenging to find a good balance between accuracy and efficiency. In order to reduce the computing latency, we employ linear kernel SVMs in our implementation. In order to obtain better accuracy, we use multiple HOG features and SVMs in our system in Fig. 3.1.

3.2.2 Pre-processing

Color and shape information are commonly used as features of traffic signs. Although road images often contain objects with their color and shape information similar to that of traffic signs, it is still a simple yet effective way to use such information to identify the ROI. We perform color extraction using an adaptive threshold method proposed by [82]. By using red color enhancement, we obtain an image whose pixel

24 Figure 3.2: 48 classes of traffic signs can be detected and recognized in our system.

value fR is computed as

 min (x − x , x − x ) f = max 0, R G R B (3.1) R s

s = xR + xG + xB (3.2)

where xR, xG, xB is the pixel value of red, green and blue channel, respectively. The global threshold is then set to µ + 4 · σ, where µ is the mean and σ is the standard deviation of the red values of the original image pixels. Applying this threshold to

25 the image results in a binary image IR which is used in the following processing steps. We also perform blue color enhancement and thresholding using the same method and obtain an blue color enhanced binary image IB. Fig. 3.3 shows an example of blue color enhancement.

Figure 3.3: An example of color enhancement.

Next, we find contours in the binary images using the algorithm in [83] and then place a bounding box for the contours of each object. Small rectangles whose width or height is less than 32 pixels are ignored to minimize the interference of small objects and color fragments in the image. Bounding boxes that have similar sizes and locations are combined to avoid overlapping. Fig. 3.4 shows an ROI is selected from the original image after pre-processing.

3.2.3 Traffic Sign Detection

In many cases, the selected ROI from pre-processing stage often contains no traffic sign. In order to provide valid inputs to the classification stage, at first traffic signs must be detected accurately. False positives need to be eliminated as much as possible. Applying the HOG method, we compute the HOG features on ROI at different scales and then use a sliding window to search the entire ROI to find traffic signs.

26 Figure 3.4: Selecting ROI from the original image.

The HOG features can be computed from an RGB image or grayscale image. For RGB image, horizontal and vertical gradients are computed in 3 channels for red, green and blue respectively. Only those that have the maximal magnitude compared with the other two channels are selected for HOG processing. Thus its computational work load is three times that of a gray scale image. We first convert the original RGB image to a grayscale image IGRAY and use it to compute the HOG features that are fed to a linear SVM to determine if there are traffic signs in the image. Although most of the existing work also applied HOG to a gray scale image for traffic sign detection, this approach had very high false positive rate. In order to reduce the false positive rate, our system also extracts the HOG features from the red image IR and the blue image IB, only on the frames that the detection is reported positive on IGRAY . Two more SVMs are trained for red and blue images respectively that eliminate some false positive frames. In addition, these SVMs also classify the detected traffic signs into

27 Table 3.1: HOG parameters in our system

Parameters Size Window Size 32 by 32 pixels Block Size 8 by 8 pixels Cell Size 8 by 8 pixels Window Stride 8 by 8 pixels Block Stride 8 by 8 pixels Scaling factor 1.1 Levels 32 several super-classes, such as red circle, red triangle, blue circle, etc. The HOG parameters in our system are shown in Table 3.1. The window size is fixed, but the size of the traffic sign in an image is unknown. Thus the original image has to be scaled at many different levels and then perform HOG feature extraction and classification at each level. The size of image in each level Sl is computed as

Sl = S0/fl (3.3)

where S0 is the original image size, l is the level number and fl is the level scaling factor defined as

l−1 fl = 1.1 (3.4)

In our design, 32 scaling levels are applied. Thus our system is able to detect traffic signs sized from 32 by 32 pixels up to 614 by 614 pixels. As shown in the central figure of Fig. 3.5, the same traffic sign is detected by multiple windows at different positions and also at different scale levels. To avoid overlapping, we perform a grouping operation that combines these detected traffic signs at the same location into a single box as shown in Fig. 3.5.

28 Figure 3.5: Grouping detected windows.

3.2.4 Traffic Sign Recognition

The final step of our design is traffic sign recognition. The SVM method is applied to classify the detected traffic signs according to the 48 classes listed in Fig. 3.2. Each of the final detected windows is classify by the SVMs mentioned in 3.2.3 to determine and re-assure its category. Once its category is determined, it is classified by a multi-class SVM in that category. SVMs are trained using k-fold cross-validation to improve the accuracy. It is also worth mentioning that we use the BelgiumTSC dataset to train the SVMs to classify different classes of traffic signs in each category.

3.3 Parallelism on GPU

Since pre-processing and HOG algorithms are complex and require extensive compu- tations, in this section we describe the GPU-based acceleration. For pre-processing, it is a typical point operation which is suitable for GPU implementation. The HOG computation is more complicated and we develop several special techniques to handle it. There exists a GPU version of HOG in the OpenCV library which accelerates the computation significantly if compared to the CPU version. However, we find that there is still room to improve its efficiency. As we mentioned in 3.2.3, the HOG

29 features need to be computed in many different scaling levels of the original image, and gaps between levels can be reduced or eliminated. Once the input data of each level is prepared, there is no data dependency during HOG computation between different levels. In OpenCV implementation, each level stalls until the computation of previous level is done to ensure data synchronization between kernels, as shown in Fig. 3.6. Such stalls are unnecessary and can be avoided by using CUDA streams. As illustrated in Fig. 3.7, kernels can run in multiple CUDA streams at the same time and can be synchronized in a certain stream without affecting others. By using CUDA streams, we reduce the gaps between levels significantly and thus improve the efficiency of HOG computation.

Figure 3.6: Normal CUDA kernel launches.

Figure 3.7: CUDA kernel launches using CUDA streams.

For better performance, the GPU version of HOG in OpenCV is highly optimized for data re-usage. The image is divided into many blocks and the block histograms

30 are computed only once, though a block can belong to multiple windows. When extracting the HOG feature of a window, we need to find the already computed block histograms and line them up. However, after we adjust the detect windows in our system, the locations and sizes of those windows are changed and their HOG features need to be recomputed. Moreover, those windows can be anywhere in the image thus it is impossible to reuse the block histograms. Computing the HOG features of those windows is inefficient even if we use previous GPU design, because there are gaps between windows and it cannot parallelize those windows massively. In order to solve this problem, we propose a window-based HOG solution on GPU. All windows are extracted and put together to form an image whose width is the same as window width, and whose height is the window height multiplied by the number of windows. Then the newly constructed image is sent to GPU for block histogram computation. As a result, HOG computation for multiple windows is now running in parallel on GPU threads. Furthermore, we optimize this method by filtering out blocks crossing two windows since these blocks are not useful. In our parameter settings, we have 9 blocks in a window and there are 3 blocks that cross two windows. By filtering out these cross-window blocks, the total computation is reduced by 25%.

3.4 Experimental Results

The proposed traffic sign detection and recognition algorithms are evaluated on a Tesla K20 GPU platform. The pre-processing stage on GPU takes about 13˜17 ms. The detection and recognition stages account for most of the processing time. At first, we compare the HOG computing time on CPU and GPU at each scaling level. As shown in Fig. 3.8, the speedup of GPU acceleration is significant when the scaling

31 level is small. The original size of testing image is 1,628 by 1,326 pixels. Parameter setting as listed in Table 3.1. The OpenCV library is employed for comparing the HOG computing time on CPU and GPU.

Figure 3.8: HOG computing time on CPU and GPU.

Secondly, we test our optimized GPU implementation using 2000 images in the BelgiumTS dataset. Each image is in the size of 1,628 by 1,326 pixels. The total execution time for all three stages is compared by using original OpenCV HOG GPU version and our optimized version. Initialization time is ignored such as reading im- ages and SVMs. Post-processing time is also ignored such as recording and displaying results. Fig. 3.9 shows the total execution time for each frame by using the OpenCV

32 GPU code for HOG computation. Fig. 3.10 shows the execution time of our opti- mized GPU code. We can see that the overall computing time is reduced and some peaks are suppressed. The average frame rate of the OpenCV version on GPU is 21.3 fps. Our optimized GPU code can achieve the average frame rate of 27.9 fps which is about 31% faster than the OpenCV version.

Figure 3.9: The total processing time when HOG is computed using OpenCV on GPU.

Finally, we evaluate the detection rate and classification rate of our proposed sys- tem, using the BelgiumTS dataset [6]. Each test image is in the size of 1,628 by 1,236 pixels. We test 1918 images and the detection rate is 91.69%. We also measure the false positive rate by using background images provided by the BelgiumTS dataset.

33 Figure 3.10: The total processing time when using our optimized GPU code.

Based on our HOG parameters described in Table 3.1, we extract over 20 million windows from those images in different scaling levels. The number of false positives is 684. Thus the False Positives Per Window (FPPW) is 3.39 × 10−5. Similarly, we use the BelgiumTSC dataset for evaluate the classification rate. Each image in BelgiumTSC dataset contains one traffic sign with some background. We resize each image to our window size of 32 by 32 pixels before computing HOG and performing SVM classification. We use 4,492 images for training and 2,520 images for testing. All training and test images are from BelgiumTSC dataset and the classification rate is 93.77%.

34 3.5 Conclusions

This chapter presents a real-time traffic sign detection and recognition system on the GPU. It is capable of detecting and recognizing 48 classes of traffic signs in any sizes on each image frame. The detection rate is about 91.69% and the recognition rate is about 93.77%. The system can process 27.9 fps video with active pixels of 1,628 Ö 1,236 resolution. Since each frame is processed individually, no information from previous frames are required. As part of our future work, information from previous frames will be considered for tracking traffic signs which is expected to further improves the detection accuracy.

35 Chapter 4

Automatic Detection of Traffic Lights Using Support Vector Machine

Many traffic accidents occurred at intersections are caused by drivers who miss or ignore the traffic signals. In this chapter, we present a new method for automatic detection of traffic lights that integrates both image processing and support vec- tor machine techniques. An experimental dataset with 21299 samples is built from the captured original videos while driving on the streets. When compared to the traditional object detection and existing methods, the proposed system provides sig- nificantly better performance with 96.97% precision and 99.43% recall. The system framework is extensible that users can introduce additional parameters to further improve the detection performance.

36 4.1 Introduction

Automatic detection of traffic lights should be an essential feature of advanced driver assistance systems and future self-driving vehicles. Today it is an important road safety issue that many traffic accidents occurred at intersection are caused by drivers running the red lights. Recent data from Insurance Institute of Highway Safety (IIHS) show that in the year of 2012 on US roads, red-light-running crashes caused about 133,000 injuries and 683 deaths [1]. The introduction of automatic traffic light detection, especially red light detection, has important social and economic impacts. Because road images often contain complex background and many objects in them, it is a challenge to develop an algorithm that can detect the traffic lights precisely. Most of the existing algorithms are based on color, shape and gradient information, but the detections are not very reliable. Since the traffic lights themselves do not have sufficient features, traditional feature-based object detection algorithm also does not work well. In this chapter, we propose a new method that combines computer vision and machine learning techniques in conjunction with inter-frame information. While we drive on the road, data have been collected by recording video using a camera mounted behind the front windshield. Then the data sets are labeled for training and evaluation of the proposed algorithm. Our experimental results suggest the proposed method is highly effective for detecting red traffic lights. The rest of the chapter is organized as follows. In Section 4.2, we propose an improved method that combines the computer vision and machine learning techniques for traffic light detection. Data collection and performance evaluation are presented in Section 4.3, followed by conclusions in Section 4.4.

37 4.2 Proposed Method for Traffic Light Detection

4.2.1 Locating candidates based on color extraction

In this chapter, we focus on the detection of red circular traffic lights only. Green or yellow lights can be detected by applying similar techniques. At first, we apply color extraction to locate the candidates of traffic lights. Subsequently, the images are converted to the hue, saturation, and value (HSV) . The red color is extracted based on the hue values. Flood-filling method is applied for region labeling and blob extraction. The blobs can be considered as the potential candidates. In many previous works, a variety of morphological filtering techniques were applied to eliminate some candi- dates for the purpose of reducing false positives. However, any filtering has a possi- bility of missing the true traffic lights, because the traffic lights are not always clear due to their size in images and the obscure background. Thus we simply perform an aspect ratio check and keep all blobs that pass the check as candidates of potential traffic lights. The objective of eliminating false positives is considered in the latter part of the proposed method.

4.2.2 Traffic light detection using template matching

Once the candidates are located, we apply a template matching method to detect the traffic lights [29]. Here we consider the traditional and the most popular design of a traffic light in which red, yellow and green lights are in round shapes and vertically positioned in that order. For horizontally positioned traffic lights, we can apply the same method with a few modifications. Typically only one of three colorful lights is turned on at a time. In the previous step, we have located potential candidates of

38 the red lights on the image. When the red light is on, the yellow and green lights are off. These two off lights are very similar. So we use the yellow light area ROIref as template that is the yellow rectangular area in Fig. 4.1. Similarly, the green light area is highlighted as the green rectangular area. We can perform template matching in the green rectangular with ROIref . In fact, we purposely make the green rectangular area slightly larger than the yellow one, which can provide more accurate results for template matching. The minimal value among the template matching results is recorded as Rin and the corresponding area is recorded as ROIin. For the three vertical traffic lights, the assumption is that these two off lights are almost identical and there should not have any similar objects in the neighboring area. The background areas around the traffic light bounding box is highlighted as blue rectangular areas as in Fig. 4.1. Using the same reference ROIref as the template, we perform template matching in the blue rectangular area. The smallest value of the template matching results is Rout and its corresponding area is ROIout. Since the yellow and green lights are both off, they appears almost identical and the Rin value is small. In contrast, the Rout value is often very large. We can set a threshold value p. If the ratio Rin/Rout < p, a traffic light is detected and otherwise not. This template matching method does not require high resolution images. It works well even if the candidates are small in size when the traffic lights are a long away. In addition, it is effective to eliminate some false positives. As an improvement to the detection method, additional constraints were considered in [29], such as the mean and variance of the pixel values at the position of the two off lights should be smaller than a certain threshold because those regions should be dark.

However, the assumption that Rout is much larger than Rin is not always true. For example, when the traffic lights are small or not so clear in the image, the off lights

39 Figure 4.1: Applying traffic light detector on a candidate.

are seen as dark regions. if background is also dark such as trees or buildings, the

template matching to the background Rout could also be very small. Then Rin/Rout is likely above the given threshold p. As a result, the true traffic light is missed. Additional constraints on the mean and variance of the pixel values do not solve this problem either. In addition, it is difficult to choose a universal value for threshold p. Thus, we propose an improved method that is integrated with machine learning algorithms.

4.2.3 An improved method using SVM

Due to various background and object sizes on the image, it is difficult to manually set a threshold for Rin/Rout ratio obtained from template matching. So we propose to build a support vector machine (SVM) that can automatically find the optimal settings for the parameters (or features) extracted from the image through machine learning. It requires a large dataset, both positive and negative samples, for training

the SVM. For each candidate, we use Rin and Rout values in conjunction with the pixel

values mref , min and mout to form a vector, where mref , min and mout are mean pixel

values of areas correspond to the areas of ROIref ROIin and ROIout, respectively.

40 Each vector becomes a sample S1 for the SVM.

S1 = {Rin,Rout, mref , min, mout} (4.1)

The SVM is able to automatically adjust its parameters through the training process. As demonstrated in Section 4.3, using the SVM to find parameters makes a huge leap in term of detection accuracy when compared to manually setting the threshold p. However, we discover that the bounding box of a candidate itself is not sufficient to determine whether it is a traffic light or not. If we cut them out from the original image, sometimes even a human can hardly decide it. Fig. 4.2 shows some examples of candidates extracted from road images. The candidates on first row have dark background and those on second row have bright background. As we can see, it is difficult to determine a traffic light when background is dark, while it is easier to spot a traffic light with bright background. Fig. 4.3 gives an example with both scenarios. The left traffic light has bright background while the right one has dark background. We also find that the brake lights, which are usually red, of black vehicles, are a major contributor to false positives.

Figure 4.2: Are they traffic lights or not? Dark background on the top and bright background at the bottom.

In order to improve the detection performance, we propose to add the location information of the candidate bounding box as additional inputs to the SVM. The

41 Figure 4.3: The left traffic light has bright background and the right traffic light has dark background. idea is that the size and ration of a traffic light as well as its location should be consistent among all training samples. For instance, the traffic location cannot be as low as the vehicle brake lights as shown in Figure 4.2. Each bounding box B has four parameters, B = {x, y, w, h} (4.2) where (x, y) are coordinates of the upper-left corner (or origin) of a bounding box, w and h are width and height respectively. Intuitively, it is impossible for the traffic lights to appear on the road, therefore y should be within a certain range and so does x. There are implicit relationships between the size and position of a traffic light in an image. Again, it is difficult to explore these relationships explicitly through image processing. We propose to introduce the additional information of the bounding box

B by including them into the SVM input sample. Thus we form a new vector S2 for

42 each candidate, where

S2 = S1 ∪ B = {Rin,Rout, mref , min, mout, x, y, w, h} (4.3)

As demonstrated later in Section 4.3, the expanded SVM vector shows significant improvement on the detection performance. It is worth noting that the proposed method can be expanded further by including more parameters and features to the SVM. The propose method is a frame work that utilizes SVM as a machine learning tool to automatically find optimal parameter settings for traffic light detection.

4.3 Data Collection and Performance Evaluation

As an experimental set up, we mount a camera behind the front windshield and record videos when driving on the road. We extract traffic light candidates using the process discussed in 4.2.1. We obtain a data set with 21299 candidates from 2706 images. These images are extracted from actual videos and contain four independent instances that contain circular traffic lights. Each image has the resolution of 1920- by-1080 pixels. We compare these candidates with manually labeled ground truth and find that there are 4526 true traffic lights and 16773 negative candidates. This new constructed dataset is used to evaluate the proposed detection method. In order to compare the performance among different method, here we propose two metrics named precision and recall, where

true positives precision = (4.4) true positives + false positives

43 Figure 4.4: Rin/Rout values for true positive candidates (left) and true negative can- didates (right). Y-axis is from 0 to 2000.

true positives recall = (4.5) true positives + false negatives

The dataset with 21299 candidates is sorted randomly. When applied to the proposed SVM, half of them are used as training data and the remaining are used for testing. There is no overlap between training and test data.

We first evaluate the ratio Rin/Rout and its feasibility to detect the traffic lights on the image. Fig. 4.4 shows the ratio values for true negative candidates are generally larger than that of true positive candidates. But if we zoom into the Rin/Rout values with Y axis from 0 to 20 as in Fig. 4.5, we can see that many true negative candidates also have small Rin/Rout values. Therefore, choosing a fixed threshold p is not an effective method to separate the positive or negative candidates, because some true positive candidates could be classified as negative and vice versa.

Table 4.1 lists the evaluation results based on Rin/Rout for different p values. TP, FP, TN and FN stand for True Positives, False Positives, True Negatives and False Negatives, respectively. The results show that it is difficult to balance between precision and recall with a fixed threshold value. Thus we opt to use SVM for

44 Figure 4.5: Rin/Rout values for true positive candidates (left) and true negative can- didates (right). Y-axis is from 0 to 20.

classifications based on the Rin and Rout values. Table 4.1: Evaluation result based on Rin/Rout for different p values

Threshold Precision Recall TP FP TN FN

p = 1.5 47.52% 96.51% 4368 4824 11949 158 p = 1.0 60.90% 89.79% 4064 2609 14164 462 p = 0.5 78.48% 76.14% 3446 945 15828 1080 p = 0.2 95.95% 45.01% 2037 86 16687 2489

Table 4.2 shows the performance of different detection methods. We use the classic objection method with Haar-like features and AdaBoost algorithm as a baseline, which provides the results of 76.89% precision and 73.40% recall. If we set the threshold p = 0.5 for the Rin/Rout ratio, the detection performance is only slightly better than the baseline.

Next, a SVM with a radial basis function (RBF) kernel is trained with {Rin,Rout} as input for traffic light classification. Table 4.2 shows that the SVM can improve recall by 15.14% but only 2.28% for precision, compared with that using a fixed set threshold.

As proposed in Section 4.2, we add the pixel values mref , min and mout in addition

45 to Rin and Rout values to the SVM input vector S1, the detection performance is improved significantly with 89.09% precision and 96.60% recall. Furthermore, the origin and geometry information {x, y, w, h}of the bounding box are added to form

S2. The improved method can achieve the performance of 96.97% precision and 99.43% recall that is reasonably accurate and reliable. Table 4.2: Evaluation result: precision and recall

Detection method Precision Recall

Haar, AdaBoost 76.89% 73.40%

Rin/Rout, p = 0.5 78.48% 76.14%

{Rin,Rout}, SVM 80.76% 91.28%

S1, SVM 89.09% 96.60%

S2, SVM 96.97% 99.43%

Fig. 4.6 shows an example of detected traffic lights in an image. Although their backgrounds are drastically different, both traffic lights are detected and marked in the image. Our system is implemented in C++ and executed on the Intel i5-3570K processor at 3.4 GHz. The processing time for each image frame is approximately 60 ms to 90 ms. For real-time implementation, we are currently migrating the design to an FPGA platform.

4.4 Conclusions

In this chapter, we propose a new method that can detect traffic lights accurately and reliably. Color extraction is applied to locate the candidates. A template matching technique is applied to provide quantitative information of the traffic lights and its sounding areas. We also demonstrate that detection using a fixed threshold ratio is not very effective and the SVM-based classification has much better performance. In

46 Figure 4.6: Both traffic lights are detected. addition, we empirically add more parameters of a candidate to the SVM input and it can achieve the best performance of 96.97% precision and 99.43% recall. As an additional contribution, we build a traffic light dataset with 21299 samples from the original videos captured while driving on the streets. This dataset can be used by others for computer vision and machine learning research.

47 Chapter 5

Accurate and Reliable Detection of Traffic Lights Using Multi-Class Learning and Multi-Object Tracking

Automatic detection of traffic lights has great importance to road safety. This chap- ter presents a novel approach that combines computer vision and machine learning techniques for accurate detection and classification of different types of traffic lights, including green and red lights both in circular and arrow forms. Initially, color ex- traction and blob detection are employed to locate the candidates. Subsequently, a pre-trained PCA network is used as a multi-class classifier to obtain frame-by-frame results. Furthermore, an online multi-object tracking technique is applied to over- come occasional misses and a forecasting method is used to filter out false positives. Several additional optimization techniques are employed to improve the detector per-

48 formance and handle the traffic light transitions. When evaluated using the test video sequences, the proposed system can successfully detect the traffic lights on the scene with high accuracy and stable results. Considering , the proposed technique is ready to be integrated into advanced driver assistance systems or self-driving vehicles. We build our own dataset of traffic lights from recorded driving videos, including circular lights and arrow lights in different directions. Our

experimental dataset is available at http://computing.wpi.edu/Dataset.html.

5.1 Introduction

Automatic detection of traffic lights is an essential feature of an advanced driver assistance system or self-driving vehicle. Today it is a critically important road safety issue that many traffic accidents occurred at intersection are caused by drivers running red lights. Recent data from Insurance Institute of Highway Safety (IIHS) show that in the year of 2012 on US roads, red-light-running crashes caused about 133,000 injuries and 683 deaths [1]. Introduction of automatic traffic light detection system, especially red light detection, has important social and economic impacts. In addition to detecting traffic lights, it is also important to recognize the lights as they appear in circular or as directional arrow lights. For example, a red left arrow light and a green circular light can appear at the same time. Without recognition, the detection systems can get confused because valuable information has been lost. There are few papers in the literature that combine detection and recognition of traffic lights together. Based on our survey, there are very few datasets available for traffic lights. The Traffic Lights Recognition (TLR) public benchmarks [8] contain image sequences with

49 traffic lights and ground truth. However, the images in the dataset do not have high resolution, and the number of physical traffic lights is limited due to the fact that the image sequences are converted from a short video. In addition, this dataset only contains circular traffic lights, which is not always the case in real applications. Therefore, we opt to build our own dataset for traffic light detection, including circular lights and arrow lights in all three directions. Our dataset of traffic lights can be used by many other researchers in computer vision and machine learning. In this chapter, we propose a new method that combines computer vision and machine learning techniques. Color extraction and blob detection are used to locate the candidates, followed by the PCA network (PCANet) [32] classifiers. The PCANet classifier consists of a PCANet and a linear Support Vector Machine (SVM). Our experimental results suggest the proposed method is highly effective for detecting both green and red traffic lights of many types. Despite of the effectiveness of PCANet and many outstanding achievements made by computer vision researchers, object detection from an image still make frequent errors, which may cause huge problems in the real-world critical applications such as Advanced Driver Assistance Systems (ADAS). Traditional frame-by-frame detection methods ignore the inter-frame information in the video. Since the objects in a video are normally in continuous motion, their identities and trajectories are valuable in- formation that can improve the frame-based detection results. Unlike a pure tracking problem that tracks a marked object from the first frame, tracking-by-detection algo- rithms involves frame-by-frame detection, inter-frame tracking and data association. In addition, multi-object tracking (MOT) algorithms can be employed to distinguish different objects and keep track of their identities and trajectories. When it becomes a multi-class problem such as recognizing different types of traffic lights, additional

50 procedure such as a voting scheme is often applied. In addition, the method needs to address the situation that traffic light status can change suddenly during the detection process. The rest of the chapter is organized as follows. Section 5.2 describes our data collection and experimental setup. In Section 5.3, we propose a method that combines computer vision and machine learning techniques for traffic light detection using PCANet. In Section 5.3.3, we propose a MOT-based method that stabilizes the detection and improves the recognition results. Performance evaluations is presented in Section 5.4, followed by some discussion in Section 5.5 and conclusions in Section 5.6.

5.2 Data Collection and Experimental Setup

In this chapter, we focus on the detection of red and green traffic lights, and the recognition of their types. The amber lights can be detected using similar techniques, but we do not consider amber lights here due to lack of data. The recognition of arrow lights requires that the input frames must be high resolution images. Otherwise all lights are just colored dots or balls in the frame, and it is impossible to recognize them. We mount a smartphone behind the front windshield and record videos when driving on the road. Several hours of videos are recorded around the city of Worcester, Massachusetts, USA, during both summer and winter seasons. Subsequently, we select a subset of video frames to build the dataset since most of the frames do not contain traffic lights. In addition, passing an intersection only takes a few seconds in case of the green lights. At red lights, the frames are almost identical as the vehicle is stopped.

51 Thus the length of selected video for each intersection is very short. Several minutes of traffic-light-free frames are retained in our dataset for assessment of false positives. Each image has a resolution of 1920×1080 pixels. To validate the proposed approach and to avoid overlapping of training and test data, the data collected in the summer is used for training and the data collected in the winter is used for testing. Our traffic light dataset is made available online at http://computing.wpi.edu/Dataset.html.

5.2.1 Training data

All the training samples are taken from the data collected during the summer. Input data to the classifier are obtained from the candidate selection procedure described in 5.3.1, and the classifier output goes to the tracking algorithm for further processing. Thus evaluation of the classifier is independent to the candidate selection or the post- processing (tracking). The classifier is trained to distinguish true and false traffic lights, and to recognize the types of the traffic lights. OpenCV [84] is used for SVM training, which chooses the optimal parameters by performing 10-fold cross- validation. The positive samples, which contain the traffic lights, are manually labeled and extracted from the dataset images. The negative samples, such as segments of trees and vehicle tail lights, are obtained by applying the candidate selection procedure over the traffic-light-free images. The green lights and red lights are classified separately. For green lights, there are three types base on their aspect ratios. The first type is called Green ROI-1, which contains one green light in each image and its aspect ratio is approximately 1:1. The second type is called Green ROI-3. It contains the traffic light holder area which has one green light and two off lights, and its aspect ratio is approximately 1:3. The third type is called Green ROI-4. It contains the traffic light

52 Figure 5.1: Examples of 5 classes of Green ROI-1. holder area which has one green round light, one green arrow light, and two off lights, and its aspect ratio is approximately 1:4. Each type of sample images has several classes. The Green ROI-1 and Green ROI-3 both has five classes including negative samples as shown in Fig. 5.1 and Fig. 5.2. These 5 classes from top to bottom are Green Negative (GN-1; GN-3), Green Arrow Left (GAL-1; GAL-3), Green Arrow Right (GAR-1; GAR-3), Green Arrow Forward (GAF-1; GAF-3) and Green Circular (GC-1; GC-3). The Green ROI-4 also has five classes including negative samples as shown in Fig. 5.3. The five classes from top to bottom are Green Negative (GN-4), Green Circular and Green Arrow Left (GCGAL-4), Green Circular and Green Arrow Right (GCGAR- 4), Green Arrow Forward and Left (GAFL-4) and Green Arrow Forward and Right (GAFR-4). The Green Negative samples are obtained from traffic-lights-free videos by using the color extraction method discussed in Section 5.3.1. For red lights, there are two types of sample images base on their aspect ratios. The first type is called Red ROI-1 as shown in Fig. 5.4. It contain one red light in each image and its aspect ratio is approximately 1:1. The other type is called Red

53 Figure 5.2: Examples of 5 classes of Green ROI-3.

54 Figure 5.3: Examples of 5 classes of Green ROI-4.

55 Figure 5.4: Examples of 3 classes of Red ROI-1.

ROI-3 as shown in Fig. 5.5. It contains the traffic light holder which contains one red light and two off lights, and its aspect ratio is approximately 1:3. Each type of sample images has three classes: Red Negative (RN-1; RN-3), Red Arrow Left (RAL- 1; RAL-3) and Red Circular (RC-1; RC-3). The Red Negative samples are obtained from traffic-lights-free videos by using the color extraction method mentioned in 5.3.1. The red light do not have ROI-4 data because the red light is on top followed by an amber light and one or two green lights at the bottom. If the red light is on, the amber and green lights beneath must be off. These three lights are in ROI-3 vertical setting, regardless of the status of the 4th light at the very bottom. Table 5.1 shows the number of training samples of Green ROI-n and Red ROI-n, where n is 1, 3 or 4. Features of a traffic light itself may not be as rich as other objects such as a human or a car. For example, a circular light is just a colored blob that looks similar to other objects in the same color. Therefore, it is difficult to distinguish the true traffic lights from other false candidates solely based on color analysis. The ROI-3 and ROI-4 samples are images of the holders, which provide additional information for detection and classification. The approach of combing all these information together is explained in 5.3.2.2.

56 Figure 5.5: Examples of 3 classes of Red ROI-3.

57 Table 5.1: Number of training samples of Green ROI-n and Red ROI-n

Class n = 1 n = 3 n = 4 GN-n 13218 13218 13213 GAL-n 1485 835 - GAR-n 1717 617 - GAF-n 2489 1018 - GC-n 3909 3662 - GCGAL-n - - 369 GCGAR-n - - 281 GAFL-n - - 749 GAFR-n - - 1005 RN-n 7788 7619 - RAL-n 1214 1235 - RC-n 4768 5035 -

5.2.2 Test data

All test images are taken from the dataset that we collected in the winter. The ground truths are manually labeled and are used for validating the results. In our proposed method, tracking technique is used to further improve the performance. However, traffic lights can move out of the image or change states during the tracking process. Therefore the test sequences need to cover many possible scenarios for all types of lights. Detailed information of the test sequences is shown in Table 5.2.

5.3 Proposed Method of Traffic Light Detection

and Recognition

Fig. 5.6 shows the flowchart of our proposed method of traffic light detection and recognition, which consists of three stages. Firstly, color extraction and candidates

58 Table 5.2: Information of 23 test sequences

Seq ID Frames Traffic lights Types of traffic lights Description 1 91 182 Green circular×2. Lights in all frames. 2 90 180 Green circular×2. Lights in all frames. 3 61 147 Green arrow left×3. Lights in all frames. 4 48 144 Green circular×3. Lights in all frames.

5 156 312 Red circular×2. Lights in all frames.

6 156 211 Green circular×2. Lights at start, then move out. 7 214 428 Green circular×2. Lights in all frames. 8 76 152 Red circular×2. Lights in all frames. 9 245 305 Green circular×2. Lights at start, then move out. 10 174 177 Green circular×2. Lights at start, then move out. Red circular×3; green arrow left; green arrow right; Red lights at start, 11 91 348 green arrow forward; green circular. then green lights. Red arrow left; green arrow right×2; 12 56 280 Lights in all frames. green arrow forward×2. 13 82 70 Green circular×2. Lights at start, then move out. 14 259 518 Green circular×2. Lights in all frames. Red arrow left; green arrow right×2; 15 65 325 Lights in all frames. green arrow forward×2. 16 185 242 Green circular×2. Lights at start, then move out. 17 93 186 Red circular×2. Lights in all frames. 18 630 0 None. No traffic lights. 19 580 0 None. No traffic lights. 20 416 0 None. No traffic lights. 21 550 0 None. No traffic lights. 22 759 0 None. No traffic lights. 23 3035 0 None. No traffic lights. Total 8112 4207 - -

59 selection are performed over the input image. Secondly, to determine whether the selected candidates are traffic lights and what types of lights, they are processed by PCANet and SVM. Finally, tracking and forecasting techniques are applied to improve the performance and stabilize the final output.

Figure 5.6: Flowchart of the proposed method of traffic light detection and recogni- tion.

5.3.1 Locating candidates based on color extraction

To locate the traffic lights, color extraction is applied to locate the Region of Interest (ROI), i.e., the candidates. The images are converted to hue, saturation, and value (HSV) color space. Comparing with RGB color space, HSV color space is more robust against illumination variation and is more suitable for segmentation [85]. The desired color is extracted from an image mainly based on the hue values, which results a

60 binary image. Suppose the HSV value of the ith pixel in an image is

HSVi = {hi, si, vi} (5.1)

In order to extract green pixels, we set the color thresholds based on the empirical data:

40 ≤ hi ≤ 90 (5.2)

60 ≤ si ≤ 255 (5.3)

110 ≤ vi ≤ 255 (5.4)

In order to extract red pixels, besides (5.3) and (5.4), one of the the following condi- tions must hold:

165 ≤ hi ≤ 180 (5.5)

0 ≤ hi ≤ 20 (5.6)

These values are adjustable and similar settings can be found in [28]. Note that the threshold values that we choose work well in OpenCV [84] and may need proper conversion in order to work with other libraries. Blob detection can be implemented using flood-fill or contour following. The blobs can be considered as the potential candidates. However, it is possible that an arrow light may be labeled as two different regions, because the head and tail of an arrow

61 Figure 5.7: Color extraction, blob detection and closing operation. are sometimes separated with a gap between them. When the traffic lights are closer to the camera, it is more likely that the gaps can be clearly seen and thus affect the result of blob extraction. To solve this problem, the closing operation is performed on the binary image obtained from color extraction. Closing operation is a typical morphological operation in image processing. It applies a dilation followed by an erosion, which eliminates gaps and holes on the binary image. Therefore, the arrow light can be detected as a whole and the candidates after closing is more reliable than the original candidates. Fig. 5.7 shows the original result of color extraction and blob detection (top right), and the result with closing operation (bottom right). The side-effect of the closing operation is that it might connect a green light with other green objects in the background such as trees. When the traffic lights are far away from camera, this problem is more likely to occur because the black

62 borders of traffic light holders are thin. However, when the traffic lights are far away, the gaps are more likely to be filled by the halo of the lights, or become invisible due to the limitation of image resolution. Therefore, the original candidates are more reliable than those after closing. It is difficult to determine whether the morphological closing operation should be applied. Therefore, we choose to keep both the original candidates and the candidates after closing operation. In case overlapped candidates are identified through the classification, the candidate with aspect ratio closest to one is selected. The objective of eliminating false positives is considered in the latter part of the proposed method. Fig. 5.8 shows an example of the road images. In this image, there are four green traffic lights, but 895 green candidates can be extracted using the method mentioned above. This requires our classifier to be very strong to filter out the negative candidates while retain positive ones. However, even if the classifier is able to filter out 99% of the negative candidates, there are still about 9 false positives remaining in this image, which is an unacceptable result. Therefore, pre-filtering and post-validation steps are necessary in addition to the classifier itself. For red traffic lights, the number of candidates is much smaller than that of the green traffic lights. For example, there are 19 red candidates in Fig. 5.8 from the color extraction . In many previous works [21, 22, 25, 29], a variety of morphological filtering tech- niques were applied to eliminate some candidates for the purpose of reducing false positives. However, any filtering has a possibility of missing the true traffic lights, because the traffic lights are not always clear due to their size and the obscure back- ground in an image. Thus only aspect ratio check is performed in the proposed method and all blobs that pass the check are kept as candidates. The aspect ratio ar

63 Figure 5.8: A sample frame from our traffic light dataset.

is defined as ar = w/h (5.7)

where w is the width and h is the height of the candidate. In order to pass the aspect ratio check, the following inequality must hold:

2/3 ≤ ar ≤ 3/2 (5.8)

The aspect ratio check is to reduce the number of candidates. In Fig. 5.8, the number of green candidates is reduced to 51 and the number of red candidates in reduced to 9 after the aspect ratio check.

64 5.3.2 Classification

5.3.2.1 PCANet

The PCANet classifier is applied to determine whether a candidate is a traffic light or not. PCANet classifier consists of a PCA network and a multi-class SVM. The structure of PCANet is simple, including a number of PCA stages followed by an output stage. The number of PCA stages can be variant, but the typical value is 2, making it so called two-stage PCANet. As shown in [32], a two-stage PCANet outperforms the single stage PCANet in most cases, but further increase of the number of stages does not necessarily provide better performance, according to their empirical experience. Therefore, two-stage PCANet is used in our proposed method. The structure of PCANet is to emulate that of a traditional convolutional neural network [86]. The filter bank is chosen to be PCA filters. The nonlinear layer is the binary hashing (quantization). The pooling layer is the block-wise his- togram of binary vectors. There are two parts in the PCA stage - patch mean removal and PCA filters convolution. For each pixel of the input image, there are a patch of pixels in the same size of the filter. Their mean are then removed from each patch, followed by PCA filter convolution. The PCA filters are obtained by unsupervised learning during the training process. The number of PCA filters can be variant. The impact of the number of PCA filters is discussed in [32]. Generally speaking, more PCA filters lead to better performance. In this chapter, we choose 8 filters for both PCA stages and we find it is sufficient to deliver good performance. The output stage consists of binary hashing and block-wise histogram. The output of PCA stages are converted to binary values, with positive value to one and else to zero. Thus a binary vector is obtained for each patch, and the length of this vector

65 Figure 5.9: The structure of two-stage PCANet. is fixed. This binary vector is then converted to decimal value. The block-wise histogram of these decimal values forms the output features. The SVM is then fed with the features from PCANet. Fig. 5.9 shows the structure of a two-stage PCANet. The number of filters in stage 1 is m and in stage 2 is n.

5.3.2.2 Recognizing green traffic lights using PCANet

As mentioned in 5.3.1, due to a large number of green objects in an image such as trees, street signs, and green vehicles, the classifier must be strong enough to eliminate the potential false positives while maintain a high detection rate. Using the green areas as candidates is not sufficient. For example, a fragment of tree leaves may occasionally look similar to the green lights in some frames, which causes false positive “flashing” in the video of detection results. To solve this problem, a validation step is applied to the system. It is assumed that the traffic lights always appear in a holder. The traffic light holder contains three

66 or four lamps that are vertically aligned in our collected data. Note that horizontal traffic lights are also often used and can be processed using the same method if the dataset is available. In addition, these lamps have certain combinations. The traffic light holder area thus contains important information that can help us detect the traffic lights. In a vertical traffic light holder, the bottom one is always a green lamp. Therefore, the position of potential traffic light holder can be located according to the green area. The aspect ratio of the green area is approximately 1:1 and the green area is called as ROI-1. The traffic holder area with three lamps is called ROI-3 and the traffic holder area with four lamps is called ROI-4. Suppose the rectangular bounding box of ROI-1 is RROI−1 where

RROI−1 = {xROI−1, yROI−1, wROI−1, hROI−1} (5.9)

Similarly there are bounding boxes RROI−3 for ROI-3 and RROI−4 for ROI-4 where

RROI−3 = {xROI−3, yROI−3, wROI−3, hROI−3} (5.10)

RROI−4 = {xROI−4, yROI−4, wROI−4, hROI−4} (5.11)

The variables xROI−i, yROI−i are the coordinates of the top-left corner of the bounding box RROI−i , wROI−i is its width and hROI−i is its height. The RROI−3 can be obtained based on RROI−1 as follows, where the coefficients are determined empirically based on the assumption that the lights are vertically aligned and the green light is the lowest light:

xROI−3 = xROI−1 − 0.1 × wROI−1 (5.12)

67 yROI−3 = yROI−1 − 2.5 × hROI−1 (5.13)

wROI−3 = 1.2 × wROI−1 (5.14)

hROI−3 = 3.6 × hROI−1 (5.15)

In the case of horizontally aligned lights, these coefficient should be changed accord- ingly. Similarly, the RROI−4 can be obtained based on RROI−1 as follows:

xROI−4 = xROI−1 − 0.1 × wROI−1 (5.16)

yROI−4 = yROI−1 − 3.9 × hROI−1 (5.17)

wROI−4 = 1.2 × wROI−1 (5.18)

hROI−4 = 5.1 × hROI−1 (5.19)

All samples of ROI-1 are resized to 10 × 10 pixels, all samples of ROI-3 to 10 × 33 pixels and all samples of ROI-4 to 10×43 pixels. Three PCANet classifiers are trained separately for for ROI-1, ROI-3 and ROI-4. Each classifier is able to perform multi- class classification, such as distinguishing left arrows, right arrows, circular lights and negative samples. In order to combine the results of these three classifiers, several methods are evaluated using the test dataset. An intuitive solution is the voting strategy. The results of ROI-1, ROI-3 and ROI-4 are voted to several classes and the class that has the most votes is selected as the final result. However, this method is not accurate. The ROI-3 may contain partial area of a traffic light holder if it is actually a four- light holder. The ROI-4 may contain background if it is actually a three-light holder.

68 Therefore, the positive results of ROI-3 and ROI-4 are both considered as possible regions. If any positive results of ROI-1 overlap with these regions, it is considered a true positive green light. This is a more plausible approach because the two cases mentioned above do contain the traffic light holders that are the possible regions. Although the class types determined by ROI-3 and ROI-4 may be inaccurate, the ROI-1 is capable of providing an accurate result.

5.3.2.3 Recognizing red traffic lights using PCANet

Red traffic lights are recognized in a similar way as to green lights. The bounding boxes of Red ROI-1 and ROI-3 are expressed the same as that of the green lights shown in (5.9) and (5.10). Assuming the lights are vertically aligned and the red light is the top light, the RROI−3 can be obtained based on RROI−1 using Equation 5.12, 5.14, 5.15 and

yROI−3 = yROI−1 − 0.1 × hROI−1 (5.20)

5.3.3 Stabilizing the detection and recognition output

5.3.3.1 The problem of frame-by-frame detection

Frame-by-frame detection is important, but not sufficient to render stable output. The reasons are twofold. One aspect is that no detector can perform perfectly under all possible scenarios. Another aspect is that the input data sometimes are not of good quality. For example, vehicle vibrations may cause cameras to lose focus, making the frames vague. An arrow red traffic light in such situation may look identical to a circular red light and can hardly be recognized even by human eyes, which is shown on the image in the center of Fig. 5.10. However, the arrow light is clear in other

69 Figure 5.10: An arrow light in three consecutive frames. The middle one is vague and look similar to a circular light. A detector often fails on such vague frame. frames. If the detector recognizes this arrow light in previous frames and keeps track of it, a correct estimation can be provided in the vague frame even if the detector gives an incorrect result. In addition, there may be multiple lights in a frame, so multiple lights need to be distinguished and not confused with each other.

The goal of multi-object tracking is to recover the complete tracks of multiple ob- jects and to give estimation of their current states. There are two categories of multi- object tracking methods: batch methods and online methods. The batch methods require the detection results of the entire sequence before analyzing the identity and constructing the trajectory of each object, which makes it impractical for real-time applications. The online methods are based on information that is available up to the current frame, which can provide results in real-time. Traffic light detection is a time-critical application that needs to give immediate feedback to the driver or con- troller, therefore multi-object tracking must be done using the online method. The online methods track objects from previous frames, and associate the tracking result with detection result of the current frame.

70 5.3.3.2 Tracking and data association

Here we propose an intuitive approach which is optimized for the traffic light detection application. For video camera at 30 frames per second (FPS), the motion of the lights between the adjacent frames are of small values. Therefore, an object in the next frame should be found near its location in the previous frame. Since color is an important feature of traffic lights, mean shift method is employed to locate the traffic light based on its previous position. Given a traffic light in the previous frame, the mean shift procedure calculates the histogram in the hue channel of the HSV color space, and then calculates histogram back-projection in the current frame in order to locate the light. There are other tracking methods such as particle filter, which is proven to work for multiple people tracking [87]. We do not adopt it for two reasons. One is that traffic lights are small objects in a high resolution image which has 1920 × 1080 pixels. This makes it difficult for the particles to locate the traffic lights accurately and may need a large number of particles, which is computationally expensive. The other reason is that the weights of each particle cannot be evaluated effectively. The assumption that the detection confidence of each particle is higher when it gets closer to the actual position of the light is not true. The lights are so small in the image and a small deviation may lose the target completely. In addition, our detector is trained based on images of complete traffic lights, thus it cannot distinguish partial lights from backgrounds nor give higher confidence values for them. For data association, [87] employs greedy data association and observes similar result compared with the Hungarian algorithm [88]. In our approach, the tracking result is simply associated with the detection result when they overlap. The reason is

71 that the traffic lights are motionless in adjacent frames and mean shift performs well in locating them. In addition, unlike people detection, traffic lights do not intersect with each other and there is no need to consider the object identities switch problem, which makes it easier to associate the tracking and detection results. Once the association is established, the detected regions are used for mean shift tracking in the next frame, instead of using the regions found by mean shift itself. It solves the scale problem of mean shift and the detected regions are considered more accurate than the tracking result. Building trajectories of the objects can overcome occasional misses, but still can- not filter out false positives. For example, if a rear light of a car is misclassified as a red traffic light in several frames, its trajectory is very likely to be built by multi-object tracking algorithms. However, the time series data for each object can be obtained from online multi-object tracking. Since the time series data consist of classification results over time, they can be used to generate the final output using forecasting and time series analysis.

5.3.3.3 Forecasting

Given the previous detection or recognition result of a target, the estimation of its current state is the final output. Such process is called forecasting and time series analysis. Multi-object tracking algorithms focus on building the trajectories and pay little to filtering out false positives. The idea is that the accumulated classification results of a false object often have different patterns compared to that of a true object, which can be used to filter out false positives. It is based on an assumption that the detector has the ability to distinguish the true positives and false positives to some extent, at least better then random guessing. Otherwise, it is

72 impossible to filter out the false positives. Some methods can be used to address the false positives problem. In [87], a tracker is only initialized in certain regions of the image, and is deactivated or terminated when there is no associated detection in a certain number of frames. Tracklet confidence is introduced in [89], which is influence by factors such as length, occlusion and affinity between tracking and detection. In this chapter, we employ a simple forecasting technique after online multi-object tracking, aiming at stabilizing the imperfect output of traffic light detection and recognition. For each object, there is a binary time series where 1 denotes that the detection result is true and 0 otherwise. The simple moving average (SMA) of the

time series is then calculated. Let n be the window size of the SMA, bi be the value

of the time series in the ith frame, and Sm be the SMA value in frame m, the formula is b + b + ··· + b + b S = m−(n−1) m−(n−2) m−1 m (5.21) m n

or alternatively b b S = S − m−n + m (5.22) m m−1 n n

It can be interpreted that the Sm is propagated from Sm−1 while replacing the oldest

value with the newest value in the sliding window. The Sm can be used to determine whether the object is considered positive, and a threshold t is set to determine the final output ˆb as m  1 S ≥ t ˆ  m bm = (5.23)  0 Sm < t

ˆ When bm is positive, a majority voting scheme is used to determine the type of the traffic light. The history labels of this particular light are voted to corresponding

73 bins, and the one bin which has the most votes tells the type of the traffic light.

5.3.3.4 Minimizing delays

Forecasting and time series analysis usually have delays. As the window size m grows, the delays become more severe. The delays at the head of a trajectory helps avoid picking up false positives, because false positives are expected to be occasional and inconsistent. However, slowly picking up true positives produces misses or false negatives. On the other hand, the delays at the tail of a trajectory helps avoid dropping off true positives, because true positives are expected to be consistent with minimal and temporal errors. However, slowly dropping off false positives produces erroneous output and increase the total number of false positives in the sequence. The delays must be balanced so that their side effects are minimized while their useful functionalities are not compromised. At the head of trajectories, a dynamic threshold and modified moving average are ˆ employed. Suppose in frame m, the moving average Sm is modified as

  bm−(n−1)+bm−(n−2)+···+bm−1+bm m ≥ n ˆ  n Sm = (5.24)  b1+b2+···+bm−1+bm  m m < n

and set the threshold tˆm with a positive constant value α as

  t m ≥ n tˆm = (5.25)  m t + α(1 − n ) m < n

At the beginning, the threshold is high and it drops slowly when more frames are available. The output from the first n frames is suppressed because of insufficient

74 information to make a reliable decision. In a video at 30 FPS, 5 frames correspond to about 167 ms. According to [90], the reaction time of human is over a second. So such delays are acceptable. As a result, a true object with high confident is picked up quickly and the false positives can still be filtered out. At the tail of trajectories, the object that no longer exists need to be dropped quickly. Traffic lights may change their states or move out of image during the tracking process. The transition of state is sudden. It usually has at most 1 frame that shows both lights are on or both are off, indicating the transition is taking place. This particular frame does not exist in many cases, so it is not reliable to tell when the transition occurs. However, traffic lights are motionless in adjacent frames and the last valid position of a currently off light is still useful. When transition happens, it can be determined if a detected light belongs to the same traffic light holder with a different colored light. Subsequently, the transition is identified and the expired information is dropped. On the other hand, when positive detections of a light around the edge of an image for a few consecutive frames are lost, the object is dropped to avoid erroneous output. Occlusion is not considered in this chapter, because it is not safe to predict the state of the light without actually seeing it completely.

75 5.4 Performance Evaluation

5.4.1 Detection and recognition

Fig. 5.11 shows an example frame with detected traffic lights. Here two metrics named precision and recall are used, where

true positives precision = (5.26) true positives + false positives

true positives recall = (5.27) true positives + false negatives

The true positives (TP) are samples that belong to this class, and are recognized as this class correctly. The false positives (FP) are samples not belong to this class, but are incorrectly recognized as this class. The false negatives (FN) are samples that belong to this class, but are recognized as the other classes erroneously.

76 Figure 5.11: All traffic lights are detected and recognized correctly in the frame.

The true positives here must be detected and recognized correctly. A detected but misclassified light does not provide correct identity of the actual light, which is a false negative. Meanwhile, it provides a false identity of another type of light, which is a false positive. Therefore, a detected but misclassified light is considered both a false positive and a false negative. For example, if a red left arrow light is detected but recognized as a red circular light, then the number of false negatives and the number of false positives are both incremented by 1. Table 5.3 shows the results of the test sequences with different configurations, such as using HOG or PCANet, with or without tracking. It is clear that the PCANet outperforms HOG and tracking

77 Table 5.3: Test result of 17 sequences that contain traffic lights

HOG HOG + Tracking PCANet PCANet + Tracking Seq. ID TP FN FP Precision Recall TP FN FP Precision Recall TP FN FP Precision Recall TP FN FP Precision Recall 1 182 0 9 95.3% 100% 162 12 13 92.6% 93.1% 182 0 6 96.8% 100% 162 12 6 96.4% 93.1% 2 179 1 13 93.2% 99.4% 171 1 4 97.7% 99.4% 180 0 13 93.3% 100% 172 0 15 92.0% 100% 3 143 4 48 74.9% 97.3% 135 4 8 94.4% 97.1% 145 2 3 98.0% 98.6% 135 4 0 100% 97.1% 4 140 4 10 93.3% 97.2% 132 0 0 100% 100% 139 5 3 97.9% 96.5% 132 0 0 100% 100% 5 102 210 0 100% 32.7% 154 150 0 100% 50.7% 298 14 0 100% 95.5% 304 0 0 100% 100% 6 211 0 51 80.5% 100% 186 17 41 81.9% 91.6% 211 0 42 83.4% 100% 186 17 32 85.3% 91.6% 7 411 17 15 96.5% 96.0% 420 0 11 97.4% 100% 428 0 6 98.6% 100% 420 0 0 100% 100% 8 136 16 6 95.8% 89.5% 420 0 11 97.4% 100% 428 0 6 98.6% 100% 144 0 0 100% 100% 9 302 3 374 44.7% 99.0% 297 0 128 69.9% 100% 303 2 99 75.4% 99.3% 297 0 37 88.9% 100% 10 168 9 14 92.3% 94.9% 169 0 10 94.4% 100% 140 37 6 95.9% 79.1% 160 9 5 97.0% 94.7% 11 325 23 18 94.8% 93.4% 306 30 22 93.3% 91.1% 329 19 2 99.4% 94.5% 314 22 3 99.1% 93.5% 12 218 62 33 86.9% 77.9% 232 28 11 95.5% 89.2% 211 69 33 86.5% 75.4% 201 59 29 87.4% 77.3% 13 67 3 5 93.1% 95.7% 54 8 17 76.1% 87.1% 66 4 1 98.5% 94.3% 54 8 17 76.1% 87.1% 14 485 33 83 85.4% 93.6% 510 0 144 78.0% 100% 493 25 34 93.5% 95.2% 510 0 13 97.5% 100% 15 282 43 21 93.1% 86.8% 295 10 7 97.7% 96.7% 280 45 0 100% 86.2% 271 34 0 100% 88.9% 16 231 11 44 84.0% 95.4% 230 4 35 86.8% 98.3% 201 41 19 91.4% 83.1% 220 14 16 93.2% 94.0% 17 186 0 144 56.4% 100% 178 0 110 61.8% 100% 186 0 12 93.9% 100% 178 0 1 99.4% 100% Total 3586 439 879 80.3% 89.1% 3612 253 548 86.8% 93.45% 3752 273 276 93.1% 93.2% 3698 167 168 95.7% 95.7% technique improves the performance. The results are not perfect due to the lack of more training data and/or the occasional quality issue of captured video as shown in Fig. 5.10.

5.4.2 False positives evaluation

The number of false positives is evaluated over several traffic-light-free sequences as shown in Table 5.4. Again, PCANet outperforms HOG and tracking technique improves the performance. The number of false positives is rapidly increased if there are mis-recognized objects. A single mis-recognized object produces 30 false positives in one second, if the video frame rate is 30 FPS. The false positives are not eliminated completely in our proposed method simply because the trade-off between precision and recall. Eliminating more false positives may cause more false negatives, making precision increase and recall decrease, or vice versa. Reference [27] argues that false-positive green lights are dangerous and

78 Table 5.4: Number of false positives in traffic-light-free sequences

HOG HOG + Tracking PCANet PCANet + Tracking Seq. ID No. No. per frame No. No. per frame No. No. per frame No. No. per frame 18 150 0.2381 12 0.0190 39 0.0619 0 0 19 45 0.0776 35 0.0603 56 0.0966 26 0.0448 20 11 0.0264 0 0 18 0.0433 12 0.0288 21 127 0.2309 23 0.0418 37 0.0673 9 0.0164 22 280 0.3689 125 0.1647 40 0.0527 6 0.0079 23 179 0.0590 85 0.0280 105 0.0346 80 0.0264 Total 792 0.1327 280 0.0469 295 0.0494 133 0.0223 should be eliminated as much as possible, yielding 99% precision and 62% recall. While such argument is reasonable for practical applications, we do not perform such adjustments in this chapter. Instead, we demonstrate highly accurate and well- balanced precision and recall results to validate our proposed approach as well as the performance improvements by the introduction of PCANet and tracking.

5.5 Discussion

5.5.1 Comparison with related work

Table 5.5 compares several recent papers on traffic light detection and recognition. However, it is difficult to compare them directly, because different testing data and different evaluation metrics were used. There are benchmarks for object detection and image classification like ImageNet [91], but no benchmark has yet been created for multi-class traffic light detection and classification. Researchers use their own collected data in their respective papers. Some papers [25–27] utilize the information other than images, such as GPS data and prior knowledge of traffic light locations.

79 Some focus on a specific type of traffic lights, while others try to solve multiple colors and types at the same time. These factors make it difficult for us to compare their performance appropriately. On the other hand, the efficiency is also hard to compare. The image sizes in these papers vary. In general, traffic lights can be seen even if they are still far away using higher resolution cameras. Instead, a far away traffic light may appear only as a few pixels in a lower resolution image. A higher resolution camera can provide clear images of traffic lights when they are further away. So the system may detect the traffic light slightly earlier, which provides the driver additional time to respond. However, large image size leads to higher computational cost and longer processing time. Another factor is that different hardware platforms were used in their implementations, such as desktop and on-board systems. Additional hardware modules may also be involved such as GPS and inertial measurement unit (IMU) [26].

5.5.2 Limitation and plausibility

This chapter presents a prototype system that can effectively detect several common types of traffic lights in a vertical aligned setting. We would like to emphasize that the proposed system is extendable. The ROI selection can be modified for other types of traffic lights such as horizontally aligned lights. The multi-class classification can be trained if sufficient data are provided. We feel confident that the proposed system can be extended to detect all type of traffic lights and even for other tasks with some modification. Different light condition, color distortion, motion blur and variance of scenes may compromise the system performance in the real world. Thus the robustness of the

80 Table 5.5: Results of several recent works on traffic lights detection

Paper Year Method Light types Image size Timing Performance Green circular; Red circular; Our approach 2016 PCANet; Multi-object tracking 1920×1080 3 Hz Precision 95.7%; Recall 95.7% Green arrow; Red arrow

Spot light detection; Adaptive template matching; Average accuracy 97.6%; [23] 2014 Green circular; Red circular; Amber circular - - Multiple model filter; Single object tracking False alarms ignored in detection

81 Overall detection rate 98.33% and [28] 2014 Image processing; Hidden Markov models Green circular; Red circular; Amber circular 648×488 25 frames per second 91.34% in different scenarios

[24] 2014 Fast radial symmetry transform Red circular; Amber circular 240×320 Most time consuming part ˜1.82 s Precision 84.93%; Recall 87.32%

[25] 2013 Filtering scheme with GPS information Green circular; Red circular 720×480 15.7 ms per frame Precision 88.2%; Recall 81.5%

Traffic light mapping and localization using GPS information; [26] 2011 Green circular; Red circular; Amber circular 1.3 megapixel Real-time; 15Hz frame input Accuracy: 91.7% Several probabilistic stages

Traffic light mapping and localization using GPS information; Green circular; Red circular; Amber circular; [27] 2011 2040×1080 4 Hz Precision 99%; Recall 62% Onboard system Green arrow; Red arrow; Amber arrow trained model is a key factor in addition to detection accuracy. The robustness of our trained models can be improved by training with more data collected under all kinds of conditions using different cameras. Researchers in machine learning are often focused on investigating better algorithms, but sometimes getting more data beats a clever algorithm [92]. However, detecting traffic lights in severe weather or night condition may require different algorithms or even additional sensors and little research has been done on such topics. This will be part of our future work as more data become available. The processing time depends on the image size as well as the number of candidates in an image. The image size in our dataset is 1920×1080, which is considerably larger than most of the other papers in Section 5.5.1. Our implementation is currently a single-thread version running at approximately 3 Hz on a CPU. Our implementation can be accelerated by using multiple CPU threads, GPUs or FPGA hardware. Previ- ously we have successfully employed GPU to accelerate a traffic sign detection system in [93] and a fast deep learning system in [94]. Using hardware is another option to accelerate such systems. The most time consuming part is the PCANet classification part, which has been accelerated on an FPGA in our latest work [95]. Since the proposed system is based on a camera sensor, its reliability is directly affected by the video quality. There are many factors that can affect the output images, such as the camera sensors, configurations, post-processing procedures, and etc. An example of the data quality problem is shown in Fig. 5.10. Therefore, the proposed method is not expected to work at night. The traffic lights at night appear in different ways depending on the camera and its configurations. There may be halo effect around the lights, or the lights turn to be white at center and only have thin colored rings at the edge. A solution on one camera may not suitable for another

82 camera. Therefore we decide not to investigate the problem at night.

5.6 Conclusions

In this chapter, we propose a system that can detect multiple types of green and red traffic lights accurately and reliably. Color extraction and blob detection are applied to locate the candidates with proper optimization. A classification and validation method using PCANet is then used for frame-by-frame detection. Multi-object track- ing method and forecasting technique are employed to improve accuracy and produce stable results. As an additional contribution, we build a traffic light dataset from the videos captured by a camera mounted behind the windshield. This dataset has been released to the public for computer vision and machine learning research and is available online at http://computing.wpi.edu/Dataset.html.

83 Chapter 6

Pedestrian Detection for Autonomous Vehicle Using Multi-spectral Cameras

Pedestrian detection is a critical feature of autonomous vehicle or advanced driver as- sistance system. This chapter presents a novel instrument for pedestrian detection by combining stereo vision cameras with a thermal camera. A new dataset for vehicle ap- plications is built from the test vehicle recorded data when driving on city roads. Data received from multiple cameras are aligned using trifocal tensor with pre-calibrated parameters. Candidates are generated from each image frame using sliding windows across multiple scales. A reconfigurable detector framework is proposed, in which fea- ture extraction and classification are two separate stages. The input to the detector can be the color image, disparity map, thermal data, or any of their combinations. When applying to convolutional channel features, feature extraction utilizes the first three convolutional layers of a pre-trained convolutional neural network cascaded with

84 an AdaBoost classifier. The evaluation results show that it significantly outperforms the traditional histogram of oriented gradients features. The proposed pedestrian detector with multi-spectral cameras can achieve 9% log-average miss rate. The ex- perimental dataset is made available at http://computing.wpi.edu/dataset.html.

6.1 Introduction

Automatic and reliable detection of pedestrians is an important function of an au- tonomous vehicle or advanced driver assistance system (ADAS). Research works on pedestrian detection are heavily depended on data, as different data and methods may yield different evaluation results. The most commonly used sensor in data collection is a regular color camera, and many datasets have been built such as the INRIA person dataset [9] and the Caltech Pedestrian Detection Benchmark [10]. Thermal cameras have also been considered lately, and different methods of pedestrian detection were developed based on the thermal data [44]. It is worth investigating whether the meth- ods developed from one type of sensor data are applicable to other types of sensors. A method may not work anymore since the nature of data has changes, e.g., finding certain hot objects by intensity value threshold on thermal image is not applicable to a regular color image. Some methods such as gradient and shape based feature extraction may still be applicable since an object has similar silhouettes in both color and thermal images. In addition, data from different sensors may contain comple- mentary information and combining them may result better performance. Multiple cameras can form stereo vision, which provides additional disparity and depth infor- mation. An example of combining stereo vision color cameras and a thermal camera for pedestrian detection can be found in [56].

85 The data collection environment is also very important. Unlike static cameras for applications, cameras mounted on a moving vehicle may observe much more complex background and distance-varied pedestrians. Therefore, it calls for different pedestrian detection algorithms from the surveillance camera applications. To use multiple sensors on a vehicle, a cooperative multi-sensor system need to be designed and new algorithms that can coherently process multi-sensor data need to be investigated. The contributions of this chapter are listed as follows:

1. A multi-spectral camera instrument is designed and assembled on a moving vehicle to collect data for pedestrian detection.

2. A new dataset for multi-spectral pedestrian detection is built from on-road driving data. These data contain many complex scenarios that are challenging for detection and classification.

3. We propose a machine learning based algorithm for pedestrian detection by combining stereo vision and thermal images. Evaluation results show satisfac- tory performance.

The rest of the chapter is organized as follows. Section 6.2 describes our instrumental setup for data collection. In Section 6.3, we propose a framework that combines stereo vision color cameras and a thermal camera for pedestrian detection using different feature extraction methods and classifiers. Performance evaluations are presented in Section 6.4, followed by further discussion in Section 6.5 and conclusions in Section 6.6.

86 6.2 Data Collection and Experimental Setup

6.2.1 Data Collection Equipment

To collect on-road data for pedestrian detection, we design and assemble a custom test equipment rig. This design enables the data collection system to be mobile on the test vehicle as well as maintaining calibration between data collection runs. The completed system can be seen in Figure 6.1. The stereo vision cameras called ZED StereoLabs are chosen for providing color images as well as disparity information. The ZED cameras can capture high resolution side-by-side video that contains synchronized left and right video streams, and can create a disparity map of the environment in real-time using the graphics processing unit (GPU) in the host computer. Furthermore, an easy to use SDK is provided, which allows for camera controls and output configuration. In addition, the on-board cameras are pre-calibrated and come with known intrinsic parameters. This makes image rectification and disparity map generation easier. The thermal camera is called FLIR Vue Pro, which is a long wavelengths infrared (LWIR) camera. The IR camera is an uncooled vanadium-oxide microbolometer touting a 640 × 512 resolution at a full 30 Hz and paired with a 13 mm germanium lens providing a 45 × 35 field of view (FOV). This IR camera has a wide −20 to 50 operation range which allows for rugged outdoor use. The thermal camera also provides Bluetooth wireless control and video data recording via its on-board microSD card as well as an analog video output. Both stereo vision and thermal cameras must remain fixed relative to each other for consistency of data collection. A threaded rod is custom cut to length and each end is threaded into the respective cameras tripod mounting hole. This provides

87 Figure 6.1: Instrumentation setup with both thermal and stereo cameras mounted on the roof of a vehicle.

88 a rigid connection between the color and thermal cameras. An electrical junction box is utilized as an appropriately sized, water proof box that provides high impact resistance. The top lid is replaced with an impact resistant clear acrylic sheet such that the stereo vision cameras can be situated safely behind it. A circular hole is cut into the top lid to accommodate for the thermal camera lens to fit through and mounted via the lens barrel. This is essential, as even clear acrylic would block most, if not all the IR spectrum that is used by the thermal camera. The mounting system is designed, modeled, and built utilizing aluminum extru- sions. The entire structure is completely portable and can be mounted to any vehicle with a ski rack. The aluminum extrusions can sit between the front and back ski rack hold-downs. On the other hand, cable management is crucial in our design as long cables are needed for communication between the laptop inside the vehicle and the cameras on the roof. To avoid interference and safety issues, the cables must run down the back of the vehicle, through the trunk and into the vehicle cabin, which needs approximately 20 feet of cable. This creates an issue for the ZED stereo vision cameras, as it operates on high speed USB 3.0 protocol that allows for a 10 feet maximum length due to signal degradation and loss. To resolve this issue, an active USB extension cable is used. A total of four cables terminated from the camera setup are wrapped together with braided cable sleeves to prevent tangling and ensure robustness. An analog frame grabber is employed to capture the real-time analog output of the IR camera instead of directly recording to the on-board microSD card. It is to ensure proper synchronization between the thermal camera and stereo vision cameras. With analog frame grabber, we are able to precisely capture at 30 FPS. AVI files are generated using software provided along with the frame grabber. These AVI files are

89 then converted into image sequences.

6.2.2 Data Collection and Experimental Setup

Our dataset is made available online at http://computing.wpi.edu/dataset.html. The data are collected while driving on city roads. Highway driving data are not collected since pedestrians are hardly seen on highways. A total number of 58 data sequences are extracted from approximately three hours of driving on city roads across multiple days and lighting conditions. There are 4330 frames in total, in which a person or multiple people are in clear view and un-occluded, similar to the Caltech- USA reasonable set [36]. However, unlike the Caltech-USA reasonable set, we do not discard small samples. In fact, more than half of the pedestrian samples in our dataset are no more than 50 pixels in height due to image resolution and their to cameras, which make our dataset more challenging. Each frame contains the stereo color images, thermal image and disparity map. Since cameras have different angle of view and field of view, the 58 usable sequences are rather short, ensuring the pedestrians are within the view of all cameras. Furthermore, video frames without any pedestrians are not included in our dataset.

6.3 Proposed Method

6.3.1 Overview

Figure 6.2 shows the flowchart of our proposed pedestrian detection method. Dispar- ity data are generated from stereo color data. Thermal data are obtained from the thermal cameras and reconstructed according to the point registration using trifocal

90 tensor. Instead of concatenating the features of different data sources and training a single classifier, feature extraction and classification are performed independently for each data source before the decision fusion stage. The decision fusion stage uses the confidence scores of the classifiers, along with some additional constraints to make the final decision. The proposed detector system can be reconfigured using differ- ent feature extraction and classification methods, such as HOG with SVM or CCF with AdaBoost. The decision fusion stage can utilize information from one or mul- tiple classifiers. The performance of different configurations can be evaluated and compared.

6.3.2 Trifocal tensor

These three cameras have different angle of view and field of view, making the point registration (pixel level alignment) essential to windowed detection method cross multi-spectral images. Simple overlay with fixed pixel offsets does not work because every object has its own offset values depending on the distance to camera. There- fore, trifocal tensor [56,96] is used for pixel level alignment over the color and thermal images. The trifocal tensor T is a set of three 3 × 3 matrices that can be denoted as

jk {T1, T2, T3} in notation, or Ti in tensor notation [96] with two contravariant and one covariant indices. The idea of the trifocal tensor is that given a view point correspondence x ↔ x0 ↔ x00, there is a relation

! 0 X i 00 [x ]× x Ti [x ]× = 03×3. (6.1) i

One method to compute the trifocal tensor T is by using the normalized linear

91 Figure 6.2: Framework of the proposed pedestrian detection method.

92 algorithm. Given a point-point-point correspondence x ↔ x0 ↔ x00, there is a relation

i 0j 00k qr x x x jqskrtTi = 0st where 4 out of 9 equations are linearly independent for all choices of s and t. Therefore at least 7 point-point-point correspondences are needed to compute the 27 elements of the trifocal tensor. The trifocal tensor can be computed from a set of equations in the form of At = 0, using the algorithm for least-squares solution of a homogeneous system of linear equations. Given the correct correspondence x ↔ x0, it is possible to determine the corre- sponding point x00 in the third view without reference to image content. It can be

00k i 0 jk denoted as x = x ljTi and can be obtained by using the trifocal tensor and fun-

0 0 0 damental matrix F21. The line l goes through x and is perpendicular to le = F21x.

Both the trifocal tensor and fundamental matrix F21 can be pre-computed and only need to be computed once as long as the placement of the cameras remains un-

00 0 changed. An alternative method is epipolar transfer x = (F31x) × (F32x ). However, this method has a serious problem that it fails for all points lying on the trifocal plane. Therefore, trifocal tensor is a practical solution for point registration. In our experiment, cameras are calibrated using a checkerboard. The pattern is made of different materials, making it visible in both color and thermal camera. Figure 6.3 shows the usage of trifocal tensor in aligning color and thermal images.

6.3.3 Sliding windows vs. region of interest

There are two main methods to locate a pedestrian: sliding window detection and Region of Interest (ROI) extraction. In sliding window detection, it applies a small

93 (a) Color image. (b) Thermal image.

(c) Reconstructed thermal image using tri- (d) Red-cyan anaglyph of color and recon- focal tensor and disparity information. structed thermal images.

Figure 6.3: Proper alignment of color and thermal images using trifocal tensor.

94 sliding window over the entire image, often in different scales, to perform an ex- haustive search. Each window is classified followed by some post-processing, such as bounding box grouping. The ROI extraction finds out the potential candidates first by some pre-processing techniques such as color and pixel intensity to filter out negatives from these candidates by using a classifier or some other constraints. It is often more efficient, as the number of candidates is much less than the amount of sliding windows. For pedestrian detection, both ROI extraction and sliding window detection have been employed in the literature. The sliding window detection method is an universal approach but is computationally expensive. On the other hand, ROI extraction is often used for thermal images, because pedestrians are often hotter than the sur- rounding environment. The ROIs are segmented based on the pixel intensity values. However, we find that the ROI extraction on thermal images does not always work well. The assumption that the pedestrians are hotter is not always true for various reasons. For instance, a pedestrian wearing heavy layers of clothing does not appear with distinctively high pixel intensity values in a thermal image, and thus a pedes- trian can not be located by simple morphological operations. As another example, the temperature of the road surface exposed to intense sunlight has higher tempera- ture than the human bodies. Although false positives introduced by hot objects such as vehicle engines can be filtered in later steps, the losses of true positives become a serious problem. As a result, we feel the sliding window detection method is more reliable in case of these complex scenarios. The classifier can analyze the windowed samples thoroughly and make an accurate decision. Figure 6.4 shows some examples of our pedestrian samples in color images and corresponding thermal images, where row 1 and 3 are color samples and corresponds to thermal samples in row 2 and 4,

95 Figure 6.4: Examples of pedestrians in color and thermal images. respectively. However, sliding window detection method also has its own drawbacks, besides much higher computational cost. The total number of windows in an images often reaches 105 or more. Even a fair classifier with False Positives Per Window (FPPW) of 10−4 would still result 10 False Positives Per Image (FPPI). Since 2009, the evaluation metric has been changed from FPPW to FPPI [38]. To solve this problem, many state- of-the-art CNN-based classifier have been proposed in recent years. An alternative approach is to combine information from additional sensors. Our proposed approach

96 of multi-spectral cameras is along this line.

6.3.4 Detection

In this chapter, we only compare the HOG and CCF methods for the task of pedes- trian detection for the follow reasons:

1. The HOG method was always included as a baseline in Caltech-USA dataset. Among 44 methods reported on the Caltech-USA dataset [38], 30 of them em- ployed HOG or HOG-like features.

2. The CCF is one of the best performed methods reported on Caltech-USA dataset as of May 2016. The idea of combining low level CNN feature and a boosting forest model is promising.

3. The goal of this chapter is to investigate the combination of multi-spectral cameras and its improvement on pedestrian detection. We publicize our dataset, so other researchers can continue this study to discover many better solutions in the future.

The HOG features have been widely used in object detection. It defines overlapped blocks in a windowed sample, and cells within blocks. The histogram of the unsigned gradients of several different directions are computed in all blocks, and are concate- nated as features. The HOG features are often combined with SVM and sliding window method for detection on different scaling levels. At the training stage, the positive samples are manually labeled. The initial negative samples are randomly selected on training images as long as they do not overlap with the positive samples. All samples are scaled to a standard window size

97 of 20 × 40 for training. The size of the minimum sample in our data is 11 × 22. After the initial training, the detector is tested on the training set and more false positives are added back to the negative samples set. These false positives are often called hard negatives and this procedure is often called hard negatives mining. This procedure can be repeated for a few times until the performance improvement becomes marginal. Once the detector is trained, it is ready to perform detection on the test dataset and give a decision score for each window. Each frame with original size of 640 × 480 is scaled into different sizes. The detector with a fixed size of 20 × 40 is then applied to the scaled images to find pedestrians of various sizes at different locations in a frame. CCF uses low level features from a pre-trained CNN model, cascaded with a boosting forest model such as Real AdaBoost [97] as a classifier. The lower level features from CNN are considered generic descriptors for objects, which contain richer information than channel features. Meanwhile, the boosting forest model replaces the remaining parts of CNN. Thus we avoid training a complete end-to-end CNN model for a specific object detection application which would require large resources of computation, storage and time. In our experiment, we apply similar settings as described in [43], except for the parameters of the scales and number of octaves, in order to detect pedestrians far away that are as small as 20×40 pixels. The conv3−3 layer in VGG-16 model are used as feature extraction. The windowed sample size is 128×64 instead of 20×40. The feature of the 20×40 sample is 1296. The training samples of CCF are from the training stage of HOG, similar to the method described in [43] which use aggregated channel features (ACF) [98] to select training samples for CCF. Caffe [99] is used for feature extraction of CCF on a GPU-based computer platform. At the test stage, CCF method runs on the GPU platform is

98 considerably faster than the HOG method, but it requires more memory and disk space for data storage.

6.3.5 Information fusion

The idea of combing the information from color image, disparity map and thermal data for decision making is referred as information fusion. One approach is to con- catenate these features together [56]. A single classifier can be trained on the con- catenated features and the final decisions of the test instances can be obtained from the classifier. This approach has an disadvantage that the classifier training becomes a challenge as the dimension of features increases. Furthermore, if a new type of feature needs to be added or an existing feature needs to be removed, the classifier need to be re-trained, which is time consuming. An alternative approach of information fusion is to employ multiple classifiers and an example can be found in [100]. Each classifier makes decision on a certain type or subset of features and the final result is obtained by using a decision fusion technique such as majority voting or sum rule [101]. This approach has an advantage that the structure of a system is reconfigurable. Without re-training the classifiers, adding or removing different types of features becomes very convenient. Therefore, we choose the later approach to make our system reconfigurable so that it evaluates various settings and methods. Specifically, an SVM is used at the decision fusion stage and its inputs are confidence scores from classifiers in the previous stage, which is more appropriate than commonly used statistical decision fusion method in the case of multi-source data [102, 103]. The data from different sources are often not equally reliable, and so are the classifiers. The confidence scores must be weighted when obtaining the final decision from information fusion.

99 6.3.6 Additional constraints

6.3.6.1 Disparity-size

Besides the extracted features from an image frame, additional constraints can be incorporated into the decision fusion stage to further improve the detector perfor- mance. An example is the disparity-size relationship. Figure 6.5 shows the disparity and height relationship of the positive samples in the form of a linear regression line   d = h 1 × B , where d is mean disparity, h is the height of the sample, and B is a 2 × 1 coefficient matrix. Given a pair of mean disparity dˆ and height hˆ of a sample,   the residual r = |dˆ− hˆ 1 × B| can be used to estimate whether this sample is possibly a pedestrian or not. From Figure 6.5 we can see a number of samples have very small mean disparity and are far below the regression line. This is because the disparity information is not accurate when an object is far away from camera. In fact, the stereo vision camera we use automatically clamps the disparity value at certain distance. Object beyond that distance results zero disparity, which makes the estimation for small size samples inaccurate.

6.3.6.2 Road horizon

During detection, a few reasonable assumptions can be made to filter out more false positives while retaining the true positives. The assumptions vary depending on the application, including color, shape, position, etc. One assumption here is that pedestrians stand on the road, i.e., the lower bound of a pedestrian must below the road horizon. The road horizon can be automatically detected in an image. This kind of simple constraint may or may not improve the detector performance, and

100 Figure 6.5: The relationship between the mean disparity and the height of an object.

101 experiments should be carried out to determine its effectiveness.

6.4 Performance Evaluation

There are a total of 58 labeled video sequences in our dataset. We use 39 of them for training and the remaining 19 for test. Figure 6.6 shows the performance of different settings, including disparity map, color image, thermal data, and their combinations, all based on HOG features. Generally, the more types of information are used, the better performance is achieved. The disparity-only setup performs the worst. The color image only is better, followed by the combination of color and disparity. Note that the thermal-only setup outperforms the combination of color and disparity. The heat signature of pedestrians seems more recognizable in thermal images. The com- bination of color, thermal and disparity information achieves the best performance, with about 36% log-average miss rate (MR). Figure 6.7 shows the performance of the HOG features, added with disparity-size information and road horizon constraint. The road horizon improves the log-average MR by about 5%. Despite little improvement provided by adding the disparity-size information alone, the combination both provides nearly 7% improvement in log- average MR. Figure 6.8 shows the performance of different settings using CCF. Performance of disparity only is the worst. Thermal image performs very well. However, it is interesting to see the disparity does not provide any improvement when combined with color or thermal. In fact, combing with disparity results lower performance. This is due to the fact that CCF implementation accepts 8-bit image as input, thus the precision of the disparity is not accurate. In comparison, CCF outperforms HOG

102 Figure 6.6: Performance of different input data combinations, all using HOG features.

103 Figure 6.7: Performance improvement by adding disparity-size and road horizon con- straints.

104 Figure 6.8: Performance of different input data combinations, all using CCF. almost on all settings except for disparity. The best performance comes from CCF with the combination of color and thermal, which achieves 9% log-average MR. Sim- ilarly, we also attempt to add disparity-size information and road horizon constraint to the CCF method, but the performance changes are negligible.

6.5 Discussion

Although the combination of multi-spectral cameras can improve the performance in pedestrian detection, the performance is still highly dependent on the instrument. Our thermal camera has a resolution of 640 × 480, which is relatively low. To accom-

105 Figure 6.9: A pedestrian is embedded in the shadow of a color image. modate the resolution and FOV of the thermal camera, the color cameras have to be set to the same resolution. In addition, color cameras are sensitive to the lighting condition, therefore the quality of the image sometimes cannot be guaranteed. Figure 6.9 shows an example, with bounding box drawn on the detected pedestrian in both color and thermal images. It is obvious that the thermal image provide much better information about the presence of the pedestrian, while it is hardly identifiable in the color image due to the shadow. Although thermal images seem to be dominant in our experiment, its reliability still needs improvement. Figure 6.10 shows a thermal image taken on a hot sunny day. Two pedestrians circled are not bright enough compared to the surroundings, which is contradictory to the assumption of distinct thermal intensity in many existing research works. In this case, the methods or operations based on pixel intensity values become unreliable, such as intensity thresholding, head recognition using hot spot, etc. On the contrary, some shape or gradient based methods may still perform well, such as HOG and CCF described in this chapter.

106 Figure 6.10: An example thermal image with two pedestrians.

107 6.6 Conclusions

In this chapter, a novel pedestrian detection instrumentation is designed using both thermal and RGB-D stereo cameras. Data are collected from on-road driving and an experimental dataset is built with pedestrians labeled as ground truth. A reconfig- urable multi-stage detector frame is proposed. Both HOG and CCF based detection methods are evaluated using the multi-spectral dataset with various combinations of thermal, color, and disparity information. The experimental results show that CCF significantly outperforms the HOG features. The combination of color and thermal images using CCF method results the best performance of 9% log-average miss rate. For the future work, other advanced feature extraction and classification methods will be considered to further improve the pedestrian detector performance.

108 Chapter 7

End-to-End Learning for Lane Keeping of Self-Driving Cars

Lane keeping is an important feature for self-driving cars. This chapter presents an end-to-end learning approach to obtain the proper steering angle to maintain the car in the lane. The convolutional neural network (CNN) model takes raw image frames as input and outputs the steering angles accordingly. The model is trained and evaluated using the comma.ai dataset, which contains the front view image frames and the steering angle data captured when driving on the road. Unlike the traditional approach that manually decomposes the autonomous driving problem into technical components such as lane detection, path planning and steering control, the end-to-end model can directly steer the vehicle from the front view camera data after training. It learns how to keep in lane from human driving data. Further discussion of this end-to-end approach and its limitation are also provided.

109 7.1 Introduction

Lane keeping is a fundamental feature for self-driving cars. Despite many sensors installed on autonomous cars such as radar, LiDAR, ultrasonic sensor and infrared cameras, the ordinary color cameras are still very important for their low cost and ability to obtain rich information. Given an image captured by camera, one of the most important tasks for a self-driving car is to find the proper vehicle control input to maintain it in lane. The traditional approach divides the task into several parts such as lane detection [104,105], path planning [106,107] and control logic [108,109], and they are often researched separately. The lane markings are usually detected by some image processing techniques such as color enhancement, Hough transform, , etc. Path planning and control logic are then performed based on the lane markings detected in the first stage. In this approach, its performance highly relies on the feature extraction and interpretation of the image data. Often the manually defined features and rules are not optimal. Errors can also accumulate from previous processing stage to next stage, leaving the final result inaccurate. On the other hand, an end-to-end learning approach for self-driving cars has been demonstrated in [70] using convolutional neural networks (CNNs). The end-to-end learning takes the raw image as input and outputs the control signal automatically. The model is self-optimized based on the training data and there is no manually defined rules. These become the two major advantages of end-to-end learning: better performance and less manual effort. Because the model is self-optimized based on the data to give maximum overall performance, the intermediate parameters are self adjusted to be optimal. Moreover, there is no need to detect and recognize certain categories of pre-defined objects, to label those objects during training or to design control logic

110 Figure 7.1: Comparison between the traditional approach and end-to-end learning. based on observation of these objects. As a result, less manual efforts are required. Figure 7.1 compares the traditional approach with the end-to-end learning approach. This chapter presents the end-to-end learning approach to produce the proper steering angle from camera image data aimed at maintaining the self-driving car in lane. The model is trained and evaluated using comma.ai dataset, which contains image frames and the steering angle data captured when driving. The rest of the chapter is organized as follows. Section 7.2 provides the details of our implementa- tion, including data pre-processing and CNN architecture. The evaluation results are presented in Section 7.3, followed by discussions in Section 7.4 and conclusions in Section 7.5.

7.2 Implementation Details

7.2.1 Data pre-processing

The data used in this chapter are from comma.ai driving dataset. The dataset con- tains 7.25 hours of driving data, including 11 video clips recorded at 20 Hz and some other measurements such as steering angle, speed, GPS data, etc. The image frames

111 Figure 7.2: An example of image frame from the dataset. are of size 320 × 160 pixels, and are cropped from original video frames. The original frames are not provided by the dataset. An example of the frame from the dataset is shown in Figure 7.2. For lane keeping, only the image frames and the steering angle data are used. The steering angle data are recorded at 100 Hz, and they are aligned with the image frames using the alignment stamps provided by the dataset. In case there are multiple steering angle instances correspond to the same image frame, their average is used to form an one-to-one mapping between each image frame and its corresponding steer angle. Before training the CNN model, the data need to be further processed. First of all, to simplify the problem, driving at night is not considered in this chapter and all four clips recorded at night are not considered. Second, the data contains many scenarios such as driving forward, changing lanes, making turns, driving on straight or curved roads, driving in normal speed or moving slowly in a traffic jam, etc. To train a lane keeping model, the data that meet the following criteria are selected: driving

112 in normal speed, no lane changes or turns, and both straight and curved roads. After data selection, the remaining data are from 7 video clips with a total of about 2.5 hours. At last, five video clips containing 152K frames are used for training and two video clips containing 25K frames are used for test. During the training stage, one important issue needs to be addressed that the data used for training is highly unbalanced as shown in Figure 7.3. As highway roads tend to be mostly straight and the portion of curved road is at a small percentage, the trained model based on these unbalanced data may tend to driving straight while still have low losses. To remove such bias, the data of curved roads are up-sampled by five, where curved roads are defined by where the absolute steering angle values are larger than five degrees. The data are then randomly shuffled before training.

7.2.2 CNN implementation details

The CNN architecture that we proposed is shown in Figure 7.4, which is similar to that in [70] and [110] but is much simpler. The loss layer used during training is Euclidean loss, which computes the sum of squares of differences between predicted

1 PN 1 2 2 steering angle and ground truth steering angle: 2N i=1 ||xi −xi ||2. The CNN model is trained using Caffe [99]. The CNN model consists of three convolutional layers and two fully connected layers. The input layer is raw RGB image, and output layer is the predicted steering angle for the input image. The first convolutional layer uses a 9×9 kernel and a 4×4 stride. The following two convolutional layers use a 5×5 kernel and a 2×2 stride. The convolutional layers are mainly for feature extraction and the fully connected layers are mainly for steering angle prediction, but there is no clear boundary between them since the model is trained end-to-end. Dropout layers are used for preventing

113 Figure 7.3: Histogram of steering angles in training data.

114 Figure 7.4: The proposed CNN architecture for deep learning.

115 over-fitting. There are no pooling layers because the feature maps are small. The CNN architecture, as well as the hyper-parameters used, can be further tuned through more experiments. Overall, the CNN architecture is not the major concern of this work for the following of two reasons. First, we feel the dataset is too small. Despite the training and testing data contain more than 170K frames that equals to about 2.5 hours driving, it is actually insufficient to train a generic lane keeping model that uses raw image as input. The appearance of the roads can be very complex due to different curves, road markings, lighting conditions, etc. In fact, the proportion of data for curved roads is relatively small, with only about 20 minutes of driving. Training a model that gives a continuous value as predicted steering angle, these amount of data is not sufficient. The other reason is that tuning a model requires proper evaluation metric, which is also limited by the current dataset. The details of the evaluation method will be discussed in Section 7.4.

7.3 Evaluation

The trained model is evaluated using two test video clips containing 25K frames. For each frame, the predicted steering angle is compared with the ground truth value. The histogram of the error is shown in Figure 7.5. The standard deviation of the error is 3.26, the mean absolute error is 2.42, and the unit is degree. To better understand the errors, the predicted angle and ground truth angle are compared in each frame and the results cam be visualized. Figure 7.6 shows an example frame along with the ground truth angle and pre- dicted angle. The projected paths for both angles are plotted using the same approx- imation as in [110]. The path using ground truth angle is in blue and the path using

116 Figure 7.5: Histogram of error of predicted steering angles during test.

117 Figure 7.6: An example frame with the ground truth angle, predicted angle and their respective projected path predicted angle is in green. The simulated steering wheels for both angles are also drawn for better visualization. Figure 7.7 also visualizes the feature maps from the first two convolutional lay- ers. The top-right 4 × 4 cells are results from the first convolutional layer, and the bottom 4 × 8 cells are results from the second convolutional layer. As expected, the convolutional layers automatically learned to extract the lane markings as a kind of feature during training. The model does not use any manually defined or hand-crafted features, since it can learn useful features from the data automatically .

118 Figure 7.7: Visualization of the results from first two convolutional layers.

7.4 Discussion

7.4.1 Evaluation

As an evaluation metric, computing the differences of ground truth angle and pre- dicted angle is actually questionable. Firstly, the ground truth provided by the human driver is not globally optimal. The human driver cannot maintain the vehicle in the center of the lane all the time. As long as the vehicle stays in lane, the predicted angles are fine and do not have to be exactly the same as the human driver. Secondly, both the vehicle movement and the steering control are continuous, thus the frame by frame evaluation is not appropriate. Let’s consider two scenarios if the road is straight. In the first scenarios, the steering angle turns to the left a bit, then quickly turns to the right a bit to maintain the vehicle in the lane. This process can be repeated. In the second scenario, the steering angle turns to the left a bit and stay at

119 that angle for a period of time, then it turns to the right a bit and stays for a while. In the second scenario, the vehicle actually would drive out of the lane most of the time. In these two scenarios, the histogram of the errors, mean absolute error, standard de- viation of the error are the same. However, the first scenario is fine while the second one is completely unacceptable. Figure 7.8 shows an example of the disadvantage of this type of frame by frame evaluation. The frames and their predicted angles are from the test dataset. These 5 frames are put in chronological order. We can see that the middle frame has a huge error of 10 degrees. However, the recorded ground truth does not seem correct in this frame. By looking at the previous and following frames, we find out that the ground truth in this frame is transitioning from left to right. This example shows that evaluating the error frame by frame is not appropriate. To solve this problem, a simulator is needed to provide feedback based on the predicted angle. The simulator should be able to generate the frames and simulate the vehicle movement realistically. The frames should be generated according to the vehicle position and orientation. One way to do so is using a virtual game engine, such as described in [11, 111]. The advantage of using a virtual engine is that there are built-in physics simulation and mechanism. The vehicle movement simulation and frame generation can be done realistically. Besides, the ground truth information is very rich in the virtual world. Information such as vehicle position, orientation and velocity can be easily obtained, so do other objects. The disadvantage is that the frames are computer generated graphics, but not real images captured from driving in the real world. Although they look very realistic with the state of the art game engine, the details and variations they provide still cannot match the data from real world. Alternatively, we can generate the next frames according to controls inputs using

120 121

Figure 7.8: An example of the disadvantage of frame by frame evaluation with 5 consecutive frames: the error in the middle frame is false recorded frames, i.e., data captured in real world. This can be achieved by either learning approach [112] or 3D image projection approach [70]. The learning approach learns auto-encoders to embedding road frames, and learns a transition model in the embedded space. The next few frames can be generated based on the current frame image and the current control inputs. On the other hand, the 3D image projection approach assumes the ground is a flat surface, and solves the 3D geometry [113] to generated the next frame based on the actual recorded frame, through predicted camera shift and rotation. The camera shift and rotation can be obtained from vehicle movement simulation, which can be computed using vehicle kinematic or dynamic models [108,109].

7.4.2 Data augmentation

Since we are not supposed to drive off the lane when recording, the data obtained from human driving are lack of error correction process. The human driver is able to maintain the vehicle within the lane, but a model trained on such data is not robust to errors and the vehicle may slowly drift away. To train a model that can correct small errors such as vehicle shifts and rotations, the error correction data must be provided during training. One solution is to perform data augmentation by randomly creating some shifts and rotations, which generates corresponding frames based on the 3D geometry described above. The correction control input can be computed again using the vehicle kinematic or dynamic models. The comma.ai dataset does not contain the original sized frames or camera calibration parameters. Therefore simulator and data augmentation is not included in this chapter. Our current work collects real world data using multiple camera. All the aforementioned techniques will be incorporated in the future work.

122 7.5 Conclusions

This chapter presents the end-to-end learning approach to lane keeping for self-driving cars that can automatically produce proper steering angles from image frames cap- tured by the front-view camera. The CNN model is trained and evaluated using comma.ai dataset, which contains image frames and the steering angle data captured from road driving. The test results show that the model can produce relatively ac- curate steering of vehicle. Further discussions on evaluation and data augmentations are also presented for future improvement.

123 Chapter 8

Building an Autonomous Lane Keeping Simulator Using Real-World Data and End-to-End Learning

Autonomous lane keeping is an important safety feature for intelligent vehicles. This chapter presents a state-of-the-art end-to-end learning method using convolutional neural network (CNN) that takes front view camera data as input and produces the proper steering wheel angle to keep the vehicle in lane. A novel method of data augmentation is proposed using vehicle dynamic model and vehicle trajectory track- ing, which can create addition training data as if a vehicle drives off-lane at random displacement and orientation. Real-world driving data is recorded from three front- view cameras on left, center, and right. A lane keeping simulator is built using the recorded data in conjunction with image projection and vehicle dynamics estima-

124 tion. Experimental results demonstrate that the end-to-end learning method with augmented data can achieve high accuracy for autonomous lane keeping and very low failure rate. The simulator can serve as a platform for both training and evaluation of vision-based autonomous driving algorithms. The experimental dataset is made available at http://computing.wpi.edu/dataset.html.

8.1 Introduction

Lane keeping is a fundamental feature for intelligent and autonomous vehicles. De- spite many sensors installed on autonomous cars such as radar, LiDAR, ultrasonic sensor and infrared cameras, the ordinary color cameras are still very popular owing to their low cost and ability to obtain rich information. Given the video images from the front-view camera, an vision-based lane keeping system can automatically output the proper steering angles to maintain the vehicle in lane. A traditional framework divides the task into several stages including lane detection [104, 105], path plan- ning [106, 107] and control logic [108, 109]. Applying image processing techniques such as color enhancement, Hough transform and edge detection, the lane detection system is to identify the lane markings on the road. Path planning and control logic are then employed to provide the proper steering angle adjustment for the vehicle. In this approach, performance of lane detection heavily relies on the feature extrac- tion and interpretation of image data. Errors can also accumulate from previous processing stage to the next, leaving the final control output less accurate. In contrast, an end-to-end learning method has the advantages of better perfor- mance and less manual effort. End-to-end learning for self-driving cars has been successfully demonstrated in [70] using convolutional neural networks (CNNs), which

125 Figure 8.1: Comparison between the traditional framework and end-to-end learning. takes the images from cameras as input and produce the vehicle control output auto- matically. The model is self-optimized based on the training data and does not need manually defined features. User does not need to label the detected objects and their categories during the training process. Figure 8.1 is a comparison between the tra- ditional framework and the end-to-end learning approach for vision-based automatic lane keeping. Although the approach of end-to-end learning for lane keeping is not new, the existing work has several deficiencies. For instance, the error difference between the recorded “ground truth” and predicted steering angle is not the best evaluation metric. Since it is hardly possible for a human driver to maintain the vehicle perfectly in the center of the lane at all time, the recorded angles are not optimal. Thus, the predicted angles do not have to be exactly the same as the ground truth angles recorded from the human driving experience. It is more important to predict the position and orientation of the vehicle in the very next time step given current vehicle speed and steering angle control. As long as the vehicle stays in lane, the steering angle is acceptable. By using a simulator, the effects of the control input can be simulated and monitored, and therefore providing a more reliable evaluation metric.

126 Furthermore, we need to provide data to train the deep neural network to take appropriate steering angle actions when the vehicle drifts away from the center of the lane. However, the recorded driving data are lack of this type of actions since it is unsafe to drive off the lanes during data collection. To solve this dilemma, we propose a data augmentation method based on vehicle dynamic model and vehicle trajectory tracking. Given any displacement and orientation, the model can generate a projected trajectory and a sequence of steering angle controls. Correspondingly, we can also create the augmented front views using image projection based on the shift location and orientation. Therefore, the system becomes a simulator that can not only generate augmented data for training the convolutional neural network but also be used as a platform to evaluate the performance of other vision-based lane keep algorithms. The main contributions of this chapter are listed as follows:

1. This chapter presents a simulator for vision-based autonomous lane keeping. Although there are many recent works on lane keeping algorithms, it is hard to compare and evaluate them. Built on the recorded driving data, this simulator employs image projection, vehicle dynamics modeling, and vehicle trajectory tracking to predict vehicle movement and its corresponding camera views. The simulator can be used for both training and evaluation of lane keeping algo- rithms.

2. An end-to-end learning method is to proposed that can generate proper steering angles from front-view camera data, which can maintain the vehicle in lane. A highly effective end-to-end learning system is demonstrated using the the aforementioned simulator. The CNN model trained with augmented data from

127 the simulator performs significantly better than the model trained with recorded data only.

3. A completely new dataset for autonomous lane keeping is developed and was

made available at http://computing.wpi.edu/dataset.html. The dataset contains recorded video frames from three forward facing cameras (left, center, and right) as well as steering wheel angles and vehicle speed information.

The rest of the chapter is organized as follows. Section 8.2 provides the implemen- tation details of our simulator, including image projection, vehicle dynamics, vehicle trajectory tracking as well as the CNN architecture. The experiment and evalua- tion results are presented in Section 8.3, followed by discussions in Section 8.4 and conclusions in Section 8.5.

8.2 Building a Simulator

8.2.1 Overview

For evaluation of vision-based lane keeping algorithms, a simulator is needed to pro- vide feedback based on the predicted angle. The simulator can generate image frames to the vehicle position and orientation, and it can also simulate the vehicle movement giving a steer angle input. Therefore, a simulator for self-driving cars has two im- portant components: graphic engine and physics engine. The graphic engine utilizes the information of the surrounding environment, as well as the pose of the camera to generate images. The physical engine simulates vehicle movement based on the input control actions. A virtual game engine usually contains both graphic and physics engine, and some autonomous driving simulators were built upon it [11,111]. Vehicle

128 movement simulation and frame generation can be integrated into the game engine. Besides, the ground truth information is very rich in the virtual world. Information such as vehicle position, orientation and velocity can be easily obtained, so do other objects. Despite these advantages, a significant drawback of these virtual simula- tors is that the generated images are still quite different from the real world data. Although they look very realistic with advanced graphic techniques, the details and variations of virtual images still cannot match the data from real world. It is risky to train a model using virtual game engines and then deploy the model for real-world driving. It would be better to build a simulator from the real world data. Different camera views can be generated from recorded video frames by learning approach [112] or 3D image projection approach [70]. The learning approach learns auto-encoders to embedding road frames, and learns a transition model in the em- bedded space. The next few frames can be generated based on the current frame image and the current control inputs. On the other hand, the 3D image projection approach assumes the ground is a flat surface, and solves the 3D geometry [113] to generated the next frame based on the actual recorded frame. The camera shift and rotation can be obtained from vehicle movement simulation, which can be estimated using vehicle kinematic or dynamic models [108,109]. In our simulator, the image projection approach is employed for rendering the images. The CNN takes the image as input and the vehicle dynamics is used to simulate vehicle movement given the control action. Figure 8.2 shows the detail operations of the simulator when testing the CNN-based lane keeping algorithm. The predicted position is constantly validated against the ground truth position. A failure is recorded if the error exceeds a threshold value. More importantly, the simulator can be very useful when training the neural network by providing a large

129 amount of additional training through augmentation. When using the simulator for training, the vehicle trajectory tracking replaces the CNN controller to provide the control actions that can gradually correct initial position shift and/or orientation rotation. Practically, assuming an arbitrary shift and rotation of the vehicle from the ground truth, vehicle trajectory tracking block can produce the proper steering angle control actions. Combined together with the generated camera view from the image projection process, augmented data can be generated. Figure 8.3 shows the operation flow of the simulator at training phase during that many augmented data can be generated from each ground truth image by arbitrary shift and rotation of the vehicle.

8.2.2 Image projection

Rendering the image according to the vehicle position and orientation is required by the simulator, in order to provide more instances for machine learning and better evaluation metric. However, without using a gaming engine, data collected in real world are sparse, often along a single trajectory as the car goes. These data themselves are far from enough to cover all possible positions and orientations. Therefore, these data must be transformed for an arbitrary position and orientation, using computer vision knowledge of image projection base on 3D geometry. Given a point in the world coordinates which is Xw = (xw, yw, zw) and the corresponding point in image coordinates which is p = (p1, p2), there are relations that

h h p = XwMexMin (8.1)

h Xw = c(xw, yw, zw, 1)

130 Figure 8.2: The flowchart of test phase.

131 Figure 8.3: The flowchart of training phase, using original data and augmented data.

132 ph = d(p1, p2, 1)

h h where p and Xw are 1 × 3 and 1 × 4 homogeneous coordinates, c and d are arbitrary nonzero constants, Mex is the 4 × 3 extrinsic matrix and Min is the 3 × 3 intrinsic matrix. The extrinsic matrix contains a rotate matrix and a translation vector, which defines the camera’s position and orientation in the world coordinates. Therefore the extrinsic matrix is changed if the camera is shifted or rotated. The intrinsic matrix defines the transformation from camera coordinates to image coordinates, including parameters such as focal length, aspect ratio, location of principle point, etc. The intrinsic matrix stays the same even if the camera is shifted or rotated. The extrinsic matrix and intrinsic matrix can be obtained through calibration procedure.

Given the image taken in the real world with known calibration parameters Mex,

Min and its pixels coordinates p, the new pixels coordinatesp ˜ need to be found with ˜ a new extrinsic matrix Mex, when the camera is shifted and rotated. The physical of the 3D scene are required in order to find the projection parameters. In the case of highway lane keeping simulation, we made an assumption that the ground surface is flat, e.g., zw = 0. According to formula 8.1, the mapping of p top ˜ then can be obtained as follows:

h h −1 −1 Xw = p Min Mex (8.2)

h h ˜ p˜ = XwMexMin (8.3)

Note that the lens distortion, if any, needs to be corrected before performing such image projection. Figure 8.4 shows some examples of transforming an original image according to camera’s virtual position and orientation. The additive black area on

133 the generated image is usually not an issue for vehicle simulation, since the captured images from front-view cameras are often cropped to retain only the middle section as the region of interest. Another challenging task is ground surface estimation during calibration. To es- timate the calibration parameters, especially Mex in formula 8.1 with the assumption zw = 0 for the ground surface, these three cameras used in our system need to be deployed on vehicle and world coordinates need to be established properly. When calibrating the cameras in the lab, a checkerboard pattern is usually used, as shown in Figure 8.4. However, estimating the ground surface needs a very large pattern, which is hard to craft and deploy. In our experiment, a flat parking lot with existing markings is used for ground surface estimation. Physical dimensions of the markings are measured manually while the corresponding images are captured by the cameras installed on the vehicle. Figure 8.5 shows the selected points in the image taken by the center camera during the calibration. The physical locations of the cameras and the selected points in the world coordinates are also shown in Figure 8.5. Three cameras are installed on the left, center and right of the vehicle, all facing forward, because they can provide better field of view than a single camera. In fact, the nearest camera to the vehicle’s virtual position is selected as the source in equations 8.2 and 8.3. Therefore, the generated images have better quality and less additive black areas after projection.

8.2.3 Vehicle dynamics and vehicle trajectory tracking

According to [108], the bicycle vehicle dynamics shown in Figure 8.6 is captured by the following equations:

134 (a) (b)

(c) (d)

Figure 8.4: Example of original image and generated images given arbitrary camera poses. (a) Original image. A checkerboard pattern on a flat surface. (b) Generated image as if the camera is shifted left by 50 mm. (c) Generated image as if the camera is rotated right by 15.25 degrees. (d) Generated image as if the camera is shifted left by 50 mm and rotated right by 15.25 degrees.

135 (a)

Multi-camera calibration and ground surface estimation. 15 Selected Points All Points 10 Right Cam Middle Cam Left Cam 5

0

-5

Y coordinates (Meters) -10

-15

-20 -15 -10 -5 0 5 10 15 20 X coordinates (Meters) (b)

Figure 8.5: Camera calibration and ground surface estimation. (a) Selected points in the image taken by the center camera. (b) Cameras and selected points in the world coordinates.

136 x˙ = v cos θ

y˙ = v sin θ

θ˙ = ω

θ = ψ + β v ψ˙ = sin β lr v˙ = a   lr β = arctan tan (σf ) lf + lr

where P = (x, y, θ) ∈ R2 × S1 is the state of the position and orientation, v and ω are the linear velocity and angular velocity respectively that are also the control input. a is the acceleration and σf is the turning angle. lf and lr are the distance from the vehicle’s mass center to the front and rear axles. In our test vehicle, we use estimated

values lf =1 m and lr=1.7 m. The dynamics in Figure 8.6 is feedback linearized by introducing a nonlinear mapping from the current nonlinear system to a new linear system and a new state variable z = [x, y, x,˙ y˙]:

z˙ = Az + Bu     x˙ x         y˙ y       = A   + Bu x¨ x˙         y¨ y˙

137 Figure 8.6: A virtual bicycle vehicle dynamics.

138     0 0 1 0 0 0         0 0 0 1 0 0     where the state matrix A =  , the input matrix B =   and the 0 0 0 0 1 0         0 0 0 0 0 1   x¨   input vector u =   in the new linear system. After the feedback linearization, the y¨ whole problem is transformed into searching the proper gain K for the linear system. To solve this optimal control problem, Linear Quadratic Regulator (LQR) is used to acquire the optimal gain K. The quadratic cost is defined as the following:

∞ J = (x|Qx + u|Ru)dt (8.4) ˆ0

  1 0 0 0       0 1 0 0 1 0     where Q =   and R =  , x and u are the state and control effort 0 0 1 0 0 1     0 0 0 1 respectively. Practically, Q and R matrices do not have to be identity matrices but positive definite, and the entries can be tuned to achieve required performance accordingly. Once the gain K is computed, the feedback control law and the ordinary differential equation (ODE) of the new linear system are described as follows:

e = z − zd

u = −Ke + ud

e˙ = (A − BK)e

z˙ =z ˙d +e ˙

139 where e is the error between the true state and desired state, K is the gain computed

based on the defined cost equation 8.4 with A and B matrices, u ∈ R2 is the input

vector, ud = (¨xd, y¨d) is the referenced input given by the ground truth,e, ˙ z,˙ z˙d is changing of the error, state, desired state, respectively. A is the 4 × 4 state matrix, and B is the 4 × 2 input matrix.

v =x ˙ cos θ +y ˙ sin θ (8.5) 1 ω = (¨y cos θ − x¨ sin θ) (8.6) v

The control input for the nonlinear system can then be calculated by remapping the new input variables of linear system back to the original input of the nonlinear system shown in equations 8.5 and 8.6, which are linear velocity v and angular velocity ω. The results in Figure 8.7 demonstrate the effectiveness and correctness of the vehicle trajectory tracking controller design. A vehicle with feedback control law has the the capability of converging to and following the desired trajectory, even though there exists initial error. At the beginning, owing to some errors between the predicted and actual orientations, the steering angle is positive and large, which helps the vehicle to correct its orientation in a short time. After 2 seconds, the predicted orientation and the ground truth converges. The vehicle orientation does not change rapidly for the next few seconds, which matches the fact that the steering angle of the vehicle remains in a small range near zero.

140 Trajectory -4420 Ground truth -4440 Predicted

-4460

-4480

-4500

-4520

-4540

Y coordinates (Meters) -4560

-4580

-4600

-4620 1400 1405 1410 1415 1420 1425 1430 X coordinates (Meters) (a)

Orientation 104

102

100

98

96

94 Angle (Degrees)

92 Ground truth Predicted

90

88 0 1 2 3 4 5 6 Time (Seconds) (b)

Steering Wheel Angle 30 Ground truth Predicted 25

20

15

10 Angle (Degrees)

5

0

-5 0 1 2 3 4 5 6 Time (Seconds) (c)

Figure 8.7: Correction of vehicle’s position and orientation using vehicle trajectory tracking. (a) Ground truth and predicted trajectory. (b) Ground truth and predicted orientation. (c) Ground truth and predicted steering wheel angle.

141 8.2.4 CNN implementation

Convolutional neural networks (CNNs) [40–42] has achieved impressive performance in image classification. In this chapter, learning the human driver’s control is not a classification problem but a regression problem, therefore the loss layer during training is Euclidean loss, which computes the sum of squares of differences between predicted

1 PN 1 2 2 steering angle and ground truth steering angle: 2N i=1 ||xi − xi ||2, where N is the 1 2 number of instances, xi is the ith predicted value and xi is the ith ground truth value. The CNN is used as a steering angle predictor given the input image. It does not take the entire image frame as input since only the center section is the region of interest for lane keeping. The images are cropped before fed to CNN, as shown in Figure 8.8. The proposed CNN architecture is shown in Figure 8.9, and it is based on the PilotNet [70,72]. It has 5 convolutional layers and 3 fully-connected layers. There are no pooling layers because the feature maps are small. The convolutional layers are mainly for feature extraction and the fully connected layers are mainly for steering angle prediction, but there is no clear boundary between them since the model is trained end-to-end. Unlike the PilotNet, our input image size is 400 × 150 instead of 200 × 66. The first convolutional layer is 4 × 4 stride and 9 × 9 kernel instead of 2 × 2 stride and 5 × 5 kernel. The system of PilotNet uses the vehicle’s turning radius r

1 as steering command, and makes the inverse-turning-radius r as the output to avoid infinite numbers when driving straight. Our CNN uses the steering wheel angle as the output, which is more intuitive. The proposed CNN model is trained using our own dataset on Caffe [99] and Matlab software platform.

142 Figure 8.8: An example of cropped image frame from the dataset.

8.3 Experiment

8.3.1 Data collection

To capture images, three forward facing cameras are mounted on the dashboard of the car, from left to right. Because the cameras are not water-proof, installing them on top of the vehicle can be inappropriate. To avoid re-calibration each time, the cameras remain stationary once installed. Multi-thread programming and software triggers are used to synchronize the three cameras to capture images at 10 Hz. The shutter time is set to auto with an upper-bound value to avoid extremely low frame rate when the light condition is too dark. The image resolution is set to1288 × 968, and captured images are stored as color image sequences. Meanwhile, the steering angle and speed information are recorded by accessing the CAN BUS via OBD-II port. The data from OBD-II port are decoded by our customized program, and then saved with time stamps, in order to synchronize with the image data. The steering wheel angle decoded from the OBD-II port has a precision of 0.07 degree and the speed data has precision of 1 km/h or approximately 0.28 m/s. The steering wheel

143 Figure 8.9: The CNN structure used, slightly modified from NVIDIA’s PilotNet.

144 Figure 8.10: Our data collection system, including three forward facing cameras, a USB hub, a laptop and access to OBD-II port.

angle s need to be converted to vehicle’s turning angle σf in Figure 8.6 by dividing the

s steering ratio k as σf = k , where k has an estimated value of 17.8 in our experiment. Figure 8.10 shows our data collection system on a vehicle, including three forward facing cameras, a USB hub, a laptop computer and an interface to OBD-II port. The experimental data were collected on 7 occasions at 6 different days, approx- imately 1 hour each. Different lighting and weather conditions are included, such as sunny, cloudy and foggy, as shown in Figure 8.11. Night time driving is not included in our data. The collected data are then refined to be used for the task of lane keep- ing. Some recorded data that meet the following criteria are discarded: non-highway

145 (a) (b)

(c) (d)

Figure 8.11: Example frames under different weather or lighting condition. (a) Cloudy. (b) Shadowed. (c) Foggy. (d) Sunny. driving, speed lower then 40 mph, change of lane, extreme lighting condition, equip- ment failure, and sequences that is shorter than 1 minute. After refinement, about 3 hours driving data are valid. Among the 7 groups of collected data, 4 groups were used for training and the other 3 groups are for testing. This is to prevent overlaps between training and test data. Overall, the train data contain 68082 frames, nearly 2 hours at 10 Hz. The test data contain 32053 frames, nearly 1 hour at 10 Hz. The training data sequences are randomly shuffled before applied to the CNN model.

146 8.3.2 Data augmentation

Ideally, the training dataset should contain some error correction scenario such that the trained CNN model is capable of handling errors. So the vehicle stays in the lane instead of drifting away. Such error correction data introduce initial errors for the vehicle’s position and/or orientation, and then provide the proper control action to correct such errors and guide the vehicle back to the lanes. The original data collected from highway driving are lack of such error correction data, because of the safety concern to perform such dangerous maneuvers on highway. Therefore, we propose to apply data augmentation technique that can generate this type of error correction data virtually. This is one of the important benefits of building a simulator. Once the data are collected and the world coordinates established, it is possible to obtain the ground truth of vehicle’s position and orientation at any given time. For each frame, errors can be added manually into the vehicle’s position and orientation. By using the knowledge of image projection of 3D geometry, the augmented images can be generated accordingly. At the same time, correct control action is provided by the vehicle trajectory tracking algorithm. Therefore, the augmented data can be used as part of the training data to improve the model’s robustness. In our experiment, each frame is randomly augmented 10 times by shifting the vehicle positions and change orientations. Figure 8.3 shows the entire process of data augmentation. Figure 8.12 shows examples of augmented images.

8.3.3 Evaluation using simulator

In our previous work [73], it shows that the differences of ground truth angle and pre- dicted angle is not an effective metric for evaluating the performance of lane keeping

147 (a) (b)

(c) (d)

Figure 8.12: Example of original image and augmented images given arbitrary vehicle poses. (a) Original image. (b) Augmented image as if the vehicle is shifted right by 0.5 m. (c) Augmented image as if the vehicle is rotated left by 7 degrees. (d) Augmented image as if the vehicle is shifted right by 0.5 m and rotated left by 7 degrees. systems. Hereby we propose a new metric by measuring the percentage of driving time when the vehicle is in lane. Our simulator cane be employed as an evaluation platform for autonomous lane keeping. The process flow of using the simulator for evaluation is illustrated in Figure 8.2. Giving the initial steering angle provided by the CNN model, vehicle positions and orientations are updated by vehicle dynamics. Subsequently a front-view camera image is generated through image projection according to the current vehicle position and orientation. The new image is then fed to the CNN model and it produces the steering angle for the next time-step. The same process repeats for all frames in a test sequence. At each time step, the amount of position difference to the ground truth is calculated. For the purpose of simplicity, the longitude difference is fixed to zero, and the horizontal is compared with a threshold value. If the horizontal shift is larger than the threshold, it is considered a lane keeping failure. The threshold

148 is set to 1 meter in our experiment. For each failure occurrence, the next 60 frames are automatically marked as manual driving period. All other frames without failure are considered autonomous driving period. The final criteria is the percentage of autonomous driving time (autonomy):

t A = a (8.7) ta + tm

where ta and tm represent the autonomous time and manual controlled time, respec- tively. Figure 8.13 shows an example of the simulation results when comparing the vehicle positions with the ground truth. The steering angles are produced by the CNN model trained with data augmentation. In our experiment, the CNNs trained with and without augmented data are both evaluated using the simulator, and the results are shown in Table 8.1. The error of position is only evaluated when the vehicle is in autonomous driving mode. The data during manual controlled time in simulation are not evaluated. The percentage of autonomous driving time using the model trained with augmented data is 98.32% and number of failures is 9, which are significantly better than the result of 82.09% and 98 without augmented data. Table 8.1: Evaluation result using the simulator, with and without augmented data.

Augmented No. of Error of Position (Meters) Autonomy Data Failures Mean Standard Deviation Yes 98.32% 9 0.2179 0.1813 No 82.09% 98 0.2670 0.2071

In addition, the simulation results also show that the error of steering wheel angle is not an effective metric for performance evaluation. The model trained with

149 Trajectory

4000

3800

3600

3400

3200

Y coordinates (Meters) 3000

CNN 2800 Ground Truth Lane Limits 2600

1400 1600 1800 2000 2200 2400 2600 2800 X coordinates (Meters) (a)

Trajectory 3250

3200

3150 Y coordinates (Meters) 3100 CNN Ground Truth Lane Limits

3050 2300 2350 2400 2450 2500 X coordinates (Meters) (b)

Trajectory 3145

3144

3143

3142

3141

3140

3139

Y coordinates (Meters) 3138

CNN 3137 Ground Truth Lane Limits 3136

3135 2392 2394 2396 2398 2400 2402 X coordinates (Meters) (c)

Figure 8.13: An example of the simulation result, produced by the CNN trained with data augmentation. (a) Overview of the trajectory in a test sequence. (b) Trajectory zoomed-in in the black rectangle in (a). (c) Trajectory zoomed-in in the black rectangle in (b).

150 augmented data has mean error of 0.3042 degrees and standard deviation of 1.6029 degrees. The model trained without augmented data has mean error of 0.3118 degrees and standard deviation of 1.2043 degrees. It can hardly tell which model is better from the mean error and standard derivation of steering angles. The deployed simulator with CNN predictor runs at approximately 13 frames per second (FPS). Considering the input data at 10 Hz, the end-to-end lane keeping system is able to run at real-time. The hardware platform is a desktop computer with Intel i5 3570K processor running at 3.4 GHz, 32 GB DDR3 RAM and one NVIDIA GTX 1080 GPU.

8.4 Discussion

It is worth investigating the causes of some failures during evaluation. For example, a failure case is shown in Figure 8.14. The vehicle is moving out of lane to the right because the front vehicle is changing lane and lane markings are partially blocked. Another case is shown in Figure 8.15 with casting shadow on the road. In most cases, we believe the quality of the input data play a role in those failures, which can be attributed to factors such as shadows on the road, extreme lighting conditions, camera exposure settings and etc. Because of the complicated scenarios in the real world, the robustness of a model needs be fully examined prior to deployment. Therefore, a simulator built on the real world data becomes very useful.

151 Figure 8.14: An example of failure. The vehicle is going out of lane to the right because another vehicle is changing lane, and lane markings are partially blocked.

Figure 8.15: An example of failure. The vehicle is going out of lane to the right because of unclear lane markings.

152 8.5 Conclusions

This chapter presents an autonomous driving simulator that is built on the real-world data with recording from three front-view cameras, steering wheel angles and vehicle speed information. Vehicle dynamic model and trajectory tracking are incorporated in the simulator to predict the vehicle movement. With proper calibrations, 3D image projection technique can be applied to generate updated front-view images at the current vehicle position and orientation. The simulator can be used both training and evaluation of vision-based lane keeping algorithms. Moreover, an end- to-end learning lane keep system is proposed using a CNN model to predict the steering angle from front-view camera input. The CNN model trained with augmented data results significantly better performance than using only the original recorded data, when measured by percentage of automated diving time. This new real-world driving dataset is shared online and can bring benefits to research and education of autonomous vehicle technology.

153 Chapter 9

Conclusions

This dissertation presents the design and implementation of a group of systems for autonomous vehicles. The real-time GPU-based traffic sign detection and recognition system is capable of detecting and recognizing 48 classes of traffic signs in any size on each image frame. The detection rate is about 91.69% and the recognition rate is about 93.77%. The system can process 27.9 fps video with the active pixels of a 1,628 Ö 1,236 resolution. Because each frame is processed individually, no information from previous frames is required. As part of our future work, information from previous frames will be considered for tracking traffic signs which is expected to further improve the detection accuracy. Two traffic light detection and recognition systems are presented. The first system detects and recognizes red circular lights only, using image processing and SVM. The performance is better than that of traditional detectors. The second system is more complicated. It detects and classifies different types of traffic lights, including green and red lights in both circular and arrow forms. Color extraction and blob detection

154 are applied to locate the candidates with proper optimization. A classification and validation method using PCANet is then used for frame-by-frame detection. The multi-object tracking method and forecasting technique are employed to improve the accuracy and produce stable results. As an additional contribution, we build a traffic light dataset from the videos captured via a camera mounted behind the windshield. A novel pedestrian detection instrumentation is designed using both thermal and RGB-D stereo cameras. Data are collected from on-road driving and an experimental dataset is built with the bounding box labeling of pedestrians as the ground truth. A reconfigurable multi-stage detector frame is proposed. Both HOG and CCF based detection methods are evaluated using data from multi-spectral cameras and their various combinations. The experimental result indicates that the approach using CCF outperforms that involving HOG features. The combination of color and ther- mal images using the CCF method can achieve the best performance of about 9% log-average miss rate. For future work, other advanced feature extraction and classi- fication methods will be considered to further improve the detector performance. The lane keeping system employs an end-to-end learning approach to obtain the proper steering angle for maintaining the car in the lane. The CNN model is trained and evaluated using comma.ai dataset, which contains image frames and the steering angle data captured from road driving. The test results show that the model can produce the relatively accurate steering of vehicle. Further discussions on evaluation and data augmentations are also presented for future improvement. A simulator for the lane keeping system is built using image projection, vehicle dynamics and vehicle trajectory tracking. This is important for data augmentation and evaluation. The test results show that the model trained with augmented data using the simulator has better performance.

155 Our on-vehicle data collection systems are also implemented and deployed, and our own datasets are built from recorded driving videos. These datasets are used in most of our projects and can benefit other researchers in the future. Our experimental datasets are available at http://computing.wpi.edu/Dataset.html.

156 Bibliography

[1] “Red light running,” Insurance Institute of Highway Safety. [Online]. Available: http://www.iihs.org/iihs/topics/t/red-light-running/topicoverview

[2] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets : The kitti dataset,” International Journal of Robotics Research (IJRR), 2013.

[4] J. Fritsch, T. Kuehnl, and A. Geiger, “A new performance measure and eval- uation benchmark for road detection algorithms,” in International Conference on Intelligent Transportation Systems (ITSC), 2013.

[5] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Con- ference on Computer Vision and Pattern Recognition (CVPR), 2015.

[6] M. Mathias, R. Timofte, R. Benenson, and L. V. Gool, “Traffic sign recognition - how far are we from the solution?” in Proceedings of IEEE International Joint Conference on Neural Networks (IJCNN 2013), August 2013.

157 [7] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, “Detection of traffic signs in real-world images: The German Traffic Sign Detection Bench- mark,” in International Joint Conference on Neural Networks, no. 1288, 2013.

[8] “Traffic Lights Recognition public benchmarks.” [Online]. Available: http: //www.lara.prd.fr/benchmarks/trafficlightsrecognition

[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec- tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, June 2005, pp. 886–893 vol. 1.

[10] P. Doll´ar,C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A bench- mark,” in CVPR, June 2009.

[11] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in European Conference on Computer Vision (ECCV), ser. LNCS, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9906. Springer International Publishing, 2016, pp. 102–118.

[12] J. Greenhalgh and M. Mirmehdi, “Real-time detection and recognition of road traffic signs,” Intelligent Transportation Systems, IEEE Transactions on, vol. 13, no. 4, pp. 1498–1506, 2012.

[13] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition,” Neural Networks, no. 0, pp. –, 2012. [Online]. Available: http://www.sciencedirect. com/science/article/pii/S0893608012000457

158 [14] C. Keller, C. Sprunk, C. Bahlmann, J. Giebel, and G. Baratoff, “Real-time recognition of u.s. speed signs,” in Intelligent Vehicles Symposium, 2008 IEEE, June 2008, pp. 518–523.

[15] W. Liu, Y. Wu, J. Lv, H. Yuan, and H. Zhao, “U.s. speed limit sign detection and recognition from image sequences,” in Control Automation Robotics Vision (ICARCV), 2012 12th International Conference on, Dec 2012, pp. 1437–1442.

[16] F. Zaklouta, B. Stanciulescu, and O. Hamdoun, “Traffic sign classification us- ing k-d trees and random forests,” in Neural Networks (IJCNN), The 2011 International Joint Conference on, July 2011, pp. 2151–2155.

[17] P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale convolu- tional networks,” in Neural Networks (IJCNN), The 2011 International Joint Conference on, July 2011, pp. 2809–2813.

[18] E. Herbschleb and P. H. N. de With, “Real-time traffic sign detection and recognition,” pp. 72 570A–72 570A–12, 2009. [Online]. Available: http://dx.doi.org/10.1117/12.806171

[19] A. D. L. Escalera, L. E. Moreno, M. A. Salichs, and J. M. Armingol, “Road traffic sign detection and classification,” IEEE Transactions on Industrial Elec- tronics, vol. 44, pp. 848–859, 1997.

[20] K. Par and O. Tosun, “Real-time traffic sign recognition with map fusion on multicore/many-core architectures,” Acta Polytechnica Hungarica, vol. 9, no. 2, 2012.

159 [21] R. de Charette and F. Nashashibi, “Real time visual traffic lights recognition based on spot light detection and adaptive traffic lights templates,” in Intelligent Vehicles Symposium, 2009 IEEE, June 2009, pp. 358–363.

[22] ——, “Traffic light recognition using image processing compared to learning processes,” in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ International Conference on, Oct 2009, pp. 333–338.

[23] G. Trehard, E. Pollard, B. Bradai, and F. Nashashibi, “Tracking both pose and status of a traffic light via an interacting multiple model filter,” in Information Fusion (FUSION), 2014 17th International Conference on, July 2014, pp. 1–7.

[24] S. Sooksatra and T. Kondo, “Red traffic light detection using fast radial symme- try transform,” in Electrical Engineering/Electronics, Computer, Telecommu- nications and Information Technology (ECTI-CON), 2014 11th International Conference on, May 2014, pp. 1–6.

[25] T.-P. Sung and H.-M. Tsai, “Real-time traffic light recognition on mobile devices with geometry-based filtering,” in Distributed Smart Cameras (ICDSC), 2013 Seventh International Conference on, Oct 2013, pp. 1–7.

[26] J. Levinson, J. Askeland, J. Dolson, and S. Thrun, “Traffic light mapping, local- ization, and state detection for autonomous vehicles,” in Robotics and Automa- tion (ICRA), 2011 IEEE International Conference on, May 2011, pp. 5784– 5791.

[27] N. Fairfield and C. Urmson, “Traffic light mapping and detection,” in Robotics and Automation (ICRA), 2011 IEEE International Conference on, May 2011, pp. 5421–5426.

160 [28] A. Gomez, F. Alencar, P. Prado, F. Osorio, and D. Wolf, “Traffic lights detec- tion and state estimation using hidden markov models,” in Intelligent Vehicles Symposium Proceedings, 2014 IEEE, June 2014, pp. 750–755.

[29] S. Salti, A. Petrelli, F. Tombari, N. Fioraio, and L. Di Stefano, “Traffic sign detection via interest region extraction,” Pattern Recognition, vol. 48(4), pp. 1039–1049, 2015.

[30] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, July 2006.

[31] I. Arel, D. Rose, and T. Karnowski, “Deep machine learning - a new frontier in artificial intelligence research [research frontier],” Computational Intelligence Magazine, IEEE, vol. 5, no. 4, pp. 13–18, Nov 2010.

[32] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “Pcanet: A simple deep learning baseline for image classification?” arXiv preprint arXiv:1404.3606, 2014.

[33] S. Lafuente-Arroyo, S. Maldonado-Bascon, P. Gil-Jimenez, H. Gomez-Moreno, and F. Lopez-Ferreras, “Road sign tracking with a predictive filter solution,” in IEEE Industrial Electronics, IECON 2006 - 32nd Annual Conference on, Nov 2006, pp. 3314–3319.

[34] S. Lafuente-Arroyo, S. Maldonado-Bascon, P. Gil-Jimenez, J. Acevedo- Rodriguez, and R. Lopez-Sastre, “A tracking system for automated inventory of road signs,” in Intelligent Vehicles Symposium, 2007 IEEE, June 2007, pp. 166–171.

161 [35] S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “How far are we from solving pedestrian detection?” CoRR, vol. abs/1602.01237, 2016. [Online]. Available: http://arxiv.org/abs/1602.01237

[36] P. Doll´ar,C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” PAMI, vol. 34, 2012.

[37] P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004. [Online]. Available: http://dx.doi.org/10.1023/B:VISI.0000013087.49260.fb

[38] R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “Ten years of pedestrian detection, what have we learned?” CoRR, vol. abs/1411.4304, 2014. [Online]. Available: http://arxiv.org/abs/1411.4304

[39] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” pp. 91.1–91.11, 2009, doi:10.5244/C.23.91.

[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica- tion with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556

162 [42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with ,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.

[43] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features for pedestrian, face and edge detection,” CoRR, vol. abs/1504.07339, 2015. [Online]. Available: http://arxiv.org/abs/1504.07339

[44] R. Gade and T. B. Moeslund, “Thermal cameras and applications: a survey,” and Applications, vol. 25, no. 1, pp. 245–262, 2014. [Online]. Available: http://dx.doi.org/10.1007/s00138-013-0570-5

[45] W. Li, D. Zheng, T. Zhao, and M. Yang, “An effective approach to pedestrian detection in thermal imagery,” in Natural Computation (ICNC), 2012 Eighth International Conference on, May 2012, pp. 325–329.

[46] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestrian detec- tion using infrared images and histograms of oriented gradients,” in 2006 IEEE Intelligent Vehicles Symposium, 2006, pp. 206–212.

[47] C. Dai, Y. Zheng, and X. Li, “Pedestrian detection and tracking in infrared imagery using shape and appearance,” Computer Vision and Image Understanding, vol. 106, no. 2-3, pp. 288 – 299, 2007, special issue on Advances in Vision Algorithms and Systems beyond the Visible Spectrum. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S1077314206001925

163 [48] J. W. Davis and M. A. Keck, “A two-stage template approach to person de- tection in thermal imagery,” Applications of Computer Vision and the IEEE Workshop on Motion and Video Computing, IEEE Workshop on, vol. 1, pp. 364–369, 2005.

[49] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and tracking with night vision,” IEEE Transactions on Intelligent Transportation Systems, vol. 6, no. 1, pp. 63–71, March 2005.

[50] D. Olmeda, A. de la Escalera, and J. M. Armingol, “Contrast invariant features for human detection in far infrared images,” in Intelligent Vehicles Symposium (IV), 2012 IEEE, June 2012, pp. 117–122.

[51] W. Wang, J. Zhang, and C. Shen, “Improved human detection and classifi- cation in thermal images,” in 2010 IEEE International Conference on Image Processing, Sept 2010, pp. 2313–2316.

[52] M. Bertozzi, A. Broggi, C. H. Gomez, R. I. Fedriga, G. Vezzoni, and M. DelRose, “Pedestrian detection in far infrared images based on the use of probabilistic templates,” in 2007 IEEE Intelligent Vehicles Symposium, June 2007, pp. 327– 332.

[53] T. T. Zin, H. Takahashi, and H. Hama, “Robust person detection using far infrared camera for ,” in Innovative Computing, Information and Control, 2007. ICICIC ’07. Second International Conference on, Sept 2007, pp. 310–310.

164 [54] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Survey of pedestrian de- tection for advanced driver assistance systems,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1239–1258, July 2010.

[55] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedes- trian detection: Benchmark dataset and baseline,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1037–1045.

[56] S. J. Krotosky and M. M. Trivedi, “On color-, infrared-, and multimodal-stereo approaches to pedestrian detection,” IEEE Transactions on Intelligent Trans- portation Systems, vol. 8, no. 4, pp. 619–629, Dec 2007.

[57] K. H. Lee and J. N. Hwang, “On-road pedestrian tracking across multiple driv- ing recorders,” IEEE Transactions on Multimedia, vol. 17, no. 9, pp. 1429–1438, Sept 2015.

[58] W. Liu, R. W. H. Lau, X. Wang, and D. Manocha, “Exemplar-amms: Recog- nizing crowd movements from pedestrian trajectories,” IEEE Transactions on Multimedia, vol. 18, no. 12, pp. 2398–2406, Dec 2016.

[59] R. Risack, N. Mohler, and W. Enkelmann, “A video-based lane keeping assis- tant,” in Proceedings of the IEEE Intelligent Vehicles Symposium 2000 (Cat. No.00TH8511), 2000, pp. 356–361.

[60] S. Ishida and J. E. Gayko, “Development, evaluation and introduction of a lane keeping assistance system,” in IEEE Intelligent Vehicles Symposium, 2004, June 2004, pp. 943–944.

165 [61] J. F. Liu, J. H. Wu, and Y. F. Su, “Development of an interactive lane keep- ing control system for vehicle,” in 2007 IEEE Vehicle Power and Propulsion Conference, Sept 2007, pp. 702–706.

[62] A. H. Eichelberger and A. T. McCartt, “Toyota drivers’ experiences with dynamic radar cruise control, pre-collision system, and lane-keeping assist,” Journal of Safety Research, vol. 56, pp. 67 – 73, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0022437515001061

[63] Y. Li, “Deep reinforcement learning: An overview,” CoRR, vol. abs/1701.07274, 2017. [Online]. Available: http://arxiv.org/abs/1701.07274

[64] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-end deep reinforcement learning for lane keeping assist,” CoRR, vol. abs/1612.04340, 2016. [Online]. Available: http://arxiv.org/abs/1612.04340

[65] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” CoRR, vol. abs/1610.03295, 2016. [Online]. Available: http://arxiv.org/abs/1610.03295

[66] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement learning framework for autonomous driving,” CoRR, vol. abs/1704.02532, 2017. [Online]. Available: http://arxiv.org/abs/1704.02532

[67] S. Sharifzadeh, I. Chiotellis, R. Triebel, and D. Cremers, “Learning to drive using inverse reinforcement learning and deep q-networks,” CoRR, vol. abs/1612.03653, 2016. [Online]. Available: http://arxiv.org/abs/1612.03653

166 [68] D. A. Pomerleau, “Advances in neural information processing systems 1,” D. S. Touretzky, Ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1989, ch. ALVINN: An Autonomous Land Vehicle in a Neural Network, pp. 305–313. [Online]. Available: http://dl.acm.org/citation.cfm?id=89851.89891

[69] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road obstacle avoidance through end-to-end learning,” in Proceedings of the 18th International Conference on Neural Information Processing Systems, ser. NIPS’05. Cambridge, MA, USA: MIT Press, 2005, pp. 739–746. [Online]. Available: http://dl.acm.org/citation.cfm?id=2976248.2976341

[70] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-driving cars,” CoRR, vol. abs/1604.07316, 2016. [Online]. Available: http://arxiv.org/abs/1604.07316

[71] M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L. D. Jackel, U. Muller, and K. Zieba, “Visualbackprop: visualizing cnns for autonomous driving,” CoRR, vol. abs/1611.05418, 2016. [Online]. Available: http://arxiv.org/abs/1611.05418

[72] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner, L. D. Jackel, and U. Muller, “Explaining how a deep neural network trained with end-to-end learning steers a car,” CoRR, vol. abs/1704.07911, 2017. [Online]. Available: http://arxiv.org/abs/1704.07911

[73] Z. Chen and X. Huang, “End-to-end learning for lane keeping of self-driving cars,” in 2017 IEEE Intelligent Vehicles Symposium (IV), June 2017.

167 [74] J. Hardy and M. Campbell, “Contingency planning over probabilistic obstacle predictions for autonomous road vehicles,” IEEE Transactions on Robotics, vol. 29, no. 4, pp. 913–929, 2013.

[75] E. Frazzoli, M. A. Dahleh, and E. Feron, “Real-time for agile autonomous vehicles,” in American Control Conference, 2001. Proceedings of the 2001, vol. 1. IEEE, 2001, pp. 43–49.

[76] M. Likhachev and D. Ferguson, “Planning long dynamically feasible maneu- vers for autonomous vehicles,” The International Journal of Robotics Research, vol. 28, no. 8, pp. 933–945, 2009.

[77] R. Y. Hindiyeh, “Dynamics and control of drifting in automobiles,” , March, 2013.

[78] E. Galceran, R. M. Eustice, and E. Olson, “Toward integrated motion planning and control using potential fields and torque-based steering actuation for au- tonomous driving,” in Proceedings of the IEEE Intelligent Vehicle Symposium, Seoul, Korea, June 2015, pp. 304–309.

[79] R. DeSantis, “Path-tracking for articulated vehicles via exact and jacobian lin- earization,” IFAC Proceedings Volumes, vol. 31, no. 3, pp. 159–164, 1998.

[80] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec- tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, June 2005, pp. 886–893 vol. 1.

[81] “BelgiumTS Dataset,” 2010. [Online]. Available: http://btsd.ethz.ch/ shareddata/

168 [82] F. Zaklouta and B. Stanciulescu, “Real-time traffic sign recognition in three stages,” Robotics and Autonomous Systems, vol. 62, no. 1, pp. 16 – 24, 2014, new Boundaries of Robotics. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0921889012001236

[83] S. Suzuki and K. Abe, “Topological structural analysis of digitized binary im- ages by border following.” Computer Vision, Graphics, and Image Processing, vol. 30, no. 1, pp. 32–46, 1985.

[84] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, vol. 25, no. 11, pp. 120–126, 2000.

[85] H. Cheng, X. Jiang, Y. Sun, and J. Wang, “Color : advances and prospects,” Pattern Recognition, vol. 34, no. 12, pp. 2259 – 2281, 2001.

[86] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Pro- cessing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.

[87] M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “On- line multiperson tracking-by-detection from a single, uncalibrated camera,” Pat- tern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 9, pp. 1820–1833, Sept 2011.

[88] H. W. Kuhn, “The hungarian method for the assignment problem,” in 50 Years of Integer Programming 1958-2008. Springer, 2010, pp. 29–47.

169 [89] S.-H. Bae and K.-J. Yoon, “Robust online multi-object tracking based on track- let confidence and online discriminative appearance learning,” in Computer Vi- sion and Pattern Recognition (CVPR), 2014 IEEE Conference on, June 2014, pp. 1218–1225.

[90] K. Basak, S. N. Hetu, ZheminLi, C. L. Azevedo, H. Loganathan, T. Toledo, RunminXu, YanXu, Li-ShiuanPeh, and M. Ben-Akiva, “Modeling reaction time within a traffic simulation model,” in 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), Oct 2013, pp. 302–309.

[91] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

[92] P. Domingos, “A few useful things to know about machine learning,” Commun. ACM, vol. 55, no. 10, pp. 78–87, Oct. 2012.

[93] Z. Chen, X. Huang, Z. Ni, and H. He, “A gpu-based real-time traffic sign detection and recognition system,” in Computational Intelligence in Vehicles and Transportation Systems (CIVTS), 2014 IEEE Symposium on, Dec 2014, pp. 1–5.

[94] Z. Chen, J. Wang, H. He, and X. Huang, “A fast deep learning system using gpu,” in 2014 IEEE International Symposium on Circuits and Systems (IS- CAS), June 2014, pp. 1552–1555.

170 [95] Y. Zhou, W. Wang, and X. Huang, “FPGA design for pcanet deep learning network,” in Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on, May 2015, pp. 232–232.

[96] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cam- bridge university press, 2003.

[97] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence- rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, 1999. [Online]. Available: http://dx.doi.org/10.1023/A:1007614523901

[98] P. Doll`e°©r, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids for object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, pp. 1532–1545, Aug 2014.

[99] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar- rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature em- bedding,” arXiv preprint arXiv:1408.5093, 2014.

[100] M. Rohrbach, M. Enzweiler, and D. M. Gavrila, “High-level fusion of depth and intensity for pedestrian classification,” in Joint Pattern Recognition Symposium. Springer, 2009, pp. 101–110.

[101] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239, Mar 1998.

171 [102] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for clas- sification of multisensor data,” IEEE Transactions on Geoscience and , vol. 45, no. 12, pp. 3858–3866, Dec 2007.

[103] R. Pouteau, B. Stoll, and S. Chabrier, “Support vector machine fusion of mul- tisensor imagery in tropical ecosystems,” in Image Processing Theory Tools and Applications (IPTA), 2010 2nd International Conference on, July 2010, pp. 325–329.

[104] J. Zhao, B. Xie, and X. Huang, “Real-time lane departure and front collision warning system on an fpga,” in 2014 IEEE High Performance Extreme Com- puting Conference (HPEC), Sept 2014, pp. 1–5.

[105] A. J. Humaidi and M. A. Fadhel, “Performance comparison for lane detection and tracking with two different techniques,” in 2016 Al-Sadeq International Conference on Multidisciplinary in IT and Communication Science and Appli- cations (AIC-MITCSA), May 2016, pp. 1–6.

[106] C. Li, J. Wang, X. Wang, and Y. Zhang, “A model based path planning algo- rithm for self-driving cars in dynamic environment,” in 2015 Chinese Automa- tion Congress (CAC), Nov 2015, pp. 1123–1128.

[107] S. Yoon, S. E. Yoon, U. Lee, and D. H. Shim, “Recursive path planning us- ing reduced states for car-like vehicles on grid maps,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 5, pp. 2797–2813, Oct 2015.

[108] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dynamic vehicle models for autonomous driving control design,” in 2015 IEEE Intelligent Vehicles Symposium (IV), June 2015, pp. 1094–1099.

172 [109] D. Wang and F. Qi, “Trajectory planning for a four-wheel-steering vehicle,” in Proceedings 2001 ICRA. IEEE International Conference on Robotics and Automation, vol. 4, 2001, pp. 3320–3325 vol.4.

[110] The comma.ai driving dataset. [Online]. Available: https://github.com/ commaai/research

[111] S. Minhas, A. Hern´andez-Sabat´e,S. Ehsan, K. D´ıaz-Chito,A. Leonardis, A. M. L´opez, and K. D. McDonald-Maier, LEE: A Photorealistic Virtual Environ- ment for Assessing Driver-Vehicle Interactions in Self-driving Mode. Cham: Springer International Publishing, 2016, pp. 894–900.

[112] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR, vol. abs/1608.01230, 2016. [Online]. Available: http://arxiv.org/abs/1608.01230

[113] R. Szeliski, Computer vision: algorithms and applications. Springer Science & Business Media, 2010.

173