UNSUPERVISED LEARNING AND REVERSE OPTICAL FLOW IN MOBILE

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Andrew Lookingbill May 2011

© 2011 by Andrew Lookingbill. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This dissertation is online at: http://purl.stanford.edu/mz066kz5780

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Sebastian Thrun, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Bernd Girod

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Andrew Ng

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Preface

They say you are supposed to be able to describe your research to a layperson in five minutes: your “elevator pitch.” In that sense, my graduate work is perfect. Whether the audience consists of strangers on a plane, extended family, or curious neighbors, what I do, why I do it, and the basics of how it is done are straightforward to explain. To teach robots to “see,” to operate independently of human supervision, and to learn about the environment without having explicitly labeled data is exciting stuff. So, while this thesis may not have any plot twists or a surprise ending, I hope you find it interesting reading. And who knows? In the coming revolution, I may be partly to blame for any stray toasters.

iv Acknowledgments

Research, at least at Stanford, is a collaborative effort. I have had the unique good fortune to work with some extraordinarily talented people during my time here. I would like to thank the members of the Stanford Autonomous Helicopter Project for their help in acquiring the video used for testing the multi-object tracking algorithm discussed in Chapter 2. I would also like to thank my collaborators David Lieb, David Stavens, John Rogers, Jim Curry, and Itai Katz for their insights as well as the long hours, late nights, sunburns, and mosquito bites we endured as we wrote and then tested our algorithms in the field. I am indebted to the members of my reading committee, Professors Girod and Ng, my thesis defense committee chair, Professor Widrow, and my advisor, Professor Thrun. Finally, I want to thank my mother, for everything.

v Contents

Preface iv

Acknowledgments v

1 Introduction 1 1.1 Thesis structure ...... 2

2 Optical Flow and Reverse Optical Flow 3 2.1 Feature Selection ...... 4 2.2 Feature Tracking ...... 4 2.3 Flow Caching and Traceback ...... 7 2.4 Examples ...... 10 2.5 Related Work ...... 12

3 Multi-Object Tracking and Activity Models 17 3.1 Learning Activity Maps from a Moving Platform ...... 18 3.1.1 Feature Selection and Feature Tracking ...... 18 3.1.2 Identifying Moving Objects on the Ground ...... 19 3.1.3 Tracking Moving Objects with Particle Filters ...... 21 3.1.4 Learning the Activity-Based Ground Model ...... 25 3.2 Applications ...... 26 3.2.1 Using the Activity Model for Improved Tracking ...... 26 3.2.2 Registration Based on Activity Models ...... 27 3.3 Results ...... 28

vi 3.3.1 Hypotheses ...... 28 3.3.2 Methods ...... 29 3.3.3 Findings ...... 30 3.3.4 Additional Results ...... 31 3.3.5 Conclusions ...... 33 3.4 Related Work ...... 35

4 Road Following 39 4.1 Adaptive Road Following ...... 40 4.2 Results ...... 46 4.2.1 Hypotheses ...... 46 4.2.2 Methods ...... 46 4.2.3 Findings ...... 50 4.2.4 Additional Results ...... 53 4.2.5 Conclusions ...... 54 4.3 Related Work ...... 56

5 Self-Supervised Navigation 65 5.1 Off-Road Navigation Algorithm ...... 67 5.1.1 Alternate Approaches ...... 74 5.2 Results ...... 78 5.2.1 Hypotheses ...... 79 5.2.2 Methods ...... 79 5.2.3 Findings ...... 81 5.2.4 Additional Results ...... 83 5.2.5 Conclusions ...... 85 5.3 Related Work ...... 85

6 Conclusions and Future Work 89 6.1 Conclusions ...... 89 6.2 Future Work ...... 91

vii Bibliography 92

viii List of Tables

3.1 Single and Multi-Object Tracking Performance ...... 31

ix List of Figures

2.1 Features identified using an algorithm by Shi and Tomasi [5]...... 5 2.2 Image pyramids, filtered and subsampled, for two consecutive video frames. A sum of squared differences measure is iteratively minimized between the two, moving from coarser to finer levels, to calculate the optical flow for a given feature...... 5 2.3 (a) Optical flow based on a short image sequence for an image contain- ing a moving object (dark car)...... 6 2.4 Changes in texture and color appearance with distance ...... 7 2.5 Changes in specular illumination with distance. Specular illumination depends on the angle of incidence at point P, which differs between robot positions a and b...... 8 2.6 White lines represent the optical flow field in a typical desert driving scene (the length of flow vectors has been scaled by a factor of 2 for clarity) ...... 9 2.7 Optical flow compressed and stored for a number of frames in the past 10 2.8 Operations for tracing the location of a feature backwards in time . . 11 2.9 (a) Points selected in initial video frame (b) Origin of points 200 frames in the past ...... 12 2.10 (a) Numbered points selected in initial video frame (b) Corresponding points 200 frames in the past ...... 12 2.11 Outdoor reverse optical flow example. Frame on right shows point when object interacts with robots local sensors, frame on left shows image region where reverse optical flow indicates the point originated. 13

x 2.12 Outdoor reverse optical flow example. Frame on right shows point when object interacts with robots local sensors, frame on left shows image region where reverse optical flow indicates the point originated. 13 2.13 Indoor reverse optical flow example. Frame on right shows point when object interacts with robots local sensors, frame on left shows image region where reverse optical flow indicates the point originated. . . . 14 2.14 (a) Points on desert roadway selected in initial video frame (b) Corre- sponding points 200 frames in the past ...... 14

3.1 The Stanford Helicopter is based on a Bergen Industrial Twin plat- form and is outfitted with instrumentation for autonomous flight (IMU, GPS, magnetometer, PC104). In the experiments reported here the onboard laser was replaced with a color camera...... 18 3.2 (a) Optical flow based on a short image sequence, for an image con- taining a moving object (dark car). (b) The “corrected” flow after compensating for the estimated platform motion, which itself is ob- tained from the image flow. The reader will notice that this flow is significantly higher for the moving car. These images were acquired with the Stanford helicopter...... 19 3.3 (a) Multiple particle filters, used for tracking multiple moving objects on the ground. Lighter particles have been more heavily weighted in the reward calculation, and are more likely to be selected during the resampling step. Shown here is an example of tracking three moving objects on the ground, a bicyclist and two pedestrians (the truck in the foreground is not moving). (b) the center of each particle filter in a different frame in the sequence clearly identifies all moving objects. 23 3.4 Two moving objects being tracked in video taken from a helicopter as part of a DARPA demo...... 24

xi 3.5 Example of a learned activity map of an area on campus, using data acquired from a camera platform undergoing unknown motion. The arrows indicate the most likely motion direction modes in each grid cell; their lengths correspond to the most likely velocity of that mode; and the thickness represents the probability of motion. This diagram clearly shows the main traffic flows around the circular object; it also shows the flow of pedestrians that moved through the scene during learning...... 26 3.6 The single-frame alignment of two independent video sequences based on the activity-based models acquired from each. This registration is performed without image pixel information. It uses only activity information from the learning grid...... 29 3.7 Example of two tracks (a) without and (b) with the learned activity map. The track in (a) is incomplete and misses the moving object for a number of time steps. The activity map enables the tracker to track the top object more reliably...... 32 3.8 Selected features in frame from jogging video...... 33 3.9 Features classified as moving in jogging video frame...... 34 3.10 Particles corresponding to a particle filter tracking the jogger. Lighter colored particles have been heavily weighted in this step, darker parti- cles have received a lower weight...... 35 3.11 Dependence of tracking accuracy on number of moving objects in train- ing data...... 36

4.1 Adaptive road following algorithm ...... 40 4.2 (a) Dark line shows the definition region used in the proposed algo- rithm. (b)-(d) White lines show the locations in previous frames to which optical flow has traced the definition region...... 41 4.3 Output of horizon detector is the dark line ...... 42 4.4 (a) Input video frame (b) Visualization of SSD matching response for 10 horizontal templates for this frame...... 44

xii 4.5 Dark circles represent locations of maximum SSD response along each horizontal search line. Light circles are the output of the dynamic programming routine. The gray region is the final output of the algo- rithm and is calculated using the dynamic programming output. The width of this region is linearly interpolated from the horizontal tem- plate widths. Uneven vertical spacing between circles is the result of changing vehicle speeds over time...... 45 4.6 Single frame algorithm output for three Mojave Desert data sets. Each column contains results from one of the three video sequences. . . . . 47 4.7 Typical human-labeled ground-truth image ...... 49 4.8 Pixel coverage results on the three test video sequences ...... 50 4.9 (a) Input frame from the video characterized by straight dirt road with scrub brush lining the sides (b) Optical flow technique output (c) MRF classifier output ...... 51 4.10 (a) Input frame from the video with long shadows and sparse vegetation (b) Optical flow technique output (c) MRF classifier output . . . . . 52 4.11 (a) Input frame (b) SSD response ...... 53 4.12 (a) Input frame from video characterized by changing elevations and gravel colors (b) Optical flow technique output (c) MRF classifier output 54 4.13 Line coverage results are shown at different distances from the front of the vehicle towards the horizon for the three video sequences...... 55 4.14 Sample optical flow field in frame from winding desert video...... 56 4.15 First figure shows the definition region in front of the vehicle, subse- quent figure shows the location the definition region is tracked back to in successfully earlier video frames. The last figure is 200 frames in the past...... 60 4.16 Sample frame with roadway position calculated from positions of hor- izontal template matches...... 61 4.17 Sample frame with roadway position calculated using dynamic pro- gramming approach...... 61 4.18 Output of naive color-based road classification algorithm...... 62

xiii 4.19 Output of naive texture-based road classification algorithm...... 62 4.20 Comparison of reverse optical flow, color, and texture-based algorithms using the pixel coverage metric...... 63 4.21 Comparison of reverse optical flow, color, and texture-based algorithms using the line coverage metric...... 64

5.1 Off-road navigation algorithm ...... 67 5.2 The LAGR robot platform. It is equipped with a GPS receiver, infrared bumpers, a physical bumper, and two stereo camera rigs...... 69 5.3 Points corresponding to the good class are depicted by x’s, bad class by circles, and lethal class by stars ...... 70 5.4 Statistics for Gaussian mixture components after a run where the robot interacted with an orange fence, and avoided subsequent orange ob- jects. Each row has the mean and standard deviation for that Gaussian component, followed by the number of good, bad, and lethal votes for that component based on training data...... 71 5.5 (a) STFT during a period of normal robot operation (b) STFT during a period of detected wheel slippage ...... 72 5.6 (a) Input frame (b) Raw segmentation output (bushes have been clas- sified as obstacles) (c) Output of “bottom finder” ...... 73 5.7 Top figure shows Gaussian mixture model of the scene (potential ob- stacles colored red). Bottom figure shows the placement of obstacles in the occupancy grid...... 75 5.8 iRobot ATRV platform ...... 76 5.9 Trees are classified as obstacles with a texture and color-based segmen- tation algorithm after interacting with the robot’s physical bumper . 77 5.10 Learned optical flow field is used by the robot to determine how to maneuver to push obstacles out of the field of view ...... 77 5.11 (a) Video frame (b) Hand-labeled obstacle image (c) Segmentation without optical flow (d) Segmentation with optical flow ...... 80

xiv 5.12 (a) Paths taken using data collected without optical flow (b) Paths taken using data collected with optical flow...... 82 5.13 Autonomous navigation results with 95% confidence ellipses. The av- erage run duration (in minutes) is indicated on the y-axis, and the average number of obstacles encountered on the x-axis...... 87 5.14 Planner local and global trajectories ...... 87 5.15 (a) Input frame (b) Initial classifier output (c) Traversable pixels (d) Polynomial contour (e) Refined classifier output (f) Estimated path . 88

xv Chapter 1

Introduction

The amount of data available from on-board sensors on mobile robotics platforms is growing rapidly as the resolution of sensors increases and costs decrease. Monocu- lar video streams alone, providing staggering amounts of training and test data, and the processing power necessary to pull useful information from them, open up new research opportunities with unique challenges. Unsupervised algo- rithms, and their ability to produce useful information without large labeled training sets, are an important tool for dealing with, and benefiting from, this abundance of data. This thesis will examine the application of techniques to three subfields of mobile robotics. The first, tracking multiple moving objects from above, is an area of current interest for unmanned aerial vehicle (UAV) researchers. The second, road following in loosely-structured environments, was made famous by the DARPA Grand Challenge - an autonomous robot race featuring a 100+ mile course through different terrains. The third, autonomous offroad navigation, was the focus of DARPA’s Learning Applied to Ground Robotics (LAGR) challenge - a competition focusing on computer vision-based navigation. This thesis describes three novel contributions, one in each of the subfields listed above. First, the ability to build dynamic, activity-based ground models from a moving platform paves the way for improved multi-object tracking (important for coping with real-world data with multiple objects of interest), and other applications

1 CHAPTER 1. INTRODUCTION 2

such as video stream registration in video taken from UAVs. Second, the combination of optical flow techniques and dynamic programming produces a real-time algorithm for accurately estimating the position of traversable areas in a loosely-structured environment which allows improved road classification in unpaved driving conditions, in turn allowing higher robot travel speeds. Finally, an extension of these optical flow techniques allows an autonomously navigating robot to improve the quality of its obstacle classification in monocular video, which in turn improves its obstacle avoidance performance. All the work described in this thesis uses a monocular video camera as the primary sensor. This is challenging in that it lacks the dense 3D range information available from laser scanners or stereo vision. It is useful, however, because it is a purely passive sensor which is desirable in some applications, and because it provides information all the way to the horizon. As the experimental results discussed in this thesis will show, these contributions improve the state of the art in each of these three subfields.

1.1 Thesis structure

The following chapters of this thesis are based on published papers. Chapter 3 is based on a paper published at ICRA, with coauthors Lieb, Stavens, and Thrun [1]. Chapter 4 is based on a paper published at RSS with coauthors Lieb and Thrun [2]. Chapter 5 is based on a paper published in IJCV with coauthors Rogers, Lieb, Curry, and Thrun, and a book chapter in the Springer Star series with coauthors Lieb and Thrun [3], [4]. All of the approaches discussed in this thesis make use of optical flow techniques. These techniques, which are a way of associating the location of an object in a given video frame with its location in an earlier frame in the video, are described in detail in Chapter 2. Chapter 2

Optical Flow and Reverse Optical Flow

There are two tools from computer vision that play a large part in the work described in chapters 3, 4, and 5 of this thesis. Optical Flow: Trucco and Verri define optical flow as “the apparent motion of the image brightness pattern”[7]. The term sparse optical flow is often used to describe this motion for a subset of pixels in an image. The work in Chapter 3 of this thesis combines the techniques in 2.1 and 2.2 to produce a sparse optical flow estimation used for object tracking in a monocular video stream taken from a moving platform. A note on terminology: what I refer to here as sparse optical flow is sometimes referred to in the literature more generally as feature tracking. Reverse Optical Flow: I will define reverse optical flow as the use of a stored buffer of interframe sparse optical flow vectors for some number of pairs of consecutive video frames in the past to associate any pixel in the current image with the location of the object it corresponds to in an earlier video frame. This is interesting because rather than identifying an object ahead of time and tracking its motion, we can pick an object in the current frame which is interesting (perhaps because of an interaction with short range sensors) and examine its appearance at some point in the past. The work in Chapters 4 and 5 of this thesis combines the techniques in 2.1, 2.2, and 2.3 into a robust implementation of the calculation of reverse optical flow.

3 CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 4

2.1 Feature Selection

The first step of the proposed approach involves identifying appropriate features in the camera image and tracking them over multiple frames. The goal is to produce an algorithm that does not use color, texture, shape, or size information to track moving objects while keeping the details of the implementation simple. In addition, no prior assumptions about the nature of the moving objects being tracked are made. In the approach proposed here, features are first identified using an algorithm by Shi and Tomasi [5], which selects unambiguous feature points by finding regions in the image containing large spatial image gradients in two orthogonal directions. The features found and tracked by this algorithm are corners. Using Scale Invariant Feature Transform (SIFT) features would have been another option [6]. In order to find corners, the image is first smoothed using a Gaussian filter (with an 11x11 pixel kernel and standard deviation of 2.15 in both dimensions), and the minimal eigenvalue of the matrix

 P 2 P  Ex ExEy   (2.1) P P 2 ExEy Ey

∂E where Ex = ∂x is the spatial image gradient in the x direction, is then found at each pixel location [7]. These eigenvalues are dropped if smaller than a threshold (0.05 times the maximum eigenvalue in the image, in this case). The OpenCV function cvGoodFeaturesToTrack was used to perform the operation discussed here [9]. A sample of features found by this algorithm, in an image acquired by the Stanford Autonomous Helicopter, is shown in Fig. 2.1.

2.2 Feature Tracking

The tracking of the selected features is achieved using a pyramidal implementation of the Lucas-Kanade tracker [8]. This approach forms image pyramids consisting of filtered and subsampled versions of the original images (see Fig 2.2). The pyramids had five levels, each level was half the size in each dimension of the level above. CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 5

Figure 2.1: Features identified using an algorithm by Shi and Tomasi [5].

Figure 2.2: Image pyramids, filtered and subsampled, for two consecutive video frames. A sum of squared differences measure is iteratively minimized between the two, moving from coarser to finer levels, to calculate the optical flow for a given feature.

The filtering consisted of smoothing with a 5x5 pixel Gaussian kernel with standard deviation equal to 1.25 in both dimensions. The displacement vectors between the feature locations in the two images are found by iteratively minimizing the sum of CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 6

Figure 2.3: (a) Optical flow based on a short image sequence for an image containing a moving object (dark car). squared errors over a small window, from the coarsest level up to the original level. The window used in this work was 3x3 pixels. The result of tracking features is shown in Fig. 2.3. The optical flow (the movement of the pixels corresponding to an object in image space) of a number of features, tracked through consecutive images and indicated by small arrows in the direction of the flow, is shown. This approach has two important benefits: it is robust to fairly large pixel displacements due to the pyramidal structure, and it allows the tracking of a sparse set of features to calculate camera and object motion, yielding a faster implementation than a dense optical flow calculation. Bi-linear interpolation of image values at integer pixel locations is used to allow sub- pixel tracking accuracy. The tracker will exit after either a maximum correlation is achieved, or a maximum number of iterations has occurred for each tracked feature. The precision is limited by how quickly one of these exit criteria is met. The tracking was performed using the OpenCV function cvCalcOpticalFlowPyrLK [9]. The output of this stage is a sparse optical flow field. While this field does not have information about the movement of every pixel, it does give a good overview about the motion of CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 7

Figure 2.4: Changes in texture and color appearance with distance the different objects between frames.

2.3 Flow Caching and Traceback

Recall that reverse optical flow is the use of stored optical flow vectors between each video frame and its preceding frame to establish correspondences between pixels corresponding to objects in the current frame and their locations in frames from the past. This approach uses optical flow information to track features on objects from the time they appear on screen until they interact with the local sensors of the robot. Classification and segmentation algorithms can then be trained using the appearance of these features at large distances from the robot. The approach is motivated by the example shown in Fig. 2.4. Where traditional monocular image segmentation approaches use the visual characteristics of the tree at short range shown in the inset on the right side of the figure after a short-range sensor interaction, the proposed CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 8

Figure 2.5: Changes in specular illumination with distance. Specular illumination depends on the angle of incidence at point P, which differs between robot positions a and b. approach uses the characteristics of the tree at a much greater distance shown in the inset on the left side of the figure. There are several explanations for why the visual characteristics of obstacles differ so greatly at different distances from the robot. These include possible automatic gain control of the on-board camera, periodic textures that are not visible at great distances due to camera resolution, and changes in the specular component of object illumination which depends on the viewing angle of the observer with respect to the surface normal of the object, which in turn is dependent on the distance between the observer and the object (see Fig. 2.5). The approach discussed in this chapter uses the standard optical flow procedures outlined in the preceding sections to assemble a history of inter-frame optical flow in real time as the robot navigates to combat the distance-dependent changes in visual appearance and still extract useful terrain classification information from monocular images. This information is then used to trace features on any object in the current frame back to their positions in a previous frame. This optical flow field is populated in the following manner. First, the optical flow between adjacent video frames is calculated as discussed in the preceding two sections of this chapter. A typical optical flow field captured in this manner is shown overlaid on the original frame from a dataset taken in the Mojave Desert in Fig. 2.6. The optical flow field for each consecutive pair of video frames is then subdivided and coarsened by dividing the 720x480 image into a 12x8 grid and averaging the CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 9

Figure 2.6: White lines represent the optical flow field in a typical desert driving scene (the length of flow vectors has been scaled by a factor of 2 for clarity) optical flow vectors in grid cells after removing outliers (in this work, vectors reflecting inter-frame displacement of more than 60 pixels were dropped). The resulting grid, with a mean vector for each cell, is then stored in a ring buffer (for the work discussed in this thesis a 200-frame buffer was used), a simplified version of which is pictured in Fig. 2.7. A point in the current frame can be traced back to its approximate previous location in any frame in the history buffer or to the frame where it had entered the robot’s field of view by adding the offset vector corresponding to the grid cell the point falls in to the point’s coordinates for each frame of the traceback. The diagram shown in Fig. 2.8 illustrates how this is done for a 200-frame traceback. Zero flow is assumed when an optical flow grid cell is empty. Fig. 2.6 gives an idea of the relative density of the optical flow field. CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 10

Figure 2.7: Optical flow compressed and stored for a number of frames in the past

2.4 Examples

A few additional examples of reverse optical flow applied in different environments are shown in this section. Fig. 2.9a shows a set of points in an input video frame denoted by white circles, while Fig. 2.9b shows the location of the points in a frame 200 frames earlier in the sequence calculated with this technique. A similar image pair taken in the same forest, but with each point numbered to aid in visualization is shown in Fig. 2.10. Additional image pairs showing a point and the corresponding image location it was traced back to using reverse optical flow on videos taken in the lab and outdoors with a mobile robotic platform are shown in Fig. 2.11, Fig. 2.12 and Fig. 2.13. Finally, 2.14 shows the points resulting from the application of reverse optical flow to a video taken along a road in an ill-structured desert environment. The similarities CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 11

Figure 2.8: Operations for tracing the location of a feature backwards in time in texture of different areas of roadway have slightly degraded the quality of the traceback. All the points in the current frame lie along a single line, the calculated origin points in a frame 200 frames in the past are scattered slightly vertically. Even this level of accuracy in traceback can be helpful in learning algorithms, as will be discussed in Chapters 4 and 5. CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 12

(a) (b)

Figure 2.9: (a) Points selected in initial video frame (b) Origin of points 200 frames in the past

(a) (b)

Figure 2.10: (a) Numbered points selected in initial video frame (b) Corresponding points 200 frames in the past

2.5 Related Work

An alternative to the feature tracking discussed in this chapter which results in a sparse optical flow field is dense optical flow, where the per-pixel flow field is calcu- lated. This calculation, involving the image constraint equation, handled in Horn and Schunck’s seminal paper, can be slower than sparse optical flow and has historically suffered from poor performance in cases involving large displacements or changing illu- mination [11]. Recent work has addressed many of these problems. The work of Brox and Malik focuses on incorporating local descriptors such as SIFT features into the CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 13

Figure 2.11: Outdoor reverse optical flow example. Frame on right shows point when object interacts with robots local sensors, frame on left shows image region where reverse optical flow indicates the point originated.

Figure 2.12: Outdoor reverse optical flow example. Frame on right shows point when object interacts with robots local sensors, frame on left shows image region where reverse optical flow indicates the point originated. variational optical flow model by treating it as an optimization problem [10]. This yields a method capable of dealing robustly with large displacements. Meanwhile, Zang et al. have done work focusing on robust optical flow calculations in the face of brightness variations [12]. The work focuses on replacing the brightness constancy assumption with local image phase constancy assumptions, and builds on the work of Bruhn et al., whose work, like that of Brox et al., strives to use elements of local optical flow methods to improve the robustness of dense optical flow techniques [13]. CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 14

Figure 2.13: Indoor reverse optical flow example. Frame on right shows point when object interacts with robots local sensors, frame on left shows image region where reverse optical flow indicates the point originated.

(a) (b)

Figure 2.14: (a) Points on desert roadway selected in initial video frame (b) Corre- sponding points 200 frames in the past

Other recent work has improved the robustness of dense optical flow by applying RANSAC (RANdom SAmple Consensus)-like techniques to remove outliers [14]. The question of what constitutes a good feature for the purposes of tracking is an open one. Though the approach proposed here uses the simple corner features suggested by Shi and Tomasi, there are many alternatives. These include methods for adaptively selecting features while tracking. Collins et al. select features based on how well they separate sample distributions drawn from the presumed background and foreground [15], while Chen et al. similarly pair adaptive selection of color features CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 15

with a particle filter for tracking to maximize color histogram differences between the background and foreground [16]. Other approaches, such as that proposed by Neumann and You, seek to combine the stability and robustness of using larger image patches for tracking with the relative transformation-invariance of smaller, point-sized features [17]. Improving the robustness of feature tracking has also been an area of active re- search in recent years. Zhou et al. chose to work with SIFT features and combine them with a mean-shift algorithm to maximize color histogram similarity of regions. Using Expectation-Maximization, these two complementary approaches are combined to produce a robust tracker [18]. Dorini and Goldenstein’s work on unscented fea- ture tracking represents the uncertainty about the location of features using Gaussian Random Variables, and then uses this to improve the performance of the Kanade- Lucas-Tomasi feature tracker [19]. Takada and Sugaya improve the robustness of feature tracking by detecting incorrect feature tracks by imposing an affine constraint on feature trajectories [20]. Finally, the work of Ta et al. increases the efficiency of feature tracking by searching for matches within a neighborhood in a 3D image pyramid rather than the 2D image itself and making use of a motion model [21]. Feature tracking is also being applied to novel environments. Rodrigo et al. in- troduce planar homographies to improve the results of feature tracking indoors [22]. Wagner et al. use a combination of SIFT and Ferns feature tracking to achieve 30fps feature tracking on mobile phones [23]. Caballero et al. are pushing the limits of SLAM using feature tracking from UAVs to serve as monocular visual odometry [24]. The term reverse optical flow appeared twice in the literature prior to the pub- lication of the RSS paper on road following [2]. In the work by Fielding and Kam, the term is used to refer to the single inter-frame flow between a video frame and the one preceding it in time, as part of the process of using previously computed disparity maps to improve the quality and speed of disparity map calculation for dynamic stereo [25]. In Benoit’s dissertation, the term is used to refer to using the inter-frame flow between a video frame and both the frame preceding it and the frame following it in time to recover scene information in a way that is robust to temporary occlusions [26]. The definition proposed in this chapter differs from these in that flow CHAPTER 2. OPTICAL FLOW AND REVERSE OPTICAL FLOW 16

is calculated and stored for a much larger number of consecutive frames, allowing operations between frames significantly separated in time. Chapter 3

Multi-Object Tracking and Activity Models

The work detailed in this chapter constitutes a novel contribution to the field: the construction and use of activity-based ground models built from video taken from a moving platform. The thrust of the research discussed in this chapter is the acquisition of activity-based models, which are models that characterize places based on the type of motion activities that occur. For example, the activities found on roads differ from those found on sidewalks. Even among roads, motion characteristics vary significantly. Accurate activity-based ground models offer a number of potential benefits: they can help us understand traffic flow; they can assist unmanned ground vehicles in navigating autonomously (e.g., guide them to stay off a busy road), and they can help us spot activity-related change and abnormalities. Good activity models also facilitate the tracking of individual moving objects, as will be shown in this chapter. The problem addressed here is the acquisition of activity-based ground models from a moving platform such as a helicopter. This system has been used with the Stanford helicopter shown in Fig. 3.1. This approach transforms video acquired by the helicopter, and other moving platforms, into probability distributions that char- acterize the frequency, speeds, and directions of moving objects on the ground, for each x-y location on the ground. To obtain such activity maps, the approach uses a pipeline of techniques for reliably extracting tracks and updating the map statistics.

17 CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 18

Figure 3.1: The Stanford Helicopter is based on a Bergen Industrial Twin platform and is outfitted with instrumentation for autonomous flight (IMU, GPS, magnetome- ter, PC104). In the experiments reported here the onboard laser was replaced with a color camera.

The algorithm performs feature tracking in the image plane, followed by an optical flow analysis that uses Expectation-Maximization (EM) to identify features that are likely moving on the ground. Multiple particle filters which are spawned, merged, and killed are then applied to identify multiple moving objects on the ground reliably. The resulting tracks from the particle filters are fed into a histogram that characterizes the probability distribution over speeds and orientations of motions on the ground. This probability histogram constitutes the learned activity map. To illustrate the utility of the activity map, two applications are examined in this chapter: an improved particle filter tracker, and an application to the problem of global image registration.

3.1 Learning Activity Maps from a Moving Plat- form

3.1.1 Feature Selection and Feature Tracking

To determine which pixels in a given video frame potentially correspond to a moving object, the feature selection and feature tracking methods described in Chapter 2.1 and 2.2 are applied. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 19

(a) (b)

Figure 3.2: (a) Optical flow based on a short image sequence, for an image containing a moving object (dark car). (b) The “corrected” flow after compensating for the estimated platform motion, which itself is obtained from the image flow. The reader will notice that this flow is significantly higher for the moving car. These images were acquired with the Stanford helicopter.

3.1.2 Identifying Moving Objects on the Ground

The principal difficulty in interpreting the optical flow to identify moving objects arises from the fact that most of the flow is caused by the platform’s ego-motion. The flow shown in Fig. 3.2a is largely due to the helicopter’s own motion; the only exception is the flow associated with the dark vehicle in the scene. The proposed approach uses an Expectation-Maximization (EM) algorithm to 0 0 identify the nature of the flow. Let {xi, yi, xi, yi} be the set of features returned by

Lucas-Kanade, where (xi, yi) corresponds to the image coordinates of a feature in 0 0 one frame, and (xi, yi) corresponds to the image coordinates of that feature in the next frame. The displacement between these two sets of coordinates is proportional to the velocity of a feature relative to the camera plane (but not the ground!). The 0 0 probability that {xi, yi, xi, yi} corresponds to a moving object on the ground is now CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 20

calculated using an EM algorithm [27]. Specifically, let us define the binary variable ci that indicates whether the i-th feature is moving. Initially, we set ci = 0 for all i, meaning that all features are assumed to be non-moving. The flow represented 0 0 by {xi, yi, xi, yi} is then used to estimate the image plane transformation that results from ego-motion of the platform. The image plane transformation is represented with an affine model that captures translation, rotation, scaling, and shearing. Due to the small amount of camera motion between individual frames, and the small depth of field of the scene relative to the platform altitude, an affine transformation is a rea- sonable approximation in most cases. For each point (xi, yi) the affine transformation 0 0 determines its position (xi, yi) in the subsequent frame:

  a1 b1 h i h i   0 0   xi yi = 1 xi yi  a2 b2  (3.1)   a3 b3

0 0 Using the set of feature correspondences {xi, yi, xi, yi}, the linear least squares solution provides the optimal affine parameters ~a and ~b. The key to the identification of moving features is now the E-step: based on the estimated image plane transformation, the expectation of the binary variable ci is then calculated:

~ p(ci = 1 | ~a, b) = η · α (3.2)  1  p(c = 0 | ~a,~b) = η · exp − D~ T Σ−1D~ (3.3) i 2 D

    T a1 b1  x0     where D~ =  i − 1 x y  a b   0  i i  2 2   yi   a3 b3

η is a normalization factor added to ensure that the probabilities sum to 1, and

α is a constant, chosen empirically (exp(2) in this case). The matrix ΣD is a di- agonal matrix of size 2-by-2, containing variances for the x and y components. In this particular case it was the 2x2 Identity matrix. The subsequent M-step iterates CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 21

the calculation of the model parameters, but is now weighted by the expectations calculated in the E-step. A small number of iterations then leads to an improved ego-motion estimate and, more importantly, an estimate of the probability that a feature is moving, p(ci). Fig. 3.2b shows the result of the EM: The flow vectors shown there as small white arrows all correspond with high likelihood to a moving object. In this example, the algorithm correctly identifies the features associated with the vehicle as moving, whereas most features corresponding to static objects have been identified correctly as static (and are therefore omitted in Fig. 3.2b).

3.1.3 Tracking Moving Objects with Particle Filters

Unfortunately, the data returned by the EM analysis are still too noisy for construct- ing activity-based maps. The affine model assumes an orthographic projection, and is therefore, in general, insufficient to model all possible platform motion. In addition, some features appear to have a high probability of belonging to moving objects due to association error in the Lucas-Kanade algorithm (if the interframe tracking just discussed confuses one feature for another in the second frame, the estimated motion for the original feature may be large). The resulting activity map would then show high activity in areas where the affine assumption breaks down or Lucas-Kanade errs. To improve the quality of the tracking, the proposed approach employs multiple particle filters. This approach is capable of tracking a variable number of moving objects, spawning an individual particle filter for each such object. Particle filters in ~[m] ~[m] T particular provide a way to model a multi-modal distribution. Let (sk vk ) be the m-th particle in the k-th particle filter (corresponding to the k-th tracked object).

Throughout this chapter ~si will refer to a feature’s coordinates and ~vi to its velocity. The prediction step for this particle assumes Brownian motion:

    ~[m]   ~[m]   sk 1 δt sk 0   ←     +   (3.4)  ~[m]  0 1  ~[m]  ~ε vk vk where ~ε is a random vector modeling the random changes in vehicle velocity. In this CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 22

work it was a two-dimensional uniform random vector with zero mean and a range of 30 pixels in each dimension, though a Gaussian random variable might arguably be a better choice. The importance weights are set according to the motion extracted in the previous step. Specifically,

[m] X w = p(ci)exp {γ} (3.5) i where

 ~   T  ~    1 s[m] ~s s[m] ~s  k  i  −1 k  i  γ =− −   Σw −   (3.6) 2 ~[m]  ~v   ~[m]  ~v  vk i vk i

T and (~si ~vi) are the motion tracks extracted as described in the previous section, and p(ci) are the corresponding expectations. The matrix Σw is a diagonal matrix of size 4-by-4, with two variances for the noise in location, and two for the noise in T velocity. This matrix essentially convolves each track (~si ~vi) with a Gaussian with covariance Σw. New particle filters are started if, at the border of the camera field, a large number of features with high probability p(ci) exist that are not associated with any of the existing particle filters. This operation uses tiled mean-shift operators which begin by spanning the image plane, thereby detecting all large peaks of motion. (In this implementation, the kernel size of the mean shift operator was one twentieth of the frame width by one twentieth of the frame height.) It spawns new particle filters when no existing filters are within a specified distance (30 pixels in this implementation) of each peak in the image plane. Particle filters are discontinued when particle tracks leave the image or when the total sum of all importance weights drops below a user- defined threshold (40, in this case). To help distinguish slowly moving objects from the background, and increase the disparity between ego-motion and object motion, a full calculation is performed once every six frames. Particle filter position information for interleaved frames is interpolated. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 23

(a)

(b)

Figure 3.3: (a) Multiple particle filters, used for tracking multiple moving objects on the ground. Lighter particles have been more heavily weighted in the reward calculation, and are more likely to be selected during the resampling step. Shown here is an example of tracking three moving objects on the ground, a bicyclist and two pedestrians (the truck in the foreground is not moving). (b) the center of each particle filter in a different frame in the sequence clearly identifies all moving objects. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 24

Figure 3.4: Two moving objects being tracked in video taken from a helicopter as part of a DARPA demo.

Fig. 3.3 shows the result of the particle filter tracking. Fig. 3.3a shows a situation in which three different particle filters have been spawned, each corresponding to a different object. Fig. 3.3b shows the center of each particle filter. In this example all three moving objects are correctly identified (the large truck in the foreground did not move in the image sequence). Fig. 3.4 shows a shot of tracking video taken from the Stanford helicopter during a demo for the Defense Advanced Research Projects Agency (DARPA). The two moving objects in the video have been correctly identified and tracked from overhead. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 25

3.1.4 Learning the Activity-Based Ground Model

The final step of this approach involves the acquisition of the behavior model. For that, the map is anchored using features in the image plane that, with high likelihood, are not moving. In this way, the activity map refers to a projection of a patch of ground into the camera plane, even when that patch of ground is not presently observable by the camera. This ground plane projection remains static with respect to the ground and does not refer to relative locations in the camera image. The activity map is then calculated by histogramming the various types of motion observed at different locations. More specifically, the approach learns a 4-dimensional histogram h(x, y, v, θ), indexed over x-y locations in the projection of the ground in the camera plane and the velocity of the objects observed at these locations, represented by a velocity magnitude v and an orientation of object motion θ (30 bins are used for speed, 36 for orientation). Specifically, each time the k-th particle filter’s state h i ~0 ~0 ~0 1 P ~[m] ~0 1 P ~[m] s v , where s = m · m[sk ] and v = m · m[vk ], intersects with an x-y cell in the histogram, the counter h(x, y, v, θ) (where x corresponds to the x-coordinate of 0 0 s~ , y to its y-coordinate, v = ||v~ ||2 to the magnitude of the velocity vector, and θ to its orientation) is incremented. Fig. 3.5 shows the result of the learning step. Shown here is an activity map overlayed with one of the images acquired during tracking. Blue arrows correspond to the most likely motion modes for each grid cell in the projection of the ground in the camera plane; if no motion has been observed in a cell, no arrow is displayed. Further, the length of each arrow indicates the most likely orientation and velocity at each location, and its thickness corresponds to the frequency of motion. As this image illustrates, the activity models acquired by this approach are informative of the motions that occur on the ground. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 26

Figure 3.5: Example of a learned activity map of an area on campus, using data acquired from a camera platform undergoing unknown motion. The arrows indicate the most likely motion direction modes in each grid cell; their lengths correspond to the most likely velocity of that mode; and the thickness represents the probability of motion. This diagram clearly shows the main traffic flows around the circular object; it also shows the flow of pedestrians that moved through the scene during learning.

3.2 Applications

3.2.1 Using the Activity Model for Improved Tracking

To understand the utility of the activity model, it has been applied to improve the quality of the particle filter tracking. Specifically, in the improved tracking algorithm CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 27

the importance weights w[m] are modified to take into account how well a feature’s motion matches motion seen previously in that grid cell, according to the histogram h:

[m] [m] [m] [m] [m] [m] wimproved = w + k · p(vk , θk | xk , yk ) (3.7)

The second term represents the probability of each particle’s motion, given its loca- tion, times a constant scale factor k (k was set to 20 for the results discussed here). This second term was added to, and not multiplied by, the original weights so that no single effect, either the original importance weight or the histogram-based motion reward, dominates. In the rich literature of activity learning, this approach for using learned activity models for improving the accuracy of the motion tracker is unique.

3.2.2 Registration Based on Activity Models

Learned activity models can also be applied to the problem of global image registra- tion of independent video sequences. It is often desirable to align, or mosaic, two video sequences based upon the content of the scene recorded. Traditional registration tech- niques estimate the transformation between the two images using feature-based [28] or correlation-based [29] measures derived from the two source images. To demon- strate the accuracy and flexibility of the learned activity models described here as well as their ability to identify a scene uniquely, they are applied here to the problem of global image registration. By encoding the major activity modes of each learned grid cell as pixel intensity values, learned activity models can be translated into conventional images. Tradi- tional image registration techniques can then be applied to align the learned activity maps. In this manner, independent activity maps of the same terrain can be merged, and previously acquired activity maps can easily be updated with additional learn- ing data. Furthermore, traditional template matching techniques applied to learned CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 28

activity maps would enable autonomous systems to characterize and later identify lo- cations on the ground based on the motions observed. For example, an autonomous helicopter could distinguish a four-way stop from a traffic circle using template match- ing and, by image registration, orient itself based on the motions of the vehicles it observes. As a natural extension, these same registration techniques applied to learned activity models can be used to align video sequences based on observed activity (but not the actual image pixel values). One can conceive of a variety of situations in which two aerial video sequences of the same physical terrain need to be aligned but traditional image-based registration is insufficient. Such terrains may lack sufficient image landmarks for traditional registration (e.g. a desert road or maritime shipping channel), or their structural characteristics may change over time (e.g. an urban combat zone in which buildings have been destroyed, etc). Or, the two videos could have been taken at very different times of day. In situations like these, accurate activity-based models would allow video sequences to be aligned solely on the observed motion of objects in the scene. Fig. 3.6 shows a single-frame alignment of two independent video sequences based only on the activity-based models acquired from each video. Registration has been performed between the image-encoded activity maps of the two sequences (the regis- tration technique used was a brute force search over possible rotations and transla- tions, using a Sum of Squared Differences criterion), and the resulting transformation has been applied to the individual frames shown. While the alignment is not perfect, surprisingly accurate results can be obtained solely from the observed motion data.

3.3 Results

3.3.1 Hypotheses

The work in this chapter was motivated by two main hypotheses. Multi-object tracking: First, that the proposed pipeline of feature selection and tracking, ego-motion calculation and subtraction, and multiple particle filters CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 29

Figure 3.6: The single-frame alignment of two independent video sequences based on the activity-based models acquired from each. This registration is performed without image pixel information. It uses only activity information from the learning grid. would produce good multiple-object tracking performance in real-world video taken from a moving platform. Activity model improvement: Second, that the use of the activity-based ground models would boost the performance of the original tracking algorithm.

3.3.2 Methods

Multi-object tracking: The performance of the approach for tracking moving ob- jects was evaluated in terms of true positives, true negatives, false positives, and false CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 30

negatives. For the purposes of this chapter, these terms are defined as follows. A true positive occurs when a moving object is correctly identified as such by the algorithm in the current frame. A true negative occurs when there are no moving objects in the current frame, and the algorithm correctly identifies this fact. A false positive occurs when the approach identifies an object as moving in the current frame when it is not, and a false negative occurs when a specific object in the current frame is not identified as moving when, in fact, it is. Evaluation occurred on two video sequences, both shot with the Stanford Heli- copter, one of a single moving object (a pedestrian walking), and one of multiple moving objects (cars, pedestrians, and a cyclist). Activity model improvement: The improvement in performance resulting from the application of the activity-based ground models was measured as follows. On a 2100-frame test data sequence, tracking accuracy is defined as the number of correct tracks, minus the number of false positives, divided by the total number of moving objects. When a red indicator circle was located over a moving object it was counted as a correct track while when a red circle was located over a stationary part of the scene it was considered a false positive. No segmentation of the moving objects was performed once they had been identified. Automatic segmentation of moving objects has been addressed previously in the literature, and the goal of this work is simply to recognize and track moving objects.

3.3.3 Findings

Multi-object tracking: The single and multi-object tracking results on these video sequences can be found in Table 3.1. Not surprisingly, the results clearly indicate that tracking multiple moving objects is more challenging than tracking single objects. Activity model improvement: On a 2100-frame test data sequence, tracking accuracy (defined as the number of correct tracks, minus the number of false positives, divided by the total number of moving objects) was 0.85 without using the learning data, and 0.89 when using the learning data. This corresponds to roughly a 27% reduction in the number of incorrectly identified or missed moving objects. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 31

Table 3.1: Single and Multi-Object Tracking Performance Single Object Multi-Object True Positives 62 140 True Negatives 124 15 False Positives 16 57 False Negatives 5 78

Fig. 3.7 compares the tracking without (top panel) and with (bottom panel) the learned activity map. More specifically, the top diagram is the result of using the standard importance weights to update the particle filters, whereas the bottom dia- gram uses the learned activity map for tracking, on independent testing data. As is easily seen, the track in the bottom diagram is more complete than the one in the top diagram, illustrating one of the benefits of the learned activity model. In the top dia- gram, the cyclist would have a different track ID when the algorithm started tracking the second time, making tasks such as statistics gathering (number of moving objects seen at different times of day, etc.) more error-prone.

3.3.4 Additional Results

To illustrate all the intermediate stages of the tracking process in another environ- ment, the following figures show processed frames from a video taken beside the Gates building on the Stanford University campus using a robotic helicopter platform of a jogger. Fig. 3.8 shows boxes indicating the location of the features chosen for the optical flow calculation. Fig. 3.9 shows only those features which have been deter- mined to be moving after correcting for egomotion (in this case, all lie on the jogger). Fig. 3.10 shows the particles corresponding to a particle filter tracking the jogger. Particles which have received a large weight are light colored, particles receiving a lower weight (less than 0.01) are a darker color. In order to quantify the effect of using more training data while building the learned activity model on tracking accuracy, the number of objects appearing in the training data (a more useful metric than simply number of frames) was increased while recording the tracking accuracy on a separate test video sequence. The results CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 32

(a) Tracking without learned activity map

(b) Tracking with learned activity map

Figure 3.7: Example of two tracks (a) without and (b) with the learned activity map. The track in (a) is incomplete and misses the moving object for a number of time steps. The activity map enables the tracker to track the top object more reliably. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 33

Figure 3.8: Selected features in frame from jogging video. are shown in Fig. 3.11.

3.3.5 Conclusions

This chapter presents a system for learning activity models of outdoor terrain from a moving aerial camera. The approach acquires such models from a camera that is undergoing unknown motion. To identify moving objects on the ground, this approach combines image-based feature tracking with an EM-approach for estimating the image transformation caused by the camera’s ego-motion. This identifies features whose motion is counter to the flow induced by the estimated ego-motion. Next, multiple particle filters are employed to identify and track moving objects. The object motion is then cached in a histogram that learns the probability distribution of different motions at different places in the world. Applications of the learned activity model CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 34

Figure 3.9: Features classified as moving in jogging video frame. include improved tracking and global registration of two different models based on the activity patterns. This approach runs at 20Hz on a 2.4GHz PC. While most elements of this approach are well-known in the literature, it defines the state-of-the-art in finding moving objects on the ground from a helicopter plat- form. Further, the use of learned activity models for tracking and registration is unique. The system has been found to be robust in tracking moving objects and learning useful activity models of ground-based motion and these models have proven to be applicable to problems of general interest. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 35

Figure 3.10: Particles corresponding to a particle filter tracking the jogger. Lighter colored particles have been heavily weighted in this step, darker particles have received a lower weight.

3.4 Related Work

In recent decades, the problem of acquiring accurate ground models, or maps, has become the focus of a number of different research communities. Photogrammetry in- vestigates the acquisition of models from remote imaging sensors flown on high-aerial aircraft or satellites [30]. Many roboticists concern themselves with the acquisition of maps from the ground, using mobile robots operated indoors [31], outdoors [32], underwater [33], or in the subterranean world [34]. The vast majority of these techniques, however, address the acquisition of static models. Moving entities, such as cars, bicyclists, and pedestrians, are usually consid- ered irrelevant to the mapping problem. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 36

Figure 3.11: Dependence of tracking accuracy on number of moving objects in training data.

The acquisition of activity-related models, however, has received some attention. For example, Makris and Ellis use video from surveillance cameras to develop an activity-based model of entry points, exit points, paths, and junctions within a scene [35]. Unlike the work described here, their approach assumes a static sen- sor platform—which greatly facilitates the detection and tracking of moving entities. Related work by Wang et al. uses an unsupervised learning algorithm to learn seman- tic scene models via trajectory analysis [36]. Although this work also uses a static sensor platform, the focus on the importance of constructing a priori distributions to describe different types of behavior occurring in different areas (and its usefulness for identifying unusual behaviors) is similar to the work described in this thesis. Two major differences between their work and this work, besides the moving platform used to acquire the model proposed here, are their use of sources, sinks, and paths to describe motion (vs. the more general histogram suggested here), and their explicit classification of pedestrian and vehicular movement as part of the algorithm. CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 37

Stauffer and Grimson also use a static sensor forest to track motion, learn patterns of activity at a site, and classify the observed activities [37]. This approach also allows them to identify abnormal behaviors in the scene. For them, the static sensor forest is a key part of the implementation since it allows the use of background pixel statistics calculation. More recently, Ermis et al., experimenting with a static sensor forest with partial field of view overlap, found that activity features based on the object of a motion in the image plane outperformed SIFT features for tracking objects between cam- eras [38]. Using visual words to assemble video documents that in turn are fed to a Latent Semantic Analysis algorithm, Varadarajan and Odobez are able to build a model of activity patterns which, in turn, like the work of Stauffer et al., recognize abnormal behaviors in video of automotive traffic [39]. Kembhavi et al. take a differ- ent approach, representing an activity model using a Markov Logic Network based on human knowledge, and tackling the problem of interpreting such a 3D model from 2D video [40]. The work of Zhang et al. leverages the usefulness of a static sensor forest that provides multiple viewpoints to learn motion patterns for vehicles and pedestri- ans once a separate co-trained, using the multiple viewpoints, has classified moving objects as either pedestrians or vehicles. Then a graph is developed which clusters the observed motion patterns [41]. Finally, opting instead to use dense optical flow to aggregate pixel-wise motion statistics, Yang et al. utilize a method for describ- ing video clips using a bag of words model to achieve high quality motion pattern detection [42]. Finally, using a very different sensor suite (RFID tags), the work of Wilson and Atkeson also exploits the utility of allowing learned activity models (in this case the movement of monitored patients) to allow the Bayesian filter to recover from ambiguities and improve tracking [43]. Tracking objects from a moving platform has received attention from Burt et al. [44]. Their approach uses foveation as part of a dynamic motion analysis tech- nique. More recent work has harnessed a number of novel techniques to improve the performance of multiple object tracking from moving platforms. Lin and Wolf use Markov Chain Monte Carlo-based sampling to make particle filter tracking robust to CHAPTER 3. MULTI-OBJECT TRACKING AND ACTIVITY MODELS 38

sudden position changes resulting from ego-motion [45] while also proposing a new ego-motion estimation method which leverages a genetic algorithm [46]. McIntyre et al. propose an efficient method for tracking moving objects from a potentially moving platform by modeling the potential changes in the objects appearance in the image space [47]. Ess et al. add a stereo rig utilize a graphical model in an attempt to deal with the complexity of the multi-object tracking problem [48]. Xiao et al. at Sarnoff Corporation have adopted an approach similar in spirit to the one proposed in this chapter for ego-motion estimation. They use dense optical flow to estimate the dominant motion between video frames, and then outlier regions are segmented out and removed from the calculation [49]. Although the tracking pipeline proposed here is novel, its component parts have been used by other researchers for similar goals. Probabilistic filtering methods (Kalman and particle filters) have been applied previously to tracking moving ob- jects [50, 51]. The use of EM to identify objects on the ground that are moving is similar to the approach taken by Jung and Sukhatme [50]. Finally, the use of mul- tiple particle filters which are spawned, merged, and killed to track moving objects has been proposed by Vermaak, Doucet, and P´erez [52]. Chapter 4

Road Following

The work described in this chapter gives rise to a novel method for identifying traversable areas in a loosely-structured environment that makes use of a combina- tion of optical flow techniques and dynamic programming. Autonomous mobile robot navigation on ill-structured roads presents unique challenges for machine perception. A successful terrain or roadway classifier must be able to learn in a self-supervised manner and adapt to inter- and intra-run changes in the local environment. The ap- proach discussed here was originally developed for long-range sensing in the DARPA Grand Challenge. In this application, the methods described in Chapter 2 are used to produce a stored reverse optical flow history grid for the 200 video frames taken before the current frame. For a representative 7000-frame video sequence, an average of 20% of the grid cells below the horizon are empty, often as a result of a lack of texture in the image. The reverse optical flow information is used to assemble a set of templates based on the appearance of the patch of roadway currently in front of the robot. These templates are used to find the location of the road at different distances from the robot in the current frame.

39 CHAPTER 4. ROAD FOLLOWING 40

Figure 4.1: Adaptive road following algorithm

4.1 Adaptive Road Following

The approach discussed here was described in an earlier form in [2] and is composed of the following steps: reverse optical flow, horizontal 1D template matching, and dynamic programming. The algorithmic flow is depicted in Fig. 4.1. This approach is designed to deal with ill-structured desert roads where tradi- tional highway road following cues such as lane markings and sharp image gradients associated with the shoulder of the road are absent. Regions off the roadway will often be similar enough in texture and color to the roadway itself that traditional segmentation techniques are unable to differentiate between them. This approach requires that the vehicle is currently traveling on the road and then tracks regions CHAPTER 4. ROAD FOLLOWING 41

(a) (b)

(c) (d)

Figure 4.2: (a) Dark line shows the definition region used in the proposed algorithm. (b)-(d) White lines show the locations in previous frames to which optical flow has traced the definition region. similar to the area directly in front of the vehicle. This region, typically twenty pixels high, which will be called the definition region is shown in Fig. 4.2a. If the vehicle is currently on the road, the pixels directly in front of the vehicle in the image are representative of roadway. Since the appearance of the definition region at different times in the past is important for the template matching procedure described below, the requirement for the correct functioning of the algorithm is that the vehicle has been traveling and recording video for at least 7 seconds (not necessarily on the road) and is currently on the road. Template Matching: To determine the location of the road at different heights in the current image, a set of horizontal templates is collected that reflects the best guess about the appearance of the roadway at the current time. These templates are formed by taking the location of the definition region in the current frame. Optical flow is used to find the location CHAPTER 4. ROAD FOLLOWING 42

Figure 4.3: Output of horizon detector is the dark line of this region in images increasingly far in the past. This approach is effective because of the difference in appearance of objects as a function of their distance from the robot discussed earlier in this thesis. Figs. 4.2(b-d) show the locations of the templates at different distances from the robot as a result of using reverse optical flow to find the location of the definition region in past frames. By using the set of templates collected for the current definition region, the most likely location of the road at various heights in the image can be estimated with a horizontal template matching algorithm. The vertical search height for a given template in the current image is determined by multiplying the vertical location of the template in the original image by a scale factor. The scale factor is the difference between the height of the horizon in the current frame and its height in the original. The location of the horizon in each frame is determined using a horizon detector based on the work of Ettinger et al. [53]. A 2-D search space parametrized by the height and angle of the horizon in the image is searched using a multi-resolution approach to minimize a criterion. This criterion is the sum of the variances in the blue channel of the pixels labeled as sky and those labeled as ground. An example of the detected horizon is shown in Fig. 4.3. This scaling was necessary to mitigate the effect of changes in the pitch of the CHAPTER 4. ROAD FOLLOWING 43

vehicle on the performance of the algorithm. It is worth noting that while this ap- proach was developed using a video stream without vehicle telemetry, the addition of accurate pitch information would obviate the need for a horizon detector. Both the templates and the search space are horizontal slices of the image. Tem- plates taken from curved roads therefore appear similar to those taken from straight roads. However, templates taken from curved portions of roadway will be artificially wide. The same effect occurs if the vehicle is undergoing a moderate amount of roll. The template matching measure combined with the dynamic programming approach described below mitigates these problems. Normalized sum of squared differences (SSD) is used as the template matching method. This computes the strength of the template match along each horizontal search line. The normalized SSD measure is defined as follows (where I is the image, T is the template, x0, y0 range over the template, and x, y range over the image):

P P 0 0 0 0 2 x0 y0 [T (x , y ) − I(x + x , y + y )] R(x, y) = P P 0 0 2 P P 0 0 2 0.5 (4.1) [ x0 y0 T (x , y ) · x0 y0 I(x + x , y + y ) ]

Because the search space for each template is only a single horizontal line and the height of the template is typically 20 pixels, this matching measure can be computed very quickly. Fig. 4.4b shows the output of the matching measure for a set of 10 template search lines in the image shown in Fig. 4.4a. The responses in the SSD image have been widened vertically for display purposes, though they appear in the image at the height at which each individual template was checked. White regions indicate stronger matches, while dark regions indicate weaker matches. Strong re- sponses can also be seen in the upper right portions of the scene where the lack of both vegetation and shadows combine to make template matches to the right of the roadway attractive. Dynamic Programming: Fig. 4.5 depicts the location of the maximum SSD response along each horizontal search line with dark circles. Sometimes the location of maximum response does not lie on the roadway due to similarities in visual char- acteristics between the roadway and areas off the road, and because of illumination differences between the search areas and the templates. CHAPTER 4. ROAD FOLLOWING 44

(a)

(b)

Figure 4.4: (a) Input video frame (b) Visualization of SSD matching response for 10 horizontal templates for this frame.

The need to find the globally optimal set of estimated road positions while satisfy- ing the constraint that some configurations are physically impossible for actual roads suggests the use of dynamic programming. The purpose of dynamic programming is then to calculate the estimated position of the road at each search line in the image so that when the positions are taken together they minimize some global cost function. The cost function chosen in this case is the SSD response for each horizontal search line, summed over all search lines. The search lines are processed from the topmost downward, with the cost at each horizontal position computed as the SSD cost at CHAPTER 4. ROAD FOLLOWING 45

Figure 4.5: Dark circles represent locations of maximum SSD response along each horizontal search line. Light circles are the output of the dynamic programming routine. The gray region is the final output of the algorithm and is calculated using the dynamic programming output. The width of this region is linearly interpolated from the horizontal template widths. Uneven vertical spacing between circles is the result of changing vehicle speeds over time. that location plus the minimum cost within a window around the current horizontal position in the search line above. The horizontal position of this minimum cost is also stored as a link. Once the bottommost search line has been processed in this way, the globally optimal solution is found by following the path of stored links, each of which point to the minimum cost position in the search line above. The path traversed represents the center of the estimated position of the road. Given the assumption that the vehicle is currently on the roadway, the finite window restriction serves to enforce a constraint on the maximum expected curvature of the road in the image plane, and reduces the computation time of the optimization. To compute a reasonable window size for a given road heading, the equation below can be used. CHAPTER 4. ROAD FOLLOWING 46

(min∆ ) · ∆x W ≥ H (4.2) ∆y

Imagine the centerline of the roadway as a two-dimensional curve in the image plane. At the height in the image where the road is undergoing the most curvature, ∆x will be the largest for a given ∆y. The dynamic programming window size W used in the technique just described must be greater than or equal to the minimum vertical separation between template match regions (min∆H ) divided by the max- ∆y imum expected value of ∆x . For the results discussed in this chapter, a dynamic programming window size of 5 pixels was used. The output of the dynamic programming module is depicted by the light circles in Fig. 4.5. The gray region in Fig. 4.5 indicates road segmentation. The location of this region was determined by the dynamic programming output while the width of the segmented region was linearly interpolated from the widths of the horizontal templates.

4.2 Results

4.2.1 Hypotheses

The hypothesis driving the work described in this chapter was that the combination of reverse optical flow, 1D template matching, and dynamic programming would pro- duce road classification results in ill-structured environments that were superior to a competitive, off-the-shelf, Markov Random Field (MRF) road classification algorithm that made no prior assumptions about the visual appearance of the roadway being followed.

4.2.2 Methods

Single frame results taken from three different 720x480 pixel video sequences shot in the Mojave desert are shown in Fig. 4.6. Each column of the figure contains images CHAPTER 4. ROAD FOLLOWING 47

Figure 4.6: Single frame algorithm output for three Mojave Desert data sets. Each column contains results from one of the three video sequences. from a different test video sequence. The first video sequence, taken in direct sun, contains footage from a straight dirt road where scrub brush lines the sides of the road. The second sequence comes from a straight road with less vegetation along the sides. Taken late in the afternoon, it has long shadows stretching across the road. The third sequence is from a trip through terrain with changes in elevation and gravel coloration. Between the three sequences there are more than 12 minutes of video. The details of the implementation of the algorithm discussed above are as follows. The road position estimates are the result of 1D template matching of a set of 10 horizontal templates found using the optical flow procedure. These templates are samples, at different points in time, of the visual characteristics of the roadway area currently in front of the robot. These templates were taken from the past - ranging from 1 frame to 200 frames prior to the current frame. The spacing of the temporal CHAPTER 4. ROAD FOLLOWING 48

samples was chosen to provide an even vertical spacing in the image plane. The templates were 20 pixels high. The definition region and templates were refreshed every 10 frames in order to adapt to gradual changes in the appearance of the roadway. A total of 3000 feature correspondences were used to calculate the optical flow fields. The mean flow vectors were stored in a grid of 96 square cells covering the entire image plane. To quantify the overall performance of the algorithm in this domain, the results of running it on the three 7200-frame data sets described above were evaluated using the two performance metrics described below. The data sets were taken prior to the development of this algorithm. They do not reflect the algorithm exerting any control over the path of the vehicle. For comparison purposes, each of these data sets has also been run through an image segmentation program which uses an MRF with the Metropolis algorithm for classification. The software was written by Kato et al. [54, 55, 56]. This software is publicly available at Zoltan Kato’s website, but it has been modified for the purposes of this experiment to take fixed training regions and upgraded to use full color infor- mation and covariance matrices. The training regions have been permanently set for every test frame to be a rectangular region in front of the vehicle, which corresponds to the definition region used for the algorithm, and a square region to the left of where the road should appear in the image. These regions were chosen because they provide the MRF with a good example of what the road looks like versus the rest of the scene. The Metropolis algorithm works by iteratively refining a labeling assign- ment over pixels according to a Single and Double potential to minimize an energy function. The Double potential component of the energy function is minimized when neighboring pixels have the same labeling. The Single potential is minimized when a pixel is labeled as belonging to the most similar class. This procedure produces an image classification which will be compared to the results generated by the optical flow technique using the following two metrics: a pixel coverage metric and a line coverage metric. Pixel Coverage Metric: The first metric compares pixel overlap between the algorithm output and ground truth images in which the road has been segmented by a human operator, as shown in Fig. 4.7. The number of pixels in the frame that CHAPTER 4. ROAD FOLLOWING 49

Figure 4.7: Typical human-labeled ground-truth image have been incorrectly labeled as roadway is subtracted from the number of correctly labeled roadway pixels. This number is then divided by the total number of pixels labeled as road by the human operator for that frame. Using the metric proposed here, a score of 1.0 would correspond to correct identification of all the road pixels as lying in the roadway (while not labeling any pixels outside the roadway as road pixels). A score of 0.0 would occur when the number of actual road pixels labeled as roadway would be equal to the number of non-roadway pixels incorrectly identified as being in the road. If there were more incorrect than correct roadway pixel labels, the score would be negative. This measure is computed once per second and averaged over the entire video sequence. While this pixel coverage metric is easily visualized and simple to compute, it must be recognized that, due to perspective effects, it is strongly weighted towards regions close to the vehicle. Line Coverage Metric: The second metric mitigates the distance-related bias of the first metric by comparing pixel overlap separately along a set of horizontal lines in the images. Five evenly spaced horizontal lines are chosen ranging in vertical position between the road vanishing point and the vehicle hood in the ground-truth image. Success scores are calculated just as in the first metric, except they are reported individually for each of the five lines. The metric returns five sets of success scores computed once per second and averaged over the entire video sequence. CHAPTER 4. ROAD FOLLOWING 50

Figure 4.8: Pixel coverage results on the three test video sequences

4.2.3 Findings

Fig. 4.8 shows the performance of the algorithm proposed in this chapter on the three different video sequences, evaluated using the pixel coverage metric. These results are compared to the output from the Markov Random Field segmentation program. The MRF classifier assumes that the region directly in front of the vehicle is on the roadway but it does not use optical flow methods. It is worth noting again here that this approach performs road following only when the vehicle is currently on the roadway. The scores for both the proposed approach and the MRF-based classification reflect only the ability to follow the road, not to find it. If this were informing a naive planner with no inertia, the results would be sobering. Over the 7200-frame test data, the mean time to failure where failure is defined as a single frame where the calculated position of the roadway does not overlap the roadway at all, at any height in the image, is 18 seconds (540 frames). The MRF classifier outperforms the optical flow technique slightly on the first video sequence. In this video, the areas off the roadway are covered with sage brush which differs from the roadway in both color and texture. A representative frame of CHAPTER 4. ROAD FOLLOWING 51

(a) (b)

(c)

Figure 4.9: (a) Input frame from the video characterized by straight dirt road with scrub brush lining the sides (b) Optical flow technique output (c) MRF classifier output output from both the MRF classifier and the optical flow technique, and the corre- sponding input frame are shown in Fig. 4.9. In the second video sequence, the long shadows and sparser vegetation combine to make areas off the road visually more similar to the roadway. Consequently, the performance of the MRF classifier was adversely affected while the optical flow technique was unaffected. Fig. 4.10 shows the test output on a representative frame from this video. An interesting limitation of the approach is its vulnerability to intermittent shadows. If a template taken from the past comes from a region in shadow, and is being matched against a region in the current image which is not in shadow, the resulting SSD response will be biased towards darker regions. This effect is shown in the SSD response and corresponding input frame in Fig. 4.11. The dynamic programming step alleviates the severity of CHAPTER 4. ROAD FOLLOWING 52

(a) (b)

(c)

Figure 4.10: (a) Input frame from the video with long shadows and sparse vegetation (b) Optical flow technique output (c) MRF classifier output this effect. In the third video sequence, the almost complete absence of vegetation makes the areas off the road visually very similar to the roadway. The performance of the MRF classifier suffers as a result, while the strong vertical texture components in the roadway improve the results of the template matching done using the optical flow technique. This can be seen in the test output in Fig. 4.12. Fig. 4.13 shows the performance of the algorithm on the same three data sets, now evaluated using the line coverage metric. Scores are graphed for a set of five evaluation lines increasingly distant from the vehicle. The performance of the algorithm generally declines as the distance from the vehicle increases. The proposed algorithm achieves very low false positive rates by making no assumptions about the general appearance of the road and classifying regions that adhere to its learned roadway information. Videos of the three 7200-frame test sets, showing the results of tracking with the CHAPTER 4. ROAD FOLLOWING 53

(a) (b)

Figure 4.11: (a) Input frame (b) SSD response proposed algorithm as well as with the MRF classifier are available at http://cs.stanford.edu/group/lagr/road following/. The algorithm runs at 3Hz on a 3.2GHz PC at 720 x 480 pixel resolution.

4.2.4 Additional Results

To illustrate the performance of the reverse optical flow calculation on a winding desert road, Fig. 4.14 shows the computed sparse optical flow field for a single frame, while Fig. 4.15 shows the calculated location of the definition region in a series of frames going back to a point 200 frames in the past. The importance of the dynamic programming step can be visualized in Fig. 4.16 and Fig. 4.17. In the first, the absolute template match positions are used to calculate the center of the roadway at the different heights in the roadway, while in the second, the output of the dynamic programming step is used. In addition to comparing the output of the road following algorithm with an off-the-shelf MRF classifier, as noted above, its performance was also evaluated in reference to naive color (HSV space clustering, wherein a pixel was considered a good match if its Hue was within 10 and Saturation and Value within 20 of the mean values for the template) and texture (a 20x20 pixel template, evaluated using a Sum of Squared Differences criterion) road classification algorithms. These algorithms were fed the same definition region pixels as were used by road following algorithm CHAPTER 4. ROAD FOLLOWING 54

(a) (b)

(c)

Figure 4.12: (a) Input frame from video characterized by changing elevations and gravel colors (b) Optical flow technique output (c) MRF classifier output described in this chapter. Fig. 4.18 shows the output of the color-based algorithm while Fig. 4.19 shows the output of the texture-based algorithm. The failures of the two algorithms are interesting in light of how similar textures and colors between on and off-roadway image patches make the road following al- gorithm more difficult. Fig. 4.20 and Fig. 4.21 show the results of comparing these approaches to the reverse optical flow-based road following algorithm using the same test data and pixel and line coverage metrics described above.

4.2.5 Conclusions

The approach described in this chapter pairs a novel use of optical flow to harness the visual differences between objects observed at different distances and a dynamic CHAPTER 4. ROAD FOLLOWING 55

Figure 4.13: Line coverage results are shown at different distances from the front of the vehicle towards the horizon for the three video sequences. CHAPTER 4. ROAD FOLLOWING 56

Figure 4.14: Sample optical flow field in frame from winding desert video. programming approach to produce a road following algorithm with good performance. In this approach, reverse optical flow was used to track a definition region in front of the vehicle at the current time to its location in previous frames at different points of time in the past. The appearance of a horizontal template, taken from the discovered position in the past frames was used to find likely positions of the roadway at different heights in the current image. Two separate methods for determining the correctness of the road following algo- rithm were proposed: one based on a total pixel coverage metric, and one based on pixel coverage along a number of lines at different distances from the vehicle, which somewhat ameliorates the bias towards getting the area closest to the vehicle right. The results reported in the section above indicate that the combination of reverse optical flow, 1D template matching, and dynamic programming did produce road classification results in ill-structured environments that were superior to a competi- tive, off-the-shelf, MRF-based road classification algorithm. Note that although the MRF-based approach is more general, it also requires an initial example of pixels corresponding to the roadway and of pixels off the roadway.

4.3 Related Work

The general problem of road following in loosely-structured environments has received a great deal of attention from roboticists in the last few years. Some approaches use CHAPTER 4. ROAD FOLLOWING 57

laser range finders in concert with computer vision, as in the work of Manz et al. where a LIDAR occupancy grid is used along with a color-based visual feature derived from Hue component of the HSI color space to evaluate different proposed positions of the roadway [57]. Ordonez et al. use lasers in concert with an Extended Kalman Filter to find ruts caused by earlier traversals of a path and determine the vehicle’s orientation with respect to these ruts [58]. Other approaches use stereo vision as an additional input to the algorithm. In the method proposed by Guo et al., a planar road surface is assumed, and splines are estimated using stereo data and a RANSAC algorithm which represent the roadway boundaries [59]. Hadsell et al. suggest a more generalized approach, also applicable to offroad navigation and tested on the DARPA Learning Applied to Ground Robotics platform, where stereo vision is used to train a deep hierarchical network which in turn can classify monocular video at distances out to the horizon [60]. Approaches that use only monocular video fall into several different classes as well. The work of Wang et al. uses hybrid features comprised of color, texture, and edge components, and a SVM (Support Vector Machine)-based classifier [61]. The examples included in this experiment, however, featured a large color and texture disparity between on-road and off-road regions. Kong et al., meanwhile, focus on finding the vanishing point of the road in a single video frame using Gabor filters to compute the dominant texture orientations at each pixel, followed by segmentation of the image into road and non-road areas based on dominant edge detection con- strained by the location of the vanishing point [62] [63]. Wu and ShuFeng employ the Otsu multi-threshold algorithm (which attempts to minimize the mean squared error between a segmented binary image and the original image) to segment a single image, and then use Canny edge detection to determine the location of the unstruc- tured road [64]. Finally, Yanqing et al. propose using the Otsu algorithm and Canny edge detection as described in the Wu paper, but include a Monte Carlo sampling step to associate the position of the road in different video frames [65]. Interestingly, this work assumed that roads had a very small curvature, and that their boundaries could be represented as straight lines, something the work presented in this chapter does not assume. CHAPTER 4. ROAD FOLLOWING 58

The ambitious work of Cheng et al. which proposes a lane detection system capable of handling structured and unstructured roads, uses mean-shift segmentation to divide the image into different regions after a hierarchical system has decided based on the absence of lane marking-colored pixels that the vehicle is in an unstructured environment, and then selects road boundaries using Bayes rule [66]. In the same spirit as the work described in this chapter, optical flow techniques have been widely used in mobile robot navigation. Image flow divergence has been used to orient a robot within indoor hallways by estimating time to collision [67]. Differences in optical flow magnitude have been used to classify objects moving at different speeds to simplify road navigation in traffic [68]. While the combination of techniques proposed here is novel, many of the com- ponents have appeared before in the literature on this and related topics. Similar template matching techniques have been used to determine lateral offset in previous work for a lane departure warning system [69].1 Dynamic programming has already been used for road detection, both aerial [70] and ground-based [71]. Some implemen- tations use custom, parallelized hardware [72]. Monocular vision-based techniques have been applied with success to road-following in ill-structured environments [73]. In particular, the work of Redmill et al. is similar to the approach proposed here both in its use of dynamic programming for picking an optimal set of centerpoint locations for the roadway being followed and in its a priori assumptions about various characteristics of the curvature of the roadway [74]. In fact, the Redmill et al. paper proposes the use of a Kalman filter for prediction of the roadway’s position in the next video frame, a step the work discussed in this thesis lacks. Their work, however, is proposed for following paved, marked roadways, and does not exploit optical flow techniques to improve the quality of the initial guesses for roadway position. One of the first uses of self-supervised learning was road-following in unstruc- tured environments [75]. Crisman’s work, though it relies on the roadway having different visual characteristics from the surrounding areas, is very useful since it does not require that the vehicle currently be on the road, and so it can be used as a

1The requirement that the template be compared to an image where the vehicle is known to have been centered in the lane is similar to the bootstrapping requirement of the approach described in this chapter. CHAPTER 4. ROAD FOLLOWING 59

bootstrapping method for a more robust road follower. While this work was originally developed for use during the DARPA Grand Chal- lenge, it was replaced by the more general vision work of Dahlkamp et al. prior to race day [76]. CHAPTER 4. ROAD FOLLOWING 60

Figure 4.15: First figure shows the definition region in front of the vehicle, subsequent figure shows the location the definition region is tracked back to in successfully earlier video frames. The last figure is 200 frames in the past. CHAPTER 4. ROAD FOLLOWING 61

Figure 4.16: Sample frame with roadway position calculated from positions of hori- zontal template matches.

Figure 4.17: Sample frame with roadway position calculated using dynamic program- ming approach. CHAPTER 4. ROAD FOLLOWING 62

Figure 4.18: Output of naive color-based road classification algorithm.

Figure 4.19: Output of naive texture-based road classification algorithm. CHAPTER 4. ROAD FOLLOWING 63

Figure 4.20: Comparison of reverse optical flow, color, and texture-based algorithms using the pixel coverage metric. CHAPTER 4. ROAD FOLLOWING 64

Figure 4.21: Comparison of reverse optical flow, color, and texture-based algorithms using the line coverage metric. Chapter 5

Self-Supervised Navigation

The capstone to the work discussed in the previous chapters is the novel application of reverse optical flow and self-supervised learning techniques to the problem of au- tonomous mobile robot navigation in offroad environments using monocular video as the only long-range sensor. This chapter demonstrates the improvements achieved by augmenting an existing self-supervised image segmentation procedure with an addi- tional supervisory input. Obstacles and roads may differ in appearance at distance because of illumination and texture frequency properties. Therefore, reverse optical flow is added as an input to the image segmentation technique to find examples of a region of interest at previous times in the past. This provides representations of this region at multiple scales and allows the algorithm to determine more accurately where additional examples of this class appear in the image. Navigation methods that utilize only a single monocular camera for terrain clas- sification are subject to limitations. Due to perspective effects, the sensor resolution per terrain area decreases in a monocular image as the distance from the robot in- creases. To make intelligent navigation decisions about distant objects the pixels corresponding to these objects must be correctly classified as early as possible. This requires that the robot infer class information using only a small patch of the image plane. As discussed in Chapter 2, the specularity of an observed object depends on the viewing angle of the observer with respect to the surface normal of the object, which

65 CHAPTER 5. SELF-SUPERVISED NAVIGATION 66

in turn is dependent on the distance between the observer and the object. This effect, combined with the periodic nature of textures, means that the visual appearance of an object at a great distance may be different than its appearance when the robot is close enough to detect it with local sensors. Finally, the automatic gain control operations necessary to mitigate the large dynamic range of outdoor scenes will create differences between the appearance of an object when it occupies a large portion of the robot’s field of view and its appearance when it constitutes a small part of a larger scene. These difficulties can be overcome with the use of self-supervised learning and optical flow techniques. Self-supervised learning in this context refers to a kernel of supervised learning which operates in an unsupervised manner on a stream of unlabeled data. The exact nature of the supervised kernel varies according to the application. In the case of the adaptive road following in Chapter 4, the supervision is the assumption that the vehicle is currently on the roadway. In the case of autonomous navigation, the supervision is the assertion that objects or terrain that trigger near- range sensors - such as physical bumpers - are hazardous and are to be avoided in the future. The self-supervised algorithm takes this information and monitors the sensor inputs and incoming video stream to label a data set without human assistance. By storing a history of observed optical flow vectors in the scene as outlined in Chapter 2, the perception module is able to trace pixels belonging to an object in the current image back to their positions in a given image in the past. In this way, obstacles and terrain types that the robot has interacted with using the short range sensors can be correlated with their appearance when they were first perceived at long range. Then the algorithm learns the visual characteristics of obstacles and easily traversable terrain at distances useful for making early navigation decisions and segmenting a 2D image accurately. This information allows long-range planning and higher traversal speeds. This approach, while applying a pipeline of established techniques, is novel in its use of reverse optical flow for image region correlation and results in improved navigation over a technique using the same image classification model, but no reverse optical flow. CHAPTER 5. SELF-SUPERVISED NAVIGATION 67

Figure 5.1: Off-road navigation algorithm

This chapter addresses the implementation of such a self-supervised learning sys- tem and its advantages and limitations. The algorithms presented here have been developed by the Stanford Artificial Intelligence Lab as part of work on the DARPA Learning Applied to Ground Robotics (LAGR) project. Section 5.1 discusses the learning algorithms used to allow the robot to maneuver in its environment, includ- ing training events, reverse optical flow, image segmentation, and autonomous navi- gation. Section 5.2 discusses the evaluation metrics used to measure the performance of the algorithm, and the outcome of the evaluation.

5.1 Off-Road Navigation Algorithm

The algorithm described here is illustrated in Fig. 5.1. It consists of four major parts: initial pixel clustering, image segmentation, projection of image space information into an occupancy grid, and the handling of training events. This approach was designed to provide long-range terrain classification information for a mobile robot. By placing that information into an occupancy grid, other sensors such as stereo vision and infrared sensors can be used to augment this grid. A D* global path planning algorithm [77] is run on this map to perform rapid navigation. Although the majority of this section deals with this global path planner and the LAGR hardware platform, the section on “Alternate Approaches” discusses other planning and segmentation CHAPTER 5. SELF-SUPERVISED NAVIGATION 68

techniques that use the same combination of self-supervised learning and reverse optical flow. The initial pixel clustering step takes an RGB input image and runs K-means on it with K set to 16 clusters. The pixel values from each of these 16 clusters are then used to compute a multi-variate Gaussian in RGB color space for each initial cluster. Each of these Gaussians is represented by a mean vector and a covariance matrix. After the initial input image is processed, each subsequent image is segmented based on these Gaussians according to a Maximum Likelihood (ML) strategy. A class label l is chosen for each pixel in order to maximize the probability of the observed color given the class label. The class label l is one of {Good, Bad, Lethal}. Three classes were chosen instead of two to allow the global D* planner to discriminate between terrain that was passable, but potentially hazardous, and lethal obstacles through which the robot could not navigate. For instance, an obstacle that triggered the infrared bumpers might be bad (could be loose scrub brush, etc.). An obstacle that triggered the physical bumper, however, would be lethal. The choice of near- range sensor suite used for training did not provide enough information to warrant the use of more than three classes. The conditional pdf for a color c, given that it belongs to class l is

M(l) X 1 f(c|l) = · G(c; µl,j, Σl,j) (5.1) j=1 M(l) where the Gs are the Gaussian pdfs with means µ and covariance matrix Σ, and M(l) is the number of Gaussians labeled as belonging to class l. This approach is interesting in this context as an approach with solid performance which can be improved with the use of optical flow. Note that the class Unknown corresponds to the case when a pixel’s color falls more than a maximum distance from any of the Gaussians in RGB space. The following Mahalanobis distance measure is used to compute this distance [78]: CHAPTER 5. SELF-SUPERVISED NAVIGATION 69

Figure 5.2: The LAGR robot platform. It is equipped with a GPS receiver, infrared bumpers, a physical bumper, and two stereo camera rigs.

q T −1 Dm(~x) = (~x − ~µ) S (~x − ~µ) (5.2)

~x in this case is the vector of RGB values describing the color of the pixel, ~µ is the mean of the Gaussian being compared against, and S is the covariance matrix of that Gaussian. In this work, a threshold distance of 0.9 was found empirically to be good at separating objects with new visual characteristics from ones that had colors similar to objects that belonged to known classes. The critical portion of this approach is the training of the model. When the robot gathers data from the environment through its local sensors, information about which pixels in an image correspond to obstacles or different types of terrain is used to train the Mixture of Gaussians model. The addition of the reverse optical flow method described in Chapter 2 makes it possible to do this training based on the appearance of obstacles and terrain at greater distances from the robot. This improves CHAPTER 5. SELF-SUPERVISED NAVIGATION 70

Figure 5.3: Points corresponding to the good class are depicted by x’s, bad class by circles, and lethal class by stars the performance of classification of image regions corresponding to those obstacles and terrain types at distance and enables higher traversal speeds. For the self-supervised learning, the supervisory inputs are the physical bumpers and the infrared range sensors. Blame assignment, while not perfect, is made easier by the fact that cameras on the platform (LAGR robot, see Fig. 5.2) point downward and inward. The pixels which correspond to the top of a traffic cone-sized obstacle that triggered either the right or left physical bumper switch have been determined via camera calibration. In the same manner, when the infrared range sensors register an obstacle at a certain range, the pixel correspondences have been determined ahead of time. When a local sensor such as a physical bumper or an infrared range sensor reg- isters an obstacle, the optical flow procedure is called. The pixels that correspond to where that object lies in the current image are traced back to the point where CHAPTER 5. SELF-SUPERVISED NAVIGATION 71

Figure 5.4: Statistics for Gaussian mixture components after a run where the robot interacted with an orange fence, and avoided subsequent orange objects. Each row has the mean and standard deviation for that Gaussian component, followed by the number of good, bad, and lethal votes for that component based on training data. they first entered the field of view of the robot (or the location in the oldest frame for which information exists in the optical flow history buffer if a full traceback is possible - a 200 frame history buffer was used for the work discussed here). Pix- els corresponding to the object from this frame in the past are then used to train the Mixture of Gaussians classifier. These pixels are incorporated into the Gaussian mixture component whose mean and covariance yield the minimum distance in the Mahalanobis sense. If the Mahalanobis distance to the nearest mean is greater than 0.9, then a new mixture component is created to capture any future examples of this obstacle. Since these Gaussian mixture components are initially trained on K-Means output, their covariances model the underlying variability of the color of the object CHAPTER 5. SELF-SUPERVISED NAVIGATION 72

(a) (b)

Figure 5.5: (a) STFT during a period of normal robot operation (b) STFT during a period of detected wheel slippage being recognized. This means that the classification is insensitive to the specific value of the threshold Mahalanobis distance. A value of 0.9 empirically allows various types of trees to be classified together, while still permitting new types of objects to be rec- ognized as being different. A point cloud representation of what the different classes look like for data gathered during a test run is shown in Fig. 5.3. The image points used to train the Gaussians are shown in RGB space classed into good, bad, and lethal classes. Fig 5.4 shows the state of the Gaussian mixture components after a run where the robot interacted with an orange fence, and then learned to avoid similarly-colored objects. The last mixture component did not exist after the initial K-means clustering, and was added after the interaction with the fence. Other inputs can also be used to trigger reverse optical flow traceback training events. The ability to recognize accurately when the robot’s wheels are slipping is of crucial importance. The wheel encoders would report that the robot was making forward progress, while the on-board pose estimate is too coarse to recognize slippage immediately with reliability. Once the high-level navigation software realizes that the wheels are slipping, it is trivial to back up, mark the region as an obstacle in the occupancy grid, and try another route. The slip detector proposed here is based on the realization that the motor con- trollers on the LAGR robot platform caused characteristic 3 Hz peaks in the frequency CHAPTER 5. SELF-SUPERVISED NAVIGATION 73

(a) (b)

(c)

Figure 5.6: (a) Input frame (b) Raw segmentation output (bushes have been classified as obstacles) (c) Output of “bottom finder” domain when the vehicle was undergoing significant wheel slippage. A Short Time Fourier Transform (STFT) was employed with a 2-second sliding window. The re- sults were normalized to remove the effects of the DC coefficient, and an empirical threshold (0.25) was used to recognize significant frequency components in the neigh- borhood of 3 Hz. Fig. 5.5a shows what the STFT looks like during periods of normal operation of the LAGR platform, while Fig. 5.5b shows what it looks like during periods of detected wheel slip. In addition to dropping an obstacle in the map, and moving on, the reverse optical flow approach discussed above could be used to train the image region classifier by providing additional examples of the bad class. CHAPTER 5. SELF-SUPERVISED NAVIGATION 74

Finally, the terrain classification information present in the current segmented video frame must be processed in such a way that the robot’s planner can make useful navigation decisions. The pipeline uses look-up tables with information about the point on an assumed flat ground plane which corresponds to every pixel in the image. These tables are constructed using the intrinsic and extrinsic parameters of the robot’s cameras determined from a camera calibration. They allow a ray to be calculated for each pixel in the image. The location at which that ray intersects the ground plane is then a simple trigonometric calculation. The information about the location on the ground plane to which each pixel corresponds is used to cast votes for what class each grid cell in the occupancy grid will be assigned. To locate vertical obstacles in space under the flat ground plane assumption, the algorithm scans vertically from the bottom of the image to find the first obstacle classified pixels. Since the approach assumes that all good terrain lies flat on the ground plane, the first pixels which do not conform to the good class when processed in this manner locate the intersection of the obstacle with the ground. All additional pixels above this intersection point would be improperly projected with the ground plane table. Instead, their collective influence is represented by summing all of their influence at this intersection. For an example of this process, see Fig. 5.6. For a sample of what the obstacles look like when the occupancy grid is populated, see Fig. 5.7. The occupancy grid used had cells that corresponded to 20x20cm patches of terrain, and three costs: no cost, high cost, and lethal cost.

5.1.1 Alternate Approaches

The utility of reverse optical flow is not limited to the global planner and the Mixture of Gaussians classifier described above. The approach has also been integrated with two local planning algorithms. These algorithms also make use of reverse optical flow and self-supervised learning but lack the state of a global planner and get caught in local minima without the addition of higher-level logic. CHAPTER 5. SELF-SUPERVISED NAVIGATION 75

Figure 5.7: Top figure shows Gaussian mixture model of the scene (potential obstacles colored red). Bottom figure shows the placement of obstacles in the occupancy grid.

Local Image-space Planner

As an experiment related to the work above, an iRobot ATRV platform (seen in Fig. 5.8) was made to navigate autonomously with a minimum of on-board sensors. A single video camera was attached to it, and an onboard computer took input from this camera and the robot’s front physical bumper and provided steering signals to CHAPTER 5. SELF-SUPERVISED NAVIGATION 76

Figure 5.8: iRobot ATRV platform the robot’s motor controllers. The goal of this subproject was to allow the robot to maneuver in an outdoor environment, learning from its interactions with obstacles, but operating without a global goal except proceeding at a fixed velocity without getting stuck. The planner was local in the image space. After obstacles were en- countered (trees were the only possible obstacles in this environment), reverse optical flow was used to learn the color and texture of image patches on the obstacles 200 frames before the robot interacted with them. In the current video frame, after scene classification, the planner picked a direction that would steer it towards the part of the current field of view that was the most free of obstacles. Fig. 5.9 shows the classi- fication of the trees as obstacles after one interaction with a tree. The color classifier used simply checked if a pixel was within a threshold distance in HSV color space of the mean of the sample pixels (threshold was 20 for S and V, and 10 for H), and the texture classifier simply compared a 20x20 pixel template to each pixel location using the Sum of Squared Differences criterion. CHAPTER 5. SELF-SUPERVISED NAVIGATION 77

Figure 5.9: Trees are classified as obstacles with a texture and color-based segmenta- tion algorithm after interacting with the robot’s physical bumper

Figure 5.10: Learned optical flow field is used by the robot to determine how to maneuver to push obstacles out of the field of view

Learned Optical Flow Controller

The reverse optical flow pipeline has been integrated with a system capable of learning not only the visual characteristics of obstacles, but also the control signals necessary to maneuver in a given environment [79]. When the robot is first turned on, the controller commands a series of twists and turns which run through every possible combination of forward and reverse speeds for each of the two front drive wheels. CHAPTER 5. SELF-SUPERVISED NAVIGATION 78

Using video recorded during this exercise, the algorithm builds an estimate for what the optical flow of different parts of the scene would be given a certain combination of wheel speeds. The controller then learns about the environment and relevant obstacle classes by using the robot’s local sensors, as described earlier in this chapter. However, after classifying the pixels in each video frame, the controller picks wheel speeds which move detected obstacles out of the danger area in front of the robot. To do this, the algorithm takes the current frame, runs through each possible pair of wheel speeds, and estimates how the image would look when the various regions moved according to the expected optical flows. Fig. 5.10 shows both the set of image region optical flows learned for a given wheel speed combination as well as the space in the input image the controller used to calculate the cost of different wheel speed choices, given the pixel classification of the current video frame. This approach is interesting because it dynamically calibrates itself “out-of-the- box,” and because it obviates the need for an additional global planner. However, it is precisely this lack of a global planner that makes it vulnerable to local minima (horseshoe-shaped obstacles, for instance).

5.2 Results

This section discusses the results from applying the technique discussed previously. This work was done as part of the Stanford Artificial Intelligence Lab work on the DARPA LAGR program. The robot platform for the program is shown in Fig. 5.2. The goal of this program is autonomous off-road navigation between two GPS way- points while avoiding obstacles. This is to be done using computer vision as the only long-range sensor. Since the visual obstacles at the remote test sites where the races were run may vary drastically from obstacles used to test the vehicle on local test courses, self-supervised learning is an attractive approach. CHAPTER 5. SELF-SUPERVISED NAVIGATION 79

5.2.1 Hypotheses

The investigations detailed in this chapter led to two main hypotheses. First, the use of reverse optical flow improved the scene segmentation obstacle classification abilities of the Mixture of Gaussians classifier. Second, the use of reverse optical flow also improved the real-world navigation behavior of a robot running the Mixture of Gaussians classifier and a global D* planner, reducing the time taken to achieve a goal and the number of obstacles that the robot interacted with using its short-range sensors.

5.2.2 Methods

Scene Segmentation: To evaluate scene segmentation accuracy, the following two- part metric was used. If the image segmentation section of the proposed algorithm is set to return a binary cost map (hazard or non-hazard), then the false negatives are reflected in a percentage of discrete obstacles correctly matched. The number of pixels correctly classified by the algorithm as belonging to an obstacle which were also labeled by a human as belonging to an obstacle are divided by the total number of human-labeled pixels for that obstacle. This percentage is then averaged for all the obstacles in the scene. The number is then corrected for the smaller total number of pixels present in obstacles farther from the robot which are considered equally important for successful long-range, high-speed navigation. The false positives are reflected by the percentage of incorrectly classified square meters of ground within 25 meters of the robot in a given image divided by the total number of square meters which the algorithm was given the task of segmenting. The number of square meters corresponding to a given pixel in the image was calculated using the ground-plane tables discussed earlier in this chapter. Incorrect classifications of terrain at large distances from the robot have a large effect on this score. This process is illustrated in Fig. 5.11. Fig. 5.11a shows a sample frame from the video sequence. Fig. 5.11b shows the hand-labeled obstacles. Fig. 5.11c shows the results of segmentation done without reverse optical flow. Fig. 5.11d shows the results of segmentation done with reverse optical flow. CHAPTER 5. SELF-SUPERVISED NAVIGATION 80

(a) (b)

(c) (d)

Figure 5.11: (a) Video frame (b) Hand-labeled obstacle image (c) Segmentation with- out optical flow (d) Segmentation with optical flow

The proposed evaluation compares algorithm-labeled images to operator-labeled images using a frame-by-frame pixel data metric. There is a great research bene- fit in using standardized benchmarks for natural object recognition algorithms such as MINERVA [80]. However, since the proposed approach requires the entire video stream up until an object interacts with the local sensors, a static database of bench- mark images is not appropriate for evaluating this work. The traditional percent incorrectly classified pixel error metric fails to capture the importance of correctly classifying objects at a greater distance from the robot that subtend fewer pixels. The pixel distance error metric [81] also is not well suited to natural scenes with large depth of field. The results shown here come from a set of three test runs in a forest in flat lighting with no sharp shadow. The only obstacles were trees. Two instances of the segmentation module were running at the same time. One was using optical flow techniques to determine the characteristics of obstacles the robot interacted with, while the other was not. During each run the robot was guided by an operator at CHAPTER 5. SELF-SUPERVISED NAVIGATION 81

normal traversal speeds towards three or four of the trees in the scene for a local- sensor interaction. After the initial interactions, the robot was driven through the forest without any additional obstacle interactions at close range. During the runs, video frames along with their segmented counterparts were logged on the robot at a rate of 10Hz. After the run these logs were parsed, and every 10th frame was analyzed using the metrics described above. The three runs, taken together, totaled over 3,000 frames. To compute the results, 304 of these frames were used. Autonomous Navigation: The applicability of the scene segmentation results was tested by evaluating the autonomous performance of the robot with the Gaussian databases collected from running with and without optical flow traceback during the last data collection run discussed above. This included evaluation of both the overall duration of a given run, and the number of obstacles the robot interacted with using the near-range sensors. Running all of the planning and sensing modules except stereo vision, the robot was tested on a 70m course through a different part of the forest containing trees with which the robot had not previously interacted. Learning was turned off during these runs. Ten runs were made, in each of which the robot achieved the goal. Half of the runs were made using the Gaussian database collected without optical flow traceback while the other half were made using the Gaussian database collected with optical flow traceback. The starting position of the robot was randomly varied within a radius of 5 meters for each pair of traceback/no traceback runs.

5.2.3 Findings

Scene Segmentation: Without optical flow, the percentage coverage of each ob- stacle was 81.65%. The percentage of area incorrectly classified as belonging to an obstacle was 53.75%. With optical flow, the percentage coverage of each obstacle was 88.98% and the percentage of area incorrectly classified was 36.53%. The data show that the percentage of each obstacle in the scene correctly identified by the classifier is higher with the optical flow traceback than without. More significant is the decrease in the percentage of area incorrectly classified as belonging to an obstacle class. Autonomous Navigation: The paths taken by the robot during these ten tests, CHAPTER 5. SELF-SUPERVISED NAVIGATION 82

(a)

(b)

Figure 5.12: (a) Paths taken using data collected without optical flow (b) Paths taken using data collected with optical flow. CHAPTER 5. SELF-SUPERVISED NAVIGATION 83

as well as the locations of the trees on the test course are shown in Fig. 5.12. The number of interactions with trees that caused the robot to stop and back up (physical and IR bumper hits) was noted for each run, along with the total run time. The average time and the number of trees the robot interacted with at close range during runs using the data collected with the optical flow approach were lower than the corresponding numbers for runs using the data collected without reverse optical flow. The results, with 95% confidence ellipses, are shown in Fig. 5.13. On average, the robot took ≈4.2 minutes to complete the course and encountered ≈3.5 obstacles when reverse optical flow was not used to train the classifier, and took ≈3 minutes and encountered ≈0.5 obstacles when reverse optical flow was used. The algorithm ran at 10Hz on the robot.

5.2.4 Additional Results

Each of the DARPA LAGR test runs were similar in form. Teams wrote code for perception, planning, and control, and sent a flash drive to DARPA. DARPA would load the drive in a robot at an undisclosed remote location, give the robot a target GPS coordinate, and evaluate the performance as the robot attempted to achieve the goal in different types of offroad terrain. Each test was engineered to focus on different objectives (ability to function with limited GPS coverage, in conditions with significant wheel slip, in situations with concave obstacles, etc). The final three tests of Phase I (the results of which decided which teams completed Phase I successfully) were a mixture of two Learning from Experience tests and one Learning from Example test. The Learning from Experience tests featured a prominent visual feature that the robot was supposed to learn was traversable (in this case a dark red cedar mulch trail). The robot had three opportunities to run the course, saving data between runs as necessary. The Learning from Experience tests drove the development of a special planner architecture that maintained a rolling evidence grid centered on the location of the robot. The minimum cost path to the goal was computed using dynamic programming, with a heuristic function predicting the cost from the edge of the local map to the final goal. Local control was done by rolling out short trajectories CHAPTER 5. SELF-SUPERVISED NAVIGATION 84

obeying the non-holonomic constraints of the robot and adding the cost of the best global path from the end of each local trajectory. A screenshot of the planner is shown in Fig. 5.14. The robot is located in the middle of the map, and is planning towards a point in the upper left corner (behind the robot). The local plan is a tight right turn while the global plan avoids previously detected obstacles in the map. Separating the local and global planner allowed for a much higher cycle time than the baseline, enabling faster reactions to obstacles and smoother performance at high velocities. In the first Learning from Experience test the robot successfully learned that the mulch trail was traversable, following it to its end. In the second Learning from Experience test, the robot completed the course all three times with competitive times, the last run being the fastest of the set. Operators could see the robot making use of its previously saved trajectory to eliminate cul-de-sacs during planning. The Learning from Example test was similar in concept. Before the robot’s au- tonomous runs, a human operator guided the robot to the goal using a remote control. While the algorithm was allowed to save perception information (in the Stanford case the Mixture of Gaussians describing obstacles and traversable terrain in RGB space), the robot was not allowed to save the GPS coordinates of the path used, or a global map. An example of one of the Stanford in-house training courses can be seen in Fig. 5.15a while the output of the trained classifier on the same frame is shown in Fig. 5.15b. In addition to the training changes, the output of the classifier was treated differently. The pixels in the current monocular image classified as traversable terrain (shown in Fig. 5.15c) were used to fit a polynomial contour (Fig. 5.15d) which was treated by the planner as the location of the desired path. An EM step then used the pixels lying along this path as additional training examples for the classifier. The improved classifier output is shown in Fig. 5.15e while the final estimate of the location of the path is illustrated by the polynomial contour in Fig. 5.15f. Stanford’s performance over the final three tests satisfied the Phase I requirements of the program. CHAPTER 5. SELF-SUPERVISED NAVIGATION 85

5.2.5 Conclusions

The 2D monocular image terrain classifier works well normally, and performs even better when optical flow techniques are used to correlate the appearance of objects close to the robot with those at greater distances. Videos illustrating the use of optical flow techniques on the robot to trace features corresponding to specific objects back in time as well as the performance of the segmentation algorithm can be found at http://cs.stanford.edu/group/lagr/IJCV IJRR/.

5.3 Related Work

Autonomous mobile robot navigation in challenging, real-world environments is cur- rently an area of interest in the fields of computer vision and robotics. While competition-style events such as the DARPA Grand Challenge have drawn public attention to the difficulties of following roads in ill-structured environments [82], government-funded research projects such as the DARPA LAGR program and the US Department of Defense DEMO I, II, and III programs [83, 84] have emphasized navigation in completely off-road environments. Robust autonomous perception and navigation algorithms benefit both the military and the private sector. As these algo- rithms improve, automakers will be able to extend to more varied environments the usefulness of current research in driver assistance technologies such as lane departure detection [85, 86], pedestrian tracking [87], and traffic sign detection and tracking [88]. Robotic exploration and mapping expeditions will also become possible in a wider va- riety of environments. Many current approaches to autonomous robot navigation use 3D sensors such as laser range-finders or stereo vision as the basis of their long range perception systems [89, 90, 91, 92, 93, 94, 95, 96]. These sensors have several disadvantages. While they provide accurate measurements within a certain distance, their perceptive range is limited. This limits the top speed of the robotic platform. Also, active sensors such as lasers tend to cost more, consume more power, and broadcast the position of the robot when it may be advantageous for it to remain undetected [97]. CHAPTER 5. SELF-SUPERVISED NAVIGATION 86

A monocular camera can mitigate these problems by capturing scene information out to the horizon. Color or texture information can be used to classify pixels into groups corresponding to traversable terrain or obstacles. Color dissimilarity and color gradient changes have been used to perform this classification [98, 99]. The use of texture was pioneered by MIT’s Polly tour guide for indoor navigation [100] and has since progressed to the point where natural terrain classifiers that use texture cues can be used on planetary rovers [101]. Techniques that use stereo camera information to tag obstacles in the field of view and then use that information paired with a visual appearance-based classifier to determine terrain traversability [102, 103, 104, 105, 106] have met with success in off-road navigation tests. These approaches might benefit from the same reverse optical flow framework to identify potential obstacles at distances outside the effective range of stereo cameras, but the difference in appearance between potential obstacles and the training examples would likely be less marked than in the monocular video work discussed here, where much shorter-range sensors are used for labeling. Self-supervised learning has also been successfully applied recently to robot navi- gation in off-road environments [108, 60, 107]. The work of Sofman et al. was notable for its use of overhead data (aerial imagery, etc.) as the input corpus for classification. In a situation where the robot’s global position is not immediately known, or there is no access to an appropriate imagery corpus, the applicability is limited. The approach of using a mixture of Gaussians to describe the state of the world is similar to the one used by Manduchi et al. [103]. That work uses laser and stereo sensors to incorporate range information into the obstacle labeling process, and there- fore does not use a method like reverse optical flow to correlate the appearance of obstacles detected nearby the robot with their appearance at more useful distances from the robot. CHAPTER 5. SELF-SUPERVISED NAVIGATION 87

Figure 5.13: Autonomous navigation results with 95% confidence ellipses. The aver- age run duration (in minutes) is indicated on the y-axis, and the average number of obstacles encountered on the x-axis.

Figure 5.14: Planner local and global trajectories CHAPTER 5. SELF-SUPERVISED NAVIGATION 88

(a) (b) (c)

(d) (e) (f)

Figure 5.15: (a) Input frame (b) Initial classifier output (c) Traversable pixels (d) Polynomial contour (e) Refined classifier output (f) Estimated path Chapter 6

Conclusions and Future Work

6.1 Conclusions

This thesis has covered three novel contributions to various subfields of mobile robotics that leverage large datasets, unsupervised or self-supervised learning techniques, and computer vision tools. All of the methods discussed rely on optical flow techniques either to distinguish moving objects from static surroundings (as in Chapter 3), or to allow classification algorithms to use more representative pixel values for training than would otherwise be possible by associating objects in the current frame with their appearance at a larger distance from the robot sensors. These techniques were presented in detail in Chapter 2. In Chapter 3, a method based on feature tracking and EM was presented for distinguishing between object motion and camera egomotion that in turn enabled high-accuracy tracking of multiple moving objects from a moving platform using particle filters. This tracking information (along with the knowledge of which objects in the scene were moving, and which were not) was then used to produce activity ground models. These models specify, based on training data, how moving objects are expected to move (speed and direction) based on their current location. Intuitively this makes sense. Pedestrians, for the most part, travel in different areas from cars, and the direction in which cars travel is often related to which part of a street they

89 CHAPTER 6. CONCLUSIONS AND FUTURE WORK 90

are traversing (lanes, etc.). The proposed approach, however, does not need any prior information about the geometry of the area in question. These ground activity models, once constructed, can be used for a variety of applications, including video sequence registration, and improved tracking of moving objects. In Chapter 4, dynamic programming and reverse optical flow were combined in a novel technique to identify where a loosely structured roadway lay in a frame of a video sequence. This approach used the appearance of the roadway currently in front of the car in previous video frames to find the globally most likely 1-D template match positions for the position of the roadway at different heights in the image. The algorithm was able to operate at real-time framerates and, when compared to a hand-labeled ground truth dataset, the approach was shown to be effective where other approaches may break down. In Chapter 5, an algorithm was presented that uses optical flow techniques to integrate long-range data from monocular camera images with short-range sensor information to improve the quality and utility of terrain classification. In this ap- proach, the optical flow field between successive input frames is coarsened and saved in a ring-buffer. Information about the location of features on an object in the current frame can be used to trace them back to their locations in a previous frame. This allows information about the traversability and danger of objects within range of the robot’s physical and IR bumpers to be used to classify similar objects reliably at large distances from the robot. This permits more efficient navigation and higher speeds. The approach resulted in an increase in the percentage of each obstacle correctly segmented in the input image and a decrease in the percentage of traversable terrain incorrectly classified. Finally, using this technique in a situation where monocular vision was the only long-range sensor on a robot platform proved to be advantageous in terms of the number of obstacles encountered and the length of time it took the robot to achieve the goal. There are situations in which the optical flow techniques upon which this approach is based do poorly. Unusually smooth image regions or regions of images which are completely saturated or desaturated will not contain trackable features. This causes the optical flow history grid cells in those areas to be empty or filled with CHAPTER 6. CONCLUSIONS AND FUTURE WORK 91

noise, which will cause errors when objects are traced back to their original locations. Objects which are in the field of view of the robot and then leave the image, to return later(such as objects seen while the robot is executing an S-turn) are currently only tracked back to their locations in the image where they most recently entered the field of view of the robot. Handling these cases using template matching or particle filter tracking might allow a more complete traceback. Finally, changing illumination conditions can result in unacceptable rates of misclassification.

6.2 Future Work

The work in Chapter 3 is of limited utility in building activity models of areas where different travel directions and speeds occur in the same spatial grid cells. Multi- purpose trails and time-of-day reversed freeway lanes are examples of such situations. More work will be required to make ground activity models robust to these limita- tions. It will also be interesting to see how ground activity models might help in other domains. Change detection in automated monitoring of video streams, large-scale sur- veying to identify intersections or roundabouts that would benefit from rearchitecting, and automated map attribute extraction (one-ways, turn restrictions, and the like) are interesting potential research areas. Extensions of the work in Chapters 4 and 5 might include the use of more sophis- ticated machine learning and computer vision techniques for doing feature selection and scene segmentation once objects have been traced back to their origins in earlier frames. This approach is also applicable to tracking, recognition, and surveillance problems where correlating the appearance of objects or people at close range with the corresponding visual characteristics at long range increases the accuracy of clas- sification. As the size of the datasets available for training machine learning algorithms con- tinues to increase, clever applications of computer vision and large-scale unsupervised learning will become ever more important. Bibliography

[1] A. Lookingbill, D. Lieb, D. Stavens, and S. Thrun. “Learning Activity-Based Ground Models from a Moving Helicopter Platform.” Proc. ICRA. 2005.

[2] Lieb, D., Lookingbill, A., and Thrun, S., “Adaptive Road Following using Self-Supervised Learning and Reverse Optical Flow,” Proceed- ings of Robotics: Science and Systems, 2005.

[3] A. Lookingbill, J. Rogers, D. Lieb, J. Curry, and S. Thrun. “Re- verse Optical Flow for Self-Supervised Adaptive Autonomous Robot Navigation.” International Journal of Computer Vision. 74(3), pp 287-302. 2007.

[4] A. Lookingbill, D. Lieb, and S. Thrun. “Optical Flow Approaches for Self-supervised Learning in Autonomous Mobile Robot Naviga- tion.” Autonomous Navigation in Dynamic Environments. STAR 35, Springer. 2007.

[5] J. Shi and C. Tomasi. “Good Features to Track.” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 593–600, 1994.

[6] D. G. Lowe. “Object Recognition from Local Scale-Invariant Fea- tures.” Proc of the Seventh International Conference on Computer Vision (ICCV’99), p. 1150, Volume 2, 1999.

92 BIBLIOGRAPHY 93

[7] E. Trucco, and A. Verri. Introductory Techniques for 3-D Computer Vision.

[8] J. Bouguet. “Pyramidal Implementation of the Lucas Kanade Feature Tracker Description of the Algorithm.” Intel Corporation, Micropro- cessor Research Labs 2000. OpenCV Documents.

[9] “OpenCV 2.1 C++ Reference” Online reference: http://opencv.willowgarage.com/documentation/cpp/index.html

[10] T. Brox, and J. Malik, “Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation.” In IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 500-513, Volume 33, 2011.

[11] B. Horn and B. Schunck, “Determining optical flow.” In Artificial Intelligence, pp. 185-203, Volume 17, 1981.

[12] D. Zang, L. Wietzke, C. Schmaltz, and G. Sommer, “Dense Optical Flow Estimation from the Monogenic Curvature Tensor.” In Lecture Notes in , pp. 239-250, Volume 4485, 2007.

[13] A. Bruhn, J. Weickert, and C. Schnrr, “Lucas/Kanade Meets Horn/Schunck: Combining Local and Global Optic Flow Methods” International Journal of Computer Vision. 61(3), pp 211-231. 2005.

[14] N. Onkarappa and A. Sappa, “On-Board Monocular Vision System Pose Estimation through a Dense Optical Flow.” In Lecture Notes in Computer Science, pp. 230-239, Volume 6111, 2010.

[15] R. T. Collins, Y. Liu, M. Leordeanu, “Online Selection of Discrimi- native Tracking Features” In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005.

[16] H. Chen, T. Liu, and C. Fuh, “Probabilistic Tracking with Adaptive Feature Selection.” In Proceedings of ICPR 2004. 2004. BIBLIOGRAPHY 94

[17] U. Neumann and S. You, “Natural Feature Tracking for Augmented- Reality” In IEEE Transactions on Multimedia. 1999.

[18] H. Zhoua, Y. Yuan, and C. Shi, “Object Tracking Using SIFT Fea- tures and Mean Shift.” InComputer Vision and Image Understanding, pp. 345-352, Volume 113, 2009.

[19] L. Dorini and S. Goldenstein, “Unscented Feature Tracking.” InComputer Vision and Image Understanding, pp. 8-15, Volume 115, 2011.

[20] C. Takada and Y. Sugaya, “Detecting Incorrect Feature Tracking by Affine Space Fitting.” In Lecture Notes in Computer Science, pp. 191-202, Volume 5414, 2009.

[21] D. Ta, W. Chen, N. Gelfand, and K. Pulli, “SURFTrac: Efficient tracking and continuous object recognition using local feature de- scriptors,” In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2937-2944, 2009.

[22] R. Rodrigo, M. Zouqi, C. Zhenhe, and J. Samarabandu, “Robust and Efficient Feature Tracking for Indoor Navigation.” In IEEE Trans- actions on Systems, Man, and Cybernetics, Part B: Cybernetics, pp. 658-671, Volume 39, 2009.

[23] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmal- stieg, “Real-Time Detection and Tracking for Augmented Reality on Mobile Phones,” In IEEE Transactions on Visualization and Com- puter Graphics, pp. 355-368, Volume 16, 2010.

[24] F. Caballero, L. Merino, J. Ferruz, and A. Ollero, “Vision-Based Odometry and SLAM for Medium and High Altitude Flying UAVs.” In Unmanned Aircraft Systems, pp. 137-161, 2009. BIBLIOGRAPHY 95

[25] G. Fielding and M. Kam, “Disparity maps for dynamic stereo” In Pattern Recognition. 34. 531-545. 2001.

[26] Benoit. S. “Towards Direct Motion and Shape Parameter Recovery from Image Sequences” PhD Dissertation. Department of Electrical Engineering, McGill University. 2003.

[27] A.P. Dempster, N.M. Laird, D.B. Rubin. “Maximum Likelihood from Incomplete Data via the EM Algorithm.” Journal of the Royal Sta- tistical Society. 39, 138. 1977.

[28] C. Hsu and R. Beuker “Multiresolution Feature-Based Image Reg- istration.” Visual Communications and Image Processing, Proc. of SPIE vol. 4067, 1490-1498, 2000.

[29] J. Kim and J. Fessler “Intensity-based image registration using robust correlation coefficients” IEEE Transactions on Medical Imaging vol. 23, 1430-1444, 2004.

[30] G. Konecny. Geoinformation: Remote Sensing, Photogrammetry and Geographical Information Systems. Taylor & Francis, 2002.

[31] F. Lu and E. Milios. “Globally Consistent Range Scan Alignment for Environment Mapping.” Autonomous Robots, 4:333–349, 1997.

[32] D. H¨ahnel,W. Burgard, and S. Thrun. “Learning Compact 3D Models of Indoor and Outdoor Environments with a Mobile Robot.” Robotics and Autonomous Systems, 44(1), 2003.

[33] S.B. Williams, G. Dissanayake, and H. Durrant-Whyte. “An Efficient Approach to the Simultaneous Localisation and Mapping Problem.” In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 406–411, Washington, DC, 2002.

[34] D. Ferguson, A. Morris, D. H¨ahnel,C. Baker, Z. Omohundro, C. Re- verte, S. Thayer, W. Whittaker, W. Burgard, and S. Thrun. “An BIBLIOGRAPHY 96

Autonomous Robotic System for Mapping Abandoned Mines.” In S. Thrun, L. Saul, and B. Sch¨olkopf, editors, Proceedings of Confer- ence on Neural Information Processing Systems (NIPS). MIT Press, 2003.

[35] D. Makris and T. Ellis. “Automatic Learning of an Activity-Based Semantic Scene Model.” IEEE Conf. on Advanced Video and Signal Based Surveillance, pages 183-190, 2003.

[36] X. Wang, K. Tieu, and E. Grimson. “Learning Semantic Scene Models by Trajectory Analysis.” Eur. Conf. Computer Vision, pp. 110 2006.

[37] C. Stauffer and W. E. L. Grimson. “Learning Patterns of Activity Using Real-time Tracking.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:747–757, 2000.

[38] E. B. Ermis, P. Clarot, P. M. Jodoin, and V. Saligrama, “Activity Based Matching in Distributed Camera Networks.” In IEEE Trans- actions on Image Processing, Volume 19, pp. 2595–2613, 2010.

[39] J. Varadarajan, and J. M. Odobez, “Topic Models for Scene Analysis and Abnormality Detection.” In Proceedings of IEEE 12th Interna- tional Conference on Computer Vision Workshops, 2009.

[40] A. Kembhavi, T. Yeh, and L. Davis, “Why Did the Person Cross the Road (There)? Scene Understanding Using Probabilistic Logic Models and Common Sense Reasoning.” In ECCV, Lecture Notes in Computer Science, Volume 6312, pp. 693-706, 2010.

[41] T. Zhang, H. Lu, and S. Z. Li, “Learning semantic scene models by object classification and trajectory clustering.” In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1940-1947, 2009.

[42] Y. Yang, J. Liu, and M. Shah, “Video Scene Understanding Using Multi-scale Analysis.” In Proceedings of International Conference on Computer Vision, 2009. BIBLIOGRAPHY 97

[43] D. H. Wilson and C. Atkeson. “Simultaneous Tracking and Activ- ity Recognition (STAR) Using Many Anonymous, Binary Sensors.” Pervasive Computing, Springer, 2005.

[44] P.J. Burt, J.R. Bergen, R. Hingorani, R. Kolczynski, W.A. Lee, A. Le- ung, J. Lubin, and H. Shvayster. “Object tracking with a moving camera.” Proceedings of the Workshop on Visual Motion, 2–12, 1989.

[45] C. Lin and M. Wolf, “MCMC-based Feature-guided Particle Filter- ing for Tracking Moving Objects from a Moving Platform.” In Pro- ceedings of IEEE 12th International Conference on Computer Vision Workshops, pp. 828 - 833, 2009.

[46] C. Lin and M. Wolf, “Detecting Moving Objects Using a Camera on a Moving Platform.” In Proceedings of 20th International Conference on Pattern Recognition, pp. 460-463, 2010.

[47] J. McIntyre, A. Church, F. Labrosse, “Efficient Image-based Tracking of Apparently Changing Moving Targets.” In Proceedings of Towards Autonomous Robotics Systems, pp. 119-126, 2009.

[48] A. Ess, B. Leibe, K. Schindler, and L. van Gool, “Robust Multiperson Tracking from a Mobile Platform.” In IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume 31, pp. 1831-1846, 2009.

[49] J. Xiao, C. Yang, F. Han, and H. Cheng, “Vehicle and Person Tracking in Aerial Videos.” In Lecture Notes in Computer Science, pp. 203-214, Volume 4625, 2008.

[50] B. Jung and G. Sukhatme. “Detecting Moving Objects using a Sin- gle Camera on a Mobile Robot in an Outdoor Environment.” 8th Conference on Intelligent Autonomous Systems, pages 980-987, 2004.

[51] J. Kang, I. Cohen, and G. Medioni. “Continuous Tracking Within and Across Camera Streams.” CVPR 03. 2003. BIBLIOGRAPHY 98

[52] J. Vermaak, A. Doucet, and P. P´erez. “Maintaining Multi-Modality Through Mixture Tracking.” Int. Conf. on Computer Vision, pages 1110-1116, 2003.

[53] Ettinger, S. M., Nechyba, M. C., Ifju, P. G., and Waszak, M., “To- wards Flight Autonomy: Vision-Based Horizon Detection for Micro Air Vehicles,” Florida Conference on Recent Advances in Robotics, 2002.

[54] Berthod, M., Kato, Z., Yu, S., and Zerubia, J., “Bayesian Image Clas- sification Using Markov Random Fields.” Image and Vision Comput- ing, issue 14, pp. 285-295, 1996.

[55] Kato, Z., Zerubia, J., and Berthod, M., “Satellite Image Classification Using a Modified Metropolis Dynamics”. In Proceedings of Interna- tional Conference on Acoustics, Speech and Signal Processing, volume 3, pages 573-576, March 1992.

[56] Kato, Z., “Modelisations markoviennes multiresolutions en vision par ordinateur. Application ´ala segmentation d’images SPOT.” PhD Thesis, INRIA, Sophia Antipolis, France, December 1994.

[57] M. Manz, M. Himmelsbach, T. Luettel, and H. Wuensche, “Fusing LI- DAR and Vision for Autonomous Dirt Road Following Incorporating a Visual Feature into the Tentacles Approach.” In Autonome Mobile Systeme, pp. 17-24, 2009.

[58] C. Ordonez, O. Y. Chuy, E. G. Collins, and L. Xiuwen, “Rut tracking and steering control for autonomous rut following.” In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 2775 - 2781, 2009.

[59] G. Chunzhao, S. Mita, and D. McAllester, “Stereovision-based Road Boundary Detection for Intelligent Vehicles in Challenging Scenarios.” BIBLIOGRAPHY 99

In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1723-1728, 2009.

[60] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier, K. Kavukcuoglu, U. Muller, and Y. LeCun, “Learning Long-range Vi- sion for Autonomous Off-road Driving.” In Journal of Field Robotics, Volume 26, pp. 120144, 2009.

[61] J. Wang, Z. Ji, and Y. Su, “Unstructured Road Detection Using Hy- brid Features.” In Proceedings of the Eighth International Conference on Machine Learning and Cybernetics, pp. 482-486, 2009.

[62] H. Kong, J. Y. Audibert, and J. Ponce, “Vanishing point detection for road detection.” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 96-103, 2009.

[63] H. Kong, J. Y. Audibert, and J. Ponce, “General Road Detection From a Single Image.” In IEEE Transactions on Image Processing, Volume 19, pp. 2211-2220, 2010.

[64] W. Wu and G. ShuFeng, “Research on Unstructured Road Detection Algorithm Based on the Machine Vision.” In Proceedings of Asia- Pacific Conference on Information Processing, Volume 2, pp. 112-115, 2009.

[65] W. Yanqing, C. Deyun, ZS. Chaoxia, and W. Peidong, “Vision-based Road Detection by Monte Carlo Method.” In Information Technology Journal, Volume 9, pp. 481-487, 2010.

[66] H. Y. Cheng, C. C. Yu, C. C. Tseng, K. C. Fan, J. N. Hwang, and B. S. Jeng, “Environment Classification and Hierarchical Lane De- tection for Structured and Unstructured Roads.” In Institution of Engineering and Technology, Computer Vision, Volume 4, pp. 37-49, 2010. BIBLIOGRAPHY 100

[67] Coombs, D., Herman, M., Hong, T., and Nashman, M., “Real-Time Obstacle Avoidance Using Central Flow Divergence, and Peripheral Flow,” IEEE Trans. on Robotics and Automation, vol. 14, no. 1, pp. 49-59, 1998.

[68] Giachetti, A., Campani, M., and Torre, V., “The Use of Optical Flow for Road Navigation,” IEEE Trans. on Robotics and Automation, vol. 14, no. 1, pp. 34-48, 1998.

[69] Pomerleau, D. “RALPH: Rapidly Adapting Lateral Position Handler” IEEE Symposium on Intelligent Vehicles, September 25-26, 1995, De- troit, Michigan, USA.

[70] Dal Poz, A. P., and do Vale, G. M., “Dynamic programming approach for semi-automated road extraction from medium- and high-resolution images,” ISPRS Archives, vol. XXXIV, part 3/W8, Sept. 2003.

[71] Kang, D., and Jung, M., “Road Lane Segmentation using Dynamic Programming for Active Safety Vehicles,” Elsevier Pattern Recogni- tion Letters vol. 24, issue 16, pp. 3177-3185, 2003.

[72] Kim, H., Hong, S., Oh, T., and Lee, J., “High Speed Road Boundary Detection with CNN-Based Dynamic Programming,” Advances in Multimedia Information Processing - PCM 2002: Third IEEE Pacific Rim Conference on Multimedia, pp. 806-813, Dec. 2002.

[73] Rasmussen, C. “Combining Laser Range, Color, and Texture Cues for Autonomous Road Following.,” Proc. IEEE Inter. Conf. On Robotics and Automation, Washington, DC, May 2002.

[74] Redmill, K., Upadhya, S., Krishnamurthy, A., and Ozguner, U., “A Lane Tracking System for Intelligent Vehicle Applications,” Proc. IEEE Intelligent Transportation Systems Conference, 2001. BIBLIOGRAPHY 101

[75] Crisman, J., and Thorpe, C. “UNSCARF, A Color Vision System for the Detection of Unstructured Roads” In Proceedings of International Conference on Robotics and Automation, Sacramento, CA 1991.

[76] Dahlkamp, H. , Kaehler, A. , Stavens, D. , Thrun, S. , and Bradski, G. , “Self-supervised Monocular Road Detection in Desert Terrain” In Proceedings of the Robotics Science and Systems Conference. 2006.

[77] Stentz, A., “Optimal and Efficient Path Planning for Partially-known Environments,” Proceedings of IEEE International Conference on Robotics and Automation vol. 4, pp. 3310-3317, 1994.

[78] De Maesschalck, R. , Jouan-Rimbaud, D. and Massart, D. L. , “The Mahalanobis distance,” In Chemometrics and Intelligent Laboratory Systems, Vol. 50, pp. 1-18, 2000.

[79] Rogers, J. , Lookingbill, A. , and Thrun, S. , “Learned Optical Flow for Mobile Robot Control” In Proceedings of NIPS 2005 Workshop on Machine Learning Based Robotics in Unstructured Environments, 2005.

[80] Singh, S. and Sharma, M., “Minerva scene analysis benchmark,” Proc 7th Australian and New Zealand Intelligent Information Systems Conference, Perth, 18 - 21 November, 2001.

[81] Yasnoff, W.A., Mui, J.K., and Bacus, J.W., “Error Measures for Scene Segmentation,” Pattern Recognition, vol 9, No. 4, pp. 217-231, 1977.

[82] Defense Advanced Research Projects Agency (DARPA), Darpa Grand Challenge (DGC), Online source: http://www.grandchallenge.org.

[83] Defense Advanced Research Projects Agency (DARPA), Learn- ing Applied to Ground Robots (LAGR), Online source: http://www.darpa.mil/ipto/programs/lagr/vision.htm. BIBLIOGRAPHY 102

[84] Shoemaker, C. M. and Bornstein, J. A., “Overview of the Demo III UGV Program,” Proc. Of the SPIE Robotic and Semi-Robotic Ground Vehicle Technology, Vol. 3366, pp. 202-211, 1998.

[85] Pilutti, T. and Ulsoy, A. G., “Decision Making for Road Departure Warning Systems,” Proceedings of the American Control Conference, June 1998.

[86] H. Dahmani, M. Chadli, A. Rabhi, A. E. Hajjaji, “Fuzzy Uncertain Observer with Unknown Inputs for Lane Departure Detection.” In Proceedings of American Control Conference, pp. 688 - 693, July 2010.

[87] J. Ge, Y. Luo, and G. Tei, “Real-Time Pedestrian Detection and Tracking at Nighttime for Driver-Assistance Systems.” In IEEE Transactions on Intelligent Transportation Systems, Volume 10, pp. 283-298, 2009.

[88] V. Prisacariu, R. Timofte, K. Zimmermann, I. Reid, and L. Van Gool, “Integrating Object Detection with 3D Tracking Towards a Better Driver Assistance System.” In Proceedings of 20th International Con- ference on Pattern Recognition,pp. 3344-3347, 2010.

[89] Murray, D. and Little, J., “Using Real-Time Stereo Vision for Mobile Robot Navigation,” Autonomous Robots, vol. 8, Issue 2, pp. 161-171, Apr. 2000.

[90] DeSouza, G. and Kak, A., “Vision for Mobile Robot Navigation: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 237-267, Feb. 2002.

[91] Moorehead, S., Simmons, R., Apostolopoulos, D., and Whittaker, W. L., “Autonomous Navigation Field Results of a Planetary Analog Robot in Antarctica,” International Symposium on Artificial Intelli- gence, Robotics and Automation in Space, June 1999. BIBLIOGRAPHY 103

[92] Asensio, J. R., Montiel, J. M. M., and Montano, L., “Goal Directed Reactive Robot Navigation with Relocation Using Laser and Vision,” IEEE Proc. of International Conference on Robotics and Automation, vol 4, pp. 2905-2910, 1999.

[93] K. M. Wurm, R. Kummerle, C. Stachniss, and W. Burgard, “Im- proving Robot Navigation in Structured Outdoor Environments by Identifying Vegetation from Laser Data.” In Proceedings of Intelli- gent Robots and Systems, pp. 1217-1222, Oct. 2009.

[94] F. Maurelli, D. Droeschel, T. Wisspeintner, S. May, H. Surmann, “A 3D Laser Scanner System for Autonomous Vehicle Navigation.” In Proceedings of International Conference on Advanced Robotics, pp. 1-6, 2009.

[95] A. Rankin, A. Huertas, and L. Matthies, “Stereo Vision Based Terrain Mapping for Off-road Autonomous Navigation.” In Proceedings of SPIE, the International Society for Optical Engineering, Volume 7332, 2009.

[96] D. Rosselot, M. Aull, and E. Hall, “Predictive Vision from Stereo Video: Robust Object Detection for Autonomous Navigation Using the Unscented Kalman Filter on Streaming Stereo Images.” In Pro- ceedings of SPIE - The International Society for Optical Engineering, Volume. 7539, Jan. 2010.

[97] Durrant-Whyte, H., “A critical review of the state-of-the-art in autonomous land vehicle systems and technology,” Sandia Report SAND2001-3685, Sandia National Laboratories, Albuquerque, N.M., 2001.

[98] Ulrich, I. and Nourbakhsh, I., “Appearance-Based Obstacle Detection with Monocular Color Vision,” Proceedings of AAAI Conference, pp. 866-871, 2000. BIBLIOGRAPHY 104

[99] Lorigo, L. M., Brooks, R. A., and Grimson, W. E. L., “Visually- Guided Obstacle Avoidance in Unstructured Environments,” In Pro- ceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 373-379, 1997.

[100] Horswill, I., “Polly: A Vision-based Artificial Agent,” Proceedings of AAAI Conference, pp. 824-829, 1993.

[101] Shirkhodaie, A. and Amrani, R., “Visual Terrain Mapping for Traversable Path Planning of Mobile Robots,” Proceedings of SPIE, vol. 5608, pp. 118-127, Oct. 2004.

[102] Bellutta, P., Manduchi, R., Matthies, L., and Rankin, A. “Terrain Perception for DEMO III,” In: Proc. Intelligent Vehicles Conf., Dearborn, MI, 2000.

[103] Manduchi, R., Castano,A., Talukder, A., and Matthies, L. “Obstacle Detection and Terrain Classification for Autonomous Off-Road Nav- igation,” Auton. Robots, 18(1): pp. 81-102, 2005.

[104] Iagnemma, K. , and Dubowsky, S., “Terrain Estimation for High- Speed Rough-Terrain Autonomous Vehicle Navigation,” Proceedings of the SPIE Conference on Unmanned Ground Vehicle Technology IV, 2002

[105] J. Chetan, K. Madhava, and C. V. Jawahar, “Fast and Spatially- Smooth Terrain Classification Using Monocular Camera.” In Pro- ceedings of 20th International Conference on Pattern Recognition, pp. 4060-4063, 2010.

[106] M. Procopio, J. Mulligan, G. Grudic, “Learning Terrain Segmenta- tion with Classifier Ensembles for Autonomous Robot Navigation in Unstructured Environments.” In Journal of Field Robotics, Volume 26, pp. 145175, 2009. BIBLIOGRAPHY 105

[107] P. Moghadam, W. S. Wijesoma, “Online, Self-supervised Vision-based Terrain Classification in Unstructured Environments.” In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 3100-3105, 2009.

[108] Sofman, B., Lin, E., Bagnell, J., Vandapel, N., and Stentz, A. “Im- proving Robot Navigation Through Self-Supervised Online Learning,” In Proceedings of Robotics: Science and Systems, 2006.