FAST TEMPLATE MATCHING FOR VISION- BASED LOCALIZATION

by JASON HARPER

Submitted in partial fulfillment of the requirements For the degree of Master of Science

Thesis Advisor: Wyatt Newman

Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY

May 2009

Table of Contents

Table of Contents ...... 1

List of Figures ...... 4

List of Tables ...... 7

1 Introduction ...... 9

1.1 Localization ...... 9

1.1.1 Global Visual Localization ...... 12

1.1.2 Local Visual Localization ...... 13

1.2 Template Matching ...... 14

1.2.1 Edge Matching ...... 14

1.2.2 Pattern-Based Template Matching ...... 14

1.2.3 Grid Template Matching ...... 15

1.3 Combining Vision Data with a State Observer ...... 16

1.4 Contributions ...... 17

2 Problem Definition ...... 18

3 Platform Setup ...... 21

3.1 The Robot Platform ...... 21

3.2 The Chosen Camera ...... 22

3.3 Camera Calibration ...... 23

3.4 Timing Calibration ...... 28

1

4 Fast Template Matching ...... 32

4.1 Overview ...... 32

4.2 Camera Data Acquisition ...... 33

4.3 Finding Tiles ...... 33

4.4 Plan View to Hough Space Transforms ...... 38

4.5 Transforming Hough Space into Modulo Space ...... 41

4.6 Normalizing the Transform ...... 46

4.7 Pose Estimation from Mod-Space for Localization ...... 51

4.8 Making Template Matching Fast ...... 55

5 Integration with a Localization System ...... 58

5.1 Overview ...... 58

5.2 Variance from Helper Functions ...... 58

5.3 Credibility Functions ...... 60

5.4 Setting Credibility Thresholds ...... 64

6 Results ...... 74

6.1 System Results ...... 74

6.2 Algorithm Runtime ...... 79

6.3 Robustness with Respect to Noise ...... 81

6.4 Resistance to Distracter Lines ...... 82

6.5 Integration with Localization System ...... 84

2

7 Conclusions ...... 89

8 Future Work ...... 91

Appendix A ...... 93

Appendix B ...... 94

B.1 Canny Edge Detector ...... 94

B.2 Laplace Edge Detector ...... 94

B.3 Weighted Local Variance ...... 94

Bibliography ...... 96

3

List of Figures

Figure 1: ALEN, the robot used for algorithm testing...... 21

Figure 2: Firefly MV from Point Gray Research (Firefly MV, 2008)...... 22

Figure 3: Camera view of aligned grid points...... 25

Figure 4: Standard deviation of the i-components manually chosen for Cartesian space subset...... 26

Figure 5: Standard deviation of the j-component of the Cartesian space subset...... 26

Figure 6: Cartesian space image transformed from camera space using the calibration lookup table...... 28

Figure 7: Image of circuit used to provide a timestamp for timing calibration...... 29

Figure 8: Example of binary timestamp 0b00100101101011...... 30

Figure 9: Timing delay of camera to PSO...... 31

Figure 10: Pipeline architecture diagram...... 33

Figure 11: of hallway image using the Sobel operator...... 34

Figure 12: Edge detection of hallway image using the Canny algorithm...... 34

Figure 13: Edge detection of hallway image using the Laplace convolution kernel...... 35

Figure 14: Edge detection of hallway image using the weighted local variance method...... 35

Figure 15: Hough space transformed from output of Sobel Figure 13...... 40

Figure 16: Example of distracting lines...... 42

Figure 17: Illustration of equations 8 and 9...... 43

Figure 18: Sample sections of mod space...... 46

Figure 19: Mod space with distracter lines...... 47

Figure 20: Mod space after normalization...... 50

4

Figure 21: Process of projecting mod space and creating a combined confidence function. . 52

Figure 22: Depiction of creating the x-confidence function...... 52

Figure 23: Depiction of creating the y-confidence function...... 53

Figure 24: Example to mod space to global heading transform...... 54

Figure 25: Illustration of aliasing...... 55

Figure 26: Scatter plot of data collected to test outlier rejection...... 66

Figure 27: Histogram of RMS error, helper function in x-direction...... 67

Figure 28: Histogram of MSE, helper functions in x-direction ...... 68

Figure 29: Histogram of RMS error, helper function in y-direction...... 68

Figure 30: Histogram of MSE, helper function in y-direction...... 69

Figure 31: Histogram of RMS error, helper function for theta...... 69

Figure 32: Histogram of MSE, helper function for theta...... 70

Figure 33: Plot of x component of position estimates separated by the chosen thresholds. .. 71

Figure 34: Scatter plot of y values separated using the selected thresholds...... 71

Figure 35: Scatter plot of theta values separated using the selected thresholds...... 72

Figure 36: Comparison between control heading parameter function and bad heading parameter function...... 73

Figure 37: Test loop around Glennan 7th floor...... 76

Figure 38: Expanded views of Figure 40. (a) View of error in x-direction. (b) View of corner.

(c) View of error in the y-direction (d) View of the start and end points...... 77

Figure 39: Multiple localization-technique results overlaid to compare performance...... 78

Figure 40: Timing diagram of algorithm runtime...... 79

Figure 41: Plan view example of noisy image...... 81

Figure 42: Plan view example of noisy image with overlay lines depicting algorithm output.82

5

Figure 43: Plan view example of environment with distracter lines...... 83

Figure 44: Plan view example of distracter lines with algorithm output overlaid...... 84

Figure 45: Vision output indicating the x values that have a low variance (green) and a high variance (red)...... 85

Figure 46: Vision output indicating the y values that have a low variance (green) and a high variance (red)...... 86

Figure 47: Vision output indicating the values which have a low variance (green) and a high variance (red)...... 86

Figure 48: Results of providing a Kalman filter with vision-based localization updates...... 88

Figure 49: Bayer color format from (Furtner, 2009)...... 93

6

List of Tables

Table 1: Thresholds for the helper functions ...... 70

Table 2: The algorithm’s runtime on a server and typical laptop...... 80

7

Fast Template Matching for Vision-Based Localization

Abstract

By

JASON WILLIAM HARPER

This thesis presents a novel vision-based localization method that uses fast template- matching techniques with respect to regularly-spaced floor tiles to provide pose information.

To accomplish the template matching, an edge map is created, transformed into Hough space and interpreted modulo the periodicity of the template. In this space, offsets relative to the periodicity of the floor tiles can be found. By separately tracking accumulation of periods, a global pose estimate can be determined that is immune to accumulation of incremental errors. The method is shown to be robust with respect to noise and distracter lines, and it also recognizes when a scene analysis is untrustworthy. The method is suitable for integration within a Kalman filter to contribute to improved localization.

8

1 Introduction

As humans, knowing where we are located with respect to our environment seems effortless. If we walk across a cluttered room, or down a hallway in a building, we are able to easily avoid obstacles and keep from running into walls. For robots, the problem of localization, or determining the precise location of the robot, is one of the most difficult and fundamental problems in mobile robotics (Fox, Burgaurd, & Thrun, 1999). Almost every other system in the robot relies on an accurate measure of position including systems such as steering, sensor integration, planning, and mapping. Without any of these critical systems, the robot would not be able to function properly.

This thesis introduces a method of using a CCD camera to provide position information. A fast method is proposed to match a grid template to floor tiles in the camera’s field of view. Once the template alignment is found, a pose estimate and measurement variance is found and is used in a Kalman filter to provide accurate and reliable localization. The following sections will give background information on previous work in the areas of localization, template matching, and vision-based variance and how these topics relate to this thesis.

1.1 Localization

The problem of localization has been approached using many different methods over the years. The most basic form of localization is dead reckoning, which uses only the previous known position and velocity information for a given time sample to determine the

9

next position (Borenstein, Everett, & Feng, 1994). The lack of a perfect system model and outside disturbances will cause pure kinematics to quickly accumulate errors, and the estimated position of the robot will drift away from its actual position (Franklin, Powell, &

Workman, 1998). To eliminate errors in the position, the principle of closed-loop control can be used. In closed-loop control, the state of the system is sensed and fed back to the kinematics instead of only using the state prediction (Antsaklis & Michel, 2007).

Using feedback to estimate the system state assumes that the state of the system can be measured (Antsaklis & Michel, 2007). As is typically the case, directly measuring the entire robot state is very difficult and impractical, so an observer can be constructed to estimate the states based on available measurements (Antsaklis & Michel, 2007). An observer can be made by utilizing a mathematical model of the system that is induced to reconcile the states with available physical measurements. The model is coerced into at least approximate agreement with sensor measurements using feedback of the discrepancy between model-based estimates and sensor-based measurements. This feedback is weighted with gains that depend on estimates of the statistics that describe uncertainties, including unmodelled disturbances and measurement noise (Hendricks, Jannerup, & Sørensen, 2008).

This results in an improvement over noise levels of direct measurements, as well as reconstruction of unmodelled states (Hendricks, Jannerup, & Sørensen, 2008). The Kalman filter is a generalized observer that can optimally determine these weights for input sensors as long as the sensors are describable using a Gaussian distribution (Franklin, Powell, &

Workman, 1998). The platform used, as will be discussed in 3.1, has a Kalman filter implemented on a real-time computer that was created by a senior project group. The currently implemented observer works well outdoors where GPS data is present, but indoors, the only sensors currently being used are the wheel encoders. However, even under

10

conditions with little wheel slippage, which can cause large errors, the heading estimate with wheel encoders is very poor (Hardt, Wolf, & Husson, 1996).

It is a common problem in indoor mobile robotics to search for sensors that can be used to estimate the position of the robot because GPS does not work indoors. There is a large range of methods for localization all using various modalities. Typical approaches have used electronic gyroscopes, laser range finders (LIDAR), and ultrasonic ranging sensors

(Borenstein, Everett, & Feng, 1994). The data from LIDAR or sonar can be matched to a global map to determine an estimate of position (Yeol, 2005) (Morenol, Munoz, Garrido, &

Martin, 2007). The difference between consecutive LIDAR scans can also be used to obtain odometry information (Hellström, Ringdahl, & Siddiqui, 2006). However, these sensors tend to be costly and are easily confused in highly dynamic environments (Dellaert, Burgard,

Fox, & Thrun, 1999).

Another approach that research has taken in recent years is to use CCD cameras because their cost is attractive compared to the other sensors (Dao, You, & Oh, 2005).

Cameras also arguably provide more information than any other sensor currently used for localization (Borenstein, Everett, & Feng, 1994). Vision-based localization algorithms typically fall under one of two categories: global or local (Tamimi & Zell, 2004). Global localization algorithms attempt to match current camera data to a template whose global position is known to discover the current position without previous knowledge of the robot’s position (Tamimi & Zell, 2004). Local methods for localization (which may use perturbation analysis) instead only use data in the current image or recent images along with an estimate of the current position to determine an improved estimate of the robot’s position (Tamimi & Zell, 2004).

11

1.1.1 Global Visual Localization

Some research has taken ideas from other localization modalities by creating a vision-based Simultaneous Localization and Mapping (SLAM). Some approaches to visual

SLAM find visual features to track within sequences of images, determining their position and, therefore, the position of the robot and storing the feature in a global map (Davidson,

2003) (Eade & Drummond, 2006). Other approaches use a nonlinear method to match more substantial portions of an image in subsequent frames (Silveira,

Malis, & Rives, 2008). One of the major downfalls of the visual SLAM algorithm is the time required to perform the calculation. Some algorithms have had processing times reported to take as long as 11.9 seconds per frame (Eade & Drummond, 2006).

Other global visual localization methods have been proposed that do not use the

SLAM algorithm. One method used an a priori map of planar landmarks (typically posters on a wall) that the robot could easily track (Ayala, Hayet, Lerasle, & Devy, 2000). Another method used a predefined, sparse, 3-D edge map to match camera edge data and calculate the corresponding camera pose (Nuske, Roberts, & Wyeth, 2008). The Minerva robot, a museum tour guide, used a map of the museum ceiling to perform global localization

(Dellaert, Burgard, Fox, & Thrun, 1999). These methods and others similar to them are very good at determining a global position; however, they require accurate a priori information which makes using the algorithms difficult in a real-world situation.

A smaller subset of the techniques used for global positioning requires artificial landmarks to be placed in the robots environment. A robot designed to navigate warehouses made use of reflective barcodes at known locations to provide location information (Everett, Gage, Gilbreath, Laird, & Smurlo, 1994). Other systems have been

12

developed using active or passive optical beacons; however, they all require retrofitting the environment with beacons, and the robot needs a direct line of sight to the beacon for the method to work properly (Borenstein, Everett, & Feng, 1994).

1.1.2 Local Visual Localization

The other category of visual localization is local localization which uses the current position estimate and the current camera data to calculate a corrected position. One of the methods used is similar to global localization in that interesting features are extracted from a frame and are tracked in subsequent frames (Lv, Zhou, & Liu, 2006). In this case, however, the features are not stored in a global map. Typically these features are either long, strong lines in the image, or scale, rotation, and illumination-invariant features selected using the

SIFT algorithm (Dao, You, & Oh, 2005) (Lv, Zhou, & Liu, 2006). A feature-tracking algorithm was used on NASA’s Spirit and Opportunity robots, which tracked corner features of the terrain (Cheng, Maimone, & Matthies, 2005).

Optical flow is another common algorithm used in to extract motion from a scene. It works by finding points that have similar intensity patterns between two subsequent images and using the points to determine a vector field of motion in an image (Russel & Norvig, 2003). This approach has been used by researchers to determine ego-motion, or the position change of the camera with respect to the environment.

Examples of the applications of optical flow for visual localization include determining the velocity of a mobile robot, improving odometry, and determining the 3-D localization of a hopping robot on the surface of an asteroid (Kim, Jeong, & Lee, 2007) (Nagatani,

Tachibana, Sofue, & Tanaka, 2000) (Wai Yan So, Yoshimitsu, & Kubota, 2008). Optical

13

flow has great potential, as it has been shown to mimic the vision of a fly, but its measurements are inherently noisy (Russel & Norvig, 2003).

1.2 Template Matching

This thesis uses a method of fast template matching for use in vision-based localization. Techniques for template matching have been briefly discussed in the previous sections, but were listed only as a broad overview of visual localization techniques. This section will therefore give a more detailed description of the algorithms used in template- matching algorithms that are related to the technique used in this paper.

1.2.1 Edge Matching

Edge matching techniques start with a 3-D model of the edges in the scene, which are assumed to be available (Kosaka & Kak, 1992). Using the calibration parameters of the camera and the current position estimate, the model edges are transformed to camera space

(Nuske, Roberts, & Wyeth, 2008). The next step is to find edges in the image, which can be done with any edge-detection algorithm. Often the edges are enhanced using an adaptive threshold (Ohya, Kosaka, & Kak, 1998) or, in other cases, using intelligent exposure control

(Nuske, Roberts, & Wyeth, 2008). The edges in the camera space are then compared with the transformed model edges. The error in position is then computed using the error between the camera edges and the model edges using the camera calibration parameters

(Ohya, Kosaka, & Kak, 1998). Another approach is to use a particle filter to maintain multiple position estimates per frame (Nuske, Roberts, & Wyeth, 2008). This technique, as do other global localization techniques, requires an a priori edge map which can be difficult or costly to obtain.

1.2.2 Pattern-Based Template Matching

14

Another form of template matching for vision-based localization has borrowed ideas from information theory. This method required the installation of floor tiles in the robot’s environment that have distinctive visual patterns that directly encode the position information of the tile (Cord & Lazic, 1997). The probability of misinterpreting the code can be minimized using redundancy information and other techniques from information theory (Cord & Lazic, 1997). It has been shown that these patterns can provide the necessary information for localization using either deterministic codes or random patterns

(Cord & Lazic, 1997). Another approach to the same problem introduces Penrose tiles, which have asymmetric patterns (Schellwat & Wide, 1997). Using tiles with encoded data, although promising in theory, still requires specialty flooring, precision installation, and position and code data to be recorded for every tile installed. All of these problems prohibit the idea from being implemented.

1.2.3 Grid Template Matching

Another template-matching technique that relates to this thesis is the grid template- matching technique. Although, the technique has not been recorded as being used in mobile robot localization, it has been used to localize chess pieces on a chess board, and the idea can still be adapted for robot localization. In this case, a camera is placed such that it has either a top-down view of the chess board or a view from a low angle (Tam, Lay, & Levy,

2008). The approach assumes that a grid is present in the image. With this assumption, the grid can be detected using one of at least three methods.

The first two methods differ in the initial feature extraction but are very similar in the feature interpretation. The first method is to detect the intersection points of the gridlines using a corner detection method such as the Harris corner detection method or

15

SUSAN method (Tam, Lay, & Levy, 2008). The results are then used in a Radon transform where the grid lines can be detected (Tam, Lay, & Levy, 2008). The Radon transform introduces a parameter space where a point in the Radon parameter-space describes an integral projection of a parametric line (Buzug, 2008). Alternatively, the second approach is to use edge detection to detect the grid lines instead of the corners (Tam, Lay, & Levy,

2008). This approach also uses a Radon transform to then detect the grid (Tam, Lay, &

Levy, 2008). Once the features have been transformed into Radon space, lines will show up as periodic bright pixels that line up at two columns separated by 90 degrees (Tam, Lay, &

Levy, 2008). Along those columns, the peaks in the accumulator cells will be spaced out by the distance between the lines of the grid (Tam, Lay, & Levy, 2008).

The third approach is typically applied to grids that are not as rigid as a chess board.

In this case, the grid is modeled as a deformable template (Hartelius & Carstensen, 2003).

As a deformable template, the grid corners are allowed to deviate from the expected pattern through the Markov random field deformation (Hartelius & Carstensen, 2003). The grid is then determined through an iterative simulated annealing process which matches the expected grid to the grid corners found in the image (Hartelius & Carstensen, 2003).

1.3 Combining Vision Data with a State Observer

The Kalman filter equations require knowledge about the variance in a measurement

(Franklin, Powell, & Workman, 1998). There have been different approaches to determining the variance of visual data to be used in the Kalman filter. The simplest of the methods is using a static variance estimate. Some research has calculated the variance based on an assumed variance of the pixel intensities (Xuedong, Xinhua, Rongjin, & Jiashan, 2006).

Others have used a simulation with a given amount of noise to estimate the variance based

16

on the difference of the algorithm’s output state with the known simulation state

(Mammarella, Campa, Napolitano, Fravolini, Gu, & Perhinschi, 2008). Another common approach is to keep the camera at a constant position, take a sample of measurements, and then compute the variance in the system output (Murata & Hirose, 1993). All of these systems assume that the noise in the measurement is Gaussian white noise; however, this may be a poor assumption.

Vision data in cluttered environments may not have normally distributed noise, but instead, may have a multimodal distribution (Isard & Blake, 1998). Other approaches have been proposed to compute a variance based on “features” measured from the camera data

(Warren, 2009). In this approach, measurements of variance are obtained from using simulated data where the input state is known exactly and the mapping from the features to the variance is learned using k-means clustering (Warren, 2009). In this thesis, a method will be proposed that expands on the idea used by Warren and will use helper functions to determine if a measurement should have either a low variance or a large variance.

1.4 Contributions

The method presented in this thesis uses template matching techniques with respect to floor tiles to provide position information. The technique uses the position of floor tile edges as a global reference. It also builds upon grid template matching techniques by introducing a normalized transformation from Hough space into a new grid-parameter space called modulo space. A robust pose estimate can then be efficiently extracted from the new parameter space. Finally, for incorporation in a Kalman filter, a method of determining a measurement variance is proposed that uses two credibility functions to reject outlier pose estimates.

17

2 Problem Definition

The problem defined by this thesis is to use a template-matching algorithm to detect floor tiles for the purpose of localization. The algorithm requires a current estimate of the robot’s position that is accurate within ±0.5 ft and ±2 degrees. The style of the floor tile will be restricted to being a single color, twelve inches square, and having minimal grouting between the tiles. Specifically, the tiles that will be used are the floor tiles in the Glennan building at Case Western Reserve University. This thesis will also define a method of providing a variance of each position measurement to allow the system to be integrated with a Kalman filter.

Local-based methods of localization use previous position information to give an accurate incremental change in position. In this way, the proposed method of localization is similar to other local-based methods. The proposed algorithm uses a current image from the camera and the last known position of the robot to determine an estimate for the new current position. It is also similar to other proposed vision-based localization algorithms because it tracks features between each frame.

Incremental localization techniques can tend to drift away from the actual current position because small errors in the estimates can accumulate and are magnified over large distances. This can be seen most clearly with the robot’s wheel encoders. The encoders can give a precise measurement of the change in the angle of the wheel. Converting this measurement of wheel angle to translational movement can introduce errors that can be

18

considered negligible for each iteration because the measurements of wheel diameter and wheelbase are less precise and because of errors in the model. Over time, the negligible errors will accumulate and result in a large error in position and heading. The fundamental problem with incremental techniques is they have no global reference on which to base their incremental change. The proposed method is different from incremental localization techniques because it has a global reference on which to base its measurements. The floor tiles used for localization are at a known global position and at each instance the incremental measurement is relative to the floor tiles. This feature allows the proposed method to keep any errors in a given measurement from accumulating and affecting later measurements.

Previously mentioned local, vision-localization algorithms use distinguishable features found in a given scene to match to features in previous scenes. This technique can also provide a global reference to keep errors from accumulating, especially in the case of detecting structural features such as door frames or edges of walls. The proposed technique is different from these algorithms because instead finding features that may be present in subsequent frames it uses the expectation that the floor tiles are always present and can narrow its search to only use the grid pattern of the tiles. This not only simplifies the algorithm by eliminating the need to determine the feature in the previous frame that corresponds to a feature in the current frame, but also guarantees that more information will be present in each frame. In the case of a large empty room with only weak edges in the image, previous algorithms might fail to provide any localization information while the proposed algorithm uses the weak edges of the floor tiles to provide an accurate position estimate.

19

Global-localization algorithms provide estimates that have error independent of error in previous measurements. They use information about the structure of the robot’s environment to provide an estimate of its current position in that environment. Using the knowledge that floor tiles are regularly spaced throughout the robot’s environment make the proposed algorithm similar to previous global techniques. As opposed to other global- localization methods, the proposed algorithm does not require a surveyed map of the environment or the robot to map the environment itself. It relies on the regular placement of the tiles to infer their location instead of storing their position.

By placing objects in the environment that encode position information, a global- based method can provide a position estimate that does not require a map of the environment. However, these methods require installation of specialty equipment which could be expensive or time consuming. The proposed method has the same benefits as these specialty installations do, but uses common floor tiles that are already installed in many buildings around the world.

20

3 Platform Setup

3.1 The Robot Platform

Figure 1: ALEN, the robot used for algorithm testing. The robot platform used is built on an electric wheelchair base donated by Invacare

Corporation. The wheelchair base provides 24V from two car batteries serially connected and mounted to the frame. The base also has two electric motors that together are capable of moving up to 278 pounds (Invacare, 2009). A frame designed to hold the computer and sensors was built upon the wheelchair base using Bosch framing. To handle computational tasks, the robot has two computers, National Instrument’s compactRio (cRio) and a custom- built server, networked together.

The cRio is a rugged controller which handles most of the sensor communication

(NI cRIO-9074, 2009). It is made with an industrial real-time processor for reliable

21

processing and a FPGA to handle the I/O and some low level processing (NI cRIO-9074,

2009). The Kalman filter is run on this computer because of its ability to process data quickly and with real-time guarantees.

The main computer is a 2.33 GHz, dual quad-core, server-style computer with 6 GB of DDR2 RAM. This computer is used to acquire data from the camera and the LIDAR. It was also chosen to run the code for this thesis because any programming language could be used and because of its exceptional processing power and memory.

3.2 The Chosen Camera

Figure 2: Firefly MV from Point Gray Research (Firefly MV, 2008). The camera chosen in this paper is the Firefly MV made by Point Gray Research.

The camera captures images at 640 x 480 resolution using raw Bayer data (Firefly MV, 2008).

It supports up to 60 frames per second and has features such as automatic gain and automatic shutter speed that are controlled by an onboard FPGA (Firefly MV, 2008). The

Firefly comes with a standard IEEE 1394a interface, which provides power and communication at a rate of 400 Mb/s (Firefly MV, 2008). It also conforms to the IEEE

22

1394-based Digital Camera specifications allowing for a standard firewire camera library to be used to capture data and set camera parameters.

One of the motivations for using vision-based localization is the cost of CCD cameras. The camera was chosen for this thesis is attractive in terms of performance versus the cost. The camera is small and uses a standardized interface making it useful for small to large scale robotic applications. With all of these features, the camera is available for under

$200 dollars on the current market.

The camera was fitted with a 2.5mm fixed iris lens. The camera is mounted on a custom fixed position mount that holds the camera at -36˚ with respect to the ground plane.

The fixed-iris lens and the fixed mount were chosen to avoid the camera moving after it had been calibrated. The parameters for the focal length, height of the camera, and downward angle were chosen so that a ten-foot by ten-foot area in front of the robot would be in the field of view of the camera.

3.3 Camera Calibration

In order for any computer vision algorithm to extract metric information from a two dimensional image it is necessary to first calibrate the image (Zhang, 2002). Traditional methods of camera calibration consist of determining the internal geometric and optical characteristics of a camera (Tsai, 1987). These characteristics can be used to transform pixels to Cartesian coordinates. A transformation using the geometric and optical parameters can be time-consuming.

The algorithm used in this paper circumvents determining these optical characteristics by associating pixel coordinates to every Cartesian-space coordinate in the camera’s field of view and storing the pairs in a lookup table. This method can be used

23

instead of the intrinsic parameters if all points are assumed to lie on the X-Y plane, which in this case is the floor.

Manually choosing a pixel coordinate in the image for every Cartesian-space coordinate would have been very time consuming. Instead, the approach taken in this thesis is to manually pair a pixel coordinates to only a subset of the Cartesian points. The rest of the pairs can be found by interpolating pixel coordinates for every point in Cartesian space that was not manually associated.

To accomplish this, the camera was mounted to the frame of the robot using a fixed mount. The robot was then aligned with a grid of visually distinctive points, such that the position of the points with respect to the robot was known. The grid used for the calibration was made using the floor tiles. The subset of points chosen to manually associate to pixel coordinates were Cartesian coordinates located at the corner of the floor tiles. Some of the points on the grid, especially the points farthest away from the camera, were difficult to identify because the grid lines were faint in the image. To make identifying the pixel coordinates easier, a 1 cm square piece of black tape was placed over each corner.

Once the robot was in position and the grid was properly set up, a snapshot image of the grid was saved for a later process. The robot was then moved away from its position and then realigned with the grid. Another snapshot was taken and saved, repeating the process until ten snapshots had been taken. The purpose of repositioning the robot after each snapshot was to measure the repeatability of aligning the robot with the grid and to average out any error associated with the alignment process.

24

Figure 3: Camera view of aligned grid points. Once an image had been acquired, it was inspected using an open-source, photo- viewing software package called Gimp. The pixel coordinates for each of the intersection points were recorded along with their corresponding Cartesian coordinates in the calibration table for that image. Because, the points chosen were regularly spaced on the floor tiles, each point was separated by twelve inches from its neighbors in the x and y directions. This regularity provided an even distribution of manually associated points that made interpolation easier and more accurate than having randomly chosen points. This process was repeated for all ten of the snapshots producing ten separate calibration tables. Once all of the calibration tables had been completed, a composite calibration table was created by taking the average pixel coordinate for each grid point.

The average of the pixel coordinates from the ten calibration tables was used in the final composited, calibration table so that the error in the calibration process due to aligning

25

the robot could be found. The error was found by also taking the standard deviation of the ten pixel coordinates used to find the average. The following two figures show the standard deviation of the pixel coordinates manually chosen for the subset of the Cartesian coordinates. The standard deviations are for the i and j components respectively.

1.2

1.5 1

1 0.8

0.6

0.5 Std. Dev. (pixels)

0.4 0 10 5 0.2 5 0 0 0 Y (feet) -5 X (feet)

Figure 4: Standard deviation of the i-components manually chosen for Cartesian space subset.

1.2

1.5 1

1 0.8 0.5 0.6 Std. Dev. (pixels) 0 10 0.4 8 5 6 0.2 4 0 2 Y (feet) 0 0 -5 X (feet)

Figure 5: Standard deviation of the j-component of the Cartesian space subset.

26

These figures show that aligning the robot with the grid and manually choosing pixel coordinates were accurate processes. The maximum standard deviations in both the i and j components was 1.4 pixels. Manually picking the pixel coordinates would most likely have an expected measurement error of one pixel. Therefore, the error due to aligning the robot was minimal.

The pixel coordinates for each of the Cartesian coordinates that were not manually associated were found by interpolating between pixel coordinates in the composited, calibration table. Most interpolation methods could have been used to produce the final table. Bilinear interpolation was chosen because the regularly-spaced grid points provided enough information for the interpolation to produce accurate results without needing to use other higher order methods. Using interpolation, any size discretization could have been generated for the final calibration table. To get enough resolution to be able to see the grout lines between the tiles, Cartesian space was discretized such that each cell in the lookup table was a 0.01 in. square which corresponds to about a 3 mm square. This size for each cell and the ten-foot by nine-foot viewing area in front of the robot made the final calibration table

1000 x 900 cells.

The following image shows Figure 3 after it had been calibrated using the generated calibration table. The straight lines between the grid points show that the calibration results were accurate and had removed warping from the lens. Most importantly, all of the edges between the tiles were visible and distinctive. This validated that the calibration table could be used to convert camera images into Cartesian space to be used for later processes.

27

Figure 6: Cartesian space image transformed from camera space using the calibration lookup table. 3.4 Timing Calibration

Delayed input to a Kalman filter can cause errors in the position estimate because it would be trying to update the current position estimate with old information. This will cause the position estimate to lag behind the actual position, especially if the measurement has a low variance associated with it. Delays can be controlled by keeping a short history of the incoming measurements, their variances, and the associated states. If an old measurement has some timing information associated with it, then it could be folded into the old measurements. The observer could then recalculate all of the new information with the old measurement folded in.

28

Vision processing is inherently slow because of the amount of information it must process. Also, with the chosen system platform, the vision system will communicate with the observer through a network interface. Both of these aspects will cause delays in the position estimate getting to the Kalman filter. Therefore, a timing calibration must be made to determine how old the measurements will be.

A possible area of delay is the difference in time between when the camera takes an image and when the image becomes available to the computer. This is a difficult property to measure because there needs to be a way to timestamp when an image was taken and when it was received. The method chosen to accomplish this was to use the cRio as an external timing source. On the cRio, an integer was incremented every millisecond to serve as a timestamp. Since the cRio is a real time controller, we are guaranteed within reasonable error that this timestamp will be updated exactly every millisecond.

An analog I/O module was used to output the binary representation of the timestamp on a series of LEDs. With the LEDs visible in the camera’s field of view, each image will have a timestamp of when the snapshot was taken. The timestamp must be extracted manually and compared to the received timestamp.

Figure 7: Image of circuit used to provide a timestamp for timing calibration.

29

The same timestamp is then sent to the server where software acquires the images from the camera. Both the timestamp from the cRio and the image are then stored together for later analysis.

Determining if an LED is on or off can be difficult because of variations in light from the room. To make the LEDs easier to see, bright blue LEDs where used and the shutter speed was decreased.

Figure 8: Example of binary timestamp 0b00100101101011. The difference between the timestamp in the image and the timestamp sent from the cRio can then be recorded as the frame latency. The data collected can be seen in the following graph:

30

30

20

10

0

-10

-20

-30 Time Difference Time (ms)Difference

-40

-50

-60

-70 0 10 20 30 40 50 60 70 80 Sample #

Figure 9: Timing delay of camera to PSO. The average delay time was -8 ms with a standard deviation of 14ms. A negative difference in time indicates that the data from the physical state observer is on average 8 ms slower getting to the server than the camera. This is what was expected since the average

RTT found by using the Ping program was 8ms. It can be concluded that on average the image has no delay once requested.

31

4 Fast Template Matching

4.1 Overview

In Chapter 4, the proposed method of vision-based localization is discussed in detail.

Each section describes a different step in the algorithm, giving background information as necessary. The following sections also cover any problems encountered, and how the proposed algorithm solves those problems.

The algorithm has a pipeline architecture where each stage accepts data from the previous stage, manipulates the data, and sends the data to the next stage. The first stage of the algorithm is to capture the image from the camera and convert it to grayscale for more efficient processing. The second stage uses standard edge extraction methods to decompose the image into a binary image representing the presence of an edge. The third stage uses the calibration lookup table to efficiently transform the gradient image into a Cartesian-space, plan view. The fourth stage of the algorithm transforms strong edges in the plan view into curves in Hough space using the standard Hough transform. The next stage analyzes the

Hough domain to determine maxima used to transform Hough space into a new parameter space called modulo space, or abbreviated mod space. The next stage in the pipeline extracts offset information from the mod space and translates it into the robot’s coordinate frame, giving an estimate of the robot’s pose. A final stage of the pipeline uses data from the different stages to calculate a variance for the three state variables (x, y, and theta).

32

Figure 10: Pipeline architecture diagram. 4.2 Camera Data Acquisition

The first step in the pipeline is to gather data from the camera. The camera used conforms to the IEEE 1394 camera specifications to allow for an open-source camera library to be used. The chosen library was the dc1394 library, available as open-source online. C++ wrappers were then created to encapsulate the library calls in an object to make it easier to use for other parts of the code.

The camera is a color camera and uses the Bayer method of creating color images.

The Bayer method of encoding color data and converting it to the RGB color model is described in Appendix A. Grayscale data is needed for the following pipeline stages. To save time, the Bayer color model is converted directly to grayscale, skipping the RGB color model altogether.

4.3 Finding Tiles

The first task of edge template matching is finding the edges in the image. Edges of images can be found using one of many basic algorithms including the Sobel operator,

33

Canny edge detector, Laplace, and weighted local variance-based edge detection, to name a few (Cui, Zou, & Song, 2008) (Law & Chung, 2007). All four methods for edge extraction were tested and evaluated based on distinctness of tile lines, the noise from the tile pattern, and processing time. The following figures show an example output of the four algorithms.

Figure 11: Edge detection of hallway image using the Sobel operator.

Figure 12: Edge detection of hallway image using the Canny algorithm.

34

Figure 13: Edge detection of hallway image using the Laplace convolution kernel.

Figure 14: Edge detection of hallway image using the weighted local variance method.

The Sobel operator was used in Figure 11 using threshold to remove low-valued gradients. Edges from the floor tiles are almost all visible using this method. Also, a good amount of the noise from the tile pattern was removed with the thresholding. The Sobel operator is also an efficient algorithm because it uses a single-pass convolution.

35

Results from the Canny edge detector can be seen in Figure 12. Canny edge detection is generally very good at detecting edges, as can be seen from the example. It is also very good at removing noise from the image because it is more intelligent at weeding edges. The major downfall of the Canny edge detector is its speed. The algorithm consists of multiple passes and is much slower with only marginal improvement over the simpler

Sobel method.

The Laplace method produced very good edges for the tiles, but it also was very noisy. This may be expected since Laplace is a second-derivative operator which enhances noise (Cui, Zou, & Song, 2008). The Laplace algorithm itself is very efficient since it only involves a one-pass convolution operation; however, the runtime stages of the pipeline that come after the edge detection are dependent on the number of edges found in an image, causing an overall system inefficiency.

The weighted local variance method performed only slightly worse than the Sobel operator on this test image; however, in other images examined, the variance method performed much worse. Also, the variance method must compute a variance for the neighbors of each pixel in the image, which means that there is substantial floating point arithmetic causing the algorithm to be inefficient.

From this analysis, the Sobel operator was chosen as the edge-detection stage of the pipeline. The rest of this section will describe the Sobel operator. A description of the other derivative operators can be found in Appendix B.

The Sobel operator approximates a two-dimensional gradient using partial derivatives in the x and y directions. Mathematically, a two dimensional gradient can be expressed as:

36

4.3.11

The magnitude of the gradient can be found by the equation:

4.3.2 1

This can be approximated by:

4.3.3 1 The Sobel operator can approximate the gradient magnitude by convolving the original image with a mask for the x and y directions. Each mask in 4.3.41 approximates the partial derivative of the image in the x and y directions, respectively. The masks used by the Sobel operator are:

4.3.41

4.3.5

After the Sobel operator is convolved with the image, an upper and lower threshold is applied to the image. The lower threshold is intended to remove gradients that are too small to be possible edges. The upper threshold was applied to remove edges that were too

1 Equations were found in the (Gonzalex & Woods, 2002) pages 134-136.

37

strong because the edges needed by this algorithm are created from very thin lines, resulting in only small to medium gradient magnitudes. The thresholds were calculated through analysis of test data.

The direction of the gradient can be found by taking the arctangent of the ratio of the y partial derivative and the x partial derivative. This direction has sometimes been used to reduce edges instead of or in conjunction with a threshold (Lang & Fu, 2000); however, the method produced poor results in the present case when used in conjunction with the rest of the pipeline.

4.4 Plan View to Hough Space Transforms

The next stage in the pipeline is to transform the gradient image from the edge stage into Cartesian space, also called the plan view. This process is completed as described in section 3.3 with no further modifications.

Once the image has been transformed into the plan view, it is then passed to the

Hough space transformation stage of the pipeline. The speed of transforming the points from the plan view is increased by ignoring zero valued pixels. This speedup is one of the reasons why the Laplace transform was not used. Since very few pixels with the Laplace transform would have been zero, transforming the Laplace output to Hough space would have caused a dramatic slow-down in the overall system.

The general Hough transform is used to map points in Cartesian space into a parameter space. In his patent, Hough introduced a method of detecting collinear points in

Cartesian space by transforming them into the parameter space of a line (Bhattacharya,

Rosenfeld, & Weiss, 2001). The parameter space originally introduced consisted of the slope

38

and intercept of the slope-intercept form of a line. The transformation of a point in

Cartesian space to parameter space was then given by (Bhattacharya, Rosenfeld, & Weiss,

2001):

4.4.1 Using the slope-intercept representation of a line will cause problems because the slope could approach infinity if the line in Cartesian space is parallel to the y-axis. To solve this problem a “normal parameterization” was proposed in 1972 that instead used the parametric form of a line (Duda & Hart, 1972):

4.4.2 In this parameter space, a point in Cartesian space will transform to a sinusoid. Consider a set of points in Cartesian space that are collinear. The corresponding sinusoids for these points will all intersect at the Hough-space point . In Hough space, the point represents the parameters of the line on which all of the original Cartesian-space points lie.

If a set of points in Cartesian space lie on multiple parallel and perpendicular lines forming a grid, then the transformation to Hough space will show up as periodic peaks. A grid configuration would transform into two columns of peaks separated by 90 degrees. In each column, the peaks will be separated by some distance , where is equal to the width of a tile.

39

-10’

0’

10’ 0˚ θ 180˚ Figure 15: Hough space transformed from output of Sobel Figure 11. The proposed algorithm makes use of this regular spacing as a template for grid matching. Normally, finding lines in Hough space consists of searching for the peaks in

Hough space and returning the maxima as candidates for lines. In the case of matching a grid template to the Hough space, we need to find many lines in Hough space. Further, the lines found need to be constrained such that the peaks found are either at some or

. Also, each peak needs to be at least one foot offset from its neighboring peaks in the case of one-foot square tiles. Under ideal conditions, finding peaks that satisfy these conditions would be simple. Using actual data, the problem becomes much more difficult because the peaks are not guaranteed to line up due to errors in calibration, noise from pixels other than the grid lines, and quantization of Hough space. Also, not all lines in the grid will

40

be visible after a threshold is applied to the gradient image. Therefore, an efficient algorithm is needed to simultaneously find multiple peaks in the image and constrain those points such that they form a grid.

4.5 Transforming Hough Space into Modulo Space

A possible method of locating the grid would be to find all of the peaks in Hough space and perform a least squares optimization to find the best heading and x and y offsets.

The problem with this method is the uncertainty in the data. The number of grid lines present in the image is highly dependent on the lighting conditions, how clean the floor is, and the gap between the floor tiles. Consequently, a constant number of peaks cannot be assumed.

Further, in a natural environment there are many edges that could exist that are not part of the grid. In the test environment used for this thesis, the joint between the wall and the floor always shows a strong edge, but the line does not always align with the grid lines.

If a least-squares fit were used in this case, the answer would be biased towards the strong edge of the wall. Also, it is common that tires from cleaning carts leave rubber streaks on the tiles. These streaks will have very strong edges but they most certainly do not line up with the grid and thus also bias the answer.

41

Figure 16: Example of distracting lines. One of the major difficulties in finding a solution is the fact that the signal-to-noise ratio is low in the environment chosen to test the algorithm. The tiles have a random speckled pattern in which the speckles have nearly the same gradient intensity as the tile edges themselves. If the floors are not properly cleaned, then the dirt will enhance the noise and not the grid lines, further decreasing the signal-to-noise ratio. Hence, detecting global peaks in Hough space and performing a least-squares optimization will not work. A method is needed that will use the periodicity in the grid lines to enhance the signal and suppress the noise so that the signal-to-noise ratio is high enough to confidently pick the offset.

The solution proposed in this thesis is to perform yet another transformation from

Hough space into mod space. Mod space is a new parameter space for the x and y offset and orientation of the gridlines with respect to the robot. The rationalization for this transformation is that if the robot’s coordinate frame is at some pose in the global

42

coordinate frame , then the closest grid lines in the positive x and y directions will be at the offsets:

4.5.1

4.5.2

In these equations, and are the distance from the robot’s position to the next tile edges in the global, positive x and y directions. is the width of one tile, which in this thesis is one foot. The percent sign represents a modulo operator and is used to give the remainder of the robot’s position divided by the width of the tile. This mod operator gives the robot’s offset from the closest tile edge in the negative direction, so it is subtracted from the width of a tile to get the offset in the positive direction. This can be seen visually:

Figure 17: Illustration of equations 8 and 9. Other floor tile edges will be offset from the values in equations 4.5.1 and 4.5.2 by integer multiples of the width of the tiles. Therefore, the distance to the nth floor tile edge will be at the x and y offsets determined by the following equation:

43

4.5.3

4.5.4

Where n can be any integer. One could therefore partition Hough space into equal sized intervals of :

4.5.5

In this equation is the height of one cell in a discretized Hough space. One could then guarantee that each partition would have at most two peaks cause by grid lines, and further, those peaks would be separated by 90˚. We can easily show that two peaks would be separated by 90˚ because any two lines in a grid are either parallel (same θ, different ) or perpendicular (θ separated by 90˚). To prove this statement we must prove that no two parallel lines can exist in the same partition.

Proof:

Assume that there are two grid lines with the same angle at and , respectively, and that both lines have peaks in the same partition. Also assume that each partition has a size of .

The distance between the two grid lines is therefore . If the two lines are in the same partition, then that implies that . If we assume that the lowest value in the partition is , then the maximum value must be from equation 4.5.5. Therefore, the maximum distance between any two offsets is . This follows that .

Therefore . However, this contradicts the rule that height of each row is equal to the distance between the grid lines or □

44

Using equations 4.5.3 and 4.5.4 with the definition of a partition in equation 4.5.5, you could show that if you find a peak at in the partition where is the lowest value of in that partition then there should also be a peak in all other partitions

at the point . We could then define a new parameter space transformation as:

4.5.6 where H is Hough space and M is the new mod space. This new parameter space is a representation of the grid orientation parameters (x, y, θ) with respect to the robot’s coordinate frame. This reduces the task of searching for the orientation of the grid to finding two peaks in mod space such that the angle between the peaks is 90˚. Not only is this now faster than finding the grid orientation in Hough space, but it will also be more robust to noise because the peaks from the grid lines will support each other more than white noise. A windowed example of this parameter space can be seen in the following figure.

45

0.0 0.0

0.5 0.5

1.0 1.0 170˚ 0˚ 2˚ 88.0˚ 90˚ 92.0˚ θ θ (a) (b)

Figure 18: Sample sections of mod space.

In Figure 18a, the largest peak can be found at the point (0.1, 176), and in Figure 18b at the point (0.6, 86). These peaks correspond to the grid’s position and orientation with respect to the robot being (0.1, 0.6, 356). The heading is not as straightforward as the positions and will be discussed more in section 4.7.

4.6 Normalizing the Transform

Oddly enough, the most attractive feature of mod space, namely its compression of spatial information, is also one of its biggest downfalls. When the information has been compressed to be in the range , it loses all but relative spatial information. There is no way of determining if a peak in mod space is from one really strong line in the image or a summation of weaker lines. This information could be very useful if there are strong

“distracter” lines present in the image. As mentioned in the previous section, these distracter lines are not only likely, but common. In some cases, they also have a much stronger gradient value than the grid lines, as is the case with the rubber strip at the base of the walls in the test environment.

46

The following figure shows an example of a windowed portion of mod space that shows distracter lines. In this image, there are strong peaks at rho near 0.0ft, 0.7ft, and

0.99ft. The peaks at 0.0ft and 0.99ft are two halves of the same peak wrapping around the mod space, but the peak at 0.7ft is a distracter line.

2.5

2

1.5

1 Intensity

0.5

0 1 5 4 0.5 3 2 1 0 0 rho (feet) theta (degrees)

Figure 19: Mod space with distracter lines. Some method is needed to encode spatial data into the compressed mod space. A naïve approach might be to normalize each partition before transforming to mod space.

This would change equation 4.5.6 to have the form:

4.6.1 where is the intensity at a point in Hough space, is the width of a floor tile, and f(n) is the normalization function:

47

4.6.2 where is the set of intensities in the nth partition.

Although this equation now encodes some spatial information in the mod space and solves the problem of strong distracter lines, it introduces yet another problem. Equation

4.6.1 would work if we could assume that every partition contains a grid line. This is not a good assumption because the numbers of grid lines visible in a scene is variable. Since this thesis focuses on indoor localization, a common environment for the robot will be hallways where the visible range of grid lines parallel to the hallway is typically less than ±1 meter.

This means that at most six lines would be visible. However, in open areas of a building, the camera could see lines at a range of ±1.5 meters resulting in as many as ten lines. Other conditions also affect the visibility of lines, including dirty floors, poor lighting conditions, and obstacles. If no grid line is present, then the partition only contains noise information, and the normalization will increase the influence of the noise by giving it the same weight in mod space as a real peak.

The solution taken by this thesis is to instead use a linear combination of a global normalization and a local normalization. Equation 4.6.2 then becomes:

4.6.3

This can be simplified:

4.6.4

This normalization has some attractive properties that solve the mentioned problems:

48

1. The global maximum will be normalized to 1.

2. All local maxima will be normalized to a value ≥ 0.5.

3. If a local maximum has a small value globally it will normalize to a value close to

0.5.

The first property is fundamental of normalizing, which encodes partial spatial information. Since every value is between zero and one, then a number greater than one in mod space must have come from multiple partitions. It is only partial information though since equation 4.6.1 cannot be inverted to determine exactly which partitions contributed to a point in mod space. However, this information is not necessary, since just knowing multiple lines were present is adequate.

The second property allows us to solve the distracter line problem. From this property we can see that the sum of any two local maxima will be greater than or equal to any single normalized point. Therefore, if there is one strong distracter line in the image, it will only take two grid lines to override its effects. Another way to look at this normalization is as a nonlinear function that serves to enhance the additive effects of the periodic signal and suppress noise. The results of using the normalization can be seen in the following figure.

49

4

3

2 Intensity 1

0 1 5 4 0.5 3 2 1 0 0 rho (feet) theta (degrees)

Figure 20: Mod space after normalization. The third property avoids the problem seen in the naïve normalization method. If a partition only includes noise, then its values in Hough space are likely to be small as compared to the global maximum. When the noise is then normalized, it will have a value near 0.5 instead of 1. It is still possible that the problem will exist since multiple partitions of noise could override the grid lines if few grid lines are visible. A way to deal with that problem would be to change the weights of the global normalization to a generic form:

4.6.5

Using equation 4.6.5, one could suppress noise further by decreasing α. Doing so, however, would cause a tradeoff between suppressing distracter lines and suppressing noise.

After examining empirical data, it was found that equation 4.6.4 provided the best balance.

Distracter lines are present in almost every image due to the joint between the wall and the floor as mentioned earlier. Decreasing α would have an adverse affect on the method’s

50

ability to handle these lines. Also, it was observed that the ratio of partitions with grid line peaks to the partitions with pure noise was high enough that would not cause any problems in detecting the grid lines.

4.7 Pose Estimation from Mod-Space for Localization

With the mod-space transformation complete and resistant to distracter lines and noise in the image, the grid position information is able to be extracted. Extracting the position simply requires finding two peaks in mod space constrained to be separated by 90˚.

The first step in the extraction process is to constrain the points to be separated by

90˚. To do that, mod space is projected onto the θ vs. intensity plane. This projection will be a function of θ vs. intensity where intensity is the maximum for all θ. The function is then added to itself with a phase shift of 90˚ creating a new combined confidence function.

Practically, the combined confidence function has an intensity that depicts the combined strength of the lines separated by 90˚. This can also be interpreted in terms of how well the edges in the original image support that a given angle is the orientation of the grid. One could then determine the heading of the grid by choosing the θ from the combined function which has the maximum intensity.

The process of creating the projection into the θ-confidence function and the 90˚ phase shifted θ-confidence function is depicted in Figure 21. The two upper images represent windows in the mod space. These are projected to the two confidence functions below as indicated by the arrows. These two confidence functions are then added resulting in the combined, θ-confidence function to the right.

51

Figure 21: Process of projecting mod space and creating a combined confidence function. When the heading of the grid is chosen, finding the offsets at that heading is straightforward. The offsets are located at with the maximum intensity for the chosen θ.

Figure 22: Depiction of creating the x-confidence function.

52

Figure 23: Depiction of creating the y-confidence function. Choosing the offset does not give us a global position or heading. The offsets and angles found determine where the grid lines appear in the robot’s coordinate frame. Some care must be taken to translate these values into a global coordinate frame. When the robot has a heading of 0˚, the x offset can be found in Hough space at 0˚ and the y offset at 90˚.

When the robot has a heading of 90˚, in Hough space the x offset is also at 90˚ and the y offset is at 0˚. In this case, however, the x offset is in the negative x direction. With a heading of 180˚, the x offset can be found in Hough space at 180˚ and the y offset at 90˚. In this case the y offset is in the negative y direction. When the robot has a heading of 270˚, the x offset can be found in Hough space at 90˚ and the y offset can be found at 180˚.

From this analysis it can be seen that the x-direction of the global coordinate frame can be found in Hough space at θ equal to the heading of the robot coerced between 0˚ and

180˚. Therefore, the angle in Hough space that corresponds to the x-direction of the grid in the global frame will be used to determine the heading of the robot. Also, since Hough space is in the interval [0˚, 180˚] the angle of the x axis of the grid can either be the positive or negative x axis in the global coordinate space. Consequently, we must chose one of two

53

possible headings based on our expectation of what the robot’s heading should be. This can be seen in the following figure. The top axis depicts the angle of the x-direction grid lines in

Hough space. The lower two axes show the two possible interpretations of heading from the chosen angle in Hough space. Both answers are viable answers, but we can use the expected heading of the robot to decide which heading to chose.

Figure 24: Example to mod space to global heading transform. We can now derive an equation for the heading transformation from mod space to the global coordinate frame to be:

4.7.1

Where θr is the estimated position of the robot, θm is the angle determined from mod space, and is the new robot heading estimate.

Transforming the x and y offsets from mod space to the global coordinate frame must also be one of two possible values because the offset will either be in a positive or negative direction from the robot. The equations will be the inverse of equation 4.5.1 and

4.5.2:

54

4.7.2

4.7.3

Also, because of aliasing, we have to compute which global coordinate corresponds to the offset found is closest to our estimated point. Figure 25 shows how the correct global coordinate depends on the estimate of the current position.

Figure 25: Illustration of aliasing. The point that is closest to the expected value can be assumed to be the correct point because the algorithm assumes that there is an estimate of the position available within

±0.5ft. Since each of the possible points will be separated by one foot increments, there will only be one point that is within the ±0.5ft limit. Once aliasing is handled in both x and y directions, then the algorithm can return a new estimate of the robot’s global position and heading. This value then needs to be folded into a physical state observer. Since the

Kalman filter is a popular observer, a measure of variance needs to be calculated.

4.8 Making Template Matching Fast

55

In general, two major principles were used to speed up all processing: reduce searches based on expectation and sacrifice memory for speed. Some examples of these principles have already been discussed. Transforming Hough space into mod space took advantage of the expectation that the grid lines will be parallel or perpendicular, and that they are spaced in one foot increments. Using this expectation, Hough space was compressed to mod space which was about 20% of the size.

This same principle was also used to compress mod space further. Using the expectation of the heading, the search in mod space was constrained between ±2˚ of the expectation. This compressed mod space to a little more than 2% of its original size.

Compared to the original Hough space, the compressed mod space is merely 0.1% of its original size. This of course resulted in a major improvement on the speed of the algorithm.

The other principle of sacrificing memory for speed was mentioned in section 3.3.

Instead of calculating the transformation from camera space into Cartesian space, the transformation was pre-computed and stored in a lookup table. This allowed for the camera image to be transformed into Cartesian space in less than 10ms.

Transforming from Cartesian space into Hough space is usually a slow operation. It is slow because the transformation includes computing a sine and a cosine and other floating point arithmetic to fit it into a discretized space. The transformation speed could be improved by storing a three dimensional array of Hough space indices, where the three dimensions would be Cartesian coordinates x and y, and Hough space coordinate θ. At the present time, however, doing so is infeasible. The Cartesian space chosen in this thesis is a

10ft. x 9ft. area discretized to 1/100 ft. cells. The Hough space was discretized into the same spatial resolution as the Cartesian space, and the angles were discretized into 0.1˚

56

increments. A lookup table for the Hough transform would require 1000x900x1800 integers resulting in a size in RAM of about 6.48 GB. Although this speedup should be possible in the future, current systems cannot handle the size.

Instead, equation 4.4.2 can be broken up into x*cos(θ) and y*sin(θ), and each part stored independently. With this scheme, the size of the two lookup tables will be 1000*1800 doubles which is about 13.7 megabytes each. This is easily feasible for today’s systems.

57

5 Integration with a Localization System

5.1 Overview

The method of determining position using fast template matching proposed in section 4 requires a physical state accurate with ±0.5 feet and a heading accurate within ±2˚.

As was mentioned in the introduction, one of the most popular methods of tracking the state of a system is to use a Kalman filter due to its optimal properties. For a measurement to be used in the Kalman filter, it must also provide a variance to describe the assumed

Gaussian noise. In the following section, a method will be introduced to determine a variance to allow the position estimate to be included in a Kalman filter implementation of a physical state observer.

5.2 Variance from Helper Functions

An approach to associating a variance with measurements based on vision has been to set a static variance based on test data (Xuedong, Xinhua, Rongjin, & Jiashan, 2006)

(Mammarella, Campa, Napolitano, Fravolini, Gu, & Perhinschi, 2008) (Murata & Hirose,

1993). This method is acceptable if the scene is not cluttered with distractions. If the background is cluttered, or if other variability causes poor performance in the algorithm, then the measured property can be completely incorrect. Variance is based on a random variable with some distribution (Ross, 2006). If a measurement is completely incorrect instead of just being affected by random noise, then it is inappropriate to assign a variance to it.

58

The method proposed to accomplish template matching of tiles on a floor will need to work in a cluttered environment. It has been shown that mortar joints in the walls, baseboards and tire marks all present competing lines that can result in gross template mismatches. Also, the tiles themselves have a speckled pattern and can be covered with dirt causing noise that could be stronger than the tile edges. One can then expect that the position information provided by the algorithm will at times be corrupted with data such that the computed position is not merely in error by some Gaussian-distributed noise, but is instead completely incorrect. A method is therefore needed that can evaluate when a template match is credible or not.

A relevant prior approach to assigning variance to vision-based measurements is described in (Warren, 2009). In this work, vision-based measurements of a road were compared to a reference (known displacement with respect to the road), allowing classification of scenes into various ranges of accuracy. Attributes relevant to the vision analysis (e.g., strength of lines) were computed for each scene, and these attributes were used to self-organize the data into clusters. The vision errors associated with each cluster were evaluated, resulting in a means to infer measurement variance from easily-evaluated attributes of a scene. In the present case of matching a template to scenes containing floor tiles, it was obvious that the results were either accurate to within some small (and constant) variance, or the results were completely untrustworthy. To determine if a given vision-based measurement should be believed or not, a variation on the attribute-based classification method was used. Although other approaches may work, the final solution was chosen for its simplicity and good results.

59

The modified method involves using two helper functions that measure some aspect of the data used to derive the position estimate. Each of the functions is easily evaluated and has some output that should have bearing on whether a template match should be successful or not. The expectation is that some such helper functions can be used to assess whether a template-matching result is credible. Desirably, evaluation of credibility could be as simple as applying thresholds to the helper-function outputs and applying a logical combination of the binary results.

5.3 Credibility Functions

The credibility functions need to measure some aspect of the estimation process so that the measurement can be labeled as good or bad. The algorithm described in this thesis gives an estimate for x, y, and θ separately. The x and y measurements are independent of each other and can therefore have independent credibility assigned to them. Both x and y do, however, rely on the measurement of θ so their credibility will also depend on the credibility of the heading estimate.

The features chosen to evaluate each of the position variables are taken from mod space since the estimate is based on the intensities in mod space. As described earlier, template matching is performed by finding the peak values within a Hough-transformed image that is post-processed using a modulo with a base equal to the expected line periodicity. In performing template matching, the corresponding x, y, and θ of the robot’s viewpoint are assumed to be the peak values from respective two-dimensional functions, which are derived from respective slices or projections of mod space. θ is evaluated from the θ-confidence function, which is the projection of mod space onto the θ-versus-intensity plane. X-confidence and y-confidence functions are created from the intensities for all

60

values of rho at the theta selected from the θ-confidence function. These three confidence functions are used to evaluate each of the three pose parameters. Since the vision measurement depends on the peaks of these intensity functions, the associated credibility functions, used to evaluate trustworthiness of the result, should evaluate the shapes of the confidence functions to determine if the peak values are likely to represent the robot’s pose.

The credibility functions should map the 2-D intensity functions onto corresponding scalar values that are indicative of the abilities of the intensity functions to imply trustworthy results (i.e. accurate pose estimation).

The first credibility function used was the RMS error of the confidence functions with respect to an ideal intensity function. The RMS error function, in general, can be used to give a measure of the error in some measured value. For each of the three functions, it is desirable for the confidence function to peak strongly at unique values of the pose parameters, thus strongly suggesting a specific solution. The ideal confidence function would look like a Dirac delta function centered at the correct pose (e.g. a strong peak of intensity at the true heading of the robot). The RMS-credibility function was evaluated using the root mean square of the normalized difference between the ideal confidence function and the actual confidence function. Normalization was performed by dividing measurement differences by the maximum intensity value. Doing so normalized the function so that it was independent of the scale of the intensity value and was instead a relative measure of the intensity. This was very useful since the range of intensities of mod space was variable, depending in particular on the number of edges in a scene. The RMS function used has the form:

61

5.3.1 where f is the actual mod-space intensity as a function of presumed line parameter , and

is the “ideal” intensity function, which peaks at and is zero elsewhere.

This peak value is assigned based on the actual intensity function value at . We can get a sense of whether this candidate credibility function should be useful in establishing measurement credibility by evaluating some extremal cases. At one limit, the actual intensity function would be identical to the ideal intensity function, peaking at a single value that indicates the corresponding true line parameter. In this extreme case, the difference at each value of will be 0. This will give the entire function the value:

5.3.2 The RMS-credibility function will approach 0 as the actual confidence function approaches the ideal confidence function.

The worst confidence function, with respect to identifying line parameters, would have uniform intensity for all or θ. In this case, the normalized differences for all sample points except one will be unity, and the resulting value of equation 5.3.1 will be , where

N is the number of discretized values in the intensity function. Thus the normalized RMS- credibility function approaches unity as the intensity function becomes unrevealing. From these extremal-case analyses, it is shown that equation 5.3.1 will have a value between zero and one.

The first credibility function works well for determining how well defined a peak value is within a confidence function relative to the background. However, this may not

62

indicate uniqueness of a peak. For example, there might be two strong peaks in a credibility function, and the presumed optimal solution might be very wrong for such a bimodal function. However the RMS-credibility function would hardly be affected by the second peak, giving the illusion of a good estimate. Another confidence function was needed to determine if there was evidence that the parameter identification might be flawed by choosing the wrong peak value. The second credibility function chosen was the mean squared error function or the MSE-credibility function. MSE gives the summation of the variance and bias of an estimator T of a random variable (Lefebvre, 2005):

5.3.3 This was used to determine the spread of the possible peaks in the confidence functions.

The function used to evaluate the MSE was:

5.3.4

The same analysis that was used for the previous credibility function was also used to determine the extreme values. In the perfect case, the confidence function would have a form similar to the Dirac delta function centered at the peak value, . In this case, the numerator of equation 5.3.4 would evaluate to zero for all non-peak values since the intensity would be zero for all non-peak values. At the peak value, the numerator also evaluates to zero; although the intensity will be non-zero, the difference in will be zero.

Equation 5.3.4 will therefore evaluate to zero in the ideal case.

Another useful case to look at is two strong peaks separated by some distance. In this case equation 5.3.4 becomes:

63

5.3.5

When both peaks have similar intensity levels, then equation 5.3.5 evaluates to roughly half of the peak separation squared. Thus if the two peaks are very close to each other, then the chosen peak is still credibly the true peak, within a small error. However, as the two peaks separate, the consequences of choosing the wrong peak become more severe, and this is indicated by the MSE function.

The worst case confidence function would have the same intensity for every value of rho. In the worst case, the function would be equal to the sum of the distances squared.

Then equation 5.3.4 would reduce to:

5.3.6

Equation 5.3.6 will be a high value in the worst case.

Given the candidate confidence functions, it remains to interpret their values to decide when a template fit should be trusted. A simple approach to assigning trust is to assign thresholds for the two credibility functions, and to require that both tests be passed before a template fit is trusted. Other possible approaches exist, but thresholding is simple and produces good results.

5.4 Setting Credibility Thresholds

To evaluate whether to accept a template fit, one can assign thresholds on the confidence functions. This section concerns how to identify appropriate thresholds.

64

To assess thresholds empirically, data had to be collected in which the actual position of the robot was know within a small tolerance. Ideally, one would drive the robot through a desired environment and record raw video along with localization results based on vision and an independent prevision reference. All instances of excessive discrepancies between the true pose and the estimated pose would be analyzed for clues from the credibility functions suggesting that the template fits in these cases should not be trusted.

Unfortunately, a precision reference for the “true” pose was not available, so an alternative test was devised.

The test consisted of precisely centering the robot on a floor tile and fixing the robot’s position. The environment in front of the robot was changed in a way that would present the algorithm with a variety of scenarios, including nearly perfect data, data that might bias the system to choose an erroneous, competing template fit, or extremely noisy data. The first set of data collected was the control data. Position estimates were taken of a clean floor with no visible distracter lines. This data provided a baseline variance against which to benchmark the corrupted data. To simulate biased data (strongly competing incorrect template fits), four long strips of wood were placed parallel on the floor, offset from the floor tiles. Position estimates were then recorded for a period of time with the wood placed in both the vertical and horizontal directions. To simulate noisy position estimates, a blue tarp was laid down in front of the robot to completely cover the grid lines.

As data was recorded, the tarp was shaken so that there was considerable variation in the edges making them random.

Figure 26 shows all of the nearly 3000 position estimates from the test data. The area shown on the plot is the same as the area of one floor tile. The robot was placed such that

65

points near (0.5, 0.5) are correct. Although it looks as if there were significantly more noisy points in the data set, it is not the case because many of the correct points overlap and cannot be seen in the plot.

1

0.9

0.8

0.7

0.6

0.5 Y (feet) 0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X (feet)

Figure 26: Scatter plot of data collected to test outlier rejection.

To discover proper threshold values, both credibility functions were evaluated and associated with the corresponding position estimate. Histograms of the credibility-function outputs were created to see if there was a clear distinction between good helper function values and bad. Generally the histograms show a good separation in their values. The two functions were ultimately chosen over other functions because their histograms had good and bad values mostly separable by a single threshold value.

66

In the following figures, the red vertical line indicates the threshold chosen. When the separation was ambiguous because there were not two clear distributions, the histogram of the function output was plotted for the control group and the noise group separately to be sure where the separation occurred. Some histograms had slightly overlapping partitions and in these cases the threshold was chosen to minimize the number of bad values that would be associated as good values.

500

450

400

350

300

250

Frequency 200

150

100

50

0 0.4 0.5 0.6 0.7 0.8 0.9 1 RMS

Figure 27: Histogram of RMS error, helper function in x-direction.

67

1800

1600

1400

1200

1000

800 Frequency

600

400

200

0 500 1000 1500 2000 2500 3000 3500 MSE

Figure 28: Histogram of MSE, helper functions in x-direction

350

300

250

200

150 Frequency

100

50

0 0.4 0.5 0.6 0.7 0.8 0.9 1 RMS

Figure 29: Histogram of RMS error, helper function in y-direction.

68

1200

1000

800

600 Frequency

400

200

0 500 1000 1500 2000 2500 3000 3500 MSE

Figure 30: Histogram of MSE, helper function in y-direction.

160

140

120

100

80 Frequency 60

40

20

0 0.7 0.75 0.8 0.85 0.9 0.95 1 RMS

Figure 31: Histogram of RMS error, helper function for theta.

69

1400

1200

1000

800

600 Frequency

400

200

0 500 1000 1500 2000 2500 3000 3500 4000 MSE

Figure 32: Histogram of MSE, helper function for theta. The exact thresholds can be seen in the following table:

RMS Threshold MSE Threshold

X 0.5 875

Y 0.5 875

θ 0.85 900

Table 1: Thresholds for the helper functions

These thresholds can be then be applied to the data in Figure 26 to analyze how well the thresholds work. In the following images, the points have been separated using the thresholds. The green points indicate good position estimates and red points indicating base position estimates (as determined by the thresholds).

70

1

0.9

0.8

0.7

0.6

0.5 X (feet) 0.4

0.3

0.2

0.1

0 0 500 1000 1500 2000 2500 3000 Index

Figure 33: Plot of x component of position estimates separated by the chosen thresholds.

1

0.9

0.8

0.7

0.6

0.5 Y (feet) 0.4

0.3

0.2

0.1

0 0 500 1000 1500 2000 2500 3000 Index

Figure 34: Scatter plot of y values separated using the selected thresholds.

71

2

1.5

1

0.5

0

Theta Theta (degrees) -0.5

-1

-1.5

-2 0 500 1000 1500 2000 2500 3000 Index

Figure 35: Scatter plot of theta values separated using the selected thresholds.

The results show that the confidence functions did a good job of recognizing when to trust the template-fit results. The most important goal mentioned was that no bad data points were mislabeled as being good, and the figures show that this goal was met. In Figure

35, a significant number of theta values that are in the correct range are marked as being bad points. At first this data looks like an error in the threshold, but examining the functions used to calculate the error shows why they are labeled as bad points. The confidence functions for these data points have very little differentiation between the selected data points and other data points. In this case, the thresholds have correctly determined that these headings had been chosen using inconclusive data. A comparison of one of these parameter functions can be seen in the following image as compared with the control group’s parameter function.

72

2600

2400 Control data 2200 Noisy data

2000 Intensity

1800

1600

1400

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Theta (degrees) 5

Figure 36: Comparison between control heading parameter function and bad heading parameter function. In this image, the peak value for each of the functions was at 0˚. However, the data marked red was almost flat compared to the control group data marked green. This means the threshold properly flagged that the heading identification was not trustworthy.

The final step was to compute a variance that could be assigned to the good measurements. The variances for x, y, and θ were computed using the control data and were

0.00001 m2, 0.00000225 m2, and 0.0009 degrees2. Since these variances are computed from mostly ideal data, they are lower than might be expected from data in a real scenario, but as will be shown in Chapter 6, these variances produce good results.

73

6 Results

In this chapter the results of the overall system are given and are compared to other localization methods. It will be shown that the proposed method did meet the design goals from Chapter 2. The first section will describe the general results of the overall system by comparing the system output to other modalities. The sections following analyze how well the system achieved more specific goals including a reasonable runtime, robustness with respect to noise, good performance with distracter lines present, and integration with a localization system.

6.1 System Results

To test the system, data was collected on the 7th floor of the Glennan building at

Case Western Reserve University. This floor was initially chosen because it was relatively clutter free, and the floors were mostly clean. Also, the 7th floor has less traffic than lower floors in the building making collecting data easier.

Glennan’s layout also made it an attractive choice. The upper floors consist of one hallway that makes a complete, rectangular loop. This allowed data to be collected for which the path formed a closed contour. By using a closed path, it was easy to visually see if the algorithm was working by measuring the robot’s estimated pose at the start of the path vs. the estimated pose at the end of the path. Since the true distance from the starting point to ending point was zero, the distance between the robot’s state estimates at start and end constitutes an error measure. This alone is not a perfect measure of success, since there are

74

many paths, correct and incorrect, that could be taken to give the same ending position.

Also since Glennan’s hallways form a square it is easier to test if the data is rectangular.

The results of one test are given in Figure 37. The black lines indicate a floor plan of the walls for the 7th floor. The green line indicates a best estimate of the true path taken, except at the corners. Lacking an accurate measure of the true path while turning corners, right angles are depicted instead. The actual path estimate follows from the experimental approach, in which the robot was pushed along the hallways with a best attempt to keep the robot centered between the walls. The blue line indicates the estimated states of the robot during this excursion, as computed using wheel encoders to compute the incremental change of the robot’s pose between each iteration of the algorithm. This new pose estimate was then used as the base input for the proposed vision localization system which then computed a better estimate of the pose.

The pose estimates plotted in the figure do not reflect the system integrated with the

Kalman filter but instead reflect the raw output of the proposed, vision-based-localization method. The purpose of this figure is to illustrate the accuracy of the algorithm by itself.

75

35

30

25

20

Y (meters) 15

10

5

0 0 5 10 15 20 X (meters)

Figure 37: Test loop around Glennan 7th floor.

From Figure 37, you can see that the system performed well. In most cases, the estimated path of the robot is indistinguishable from the (presumed) actual path taken, at least at the scale shown. Figure 38 shows zoomed views of Figure 37 to show the scale of the error.

76

33.5 28.2 33 28 32.5 27.8 32 27.6 31.5 27.4

31 Y (meters) Y (meters) 27.2 30.5 27 30 26.8 29.5 26.6 29 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 14 14.5 15 15.5 16 16.5 X (meters) X (meters) (a) (b)

6 1.7 5 1.6

1.5 4 1.4 3 1.3

1.2 2 Y (meters) Y (meters) 1.1 1 1

0.9 0 0.8 -1 0.7

7.6 7.7 7.8 7.9 8 8.1 0 0.5 1 1.5 2 2.5 3 X (meters) X (meters) (c) (d)

Figure 38: Expanded views of Figure 37. (a) View of error in x-direction. (b) View of corner. (c) View of error in the y-direction (d) View of the start and end points. These results were remarkable considering that previous work on the systems using the Kalman filter have only been able to achieve ±0.5m and ±1˚ (Harper & Nemeth, 2009).

The results can be further compared to other modalities. Figure 39 shows the result of different localization techniques overlaid including the proposed vision localization technique.

77

35

30

Wall 25 Wall Encoders LIDAR Vision 20

Y (feet) 15

10

5

0

0 5 10 15 20 X (feet)

Figure 39: Multiple localization-technique results overlaid to compare performance.

In Figure 39, one of the localization techniques is open-loop kinematics based on wheel encoders, results of which are shown in green. The encoders tend to drift away from the correct estimate of the heading due to accumulation of small errors. Results of LIDAR localization is plotted in red. The LIDAR localization algorithm is currently being developed by another student and uses wheel encoders to provide a position estimate to match the

LIDAR scan to a floor plan of the walls. It performs well because it performs a global localization technique in order to avoid accumulating drift. However, in the first corner you

78

can see it has a large error in its position estimate, whereas the proposed algorithm is smoother through the turns with only small random displacements in the position.

6.2 Algorithm Runtime

Generally, when working with vision data there is a large amount of information to process. In the case of vision-based localization, the algorithm ideally would be able to process the information fast enough so that the position information it provided the state estimator was still current relative to the speed of the state estimator. It was useful therefore to examine the runtime of the algorithm. The following diagram shows the runtime of the algorithm on a server-style computer with 2.33 GHz, dual, quad-core processors, 6 GB of

RAM, and running the Ubuntu server-edition, operating system.

0.07

0.065 Runtime data mean runtime 0.06 std dev of runtime

0.055

0.05

0.045

0.04 Runtime (seconds) Runtime 0.035

0.03

0.025

0.02 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Iteration

Figure 40: Timing diagram of algorithm runtime.

79

For comparison results, the following table compares the runtime of the algorithm on the server versus a typical laptop computer.

Computer Specs (s) (s)

Intel Xenon 2.33 GHz Dual Quad Core Processor 2 x 4 Mb L2 Cache 1333 MHz front side bus 0.036 0.0094 6 GB DDR2 RAM 65 nm process technology Ubuntu Server Edition AMD Turion 1.81 GHz Dual Core mobile processor 1 Mb L2 Cache 0.218 0.056 667 MHz front side bus 1 GB DDR2 RAM 90 nm process technology Ubuntu Table 2: The algorithm’s runtime on a server and typical laptop. This data shows that the algorithm is capable of running near real-time on the server.

It could, on average, run at 20Hz, which was the same rate at which the physical state observer ran. The reason for the 6 times speedup of running the algorithm on a server versus the laptop is most likely from the superior hardware and software configuration. The server had a faster processor that has an eight times larger L2 cache and was made using 65 nm technology allowing more ALUs per core. The server also had more RAM, which allowed the memory-intensive algorithm to run without the need of paging, and it also had a front-side bus that was twice as fast as the laptop. Also the operating system installed on the

80

server did not have a graphical environment, which means that it was completely command- line based and did not display windows or a desktop, thus requiring less resources.

6.3 Robustness with Respect to Noise

One of the goals of this thesis was to make the algorithm resistant to noise. The noise could be caused from dirt on the floor, the speckled design of the tiles, or some other factor. The system was designed with this goal in mind. As seen in section 4.5, the mod- space transform was created so the lines belonging to the grid would reinforce each other to suppress the influence of noise. Figures 41 and 42 will show an example snapshot affected by noise. The images have been transformed into the calibrated view.

Figure 41: Plan view example of noisy image.

81

Figure 42: Plan view example of noisy image with overlay lines depicting algorithm output. In this example, the floor was covered with a significant amount of dirt making many of the grid lines indistinguishable. Also, the image was taken while turning through a corner.

In this particular case, part of the wall is visible to the camera. The wall is heavily textured and includes very strong edges that interfere with the template match.

The output of the algorithm is depicted in Figure 42 with the blue lines indicating the grid lines in the x-direction, and the red lines indicating the y-direction. In this example, the algorithm seems unaffected by the presence of noise. If there were an error in the output, the superimposed red and blue lines would not align with the floor-tile edges. In this example, however, the overlaid lines completely cover the tile edges.

6.4 Resistance to Distracter Lines

Another goal of the proposed algorithm was for the algorithm not to be distracted by lines that are not part of the grid. Rejecting distracter lines is a difficult problem because

82

very strong lines could be close to lining up with the grid but offset enough to cause an error in the estimated position. The distracter lines could, in the worst case, cause the template to snap into a dramatically wrong result, which violates the assumption of the Kalman filter that the data will be zero mean with Gaussian distributed noise. The normalization proposed in section 4.6 was designed such that the mod space would be able to encode spatial information and to reject lines that do no align with the grid.

The following images show an example of an extreme case of line distraction. The images contain two parallel lines taped on the floor. The tape is centered on two of the grid lines to remove the support of two of these lines while adding four additional distracter edges that are offset from the original grid by ±0.5 inches. The base of the walls was also offset from the grid, causing two more distracting edges.

Figure 43: Plan view example of environment with distracter lines.

83

Figure 44: Plan view example of distracter lines with algorithm output overlaid.

The example in Figure 44 illustrates the ability of the template-matching algorithm to overcome significant distractions from unrelated lines. The parts of the image that were most distracting to the algorithm were the taped lines close to the robot (bottom of the image). The taped lines were placed directly on the grid lines, and the algorithm was able to distinguish the difference between the taped lines and the grid lines. This test case shows the algorithms robustness to even extreme cases of distracter lines.

6.5 Integration with Localization System

The final goal of this thesis was to provide a method of integrating the data with a

Kalman filter. In section 5.4 a method was proposed to determine a corresponding variance for each of the three state variables x, y and θ. The variance chosen is based on credibility functions which use the x, y, and θ confidence functions to determine if the measurement is credible. The credibility functions were then applied to real data and are displayed in Figures

45, 46, and 47 which show the result of the credibility functions for x, y, and θ respectively.

84

In these figures, the green points have been assigned a low variance and the red points have been assigned a high variance.

35

30

25

20

15 Y (feet)

10

5

0

-5 -5 0 5 10 15 20 X (feet)

Figure 45: Vision output indicating the x values that have a low variance (green) and a high variance (red).

85

35

30

25

20

15 Y (feet)

10

5

0

-5 -5 0 5 10 15 20 X (feet)

Figure 46: Vision output indicating the y values that have a low variance (green) and a high variance (red).

35

30

25

20

15 Y (feet)

10

5

0

-5 -5 0 5 10 15 20 X (feet)

Figure 47: Vision output indicating the values which have a low variance (green) and a high variance (red).

86

As can be seen in Figure 45, the x measurements tend to have a low variance when the robot is moving in the y-direction but a low variance when moving in the x-direction.

Figure 47 shows the opposite is true with the y-direction measurements. One possible explanation for this is when the robot is moving the image has some motion blurring, effectively smoothing the image in the direction parallel to the motion of the robot. When the robot is moving in the y-direction, the lines that provide information for the y displacement are perpendicular to the robot’s motion so the blurring will reduce their strength causing the algorithm to be more uncertain about the correct offset. The lines that provide x-displacement information are parallel to motion of the robot so blurring will have little to no effect on the lines which leads to higher confidence in the output measurements.

Figure 47 shows that the heading variance is not affected by the motion blurring.

This was expected since in determining the heading, information from the x and y directions are used. Therefore in both cases, the measurement with the lower variance provided good information for determining the heading which lead to a low variance.

These results can then be used to provide state updates to a Kalman filter. Figure 48 shows the results of doing so. In this figure, wheel encoders where used as the model for the Kalman equations and the vision-based localization data was the only measurement that provided updates to the observer.

87

120

100

80

60

Y (feet) 40

20

0

-20 0 10 20 30 40 50 X (feet)

Figure 48: Results of providing a Kalman filter with vision-based localization updates.

The data plotted in Figure 48 shows that vision data could be successfully integrated with a Kalman filter using the proposed credibility functions to provide variance information. Without the aid of the vision-based position information the encoders would have accumulated error and drifted from the correct position as was shown in Figure 39. The combined position information was mostly straight when traveling down the hallways and smooth in the hallways and through the turns. This is consistent with the expected path since the robot’s actual position was not discontinuous. Some sections in Figure 48 do have small jumps in the position which was caused by the vision data. These jumps are typically less than 4 inches and do not bias the data.

88

7 Conclusions

This thesis presents a unique method of vision based localization using fast template matching. It specifically exploits matching edges in a scene to a model. In the particular case analyzed, edges were conveniently available in a periodic pattern created by floor-tile edges. The periodicity of the tile edges (both rows and columns) permitted a significant simplification of both the model and template matching.

The present approach to localization is not readily classified as either a differential or a global technique. Within the periodicity of the model, the method is absolute. Each template fit is independent of prior fits, and thus errors do not accumulate (in contrast, e.g., to encoders, inertial sensors, or laser-based odometry). On the other hand, use of a periodic template results in identification of position with an uncertainty of any number of multiples of the template period. Thus, the method cannot be considered absolute (or “global”). An estimate of pose is required to apply the template-fit fine adjustment to the appropriate period. The method may thus be compared to the use of laser interferometry or incremental encoders—means that do not provide absolute measurements but which do not accumulate errors.

The first step of the process presented here was to use find the edges in a scene using a standard Sobel operator with a threshold (which was selected manually by examining sample data sets). The next step was to transform the edge map into an equivalent Cartesian-space top-down view; a lookup table was used to make this transformation fast. After this step, the Cartesian-space points were transformed into Hough space so that the grid lines could

89

be detected. To reduce the effect of noise and distracter lines Hough space was normalized before being transformed again. The cells in Hough space were then transformed into a new parameter space called modulo space. The mod space allowed the grid parameters to be efficiently and effectively extracted from the edge data. Since vision data could be easily confused, credibility functions were used to determine if a given position estimate could be trusted. If the estimate could be trusted, then a known small constant variance associated with the identified position, which enabled fusing this sensory data with other sensory results.

The goal of this thesis was to create a vision-based localization system that would be robust with respect to noise, clutter and line distractions, and to augment the results with an appropriate error variance estimate. The algorithm was compared to other localization techniques and the results show that the proposed technique does not accumulate error as do wheel encoders and does not have large discontinuities in its position estimate as did the

LIDAR localization algorithm. The data presented further shows that when the system can be trusted; it is accurate to within a centimeter, even in the presence of substantial noise and distracter lines. Results also show that the proposed method can detect when its pose estimates can not be trusted and can assign the measurement a variance accordingly allowing it to be integrated with a Kalman filter based localization system. Analyses of the proposed method’s runtime on two different style computers shows that the algorithm is capable of being run in real time as defined by the rate of the localization system used to integrate the vision data with other sensors.

90

8 Future Work

Using a camera mounted on a robot to detect floor tiles for accurate localization has proved to be very useful and robust. As with any research, there is always room for improvement.

The low-level processing in particular could be improved by using a smarter method of extracting the tile lines. The current method opted to use a simple Sobel operator with a set threshold because the performance was good in most situations, and the higher-level processing was able to accommodate many distracting lines and noise. However, the processing time grows with the number of edges in the image.

Most of the distracter lines come from door frames and the rubber base on the walls.

These edges could be removed if the robot could detect that they are not points on the ground. A method of doing so might utilize LIDAR data to disregard any pixels beyond the perimeter detected by the LIDAR.

Another method that could be used to eliminate erroneous edges would be to use the estimated heading to reduce the edges caused by noise. An extension of the current algorithm might only consider edges that have a gradient direction either near the heading or near perpendicular to the heading. This would reduce the noise cause by dirt and the speckled pattern on the tiles.

Another area of the algorithm that could be improved, as with most algorithms, is the processing time of the algorithm. Although the algorithm was developed

91

with speed in mind, and a great deal of effort went into improving the speed, the algorithm could have a smaller runtime. One major improvement to the speed could come from creating a complete lookup table for the Hough transformation. As was discussed in 4.8, current systems cannot handle the size requirement of this lookup table, but as memory sizes have increased in the past, they will certainly increase in the future, thus making the table possible.

A final very difficult improvement could use expectation of position and heading to improve grid parameter detection in mod space. This might be accomplished using a tracking algorithm or something similar to the CONDENSATION algorithm. Alternatively, the values near the expected position might be weighted in some fashion to give them more support.

92

Appendix A

Color cameras can encode data using grayscale camera techniques. If a grayscale

CCD chip is used, it can be covered with color filters in a specific pattern such that each cell in the CCD will encode data for one color component (Bräunl, 2006). A popular pattern for the color filters is the Bayer format (Bräunl, 2006). The Bayer format used by the chosen camera is RGGB. The following figure illustrates the Bayer color format.

Figure 49: Bayer color format from (Furtner, 2009). The simplest method of calculating a standard RGB color image from the Bayer format is by the following:

(00) (10) (01,00,10) (01,11,10)

(01) (11) (01,00,10) (01,11,10)

93

Appendix B

This appendix will serve to give a brief overview of the edge detection algorithms tested in section 4.3 but not used.

B.1 Canny Edge Detector

The Canny edge detector was created in 1986 to as an optimal edge detector

(Middleton & Sivaswamy, 2005). The first step of the Canny algorithm is to smooth the image with a Gaussian filter, followed by a directional derivative (Middleton & Sivaswamy,

2005). The gradient direction can be determined by the derivatives in the x and y-directions

(Canny, 1986). The algorithm then uses hysteresis, such that a pixel above an upper threshold is definitely an edge point, a pixel below a lower threshold is definitely not an edge candidate, and a pixel between might be an edge, depending on its neighbors (Middleton &

Sivaswamy, 2005).

B.2 Laplace Edge Detector

The Laplace edge detector uses the discrete form of the Laplace operator (Bradski &

Kaehler, 2008):

B.2.1

Zeros of equation B.2.1 are then marked as edges in an output image.

B.3 Weighted Local Variance

The weighted local variance is a technique that has been used in some medical image processing (Law & Chung, 2007). The method attempted in this thesis was a simplified

94

version of (Law & Chung, 2006). The method used applied a threshold to the following equation:

B.2.2

95

Bibliography

1. Antsaklis, P. J., & Michel, A. N. (2007). A Linear Systems Primer. Boston, MA:

Birkhäuser Boston.

2. Ayala, V., Hayet, J., Lerasle, F., & Devy, M. (2000). Visual Localization of a Mobile

Robot in Indoor Environments using Planar Landmarks. Proceedings of the 2000

IEEE/RSJ International Conference onIntelligent Robots and Systems (pp. 275-280). IEEE.

3. Bhattacharya, P., Rosenfeld, A., & Weiss, I. (2001). Point-to-Line Mappings and

Hough Transforms. In Digital and Image Geometry: Advanced Lectures (pp. 187-208).

Heidelberg: Springer Berlin Heidelberg.

4. Borenstein, J., Everett, H., & Feng, L. (1994). "Where am I?" Sensors and Methods for

Mobile Robot Positioning. The University of Michigan.

5. Bradski, G. R., & Kaehler, A. (2008). Learning OpenCV (1st Edition ed.). Sebastopol,

CA: O'Reilly Media.

6. Bräunl, T. (2006). Embedded Robotics: Mobile Robot Design and Applications with Embedded

Systems. Hiedelburg, Berlin: Springer Berlin Heidelberg.

7. Buzug, T. (2008). Computed Tomography: From Photon Statistics to Modern Cone-Beam CT.

Berlin, Heidelberg: Springer Berlin Heidelberg.

8. Canny, J. (1986). A Computational Approach to Edge Detection. IEEE Transactions

on Pattern Analysis and Machine Intelligence , 679-698.

96

9. Cheng, Y., Maimone, M., & Matthies, L. (2005). Visual Odometry on the Mars

Exploration Rovers. Proceedings of the 2005 Conference on Systems, Man and Cybernetics (pp.

54-62). IEEE.

10. Cord, T., & Lazic, D. (1997). Error Control Navigation Cod.es for Autonomous

Mobile Robots. 1997 IEEE International Conference on Intelligent Engineering Systems (pp.

141-146). IEEE.

11. Cui, F.-y., Zou, L.-j., & Song, B. (2008). Edge Feature Extraction Based on Digital

Image Processing Techniques. Proceedings of the IEEE International Conference on

Automation and Logistics (pp. 2320-2324). Qingdao, China: IEEE.

12. Dao, N., You, B.-J., & Oh, S.-R. (2005). Visual Navigation for Indoor Mobile

Robots Using a Single Camera. Intelligent Robots and Systems , 1992-1997.

13. Davidson, A. (2003). Real-Time Simultaneous Localisation and Mapping with a

Single Camera. Proceedings of the Ninth IEEE International Conference on Computer Vision

(pp. 1403-1410). IEEE.

14. Dellaert, F., Burgard, W., Fox, D., & Thrun, S. (1999). Using the

CONDENSATION algorithm for robust, vision-based mobile robot localization.

Computer Vision and , 23-35.

15. Duda, R. O., & Hart, P. E. (1972). Use of the Hough transformation to detect lines

and curves in pictures. In Communications of the ACM (pp. 11-15). New York: ACM.

97

16. Eade, E., & Drummond, T. (2006). Scalable Monocular SLAM. Proceedings of the 2006

IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 469-

476). IEEE.

17. Everett, H., Gage, D., Gilbreath, G., Laird, R., & Smurlo, R. (1994). Real-world

issues in warehouse navigation. Proceedings of the SPIE Conference on Mobile Robots IX,

2352.

18. Firefly MV. (2008). Retrieved March 2, 2009, from Research, Point Gray:

http://www.ptgrey.com/products/fireflymv/index.asp

19. Fox, D., Burgaurd, W., & Thrun, S. (1999). Markov Localization for Reliable Robot

Navigation and People Detection. Journal of Artificial Intelligence Research , 391-427.

20. Franklin, G., Powell, J., & Workman, M. (1998). Digital Control of Dynamic Systems (3rd

Edition ed.). Menlo Park, CA: Addison Wesley Longman, Inc.

21. Furtner, U. (2009). Color processing with Bayer Mosaic sensors. Retrieved March 2009 9,

2009, from MATRIX VISION - industrial image processing articles:

http://www.matrix-vision.com/info/articles/pdf/art_bayermosaic_en.pdf

22. Gonzalex, R. C., & Woods, R. E. (2002). (Second Edition ed.).

Upper Saddle River, New Jersey: Prentice-Hall, Inc.

23. Hardt, H.-J., Wolf, D., & Husson, R. (1996). The Dead Reckoning Localization

System of the Wheeled Mobile Robot ROMANE. Multisensor Fusion and Integration for

Intelligent Sytems , 603-610.

98

24. Harper, J., & Nemeth, C. (2009). Mobile Robot with Physical State Observer. Senior

Project Final Report.

25. Hartelius, K., & Carstensen, J. M. (2003). Bayesian Grid Matching. IEEE Transactions

On Pattern Aanalysis And Machine Itelligence , 162-173.

26. Hellström, T., Ringdahl, O., & Siddiqui, A. (2006). Path Tracking and Localization in

Forest Environment. Umea, Sweden: Umea University.

27. Hendricks, E., Jannerup, O., & Sørensen, P. H. (2008). Linear Systems Control:

Deterministic and Stochastic Methods. Heidelberg, Berlin: Springer Berlin Heidelberg.

28. Invacare, C. (2009). Invacare Product Catalog. Retrieved March 7, 2009, from Invacare:

http://www.invacare.com/cgi-

bin/imhqprd/inv_catalog/prod_cat_detail.jsp?s=0&prodID=3GSEAT&catOID=-

536887497

29. Isard, M., & Blake, A. (1998). Condensation - conditional density propagation for

visual tracking. International Journal of Computer Vision , 5-28.

30. Kim, S., Jeong, I., & Lee, S. (2007). Mobile Robot Velocity Estimation Using an

Array of Optical Flow Sensors. Proceedings of the International Conference on Control,

Automation and Systems 2007, (pp. 616-621). Seoul, Korea.

31. Kosaka, A., & Kak, A. (1992). Fast Vision-Guided Mobile Robot Navigation Using

Model-Based Reasoning And Prediction Of Uncertainties. h o c d n g s of the 1992

IEEERSJ International Conference on Intelligent Robots and Systlems (pp. 2177-2186).

Raleigh, NC: IEEE.

99

32. Lang, S. Y., & Fu, Y. (2000). Visual Measurement of Orientation Error for a Mobile

Robot. IEEE TRANSACTIONS ON INSTRUMENTATION AND

MEASUREMENT , 49 (6), 1344-1357.

33. Law, M. W., & Chung, A. C. (2007). Weighted Local Variance-Based Edge Detection

and Its Application to Vascular Segmentation in Magnetic Resonance Angiography.

Transactions on Medical Imaging , 1224 - 1241.

34. Law, W., & Chung, A. (2006). Segmentation of Vessels Using Weighted Local

Variances and an Active contour Model. Conference on Computer Vision and Pattern

Recognition Workshop (p. 83). IEEE.

35. Lefebvre, M. (2005). Applied Probability and Statistics. New York: Springer New York.

36. Lv, Q., Zhou, W., & Liu, J. (2006). Realization of Odometry System Using

Monocular Vision. Proceedings of the 2006 Conference on Computational Intelligence and

Security (pp. 1841-1844). IEEE.

37. Mammarella, M., Campa, G., Napolitano, M. R., Fravolini, M. L., Gu, Y., &

Perhinschi, M. G. (2008). Machine Vision/GPS Integration Using EKF for the UAV

Aerial Refueling Problem. IEEE Transactions of Systems, Man, and Cybernetics - Part C ,

38 (6), 791-801.

38. Middleton, L., & Sivaswamy, J. (2005). Hexagonal Image Processing: A Practical Approach.

London: Springer London.

100

39. Morenol, L., Munoz, M. L., Garrido, S., & Martin, F. (2007). E-SLAM solution to

the grid-based Localization and Mapping problem. IEEE International Symposium on

Intelligent Signal Processing (pp. 1-7). IEEE.

40. Murata, S., & Hirose, T. (1993). Onboard Locating System Using Real-Time Image

Processing for a Self-Navigating Vehicle. IEEE Transactions of Industrial Electronics , 40

(1), 145-154.

41. Nagatani, K., Tachibana, S., Sofue, M., & Tanaka, Y. (2000). Improvement of

Odometry for Omnidirectional Vehicle using Optical Flow Information. Proceedings of

the 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 468-

473). IEEE.

42. NI cRIO-9074. (2009). Retrieved March 7, 2009, from

http://sine.ni.com/nips/cds/view/p/lang/en/nid/203964

43. Nuske, S., Roberts, J., & Wyeth, G. (2008). Visual Localisation in Outdoor Industrial

Building Environments. 2008 IEEE International Conference on Robotics and Automation

(pp. 544-550). Pasadena, CA: IEEE.

44. Ohya, A., Kosaka, A., & Kak, A. (1998). Vision-Based Navigation by a Mobile Robot

with Obstacle Avoidance Using Single-Camera Vision and Ultrasonic Sensing. IEEE

Transactions On Robotics And Automation , 969-978.

45. Ross, S. (2006). A First Course In Probabillity. Upper Sdalle River: Pearson Prentice

Hall.

101

46. Russel, S., & Norvig, P. (2003). Artificial Intelligence: A Modern Approach. Upper Saddle

River, NJ: Pearson Education, Inc.

47. Schellwat, H., & Wide, P. (1997). Position Detection of an Autono ous Robot by

Aperiodic Tilings. IEEE instrumentation and Measurement Technology Conference (pp. 576-

580). Ottawa, Canada: IEEE.

48. Siegwart, R. (2006). Mobile Robots Facing the Real World. In Field and Serivice

Robotics (pp. 21-30). Springer Berlin.

49. Silveira, G., Malis, E., & Rives, P. (2008). An Efficient Direct Approach to Visual

SLAM. IEEE Transactions On Robotics , 969-979.

50. Tam, K., Lay, J., & Levy, D. (2008). Automatic Grid Segmentation of Populated

Chessboard Taken at a Lower Angle View. Digital Image Computing: Techniques and

Applications , 294-299.

51. Tamimi, H., & Zell, A. (2004). Global Visual Localization of Mobile Robots using

Kernel Principal Component Analysis. Intelligent Robots ans Systems , 1896-1901.

52. Tsai, R. (1987). A Versatile Camera Calibration Techniaue for High-Accuracy 3D

Machine Vision Metrology Using Off-the-shelf TV Cameras and Lenses. IEEE

JOURNAL OF ROBOTICS AND AUTOMATION , 3 (4), 323-344.

53. Wai Yan So, E., Yoshimitsu, T., & Kubota, T. (2008). Relative Localization of a

Hopping Rover on an Asteroid Surface using Optical Flow. SICE Annual Conference

2008, (pp. 1727-1732).

102

54. Warren, E. (2009). Machine Learning for Road Following by Autonomous Mobile Robots.

Case Western Reserve Univserity.

55. Xuedong, W., Xinhua, J., Rongjin, Z., & Jiashan, H. (2006). An Application of

Unscented Kalman Filter for Pose and Notion Estimation Based on Monocular

Vision. IEEE International Symposium on Industrial Electronics (pp. 2614 - 2619). IEEE.

56. Yeol, J. W. (2005). An Improved Position Estimation Algorithm for Localization of

Mobile Robots by Sonars. Student Conference on Engineering Sciences and Technology (pp. 1-

5). IEEE.

57. Zhang, Z. (2002). Computer Vision — ECCV 2002. 7th European Conference on

Computer Vision (pp. 161-174). Copenhagen, Denmark: Springer Berlin Heidelberg.

103