VISUAL SEARCH FOR OBJECTS WITH STRAIGHT LINES

by

SIMON HAIG MELIKIAN

Submitted for the degree of

Doctor of Philosophy

Case School of Engineering

Electrical Engineering and Computer Science

Thesis Adviser: Prof. Dr. Christos Papachristou

January, 2006

CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of

______

candidate for the Ph.D. degree *.

(signed)______(chair of the committee)

______

______

______

______

______

(date) ______

*We also certify that written approval has been obtained for any proprietary material contained therein.

Copyright © 2006 by Simon Haig Melikian

ALL RIGHTS RESERVED

VISUAL SEARCH FOR OBJECTS WITH STRAIGHT LINES

by

SIMON HAIG MELIKIAN

Abstract

I present a new method of visual search for objects that include straight lines. This is usually the case for machine-made objects. I describe existing machine vision search

methods and show how my method of visual search gives better performance on objects

that have straight lines. Inspired from human vision, a two-step process is used. First

straight segments are detected in an image and characterized by their length, mid-

point location, and orientation. Second, hypotheses that a particular straight line segment belongs to a known object are generated and tested. The set of hypotheses is constrained by spatial relationships in the known objects. I discuss implementation of my method and its performance and limitations in real and synthetic images. The speed and robustness of my method make it immediately applicable to many machine vision problems.

Table of Contents

TABLE OF FIGURES ...... 3 ACKNOWLEDGEMENTS...... 5 ACKNOWLEDGEMENTS...... 5 CHAPTER 1 ...... 6 1.1 THIS THESIS’S CONTRIBUTIONS...... 6 1.2 INTRODUCTION...... 7 1.3 SCOPE OF THIS THESIS ...... 10 CHAPTER 2 ...... 13 2.1 A LOOK AT HUMAN VISION...... 13 2.1.1 Modular Systems for Features and Hypotheses...... 13 2.1.2 Are Straight Lines Salient Features?...... 14 2.2 PRIOR MACHINE VISION WORK...... 17 2.2.1 Binary Search Methods...... 17 2.2.2 Normalized Grayscale Correlation...... 19 2.2.3 Geometric Based Search and Recognition...... 22 2.2.4 Contour Based Search ...... 24 2.2.5 Affine Invariant Constellation Based Recognition...... 27 2.2.5.1 Corner Based...... 28 2.2.5.2 Salient Icons...... 30 2.2.5.3 Scale Invariant Feature Transform ...... 34 CHAPTER 3 ...... 38 3.1 VISUAL SEARCH WITH STRAIGHT LINES ...... 38 3.1.1 Search Constraints for Machine Vision...... 38 3.1.2 Using Straight Lines as Icons ...... 39 3.1.3 Search with Lines...... 41 3.1.4 The Cost of Hypothesis Generation ...... 49 3.1.5 The Cost of Verification...... 49 3.1.6 Gradient Angle of a Line...... 50 3.1.7 The Number of Reference Lines Needed for Robust Search ...... 50 3.2 ABSTRACT LOOK AT VISUAL SEARCH WITH STRAIGHT LINES ...... 50 CHAPTER 4 ...... 52 4.1 CURVATURE BASED STRAIGHT LINE EXTRACTION (CBSLE)...... 52 4.1.1 Effect of Span Value...... 59 4.2 SPLIT AND MERGE METHOD FOR STRAIGHT LINE EXTRACTION...... 60 4.3 LINE EXTRACTION PERFORMANCE ...... 64 4.3.1 End Points Position Accuracy ...... 64 4.3.1.1 Effect of Object Size on End Point Accuracy...... 64 4.3.1.2 Effect of Noise on End Point Accuracy...... 66 CHAPTER 5 ...... 68 5.1 HYPOTHESIS GENERATION ...... 68 5.2 VERIFICATION ...... 70 CHAPTER 6 ...... 75 6.1 ABSTRACT TESTING ...... 75 6.1.1 Speed vs. Background Straight Lines...... 76 6.1.2 Speed vs. Target Position...... 78 6.1.3 Speed vs. Angle ...... 79

1 of 106 6.1.4 Speed vs. Scale...... 79 6.1.5 Speed vs. Scale and Angle...... 79 6.2 REAL WORLD TESTING...... 80 6.3 FEASIBILITY OF TEACHING WITH SYNTHETIC MODEL ...... 90 CONCLUSION...... 96 APPENDIX 1. COMPARISON WITH OTHER METHODS...... 97 REFERENCES...... 99

2 of 106 Table of Figures

Figure 1. Ice cream package lid on a conveyer belt...... 9 Figure 2. Wheel-rim identification ...... 10 Figure 3. Search test pattern from [Wolfe] ...... 15 Figure 4. Normalized gray-scale correlation ...... 20 Figure 5. Computing the Generalized ...... 23 Figure 6. Computing Contour Based Search ...... 25 Figure 7. Results of contour based search...... 27 Figure 8. The local neighborhood of a corner (gray circles) gives us local position and orientation information (left-hand dotted box), but not information on the size of the object that the corner belongs to (right-hand dotted box)...... 29 Figure 9. Finding characteristic scale ...... 30 Figure 10. Salient regions as computed by the method of [Kadir & Brady] ...... 32 Figure 11. An example of a Gaussian (top row) and Laplacian pyramid...... 35 Figure 12. Each key point is shown as square, with a line from the center to one side of the square indicating orientation [Lowe] ...... 37 Figure 13. A rectangular object and its components of lines and corners ...... 40 Figure 14. Each line contributes information to locate the rectangle (find a reference point) regardless of the rectangle position, angle and size...... 41 Figure 15. Rectangle pattern for training...... 41 Figure 16. Shape translated, rotated, and scaled by unknown amounts...... 42 Figure 17. Reference shape with coordinate system added ...... 42 Figure 18. Straight-line segments extracted from the reference pattern...... 43 Figure 19. Straight line segments extracted from target pattern...... 43 Figure 20. Fitting a trained line to a target line...... 45 (20a) Translation of trained line’s center point to target line’s center point ...... 45 (20b) Rotation of trained line to overlap with target line ...... 45 (20c) Scaling of trained line to fit the target line...... 45 (20d) The trained pattern is transformed by the same amounts to overlap the target...... 45 Figure 21. Several hypotheses are tested to recognize and locate a pattern...... 48 (21a) The reference pattern with a straight line (focus of attention) shown in green...... 48 (21b-21e) Attempts to verify the pattern by comparing two non-corresponding-lines. ... 48 (21f) Perfect match verifies that the hypothesis is valid (corresponding lines)...... 48 Figure 22. Curvature at a point ...... 52 Figure 23. Computing h to find straight line segments...... 53 Figure 24. h (curvature measure) along a contour, showing curves (a,b,c) and straight line (d)...... 54 Figure 25. Approximating curve with circular arc...... 55 Figure 26. Left column: contours. Right column: Extracted straight lines...... 58 Figure 27. Straight lines extracted using CBSLE method...... 59 Figure 28. The effect of Span on straight line ends. Notice the slight change in start and end locations of each line segment...... 60 Figure 29. Split and Merge for straight line extraction...... 61

3 of 106 Figure 30. Straight line extraction using Split and Merge. Blue crosses are end points of a line segment...... 61 Figure 31. Result of straight line extraction using Split and Merge method...... 62 Figure 32. Contour points for an outdoor scene...... 63 Figure 33. Straight lines from the scene above using CBSLE. Note the compression of information...... 63 Figure 34. Image used to measure the stability of end point as object size varies ...... 64 Figure 35. CBSLE end point error (in pixels) as a function of scale...... 65 Figure 36. Split & Merge end point position error (in pixels) as a function of scale...... 66 Figure 37. The image used to measure the stability of end points to noise...... 66 Figure 38. The effect of noise (sample images 0 to 19) on end point location...... 67 Figure 39. Several verification attempts and one successful but not exact where hill climbing is needed...... 74 Figure 40. Simple pattern used to perform abstract testing...... 75 Figure 41. Searching for the pattern of Figure 40 in the presence of background lines “distractions”...... 77 Figure 42. Searching for the pattern of Figure 40 in the presence of more background lines...... 77 Figure 43. Search time as function of the number of lines in the background ...... 78 Figure 44. Recognizing and locating stop signs. Trained pattern is in upper-left image. 82 Figure 45. Recognizing and locating a “Talbots” sign. Top, left image is the training image...... 83 Figure 46. Tracking the writing on the back of a moving truck. Pattern size is 231 x 135 pixel, and scale range is from 1.0 to 0.55...... 84 Figure 47. Recognizing three patterns in package label...... 85 Figure 48. Straight line extractor applied for lane tracking...... 86 Figure 49. Recognizing and locating a label’s logo on two types of bottles...... 87 Figure 50. Recognizing and locating manufactured parts – same part at different working distances...... 88 Figure 51. Recognizing “France” pattern within other straight line rich patterns...... 89 Figure 52. Synthetic Model of an object as a bitmap image...... 90 Figure 53. Real image of an object (v-block) to search for that is taught by synthetic bitmap of Figure 52...... 91 Figure 54. Result of straight line extractor on the synthetic bitmap. Notice the double straight lines on each “drawing” line...... 91 Figure 55. Result of the straight line extractor on the real object of the synthetic bitmap. Notice single straight line on the boundary of the object...... 92 Figure 56. “Filled” synthetic bitmap with matching contrast to the object in the real image (Figure 53 & 55). The object is brighter than the background...... 93 Figure 57. The synthetic bitmap that is trained to search for the object in the images in Figure 58...... 94 Figure 58. Search for v-block that was taught using synthetic bitmap of Figure 57...... 95 Figure 59. The Salient icons (Straight lines) of this thesis. It shows the location (mid point), the scale (distance between end points), and angle (the angle of the straight line segment)...... 98

4 of 106 Acknowledgements

This gives me an opportunity to express my gratitude to my adviser Prof. Christos

Papachristou for his support, guidance, and make it possible to pursue this research. Dr.

Ben Dawson, who provided critical sounding board for most of these ideas, carefully critiqued my drafts and his influence on me in the human aspect of vision. My sincere

appreciation goes to my family for the unconditional support and help in testing the algorithm.

5 of 106 Chapter 1

1.1 This Thesis’s Contributions

I developed a novel method of visual search – finding patterns or objects in images –

based on different assumptions than are currently used in machine or . I

assume that the objects we are searching for have some straight lines in their outline or

interior. This is usually the case for machine-made objects and for many person-made

objects.

I show that straight lines can be used as salient features (icons) for rapidly locating and

identifying objects in an image. Previous machine and computer vision work assumed

that straight lines do not have enough information to support visual search, and so has

dismissed this important source of information about objects.

Motivated by human visual search, I developed a two-step process: finding straight lines in a “target” image and testing the hypotheses that these straight lines belong to a known,

trained object.

I implemented these ideas into a robust and practical algorithm that can be used in

industrial machine vision. My contributions here include a fast and accurate straight line

detector and a fast and accurate verifier (hypothesis tester).

6 of 106 As a practical contribution this method could be used for autonomous guidance of vehicle convoy. This can be done by having successive vehicles lock on a pattern on the back of the preceding vehicle. My method can also help guide vehicles by finding the straight edges in road marking at the middle and at the edge of the road. The method is fast to the point that 20 samples per second can be used to close a guidance loop.

I demonstrate the use of my methods in a variety of real world and synthetic search

examples, to show its robustness and limitations. I also demonstrate the feasibility of

using a CAD file (as a synthetic image) to teach objects to be recognized. Using a

“clean” synthetic improves detection reliability and improves ease-of-use over teaching

with actual images of objects.

This work is a pioneering contribution to practical, structure-based visual search and has

found immediate application in a wide variety of challenging machine vision problems.

1.2 Introduction

Our ability to visually search for and recognize objects is critical to our survival. We

have to find food, tools, paths, mates, and much more, all the while keeping an “eye out”

for things that could hurt us. It is likely that we have many specialized visual search

mechanisms to do tasks ranging from rapid detection of moving objects to sophisticated

visual search, such as finding a friend in a crowd.

7 of 106 Our goal in machine vision is to design computer systems to replace human vision in

tasks that are fast, repetitive, precise, and must be free of error. A typical machine vision

task is the inspection of manufactured goods for part orientation, , and flaws.

It is no surprise that search and recognition are also critical components of machine

vision. Each part has to be precisely located before it can be measured and inspected.

Visual search is, therefore, an important research topic and the practical foundation for most machine vision companies. Rapid and accurate visual search is a competitive advantage in biology and business.

Our natural visual environment is complex and so visual search requires extraordinary

computational power, perhaps on the order of 1015 operations per second1. Even if this

was possible in a commercial machine vision system, there is a bigger problem: We

don’t know how to do rapid visual search in unconstrained, natural environments.

To make search and recognition tasks tractable in commercial machine vision, we

artificially limit the task’s visual complexity. First, the vision system is set up to view

and recognize only one or a small class of objects. Second, the presentation (position,

orientation, size, view, etc.) of these objects is strictly controlled. We thus reduce the

object variability to the point that we know how to do the search and can implement it

with reasonable cost – in both compute time and money.

1 1011 neurons, each with 103 synapses, running at 102 cycles per second, with 10% utilization = 1015. See http://faculty.washington.edu/chudler/facts.html

8 of 106 For example, when packaging ice cream, the vision system has to recognize a package lid

from a small class of lids (Vanilla, Raspberry, etc.). To reduce visual complexity, we use

a uniform light source and present the lids plane parallel to the camera’s sensor to

eliminate perspective distortions (see Figure 1).

Figure 1. Ice cream package lid on a conveyer belt

In Figure 2 the machine vision task is to identify wheel-rims and measure their angular

orientation so that a robot can correctly place the rim. To simplify the search, a spindle

centers the wheel-rim in the camera’s field of view. Diffuse illumination is used to

reduce specular (mirror-like) reflections that can confuse the vision system. The vision system matches a rim to templates (abstract models) of the possible rims to identify it and find its orientation.

9 of 106

Figure 2. Wheel-rim identification

To make an analogy, you make a Google search tractable by constraining the search to be as specific as you can. For example, instead of searching for “dogs” (29,600,000 hits) you search for “Yorkshire Terriers”. Even so, there are 109,000 hits to search through, and so you further restrict your search. You use your extensive a priori knowledge of dogs to constrain the search and sort through the hits.

Machine vision systems lack the knowledge needed to constrain and interpret a general visual search. Therefore, practical machine vision search requires you to drastically restrict what the vision system sees and to add a priori knowledge about what it will see so that it can interpret the result. This takes time, money and specialized knowledge. A major contribution of this thesis is a method that reduces the constraints and knowledge needed to make a practical machine vision system.

1.3 Scope of this Thesis

I think that the continued success of machine vision depends on improving visual search to handle wider variations in part presentation, lighting, and size (scale). As the vision system becomes more robust to object variation, the need to restrict the system’s view by

10 of 106 part positioning (staging) and lighting is reduced, and the development time and costs to put a priori knowledge into the system are reduced.

Machine- and Person-made objects have many straight lines. I argue that by using this observation as a general constraint on search tasks, we can build fast and robust search.

Of course, this assumption means that my search method is not applicable to most natural objects, such as vegetables or dogs.

Rather than laboriously testing every possible match of a template (part model) to any possible view (location, orientation and scale) of an object, I take a tip from human vision to argue for a faster, two-step process. First I quickly find straight lines and use these as salient features or icons. A feature is some distinguishable (unique) component of an object and salient means that the feature has high information content. Second, I test these salient features against hypotheses derived from the objects we want to recognize.

This type of search is rapid and robust, as there are a small number of salient features

(straight lines) and hypothesis testing with them is relatively simple.

These ideas are implemented as algorithms in a C++ program. I use this program to show how my methods can rapidly and robustly find and recognize objects synthetic and natural images, and to explore the method’s limitations.

It would be wonderful to have a machine vision system that matches human visual abilities. It could, for example, recognize products without needing barcodes, learn

11 of 106 robotic assembly with little training, read signs on streets and highways, or guide a blind person. Unfortunately, we don’t know how to make such a vision system with reasonable cost and required reliability. This thesis makes an incremental step towards human vision’s power, by contributing a practical method of searching for and recognizing objects from the large class of objects with straight lines.

This thesis abstracts this approach to make it possible to use other than straight lines to do visual search and perhaps more advancement in the field of visual search.

12 of 106 Chapter 2

2.1 A Look at Human Vision

I start by looking at some aspects of human vision that motivate this work.

2.1.1 Modular Systems for Features and Hypotheses

Biological vision is complex but appears to be somewhat modular. For example,

[Schneider] provided evidence for separate brain systems for search and recognition.

Humans (and experimental animals) with damage to primary visual cortex could not

recognize objects but could report the object’s location, a phenomena Schneider termed

“blindsight”. Conversely, damage to sub-cortical visual structures left animals able to

recognize objects but not locate them in (perceptual) space.

This observation leads to a conundrum for both human and machine vision. It would

seem you have to find an object before you can recognize it, but you have to recognize it

before you can find it. Yet this work suggests they are separate operations.

A proposed solution, that fits well with this thesis, is that the “where” system recognizes simple salient features of objects such as color, brightness, movement, etc., and directs the “what” system to image areas that are important to process. The “what” system uses the direction and evidence (salient features) provided by the “where” system to formulate and test hypotheses as to what object is there – to recognize the object.

13 of 106 For example, a sudden movement in the periphery of your vision will cause your head

and eyes to turn towards this stimulus and direct your visual cortex to examine the stimulus. This is done reflexively, that is without conscious control, and quite rapidly –

the movement you see might be a tiger headed your way! Once directed, the “what”

visual system can quickly evaluate the stimulus, and perhaps you will run for your life.

There are two key concepts that this thesis uses from this division of labor by the “where”

and “what” visual systems. First, visual attention is a way for an organism or machine to

direct limited computing resources to the important parts of a problem. In this thesis, the

“where” system generates salient features (straight lines) that call attention to image areas

that need to be examined by a “what” system.

The second concept is the generation and testing of hypotheses by the “what” system,

based on a set of salient visual features computed by the “where” system. This is also a

way to reduce computation and so increase recognition speed, as a hypothesis can be abandoned after a few tests against the feature “evidence” fails.

2.1.2 Are Straight Lines Salient Features?

I argue that, in certain environments, straight lines are highly salient features that can be quickly found and that can direct the attention of a matching (hypothesis testing) stage.

14 of 106 There is evidence that straight lines (and the angles between them) are salient features for

human vision search – they drive visual attention and recognition. This evidence

supports the use of straight lines in machine vision search and recognition.

Many neurons in primary visual cortex respond to edge (intensity change) segments or to small bars (lines) at specific orientations [Hubel & Wiesel]. Some “end stopped” cells seem designed to detect curvature (or straightness) in small image segments [Kandel,

Schwartz & Jessell]. This is evidence of extensive neural processing to “extract” straight lines, although typically smaller line segments than used in this thesis.

A more immediate demonstration of the use of straight lines and visual attention in

human vision is provided by an example from [Wolfe]:

Figure 3. Search test pattern from [Wolfe]

15 of 106 When you are asked to find the blue diamond, it almost instantly “pops out” without conscious effort. The blue diamond differs from the blue square only in its outline (lines)

orientation and from the blue rectangles only in its line lengths. If I ask you to look for a cross with a red-vertical element, you will find yourself looking at each of the crosses until you find a match, perhaps because there are no differences in the line structure or orientations.

It seems that straight line patterns and colorings are salient features that direct our

attention, so that processing for recognition is fast and easy. These kinds of features are

probably processed “in parallel” in the brain and the important ones are passed to a

sequential process that tests hypothesis as to what the object is. Perhaps the blue

diamond “pops out” because a “where” system finds line features and is primed to direct

visual attention to areas with the right orientation, aspect ratio and color for a blue

diamond. The cross patterns are distinguished only by details of their colors, and these

subtler differences are apparently not detected by the system that drives attention. So we

must slowly and sequentially examine each cross to find the target.

This and other evidence supports the idea that straight, oriented lines (and other features,

such as area coloring) are salient features for human visual attention and so could be used

for machine vision. The straight-line features focus a slower, hypothesis-testing “what”

module used for recognition of objects based on these features. This is the approach used

in this thesis.

16 of 106 2.2 Prior Machine Vision Work

The first machine vision systems were developed in the 1960s and 1970s. See, for

example, [Agin]. Early machine vision systems were hampered by poor quality video

cameras and by very limited computing power. These systems usually converted the

continuous, gray-scale images from the cameras into binary (two value, black and white)

images.

The capabilities of machine vision systems advanced rapidly in the 1980s, driven by

better computers, the introduction of CCD cameras, and high-quality frame grabbers –

high-speed digitizers, memory, and computer interfaces for acquiring camera images.

The evolution of machine vision in the last twenty years was driven by improvements in

hardware and algorithms. Search algorithms have always been critical for machine

vision because search is the basis for most other machine vision tasks – if you can’t find

it, you can’t recognize or measure it. I will review some key developments in machine

vision search, in approximate chronological order.

2.2.1 Binary Search Methods

Early machine vision systems used binary images, due to hardware and processing

limitations. Gray scale images are converted to binary images by applying a threshold,

and this is difficult unless the objects and lighting are uniform. Objects in these binary images appear as black (or white) blobs – areas of connected pixels – on a white (or black) background.

17 of 106

Early search method used projections – the integration of pixel values perpendicular to

some line. In a binary image, integration reduces to counting pixels in rows and columns,

and so can be done quickly with very little hardware. These projections form a

distribution that is matched with template curves to determine an object location and to

recognize the object. Unfortunately, multiple objects in the image can confound these

distributions. It also can be impossible to recover object orientation from two projections

– consider the projections of an elliptical blob at 45 and 135 degrees. We can improve recognition sensitivity by using gray scale pixels, whose values are a function of object

intensity, and by adding more projections. The base problem is that with projections, we

are trying to solve an often difficult inverse reconstruction problem.

A more common approach, and one still in widespread use, is to do blob analysis on the

image. Blob analysis uniquely identifies connected areas (blobs) in the image. We could compute the moments (center of gravity, etc.) of a blob and use these features to locate and recognize it. In practice, you need at least seven moments to recognize an object from its blob representation, and intermediate accumulation values can grow beyond what even double precision floating point can maintain [Gonzales & Woods].

A better approach is to follow the outline of each blob using an edge crawler – a state

machine that produces a sequence of X,Y locations as it follows the blob’s outline edge.

The sequence of edge locations can be formed into a chain code based on changes in

edge position. This is equivalent to taking a derivative and so the exact location (absolute

18 of 106 X,Y values) of the blob “drops out” of the chain. We can now match a sample chain

code with a template (previously learned) chain code independent of the object’s position in the image. The blob’s position can be computed as the center of gravity of only the blob’s outline points.

The first derivative of an outline curve gives us a directional curve that is translation

(position) invariant. A second derivative gives us the rate of change of direction – the

curvature of the curve – and so makes the outline rotation invariant. Unfortunately, scale

invariance is not as easy. One approach is to code the outline by patterns of curvature

extrema and inflection points (where the curve changes direction). This removes curve

distance information, and hence is scale invariant [Richards, Dawson & Whittington].

Blob measures and methods are highly developed and are widely used. Other methods of

finding and recognizing binary objects include matching Fourier components of the

outline and binary correlation.

2.2.2 Normalized Grayscale Correlation

Normalized grayscale correlation was the basis for the first commercially successful visual search and recognition algorithms in machine vision. Many products still use it and it is considered a “gold standard” against which other search and recognition algorithms are measured.

19 of 106 A template of size J x K is taught, where the template is made of normalized pixel values.

Normalization, in this case, means subtract the template mean from each pixel and divide

the result by the template variance. During search (and recognition), the template pattern

is scanned across an input image (of size M x N) and at each location the normalized

grayscale correlation value is computed and recorded in a “correlation image” (see Figure

4). Peek values in the correlation image indicate the best match location between the

template and the possible instances of the template in the input image. The correlation

peak value is the match score, and is a measure of the quality or match closeness of the

image and template patterns.

N t

K

s J (s,t)

M w(x-s,y-t)

Figure 4. Normalized gray-scale correlation

In more detail: The input image values are f(x,y) and w(x,y) are the template values. The template (of size J x K) is moved stepwise across the input image (of size M x N), and at each point (s,t) the correlation coefficient is computed and output to the correlation image. The correlation coefficient is defined as:

20 of 106 ∑∑[ ] [ ),(),(),( −−−•− wtysxwyxfyxf ] xy γ ts ),( = ⎧ 2/1 ⎪ 2 2 ⎫ ⎨∑∑[]− ∑∑[]),(),(),( −−− wtysxwyxfyxf ⎬ ⎩⎪ xy xy ⎭

Where s = 0,1,2,…M – 1, t = 0,1,2,…N – 1, w is the average value of the pixels in w(x,y),

yxf ),( is the average value of f(x,y) in the region coincident with the current location of

w, and the summation are taken over coordinates common to both f and w. The

correlation coefficient γ ts ),( is in the range -1 to 1. [Gonzalez & Woods]

There are limitations to this technique. First, it is only invariant to translation (movement

of the target pattern in the input image). If, for example, an object can be presented at any orientation angle, you will need to generate and match against a set of patterns with different orientations. For 10 degrees of angular resolution, you would need 36 template patterns and, as an input pattern turns away from one of these templates, the correlation peak would decrease even though the pattern is a good match when orientation matches.

If the object moves along the camera’s optical axis, such that the pattern’s apparent size

changes, additional template patterns need to be generated and matched to assure a good

correlation score. For a size range of 0.5 to 2 times, I have found that at least 15 different

template pattern sizes are needed.

So, to get rotation and , you could end up testing 36 x 15 = 540 different template patterns at each target image point when using normalized grayscale correlation.

21 of 106 I have not even considered the additional computation needed to recognize many different objects or for dealing with natural variation in the target objects!

2.2.3 Geometric Based Search and Recognition

A newer class of algorithms combines the sensitivity and inherent translation and rotation invariance of blob outline methods with integration (in the form of “voting”) to reduce noise. To understand these methods you need to know about the Hough transform

[Hough].

The Hough transform accumulates votes for a parameterized pattern in a parameter space.

The point in the parameter space with the most votes can be “inverted” to give the location and orientation of the pattern. For example, consider the space of straight lines.

We use an edge detector (an algorithm that selects pixel values that fall along an object edge, where an edge is represented in a digital image by a sudden change in intensity) to pick out candidate edge points in the input image. Each candidate edge point votes for all lines that could be drawn through it, by incrementing bins in an accumulator space that is parameterized by line orientation and perpendicular distance to the origin. After all points have voted, we search the parameter space (bins) for peaks, and each peak is a recognized line in the input image. This method is used, for example to trace the faint path of a satellite (a straight line) moving against a background of stars.

This Hough method can be made more efficient by using the local orientation of the edge point, also derived from the edge detector.

22 of 106 Ballard generalized Hough’s idea to deal with arbitrary curves [Ballard]. As before, we

extract edge points as well as edge (gradient) directions. From an arbitrarily chosen

reference point (xr,yr) within or outside the pattern we draw vectors to each edge point

(xi,yi). The length (ρ) and orientation (φ) of the vector are used to construct an “R-Table”

that represents this pattern (see Figure 5).

Edge points (not contours!)

(xr,yr) ρ

φ

(xi,yi)

Figure 5. Computing the Generalized Hough Transform

The R-Table lists the gradient angle and the length ρ and relative angle φ of each vector.

The R-Table is tabulated such that points with similar gradient angles are put into one

“bin”.

Edge-based methods have the advantage of being mostly independent of non-linear

contrast change between the template pattern and the target pattern. Further, in the

Generalized Hough Transform (GHT) we are integrating the “votes” of many edge pixels, so noise sensitivity is greatly reduced.

23 of 106 However, the Generalized Hough Transform approach is only translation invariant. An attempt to make it rotation invariant requires building an R-Table for each orientation angle. Much like grayscale correlation, you again need a large number of pattern

“variants” and the effort to match (find and recognize) patterns quickly becomes too much for practical use. [Melikian et. al.], removed the need of R-Table for each angle as described next.

2.2.4 Contour Based Search

A method developed by [Melikian, Tyminski & Dawson], overcomes the lack of rotation invariance of the Hough transform by relying on edge contours (ordered sets of edge points) rather than many single unrelated edges points.

With contours, the local geometrical structure of the outline edge is preserved and we can compute both the curvature at an edge point as well as a unique angle measure; both are translation and rotational invariant. These parameters make it practical to apply Hough transform methods with both translation and rotation invariance.

We select a scale at which to measure the curve contour by selecting pairs of points on the curve that are a set, linear distance apart (see Figure 6). One point, (xi-j,yi-j) is called a

“back point” and the other contour point, (xi+j,yi+j) is called a “front point”. The line between the back point and the front point is called the “stick”. A perpendicular bisector is erected from the stick to the contour point (xi,yi) and this is called h (for height) and is a measure of curvature. The higher the contour curvature, the larger h is. By taking

24 of 106 points that are the stick distance apart on the curve, we have smoothed our curvature

measure to the scale of the stick.

Edge points (not contours)

(xr,yr) ρ

(xi - j,yi - j) φ

(x ,y ) h, curvature i i (xi + j,yi + j) measure

Figure 6. Computing Contour Based Search

As with Ballard, we pick an arbitrary reference point (xr,yr). A vector Ω of length ρ is

drawn from the contour point (xi,yi) to the reference point. Note that φ, the angle between the stick and vector Ω is rotational invariant.

25 of 106 With these definitions and background, the algorithm steps are summarized by:

Training:

i. Extract edge points using a Canny detector or other edge detector [Canny

1983],[Canny 1986]

ii. Crawl around the edge points to extract the contours

iii. For each contour point, select “front” and “back” point that is a “span” apart

(arc length).

iv. Compute h, φ, ρ for every contour point, and store in R-table with using “h”

value as R-table bin. As a reminder, the bins of the R-table in the GHT are the

gradient angle of an edge point.

Searching:

v. Extract edge points using a Canny or other edge detector

vi. Crawl on edge points to extract the contours

vii. For every point on a contour, get “front” and “back” contour points

viii. Compute h, φsearch, ρsearch for every contour point

ix. From R-table, pick up the data (φtrain, ρtrain) from bin that correspond to h, and

compute the angle difference between φtrain & φsearch, rotate Ω, and increment

the accumulator bin where Ω is pointing..

x. Find peaks in accumulator bins, indicating a high probability of a particular

object in the image.

26 of 106 Figure 7 shows contours and object found by this method. The extracted contours are indicated by colors in this figure. Note that edges can have inside and outside contours, indicating objects and “holes”

Figure 7. Results of contour based search.

2.2.5 Affine Invariant Constellation Based Recognition

An object or pattern in an image can be represented as a set of shapes – called, parts,

image patches, or icons – along with the geometric relationship between these icons. The

icons and their geometric relationships are said to make up a constellation. We can use

constellations of icons to locate and recognize objects in an image, and this search and

recognition can be made invariant to most affine transformations.

We start by selecting a type or class of icons, based on what we think are salient

(information rich) features in images. Icon types might be corners (where edges come

together), correlation matches to a set of important features, or perhaps local image

27 of 106 autocorrelations (second moments). We would also like a measure of icon “strength”, so we can examine more important icons first.

The advantages of selecting salient points for search and recognition are first, salient

points are easily computed and second, these salient points can direct hypothesis testing –

they serve as a form of visual attention. We only need to identify a small number of

salient points and then hypothesis test against the object constellation. The key insight is

that a few points and their geometric relationships quickly narrow down the possible

matching objects. To continue the Google search analogy from the introduction, adding additional key words rapidly reduces the number of returned “hits” in the search.

The performance of an icon-based search and recognition system depends on the choice

of icons. The choice of icons, in turn, depends on the class of objects being searched for and the class of images in which these objects are to be found. This a priori knowledge

thus guides the design and implementation of the search and recognition system. Here

are three approaches that have been used to select icons:

2.2.5.1 Corner Based

Corners are important features in our visual environment, so it makes sense to use them

as salient points. Corners are often detected using moments of a convolution matrix

[Harris & Stephens]. The machine vision system learns a set of “landmark” corners for

an object. Search and recognition are done by estimating the transformation of the

“landmark” corners that would match the set of corners in the target image.

28 of 106

Corner detection is rotation and translation invariant but has no scale information, so we

cannot estimate a scale transformation that maps the landmarks to the target constellation

(See Figure 8).

Figure 8. The local neighborhood of a corner (gray circles) gives us local position and orientation information (left-hand dotted box), but not information on the size of the object that the corner belongs to (right-hand dotted box)

A solution to this scale problem was proposed by [Mikolajczyk & Schmid]. Candidate corner points (icons) are found with a Harris corner detector [Harris & Stephens].

Around each candidate icon, a measure of scale is computed by using a range (in neighborhood size) of normalized Laplacian (second derivative) operators. The characteristic scale of this corner point is indicated by a local extremum in the curve of derivative strength vs. neighborhood size (see Figure 9). However, this gives only an approximation to the characteristic scale of the corner, so iterative methods are used to improve the measure of the corner’s position, angle, and scale measures.

The drawback of their method is that there is no ranking of icon saliency (quality). You end up with lots of low-information icons, requiring an exhaustive verification process to

29 of 106 identify the object – much like searching for the crosses with color changes as described above. Furthermore, the computation time for the iterative refinement of an icon’s affine coefficients makes this method slow for practical machine vision.

Figure 9. Finding characteristic scale

2.2.5.2 Salient Icons

Two related approaches are to select icons based on a direct measure of saliency

(information content), or to measure saliency as a function of scale and to use this curve to select icons with high saliency.

Gilles investigated the use of salient local image patches (icons) for object matching and registering images [Gilles]. He measured saliency in terms of local signal complexity or unpredictability by estimating the Shannon entropy of local features [Shannon]. The distribution of a suitable descriptor within an image patch (neighborhood) allows us to compute the Shannon entropy. Areas with high signal complexity tend to have flatter distributions (less predictable) and hence higher entropy. For example a uniform color

30 of 106 patch has low entropy (predictability) where as one with a wide range of colors has high

entropy and hence is considered to be more salient.

Given a point x, a local neighborhood Rx, and a descriptor d that takes values from D =

{d1, d2,...dr} local neighborhood entropy is defined as:

H −= PLogP ,RD x ∑ d 2 d d

Where Pd = Rx(di) is the probability of descriptor D taking the value di in the local region

Rx. For example, if our descriptor was image intensity, D would have values in the range

of 0..255 and Pd is the normalized histogram (probability density function or PDF) of intensity values in the neighborhood Rx.

Gilles’s method has limitations. First, it requires specifying the neighborhood size, or

scale, to estimate the local PDF. If we want icons that are salient over scale (and hence

scale invariant), then a fixed neighborhood size will not do. Second, this definition of

saliency assumes that complexity is rare in real images. This is generally true, but the

number of apparently salient points will increase as noise or “clutter” in the image

increases. For a pure noise image, complexity is uniform and independent of scale or

position.

[Kadir & Brady] addressed these problems by defining saliency as products of two terms.

The first is entropy, as defined above and for some set of features, and the second is the

function of entropy across scale (neighborhood sizes). Peaks of entropy in this function

are used for selection of the characteristic scale of this neighborhood. The product of

31 of 106 these two term value is the icon’s saliency value, with higher values indicating more saliency. Figure 10 shows icon extracted with [Kadir & Brady] method.

Figure 10. Salient regions as computed by the method of [Kadir & Brady]

Here is [Kadir & Brady]’s process in brief:

In an image, I, and for each scale, s in a set of scales S, we compute saliency (entropy),

HD(s,a), in the area, a, surrounding each point, (x,y), in I:

D ),( −= ∑ asd PLogPasH ,,2,, asd ∈Dd

Next compute a term called the inter-scale saliency, WD(s,a), by taking the difference between probability descriptor measures, P, at two successive scales. This is term increases with increasing changes in the probability descriptor as a function of scale:

32 of 106 S 2 asW ),( = − PP D S −12 ∑ − ,1,,, asdasd

Where the number of scales, S, is a normalizing factor. For each scale that attains a peak in the differentiated saliency vs. scale curve, compute a new measure of saliency,

YD(sp,a), which is the product of HD(s,a) and WD(s,a) at the s where the peak occurs.

D p = pD × pD asWasHasY ),(),(),(

Pseudo-code for their algorithm may make this clearer:

For each pixel (x,y) in the image I { For each scale, s, between S = {Smin … Smax} { Extract a local feature set, Is, from pixels in the neighborhood, a, of (x,y) Estimate the local PDF, Pd,s , using the histogram of Is Calculate the entropy, HD(s,a), from Pd,s Calculate inter-scale saliency, WD(s,a), between Pd,s and Pd,s-1 } For each scale for which the differentiated entropy attains peak, sp, do { YD(sp,a) = HD(sp,a) x WD(sp,a) } }

Here are some limitations in [Kadir & Brady]’s method:

1. The selected scale is descriptor (image feature) dependant. Picking a different set

of features gives a very different set of icons.

2. It favors blob-like objects over objects with extended structure, even though the

extended structure is more useful for localizing the icon.

3. It fails when the image is binary (two-valued) or where complexity is minimal

(low entropy), as you would expect.

33 of 106 4. The inter-scale saliency measure WD, does not provide sufficient geometrical

constraints on the icon selection, which results in icons with only one-dimensional

signals. Icons with one-dimensional signals cannot be located in the other

– they will “slide” in that direction.

2.2.5.3 Scale Invariant Feature Transform

Lowe has developed a key point approach that is scale-invariant [Lowe]. This approach

identifies hundreds of key points in an image that are scale-invariant by picking locations

that are a maximum or minimum of difference-of-Gaussian functions over scales [Marr,

Hildreth, & Crowley]. A Laplacian pyramid data structure is generated by repeatedly convolving the input image with a Gaussian function, subtracting successive convolution images (levels in the pyramid) to get a which approximates a

Laplacian, and scaling the filtered images [Burt & Adleson].

To give the flavor of this process, we start with an input image (the “bottom” of the

pyramid), and convolve it with a 7x7 Gaussian to produce image A. Then image B is

generated by convolving A with the same 7x7 Gaussian to produce image C, and so on.

Each convolution “blurs” (low-low pass filters) the image more, removing more and

more of the higher spatial frequency components – the characteristic scale (the scale of

the details) of the image gets larger and larger. To approximate a Laplacian convolved

image we subtract successive Gaussian filtered images so, for example, image B from

image A and image C from image B (See Figure 11, from [Burt & Adelson]).

34 of 106

Figure 11. An example of a Gaussian (top row) and Laplacian pyramid

Because we are subtracting two low-pass filtered images, the result is a band-pass filtered image. These Laplacian images therefore have a fairly narrow range of scale (about one octave) or image structure size. Because the details decrease as we repeatedly filter the image, we can sub-sample lower-frequency images and hence get a smaller data set. This sub-sampling is what gives a pyramid “shape” to the data structure.

The Gaussian and Laplacian pyramids form a scale-space – we have made the scale or spatial frequency of features in the image explicit in this space (data structure), somewhat like a Fourier transform [Witkin].

Lowe looks for maxima or minima for each image point by comparing each pixel in the

Laplacian pyramid to its neighbors. First, a pixel is compared to its 8 nearest neighbors on the same level of the pyramid. If a maximum or minimum is found then the pixel is compared to the 9 pixels (3x3 pixel area) nearest to it in the next lowest level of the

35 of 106 pyramid. If the pixel is still a maximum or minimum, then this test is repeated with the

next highest level of the pyramid.

The points that “survive” these comparisons are considered to be key points – they show

maximum change (differences) over local space and scale. For each key point, a gradient

value and angle are computed. The gradient angle provides angular (rotation) invariance.

For each key point region image, descriptors are computed that are invariant to rotation

and scale. These descriptors are a way to compress an image region into few numbers

that are more-or-less unique to that region.

Each key point is stored as relative location to a pattern reference point (like the GHT

methods), a scale value, a rotation value and the descriptors. The pattern or template we

will look for is therefore made up of key points and associated information.

At search time, search keys (key points from the target image, to be used for searching

for the trained patterns), are generated and compared with stored key point and associated data. Once a potential match is identified, a Hough Transform is used to search for keys that vote for a pattern at a position, angle and scale. Pattern matches are verified by superimposing the model constellation of key points on the image, by transforming the stored key points according to the affine matrix values from the Hough transform. This is a hypothesis verification step. Figure 12 shows key points, characteristic scale and gradient angle.

36 of 106

The main drawbacks to Lowe’s approach are that “key points” are not stable in all

locations, scale and angle, and the computational expense of searching huge data base to similar key points and then performing a Hough on these potential key points. The hypothesis verification is a comparatively small computational cost.

Figure 12. Each key point is shown as square, with a line from the center to one side of the square indicating orientation [Lowe]

Appendix 1 compares my method with the methods of Lows and Schmid, showing the processing steps, the range of applications and implementation complexity.

37 of 106 Chapter 3

I will show how straight lines can be used as salient features (icons) to search for and

recognize patterns and objects.

3.1 Visual Search with Straight lines

3.1.1 Search Constraints for Machine Vision

Academic research on visual search and recognition typically uses busy scenes such as

street scenes or images of cluttered offices. Search and recognition in these scenes is a

demanding task, and has helped develop some of the algorithms I have described and

others. Search and recognition in academic research is usually not constrained by time,

accuracy, or reliability.

On the other hand, machine vision requires search and recognition that is fast (perhaps

less than 30 milliseconds), accurate to a fraction of a pixel position, and essentially 100% reliable. A search that takes a few seconds or makes occasional mistakes might be

acceptable for a paper (or thesis) but not for commercial machine vision.

As discussed above, we have to constrain the parts seen by a machine vision system, their

presentation (staging) and their lighting and view in order to get fast, reliable and accurate search and recognition. These constraints are expensive to implement and

require special knowledge that a user must acquire or hire.

38 of 106 Obviously there is a need for search that is fast and accurate, but also can locate a part regardless of its orientation, position, and scale so staging and view constraints can be relaxed and hence reduce a user’s cost and time. We would also like to be able to teach a part from a CAD file rather than presenting parts to the vision system for it to “learn”, as this saves time and is more accurate. I will show that using straight lines as the salient features (icons) for search and recognition is fast, accurate, and robust. Further, by being invariant to translation, rotation and scale of objects (parts), my method can reduce the staging and view constraints, and make it possible for a robotic arm to track a part as the arm homes on the part. I will show that my method can be taught from CAD files rather than images of the parts themselves.

3.1.2 Using Straight Lines as Icons

It is believed that straight lines are useless for visual search, because they contain only limited, one-dimensional information. For example, [Tuytelaar & Van Gool] argue that icons based on straight line can “slide” along the line and so will not provide position information needed to locate patterns and objects.

I have presented evidence from human vision that straight line segments are important features for search and recognition. This argues for considering similar features for machine vision.

Straight line segments have ends, in most cases, and have an angle – simply the angle of the line with respect to some external coordinate system. The length of the line gives its

39 of 106 scale and the center point of the line gives its location. In fact, a straight line has true and

stable angle and scale as to compare with [Lindeberg & Lowe]’s methods that define a

“characteristic” or “canonical” angle and scale.

For example, each of the four lines that make up a rectangle is bounded by two corners.

Each bounded line gives position, orientation, and scale information (see Figure 13) to direct attention and recognize the rectangle shape regardless of the location, angle, and size of the rectangle in the image (see Figure 14).

Rectangle shape End point A corner

Center point Line-segment length represents scale End point

Figure 13. A rectangular object and its components of lines and corners

40 of 106

End Center point point

φ2 ρ2 ρ1 φ1

Reference point

Figure 14. Each line contributes information to locate the rectangle (find a reference point) regardless of the rectangle position, angle and size.

3.1.3 Search with Lines

To illustrate my method, let’s use a simple, synthetic pattern – a rectangle. The rectangle

pattern we want to learn is shown in darker gray in Figure 15. The goal is to “train” on this shape and search for it in anther image as in Figure 16. The shape in Figure 15 has translated, rotated and scaled by unknown factors to end up as it looks in Figure 16.

Figure 15. Rectangle pattern for training

41 of 106

Figure 16. Shape translated, rotated, and scaled by unknown amounts.

Put a coordinate system on Figure 15, as shown in Figure 17.

Figure 17. Reference shape with coordinate system added

I extract straight-line segments from the reference shape as shown in Figure 18 and save the following information:

1. Contour points

2. End-points, center point, line angle, and line length for each straight line

42 of 106 3. Vector from each line’s center point to an object reference point (user-defined

coordinate system center).

Figure 18. Straight-line segments extracted from the reference pattern

To search for this shape (pattern) I extract the lines shown in Figure 19 from the image shown in Figure 16, saving the same information as with Figure 18 except the reference point – it is unknown.

Figure 19. Straight line segments extracted from target pattern

43 of 106 To match the learned with target object (pattern) we want to compute the amount of

translation, rotation, and scaling needed to transform the learned object such that its lines overlap (or its contours) the lines of the target (unknown) object. The coefficients of the transform matrix give the position, orientation, and scale of the target object and the quality of the fit from learned to target is a measure of recognition.

Consider just the top line in the trained object and the longer top line in the target object.

To bring these two lines into alignment, we (1) translate the center of the trained object’s

line to center of the target object’s line, (2) rotate the trained object’s line to match the

angle of the target object’s line, and (3) scale (stretch, in this case) the trained object’s

line (contour) to completely overlap with the target object’s line. This process is shown

in Figure 20.

(a) (b)

44 of 106 (c) (d)

Figure 20. Fitting a trained line to a target line (20a) Translation of trained line’s center point to target line’s center point (20b) Rotation of trained line to overlap with target line (20c) Scaling of trained line to fit the target line (20d) The trained pattern is transformed by the same amounts to overlap the target

45 of 106 If a trained-line center point is xp, yp, its length is lp, and its angle θp, and for a target-line

xs, ys, ls, θs, the transformation coefficients are:

δx = xp – xs. Translation in x

δy = yp – ys. Translation in y

δθ = θp - θs. Rotation

δs = lp / ls. Scale

The transformation coefficients form our hypothesis for recognition and location: If we

apply these coefficients to the lines from the trained object then match the lines in the

target object, this validates the hypotheses that (a) the target object is a transformed

version of the learned object (recognition), and (b) that the proposed transform is correct

(location).

Instead of using only the straight lines in the reference (the trained pattern) and target for

hypothesis verification, I match using a set of contour points in reference and target

objects. You can think of this as “overlaying” the reference object’s contour points onto

the target object. The quality of the fit or match quality is computed by summing the

Euclidian distances between corresponding reference and target contour points. I

transform the match scores so that high match values indicate a better match, as this is

what we are used to from match scores such as correlation coefficients.

Note that only few points from the contours are needed to quickly verify if we have a valid match based on the transform (hypothesis) from the pair of straight lines. In the above example I chose corresponding lines between the pattern and the target just to

46 of 106 illustrate the idea. If the two lines are not corresponding lines, then the verification

procedure produces low match score and a new line from the target is examined. That is,

a new hypothesis (transform) is generated and verified, and so on until a high match score

is found.

Only one straight line is needed from the reference pattern to compute the transform and

so find and match target objects. The rectangle example reference pattern has four straight lines, any one of which can direct the “visual attention” to find the object in its new location, angle and scale. As with human vision, important features (straight lines in this method) are quickly extracted and then a second “what” stage sequentially tests for matching patterns.

Figure 21 shows a more complex pattern (object) being recognized and located. In this

example, the single straight line shown in green is the focus of our “attention” (21a).

Several hypothesis (transforms) are generated and rejected (21b through 21e) before a

hypothesis is verified (21f).

47 of 106

(a) (b)

(c) (d)

(e) (f)

Figure 21. Several hypotheses are tested to recognize and locate a pattern.

(21a) The reference pattern with a straight line (focus of attention) shown in green.

(21b-21e) Attempts to verify the pattern by comparing two non-corresponding-lines.

(21f) Perfect match verifies that the hypothesis is valid (corresponding lines)

48 of 106 3.1.4 The Cost of Hypothesis Generation

The computation cost to generate transformation hypotheses is considered. If a reference

pattern (object) has N straight lines (icons) and the scene image has a M straight lines

(lines belong to object(s) we want to recognize and locate, plus noise lines), then the cost

for using one reference line as a hypothesis is O(M) and for using all reference lines is

O(N x M).

If we train on K object, then the computation cost for each hypothesis is O(M x K) and

the total computation cost is O(N x M x K). These costs assume no prior knowledge of

the transformation between two compared lines. However, we often do have a priori

information that limits the range of transformations. For example, if we know that the

target objects will only vary in scale with in a range of 0.5 to 2.0, then many possible pairs of line matches (transformations) can be immediately eliminated. Another way to reduce the number of hypotheses is to use additional information about the lines, such as the color difference on each side of a line, or the angle between lines. These require additional assumptions about the objects and images, but can greatly reduce the number of hypothesis that we need to test and hence the computation time.

3.1.5 The Cost of Verification

A typical pattern contains 500 to 5000 contour points. Experimentally I find that 10 to

100 contour points are sufficient for rapid verification. Once a pattern with high

verification score is found then all pattern contour points are considered for final

verification. The cost of the worst case scenario for rapid verification is O(100)

calculations – this is very fast and on the order of few micro seconds in today’s computer.

49 of 106 3.1.6 Gradient Angle of a Line

I define the gradient angle of straight line to be the average gradient angle of contour points that make up that line segment. The gradient angle is computed from the direction of the image intensity gradient at each contour point. Thus a line angle based on gradient information can range from –π to π. Without gradient information, we don’t know the

“direction” of the line so angles range from only –π/2 to π/2. We need gradient information in order to get the correct angle for the transformation hypothesis. Otherwise, we need to test the hypothesis for 2 angles, θ and θ + π. This doubles search time!

3.1.7 The Number of Reference Lines Needed for Robust Search

Only one line from a pattern is needed to search for matching target objects. However, if the corresponding line in the target is occluded or corrupt, then no match can be found. I found that for practical purpose, using five lines from the reference pattern gives robust results. My criterion choosing these five lines is simply to pick the five longest lines in the reference pattern. The reasoning is that longer lines provide more accurate transformations. With five lines, we have sufficient redundancy to make this method robust.

3.2 Abstract Look at Visual Search with Straight lines

I have shown that you can do practical visual search using straight lines. Looking at this abstractly, the end points of a line are the salient points. They contain all the line

information, except for the line gradient angle – for that the points need to be ordered.

50 of 106 The ordered end points are therefore the “icons” in this method, but we consider them as a line. End points of a straight line solve the combination/correspondence problem.

The object’s contour points (samples version) is the descriptor of the object.

This view suggests other search techniques that I have not yet explored.

In the next two chapters we turn to more practical issues of how to quickly extract straight lines from images and how to quickly test transformation hypotheses.

51 of 106 Chapter 4

In this chapter I introduce a new, fast method for extracting straight lines from images. I

call it Curvature-Based Straight Line Extraction (CBSLE). I discuss its performance and

compare it with a standard Split and Merge method.

4.1 Curvature Based Straight Line Extraction (CBSLE)

The curvature at a point on a curve is defined as the change in tangent, θ, with respect to

δθ distance, s, along the curve:κ = δs

θ

1/κ

Figure 22. Curvature at a point

A contour point is considered to belong to a straight line if its curvature value is near zero

or the osculating circle’s radius, 1/κ, is large.

I compute a measure of curvature by the perpendicular distance, h, between a contour

point and a virtual line (called a “stick”) that spans between “before” and “after” points

as shown in Figures 23 and 24. The number of contour points between the chosen contour point and the “before” and “after” points is the same, and is called the “span”. h

52 of 106 is a scale dependent measure of curvature on quantized curves. It approximates the

analytic definition of curvature as the span distance decreases. Quantization and noise in

digital images prevents the use of small spans (scales). Instead, I use the span as a free

parameter that sets the scale of measure, where larger spans “average out” more details of

the digital curve. This method follows that developed by [Melikian, et. al.]

Adjacent contour points with small curvature are grouped to form straight-line segments.

The points in each straight-line segment are then fitted to an analytic straight line (y = mx

+ b) using linear regression. So each straight line segment consists of its individual edge points and an equation (y = mx + b) where the slope, m, is the orientation of the line. The mid-point of the segment is the average value of the segment’s end points, and is taken as the line position.

Straight line points, “Before” Contour Points (i - 1) small h. Span

Contour Point (i)

“After” Contour Points (i + 1) Stick

h curvature measure. This point does not belong to a straight line, as h is large

Contour Straight line points, small h.

Figure 23. Computing h to find straight line segments

53 of 106

Curve with ha Curve with hb

Curve with hc

Straight line with hd

Figure 24. h (curvature measure) along a contour, showing curves (a,b,c) and straight line (d).

Here is the computation for h:

δx x −:= x δyi := y − y i i+ span i− span i+ span i− span

δx1 x −:= x δy1 := y − y i i+ span i i i+ span i

θ := atan2 δx , ()−δy λ := atan2 δx1 , (−δy1) i ⎣⎡ i i⎦⎤ i ⎣⎡ i i⎦⎤

2 2 side_length := δx1 + δy1 αi θi −:= λi i ()i ()i

h := side_length ⋅sin()αi h is the h value for a contour point at index i. x ,y is i i i i i the contour point being tested for belonging to a straight line.

54 of 106 If we approximate a small contour segment with circular arc then, as in Figure 25, we can write:

2 ⎛ s ⎞ 2 2 ⎜ ⎟ + ()Rh− R ⎝ 2 ⎠

Solving for R:

2 2 1 s 4h⋅+ R ⋅ 8 h

Then the curvature k is equal to:

8h⋅ k 2 2 s 4h⋅+

This shows that in the limit, h is sufficient to compute curvature.

Contour segment

h

R P1 P2 pc

Figure 25. Approximating curve with circular arc.

55 of 106

The algorithm to extract straight lines using the h method is as follows:

Step I

- Extract contours using an edge detector such as [Canny, 1986]

- Select a span, n, to set the scale of measure

- Select a threshold value for calling a curve straight

- Loop: “crawl” each contour in the object

Loop: for every contour point cpi in a contour

Get cpi-n, cpi+n

Compute h, as above

if h < threshold value

Mark cpi

End

End

Step II

Loop: for every contour in the object

Loop: for every contour point marked in Step I

Collect and save connected marked points as a single line segment.

End

End

Loop: for each line segment

- First and last points are the ends of the segment

56 of 106 - Average of first and last points are the center (location) of the segment

- Least square fit points in the segment to compute m,b for y = mx + b.

- Compute average gradient angle of all contour points (line orientation)

End

The following information is then available for each line segment:

Pend_a = the first end point of line segment (from the direction of crawl)

Pend_b = the last end point of line segment (from the direction of crawl)

Pcenter = the center of line segment = (Pend_a + Pend_b)/2.

Pi = contour points in this line segment

AveGradAngle = the average angle of contour points – π to π.

The AveGradAngle is computed from the slope of the line and the direction of the

intensity gradient along the line. The slope provides angle defined from – πi/2 of π/2 but

AveGradAngle has a range – π to π. We need AveGradAngle to get the proper transform coefficients.

57 of 106 Figure 26 and 27 show the straight lines extracted by my CBSLE algorithm.

Figure 26. Left column: contours. Right column: Extracted straight lines

58 of 106

Figure 27. Straight lines extracted using CBSLE method.

4.1.1 Effect of Span Value

The Span value is the number of contour points (pixels) to go “backward” or “forward” from a contour point that we are examining for curvature. The length of the Span (the number of pixels in the arc) sets the scale of measurement. As the length of the span increases, details – higher spatial frequencies – of the curve are lost [Gonzalez &

Woods]. In a sense, the Span acts as a low pass filter to reduce digital noise and to set the scale of measurement.

Longer Span values will causes the algorithm to miss short line segments, and shorter

Span values will increase the number of short line segments found in a slowly curving contour. In practice, a Span of three pixels (7 contour points from beginning to end of the contour segment) works with most contours.

59 of 106 Changing the Span value effects the locations of the ends of a line segment. Figure 28 shows how the straight line segments “slide” as the Span value varies.

Figure 28. The effect of Span on straight line ends. Notice the slight change in start and end locations of each line segment.

4.2 Split and Merge Method for Straight Line Extraction

I compare the CBSLE method to the Split and Merge method for straight line extraction.

Here is pseudo-code to extract straight lines using the Split and Merge method, as outlined in [Gonzalez & Woods] and show graphically in Figure 29:

(a) Take a contour and start from its end points

(b) Draw a virtual line between the end points

(c) Loop through every point of the contour and find the maximum perpendicular

distance (MPD) from a contour point and the virtual line.

(d) If the MPD is greater than “fit error” then split the contour into two pieces and

repeat for every piece.

Otherwise

A valid line is present; get its data.

60 of 106

A Contour, ordered set of points

Draw virtual line between end points

Find max perpendicular dist (MPD) between a contour point and the virtual line

Split into two segments, repeat.

Split and so on.

Figure 29. Split and Merge for straight line extraction.

Figure 30 shows the straight line segments found by the Split and Merge method. Notice the large number of short line segments along the curved regions.

Figure 30. Straight line extraction using Split and Merge. Blue crosses are end points of a line segment.

61 of 106

What I find interesting about the Split and Merge method is that a curved line generates

many small straight line segments, whose lengths are proportional to the curvature value.

The curve “modulates” the line segments length – the more curved a line is, the shorter

its line segments are.

Figure 31 shows the Split and Merge method applied to the same image as Figure 27.

Figure 31. Result of straight line extraction using Split and Merge method.

You see that there are more and smaller straight lines around areas of high curvature.

While this provides a more visually “complete” outline of edges (intensity contrasts),

these small lines are not robust due to quantization and noise and there are too many for easy testing – the “compression” of the edge information is low. I conclude that the

CBSLE method gives fewer but more significant lines that are faster and more robust for

object detection and location using hypothesis testing (see Figures 32 and 33).

62 of 106

Figure 32. Contour points for an outdoor scene.

Figure 33. Straight lines from the scene above using CBSLE. Note the compression of information.

63 of 106 4.3 Line Extraction Performance

On a typical 640 x 480 (width x height) image of a part, the CBSLE algorithm takes

about 26 milliseconds, and 56 milliseconds on busy image (2 GHz Pentium III). The timing includes the time to compute the , edge crawling and contour

extraction. The time to do the line extraction from the contour points is only few milliseconds. Basically it is looping through few thousands “marked” points. The Split and Merge line extraction took similar amount of time.

4.3.1 End Points Position Accuracy

4.3.1.1 Effect of Object Size on End Point Accuracy

As the target object size changes, the scaled position of the straight line end points

changes. To explore the uncertainty in locating line segment end points, I ran the CBSLE

algorithm on the pattern in Figure 34 and recorded the position of an end point (blue

cross) as I change the size of object from ½ to twice its initial size. That is, I varied the

scale of the object from 0.5 to 2.0, where a scale of 1.0 is the normal size.

Figure 34. Image used to measure the stability of end point as object size varies

For each size of the object, the line end position (effectively the line length) was

measured and scaled by the size change factor. As expected, the error in end point

64 of 106 position increases with changes in scale away from the normal length (scale) of 1.0.

Figure 35 plots the error (in pixels) between expected and actual end point position as a

function of scale. At scale of 2.0 the error is about 5 pixels and it is about 2.5 pixels at

scale of 0.5. During verification (hypothesis testing), this error (uncertainty in transformation position and scale) is greatly reduced by an optimization step.

Figure 35. CBSLE end point error (in pixels) as a function of scale

I ran the Split and Merge based line extraction on the same set of images (see Figure 36).

The error has similar maximum magnitude but is more randomly distributed than the U-

shaped error curve of the CBSLE algorithm. It is therefore unlikely that a simple

optimization step (hill climbing) to reduce this error would work well with the Split and

Merge method. It is worth mentioning that the image I used has gradually changing

curvature at the end of the line. An end point next to sharp corner would have a more

stable location.

65 of 106

Figure 36. Split & Merge end point position error (in pixels) as a function of scale.

4.3.1.2 Effect of Noise on End Point Accuracy

To get a sense of the effect of random noise on end point location, we could add synthetic noise, perhaps Gaussian distributed. Rather than guessing what noise distribution to use,

I captured twenty images of the object shown in Figure 37 but in low light level condition such that the electronic noise of the camera is pronounced. The error in location of end point is below a pixel. Figure 38 shows the result.

Figure 37. The image used to measure the stability of end points to noise.

66 of 106

Figure 38. The effect of noise (sample images 0 to 19) on end point location

67 of 106 Chapter 5

This chapter discusses the second step of my method: hypothesis generation and

verification. Verification is a “what” step, in analogy with human vision.

5.1 Hypothesis Generation

As discussed in Chapter 3, a straight line (the reference line) from the reference pattern is

compared to target lines from the target pattern or scene. The transformation required to

match the reference line to the target line is the hypothesis that the reference line and

target line represent the same line feature in an object or pattern that has been translated,

rotated and scaled. The hypothesis is nearly an affine transform.

Here is the transformation code fragment used to generate the hypothesis:

// COMPUTING SCALE // Scale is scene line-length divided by pattern line -length. scale = aSceneLine.length / aPattLine.length; // COMPUTE ROTATION rotation = aPattLine.trueTheta - aSceneLine.trueTheta; // COMPUTE TRANSLATION translationX = aPattLine.xmid - aSceneLine.xmid; translationY = aPattLine.ymid - aSceneLine.ymid;

// COMPUTING SCENE PATTERN LOCATION // Translate the mid-point-pattern to mid-point-scene, rotate about the scene // mid-point and compute where the scene pattern point is. SceneRefX = (xref - translationX - aSceneLine.xmid)*cos(rotAngle) - (yref - translationY - aSceneLine.ymid)*sin(rotAngle) + aSceneLine.xmid;

SceneRefY = (xref - translationX - aSceneLine.xmid)*sin(rotAngle) + (yref - translationY - aSceneLine.ymid)*cos(rotAngle) + aSceneLine.ymid;

// Scale it.

68 of 106 SceneRefX = (SceneRefX - aSceneLine.xmid)*scale + aSceneLine.xmid; SceneRefY = (SceneRefY - aSceneLine.ymid)*scale + aSceneLine.ymid;

We can reduce the above steps to:

scale = aSceneLine.length/aPattLine.length; // Scale rotation = aPattLine.trueTheta - aSceneLine.trueTheta; // Rotation

// Compute translation double costrans = cos(rotation) * scale; double sintrans = sin(rotation) * scale; double X = (xref - aPattLine.xmid); double Y = (yref - aPattLine.ymid); xScene = X*costrans - Y*sintrans + aSceneLine.xmid; yScene = X*sintrans + Y*costrans + aSceneLine.ymid;

During the search phase for the object, the hypothesis generating algorithm selects

(attends to) a line from the reference pattern (starting with the longest straight line of the

pattern) to compare to a line from the scene (also starting with the longest line). This

comparison suggests a transformation hypothesis that could map the reference pattern to

the target pattern. Then this hypothesis is verified and accepted or rejected, as discussed

in the next section. If rejected, another hypothesis is generated and verified, and so on

until the hypothesis is accepted.

In current implementation, the only constraint applied during hypothesis construction is

the length of the line. This is done first by starting the hypothesis generation from the

longest lines and working towards shorter lines, and second, if the ratio of the lengths of

the reference and target lines is outside expected range, then that line is skipped.

69 of 106 These two constraints are quite minimal. Stated another way, we are assuming very little

about the structure of the reference and target patterns. Adding constraints to selecting

line pairs could greatly reduce the number of hypotheses that have to be verified, but at

the expense of assuming more about the objects being recognized and located. Here are

some other constraints that could help reduce the number of hypotheses:

1. In color images, use the average color difference across the line as a “label” or tag for

selecting lines for hypothesis generation. This is probably a trick used by human vision.

2. Create a line profile (graph of gray-level pixel values) that is the perpendicular bisector

of the straight line. Find the extremum (brightest point) in the profile and use the

distance between the straight line and the extremum (along the line profile) as a “label”

or tag for selecting line pairs for hypothesis generation. The extremum could be

computed by any of the methods suggested by [Brady & Kadir], [Tuytelaars &Van

Gool], or [Tell & Carlsson].

3. In analogy with search methods proposed by Schmid, Lowe, and Brady, use the entire

line profile (as described in 2. above) as the “descriptor” or “key” to use for searching in

a data base for matching lines.

5.2 Verification

I define an object purely as contours (connected sets of edge points) at initial location,

angle and size (scale, uniform scale). When an object translates, rotates, or changes in size, only its contours move. This is unlike Brady, Lowe, Schmid and others, who use image areas to define an object.

70 of 106

The verification module takes a transformation hypothesis and applies it to the reference

pattern to overlay the pattern on the target edge image. The edge image, as computed by

the Canny operator, has only edge points, not contours (connected sets of edge points).

The verification then computes the match in the fit of the transformed contour points and

the target image edge points. A high match score supports the hypothesis that the target

pattern is a transformed version of the reference pattern.

Here is the verification process in pseudo code:

Set a distance threshold, n (typically 6 or 7)

Match Score = 0

Initialize a Score LUT, of size n

Loop: For all points, p, in the reference contours

Find the distance along the gradient direction, d, from contour point, p, to a target

edge point.

If d <= n then Match Score += Score LUT(n)

End

The values of the Score LUT (Look-Up Table) are empirically determined, typically:

Distance, d Resulting Score Value 1 1.00 2 0.99 3 0.98 4 0.94 5 0.90 6 0.86 7 0.80

71 of 106 You can see that the scores fall off more rapidly as d increases.

The Match Score thus is larger, the closer the distance between the reference object’s

contour points and the target’s edge points. I scale the Match Score by dividing it by the

total number of points in the reference object’s contour points, to give a percent match.

I first use a small number of pattern contour points – about 10% is sufficient – to quickly

test for a possible match. Once a possible match is found (a match score above 80%), the verification is repeated on the entire set of contours to get a more accurate score for this

reference object and transformation hypothesis.

A final step is needed to exactly match the location and scale of the transformed pattern

with the target pattern. While the angle estimate is precise due to the least squares line

fit, the location has about 5 pixel error (see section 4.3.1.1) and the scale has about 10%

error. To reduce this error to a fraction of a pixel, I use a simple hill climbing approach

to “zero in” on the location and scale.

The hill climbing method searches for a higher match score by stepping left, right, up and

down by 2.5 pixels and by enlarging and reducing (scaling) by 5%. If a better position

and scale is found, the step size for the position and scale are halved and the search

repeated until no improvement is found. This very quickly reaches a match that is within

0.078 pixel in position (less than 1/10 of a pixel) and within 0.0 78% of the best scale.

72 of 106 Figure 39 shows verification attempts and the closest match, which then has to be

“zeroed in” by the hill climbing algorithm. In fact, comparing the intersection points of two pairs of lines from the reference pattern and corresponding target lines, would eliminate the need to hill climbing.

73 of 106

Figure 39. Several verification attempts and one successful but not exact where hill climbing is needed.

74 of 106 Chapter 6

In this chapter, I present the results of testing my algorithm on synthetic images, manufactured parts, and on some unusual machine-made objects. The only constraints the method has are that the parts or objects be rigid (no shear, etc.) and have one or more straight lines to focus attention for transformation hypotheses. This, of course, rules out dogs, grass, and natural scenes in general.

I group the testing into abstract and real world testing. Abstract testing quantifies the behavior and performance (timing) of my algorithm as a function of the number of

“distraction” lines in the image, position of the pattern, rotation of the pattern, scale of the pattern, and combined rotation and scale of the pattern. Real world testing gives a qualitative appreciation of what the method can do and where it breaks down.

6.1 Abstract Testing

I used the simple pattern shown in Figure 40 as a “base” pattern, and focus mainly on the

performance of the hypothesis generator and verification stage of the algorithm.

Figure 40. Simple pattern used to perform abstract testing.

75 of 106

The size of this pattern is 190 x 115 pixels and it is has eight straight lines found by the

CBSLE algorithms. Hypothesis generation is set to search for all angles and for scales from 0.5 to 2.2 times the base pattern size.

On 2 GHz Pentium III, the algorithm took a total of 5 milliseconds:

• 2.4 milliseconds for CBSLE – Canny edge, extract contours, and straight lines

• 2.6 milliseconds to compute Hypothesis generation and Verification

These times are without any short cuts, such as early termination if the pattern is found with a score above some level. Turning early termination for score above 80% reduces the time of Hypothesis generation and Verification to less than 1 millisecond.

6.1.1 Speed vs. Background Straight Lines

Next, I ran same pattern but with more objects, mostly made of straight lines, in the background. See Figure 41 and Figure 42. If I used objects with no straight lines, then no timing changes would be observed. Increasing the number of background lines mainly increases the Hypothesis and Verification time.

76 of 106

Figure 41. Searching for the pattern of Figure 40 in the presence of background lines “distractions”.

Figure 42. Searching for the pattern of Figure 40 in the presence of more background lines.

77 of 106 Figure 43 shows search time as a function of background noise (number of background

lines). This is expected because the worst case search time cost is O(NxM), where N is

number of straight lines in the reference pattern and M is number of straight lines in the

target image. The time is in milliseconds, and the solid blue line is a linear regression fit

to the sample points.

Figure 43. Search time as function of the number of lines in the background

The test is performed using all the lines in the reference pattern and target image with early termination is disabled. When I enable early termination for match scores above

80%, then search time drops to about 2-3 milliseconds.

6.1.2 Speed vs. Target Position

The target position was moved within the image. As expected, this had no effect on search time – the search time is constant with translation.

78 of 106 6.1.3 Speed vs. Angle

I electronically rotated the pattern in Figure 40, through 360 degrees with step of 10 degrees and constant scale. Search time for rotated target patterns was the same regardless of the angle of rotation.

6.1.4 Speed vs. Scale

I electronically scaled the pattern in Figure 40 through sizes from 0.5 to 2.0 with step of

0.1 and constant angle. Search time for scaled target patterns was the same regardless to scale value.

6.1.5 Speed vs. Scale and Angle

I electronically rotated and scaled the reference (trained) pattern, using the same ranges as in sections 6.1.3 and 6.1.4. Search time was the same regardless to the scale and angle of the reference object.

This algorithm has a constant search time over translation, rotation and scale. To bring this into perspective, consider normalized correlation as discussed in Section 2.2.2. If a search based on normalized correlation takes a unit of time to find a pattern at any translation, and if that pattern is allowed to rotate to any angle and scale from 0.5 to 2.0, then a normalized correlation would take 540 units of time to perform. My algorithm takes 1 unit of time, regardless of the position, angle and scale.

79 of 106 6.2 Real world testing

We now provide a qualitative “feel” for the algorithm’s performance by testing it on

some real, rather than synthetic, images.

Figure 44 shows stop sign images. The top four images are from the same scene and the

bottom two images are from different scenes. The images are taken at different distances from the stop sign and to show a range of sizes (scales) from 1.0 to 2.3. The stop sign in the top left image is used as the reference (training) pattern (scale = 1.0). This pattern is

78 x 78 pixels, but any size should do. All images are 640 x 480 pixels. These scenes have many objects with lots of straight lines. The algorithm performs well, as you can see from the overlaid coordinates on the stop signs.

Figure 45 shows a similar type of street scene as in Figure 44 but for a “Talbots” sign.

The reference pattern size is 280 x 142 pixels and the scene images are 640 x 480 pixels.

The “Talbots” signs have a size range from 1.0 to 0.52 times. The algorithm performed

well, as you can see from the overlaid coordinates on the signs.

Figure 46 shows the back of a truck, moving on the highway in a foggy day. This test

demonstrates the feasibility to use this algorithm in an unmanned vehicle convoy. The

algorithm “locks” onto the writing on the truck and tracks it. Size range was from 1.0 to

0.55. I electronically generated additional sets of images to extend the scale writing by

factor of 2.0. These images are not shown, but the algorithm also performed well on

them.

80 of 106

Figure 47 shows images of label on a package of CD-R disks. The label images show

some translation and rotation but mostly changes in scale. This test shows that, using this

algorithm, you can teach on one type of label and search for similar label but on different

product with a different size. I trained on three independent patterns from the label and

searched on the entire image. There are very few straight lines in these images but the algorithm performed well.

Figure 48 shows lanes on the highway. This show how the CBSLE algorithm might be used for lane tracking. The straight line extractor takes 41 milliseconds on these 640 x

480 images (using a 2 GHz, Pentium III laptop). This means this algorithm could be used

to track lanes at about 20 images (samples) per second.

Figure 49 shows labels on soap bottles. The algorithm has no problem recognizing the

logo, even though the bottles in the first three images are physically larger than in the last

three. Using this algorithm could eliminate the need to train on individual product types.

Figure 50 shows image of metal bracket, as it might be presented on a conveyer belt. The

algorithm successfully locates the bracket regardless to its orientation angle and size.

Figure 51 show logos representing different countries. We train on the “France” logo and

the algorithm performs well, even though there are many other similar and straight line

rich patterns.

81 of 106

Figure 44. Recognizing and locating stop signs. Trained pattern is in upper-left image.

82 of 106

Figure 45. Recognizing and locating a “Talbots” sign. Top, left image is the training image.

83 of 106 Figure 46. Tracking the writing on the back of a moving truck. Pattern size is 231 x 135 pixel, and scale range is from 1.0 to 0.55.

84 of 106 Figure 47. Recognizing three patterns in package label.

85 of 106 Figure 48. Straight line extractor applied for lane tracking

86 of 106 Figure 49. Recognizing and locating a label’s logo on two types of bottles.

87 of 106 Figure 50. Recognizing and locating manufactured parts – same part at different working distances.

88 of 106 Figure 51. Recognizing “France” pattern within other straight line rich patterns.

89 of 106 6.3 Feasibility of Teaching with Synthetic Model

As I indicated in previous chapters, I will demonstrate feasibility of using a CAD file (as a synthetic model) to teach objects and search for these objects on real images. I choose a simple object which is a v-block. I generated a bitmap-form of an outline of a CAD file as in Figure 52.

Figure 52. Synthetic Model of an object as a bitmap image.

The question is can I teach on this bitmap and search for the real object in real image as in Figure 53.

90 of 106

Figure 53. Real image of an object (v-block) to search for that is taught by synthetic bitmap of Figure 52.

Before exploring the results, let’s look at the output of the straight line detector for both, the synthetic and the real image (see Figure 54 & 55).

Figure 54. Result of straight line extractor on the synthetic bitmap. Notice the double straight lines on each “drawing” line.

91 of 106

Figure 55. Result of the straight line extractor on the real object of the synthetic bitmap. Notice single straight line on the boundary of the object.

Comparing Figure 54, with Figure 55, suggests that if match would have occurred, the

score of match would be about 50%. Because the synthetic bitmap has two Straight lines

per “Drawing” line but the object has only one. The test conformed that.

This suggests that when teaching on synthetic bitmap, user needs to remove one of the

“double” lines and keep the other (teaching on the “inside” or the outside”). The contrast

between the object and the background decides what Straight line to keep (the inside or

the outside).

Another way to solve this problem of double lines is to teach on “filled” synthetic bitmap

with matching contrast between the real object and its background as in Figure 56.

92 of 106

Figure 56. “Filled” synthetic bitmap with matching contrast to the object in the real image (Figure 53 & 55). The object is brighter than the background.

Teaching on the image in Figure 56, and searching on the images from real object gave successful match with score of about 94-98%. I trained the pattern in Figure 57, and searched for it in the images as in Figure 58.

I have shown how the algorithm of this thesis can be used to teach on objects in their synthetic form as a bitmaps, and search for these objects in real world. Of course, this is possible because of the rotation and scale invariant nature of the algorithm.

93 of 106

Figure 57. The synthetic bitmap that is trained to search for the object in the images in Figure 58.

94 of 106 Figure 58. Search for v-block that was taught using synthetic bitmap of Figure 57.

95 of 106 Conclusion

1. I developed a novel method of visual search with practical implementation. My

method is translation, rotation, and scale invariant with search time that is

invariant to these degrees of freedom.

2. I have proved that the concept of directing the attention and testing the hypothesis

is viable approach for visual search. We know it is viable for human.

3. My approach of doing visual search is simple and requires minimal assumptions.

This is a paradigm shift from existing methods which require meticulous crafting

to make them work.

4. I introduced a new, simple, and practical way (milliseconds of process time) to

extract straight lines from an image.

5. Developed the idea of using straight lines as salient icons. This can lead into using

other type of “icons” with geometrical/structural relations.

6. I have shown that it is feasible and practical to teach from synthetic model of

objects and search for the real objects.

7. As a practical contribution, this method could be used for autonomous guidance

of vehicle convoy. This can be done by having successive vehicles lock on a

pattern on the back of the preceding vehicle. My method can also help guide

vehicles by finding the straight edges in road marking at the middle and at the

edge of the road. The method is fast to the point that 20 samples per second can

be used to close a guidance loop.

96 of 106 Appendix 1. Comparison with Other Methods

Process Steps Melikian Lowe Schmid

Extract straight lines Search over scale Extract corners as from virtually any and all locations to key points using scale. The scale and find hundreds of key multi-scale corner angle of the line points. The detector. The (“icon”) is known characteristic scale characteristic scale now. The end points and angle are not and angle are not of a line are true affine known yet known yet and perspective invariant. For every key point, For every key point a detailed model is use iterative fit to determine algorithm to location, angle and compute/modify scale (uniform) for location, angle and each key point. scale (uniform) for each key point For every key point, For every key point, grab an image from grab an image from the trained pattern trained pattern with with size size correspond to corresponds to scale, scale, and angle and angle (from (from above step). above step). From From that image that image compute a compute a descriptor descriptor to use it to to use it to recognize recognize key points key points from from scene images. scene images. The pattern model is The pattern model is The pattern model is the location, scale the location, scale the location, scale (uniform), angle of the (uniform), angle and (uniform), angle and straight lines (“icon”) the descriptor for the descriptor for and contour points. each key point each key point (“icon”). (“icon”). During search, repeat During search, repeat During search, repeat the steps above on the the steps above on the steps above on scene image the scene image the scene image For every key point For every key point in the Model, find in the Model, find best match in scene’s best match in scene’s key points by key points by

97 of 106 comparing their comparing their descriptor. descriptor. Generate hypothesis Generate hypothesis Generate hypothesis from one line from the using Hough, since from 2 or 3 pairs of reference pattern and key points are not corresponding key one line from the stable. points scene. Verify Verify Verify

Limitations Does not work on Not expected to work Not expected to patterns that do not on form-based object work on form-based contain straight lines objects Applicability Works on simple or Does not work on Does not work on complex patterns simple patterns simple patterns described by either form/shape or area. Implementation Simple Complex Complex

End point

Scale/Angle

mid point End point

Figure 59. The Salient icons (Straight lines) of this thesis. It shows the location (mid point), the scale (distance between end points), and angle (the angle of the straight line segment).

98 of 106 References

Agin, G. J., An Experimental Vision System for Industrial Applications. Stanford Research Institute, Technical Note 103, Menlo Park, CA, (June 1975)

Agarwal, S., Awan A., and Roth D., Learning to Detect Objects in Images via a Sparse, Part-Based Representation. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 26, #11:1475-1490 (2004)

Ballard, D.H., Generalizing the Hough Transform to Detect Arbitrary Shapes. Pattern Recognition, vol. 13, #2:111-122 (1981)

Beis, J., and Lowe, D., Indexing without Invariants in . IEEE Trans. On Pattern Analysis and Machine Intelligence, vol. 21, #10:1000-1015 (1999)

Beis, J., and Lowe, D., Learning Indexing Functions for 3-D Model-Based Object Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 275-280 (1994)

Beis, J., and Lowe, D., Shape Indexing Using Approximate Nearest-Neighbor Search in High-Dimension Spaces. Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1000-1006 (1997)

Burt, P.J., and Adelson, E.H., The Laplacian Pyramid as a Compact Image Code. IEEE Trans. on Communications, vol. 31, #4:532-540 (1983)

Canny, J.F., Finding Edges and Lines in Images. Master's thesis, MIT, Cambridge, USA, Technical Report AITR-720 (1983)

Canny, J.F., A Computational Approach to . IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 8, #6:679-698 (1986)

Crowley, J.L., and Parker, A.C., A Representation for Shape Based on Peaks and Ridges in the Difference of Low-Pass Transform. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 6, #2:156-170 (1984)

Crowley, J.L., and Stern, R.M., Fast Computation of the Difference of Low Pass Transform, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 6, #2:212- 222 (1984)

Fergus R., Perona, P., and Zisserman, A., Object Class Recognition by Unsupervised Scale-Invariant Learning. In Proc. Computer Vision Pattern Recognition, vol. II:264-271 (2003)

99 of 106 Fei-Fei, L., Fergus, R., and Perona, P., Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach tested on 101 Object Categories. IEEE Computer Vision and Pattern Recognition 2004, Workshop on Generative-Model Based Vision (2004)

Felzenszwalb, P., Representation and Detection of Deformable Shapes. IEEE Trans. on Pattern Analysis and Machine Intelligence Vol. 27, #2:208-220 (2005)

Ghosh, A., Petkov, N., Robustness of Shape Descriptors to Incomplete Contour Representations. IEEE Trans. on Pattern Analysis and Machine Intelligence Vol. 27, #11:1793-1804 (2005)

Gilles, S., Robust Description and Matching of Images. Ph.D. thesis, University of Oxford, (1998)

Gonzales, R.C. and Woods, R.E., , Addison-Wesley Publishing Co., Reading, MA. Second Edition (2002)

Hall, D., Leibe, B., and Schiele, B., Saliency of Interest Points under Scale Changes, Research, part of the CogVis project

Harris, C.G. and Stephens, M. A Combined Corner and Edge Detector, In 4th Alvey Vision Conference, pages 147—151 (1988)

Hubel, D.H., and Wiesel, T.N., Receptive fields and functional architecture of monkey striate cortex. J. Physiol. 195

Kandel, E., Schwartz, J. and Jessell, T. Principles of Neural Science, Fourth edition, Chapter 25, page 501-504. Chapter 27, pages 533-539

Kadir, T., and Brady, M., Texture Description by Salient Scales, Robotic Research Laboratory, Department of Engineering Science, University of Oxford (2002)

Kadir, T., Zisserman, A., and Brady, M., An Affine Invariant Salient Region Detector. Department of Engineering Science. University of Oxford (2004)

Lindeberg, T., Bretzner L., Real-time Scale Selection in Hybrid Multi-scale Representation (2003)

Lindeberg, T., Direct Estimation of Affine image Deformations Using Visual Front-end Operations with Automatic Scale Selection. IEEE Trans. on Pattern Analysis and Machine Intelligence (1995)

Lindeberg, T., Edge Detection and with Automatic Scale Selection. IEEE Trans. on Pattern Analysis and Machine Intelligence (1996)

100 of 106

Lindeberg, T., Behavior of Local Extrema and Blobs. Journal of Math. Imaging Vision, vol. 1, pp. 65-99 (1992)

Lindeberg, T., On the Computation of a Scale Space Primal Sketch, Journal of Visual Communication and Image Representation (1991)

Lindeberg, T., Behavior of Image Structures in Scale-Space: Deep Structure. Journal of Mathematical Imaging and Vision. (1992)

Lowe, D., Fitting Parameterized Three-Dimensional Model to Images. IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 13, #5, (1991)

Lowe, D., Object Recognition from Local Scale-Invariant Features, In Proc. International Conference on Computer Vision, pages 1150-1157 (1999)

Lowe, D., Local Feature View Clustering for 3D Object Recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence (2001)

Loy, G., Zelinsky, A., Fast Radial Symmetry for Detecting Points of Interest. IEEE Trans. on Pattern Analysis and Machine Intelligence Vol. 25 (2003)

Ltti, L., Koch, C., Niebur, E,. A Model of Saliency-Based Visual Attention for Rapid Scene Analysis IEEE Trans. on Pattern Analysis and Machine Intelligence (1998)

Marr, D., and Hildreth, E.C., Theory of Edge Detection. Proc. Roy. Soc. London., B- 207:187-217 (1980)

Marr, D., Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W.H. Freeman and Company (1992)

Matsakis P., Keller J., Sjahputera O., Marjamaa J., The Use of Force Histograms for Affine-Invariant Relative Position Description. IEEE Trans. on Pattern Analysis and Machine Intelligence Vol. 26 (2004)

Melikian, S.H., Tyminski, D., and Dawson, B. M., System and Method for Pattern Identification, U.S. Patent Application 09/680,052 (2002)

Mikolajczyk, K., and Schmid, C., Scale & Affine Invariant Interest Point Detectors, International Journal of Computer Vision 60(1):63-86 (2004)

Mikolajczyk, K., and Schmid, C., A Performance Evaluation of Local Descriptions, IEEE Trans. on Pattern Analysis and Machine Intelligence (2005)

101 of 106 Mokhtarian F., Mackworth A., A Theory of Multiscale, Curvature-Based Shape Representation for Planner Curves. IEEE Trans. on Pattern Analysis and Machine Intelligence (1992)

Mokhtarian F., Suomela R., Robust Image Through Curvature Scale Space. IEEE Trans. on Pattern Analysis and Machine Intelligence (1998)

Mokhtarian F., Abbasi S., Automatic View Selection in Multi-view Object Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence (2000)

Mokhrarian F., Suomela R., Curvature Scale Space for Robust Image Corner Detection. International Conference on Pattern Recognition (1998)

Mokhrarian F., Suomela R., Shape Similarity Retrieval under Affine Transform: Application to Multi-view Object Representation and Recognition. International Conference on Computer Vision (1998)

Mori G., Belingie S., Malik J., Efficient Shape Matching Using Shape Contexts. IEEE Trans. on Pattern Analysis and Machine Intelligence (2005)

Pope, A., Lowe, D., Learning Object Recognition from Images. 1993 IEEE.

Ramesh, J., Kasturi, R., Schunck, B., Machine Vision, McGraw Hill, New York 1995.

Richards, W., Dawson, B., and Whittington, D., Encoding Contour Shape by Curvature Extrema. Journal of the Optical Society of America A, #3:1483-1491 (1986)

Robles-Kelly A., Hancock E. Graph Edit Distance from Spectral Serration. IEEE Trans. on Pattern Analysis and Machine Intelligence Volume 27, (2005)

Rustishauser, U., Walther, D., Koch, C., and Perona, P., Is Bottom-up useful for object recognition? Computation and Neural Systems, California Institute of Technology.

Schneider, G.E., Two visual systems: Brain mechanisms for localization and discrimination are dissociated by tectal and cortical lesions. Science, 163:895-902 (1969)

Shannon, C.E., A mathematical theory of communication. The Bell System Technical Journal, Vol. 27, pp. 379-423 (July), pp. 623-656 (October) (1948)

Softky, W.,Unsupervised Pixel-Prediction. From NIPS 1995, p.809-815

Suk, T., Flusser, J., Prejection Moment Invariant. IEEE Trans. on Pattern Analysis and Machine Intelligence Volume 26, October (2004)

Tuytelaars, T. and Van Gool, L., Matching Widely Separated Views Based on Affine Invariant Regions. International Journal of Computer Vision, 59(1), 61-85, (2004)

102 of 106

Weber, M., Welling, M., and Perona, P., Unsupervised Learning of Model for Recognition. In Proc. European Conference on Computer Vision, June (2000)

Witkin, A.P., Scale-space filtering. In Proc. IJCAI 1983, pp. 1019-1021 (1983)

Wolfe, J. M., Watching Single Cells Pay Attention. Science, 308, pp. 503-504 (2005)

103 of 106