MASSACHUSETTS INSTITUTE OF TECHNOLOGY

ARTIFICIAL INTELLIGENCE LABORATORY

AI Technical Rep ort No May

A Trainable System for Ob ject

Detection in Images and Video

Sequences

Constantine PPapageorgiou

cpapaaimitedu

This publication can b e retrieved by anonymous ftp to publicationsaimitedu

Abstract

This thesis presents a general trainable system for ob ject detection in static images

and video sequences The core system nds a certain class of ob jects in static im

ages of completely unconstrained cluttered scenes without using motion tracking or

handcrafted mo dels and without making any assumptions on the scene structure or

the numb er of ob jects in the scene The system uses a set of training data of p ositive

and negative example images as input transforms the pixel images to a Haar

representation and uses a supp ort vector machine classier to learn the dierence

between inclass and outofclass patterns To detect ob jects in outofsample im

ages we do a brute force searchover all the subwindows in the image This system

is applied to face p eople and car detection with excellent results

For our extensions to video sequences we augment the core static detection system

in several ways extending the representation to ve frames implementing

an approximation to a Kalman lter and mo deling detections in an image as a

density and propagating this density through time according to measured features

In addition we present a realtime version of the system that is currently running in

a DaimlerChrysler exp erimental vehicle

As part of this thesis we also present a system that instead of detecting full pat

terns uses a comp onentbased approach We nd it to b e more robust to o cclusions

ere lighting conditions for p eople detection than the full rotations in depth and sev

body version We also exp erimentwithvarious other representations including pixels

and principal comp onents and show results that quantify howthenumb er of features

color and graylevel aect p erformance

c

Massachusetts Institute of Technology

Contents

Intro duction

Ob ject Detection

Pattern Classication and Machine Learning

Previous Work

Face Detection

People Detection

Car Detection

Our Approach

Thesis Contributions

Outline

The Static Detection System

Architecture

Training Data

Representation

The Wavelet Representation

Supp ort Vector Machines

SVMs and Conditional Densities

Exp erimental Results

Exp erimental Results

Test Pro cedure

Exp eriments

Pixels Wavelets PCA

Signed vs Unsigned Wavelets

Complete vs Overcomplete

Color vs GrayLevel

Faces People and Cars

Discussion

Dierent Classiers

Feature Selection

Training with Small Data Sets

Detection by Comp onents

Intro duction

Classier Combination Algorithms

Previous Work

System Details

Overview of System Architecture

Details of System Architecture

Results

Exp erimental Setup

Exp erimental Results

A RealTime System and Fo cus of Attention

A RealTime Application

Sp eed Optimizations

Fo cus of Attention

Integration With The DaimlerChrysler Urban Trac Assistant

Future Work

Fo cus of Attention Exp eriments

Integration Through Time

Pure Pattern Classication Approach

Unimo dal DensityTracking

Kalman Filter

Po or Mans Kalman PMK

Multimo dal DensityTracking

Condensation

Estimating P zjx

Results

Discussion of Time Integration Techniques

Pattern Classication Approach

Po or Mans Kalman Heuristic

CondensationBased Approach

Conclusion

AWavelets

A The Haar Wavelet

A Dimensional

A Quadruple DensityTransform

B Supp ort Vector Machines

B Theory and Mathematics

B Practical Asp ects

C RawWavelet Data

List of Figures

Static image versus video sequences

Examples from the training database of p eople images

System architecture

Examples from the training database of face images

Images of p eople from our database of training examples

Examples from the training database of cars

Haar wavelets

Fine scale edges are to o noisy to use as features

Co ded average values of wavelet features for faces

Co ded average values of wavelet features for p eople

Co ded average values of wavelet features for cars

Comparing small versus large margin classiers

Mapping to a higher dimensional feature space

Summary of detection algorithm

Face detection results

People detection results

Car detection results

Shifting the decision surface to change accuracyfalse p ositive rate

tradeo

Histogram equalization and scaling co ecient represen tation

ROC curves for face detection

ROC curves for p eople detection

ROC curves for car detection

Comparing dierent classiers for p eople detection

Manually selected features overlayed on an example p erson

ROC curves comparing dierentnumb ers and typ es of features for

p eople detection

ROC curves comparing dierent p ositive and negative training set sizes

for p eople detection

Schematic outline of the op eration of the comp onent based detection

system

Illustration of the imp ortance of geometric constraints in comp onent

detection

Geometric constraints placed on dierent comp onents

Examples of heads legs and left and right arms used to train the

comp onent classiers

ROC curves of the comp onent classiers

Comparison of metho ds of combining classiers

Images demonstrating the capability of the comp onent based p erson

detector

Results of the systems application to images of partially o ccluded p eo

ple and p eople whose b o dy parts blend in with the background

Examples of comp onent based p eople detection in large cluttered images

Side view of the DaimlerChrysler Urban Trac AssistantUTA

View of the cameras in UTA

Quantifying the reduction in p erformance from using graylevel

features

Lo oking out from UTA

The top salient region in each frame of a video sequence

ROC curves for system using saliency mechanism

Multiframe training data for pattern classication approachtopeople

detection in video

ROC curves comparing static and pattern classication approachto

p eople detection in video sequences

Illustration of PMK technique for dynamic detection

OC curves comparing static and Po or Mans Kalman technique for R

p eople detection in video sequences

Illustration of Condensation in action

SVM output decays smo othly across lo cation and scale

Comparing global thresholding to p eak nding with lo cal thresholding

ROC curves comparing static and Condensationbased p eople detec

tion in video sequences

Chapter

Intro duction

Until recently digital information was practically limited to text Now we are in the

midst of an explosion in the amount of digital visual information that is available In

fact the state of technology is quickly moving from where databases of images are

standard to where the proliferation of entire video databases will b e de rigeur

With this increase in the amount of online data available there has b een a corre

sp onding push in the need for ecient accurate means for pro cessing this informa

tion Searchtechnology thusfar has b een almost exclusively targeted to pro cessing

textual data Yaho o Inc Compaq Computer Corp Lycos Inc

Excite Inc indeed the amount of online data was heavily weighted towards

text and the automatic pro cessing of image information has signicantly lagged Only

recently as the amount of online image data has explo ded have there b een systems

that provide indexing and cataloging of image information in mainstream search

services IBM Virage Inc These systems provide various means for

searching through mainly static online image libraries but are fairly limited in their

capabilities We can exp ect that as the standard in available online visual informa

tion transitions from images to entire video libraries eectivetechniques and systems

for searching this data will quickly b ecome imp erative Consider this hyp othetical

problem a digital historian is lo oking for all of the CNN video fo otage from the past

veyears that shows Bill Clinton and Monica Lewinsky together in the same scene

Manually searching this data would b e a daunting if not imp ossible task A system

ould that would b e able to automatically search for sp ecic ob jects and situations w

b e indisp ensable

In addition to the Internet as a motivator and forum for video search the im

proved cheap er pro cessing p ower of computers has op ened the do or to new applica

tions of image and video searching that were previously infeasible if not inconceivable

In the near future we can exp ect onb oard automotive vision systems that inform or

alert the driver ab out p eople track surrounding vehicles and read street signs Secu

rity systems will so on b e able to detect when a p erson is in the eld of view and will

b e able to intelligently ignore b enign intruders like domestic animals Rob ots will

b e able to autonomously navigate through complicated terrain by detecting dierent

landmarks and comparing these to internal maps One day there mayeven b e smart

bombs that are able to home in on sp ecic parts of buildings or bridges by seeing

the target and determining where the correct p oint of impact should b e Devices that

implement ob ject detection could also b e useful for applications that aid the blind

and deaf

Ob ject Detection

Fast robust ob ject detection systems are fundamental to the success of these typ es

of nextgeneration image and video pro cessing systems Ob ject detection systems are

capable of searching for a sp ecic class of ob jects such as faces p eople cars dogs

or airplanes In contrast the problem of recognition which is the ability to identify

sp ecic instances of a class deals with understanding the dierence b etween my face

and your face not the dierence b etween faces and things that are not faces see

Murase and Nayar for relevantwork in ob ject recognition

On the surface this may seem trivial p eople are able to immediately detect ob

jects with little if any thought People though have had the b enet of millions

of years of evolution and development In contrast ob ject detection for computers

is a nontrivial task as we are faced with the problem of giving a mass of wires and

mosfets the ability to see Wemust deal with questions such as How are images

represented digitally What are the characteristics of p eople for instance that dis

tinguish them from similar lo oking ob jects like columns and re hydrants Howdo

we tell a computer program to lo ok for this information

In this thesis the reader will nd an analysis of these questions and several pro

p osed solutions to the problem of automatic ob ject detection in images and video

sequences

The work in this thesis addresses the problem of ob ject and pattern detection

in video sequences of cluttered scenes The general problem of ob ject detection by

computers is a dicult one as there are a number of variables that we cannot account

for or mo del in any but the most contrived situations The system should not make

t the size or any assumptions ab out the scene lighting the numb er of ob jects presen

p ose of the ob jects or motion among other characteristics

There are two basic angles this problem could take static images or video se

quences If wewould like to detect ob jects in static images the problem b ecomes a

pure pattern classication task the system must b e able to dierentiate b etween the

ob jects of interest and everything else With the variability in the scene and the

uncontrollable conditions identied ab ove the mo del of a p erson must b e rich enough

to cop e with these variations

On the other hand if the problem is to detect ob jects in video sequences there is a

richer set of information available namely the dynamical information inherentinthe

video sequence However for a general purp ose system that do es not make limiting

These twotyp es of systems often complement one another the rst step in a recognition system

usually is to lo cate the ob ject ...

static images: pure pattern classification video sequences: can take advantage of

dynamical information

Figure An illustration of the inherent dierences in doing detection in static

images and video sequences In static images the detection task b ecomes a pure

pattern classication problem Dynamical information available in video sequences

usually makes the detection problem easier

assumptions ab out the ob jects we cannot exclusively rely on motion information p er

se What if the particular scene is of a group of p eople standing at a bus stop A

system that relies on motion to detect p eople would clearly fail in this case Figure

contrasts the twotyp es of visual data

What we need is a technique or combination of techniques that use a mo del that

is rich enough to b oth a describ e the ob ject class to the degree that it is able to

eectively mo del any of the p ossible shap es p oses colors and textures of the ob ject

for detection in static images and b when weknow that we are pro cessing video

sequence harness the dynamical information inherent in video sequences without

making any underlying assumptions ab out the dynamics

Pattern Classication and Machine Learning

Pattern classication encompasses a wide variety of problems We assume that some

system is presented with a certain pattern x and a set of p ossible classes y one

i

of which is the true class of the pattern The elements of the pattern are individual

features that can enco de characteristics of the pattern likeheight color length pixel

value center of mass mo o d etc literallyanything that can describ e the thing we

are trying to classify The goal of the system is to decide to which class x b elongs

Put forth in this manner it is essentially equivalent to asking what kind of thing

is the pattern In this thesis we will fo cus on the two class classication problem

where every pattern falls into exactly one of two p ossible classes

There are several ways that the system can enco de the knowledge needed to tackle

this problem Using a rulebased approach a user can describ e a set of rules that

the system should follow in the decision pro cess While these systems have b een

successful for certain typ es of problems they typically involve signicant eort in

engineering the rules and hence are quite exp ensive to develop A more promising

approach is one where the system learns to classify the patterns This is exactly the

approachwe will take

Machine learning describ es a large set of techniques heuristics and algorithms

that all share a single common characteristic by using a set of examples they some

how impart up on a system the ability to do a certain task In the context of our

pattern classication problem weareinterested in presenting the system with a set

of example patterns of b oth classes from a set of training data and have it automati

cally learn the characteristics that describ e each class and dierentiate one class from

the other The p ositive examples are lab eled as and the negative as The goal

and measure of success is the degree of p erformance that the trained system achieves

on a set of examples that were not present in the training set or test set What we

are determining when using the test set to evaluate the p erformance is howwell the

system is able to generalize to data it has never seen

The problem of learning from examples is formulated as one where the system

attempts to derive an inputoutput mapping or equivalently a mo del of the domain

from a set of training examples This typ e of approach is particularly attractive for

several reasons First and foremost by learning the characteristics of a problem from

examples weavoid the need for explicitly handcrafting a solution to the problem

y suer from the users imp osition of what he thinks are A handcrafted solution ma

the imp ortant features or characteristics of a decision problem With a learning

based approach the imp ortant features and relationships of a decision problem are

automatically abstracted away as a trained mo del On the other hand learning

based approaches may suer from the problem of p o or generalization on accountof

overtting where the mo del has learned the decision problem to o well and is not

able to generalize to new data

For a given learning task wehave a set of Ndimensional lab eled training ex

amples

N

x y x y x y x R y f g

  i i

where the examples have b een generated from some unknown p df P xy Wewould

like the system to learn a decision function f x y that minimizes the exp ected

a

risk

Z

R jf x y jdP xy

In most cases we will not know P xy we simply see the data p oints that the

distribution has generated Thus direct minimization of Equation is not p ossible

What we are able to directly minimize is the empirical risk the actual error over the

training set

X

R jf x y j

emp i i

i

Learning engines that minimize empirical risk tend to overt the training data

This means that though the error rate on the training data may b e extremely low

the error rate on outofsample test data could b e quite high There are a varietyof

techniques to overcome this including intro ducing a prior MacKay Wolp ert

early stopping Nelson and Illingworth and using a holdout data set

Bishop

These metho ds for improving generalization are largely ad ho c Recently a new

training technique for classiers called Supp ort Vector Machines has emerged that

directly minimizes b oth the empirical risk and the complexity of the classier at the

same time We will b e exclusively using SVM training for our system This technique

is describ ed in more detail in Section and App endix B

Previous Work

In this section we describ e prior work in ob ject detection in static images and in video

sequences that is relevant to our technique The descriptions are group ed according to

application area faces p eople and cars from the simpler to more complex systems

The descriptions are very roughly chronological

The typ es of technologies used for ob ject detection have largely b een dictated by

computing p ower For instance early systems for ob ject detection in static images

generally used edge detection and simple heuristics while recent systems that have

access to much more storage and pro cessing p ower are able to use large sets of training

data to derive complex mo dels of dierent ob jects The rst systems for ob ject

detection typically used motion information to segment out the moving ob ject While

assuming the availability of this typ e of information can b e seen as restrictive for

certain applications motion remains a p owerful source of information used to b oth

provide b etter detection and faster pro cessing even in some of the more recent systems

Face Detection

Muchofthework in ob ject detection in static images has b een concentrated in face

and p eople detection These choices of domains are obvious ones as detecting faces

and p eople are imp ortant steps in most systems where there is some sort of human

computer interaction

Most of the early systems that nd faces in images use simple shap es and con

straints Yang and Huang and develop ed a system that uses lo cal con

straints on an image pyramid to detect comp onents of faces A similar rule based

approach is describ ed in Kotrop oulos and Pitas These rule based approaches

have the b enet of low computational cost and work well for faces on accountofthe

regular interior structure across faces This concept was extended by Sinha a

Sinha b who intro duced the idea of the template ratio enco ding a human

face as a set of binary relationships b etween the average intensities of regions

The assumption b ehind the template ratio approachwas that these relationships will

hold regardless of signicantchanges in illumination direction and magnitude For

example the eyesockets are almost always darker than the forehead or the cheeks

The success and robustness of the simple rulebased and template ratio approaches

for face detection indicate that a representation based on the enco ding of dierences

in average intensities of dierent regions is a promising direction

Anumb er of systems take advantage of the regular structure in faces by pro cessing

patterns to nd ne scale edges corresp onding to individual facial features and then t

the geometry of the comp onents to a deformable template In these systems Yuille

Yuille et al Kob er et alKwon and Lob o Venkatraman

and Govindara ju the deformable templates are handcrafted based on prior

knowledge of the structure of human faces

Often these systems are plagued by the noise and spurious information in the

ne scale edges they use One wayofcountering the eects of inconsistent ne scale

edges is to use multiscale gradient features instead of the ner information In Le

ung et al Burl et al Burl and Perona they use lo cal templates

based on oriented Gaussian to matcheye nose and mouth features on the

human face and determine valid arrangements of these features using random graph

matching Similar systems include Yow and Cip olla b Yow and Cip olla a

Yow and Cip olla where the features are nd Gaussians combined us

ing a Bayesian network and Guarda et al where gradient features are combined

in a genetic algorithm framework There have also b een eorts that use Laplacian

of Gaussian features Ho ogenb o om and Lew or gradient features Qian and

Huang to compare an entire face pattern to a set of templates instead of

individual comp onents

Wavelets describ e in Section and Chapter A provide another formulation

for multiscale intensity dierences Some motivation for using wavelets as features

is provided in MicheliTzanakou and Marsic where they hyp othesize that

wavelets may provide excellent features for detection and recognition when com bined

with a classier Initial work showing that wavelet resp onse p eaks over ob jects in

a fairly uniform background is shown in Braithwaite and Bhanu and Chapa

and Raghuveer In a related area of research Kruger et al determine

the p ose of a face by matching wavelet features to learned D head representations

While many face detection systems have used features based on intensity dier

ences of some sort the regularity in the intensitybetween faces see Sinha a

Sinha b makes it p ossible for approaches using intensity features to succeed for

face detection We will see later that this is not the case with p eople images where the

intensities have little regularity across an ensemble of examples Lew and Huang

use the Kullback relative information as a measure of closeness b etween a face candi

date and a known template image of a face Similarly Colmenarez and Huang

Colmenarez and Huang Colmenarez and Huang present face detection

systems using quantized intensity images combined with maximum likelihood and

cross entropy decision measures Lakshmi Ratan et al describ e a system that

detects faces by using a dynamic programming approachover quantized intensity

values to match a small numb er of prototyp e images

One of the recent fo cuses in the literature has b een trainable systems for face

detection that takeadvantage of increases in computing p ower and storage These

systems for detecting uno ccluded vertical frontal views of human faces in images have

b een develop ed using examplebased approaches a typically large set of training data

is pro cessed by the systems eventually enabling them to dierentiate b etween faces

and nonfaces These viewbased approaches can handle detecting faces in cluttered

scenes and haveshown a reasonable degree of success when extended to handle non

These systems frontal views Sung Sung and Poggio Rowley et al

can b e sub divided into those using density estimation based metho ds and those that

are pure pattern classication techniques

For the density based techniques Sung and Poggio Sung and Poggio

Moghaddam and Pentland Duta and Jain and Reiter and Matas

essentially mo del faces in a high dimensional space and do detection by nding

where new patterns fall in this density In these systems some measure of distance

from or closeness to a density is needed typically a Mahalanobislike metric is used

The work of Schneiderman and Kanade develops a face detection system that

is also based on density estimation but they mo del the joint statistics of lo cal pat

terns Rikert et al accomplish detection by clustering the output of multiscale

multiorientation features and mo del the clusters as mixtures of Gaussians for a set

of b oth inclass and outofclass data Of course all of these approaches rely on the

availability of large data sets so that density estimation is p ossible The system of

Duta and Jain is essentially equivalent to that of Sung and Poggio

Sung and Poggio and that of Reiter and Matas and is based on

Moghaddam and Pentland

Pattern classiers like neural networks and supp ort vector machines can b e view ed

as techniques that take as input large sets of lab eled data and nd a nonlinear decision

surface that separates the inclass faces patterns from the outofclass nonfaces

patterns The b enet of these typ es of systems is that no explicit mo deling needs

to b e done One the other hand they are typically exp ensive computationally

esp ecially during training and it can b e dicult to extract intuition from a trained

system as to what they are learning internallyPattern classication approaches

to face detection havebeendevelop ed by Rowley et al Rowley et al

Vaillant et al and Osuna et al a Rowley and Vaillant use neural

networks with dierent receptive elds and Osuna uses a supp ort vector machine to

classify the patterns

There have b een several systems that combine the previous ideas and use wavelet

features as input to example based systems Whereas the particular class of wavelets

and learning techniques dier the general structure of these approaches are similar to

ours but use very small numb ers of features are develop ed for a particular domain

and are not rigorously tested on large outofsample data sets Philips uses

matching pursuit over Gab or wavelet features to detect features on a face This tech

nique decomp oses an outofsample pattern into a linear combination of wavelets and

matches the linear co ecients to those of a known face The systems of Mirhosseini

and Yan and Shams and Sp o elstra use a neural network trained on

t use a similar Gab or wavelet features to detect eyes while Web er and Casasen

approach to detect tanks in IR imagery

There are other typ es of information that have b een used in face detection systems

While these features are exclusive to faces and do not generalize to other ob ject classes

they nevertheless can b e p owerful cues for face detection Several researchers use the

fact that the colors of human faces fall into a narrow range These systems use the

prior knowledge that color can b e used to segment out skin regions as information

for detection Sab er and Tekalp Wu et al Garcia et al Another

system by Crowley and Berard uses the unique time signature of blinks as well

as color to detect faces

People Detection

The existing work in p eople detection has largely not addressed the detection prob

lem per se rather most existing systems use some sort of prior knowledge place

heavy restrictions on the scene assume xed cameras with known backgrounds or

implementtracking

The early systems that detect p eople fo cused on using motion and simple shap es or

constraints much of this is due to computational limitations of the time Tsukiyama

and Shirai use simple shap e descriptions to determine the lo cation of leg mo

tion against a white background and a distance measure is utilized to determine the

corresp ondences b etween moving regions in consecutive images This system can han

dle multiple p eople in an image but requires a stationary camera and only uses leg

movement to track p eople Leung and Yang use a voting pro cess to determine

candidate edges for moving b o dy parts and a set of geometric constraints to determine

actual b o dy part lo cations This architecture also assumes a xed camera Another

imp ortant restriction is that it is only able to deal with a single moving p erson

The use of D mo dels has b een prominent in nding p eople in video sequences

This typ e of system while adequate for particular welldened domains inv olves

using domain sp ecic information in the development of the mo del and is not easily

p ortable to new domains Hogg describ es a system that is based on mo deling

ahuman gure as a hierarchical decomp osition of D cylinders using dynamic con

straints on the movement of the limbs as well Edge detection is used to determine

the p ossible lo cations of b o dy parts and a search tree is used to determine the lo ca

tion that maximizes a plausibility measure indicating the likeliho o d that there is

a p erson at this lo cation Rohr develops a system using similar D cylindrical

mo dels of the human b o dy and kinematic motion data Mo del contours are matched

with edges that are found in an image using a grid search metho d A Kalman lter is

used to determine the exact p osition and p ose of the walking p erson across multiple

frames Both these architectures assume a xed camera and a single moving p erson

in the image

Alargenumb er of systems use the fact that if our camera is xed and weknow

what the scene background is we can eectively subtract the background from a new

image of the scene and recover the moving ob jects They can further restrict the

domain to sp ecify that the only ob jects that movewillinvariably b e p eople Once

the moving b o dies have b een segmented out it is p ossible to do tracking with blob

mo dels of the dierent b o dy parts assigned by a maximum a p osteriori approachas

in Wren et al A similar system Sullivan et al mo dels and tracks

moving p eople using a simple deformable mo del McKenna and Gong cluster

the motion information to separate dierent b o dies of motion and subsequently use

a Kalman lter to track dierent p eople

This use of background subtraction can b e extended to do more than just detection

and tracking Kahn and Swain use motion and color to segment a p erson

then use prior knowledge ab out the geometry of the b o dy to determine where the

ting In Haritaoglu et al they present a realtime surveillance p erson is p oin

system for recognizing actions that analyzes basic shap es and is able to cop e with

o cclusions and merging b o dies a related approach is describ e in Takato o et al

Similarly Davis and Bobick present a system for recognizing actions

based on timeweighted binarized motion images taken over relatively short sequences

However the system is not used for detection

In a dierentvein Heisele et al use the clusters of consistent color to track

moving ob jects Initially the system computes the color clusters for the rst image

in a sequence The system recomputes the cluster centroids for subsequent images

assuming a xed numb er of clusters To track an ob ject the clusters corresp onding

to that ob ject are manually lab eled in an initial image and are tracked in subsequent

frames the user is in eect p erforming the rst detection manually The authors

highlight as future work investigating ob ject detection with this algorithm An

imp ortant asp ect of this system is that unlike other systems describ ed in this section

this technique do es not assume a stationary camera This system has b een combined

with a time delay neural network to detect and recognize p edestrians Heisele and

Wohler

Another approach that diers from the traditional techniques is to mark sp ecic

p oints on the human b o dy with sensors or lights and to record data o of moving

p eople This information is more suited to understanding human motion and can

subsequently b e used for analysis or animation Campb ell and Bobick take

a dierent approach to analyzing human b o dy motion They present a system for

recognizing dierent b o dy motions using constraints on the movements of dierent

b o dy parts Motion data is gathered using ballet dancers with dierent b o dy parts

marked with sensors The system uses correlations b etween dierent part motions

to determine the b est recognizer of the highlevel motion They use this system

to classify dierent ballet motions LakanyandHayes also use moving light

displays MLDs combined with a D FFT for feature extraction to train a neural

network to recognize a walker from hisher gait

While systems relying on motion or background information are prevalent for p eo

ple detection there have b een recent eorts in developing systems that actually do

detection In Forsyth and Fleck and Forsyth and Fleck they describ e

a system that uses color texture and geometry to lo calize horses and naked p eople

in images The system can b e used to retrieve images satisfying certain criteria from

image databases but is mainly targeted towards images containing one ob ject Meth

o ds of learning these b o dy plans of hand co ded hierarchies of parts from examples

are describ ed in Forsyth and Fleck In a direction of work more similar to

Gavrila and Philomin describ es a technique for nd ours a recent system by

ing p eople bymatching oriented outlines generated by an edge detector to a large

set of p eople template images via a distance transform measure The system is

related to ours in that it lo oks at some sort of intensity dierence information but

unlike ours only considers outlines comp osed of ne scale edges and uses an explicit

holistic match metric

Car Detection

Cars like p eople have b een less heavily studied than faces as a domain for ob ject

detection Indeed most systems for car detection typically imp ose many of the same

restrictive assumptions as for p eople detection largely due to the variabilityinthe

patterns and shap es and the availabilityofcharacteristic motion cues in video se

quences Much of the work relies on background subtraction Ali and Dagless

Ebb ecke et al Several systems use some other segmentation metho d that keys

o of the fact that cars o ccur against fairly constantbackgrounds of pavement For

example Kalinke et al describ e a technique based on lo cal image entropy

Beymer et al present a trac monitoring system that has a car detection

mo dule This p ortion of the system lo cates corner features in highway sequences and

groups features for single cars together byintegrating information over time Betke

et al and Betke and Nguyen use corner features and edge maps com

bined with template matching to detect cars in highway video scenes This system

can aord to rely on motion since it is designed for a fairly narrow domain that of

highway scene analysis from a vehicle

Most of these systems are applying heuristics that are sp ecic to the fairly limited

domains in which the systems are designed to run typically from a xed camera

monitoring a piece of road highwayorintersection or from the frontofavehicle

A more relevant example of a true trainable car detection system is that of

Ra jagopalan et al which clusters the p ositive data in a high dimensional

space and to classify an unknown pattern computes and thresholds a distance mea

sure based on the higher order statistics of the distribution This technique has a

go o d deal in common with the face detection system of Sung and Poggio

Sung and Poggio A similar detection system is describ ed by Lipson

who uses a deformable template for side view car detection In this system the

wheels midb o dy region and regions ab ove the wheels are roughly detected based on

photometric and geometric relations The wheels are then more precisely lo calized

using a Hausdorf match Pro cessing is conned to high resolution images whichis

p ossibly a restriction for more general detection tasks This system has b een applied

to scene classication Lipson et al Ratan and Grimson and shares

some conceptual similarity with that of Sinha a Sinha b

All these systems have succeeded to varying degrees but have relied on the follow

ing restrictive features

explicit mo deling of the domain

assumption that the ob ject is moving

stationary camera and a xed background

marking of key moving features with sensorslights

implement tracking of ob jects not detection of sp ecic classes

Mo delbased approaches need a large amount of domain sp ecic knowledge while

marking features is impractical for real world use The tracking systems have problems

handling the entrance of new ob jects into the scene Toovercome this problem a

tracking system would need to emulate a detection system This work will overcome

these problems byintro ducing an examplebased approach that learns to recognize

patterns and avoids the use of motion and explicit segmentation

Our Approach

The approachtaken in this thesis has two comp onents First wedevelop a robust

trainable ob ject detection system for static images that achieves a high degree of

Figure Example images of p eople used in the training database of our static de

tection system The examples showvariation in p ose color texture and background

p erformance We then develop several mo dications and enhancements of the original

system that allow us to take advantage of dynamical information when we pro cess

video sequences

The core static detection system is one that learns from examplesWe presentit

with a set of training data that are images of the ob ject class wewould like to detect

and a set of data that are examples of patterns not in the ob ject class and the system

derives an implicit mo del from this data Toallow the system to nd a b etter mo del

we do not directly use the pixel images as training data since the pixel patterns havea

high degree of variability Rather we transform the pixel images into a representation

that enco des lo cal oriented multiscale intensity dierences and provides for a more

descriptive mo del of a variety of ob ject classes

The system learns using patterns of a xed size but in general images wedo

not know what size ob jects we will b e lo oking for howmany of these ob jects will

b e in the scene and where they will b e lo cated To detect ob jects at all sizes and

lo cations we implement a brute force search in the image lo oking at all lo cations and

sizes of patterns We assume that the orientations that we are interested in must b e

expressed in the training data

When we are pro cessing video sequences the naive approach of using our core

system is to directly apply the static detection system to each frame sequentially

This however ignores all the dynamical information inherently available in video

sequences We can takeadvantage of the facts that ob jects that app ear in one frame

typically app ear in approximately the same p osition in subsequent frames and that

t ob jects do not usually sp ontaneously app ear in a frame when they are not presen

in previous frames This general idea can b e co ded more rigorously in several ways

eachwhich use the core static detection technique as their basis We will describ e

and provide empirical results using these dierent metho ds

Thesis Contributions

Representation

From the example images of p eople shown in Figure it is clear that a pixelbased

representation is plagued by a high degree of variability A learningbased approach

would have a dicult time nding a consistent denition of a p erson using this typ e

of representation We describ e a new representation where the features are derived

as the resp onses of lters that detect oriented intensity dierences b etween lo cal

adjacent regions This is accomplished within the framework of Haar wavelets and

the particular transform we use results in an overcomplete dictionary of these Haar

features

We will also investigate the p ower of this representation compared with b oth

a pixel representation and principal comp onents analysis PCA lo cal and global

representations resp ectively

In addition we present a set of exp eriments that capture the informational con

tent inherent in color images versus graylevel images by comparing the p erformance

achieved with color and graylevel representations In other words these exp eriments

will show the exact value of color information in the context of eective detection

p erformance

Learning Machine

Traditional training techniques for classiers suchasmultilayer p erceptrons use em

pirical risk minimization and only guarantee minimum error over the training set

These techniques can result in overtting of the training data and therefore p o or

outofsample p erformance In this thesis we use a relatively new pattern classica

tion technique supp ort vector machines that has recently received a great deal of

attention in the literature The numb er of applications of SVMs is still quite small

so the presentation of SVMs as the core learning machine represents a signicant

advancement of the technique in the context of practical applications

Supp ort Vector Machines SVM Vapnik Cortes and Vapnik Burges

approximates structural risk minimization whichisawellfounded learning

metho d with guaranteed theoretical b ounds on the generalization p erformance SVMs

minimize a b ound on the generalization error by simultaneously controlling the com

plexity of the machine and the p erformance on the training set and therefore should

p erform b etter on novel data The SVM framework is characterized by the use of

nonlinear kernels that maps data from the original input space into a much higher

dimensional feature space in which the learning capability of the machine is signi

cantly increased In the higher dimensional feature space an SVM classier nds the

optimal separating hyp erplane that is the one that maximizes the margin b etween

the two classes

Faces People Cars

Much of the previous work in ob ject detection in static images has fo cused on the prob

lem of face detection Sung and Poggio Rowley et al Vaillant et al

Moghaddam and Pentland Osuna et ala While this is an imp ortant

domain for static ob ject detection systems we consider frontal face detection to b e



essentially a solved problem

To explore the generality of our system and highlight its p erformance in less stud

ied domains weprovide indepth results on p eople and car detection as well as face

detection To our knowledge this work is the rst exp osition of p eople detection

in static images without making any assumption on motion scene structure back

ground or the numb er of p eople in the scene In addition while there have b een

several car detection systems develop ed in the literature they to o typically require

the use of dynamical information

This thesis explores the three ob ject detection domains faces p eople and cars

and shows that our single general purp ose trainable architecture is able to handle

each of these classes of ob jects with excellent results

Detection by Comp onents

The most prevalent problem with our static detection system is its diculty in de

tecting ob jects when a p ortion of the pattern is o ccluded or there is little contrast

between the background and part of the pattern This is a consequence of the fact

that we are training our system on the complete patterns of our ob jects of interest

Many ob ject classes are decomp osable into a hierarchy of constituent elements For

instance we know that a p erson is comp osed of a head left arm right arm torso

w that and legs When we see these comp onents in the prop er conguration we kno

we are lo oking at a p erson

One would exp ect that if we knew there was a head left arm and legs in the

prop er conguration but we could not see the right arm that this may still b e a

p erson In other words if we lo ok for the core building blo cks or components of an

ob ject and allow some leniency in allowing one or two of the comp onents that make

up the ob ject to b e missing this may result in a more robust detection system than

our full pattern approach

This thesis presents a comp onent based framework for ob ject detection and shows

that for a test domain of p eople detection the system p erforms b etter than the full

b o dy detection system



We note that the more general problem of pose invariant face detection still has not b een

suciently dealt with

Dynamical Detection

This thesis presents exp eriments of ob ject detection in video sequences using several

dierent metho ds that takeadvantage of dynamical detection information in dierent

manners We will use the brute force technique of applying a static detection system

to individual frames of a video sequence as our baseline

Our rst system is a pure pattern classication approach to dynamical ob ject

detection here we seek to circumvent the need for the extensive engineering that

is quite typical in current dynamical detection systems and assuming particular

underlying dynamics We will mo dify our base static detection approach to represent

dynamic information by extending the static representation into the time domain

With this new representation the system will b e able to learn the dynamics of p eople

with or without motion The system will learn what a p erson lo oks like and what

constitutes valid dynamics over short time sequences without the need for explicit

mo dels of either shap e or dynamics

The second system to takeadvantage of dynamical information is a rule based

mo dule that integrates information through time as an approximation to a Kalman

lter Kalman ltering theory assumes an underlying linear dynamical mo del and

given measurements of the lo cation of a p erson in one image yields a prediction of

the lo cation of the p erson in the next image and the uncertainty in this prediction

Our heuristic smo oths the information in an image sequence over time by taking

advantage of this fundamental a priori knowledge that a p erson in one image will

app ear in a similar p osition in the next image We can smo oth the results through

time by automatically eliminating false p ositives detections that do not p ersevere

over a small subsequence

The third system we will develop uses a new approach to propagating general

multimo dal densities through time based on the so called Condensation technique

Isard and Blake This technique has a signicant advantage over the Kalman

lter namely that it is not constrained to mo del a single Gaussian density

A Practical Application

This thesis presents a realtime application of a particular optimized version of our

static detection system Our p eople detection technology has b een integrated into a

system for driver assistance The combined system including our p eople detection

mo dule is currently deployed live in a DaimlerChrysler S Class demonstration

vehicle

Outline

In Chapter we describ e our core trainable ob ject detection system for static images

with details on our wavelet representation and supp ort vector machine classication

and test results Chapter investigates the use of alternate representations including

pixels and principal comp onents analysis for the purp oses of ob ject detection The

chapter also describ es a manual technique for feature selection as well as exp eriments

quantifying how training set size aects detection p erformance In Chapter we

describ e the comp onent based approach for ob ject detection and test it on the domain

of p eople detection Chapter highlights a real application of our system as part of a

driver assistance system in a DaimlerChrysler test vehicle and describ es exp eriments

using a fo cus of attention mo dule In Chapter we extend the static system into

the time domain to take advantage of dynamical information when we are pro cessing

video sequences several dierent approaches are describ ed and tested Chapter

summarizes our results and provides some direction for future work

Chapter

The Static Detection System

Architecture

The architectural overview of our system is provided in Figure as applied to the

task of p eople detection and shows the training and testing phases In the training

step the system takes as input a set of images of the ob ject class that have b een

aligned and scaled so that they are all in approximately the same p osition and are the

same size and a set of patterns that are not in our ob ject class An intermediate

representation that encapsulates the imp ortant information of our ob ject class is

computed for each of these patterns yielding a set of p ositive and negative feature

vectors These feature vectors are used to train a pattern classier to dierentiate

between inclass and outofclass patterns

In the testing phase weareinterested in detecting ob jects in outofsample images

Figure presents an algorithmic summary of the detection pro cess The system

slides a xed size windowover an image and uses the trained classier to decide which

patterns show the ob jects of interest Ateach window p osition we extract the same

set of features as in the training step and feed them into our classier the classier

output determines whether or not we highlight that pattern as an inclass ob ject To

achievemultiscale detection we iteratively resize the image and pro cess each image

size using the same xed size window

This section addresses the key issues in the development of our trained pattern

classier the representation and the learning engine

Training Data

Our example based approach uses a set of images of an ob ject class to learn what

constitutes an inclass and outofclass pattern Here wetake p eople detection as a

sample domain Since the output of our system will b e an image with b oxes drawn

around the p eople the training pro cess needs data that reects what is and is not a

p erson so we need our p ositive data to b e examples of p eople To ensure that our

classication engine will learn on data that has consistent information we require

that the example images of p eople b e scaled to the same size and aligned so that the Training Testing

Overcomplete Representation

SVM Classifier Overcomplete Representation

SVM Classifier

person non-person

Figure The training and testing phases of our system

b o dy is lo cated in the same p osition in each image Wehavedevelop ed a simple to ol

that allows a user to click on a few identifying marks of an ob ject in an image and

then automatically cuts scales and aligns the pattern for use in training Negative

training data is gathered by randomly sampling patterns in images that do not contain

the ob ject of interest

Figure Examples from the database of faces used for training The images are

gray level of size pixels

Representation

Wavelets

One of the key issues in the development of an ob ject detection system is the rep

resentation of the ob ject class Even within a narrowly dened class of ob jects such

as faces or p eople the patterns can show a great deal of variability in the color

texture and p ose as well as the lack of a consistentbackground Our challenge is to

develop a representation that achieves high interclass variability with lowintraclass

variability

Figure Examples from the database of p eople used for training The images

are color of size pixels The examples vary in p ose color texture and

background

To motivate our choice of representation we can start by considering several

traditional representations Pixel based and color region based approaches are likely

to fail b ecause of the high degree of variability in the color in certain ob ject classes

like p eople and the numb er of spurious patterns Traditional ne scale edge based

representations are also unsatisfactory due to the large degree of variability in these

edges

The representation that we use is an overcomplete dictionary of Haar wavelets

in which there is a large set of features that resp ond to lo cal intensity dierences at

several orientations We presentanoverview of this representation here details can

b e found in App endix A and Mallat Stollnitz et al The Haar wavelets

in their p ossible orientations are shown in Figure b

For a given pattern the wavelet transform computes the resp onses of the wavelet

lters over the image Each of the three oriented wavelets vertical horizontal and

diagonal are computed at several dierent scales allowing the system to represent

coarse scale features all the waydown to ne scale features In our ob ject detection

Figure Examples from the database of cars used for training The images are

color of size pixels normalized so that the front or rear bump er is pixels

wide

systems we use consecutivescalesofwavelets In the traditional wavelet transform

the wavelets do not overlap they are shifted by the size of the supp ort of the wavelet

in x and y Toachieve b etter spatial resolution and a richer set of features our

transform shifts by of the size of the supp ort of eachwavelet yielding an overcom



plete dictionary of wavelet features see Figure c The resulting high dimensional

feature vectors are used as training data for our classication engine

There is certain a priori knowledge emb edded in our choice of the wavelets First

we use the absolute values of the magnitudes of the wavelets This tells the system

that a dark ob ject on a lightbackground and a light ob ject on a dark background

have the same information content Second for color images we compute the wavelet

transform for a given pattern in each of the three color channels and then for a wavelet

of a sp ecic lo cation and orientation we use the one that is largest in magnitude

This allows the system to use the most visually signicant features

Motivation

Our main motivation for using wavelets is that they capture visually plausible features

of the shap e and interior structure of ob jects that are invariant to certain transforma

tions The result is a compact representation where dissimilar example images from

the same ob ject class map to similar feature vectors

With a pixel representation what wewould b e enco ding are the actual intensities wavelets in 2D B 1 -1 1 -1 -1 1 0 1 1 scaling function A vertical horizontal diagonal 1 C

0 1

-1

wavelet

standard overcomplete

Figure The Haar wavelet framework a the Haar scaling function and wavelet

b the three typ es of dimensional nonstandard Haar wavelets vertical horizontal

and diagonal and c the shift in the standard transform as compared to quadruply

dense shift resulting in an overcomplete dictionary of wavelets

of dierent parts of the patterns a simple example makes it clear that this enco ding

do es not capture the imp ortant features for detection Take for instance our example

of twodatapoints of the same class where one is a dark b o dy on a white background

and the other is a white b o dy on a dark background With an intensity based rep

resentation like pixels each of these examples maps to completely dierent feature

vectors A representation that enco des lo cal oriented intensity dierences like Haar

wavelets would yield similar feature vectors where the features corresp onding to uni

form regions are zero and those corresp onding to b oundaries are nonzero In fact

since in our representation we enco de only the magnitude of the intensity dierence

the feature vectors for this simple two example case would b e identical

We do not use all the very ne scales of wavelets as features for learning since

these scales capture high frequency details that do not characterize the class well

For instance in the case of p eople the nest scale wavelets may resp ond to checks

strip es and other detail patterns all of which are not features that are characteristic

to the entire class Similarlythevery coarse scale wavelets are not used as features

Figure The top row shows examples of images of p eople in the training database

Thebottomrowshow edge detection of the p edestrians Edge information do es not

characterize the p edestrian class well

for learning since their supp ort will b e as large as the ob ject and will therefore not

enco de useful information So for the ob ject detection system wehave develop ed

we throw out the very ne and very coarse wavelets and only use two medium scales

of wavelets as features for learning These scales dep end on the ob ject class and the

size of the training images and are chosen a priori

The Wavelet Representation

The Haar transform provides a multiresolution representation of an image with wavelet

features at dierent scales capturing dierentlevels of detail The coarse scale

wavelets enco de large regions while the ne scale wavelets describ e smaller lo cal

regions The wavelet co ecients preserve all the information in the original image

but the co ding of the visual information diers from the pixelbased representation

in two signicantways

First the wavelets enco de the dierence in average intensitybetween lo cal regions

along dierent orientations in a multiscale framework Constraints on the values of

the wavelets can express visual features of the ob ject class Strong resp onse from a

particular wavelet indicates the presence of an intensity dierence or b oundaryat

that lo cation in the image while w eak resp onse from a wavelet indicates a uniform

area

Second the use of an overcomplete Haar basis allows us to propagate constraints

between neighb oring regions and describ e complex patterns The quadruple density

wavelet transform provides high spatial resolution and results in a rich overcomplete

dictionary of features

In the following sections weshowhowourwavelet representation applies to faces

p eople and cars This co ding of lo cal intensity dierences at several scales provides

a exible and expressive representation that can characterize complex ob ject classes

Furthermore the wavelet representation is computationally ecient for the task of

ob ject detection since we do not need to compute the transform for each image region

that is examined but only once for the whole image and then pro cess the image in

the space of wavelets

Analyzing the Face Class

For the face class wehave a training set of gray scale images of faces This set

consists of a core set of faces with some small angular rotations to improve general

ization and nonface patterns These images are all scaled to the dimensions

and show the face from ab ovetheeyebrows to b elow the lips Typical images

from the database are shown in Figure Databases of this size and comp osi

tion have b een used extensively in face detection Sung Rowley et al

Osuna et al a For the size of patterns our face system uses wehave at our

disp osal wavelets of the size and Instead of using the entire

set of wavelets we a priori limit the dictionary to contain the wavelets of scales

and since coarser features do not contain signicant information for detection

purp oses At the scale pixels there are features in quadruple density

hwavelet class and at pixels there are features in double density for eac

for each class for a total of co ecients

The rawvalue of a co ecientmay not necessarily b e indicative of a b oundary

aweak co ecient in a relatively dark image may still indicate the presence of an

intensity dierence that is signicant for the purp oses of classication To reduce

these eects on the features used for classication we normalize a co ecients value

against the other co ecients in the same area For the normalization step we compute

the average of eachwavelets class fv er tical hor iz ontal diag onal g f gover

the current pattern and divide the wavelet resp onse at each spatial lo cation by its

corresp onding class average We calculate the averages separately for each class since

the p ower distribution b etween the dierent classes mayvaryFor a given pattern p

in this case a pixel pattern the class averages are

X

av g w i

os os

n

ip

where c denotes a xed orientation s denotes a xed scale i indexes into the wavelets

in the pattern p and w denote the n individual wavelet co ecients at orientation

os

o and scale s The normalization for all wavelets within the pattern p is then

w i

os



w i

os

av g

os

After the normalization the average value of a co ecient for random patterns

should b e one Three classes of feature magnitudes will emerge ensemble average

values much larger than one which indicate strong intensity dierence features that

are consistent along all the examples values that are much less than one which indi

cate consistent uniform regions and values that are close to one which are asso ciated

with inconsistent features or random patterns

To visualize the detected face features we co de the ensemble average of the wavelet

co ecients using graylevel and draw them in their prop er spatial layout Figure

Co ecients with values close to one are plotted in gray those with values larger

than one are darker and those with values less than one are lighter It is interesting

to observe the emerging patterns in the facial features The vertical co ecients

capture the sides of the nose while the horizontal co ecients capture the eyesockets

eyebrows and tip of the nose Interestingly the mouth is a relatively weak feature

compared to the others The diagonal co ecients resp ond strongly to the endp oint

of facial features

a b c d e f

Figure Ensemble average values of the wavelet features of faces co ded using gray

level Co ecients whose values are ab ove the average are darker those b elowthe

average are lighter ac are the vertical horizontal and diagonal wavelets at scale

df are the vertical horizontal and diagonal wavelets at scale

                   

                    

                     

                  

                  

                    

                    

                     

                     

                  

                      

                      

                         

                      

                       

                      

                    

Table Ensemble average of normalized horizontal co ecients of scale of

images of faces Meaningful co ecients are the ones with values much larger or

smaller than Average values close to indicates no meaningful feature

Analyzing the People Class

For learning the p eople class wehave collected a set of color images of p eople

in dierent p oses Figure and use the mirror images as well and

nonp eople patterns All of the images are normalized to the dimensions and

the p eople images are aligned such that the b o dies are centered and are approximately

the same size the distance from the shoulders to feet is ab out pixels

As in the case of faces to co de features at appropriate scales for p eople detection

scales at whichwe exp ect relevant features of p eople to emerge we restrict the

system to the wavelets at scales of pixels features for each orientation

and pixels features for each orientation

In our p eople detection system our training database is of color images For

a given pattern we compute the quadruple density Haar transform in each color

channel RGB separately and take as the co ecientvalue at a sp ecic lo cation

and orientation the one largest in absolute value among the three channels This

technique maps the original color image to a pseudocolor channel that gives us

wavelet co ecients the same numberasifwe had b een using graylevel images

To visualize the patterns that emerge using this wavelet representation for p eople

we can co de the average values of the co ecients in graylevel and display them in the

prop er spatial layout as we did for the faces Figure shows eachaverage wavelet

displayed as a small square where features close to one are gray stronger features are

darker and weaker features are lighter As with faces weobservethateach class

of wavelet co ecients is tuned to a dierenttyp e of structural information The

vertical wavelets capture the sides of the p eople The horizontal wavelets resp ond to

the shoulders and to a weaker b elt line The diagonal wavelets are tuned to corner

features ie the shoulders hands and feet The scale wavelets provide ne

spatial resolution of the b o dys overall shap e and smaller scale details suchasthe

head and extremities are clearly evident

Analyzing the Car Class

The car detection system uses a database of frontal and rear color images of cars

normalized to and aligned such that the front or rear bump er is pixels

across For training we use the mirror images as well for a total of p ositive

patterns and negative patterns A few examples from our training database are

shown in Figure The two scales of wavelets we use for detection are and

Like the pro cessing for p eople we collapse the three color channel features

into a single channel by using the maximum wavelet resp onse of eachchannel at a

sp ecic lo cation orientation and scale This gives us a total of wavelet features

that are used to train the SVM

The av erage wavelet feature values are co ded in graylevel in Figure The

gray level co ding of the average feature values shows that the wavelets resp ond to the

signicant visual characteristics of cars The vertical wavelets resp ond to the sides

a b c d e f

Figure Ensemble average values of the wavelet features of p eople co ded using

graylevel Co ecients whose values are ab ovetheaverage are darker those b elowthe

average are lighter ac are the vertical horizontal and diagonal wavelets at scale

df are the vertical horizontal and diagonal wavelets at scale

of the car the horizontal wavelets resp ond to the ro of underside top of the grille

and bump er area and the diagonal wavelets resp ond to the corners of the cars b o dy

At the scale wecaneven see evidence of what seem to b e license plate and

headlight structures in the average resp onses

a b c d e f

Figure Ensemble average values of the wavelet features of cars co ded using gray

level Co ecients whose values are ab ove the average are darker those b elowthe

average are lighter ac are the vertical horizontal and diagonal wavelets at scale

df are the vertical horizontal and diagonal wavelets at scale

Discussion

Comparing the database of p eople Figure to the database of faces Figure

illustrates an imp ortant fundamental dierence in the two classes In the case of faces

there are clear patterns within the face consisting of the eyes nose and mouth These

patterns are common to all the examples This is not the case with fullb o dy images

of p eople The p eople do not share any common color or texture Furthermore the

p eople images have a lot of spurious details suchasjackets ties and bags On the

other hand wewould exp ect that p eople can b e characterized quite well by their

fairly similar overall b o dy shap e or silhouette Our approach treats these two

cases where there is dierent underlying information content in the ob ject classes

in a uniform manner Frontal and rear views of cars have b oth a certain amountof

common interior structure top of grille license plates headlights as well as fairly

uniform outer b oundaries We will see that our wavelet representation is also suitable

for car detection as well

There is certain a priori knowledge emb edded in our choice of the wavelets The

use of the absolute value of the co ecient is essential in the case of p eople since the

direction of the intensity dierence of a certain features orientation is not imp ortant

a dark b o dy against a light background and a light b o dy against a dark background

should b e represented as having the same information content Furthermore we

compute the wavelet transform for a given pattern in each of the three color channels

and then for a wavelet at a sp ecic lo cation and orientation we use the one that is

largest in magnitude amongst the three channels This is based on the observation

that there is little consistency in color b etween dierent p eople and allows the system

to key o of the most visually signicant features

vedonethe Once wehave generated the feature vectors for an ob ject class and ha

same for a set of images not in our ob ject class we use a learning algorithm that

learns to dierentiate b etween the two classes The particular learning engine weuse

is a supp ort vector machine describ ed b elow

Supp ort Vector Machines

a small margin b large margin

Figure The separating hyp erplane in a has small margin the hyp erplane in

b has larger margin and should generalize b etter on outofsample data

Supp ort vector machines SVM is a technique to train classiers that is wellfounded

in statistical learning theory Vapnik Burges One of the main attractions

of using SVMs is that they are capable of learning in high dimensional spaces with

very few training examples SVMs accomplish this by minimizing a b ound on the

empirical error and the complexity of the classier at the same time

This concept is formalized in the theory of uniform convergence in probability

log h

R R

emp

with probability Here R is the exp ected risk R is the empirical

emp

risk is the numb er of training examples h is the VC dimension of the classier

that is b eing used and istheVC condence of the classier Intuitively what

this means is that the uniform deviation b etween the exp ected risk and empirical risk

decreases with larger amounts of training data and increases with the VC dimension

h This leads us directly to the principle of structural risk minimization wherebywe

can attempt to minimize at the same time b oth the actual error over the training set

and the complexity of the classier This will b ound the generalization error as in

Equation It is exactly this technique that supp ort vector machines approximate

a original data set b mapp ed feature space

Figure The original data set may not b e linearly separable The supp ort vector

machine uses a nonlinear kernel to map the data p oints into a very high dimen

sional feature space in which the classes havea much greater chance of b eing linearly

separable

This controlling of b oth the training set error and the classiers complexity has

allowed supp ort vector machines to b e successfully applied to very high dimensional

learning tasks Joachims presents results on SVMs applied to a dimen

sional text categorization problem and Osuna et al b show a dimensional

face detection system

The separating b oundary is in general of the form

X

y K x x b f x

i i i

i

where is the numb er of training data p oints x y y b eing the lab el of train

i i i

ing p oint x are nonnegative parameters learned from the data and K is

i i

akernel that denes a dot pro duct b etween pro jections of the two arguments in

some feature space Vapnik Wahba where a separating hyp erplane is

then found Figure For example when K x y x y is the chosen ker

nel the separating surface is a hyp erplane in the space of x input space The



kernel K x y exp kx y k denes a Gaussian radial basis function Girosi

n th

et al and K x y x y describ es an n degree p olynomial In

general any p ositive denite function can b e used as the kernel Vapnik

Wahba

The main feature of SVM is that it nds among all p ossible separating surfaces

of the form the one which maximizes the distance b etween the two classes of

p oints as measured in the feature space dened by K The supp ort vectors are the

nearest p oints to the separating b oundary and are the only ones typically a small

fraction of the training data for which in Equation is p ositive

i

Using the SVM formulation the classication step for a pattern x using a p oly

nomial of degree two the typical classier we use for our system is as follows

N

s

X



f x y x x b

i i i

i 

where N is the numb er of supp ort vectors or training data p oints that dene the

s

decision b oundary are Lagrange parameters and is a threshold function If

i

weintro duce dx which returns a value prop ortional to the distance of x to the

sep erating hyp erplane then f x dx

In our case the feature vector we use is comp osed of the wavelet co ecients for

the pattern we are currently analyzing

SVMs and Conditional Densities

The raw output of a single SVM classication dx is a real numb er that is pro

p ortional to the distance of the p oint x to the separating hyp erplane To facilitate

comparisons b etween the outputs of dierent supp ort vector machines and provide

a probabilistic interpretation it is necessary to normalize the outputs somehow in

their raw form they are not directly comparable Several metho ds for providing in

terpretations of the output of a SVM as a conditional densityhave emerged While

we do not use these metho ds in our system they are outlined b elow for completeness

Density estimation in the space of dx

In Niyogi et al they describ e a metho d for converting SVM outputs to prob

abilities that simply relies on the approximation P xjy P dxjy thereby doing

density estimation in the lower dimensional space of distances from the hyp erplane

rather than the full feature space To estimate the p osterior density

P dxjy P y

P y jx

P dxjy P y P dxjy P y

Maximum likeliho o d tting

In Dumais et al Platt they directly t a sigmoid to the output of an

SVM using a regularized maximum likeliho o d approach The resulting p osterior is

P y jx

Ad xB

e

Decomp osition of feature space

Another metho d for estimating conditional probabilities from SVMs is presented in

Vapnik Here the feature space is decomp osed into a direction orthogonal to

the sep erating hyp erplane and the other dimensions Each of these decomp ositions

is parameterized sep erately by t a scaled version of dx and u resp ectivelyand

the density along the orthogonal line is

N

X

P y jt u a u a ucosit

 n

i

Exp erimental Results

An algorithmic summary of the outofsample detection pro cess is shown in Figure

In Figures and we present examples of our trainable ob ject detec

tion system as applied to the domains of face p eople and car detection resp ectively

We reiterate that the system makes no a priori assumption on the scene structure

or the numb er of ob jects present and do es not use any motion or other dynamical

information The p erformance of each of these particular instantiations of detection

systems could easily b e improved by using more training data Wehave not sought

to push the limits of p erformance in particular domains rather our goal has b een to

show that this uniform architecture for ob ject detection leads to high p erformance in

several domains

Let I be the image on which we are running the detection system

Let s be the rescaling factors for our image we use s

r r

Let s be the initial scale we use s

i i

Let s be the final scale we use s

f f

Let s be the current scale of image we are processing

c

Let I be the input image scaled by s

c c

Let H I be the Haar transform of image I

Let q be a pattern in wavelet space

Let fv be the feature vector used to classify a pattern

set s s

c i

while s s then do

c f

I resize I by s

c c

H I compute the Haar wavelet transform of I

c c

rc in H I loop over all rows and columns

c

q pattern at rc

compute the average response of each type of wavelet

scal e f gor ientation fV H Dg in pattern q

fv normalize each wavelet in q by its class average

cl ass classify fv using the SVM classifier

if cl ass then

pattern q is a person so draw a rectangle around q

if cl ass then

pattern q is not a person so ignore

s s s

c c r

end

Figure Algorithm for detecting p eople in outofsample images A B C D

E F G H

I J K

N O

L M

Figure Results of our face detection system on a set of outofsample images

A C E F G H I J K L M N are from the test database of Sung and Poggio B

D are from wwwstarwarscom O is from wwwcorbiscom Missed faces B F I J

K M are due to signicant head rotations that were not present in the training data

False p ositives E F N are due to insucient training data and can b e eliminated by

using more negative training data The face detection system pro cesses approximately patterns p er image A B C

E F D G

J H I

M K L

O P

N

Figure Results of p eople detection on outofsample images A I K are from

wwwstarwarscom B D E F H J N are from wwwcorbiscom C G are from

wwwcnncom L O P were taken in Boston and Cambridge M was provided by

DaimlerChrysler Missed detections are due to the p erson b eing to o close to the edge

of the image B or when the p erson has a uncharacteristic b o dy shap e not represented

in the training data I False p ositives often lo ok very similar to p eople A or are

due to the presence of strong vertical and horizontal intensity dierences D E K

L M O The p eople detection system pro cesses approximately patterns p er image A B C

D E

H F G I

M

L J

O

K N

Figure Results of car detection on outofsample images A is from

wwwlewistonp dcom B C D E F G H J K L M O are from wwwcorbiscom

I is from wwwenncom N is from wwwfoxglovecom Missed p ositive examples are

due to o cclusions A F O or where a car is to o close to the edge of the image A

False p ositives C J I N are due to insucient training and can b e eliminated with

more negative training patterns The car detection system pro cesses approximately

patterns p er image

Chapter

Exp erimental Results

Our criteria for the representation we are using is that it b e eciently computable and

identify lo cal oriented intensity dierence features Haar wavelets satisfy these crite

ria and are p erhaps some of the simplest such features with nite supp ort Thusfar

wehave fo cused on the Haar wavelet representation but could have considered other

p ossible representations including pixels and principal comp onents This chapter em

pirically quanties the added value in using the wavelet representation as compared

with several alternate representations We also present the result of exp eriments that

compare the eect of using dierenttyp es of classiers including linear p olynomial

and radial basis function classiers Finallywe provide empirical evidence that shows

that relatively few training examples may b e needed to suciently train an ob ject

detection system

Test Pro cedure

To accurately measure our detection systems p erformance we use a receiver op

erating characteristic ROC curve whichquanties the tradeo b etween detection

accuracy and the rate of false p ositives In all of our ROC curves the y scale is the

p ercentage of correct p ositive detections and the x scale is the rate of false p ositives

measured as the numb er of false p ositives p er negative pattern pro cessed

To test our detection system wehavedevelop ed an automated testing package

es us a detailed view of howwell our system is p erforming The testing pro that giv

cedure measures p ositive and negative pattern p erformance separately over dierent

sets of data The p ositive test data consists of aligned but not scaled examples of the

ob ject class with a suciently large b oundary around them such that the detection

system can run at several dierent lo cations and scales While these images can b e of

various sizes the prop ortions at which the ob ject o ccurs in the image is constant from

test example to test example This allows us to check whether or not each detection

in the image falls on the actual ob ject Weallow tolerances of a few pixels in the

detections exact tolerances are given in Table

To generate the false p ositive rate wehave a set of images of dierentnat

ural and manmade scenes that do not contain any examples of the ob jects weare

Faces People Cars

Tolerance

Table Pixel tolerances for the automated testing pro cedure given tolerances are

for b oth x and y p ositions as normalized to the sizes of the ob jects in the training

set ie for faces tolerances are scaled for patterns for p eople and

for cars

detecting The false p ositive rate is simply the numb er of detections in this set of

data divided by the total numb er of patterns that are examined Using the actual

backgrounds of images containing p eople maygive a more accurate false p ositive rate

but we ignore this issue opting for the most simple metho d of determining the false

p ositive rate Table gives the numb er of p ositive examples and negative patterns

for each of the ob ject detection systems wedevelop in this thesis

Faces People Cars

Positive Examples

NegativePatterns

Table Summary of the test set sizes for each of the detection systems

This technique gives us a single detectionfalse p ositive p oint To generate a full

ROC curve we shift the SVM decision surface and obtain detectionfalse p ositive

p oints for various shifts Shifting the decision surface has the eect of tuning the

system to b e more strict or more relaxed in its denition of what is and is not an

element of the ob ject class The shifting is accomplished as

X

f x y K x x b s

i i i

i

where s dictates the magnitude and direction of the shift as shown in Figure

For s wehave the decision surface that SVM training providesl smoves

the decision surface towards the p ositive class making classication more strict and

smoves the decision surface towards the negative class making classication more

lenient

Exp eriments

The dense Haar transform captures a rich set of features that allow the SVM classier

to obtain a p owerful class mo del the wavelets resp ond to signicant visual features

while smo othing away noise This choice of features is a priorihowever this section

presents the results of many tests comparing dierent features for ob ject detection s<0 s=0 s>0

non-people people

Figure Shifting the original decision surface s bychanging the bias term

makes classication more strict if s or more lenientifs Each shifted decision

b oundary yields a detection system with dierent p erformance characteristics We

generate ROC curves by plotting the p erformance obtained for each shifted decision

surface

There are many p ossible alternate representations that have b een used in the liter

ature including pixels and PCA and these dierent representations are compared

in our detection framework Another decision wemadewas to ignore the sign of

the wavelets and use their absolute value tested against the signed values In addi

tion for p eople detection our training set is in color We empirically quantify the

improvement in p erformance using color data as opp osed to graylevel data

In the results presented in this section our p eople detection system is trained on

p ositive patterns frontal and rear p eople images and their mirror images

and nonp eople patterns and is tested on images containing p eople and

nonp eople patterns The face detection system is trained on face

images and nonface patterns and is tested on images containing faces

and nonface patterns The car detection system is trained on frontal

and rear color images of cars examples and their mirrors and noncar

patterns and is tested on images containing cars and noncar patterns

a b c

Figure Comparing histogram equalized pixels to histogram equalized scaling

co ecients a raw pixels b histogram equalized pixels and c histogram

equalized overlapping scaling co ecients

Pixels Wavelets PCA

Our main premise for cho osing a wavelet based representation is that intensity dif

ferences b etween lo cal adjacent regions contain higher quality information for the

purp ose of ob ject detection than other traditional representations Pixel representa

tions capture the most lo cal features These have b een used extensively for face

detection but due to the variability in the p eople patterns wewould exp ect pixel

representations to fail for p eople detection At the other end of the lo cality sp ectrum

are global representations like PCA which enco des a class in terms of basis functions

that account for the variance in the data set Wecanchange the class of features

to see which yields the b est p erformance For p eople we use the overlapping

averages instead of pixels for a more fair comparison that uses similar numbers

of features Furthermore these averages are histogram equalized in the same manner

as the pixel representation see Figure

Signed vs Unsigned Wavelets

The features our system uses do not contain information on the sign of the intensity

gradient but are the absolute values of the wavelet resp onses With these features

we are solely describing the strength of the intensity dierences For an ob ject class

like p eople where a dark b o dy on a lightbackground has the same information as

a light b o dy on a dark background and there is little consistency in the intensities

the sign of the gradient should not matter On the other hand if we consider face

patterns there is consistent information in the sign of the gradient of the intensity

dierences For instance the eyes are darker than the cheeks and the forehead and

the mouth is darker than the cheeks and the chin These typ es of relationships have

b een explored in Sinha b We might exp ect that using the sign information

or would enhance results in this case

Complete vs Overcomplete

The motivation for using the overcomplete Haar wavelet representation is to provide

a richer set of features over which the system will learn and ultimately a more

accurate description of a p erson We test this against the standard complete Haar

representation

Color vs Gray Level

For color images in the case of p eople detection we collapse information from the

three color channels into a single pseudochannel that maintains the strongest lo cal

intensity dierences It is intuitively obvious that color images con tain much richer

information than the corresp onding grayscale versions We present exp eriments that

quantify the inherent information content in using color images as opp osed to gray

level for ob ject detection

Faces People and Cars

Our ROC curves highlight the p erformance of the detection system as accuracy over

outofsample data against the rate of false p ositives measured as the numb er of false

p ositives p er pattern examined The ROC curves that compare dierent representa

tions for the face detection system are shown in Figure The representations used

for face detection are raw pixels features histogram equalized pixels fea

tures principal comp onents of histogram equalized pixels features gray signed

wavelets features and gray unsigned wavelets features Gray unsigned

The wavelets that we use actually form an undercomplete basis for this space A more correct

characterization of the representation is that the features are the wavelets from the complete basis

that are at the two scales  and  we are using Faces: pixels, pca, wavelets 1

0.8

0.6

0.4 gray unsigned wavelets Detection Rate gray signed wavelets gray pixels (HE) 0.2 pca gray pixels (HE) gray pixels (raw)

0 −6 −4 −2 0 10 10 10 10

False Positive Rate

Figure ROC curves for face detection comparing dierent features using pixel

features as a b enchmark The graph shows gray unsigned and signed wavelets raw

and histogram equalized pixels and PCA of histogram equalized pixels The face

detection system typically pro cesses ab out patterns p er image

People: averages, pca, wavelets 1

0.8

0.6

color unsigned wavelets 0.4 color signed wavelets

Detection Rate gray unsigned wavelets gray signed wavelets 0.2 color unsigned wavelets (complete) gray pixel averages pca gray pixel averages 0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for p eople detection comparing dierent features using pixel

typ e features as a b enchmark The graph shows color and gray signed and unsigned

wavelets overlapping color unsigned wavelets nonoverlapping overlapping

pixel averages and PCA of the overlapping pixel averages We compare the

wavelets against the overlapping pixel averages instead of the

pixels themselves to use approximately the same numb er of features for a more fair

comparison The p eople detection system typically pro cesses ab out patterns

per image Cars: wavelets 1

0.8

0.6

0.4 Detection Rate

0.2

color wavelets 0 −4 −2 0 10 10 10

False Positive Rate

Figure Preliminary ROC curve for car detection using wavelet features over

color images The car detection system typically pro cesses ab out patterns

per image

wavelets yield the b est p erformance while gray signed wavelets and histogram equal

ized gray pixels lead to the same level of p erformance slightly worse than the gray

unsigned wavelets The version using principal comp onents is less accurate than

the histogram equalized pixels That the unsigned wavelets p erform b etter than the

signed wavelets is somewhat counterintuitive we had p ostulated that the signs of

the wavelets contain imp ortant information for face detection since human faces have

consistent patterns Using the absolute magnitude of the wavelets mayresultina

representation with less variability than the signed version while still enco ding the

imp ortant information for detection allowing the classier to nd a b etter decision

surface To gauge the p erformance of the system we can take a p ointontheROC

curve and translate the p erformance into real image terms For instance for a

detection rate wemust tolerate false p ositiveforevery patterns pro cessed

or approximately false p ositive p er image

The ROC curves for the p eople detection system are shown in Figure Here

using all the color features p erforms the b est where for instance a detection rate

leads to one false p ositiveforevery patterns that are pro cessed ab out three

false p ositives p er image Graylevel wav elets p erform signicantly b etter than the

corresp onding gray level averages Here unlike in the case of face detection the raw

pixel values do not characterize the ob ject class well When we use the PCAs of

the averages the p erformance is signicantly worse Figure also supp orts our

hyp othesis on the necessityofanovercomplete versus a complete representation The

system starting from a complete representation color wavelets underp erforms

all of the systems based on the overcomplete representation The signed versions of

b oth the color and gray level wavelets p erform worse than their unsigned versions

Wehyp othesize that the reason is the same as the case for faces that the unsigned

versions result in more compact representations over which it is easier to learn see

the intuition given in Section

The preliminary ROC curve for our car detection system using unsigned wavelet

features on color images is shown in Figure

Discussion

There is a simple relation b etween linear transformations of the original images and

kernels that b ears mentioning An image x can b e linearly decomp osed into a set of

features c c c by c Ax where A is a real matrix we can think of the

m

features c as the result of applying a set of linear lters to the image x



If the kernel used is a p olynomial of degree m as in the exp eriments then

  m  m  m

x A Ax K x x x x while K c c c c

j i j j i j j

i i i

So using a p olynomial kernel in the c representation is the same as using a kernel

 m 

A Ax in the original one This implies that one can consider any linear x

j

i

T

transformation of the original images bycho osing the appropriate square matrix A A

in the kernel K of the SVM

As a consequence of this observation wehave a theoretical justication of why

the pixel and eigenvector representations should lead to the same p erformance In

T

the case of using the PCA the matrix A is orthonormal therefore A A I which

implies that the SVM should nd the same solution in b oth cases On the other hand

if wecho ose only some of the principal comp onents or if we pro ject the images onto

a nonorthonormal set of Haar wavelets the matrix A is no longer orthonormal so

the p erformance of the SVM may b e dierent

This theoretical justication is empirically validated or at least not contradicted

in the case of faces where the PCA and pixel representations p erform at ab out the

same level However for p eople our results seem to contradict the theory the PCA

of the lo cal averages p erform muchworse than the lo cal averages themselves Why

migh t this b e happ ening

Closer analysis reveals that the pixel and principal comp onent representations do

not in fact lead to identical solutions In the case of pixels our p olynomial kernel has

the form

 m

K x x x x

i j j

i

In the case of PCA we actually compute the eigenvectors over the set of mean nor

malized pixel images x x yielding as the kernel

 m   m

K c c c c x x A Ax x

i j j i j

i



Generally this holds for anykernel for which only dot pro ducts b etween input arguments are

needed ie also for Radial Basis Functions

where now c Ax x are computed from the mean normalized pixel features

Since A is orthonormal in the case of PCA we are actually computing

 m  m

K c c c c x x x x

i j j i j

i

which is strictly not equivalenttothekernel in the case of pixels Further analysis of

these issues is an area of future research

The number of support vectors in the dierent solutions is one indicator of how

dicult the individual problems are Table lists the numb er of supp ort vectors

for each of the dierent representations wehave considered ab ove One of the most

imp ortant conclusions we can draw from this table is that the signed representations

do indeed result in more complex SVM decision surfaces than their unsigned coun

terparts meaning that the signed representations are not as conducive to learning

Faces People Cars

pixels raw

pixels histogram equalized

pca

gray signed

gray unsigned

color signed

color unsigned

Table Number of support vectors in each of the solutions to the classication

problems using dierent representations

Dierent Classiers

As we stated in Section our main motivation for using classiers trained by

the supp ort vector machine algorithm is its ability to nd separating hyp erplanes in

sparsely p opulated high dimensional feature spaces that do not overt Until now

wehave ignored exactly what form the classier has and have describ ed the system

in the context of using a p olynomial classier of degree two In this section we

present further exp eriments where several dierenttyp es of classiers are used and

we compare the results over an identical test set

The general form of the decision surface found through SVM training is

X

f x y K x x b

i i i

i

where is the numb er of training data p oints x y y b eing the lab el of training

i i i

isthe p oint x are nonnegative parameters learned from the data and K

i i Comparing classifiers for people detection 1

0.8

0.6

0.4 linear polynomial deg 2 Detection Rate polynomial deg 2 (no bias) polynomial deg 3 0.2 polynomial deg 4 polynomial deg 5

0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for p eople detection with the full feature set comparing

dierent classiers linear and p olynomial classiers of degree through including a

p olynomial classier of degree with no bias term There is no appreciable dierence

in the p erformance using these dierent classiers

kernel that denes a dot pro duct b etween pro jections of the two arguments in some

feature space By cho osing dierentkernels K we dene qualitatively dierent

n th

decision surfaces For instance K x y denes an n degree p olynomial and



K x y expkx y k denes a Gaussian radial basis function

Here we consider p olynomial classiers of varying degree n from a p erceptron

th

n to a degree p olynomial several attempts at using Gaussian RBF classier

would not converge in SVM training Figure shows the p erformance of eachof

these systems From the ROC curves it is immediately evident that there is no appre

ciable dierence in any of the classiers This has imp ortant practical ramications

with a linear classier the decision surface can b e describ ed as

X

f x y x x b

i i i

i

but since the y andx are all predetermined at runtime we can write the decision

i i i

function in its more general form

f x w x b

P

since w y This means that we can classify a p oint with a single dot

i i

i

pro duct

Feature Selection

The goal of feature selection is to identify which features are imp ortant for detec

tion and subsequently use this subset of features as the representation over which

learning o ccurs Until now wehave describ ed systems with no feature selection after

deciding on a feature class pixels or wavelets or PCA we use all of the features of

that class Generallywepay a price for large sets of features in the form of exp ensive

computation Moreover it may b e the case that certain features are not imp ortant

for detection For instance the wavelets in the background p ortion of the p eople

patterns are not go o d indicators of the presence of a p erson It may b e p ossible to

reduce the dimensionality of the representation through a feature selection step while

still preserving most of the systems representational p ower

To nd the optimal set of features wewould need to determine the p erformance of

our system using every p ossible subset of features for the wavelet representation this

P



would mean checking the unique subsets of features an astronomical

s

s

number For a given size s the selection of the b est s features amounts to an integer

programming problem which is NPcomplete

Wehave implemented a manual feature selection technique that results in much

lower dimensional data and therefore less exp ensive classication while still main

taining a reasonable level of accuracy

Todomanual feature selection we start with the tables of the average wavelet

resp onses for our training set broken down by scale and orientation From these tables

wemanually cho ose the strongest and weakest features while at the same time we

ensure that the features wecho ose are not overly redundant ie do not overlap This

pro cess is fairly sub jective but at the same time allows for the enforcement of criteria

such as picking strong features that span the space in some greedy manner rather

than just picking the n strongest features some of whichmayoverlap signicantly

and therefore not provide improvement

For the p eople detection case the tables of rawaverages are shown in App endix

C The manually chosen features are shown overlayed on an example p erson in

Figure The p erformance of this feature system is shown in Figure While

it underp erforms the versions using the full feature sets wavelets the system is

able to capture much of the structure of the human form using just lo cal features

The resulting p erformance may b e acceptable for certain applications

Training with Small Data Sets

Most examplebased systems for ob ject detection have relied on large sets of p ositive

and negative examples that are used to train a system to dierentiate b etween the

target class and the nontarget class Table en umerates that numb er of p ositive

and negative examples used to train dierent detection systems rep orted on in the

Figure The reduced set of manually chosen wavelet features for fast p eople

detection overlayed on an example image of a p erson

literature

Gathering p ositive examples of an ob ject class is an exp ensive tedious task all

these systems require input data where the ob jects are aligned in the same p osition in

the image and the images are scaled to the same size Typicallywehavecheap access

to an unlimited numb er of negative examples while obtaining a large numb er p ositive

examples is relatively exp ensive In our domain of p eople detection weinvested

signicant eort in gathering a large numb er of p ositive examples Furthermore the

domains for ob ject detection that p eople havetackled are welldened and easily

accessible While it is time consuming to gather more p ositive examples there are

no inherent limitations in doing so weknow what a face or p erson lo oks like so we

simply nd or take pictures of examples from these classes We bring to b ear much

prior knowledge in engineering ob ject detection systems for these domains

What if our detection problem was such that we only had information ab out a

small numb er of elements of the p ositive class Let us also assume that wehave no

prior exp erience or information ab out this detection problem This is not to o unbe

lievable a situation consider a hyp othetical task of detecting certain rare anomalies

in images taken with an electron microscop e Taking this one step further what if People: wavelets −− color vs grey, 1326 vs 29 1

0.8

0.6

0.4 1326 color

Detection Rate 1326 grey unsigned 1326 grey signed 0.2 29 color 29 grey unsigned 29 grey signed

0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for p eople detection comparing dierentwavelet features and

dierent feature set sizes

Researchers Domain Positive Examples Negative Examples

Vaillant Monro cq Le Cun Faces

Sung and Poggio Faces base

Rowley Baluja Kanade Faces base

Moghaddam and Pentland Faces NA

Schneiderman and Kanade Faces base

Papageorgiou Faces

Papageorgiou People base

Table Training set sizes for several ob ject detection system rep orted on in the

literature

it is also the case that our knowledge of negative examples is severely restricted as

well

One of the main attractions of the SVM framework is that it controls b oth the

training error and the complexity of the decision classier at the same time This

can b e contrasted with other training techniques likeback propagation that only

minimize training error Since there is no controlling of the classier complexity this

typ e of system will tend to overt the data and provide p o or generalization

In practical terms this means that SVMs can nd go o d solutions to classication

problems in very high dimensions In addition to b eing a theoretically sound prop erty

this capability has b een demonstrated empirically in the literature in face detection

Osuna et al b text categorization Joachims and p eople detection Pa

pageorgiou et al All of these systems and other ob ject detection systems

Sung and Poggio Rowley et al use a large set of p ositive examples in

addition to a large set of negative examples

Figure quanties the p erformance of our wavelet feature color p eople

detection system when the p ositive training set size is varied as and the

full set of p eople plus mirror images and the negative training set size

is varied as and the full set of nonp eople patterns The size

and training set exp eriments were each run times with randomly chosen data

p oints the ROC curves rep ort the average p erformance

If we assume that wehave unlimited access to negative examples as represented

by Figure a we can see that with just one p ositive example the system p erforms

extremely well Using the fully trained systems b enchmark of one false p ositiveper

patterns with a detection rate the systems trained with one p ositive ex

ample correctly detect of the p eople Increasing the numb er of p ositive examples

to results in an detection rate

With limited access to negative examples Figures b and c show that the system

still p erforms very well For as few as negative training examples the system

reaches accuracy in b oth the and p ositive training versions for the rate

of one false p ositive p er patterns

This leads us to b elieve that for the domain of p eople detection the choice of the

representation we use is more imp ortant than gathering large sets of training data In

our case the transformation from pixels to wavelets compresses the image information

into a mo del that seems to b e quite compact so that even a single p ositive training

example characterizes the class well 11,361 negative examples 100 negative examples 1 1

0.8 0.8

0.6 0.6

0.4 0.4 Detection Rate Detection Rate full training full training 0.2 100 positive examples 0.2 100 positive examples 10 positive examples 10 positive examples 1 positive example 1 positive example 0 0 −4 −2 0 −4 −2 0 10 10 10 10 10 10

False Positive Rate False Positive Rate

a b 10 negative examples 1

0.8

0.6

0.4 Detection Rate full training 0.2 100 positive examples 10 positive examples 1 positive example 0 −4 −2 0 10 10 10

False Positive Rate

c

Figure ROC curves comparing dierent sized p ositive training sets for p eople

detection The full example p ositive training set is compared against

and p ositive examples eachaveraged over iterations Wevary the size of the

negative training set from and negative examples in a b and c

resp ectivelyEven with p ositive example the system is able to learn a great deal

of the structure of the p eople class

Chapter

Detection by Comp onents

One of the most prevalent problems with our static p eople detection system is its

diculty in detecting p eople when a p ortion of the b o dy is o ccluded or there is little

contrast b etween the background and part of the b o dyOnewould exp ect that if we

knew there was a head left arm and legs in the prop er conguration but we could

not see the right arm that this may still b e a p erson In other words if welook

for the core building blo cks or components of an ob ject and allow some leniency in

allowing one or two of the comp onents that make up the ob ject to b e missing this

may result in a more robust detection system than our full pattern approach

This chapter develops a comp onent based ob ject detection system for static im

ages Here we apply it to the problem of p eople detection but the architecture is

quite general and could b e applied to among other ob jects faces and cars The

description in this chapter closely follows the material published in Mohan

and Mohan et al In Section weintro duce the detection by components

framework and review some relevantwork in comp onentbased ob ject detection Sec

tion describ es the system development and architecture Finallywe show results

of the system and compare it to the original fullb o dy p eople detection system in

Section

Intro duction

In this typ e of system geometric information concerning the physical structure of

the human b o dy supplements the visual information present in the image and should

verall p erformance of the system More sp ecically the visual thereby improvetheo

data in an image is used to detect b o dy comp onents and knowledge of the structure of

the human b o dy allows us to determine if the detected comp onents are prop ortioned

correctly and arranged in a p ermissible conguration In contrast a fullb o dy p erson

detector relies solely on visual information and do es not takeadvantage of the known

geometric prop erties of the human body

Also it is sometimes dicult to detect the human b o dy pattern as a whole due to

variations in lighting and orientation The eect of uneven illumination and varying

viewp oint on individual b o dy comp onents like the head arms and legs is less drastic

and hence they are comparatively easier to identify Another reason to adopt a com

p onent based approach to p eople detection is that the framework directly addresses

the issue of detecting p eople that are partially o ccluded or whose b o dy parts have

little contrast with the background This is b ecause the system may b e designed

using an appropriate classier combination algorithm to detect p eople even if all of

their comp onents are not detected

The fundamental design of the system is as a twolevel hierarchical classier there

are sp ecialized detectors for nding the dierentcomponents of a p erson at the base

level whose results are combined in a top level classier The comp onent based

detection system attempts to detect comp onents of a p ersons b o dy in an image

ie the head the left and right arms and the legs instead of the full b o dy The

system checks to ensure that the detected comp onents are in the prop er geometric

conguration and then combines them using a classier Wewillshow that this

approachofintegrating comp onents using a classier increases accuracy compared to

the fullb o dy version of our p eople detection system

The system intro duces a new hierarchical classication architecture to visual data

classication Sp ecically it comprises distinct example based component classiers

trained to detect dierent ob jects at one level and a similar example based combina

tion classier at the next This typ e of architecture where example based learning

is conducted at more than twolevels is called an AdaptiveCombination of Classi

ers ACC The comp onent classiers separately detect comp onents of the p erson

ob ject ie heads legs and arms The combination classier takes the output of the

comp onent classiers as its input and classies the entire pattern under examination

as a p erson or a nonp erson

Classier Combination Algorithms

wn in hierarchical classication struc Recently a great deal of interest has b een sho

tures ie pattern classication techniques that are combinations of several other

classiers In particular two metho ds have received considerable attention bagging

and boosting Both of these algorithms have b een shown to increase the p erformance of

certain classiers for a variety of datasets Breiman Freund and Schapire

Quinlan Despite the well do cumented practical success of these algorithms the

reason why bagging and b o osting work is still op en to debate One theory prop osed

bySchapire Schapire et al likens b o osting to supp ort vector machines in that

b oth maximize the minimum margin over the training set However his denition

of margin diers from Vapnik Bauer and Kohavi present a study

of such structures including bagging and b o osting oriented towards determining the

circumstances under which these algorithms are successful

Previous Work

Most of the previous work in comp onent based detection systems has fo cused on

face detection An overview of relevant systems Yuille Yuille et al

Leung et al Burl et al Burl and Perona Shams and Sp o elstra

Yow and Cip olla Forsyth and Fleck Forsyth and Fleck

Lipson is presented in the intro duction Section We highlight several

systems here

In Yuille and Yuille et al they describ e systems that extract facial

features in a framework where the detection problem is cast as an energy minimization

problem handcrafted deformable templates are used for the individual features

Leung et al Burl et al and Burl et al use lo cal templates to

matcheye nose and mouth features on the human face and determine valid arrange

ments of these features using random graph matching Computationally this amounts

to a constrained search through a very large set of candidate face congurations This

system has b een shown to have some robustness against o cclusions

The system of Shams and Sp o elstra uses a neural network to generate

condences for p ossible left and righteye regions that are paired together to form

all p ossible combinations The condences of these pairings are weighted by their

top ographic suitability which are then thresholded to classify the pattern These

weights are dened by a D Gaussian

Yow and Cip olla ha ve also develop ed a comp onent based approach to de

tecting faces In their system p otential features are categorized into candidate groups

based on top ographic evidence and probabilities that they are faces are assigned to

these groups The probabilities are up dated using a Bayesian network If the nal

probability measure of a group is ab ove a certain threshold then it is declared as a

detection The features are initially identied using an image invariance scheme

Where the comp onent based systems describ ed take dierent approaches to de

tecting ob jects in images bycomponents they havetwo similar features

they all have component detectors that identify candidate comp onents in an

image

they all have a means to integrate these comp onents and determine if together

they dene a face

System Details

In this section we describ e the structure and op eration of the comp onent based ob ject

detection system as applied to the domain of p eople detection

Overview of System Architecture

The section explains the overall architecture of the system by tracing the detection

pro cess when the system is applied to an image Figure is a graphical represen

tation of this pro cedure

The pro cess of classifying a pattern starts by taking a windowasan

input This input is then pro cessed to determine where and at which scales the

comp onents of a p erson head legs left arm and right arm may b e found within the

window using our prior knowledge of the geometry of b o dies All of these candidate

regions are pro cessed by the resp ective comp onent detectors to nd the strongest

candidate comp onents There are four distinct comp onent detectors in this system

which op erate indep endently of one another and are trained to nd separately the

four comp onents of the human b o dy the head the legs and the left and right arms

The comp onent detectors pro cess the candidate regions by applying the quadru

ple density Haar wavelet transform to them and then classifying the resultant data

vector The comp onent classiers are quadratic p olynomials that are trained using

the supp ort vector machine algorithm The training of the comp onent and combi

nation classiers is describ ed in detail in Section The strongest candidate

comp onent is the one that pro duces the highest p ositiveraw output referred to in

this thesis as the component score when classied by the comp onent classiers The

raw output of a SVM is a rough measure of howwell a classied data p oint ts in

with its designated class and is dened in Section

The highest comp onent score for each comp onentisfedinto the combination

classier which is a linear classier If the highest comp onent score for a particu

lar comp onent is negative ie the comp onent detector in question did not nd a

comp onent in the geometrically p ermissible area then a comp onent score of zero is

used instead The combination classier pro cesses the set of scores received from the

comp onent classier to determine if the pattern is a p erson

Since our classier is not shift and scale invariant wefollow the same brute force

search pro cedure as used in the general detection system The windowis

shifted across and down the image The image itself is pro cessed at several sizes

ranging from to times its original size These steps allow the system to detect

various sizes of p eople at any lo cation in an image

Details of System Architecture

This section of the chapter outlines the details of the comp onent detectors and the

combination classier

Stage One Identifying Comp onents of PeopleinanImage

When a windowisevaluated by the system the comp onent detectors are

applied only to sp ecic areas of the window at only particular scales This is b ecause Original Image 128 x 64

Areas of the image, where it is possible to detect a head, legs, and arms are identified. Respective component detectors operate on these areas only.

Face Leg Right Arm Left Arm Component Detectors are Detector: Detector: Detector: Detector: applied to all locations of Quadratic Quadratic Quadratic Quadratic permissible areas. SVM SVM SVM SVM

The "most suitable" head, legs, and arms are identified by the component detectors. The component scores, i.e. raw output of the component classifiers, are fed into the combination classifier.

The combination Combination Classifier: classifier classifies the Support Vector Machine pattern as a "person" or "non-person".

A person is detected. The solid rectangle outlines the person. The dashed boxes mark the components of the

person.

Figure Diagrammatic description of the op eration of the system

Figure It is very imp ortant to place geometric constraints on the lo cation and

scale of comp onent detections Even though a detection may b e the strongest in

a particular window examined it might not b e lo cated prop erly In this gure the

shadow of the p ersons head is detected with a higher score than the head itself If

we did not check for prop er conguration and scale comp onent detections like these

would lead to false alarms andor missed detections of p eople

the arms legs and head of a p erson have a dened relative conguration ie the head

is found ab ove the legs with left and right arms to either side and the comp onents

must also b e prop ortioned correctly This is in eect prior information that is b eing

integrated into our detection system By placing these geometric constraints on the

lo cation and scale of the comp onents we ensure that they are arranged in the form

of a human b o dyandthus improve the p erformance of the ob ject detection system

This is necessary b ecause even though a comp onent detection is the strongest in

a particular window under examination ie it has the highest comp onent score it

do es not imply that it is in the correct p osition as illustrated in Figure

Since the comp onent detectors op erate on rectangular areas of the image the

constraints placed on the lo cation and scale of comp onent detections are expressed

in terms of the prop erties of the rectangular region examined For example the

centroid and b oundary of the rectangular area determines the lo cation of a comp onent

detection and the width of the rectangle is a measure of a comp onents scale All

co ordinates are relative to the upp er left hand corner of the window

We calculated the geometric constraints for each comp onent from a sample of

the training images The constraints themselves b oth in lo cation and scale are

tabulated in Table and shown in Figure The values of quantities suchasthe

lo cation of the centroid and top and b ottom b oundary edges of a comp onent were

determined by taking the statistical mean of the quantities over p ositive detections

in the training set The tolerances were set to include all p ositiv e detections in the

training set Permissible scales were also estimated from the training images There

Component Centroid Scale Other Criteria

Row Column Minimum Maximum

Head and Shoulders

Lower Bo dy Bottom Edge

Row

Right Arm Extended

Right Arm Bent Top Edge

Row

Left Arm Extended

Left Arm Bent Top Edge

Row

Table Geometric constraints placed on each comp onent All co ordinates are in

pixels and relative to the upp er left hand corner of a rectangle

are two sets of constraints for the arms one intended for extended arms and the other

for b ent arms

For each of the comp onent detectors the pixel images are pro cessed with the

quadruple density Haar wavelet transform describ ed in Section and App endix

A We use the wavelets at the scales and to represent the patterns As

in the case of the full b o dy detection system we do the transform in each of the three

color channels and for each scale lo cation and orientation of wavelet we use the one

that is maximum in absolute value among the three color channels In this waythe

information in the three color channels is collapsed into a single virtual color image

We use quadratic p olynomial classiers trained using supp ort vector machines

to classify the data vectors resulting from the Haar wavelet representation of the

comp onents The optimal hyp erplane is computed as a decision surface of the form

f x sg n g x

where

l

X



b x g x y K x

i i

i

i

In Equation K is one of many p ossible kernel functions y f g is the class

i

 l  

is a subset of the training data set The x g and fx lab el of the data p oint x

i i i i

are called support vectors and are the p oints from the data set that fall closest to the

separating hyp erplane Finally the co ecients and b are determined by solving a

i

largescale quadratic programming problem The kernel function K that is used in

the comp onent classiers is a quadratic p olynomial and has the form shown b elow Maximum (0,0) Size: 42x42 (0,0)

Minimum Size: 28x28 Maximum Size: 69x46 Centroid: (23,32) Tolerance: Height +,- 2 Minimum Width +,- 3 Size: 42x28

Centroid: Width: 32 Tolerance: +,- 3

Bottom Edge Between 120 & 128 (128,64) (128,64)

(a) (b)

(0,0) (0,0) Top Edge Maximum Between Size: 47x31 28 & 34 Centroid: Width: 46 Minimum Tolerance: Size: 31x25 +,- 3

Minimum Centroid: Size: 25x17 (54,46) Tolerance: Height +,- 5 Maximum Width +,- 3 Size: 47x31

(128,64) (128,64)

(c) (d)

Figure Geometric constraints that are placed on dierent comp onents All co or

dinates are in pixels and relative to the upp er left hand corner of a rectangle

Dimensions are also expressed in pixels a illustrates the geometric constraints on

the head b the lower body c an extended right arm and d a b entright arm

Figure The top rowshows examples of heads and shoulders and lower b o dies

of p eople that were used to train the resp ective comp onent detectors Similarly

the b ottom rowshows examples of left arms and right arms that were used for

training purp oses

  

K x x x x

i i

f x f g in Equation is referred to as the binary class of the data p oint

x which is b eing classied by the SVM Values of and refer to the classes of

the p ositive and negative training examples resp ectively As Equation shows the

binary class of a data p oint is the sign of the raw output g x of the SVM classier

The raw output of an SVM classier is the distance of a data p oint from the decision

hyp erplane In general the greater the magnitude of the raw output the farther the

p oint is from the decision surface or the more likely the p oint b elongs to the binary

class it is group ed into by the SVM classier

The comp onent classiers are trained on p ositive and negative images for their

resp ective classes The p ositive examples are of arms legs and heads of p eople in

various environments b oth indo ors and outdo ors and under various lighting condi

tions The negative examples are taken from scenes that do not contain any p eople

Examples of p ositiv e images used to train the comp onent classiers are shown in

Figure

Stage TwoCombining the Comp onent Classiers

Once the comp onent detectors have b een applied to all geometrically p ermissible areas

within the window the highest comp onent score for eachcomponenttyp e

is entered into a four dimensional vector that serves as the input to the combination

classier The comp onent score is the raw output of the comp onent classier and is

the distance of the test p oint from the decision hyp erplane This distance is a rough

measure of howwell a test p oint ts into its designated class If the comp onent

detector do es not nd a comp onent in the designated area of the window

then zero is placed in the data vector A comp onent score of zero refers to a test

p oint that is classied as neither a comp onent nor a noncomp onent b ecause it

lies on the hyp erplane

Head and Shoulder Lower Body Right Arm Left Arm

Scores Scores Scores Scores

Positive Examples

Negative Examples

Table Examples of p ositive and negative data p oints used to train the combi

nation classier The entries are comp onent scores The comp onent scores of the

p ositive examples are generally higher

The combination classier is trained with a linear kernel and has the following

form

 

K x x x x

i i

This typ e of hierarchical classication architecture where learning o ccurs at mul

tiple stages is termed AdaptiveCombination of Classiers ACC

Positive examples were generated by pro cessing images of p eople at one

scale and taking the highest comp onent score from detections that are geometrically

allowed for each comp onenttyp e Table shows examples of data vectors that were

used to train the combination classier

Results

The p erformance of this system is compared to that of other comp onent based p erson

detection systems that combine the comp onent classiers in dierentways as well as

the full b o dy p erson detection system This framework allows us to determine the

strengths of the comp onent based approach to detecting ob jects in images and the

p erformance of various classier combination algorithms

Exp erimental Setup

All of the comp onent based detection systems that were tested in this exp erimentare

two tiered systems Sp ecically they detect heads legs and arms at one level and

Component Number of Number of

Classier Positive Examples Negative Examples

Head and Shoulders

Lower Bo dy

Left Arm

Right Arm

Table Number of Positive and Negative Examples Used to Train the Dierent

Comp onent Classiers

at the next they combine the results of the comp onent detectors to determine if the

pattern in question is a p erson or not The comp onent detectors that were used in

all of the comp onent based p eople detection systems are identical and are describ ed

in section The p ositive examples for training these detectors were obtained

from the same database that is used in the full b o dy system The images of p eople

were taken in Boston and Cambridge Massachusetts with dierent cameras under

dierent lighting conditions and in dierent seasons This database includes images

of p eople who are rotated in depth and who are walking in addition to frontal and rear

views of stationary p eople The p ositive examples of the lower b o dy include images

of women in skirts and p eople wearing full length overcoats as well as p eople dressed

in pants Similarly the database of p ositive examples for the arms was varied in

content and included arms at various p ositions in relation to the b o dy The negative

examples were obtained from images of natural scenery and buildings that did not

contain any p eople The numb er of p ositive and negative examples that were used to

train the dierent comp onent classiers is presented in Table

Adaptive Combination of Classiers Based Systems

Once the comp onent classiers were trained the next step in evaluating the Adap

tiveCombination of Classiers ACC based systems was to train the combination

classier Positive and negative examples for the combination classier were collected

from the same databases that were used to train the comp onent classiers A p ositive

h image of a p erson at a single appropriate example was obtained by pro cessing eac

scale The four comp onent detectors were applied to the geometrically p ermissible

areas of the image and at the allowable scales The greatest p ositive classier output

for eachcomponentwere assembled as a vector to form a p ositive training example

If all of the comp onent scores were not p ositivethennovector was formed and the

window examined did not yield an example The negative examples were computed

in a similar manner except that this pro cess was rep eated over the entire image and

at various scales The images for the negative examples did not contain p eople

We used p ositive examples and negative examples to train the combi

nation classiers First second third and fourth degree p olynomial classiers were

trained and tested

The trained system was run over our p ositive test database of images of p eople

to determine the p ositive detection rate There was no overlap b etween these images

and the ones that were used to train the system The outofsample false alarm rate

was obtained by running the system over the negative test database of images that

do not contain any p eople These images are pictures of natural scenery and buildings

By running the system over these images windows were examined and

classied as in our fullb o dy detection system We generate ROC curves to measure

their p erformance by shifting the decision surface

Voting Combination of Classiers Based System

The other metho d of combining the results of the comp onent detectors that was tested

is a Voting Combination of Classiers VCC VCC systems combine classiers by

implementing a voting structure amongst them One way of viewing this arrangement

is that the comp onent classiers are weak exp erts in the matter of detecting p eople

VCC systems p oll the weak exp erts and then based on the results decide if the

pattern is a p erson For example in a p ossible implementation of VCC if a ma jority

of the weak exp erts classify a pattern as a p erson then the system declares the pattern

to b e a p erson

One motivating reason for trying VCC as an approach to combining the comp onent

classiers is that since VCC is one of the simplest classes of classier combination al

gorithms it aords the b est opp ortunity to judge the strengths of a comp onent based

ob ject detection system that is not augmented with a p owerful classier combina

tion metho d Similarly when compared to an ACC based system one can determine

the b enets of more sophisticated classier combination metho ds Since the com

putational complexity of these metho ds is known and the exp erimen t describ ed in

this section determines their p erformance this framework characterizes the tradeo

involved b etween enhanced p erformance and greater computational complexity for

these systems The p erson detection systems that are evaluated here are the ACC

based system the VCC based system and the baseline fullb o dy system

In the incarnation of VCC that is implemented and tested in this exp eriment a

p ositive detection of the p erson class results only when all four comp onent classes

are detected in the prop er conguration The geometric constraints placed on the

comp onents are the same in the ACC and VCC based systems and are describ ed in

Section For each pattern that the system classies the system must evaluate

the logic presented b elow

per son H ead Leg s Lef t ar m Rig ht ar m

where a state of true indicates that a pattern b elonging to the p erson class has b een

detected Performance of Component Detectors 100

90

80

70

60

50

40

30

20 Complete Person Detector − Baseline

Positive Detection Rate (%) Head and Shoulders Detector Lower Body Detector 10 Right Arm Detector

0 −7 −6 −5 −4 −3 −2 −1 10 10 10 10 10 10 10

False Detection Rate

Figure ROC curves illustrating the ability of the comp onent detectors to correctly

detect a p erson in an image The p ositive detection rate is plotted as a p ercentage

against the false alarm rate which is measured on a logarithmic scale The false alarm

rate is the numb er of false p ositive detections p er window insp ected

The detection threshold of the VCC based system is determined by selecting ap

propriate thresholds for the comp onent detectors The thresholds for the comp onent

detectors are chosen such that they all corresp ond to approximately the same p ositive

detection rate This information was estimated from the ROC curves of eachofthe

comp onent detectors that are shown in Figure For example if one wished to run

the VCC based system at a threshold that corresp onded to a p ositive detection rate of

then they would cho ose thresholds of and for the head legs and

arm classiers resp ectively These ROC curves were calculated in a manner similar

to the pro cedure describ ed earlier in Section A p ointofinterest is that these

ROC curves indicate how discriminating the individual comp onents of a p erson are

in the pro cess of detecting the full b o dy The legs p erform the b est followed bythe

arms and the head The sup erior p erformance of the legs may b e due to the fact that

the background of the lower b o dy in images is usually either the street pavement or

grass and hence is relatively clutter free compared to the background of the head and

arms

Exp erimental Results

An analysis of the ROC curves suggests that a comp onent based p erson detection

system p erforms very well and signicantly b etter than the baseline full b o dy system

at all thresholds This is noteworthy b ecause the baseline system has pro duced very

accurate results It should b e emphasized that the baseline system uses the same

image representation scheme Haar wavelets and classier SVM that the comp onent Methods of Combining Classifiers 100

95

90

85

80

75 Complete Person Detector − Baseline Voting Combination of Classifiers

Positive Detection Rate (%) Adaptive Combination of Classifiers − Linear SVM 70 Adaptive Combination of Classifiers − Quadratic SVM Adaptive Combination of Classifiers − Cubic SVM Adaptive Combination of Classifiers − 4th degree polynomial SVM 65 −6 −5 −4 −3 10 10 10 10

False Detection Rate

Figure ROC curves comparing the p erformance of various comp onent based

p eople detection systems The systems dier in the metho d used to combine the clas

siers that detect the various comp onents of a p ersons b o dy The p ositive detection

rate is plotted as a p ercentage against the false alarm rate which is measured on a

logarithmic scale The false alarm rate is the numb er of false p ositives detections p er

window insp ected The curves indicate that a system in which a linear SVM com

bines the results of the comp onent classiers p erforms b est The baseline system is a

full b o dy p erson detector similar to the comp onent detectors used in the comp onent

based system

detectors used in the comp onent based systems All of the comp onent based systems

p erformance were comparable to or b etter than the baseline system

For the comp onent based systems the ACC approach pro duces b etter results than

VCC In particular the ACC based system that uses a linear classier to combine

the comp onents is the most accurate During the course of the exp eriment the

linear SVM system displayed a sup erior ability to detect p eople even when one of the

comp onents was not detected in comparison to the higher degree p olynomial SVM

systems A p ossible explanation for this observation may b e that the higher degree

p olynomial classiers place a stronger emphasis on the presence of combinations of

comp onents due to the structure of their kernels Burges The second third

and fourth degree p olynomial kernels include terms that are pro ducts of up to two

three and four elements which are comp onent scores This suggests that all of

those elements must b e p erson like for the pattern to b e classied as a p erson

The emphasis placed on the presence of combinations of comp onents increases with

the degree of the p olynomial classier The results show that the p erformance of

the ACC based systems decreases with an increase in the degree of the p olynomial

classier In fact the ROC curvefortheACC based system that employs a fourth

degree p olynomial classier is very similar to the VCC based system Interestingly

b oth of the ab ove systems lo ok for all four comp onents in a pattern The VCC based

system explicitly requires the presence of all four comp onents where as the ACC

based system that uses the fourth degree p olynomial classier makes it an implicit

requisite due to the design of its kernel Two other p ossible reasons for the decrease

in p erformance with higher degree p olynomials is that higher degree classiers may

need more training data or they could b e overtting

It is also worth mentioning that the database of test images that was used to gen

erate the ROC curves did not just include frontal views of p eople but also contained

avarietyofchallenging images Some of these classes of images p ortray exactly the

situations in whichwewould exp ect a comp onent based approach to improve results

Included are pictures of p eople walking and running In some of the images the

p erson is partially o ccluded or a part of their b o dy has little contrast with the back

ground A few of the images depict p eople who are slightly rotated in depth Figure

is a selection of these images

Figure shows the results obtained when the system was applied to images of

p eople who are partially o ccluded or whose b o dy parts blend in with the background

In these examples the system detects the p erson while running at a threshold that

according to the ROC curveshown in Figure corresp onds to a false detection

rate of less than one false alarm for every patterns insp ected

Figure shows the result of applying the system to sample images with clutter

in the background Even under such circumstances the system p erforms very well

The lower four images were taken with dierent cameras than the instruments used

for the training set images The conditions and surroundings for these pictures are

dierent as well

Figure Samples from the test image database These images demonstrate the

capability of the system It can detect running p eople p eople who are slightly rotated

p eople whose b o dy parts blend into the background b ottom row second from right

p erson detected even though the legs are not and p eople under varying lighting

conditions top row second from left one side of the face is light and the other dark

Figure Results of the systems application to images of partially o ccluded p eople

and p eople whose b o dy parts have little contrast with the background In the rst

image the p ersons legs are not visible in the second image her hair blends in with

the curtain in the background and in the last image her right arm is hidden b ehind

the column

Figure Results from the comp onent based p erson detection system The solid

boxes outline the complete p edestrian where the dashed rectangles are the comp o

nents

Chapter

A RealTime System and Focus of

Attention

While the core system is quite eective the pro cessing time among other charac

teristics of the system limits any practical applications Clearly considering every

pattern in an image is wasteful and there maybeways of limiting the number of

patterns that are pro cessed Taking p eople detection as our domain we view the

prop er use of our detection system for practical applications as one where some focus

of attention mechanism identies areas in the image where there may b e a p erson and

then our detection system pro cesses patterns in only those regions

This chapter presents

A realtime implementation of our p eople detection system as part of a driver

assistance system that uses an obstacle detection mo dule to provide a b ounding

box for the detection system

An integration of our p eople detection system with a biologically inspired fo cus

of attention mo dule that identies salient areas of an image using several feature

maps

A RealTime Application

As alluded to in the intro duction there are many p ossible applications of our ob ject

detection technology ranging from automotive assistance systems to surveillance The

only factor that is inhibiting our system from b eing used rightnowinsuch systems

is the relatively slow pro cessing sp eed It is imp ortant to note that our full system is

for the most part an unoptimized researchtoolaswehave not invested signicant

amounts of eort in improving the core sp eed

As an alternative to dynamic detection strategies we can use a mo died version of

our static detection system to achieve realtime p erformance This section describ es

a realtime application of our technology as part of a larger system for driver as

tly sistance The combined system including our p eople detection mo dule is curren

Figure Side view of the DaimlerChrysler S Class vehicle with the Urban Trac

Assistant UTA our p eople detection system has b een integrated into the system

deployed live in a DaimlerChrysler S Class demonstration vehicle Figure The

remainder of this section describ es the integrated system

Sp eed Optimizations

Our original unoptimized static detection system for p eople detection in color images

pro cesses sequences at a rate of frame p er minutes which is clearly inadequate

for any realtime automotive application Wehaveimplemented optimizations that

have yielded several orders of magnitude worth of sp eedups

subset of features Instead of using the entire set of wavelet features

we use just of the more imp ortant features manually chosen that enco de the

outline of the b o dyThischanges the dimensional inner pro duct in Equation

into a dimensional dot pro duct The features are shown overlayed on an

example p erson in Figure

reduced set vectorsFrom Equation B we can see that the computation

time is also dep endent on the numb er of supp ort vectors N In our system this

s

is typically on the order of We use results from Burges to obtain an

equivalent decision surface in terms of a small number of synthetic vectors This

metho d yields a new decision surface that is equivalent to the original one but uses

just vectors

gray level images Our use of color images is predicated on the fact that the

three dierent color channels RGB contain a signican t amount of information that

Figure A view of the cameras in UTA

gets washed out in graylevel images of the same scene This use of color information

results in signicant computational cost the resizing and Haar transform op erations

are p erformed on each color channel separately In order to improve system sp eed

we mo dify the system to pro cess intensity images

The reduction in p erformance resulting from using graylevel images and fea



tures is shown in Figure Taking a false p ositiverateof whereas the full

color feature system achieves accuracy the gray feature system achieves

accuracy While this is a signicant reduction in p erformance this level of accuracy

may still b e adequate for certain applications

Fo cus of Attention

To further enhance the pro cessing sp eed of the system we can use a fo cus of attention

mo dule that concentrates pro cessing only on areas of an image that are likely to con

tain p eople This fo cus of attention can key o of dierentcharacteristics including

motion distance lo cal image complexity shap e color etc People: wavelets −− color vs gray, 1326 vs 29 1

0.8

0.6

0.4

Detection Rate 1326 color unsigned 1326 gray unsigned 0.2 29 color unsigned 29 gray unsigned

0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for p eople detection gray level and color and

features While going from color features to graylevel features results in

a signicant decrease in accuracy the p erformance may still b e adequate for some

applications

Integration With The DaimlerChrysler Urban Trac

Assistant

To this end wehaveintegrated our p eople detection system with a stereobased

obstacle detection system in collab oration with DaimlerChrysler AG DaimlerChrysler

has obviously motivated interests in obstacle detection algorithms for automotive

applications as a means to aid driving and ultimatelytoallow for autonomous

driving One of the imp ortant requirements of the system is that it is able to deal

with b oth highway and urban scenes the latter b eing much more complex than the

former

The DaimlerChrysler Urban Trac AssistantUTA is a realtime vision system

for obstacle detection recognition and tracking Franke and Kutbach Franke

et al Franke et al The car has a bino cular stereo vision system Figure

mounted on the rearview mirror a MHz PowerPC in the rear trunk

and a at panel displaybetween the driver and passenger seats on which system

pro cessing and output is visualized Figure shows the view of the system inside the

car UTA currently has several pro cessing mo dules implemented including p edestrian

motion recognition lane detection car detection sign detection and an automatic

car following system

Their system relies on D p osition and depth information using the stereo cameras

Toovercome the exp ensive corresp ondence problem they havedevelop ed a feature

based approac h to stereo analysis that runs at Hz The system clusters feature

p oints that corresp ond to the same ob ject therebyproviding a rectangular b ounding

box around each obstacle in the scene

Using this b ounding b oxwhich closely outlines the shap e of the obstacle we

expand this area to provide a larger region of interest in whichwe will run our p eople

detection system This is done to alleviate p ossible misalignments in the b ounding b ox

provided by the stereo system Furthermore the stereo system provides an accurate

estimate of the distance to each ob ject Using this information we can constrain the

numb er of sizes at whichwe lo ok for p eople to a small number typically under three

scales

Within these regions of interest we use our graylevel feature system with the

reduced set metho d that lowers the numb er of supp ort vectors to In realworld

test sequences pro cessed while driving through EsslingenStuttgart Germanyweare

able to achieve rates of more than Hz DaimlerChrysler exp ects to so on upgrade

their onb oard system to MHz computers so further gains in sp eed will follow

Future Work

The p ortion of the total system time that is sp ent in our p eople detection mo dule is

ms p er obstacle An analysis of howmuch time is taken byeach p ortion of the

p eople detection mo dule shows that the smallest amount of time is b eing sp entinthe

SVM classication This b o des well for improving the p erformance Weshouldbe

able to use a muchricher set of features than the that are currently used p erhaps

on the order of a few hundred features without signicantly degrading the sp eed of

the system

Fo cus of Attention Exp eriments

Our current metaphor for detection is one where we do a brute force searchinthe

entire image this is clearly inecient The integrated system presented in Section

show cases one metho d for fo cusing pro cessing only on important areas of an images

which are in this case dened by the results of an obstacle detection system In

this section we present another version of our system that uses a focus of attention

mo dule to direct the pro cessing in only certain areas of the image

To detect ob jects in cluttered scenes the human visual system rapidly fo cuses

its attention to dierent areas where there is comp elling visual information that may

indicate the presence of an interesting ob ject The information that our system keys

o of could include features suchasintensity color and orientation discontinuities

This typ e of mo del is develop ed in Itti et al Itti and Ko ch Their

system decomp oses an input image into several feature maps eachofwhichkey o of

dierent visual information The visual features the metho d uses are intensity color

and orientation via Gab or lters These feature maps are integrated into a single

saliency map where at a given p oint in the image the individual feature resp onses

are additively combined The mechanism that changes the fo cus of attention from

Figure Lo oking out from UTA pro cessing results are shown on the at panel

displaybetween the driver and passenger seats

p oint to another is mo deled as a dynamical neural network The network outputs the

ordered salient lo cations in simulated time as x y p ositions in the image

As in the case of our DaimlerChrysler integration we can use this information

to fo cus pro cessing on only the imp ortant areas of an image The saliency based

attention system do es not provide information on the size of the salient regions so

for our test we dene a region centered on each salientpointinwhichwe

try to nd p eople note that this is the maximum b o dy size of p eople that we will

lo ok for Furthermore we limit the numb er of regions pro cessed to the top N salient

regions p er image and test N f g

The ROC curves for the system compared to the base system are shown in Figure

The p erformance of the versions using the fo cus of attention mechanism is slightly

Figure A sequence showing the single most salient lo cation for each frame as

pro cessed by the system of Itti et al Itti and Ko ch Our p eople detection

system can use this focus of attention mo dule as a prepro cessor to target sp ecic areas

of an image Base vs Focus of Attention 1

0.8

0.6

0.4

Detection Rate static foa top 5 0.2 foa top 6 foa top 7

0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for p eople detection comparing the base system to one that

prepro cess an image by fo cusing attention on certain salient regions in the image

worse than the base system Fewer patterns are considered resulting in a signicantly

faster system however Also the ROC curves seem to indicate that after the rst

salient regions considering more of these regions may not impact p erformance

Integrating this saliency information in our detection framework is not a p erfect

t on account of the following

The center of the salient region is often not at the center of the p erson and

there is no scale information so by default a large area around the salientspot

is examined

While it usually nds one of the p eople in the sequence as the most salient the

other p erson may b e the th or th most salient sp ot This combined with

the previous p oint means that a large p ortion of the image still ends up b eing

examined

Irresp ective of these issues the integration of a fo cus of attention mechanism re

sults in signicant improvement in pro cessing time Table lists the p ercentage

of total patterns that are pro cessed in the versions of our system using this fo cus of

attention mo dule as compared to the base system that pro cesses all regions These

results indicate that approximately the same p erformance can b e achieved while pro

cessing only one third of the patterns

Percentage of patterns pro cessed

Base

FOA top

FOA top

FOA top

Table The p ercentage of total p ossible patterns that are examined byeachversion

of the system the use of a fo cus of attention mechanism such as the one we consider

here can result in signicant increase in pro cessing sp eed

Chapter

Integration Through Time

This thesis has fo cused mainly on the static detection problem but with the increasing

amount of video information available it is of great interest to see how our system

could b e applied to video sequences The straightforward metho d of applying our

static system to video sequences is to analyze each frame as a static image However

this brute force technique ignores all dynamical information available information

that could b e quite useful for the purp oses of detection The brute force version will

b e used as our b enchmark in this chapter

In our extensions of the system to handle the time domain we will not drastically

alter the core technology used in the static version Rather we will either slightly

mo dify it or will augment it with various mo dules that use the static detection system

in dierentways In this manner each of the techniques we presentinthischapter

can b e seen as more general means of converting a static detection system into one

with improved pro cessing capabilities for video sequences

Our rst system is a pure pattern classication approach to dynamical ob ject

detection Here we seek to circumvent the need for the extensive engineering that

is quite typical in current dynamical detection systems and assuming particular

underlying dynamics We will mo dify our base static detection approach to represent

dynamic information by extending the static representation into the time domain

With this new representation the system will b e able to learn limited dynamics of

p eople with or without motion The system will learn what a p erson lo oks like

and what constitutes valid dynamics over short time sequences without the need for

explicit mo dels of either shap e or dynamics

The second system to takeadvantage of dynamical information is a rule based

mo dule that integrates information through time as an approximation to a Kalman

lter Kalman ltering theory assumes an underlying linear dynamical mo del and

given measurements of the lo cation of a p erson in one image yields a prediction of

the lo cation of the p erson in the next image and the uncertainty in this prediction

Our heuristic smo oths the information in an image sequence over time by taking

advantage of this fundamental a priori knowledge that a p erson in one image will

app ear in a similar p osition in the next image We can smo oth the results through

time by automatically eliminating false p ositives or detections that do not p ersevere

over small subsequences

Figure Example image sequences that are used to train the pure pattern classi

cation version of our dynamic detection system

The third system we will develop uses a new approach to propagating general

multimo dal densities through time based on the so called Condensation technique

Isard and Blake This technique has a signicant advantage over the Kalman

lter namely that it is not constrained to mo del a single Gaussian density but can

eectively mo del arbitrarily complex densities Condensation is also able to gracefully

handle changes in scene structure ie ob jects entering or exiting the scene whereas

if wewere to use a Kalman lter wewould have to run our detection system every

few frames and initialize a new Kalman lter on any newly detected ob jects

Pure Pattern Classication Approach

In this section our goal is to develop a detection system for video sequences that

makes as few assumptions as p ossible Wedonotwanttodevelop an explicit mo del

of the shap e of p eople or outwardly mo del their p ossible motions in anyway Instead

of a dynamical mo del we will use very high dimensional feature vectors to describ e

patterns through time These will b e used to train a supp ort vector machine classier

whichwe exp ect will implicitly mo del certain dynamics

Wewould likea technique that implicitly generates a mo del of b oth the shap e

and valid dynamical characteristics of p eople at the same time from a set of training

data This should b e accomplished without assuming that human motion can b e

approximated by a linear Kalman lter or that it can b e describ ed by a hidden Markov

mo del or any other dynamical mo del The only assumption we will make is that ve

consecutive frames of an image sequence contain characteristic information regarding

the dynamics of how p eople app ear in video sequences From a set of training data

the system will learn exactly what constitutes a p erson and howpeopletypically

app ear in these short time sequences

Instead of using a single pattern from one image as a training example

our new approach takes the patterns at a xed lo cation in ve consecutive

frames computes the features for each of these patterns and concatenates them

into a single dimensional feature vector for use in the supp ort vector training

We use images t t t t t where the p erson is aligned in the center of the

image at t Figure shows several example sequences from the training set The

full training set is comp osed of p ositive examples and negative examples

The extension to detecting p eople in new images is straightforward for each can

didate pattern in a new image we concatenate the wavelet features computed for

that pattern to the wavelet features computed at that lo cation in the previous four

frames The full feature vector is subsequently classied by the SVM

We emphasize that it is the implicit ability of the supp ort vector machine clas

sication technique to handle small sets of data that sparsely p opulate a very high

dimensional feature space that allows us to tackle this problem

In developing this typ e of representation we exp ect that the following dynamical

information will b e evident in the training data and therefore encapsulated in the

classier

p eople usually display smo oth motion or are stationary

p eople do not sp ontaneously app ear or disapp ear from one frame to another

camera motion is usually smo oth or stationary

One of the primary b enets derived from this technique is that it extends this

rich feature set into the time dimension and should b e able to detect p eople at high

accuracy while eliminating transient false p ositives that would normally app ear when

using the static detection system

This is purely a datadriven pattern classication approach to dynamical detec

tion We compare this approach to the static detection system trained with the

individual images corresp onding to frame t in each of the sequences so there are

p ositive and negative dimensional feature vectors as training for

the static detection system Both the static and dynamic systems are tested on the Base vs Pure Pattern Classification 1

0.8

0.6

0.4 Detection Rate

0.2 static multiframe

0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for the static and pure pattern classication detection sys

tems The detection rate is plotted against the false p ositive rate measured on a

logarithmic scale The false p ositive rate is dened as the numb er of false detections

p er insp ected window

same outofsample sequence Figure shows the ROC curves for the two systems

From the ROC curves we see that the system that has incorp orated dynamical in

formation p erforms signicantly worse than the static system at most p oints on the

ROC curve If we lo ok at our training data we see that righttoleftmoving p eople

are signicantly underrepresented but the p eople in the test sequence are moving

from left to right The bias in the training data that is not reected in the test data

may b e causing this worse p erformance

Regardless this exp eriment illustrated that the dynamic system is capable of

doing classication in a dimensional space with only training examples

It is imp ortant to note that our features are not the D wavelets in space and

time What wehave done is taken a set of D wavelet features spread through time

and used these to develop our mo del One extension of our system that would b e

interesting to pursue is to use D wavelets as features Such a system would learn

the dynamics as a set of displacements and therefore may generalize b etter

Unimo dal DensityTracking

This section describ es the extension to the static system that incorp orates a Kalman

lterlike heuristic to track and predict detections in video sequences using unimo dal

Gaussian densities

Kalman Filter

Kalman lters are recursive algorithms that provide optimal estimates of a systems

parameters by incorp orating a linear system mo del prediction with actual measure

ments The basic Kalman lter emerged from classical optimal estimation theory and

has b een heavily used in many elds For background and derivations of the Kalman

lter see Mayb eck

The goal of the Kalman lter algorithm is to at time t estimate or predict the

k 

state of the pro cess one time step in the future t based on the underlying dynamics

k

of the mo del captured in the state transition matrix and then up on receiving the

sensory measurements at the next time step t correct the prediction to yield the

k

optimal estimate in a linear sense of the state at time t In the context of tracking

k

and prediction in computer vision Kalman lters have b een widely used Broida and

Chellappa Dickmanns and Graefe Deriche and Faugeras Harris

In a Kalman lter tracking situation a Kalman lter is typically initialized

over each ob ject of interest Over time the Gaussian density function that is over

each ob ject shifts spreads and p eaks dep ending on the reinforcement it receives from

the measurement mo del

Po or Mans Kalman PMK

As a zeroth order approximation to Kalman prediction and tracking for our p erson

detectiontracking task we approximate a Kalman lter by using a simple rulebased

heuristic called the Po or Mans Kalman PMK

Our heuristic smo oths the information in an image sequence over time by taking

advantage of the fundamental a priori knowledge that a p erson in one image will

app ear in a similar p osition in the next image We smo oth the results through time

es detections that do not p ersevere over a by automatically eliminating false p ositiv

small subsequence

The rule is extremely simple and assumes we are lo oking at sequences of con

secutive frames k k k

n

If a pattern is labeled as a person in fewer than times



in an nframe subsequence then we expect that this

pattern is a false positive since true detections

persevere through time so eliminate that detection

n

times in any nframe subsequence If a pattern is lab eled as a p erson less than



than we relab el that pattern as a nonp erson thereby eliminating that presumed false

p ositive For our tests we use n Since the p eople and camera ma ybemoving

we allow some tolerance in the overlap of patterns For a given tolerance level t and

assuming a pattern that has b een rescaled to the base size containing p eople

of size weallowa t pixel tolerance in the x p ositions of the topleft and

b ottomright lo cations and a t pixel tolerance in the y p ositions The tolerances we t-2 t-1 t

original

PMK

Figure An illustration of howthePo or Mans Kalman heuristic would eliminate

a detection that only app ears at a certain lo cation for one frame in a three frame sub

sequence The top gures indicate the rawhyp othesized detections After pro cessing

using the PMK technique the detection that do es not p ersist is eliminated from the

nal output

test are t f g The eect that this technique has is illustrated in Figure

Note that a rule as simple as this will fail when there are large motions of either

the camera or p eople This problem could b e solved by instead of just lo oking at a

single lo cation in consecutive images trying to predict linear motion in lo cal regions

We will not address this problem however

To compare this system against the static version and facilitate comparisons with

the other time integration approaches we train on the p ositive and neg

ative subsequences t t t t t used to train the pure pattern clas

sication approach For a fair comparison with the other techniques that are more

directly based on static detection we train a static detection SVM with the

p ositive and negative individual images corresp onding to frame t in eachofthe

subsequences The resulting ROC curve for our PMK technique is shown in Figure

The ROC curveshows that while at higher thresholds PMK is the same as or

slightly outp erformed by the static system at lower thresholds PMK p erforms b etter

The reason b ehind this is due to the fact that when the core detection is strict p eople

patterns that lie close to the decision b oundary may not b e correctly classied This

could yield a subsequence where a p erson app ears and then disapp ears exactly the Base vs Poor Mans Kalman 1

0.8

0.6

0.4

Detection Rate static heuristic 0.1 0.2 heuristic 0.2 heuristic 0.3

0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for the static and Po or Mans Kalman detection systems

The numb ers asso ciated with eachPMKcurve indicates the tolerance for p ersevering

detections

conditions under which the PMK technique dictates the removal of the detection

which in this case is a correct p erson and not a false p ositive

Multimo dal DensityTracking

Kalman ltering has b een used extensively to track ob jects in video sequences through

time The key assumption of the Kalman lter is that the state density is Gaussian

This means that a Kalman lter can only track a single ob ject through time One

straightforward wayofachieving multiob ject tracking with Kalman lters is to in

stantiate a Kalman lter for each ob ject that is b eing tracked Though this would

certainly work as is shown in our PMK exp eriments this strategy is not as app ealing

as a single unied framework that could trackmultiple ob jects Such an algorithm

called Condensation has b een prop osed by Isard and Blake

Condensation

The two comp onents of Condensation are a mo del of the state dynamics and a

measurement mo del We denote the time history of the density function as X

t

fx x x x g and the time history of the measurement output as Z

t t t t

fz z z z g We assume that the state dynamics can b e mo deled as a

t t t

rst order Markovchain

P x jX P x jx

t t t t

The measurement comp onent of the condensation mo del assumes that observa

tions z are mutually indep endent and indep endent of the state dynamics The obser

t

vation pro cess is therefore P zjx It is imp ortant to note that no other assumptions

are b eing made ab out the observation pro cess so in the general case it could b e

multimodal

The prediction or propagation of the densities to the next time step is done by

applying Bayes rule using as the prior for the current step the p osterior state density

that has b een transformed by one step of the underlying dynamical mo del

P x jZ k P z jx P x jZ

t t t t t t t

where

Z

P x jx P x jZ P x jZ dx

t t t t t t t

x

t

and k is a normalization constant In our case x indicates the p osition and scale of

t

a p erson in the image and z is the supp ort vector machine output at given p ositions

and scales

The main problem here is to somehow estimate and rebalance the multimodal

state density through time as we obtain new measurements Isard and Blake

prop ose a factored sampling algorithm to eciently estimate these densities They

rst sample p oints from the prior and apply the dynamical mo del to these p oints

The new p oints are then weighted in prop ortion to the measured features ie in

prop ortion to P zjx This weighted p oint set serves as an ecient approximation to

the p osterior P xjz Since we are concerned with accuracy in detection and tracking

and not as concerned with eciencywe will use a formulation of the Condensation

algorithm that do es not rely on a factored sampling approachtoprovide estimates

of the state densities Our version represents the density and takes measurements at

each p oint in space

Our version of this density propagation approachisasfollows Using the p osterior

density from the previous image in the sequence we directly transfer this densityto

b e the prior for the current image This imp oses an assumption of a trivial under

We run our p erson detection lying dynamical mo del for p eople P x P x

t t

system over the current image and reinforce or inhibit the density in the lo cations

where the supp ort vector machine output is high and low resp ectivelyFormallywe

use estimates of the density P zjx the probability relating supp ort vector machine

outputs to condence that there is a p erson at the lo cation that have b een generated

from a set of training data

The result of this pro cessing is that for each frame wehave a p osterior density

reecting the likeliho o d of the presence or absence of p eople at each lo cation To

accomplish actual detection we threshold these densities

The derivation of the density in Equation is summarized here

P x jZ P x jz Z

t t t t t

P z jx Z P x jZ

t t t t t

byBayes Rule

P z jZ

t t

P z jx Z P x jZ

t t t t t

by indep endence of z

i

P z

t

since x already en

t

P z jx P x jZ

t t t t

capsulates all infor

P z

t

mation ab out Z

t

since P x jZ isa

t t

distribution that in

k P z jx P x jZ

t t t t t

tegrates to wecan

normalize by k

t

Furthermore we are assuming a trivial dynamical mo del in which the prediction

of a p ersons p osition in the next frame is simply the lo cation in the current frame

This means

P x jZ P x jZ

t t t t

so

P x jZ k P z jx P x jZ

t t t t t t t

The eect of this typ e of pro cessing is illustrated in Figure where the density

is visualized in two dimensions Unlike Isard and Blake who in eect obtain

measurements from a xed numb er of lo cations in the image we obtain and measure

candidate features at every p oint in the new image

Since we are assuming a trivial dynamical mo del the only quantity in Equation

that is of any real interest is the measurement pro cess P z jx This function will

t t

propagate the density based on the output of the supp ort vector machine classier

at each lo cation The problem here is that the SVM output is in the form of the

distance of the pattern from the separating hyp erplane and not a probabilityon x

The next section addresses this issue

Estimating P zjx

In words the measurement pro cess P z jx denotes the probability of a certain SVM

t t

output conditional on the distribution of x the p eople at each lo cation We base our

measurement pro cess on the SVM output at each lo cation This value is a distance

from the hyp erplane that separates the p eople from the nonp eople Assuming we

have already estimated P zjx at each lo cation in the image we take the output of end of time=t-1 time=t

p(x) p(x)high z p(x)

xxx

Figure An illustration of Condensation in action The p osterior density from

t is used as the prior a time t This prior is then acted on by the eect of the

SVM measurements at time t strong SVM output z reinforces the prior and shifts

the density The result of this propagation is the rightmost density the p osterior for

time t

the SVM and compute this likeliho o d Tocounter the eects of p ossibly nonsmo oth

SVM outputs in our results weaverage P zjxover lo cal regions

One of the key assumptions is that the output of the SVM p eaks over a p erson and

then relatively smoothly decays as wemoveaway from the pattern b oth in x y and

in scale Figure shows that the supp ort vector machine raw output do es indeed

decay fairly smo othlyWehavenotdevelop ed our system to b e shift invariant but in

fact there is a small amount of shift invariance that the system seems to learn from

the training set p ossibly due to small misalignments in the data

To transform this raw distance value into a distribution conditional on xweuse

a set of outofsample images that have b een manually annotated with the lo cations

i j 

of the p eople x At each lo cation in the image x we obtain the SVM output

j 

for that pattern z This gives us a version of the density that reects distances to

j  i j 

p eople ie P z jdx x

When running over a new image we use this as follows Having computed the

i

SVM output for a given pattern we can compute P z jxby smo othing over a lo cal

neighb orho o d indexed as A jA j N by

N N

X

i j  i j 

P z jx P z jdx x

N

jA

N

This gives us the likeliho o d that serves to reinforce the prior in areas where there are

p eople

As our function da b we need to use a measure that takes into accountthe

dierences in b oth p osition and scale b etween a and bWe use

tl tl tl tl br br br br

da ba b a b a b a b

x x y y x x y y

where the sup erscripts tl and br indicate topleft and b ottomright co ordinates resp ec

tively and the subscripts x and y denote the x and y comp onents of the co ordinates

This is a very simple measure that ts our criteria of encapsulating dierences in b oth

distance and scale Other more complex measures of distance could b e used but we

nd that this one works eectively

The motivation for the Condensation algorithm is to propagate densities through

time not identify distinct lo cations where an ob ject lies In other words the goal

is to track densities not to do pure pattern classication This leads to a problem

when applying it to a detection task in dealing with ob jects with unequal probability

masses

h is detected equally well same If there are two p eople in the scene and eac

numb er of detections with the same strength by the underlying supp ort vector ma

chine then the Condensation framework will assign equally strong probability masses

to the two p eople However it is extremely unlikely that all the p eople in the scene

will b e detected with equal strength This leads the Condensation algorithm to as

sign probability masses of varying size to dierent p eople In turn this means that

thresholding the densities to lo calize the ob jects of interest maymissweaker p eaks

in the density

Our solution to this is to rst do a hill climbing search in the density to nd

the distinct p eaks Then we individually threshold the p eaks by sp ecifying that a

constant mass around each p eak indicates a detection which amounts to lab eling the

closest p oints to each p eak as a detection Figure illustrates this concept

Results

We train and test the system in the same manner as the other time integration sys

tems The p erformance of the Condensationbased approach as compared to the static

system is shown in Figure The curves show that the Condensationbased tech

nique signicantly outp erforms the static version in the low false p ositive p ortion

of the ROC curve This is not a surprising result since the eect of false p ositives

are magnied by our taking a constantvolume as the detected patterns When there

are few false p ositives the detected patterns are more likely to b e p eople and these

detections will p ersevere on account of the propagation of the density

Discussion of Time Integration Techniques

It is of interest to compare the dierenttimeintegration techniques and discuss their

relative merits and drawbacks

Pattern Classication Approach

The multiframe pattern classication approach is the one most directly related to

the core static system and is the system that makes the least numb er of assumptions

ab out dynamical mo dels Here we assume that any of the dynamics are captured in

the ve frame subsequences The SVM learning engine is able to learn in the

dimensional input space and is able to generalize quite well as seen in the ROC curve

The main problem with our particular version is that the training data seems to b e

heavily biased to left moving p eople

At run time this system is quite exp ensive The core classication step for

window uses ns multiplications where n is the numb er of features in a single window

and s is the numb er of supp ort vectors For an average image size of with

approximately windows this leads to ns multiplications p er image

with access p er pattern

Po or Mans Kalman Heuristic

The PMK heuristic approximation to a Kalman lter is the simplest addition to the

core static approachtotakeadvantage of information over time This mo dule assumes

no underlying dynamics but takes advantage of the fact that the video sequences we

typically use are at a high enough rate that p eople do not exhibit signicantchanges in

p osition and scale from one frame to another In fact this is why our simple distance

measure essentially using the L norm in p osition and scale works eectivelyIn

domains where the assumption of smo oth motion is not valid this approachwould

not work Cases like these are b etter served by an approach that is able to more

directly capture direction and velo city in the mo del

The run time complexity of this system is ns multiplications p er window with

access p er pattern assuming the prop er data structures In all there are ns

multiplications p er image for the core SVM computation

CondensationBased Approach

The Condensationbased extension to the static system is the most elegant but in

volves signicantly more development and changes to the system One of the main

issues is the estimation of the likeliho o d P zjx which is nontrivial In our case

this likeliho o d can b e estimated from a set of data

At run time the Condensationbased approach is quite a bit more complex than

the other techniques For classication each pattern needs ns multiplications with

an additional multiplication for propagating the prior here to o there are O

additions p er window since to compute the likelihood weaverage over a small neigh

b orho o d and division to normalize the density This combined with the necessary

p eak nding step means O visits p er pattern Though more complex than the

other approaches Condensation seems to b e the metho d of choice due to its exibil

ity in mo deling multimo dal densities In addition to the computational complexity

another minor drawback is that the framework is develop ed for tracking and not pure

detection our p eak nding heuristic is one simple way to circumvent this 50 100 150 200 250 100 200 300

50 40 0 0 40 30 −2 −2 30 20 −4 20 −4 10 10 −6 −6

20 40 60 20 40 60 80

60 0 60 0

40 −2 −2 40 −4 −4 20 20 −6 −6

20 40 60 80 20 40 60 80 100

Figure The top image is the original and the b ottom images are the color co ded

raw supp ort vector machine outputs from the image when pro cessed for four dierent

scales of p eople The output of the SVM p eaks over the p eople patterns and decays

smo othly as we shift the lo cation and scale of the pattern p(x) p(x)

threshold

x x

global thresholding peak finding and local thresholding

Figure Global thresholding may miss a p eak due to unequally weighted p ortions

of the densityTo correct for this we do a hill climbing search to nd the distinct

p eaks and then lo cally threshold the p ortions of the density around the p eaks

Base vs Condensation 1

0.8

0.6

0.4 Detection Rate static 0.2 condensation 1 condensation 2 condensation 5

0 −4 −2 0 10 10 10

False Positive Rate

Figure ROC curves for the static and Condensationbased detection systems

The numb ers asso ciated with the Condensation curves are the numb er of p oints that

the algorithm uses at the p eaks of the density

Chapter

Conclusion

This thesis has presented a general trainable framework for ob ject detection in static

images that is successfully applied to face p eople and car detection in static images

Along with some extensions to video sequences wehaveinvestigated dierent rep

resentations a comp onentbased approach and implemented a realtime version of

the system that is running in a DaimlerChrysler exp erimental vehicle The system is

robust p ortable and can b e made ecient While wehave not pushed this system

to b e the b est detection system in a particular domain though this certainly may

b e p ossible wehave shown its applicability to a wide range of ob ject classes Wefeel

that in practical uses of this system the prop er architecture should combine a fo cus

of attention mo dule with our static detection system Wehave presented preliminary

results here as well

Future work around this system should fo cus on the following

comp onentbased techniques In the comp onentbased system we present the

parts and geometries are manually sp ecied An imp ortantnextstepisto

develop a metho d for automatically selecting the comp onents from a large set

of p ossible comp onents

time integration This thesis has presented some promising rst steps in aug

menting the core static detection system with dierent mo dules that take ad

vantage of dynamical information These techniques and others should b e in

vestigated further and extended One interesting direction would b e to develop

a system that used wavelets in space and time

representations One of the big leaps wehavetaken in this thesis is our choice of

represen tation Haar wavelets While wehaveshown that this wavelet represen

tation achieves excellent results when compared to other representations there

are many p ossible other feature sets that could b e used Subsequent research

in this area is imp erative

supp ort vector machines Though supp ort vector machines are wellfounded in

statistics and heavily studied these days there are a couple of op en questions

whose solutions would b enet our work as well First how can wequickly train

supp ort vector machines with very large data sets Second are there ways of

using supp ort vector machines to do feature selection in a principled manner

We present one idea here but this is largely a heuristic rst step

App endix A

Wavelets

A The Haar Wavelet

Wavelets provide a natural mathematical structure for describing our patterns a

more detailed treatment can b e found in Mallat These vector spaces form

the foundations of the concept of a multiresolution analysis We formalize the notion



ofamultiresolution analysis as the sequence of approximating subspaces V V

 j j  j 

V V V thevector space V can describ e ner details than the

j j j 

space V but every elementof V is also an elementof V Amultiresolution

j

analysis also p ostulates that a function approximated in V is characterized as its

j

orthogonal pro jection on the vector space V

j

As a basis for the vector space V we use the scaling functions

p

j

j j

j

x ii A

i

where for our case of the Haar wavelet

for x

x A

other w ise

j

Next we dene the vector space W that is the orthogonal complementoftwocon

L

j j j  j

W TheW are known as wavelet secutiveapproximating subspaces V V

subspaces and can b e interpreted as the subspace of details in increasing rene

j

ments The wavelet space W is spanned by a basis of functions

p

j

j j

j

x ii A

i

where for Haar wavelets

for x



x x A for



other w ise

The sum of the wavelet functions form an for L R It can b e



shown under the standard conditions of multiresolution analysis that all the scaling

functions can b e generated from dilations and translations of one scaling function

Similarly all the wavelet functions are dilations and translations of the mother wavelet

function Figure a shows the scaling and wavelet functions The approximation of

j

some function f x in the space V is found to b e

jk

z

X

j j

A f fu u x A

j

k k

k Z

j

and similarly the pro jection of f xon W is

jk

z

X

j j

fu D f u x A

j

k k

k Z

The structure of the approximating and wavelet subspaces leads to an ecient

cascade algorithm for the computation of the scaling co ecients and the wavelet

jk

co ecients

jk

X

h A

jk nk j n

nZ

X

g A

jk nk j n

nZ

where fh g and fg g are the lter co ecients corresp onding to the scaling and

i i

wavelet functions Using this construction the approximation of a function f xin

j

the space V is

p

X

j

j

A f x k A

j jk

nZ

j

Similarly the approximation of f x in the space W is

p

X

j

j

x k A D f

j jk

nZ

g Since we use the Haar wavelet the corresp onding lters are h f

 

and g f g The scaling co ecients are simply the averages of

 

pairs of adjacent co ecients in the coarser level while the wavelet co ecients are the

dierences

It is imp ortant to observe that the discrete wavelet transform DWT p erforms

downsampling or decimation of the co ecients at the ner scales since the lters h

and g are moved in a step size of for each incrementof k

A Dimensional Wavelet Transform

The natural extension of wavelets to D signals is obtained by taking the tensor

pro duct of twoDwavelet transforms The result is the three typ es of wavelet

basis functions shown in Figure The rst typ e of wavelet is the tensor pro duct

ofawavelet by a scaling function x y x y this wavelet enco des a

dierence in the average intensityalongavertical b order and we will refer to its

value as a vertical co ecient Similarly a tensor pro duct of a scaling function bya

wavelet x y x y is a horizontal co ecient and a wavelet byawavelet

x y x y is a diagonal co ecient since this wavelet resp onds strongly

to diagonal b oundaries

Since the wavelets that the standard transform generates have irregular supp ort

we use the nonstandard D DWT where at a given scale the transform is applied

to each dimension sequentially b efore pro ceeding to the next scale Stollnitz et al

Stollnitz et al The results are Haar wavelets with square supp ort at all

scales shown in Figure b

A Quadruple DensityTransform

For the D Haar transform the distance b etween two neighb oring wavelets at level

n n

n with supp ort of size is To obtain a denser set of basis functions that

provide b etter spatial resolution we need a set of redundant basis functions or an

n

overcomplete dictionary where the distance b etween the wavelets at scale n is



Figure c The straightforward approach of shifting the signal and recomputing

the DWT will not generate the desired dense sampling Instead this can b e achieved

by mo difying the DWT To generate wavelets with double densitywherewavelets

n

of level n are lo cated every pixels we simply do not downsample in Equation



A To generate the quadruple density dictionary rst we do not downsample in

Equation A giving us double density scaling co ecients Next we calculate double

densitywavelet co ecients on the two sets of scaling co ecients even and o dd

separately By interleaving the results of the two transforms we get quadruple

densitywavelet co ecients For the next scale n wekeep only the even scaling

co ecients of the previous level and rep eat the quadruple transform on this set only

the o dd scaling co ecients are dropp ed o Since only the even co ecients are carried

along at all the scales weavoid an explosion in the number of coecients yet obtain

a dense and uniform sampling of the wavelet co ecients at all the scales As with the

regular DWT the time complexityis O n in the numb er of pixels n The extension

of the quadruple density transform to D is straightforward

App endix B

Supp ort Vector Machines

The second key comp onent of our system is the use of a trainable pattern classier

that learns to dierentiate b etween patterns in our ob ject class and all other patterns

In general terms these sup ervised learning techniques rely on having a set of lab eled

example patterns from which they derive an implicit mo del of the domain of interest

The particular learning engine we use is a supp ort vector machine SVM classier

The supp ort vector machine algorithm is a technique to train classiers that is

wellfounded in statistical learning theory Vapnik Burges Vapnik

Here we provide some of the mathematical and practical asp ects of supp ort vector

machines that are relevanttoourwork

B Theory and Mathematics

For a given learning task wehave a set of Ndimensional lab eled training examples

N

x y x y x y x R y f g B

  i i

ould where the examples have b een generated from some unknown p df P xy Wew

like the system to learn a decision function f x y that minimizes the exp ected

a

risk

Z

R jf x y jdP xy B

In most cases we will not know P xy we simply see the data p oints that the

distribution has generated Thus direct minimization of Equation B is not p ossible

What we are able to directly minimize is the empirical risk the actual error over the

training set

X

R jf x y j B

emp i i

i

This is exactly the functional that many training techniques for classiers mini

mize but can lead to overtting the training data and p o or generalization For these

reasons weintro duce the theory of uniform convergence in probability

h log

R R B

emp

with probability Here R is the exp ected risk R is the empirical

emp

risk is the numb er of training examples h is the VC dimension of the classier

that is b eing used and istheVC condence of the classier Intuitively what

this means is that the uniform deviation b etween the exp ected risk and empirical risk

decreases with larger amounts of training data and increases with the VC dimension

h This leads us directly to the principle of structural risk minimization where we

can minimize b oth the actual error over the training set and the complexityofthe

classier at the same time this will b ound the generalization error as in Equation

B It is exactly this technique that supp ort vector machines approximate

In the simple case of nding the optimal linear hyp erplane w b that separates

two separable classes this problem is equivalent to solving



k w k minimize



B

w b

sub ject to the constraints

y w x b i B

i i

Typically the dual formulation of this quadratic programming problem is solved

leading to a decision surface of the form

X

y x x b B f x

i i i

i

where are Lagrange variables

i

To extend this to the more general case of linearly nonseparable data weaddin

slackvariables and a cost C that p enalizes misclassications

k

P



minimize k w k C

i

i



B

w b

sub ject to the constraints

y w x b i

i i i

B

i

i

Linear hyp erplanes are a fairly restrictive set of decision surfaces so ultimately we

would like to use nonlinear decision surfaces In this case supp ort vector machines

work by pro jecting the input data into a higher dimensional feature space and nding

the optimal linear separating hyp erplane in this space The data is mapp ed according

to some function as

x x x x x B

 n

This leads to decision surfaces of the form

X

y x x b B f x

i i i

i

A more compact representation is p ossible byintro ducing the nonlinear kernel

function K

K x y x y B

With this formulation we obtain decision surfaces of the form

X

f x y K x x b B

i i i

i

One of the imp ortantcharacteristics of the solution is that there are typically

only a small numb er of nonzero The separating hyp erplane is therefore a linear

i

combination of a small set of data p oints called supp ort vectors Removing the

nonsupp ort vectors and training again would yield exactly the same solution

Using the SVM formulation the classication rule for a pattern x using a p oly

nomial of degree two the classier we use in our detection system is as follows

N

s

X



f x y x x b B

i i i

i

where N is the numb er of supp ort vectors

s

This controlling of b oth the training set error and the classiers complexity has

allowed supp ort vector machines to b e successfully applied to very high dimensional

learning tasks Joachims presents results on SVMs applied to a dimen

sional text categorization problem and Osuna et al b show a dimensional

face detection system

B Practical Asp ects

The SVM package we use is that of Osuna et al describ ed in Osuna et al a

Osuna et al b that uses the MINOS quadratic programming solver Murtagh

and Saunders To train a supp ort vector machine we rst need a set of training

data The pro cess by whichwe gathered our training set is describ ed in Section

We transform each training image into a feature vector of Haar wavelet co ecients

as in Section The data is then ready to b e used for training

In the SVM framework there are essentially only tunable parameters the typ e

of classier eg linear p olynomial etc and C the p enalty for misclassications

Most of our exp eriments are run using a p olynomial of degree as in Equation B

In p enalizing misclassications we use C for the p ositiv e examples and

pos

C for the negativ e examples to reect the relative imp ortance of havingalow

neg

false negative rate false negatives are p enalized times more than false p ositives

Training time with the ab ove settings is under hours for our detection systems

on either an SGI Reality Engine or a MHz PC running Linux The output of the

SVM training is a binary le containing the data for the decision surface x y

i i i

for data p oints i that have nonzero and b This data le is subsequently used by

i

the detection system

App endix C

Raw Wavelet Data

Here we present the rawwavelet resp onses averaged over the set of p eople

images and presented in their prop er spatial lo cation The meaningful co ecients are

those with values much larger or smaller than Average values close to indicate

neither the presence nor the absence of an intensity dierence in the average ie

these are inconsistent features

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

Table C Vertical wavelets

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

Table C Horizontal wavelets

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

            

Table C Diagonal wavelets

    

    

    

    

    

    

    

    

    

    

    

    

    

Table C Vertical wavelets

    

    

    

    

    

    

    

    

    

    

    

    

    

Table C Horizontal wavelets

    

    

    

    

    

    

    

    

    

    

    

    

    

Table C Diagonal wavelets

Biblio graphy

Ali and Dagless A Ali and E Dagless Alternative pro ctical metho ds for

moving ob ject detection In IEE Col loquium on Electronic Images and Image

Processing in Security and Forensic Science pages

Bauer and Kohavi Eric Bauer and Ron Kohavi An empirical comparison of

voting classication algorithms Bagging b o osting and variants Machine Learning

Betke and Nguyen M Betke and H Nguyen Highway scene analysis form a

moving vehicle under reduced visibility conditions In Proceedings of Intel ligen t

Vehicles pages Octob er

Betke et al M Betke E Haritaoglu and L Davis Highway scene analysis

in hard realtime In Proceedings of Intel ligen t Transportation Systems IEEE July

Beymer et al D Beymer P McLauchlan B Coifman and J Malik A Real

ceedings of time Computer Vision System for Measuring Trac Parameters In Pro

Computer Vision and Pattern Recognition pages IEEE Computer So ciety

Press

Bishop C Bishop Neural Networks for Pattern Recognition Clarendon Press

Braithwaite and Bhanu R Braithwaite and B Bhanu Hierarchical gab or l

ters for ob ject detection in infrared images In Proceedings of Computer Vision

and Pattern Recognition pages

Breiman L Breiman Bagging predictors Machine Learning

Broida and Chellappa T Broida and R Chellappa Estimation of Ob ject

Motion Parameters from Noisy Images IEEE Transactions on Pattern Analysis

and Machine Intel ligen ce January

Burges C Burges Simplied Supp ort Vector decision rules In Proceedings

of th International Conference on Machine Learning

Burges C Burges A Tutorial on Supp ort Vector Machines for Pattern Recog

nition In Usama Fayyad editor Proceedings of Data Mining and Know ledge Dis

covery pages

Burl and Perona M Burl and PPerona Recognition of planar ob ject classes

In Proceedings of Computer Vision and Pattern Recognition June

Burl et al M Burl T Leung and PPerona Face lo calization via shap e

statistics In International Workshop on Automatic Face and GestureRecognition

June

Campb ell and Bobick L Campb ell and A Bobick Recognition of human

b o dy motion using phase space constraints Technical Rep ort MIT Media

Lab oratory

Chapa and Raghuveer J Chapa and M Raghuveer Matched wavelets their

construction and application to ob ject detection In Proceedings of International

oustics Speech and Signal Processing pages ConferenceonAc

Colmenarez and Huang A Colmenarez and T Huang Maximum likelihood

face detection In Proceedings of International ConferenceonAutomatic Face and

GestureRecognition pages

Colmenarez and Huang A Colmenarez and T Huang Face detection with

informationbased maximum discrimination In Proceedings of Computer Vision

and Pattern Recognition pages

Colmenarez and Huang A Colmenarez and T Huang Pattern detection with

informationbased maximum discrimination and error b o otstrapping In Proceed

ings of International Conference on Pattern Recognition pages

Compaq Computer Corp Compaq Computer Corp wwwaltavistacom

Cortes and Vapnik C Cortes and V Vapnik Supp ort Vector Networks Ma

chine Learning

J Crowley and F Berard Multimo dal tracking of faces Crowley and Berard

for vbideo communications In Proceedings of Computer Vision and Pattern Recog

nition pages

Davis and Bobick JW Davis and AF Bobick The Representation and

Recognition of Human Movement Using Temp oral Templates In Proceedings of

Computer Vision and Pattern Recognition pages IEEE Computer So ciety

Press

Deriche and Faugeras R Deriche and O Faugeras Tracking line segments

Image and Vision Computing

Dickmanns and Graefe E Dickmanns and V Graefe Applications of Dy

namic Mono cular Machine Vision Machine Vision and Applications

Dumais et al S Dumais J Platt D Heckerman and M Sahami Inductive

learning algorithms and representations for text categorization In Proceedings of

th International Conference on Information and Know ledge ManagementNovem

b er

Duta and Jain N Duta and A Jain Learning the human face concept in

black and white images In Proceedings of International Conference on Pattern

Recognition pages

Ebb ecke et al M Ebb ecke M Ali and A Dengel Real time ob ject detection

tracking and classication in mono cular image sequences of road trac scenes In

Proceedings of International Conference on Image Processingvolume pages

Excite Inc Excite Inc wwwexcitecom

Forsyth and Fleck D Forsyth and M Fleck Bo dy plans In Proceedings of

Computer Vision and Pattern Recognition pages

Forsyth and Fleck D Forsyth and M Fleck Finding naked p eople Interna

tional Journal of Computer Vision accepted for publication

Franke and Kutbach U Franke and I Kutbach Fast stero based ob ject de

tection for stopgo trac In Proceedings of Intel ligen t Vehicles pages

Franke et al U Franke S Go erzig F Lindner D Nehren and F Paetzold

Steps towardsanintelligent vision system for driver assistance in urban trac In

Intel ligen t Transportation Systems

Franke et al U Franke D Gavrila S Go erzig F Lindner F Paetzold and

C Wo ehler Autonomous driving go es downtown IEEE Intel ligent Systems pages

emb erDecember Nov

Freund and Schapire Y Freund and R E Schapire Machine Learning Pro

ceedings of the Thirteenth National Conferencechapter Exp eriments with a new

b o osting algorithm Morgan Kaufmann

Garcia et al C Garcia G Zikos and G Tziritas Face detection in color

images using wavelet packet analysis In International Conference on Multimedia

Computing and Systems pages

Gavrila and Philomin D Gavrila and V Philomin Realtime ob ject detection

for smart vehicles In Proceedings of th International Conference on Computer

Visionvolume pages

Girosi et al F Girosi M Jones and T Poggio Regularization theory and

neural networks architectures Neural Computation

Guarda et al A Guarda C le Gal and A Lux Evolving visual features

and detectors In Proceedings of International Symposium on Computer Graphics

Image Processing and Vision pages

Haritaoglu et al I Haritaoglu D Harwo o d and L Davis W Who when

where what a real time system for detecting and tracking p eople In Face and

GestureRecognition pages

racking with Rigid Mo dels In A Blake and A Yuille Harris C Harris T

editors Active Vision pages MIT Press

Heisele and Wohler B Heisele and C Wohler MotionBased Recognition of

Pedestrians In Proceedings of International Conference on Pattern Recognition

in press

Heisele et al B Heisele U Kressel and W Ritter Tracking Nonrigid Mov

ing Ob jects Based on Color Cluster Flow In Proceedings of Computer Vision and

Pattern Recognition

Hogg D Hogg Mo delbased vision a program to see a walking p erson Image

and Vision Computing

Ho ogenb o om and Lew R Ho ogenb o om and M Lew Face detection using

lo cal maxima In Proceedings of International ConferenceonAutomatic Face and

GestureRecognition pages

IBM IBM wwwqbicalmadenibmcom

M Isard and A Blake Condensation conditional density Isard and Blake

propagation for visual tracking International Journal of Computer Vision

Itti and Ko ch L Itti and C Ko ch A Comparison of Feature Combination

Strategies for SaliencyBased Visual Attention Systems In Human Vision and

Electronic Imaging SPIE January

Itti et al L Itti C Ko ch and E Niebur A Mo del of Saliencybased Visual

Attention for Rapid Scene Analysis IEEE Transactions on Pattern Analysis and

Machine Intel ligence

Joachims T Joachims Text Categorization with Supp ort Vector Machines

Technical Rep ort LS Rep ort University of Dortmund Novemb er

Kahn and Swain R Kahn and M Swain Understanding p eople p ointing The

p erseus system In International Symposium on Computer Vision pages

Kalinke et al T Kalinke C Tzomakas and W van Seelen A texturebased

ob ject detection and an adaptive mo delbased classication In Proceedings of

Intel ligen t Vehicles pages

Kob er et al R Kob er J Schiers and K Schmidt Mo delbased versus

knowledgeguided representation of nonrigid ob jects A case studyInProceedings

of International Conference onn Image Processing pages

Kotrop oulos and Pitas C Kotrop oulos and I Pitas Rulebased face detection

in fron tal views In Proceedings of International ConferenceonAcoustics Speech

and Signal Processing

Kruger et al N Kruger M Potzsch and C von der Malsburg Determination

of face p osition and p ose with a learned representation based on lab elled graphs

Image and Vision Computing

Kwon and Lob o Y Kwon and N Lob o Face detection using templates In

Proceedings of International Conference on Image Processing pages

Lakany and Hayes H Lakany and G Hayes An algorithm for recognising

walkers In J Bigun G Chollet and G Borgefors editors Audio and Video

based Biometric Person Authentication pages IAPR Springer

Leung and Yang a MK Leung and YH Yang Human b o dy motion segmen

tation in a complex scene Pattern Recognition

Leung and Yang b MK Leung and YH Yang A region based approachfor

cognition human b o dy analysis Pattern Re

Leung et al TK Leung MC Burl and PPerona Finding faces in cluttered

scenes using random lab eled graph matching In Fifth International Conferenceon

Computer Vision June pages

Lew and Huang M Lew and T Huang Optimal supp orts for image matching

In Digital Signal Processing Workshop Proceedings pages

Lipson et al P Lipson W Grimson and P Sinha Conguration Based Scene

Classication and Image Indexing In Proceedings of Computer Vision and Pattern

Recognition pages

Lipson P Lipson Context and Conguration BasedScene Classication PhD

thesis Massachusetts Institute of Technology

Lycos Inc Lycos Inc wwwlycoscom

MacKay D MacKay A practical bayesian framework for backpropagation

networks Neural Computation

Mallat SG Mallat A theory for multiresolution signal decomp osition The

wavelet representation IEEE Transactions on Pattern Analysis and Machine In

tel ligen ce July

Mayb eck PMayb eck Stochastic Models Estimation and Controlvolume

of Mathematics in Science and Engineering Academic Press Inc

S McKenna and S Gong Nonintrusive p erson authen McKenna and Gong

tication for access control by visual tracking and face recognition In J Bigun

G Chollet and G Borgefors editors Audio and Videobased Biometric Person

Authentication pages IAPR Springer

MicheliTzanakou and Marsic E MicheliTzanakou and I Marsic Ob ject de

tection and recognition in a multiresolution representation In International Joint

Conference on Neural Networks pages

Mirhosseini and Yan A Mirhosseini and H Yan Neural networks for learning

human facial features from lab eled graph mo dels In Conference on Intel ligen t

Information Systems pages

Moghaddam and Pentland B Moghaddam and A Pentland Probabilistic

visual learning for ob ject detection In Proceedings of th International Conference

on Computer Vision

Mohan et al A Mohan C Papageorgiou and T Poggio Example based

ob ject detection in images by comp onents IEEE Transactions on Pattern Analysis

and Machine Intel ligen ce in preparation

Mohan A Mohan Robust Ob ject Detection in Images by Comp onents Mas

ters thesis Massachusetts Institute of Technology May

Murase and Nayar H Murase and S Nayar Visual Learning and Recogni

tion of D Ob jects from App earance International Journal of Computer Vision

Murtagh and Saunders B Murtagh and M Saunders MINOS Users

Guide System Optimization Lab oratoryStanford UniversityFeb

Nelson and Illingworth M Nelson and W Illingworth APractical Guide to

Neural Nets AddisonWesley

Niyogi et al PNiyogi C Burges and P Ramesh Capacity control in sup

port vector machines and their application to feature detection ieeespeaudproc

to b e submitted verify journal name

Osuna et ala E Osuna R Freund and F Girosi Supp ort Vector Machines

Training and Applications AI Memo MIT Articial Intelligence Lab oratory

Osuna et alb E Osuna R Freund and F Girosi Training supp ort vector

machines An application to face detection In Proceedings of Computer Vision

and Pattern Recognition pages

Papageorgiou et al C Papageorgiou T Evgeniou and T Poggio A trainable

eding of Intel ligent Vehicles pages p edestrian detection system In Proce

Octob er

Philips P Philips Matching pursuit lter design In Proceedings of Interna

tional Conference on Pattern Recognition pages

Platt J Platt Probabilistic outputs for supp ort vector machines and compar

isons to regularized likeliho o d metho ds In A Smola P Bartlett B Scholkopf and

D Schuurmans editors Advances in Large Margin Classiers MIT Press

Qian and Huang R Qian and T Huang Ob ject detection using hierarchi

cal MRF and MAP estimation In Proceedings of Computer Vision and Pattern

Recognition pages

Quinlan JR Quinlan Bagging b o osting and c In Proceedings of the

Thirteenth National ConferenceonArticial Intel ligence

Ra jagopalan et al A Ra jagopalan P Burlina and R Chellappa Higher

order statistical learning for vehicle detection in images In Proceedings of th

International Conference on Computer Vision pages

Ratan and Grimson A Lakshmi Ratan and WEL Grimson Training tem

plates for scene classication using a few examples In Content BasedAccess of

Image and Video Libraries

Ratan et al A Lakshmi Ratan WEL Grimson and W Wells III Ob ject

detection and lo calization by dynamic template warping In Proceedings of Com

puter Vision and Pattern Recognition June

Reiter and Matas M Reiter and J Matas Ob jectdetection with a varying

numb er of eigenspace pro jections In Proceedings of International Conferenceon

Pattern Recognition pages

Rikert et al T Rikert M Jones and PViola A clusterbased statistical

mo del for ob ject detection In Proceedings of th International Conferenceon

Computer Vision pages

Rohr K Rohr Incremental recognition of p edestrians from image sequences

Proceedings of Computer Vision and Pattern Recognition pages

Rowley et al HA Rowley S Baluja and T Kanade Human face detection

in visual scenes Technical Rep ort CMUCS Scho ol of Computer Science

Carnegie Mellon University JulyNovember

Rowley et al H Rowley S Baluja and T Kanade Rotation invariant neu

ral networkbased face detection Technical Rep ort CMUCS Scho ol of

y Decemb er Computer Science Carnegie Mellon Universit

Rowley et al H Rowley S Baluja and T Kanade Neural networkbased

face detection IEEE Transactions on Pattern Analysis and Machine Intel ligence

January

Sab er and Tekalp E Sab er and A Tekalp Face detection and facial feature

extraction using color shap e and symmetrybased cost functions In Proceedings

of International Conference on Pattern Recognition pages

Schapire et al RE Schapire Y Freund P Bartlett and WS Lee Bo osting

the margin a new explanation for the eectiveness of voting metho ds The Annals

of Statistics To app ear

Schneiderman and Kanade H Schneiderman and T Kanade Probabilistic

mo deling of lo cal app earance and spatial relationships for ob ject recognition In

Proceedings of Computer Vision and Pattern Recognition pages

Shams and Sp o elstra Ladan Shams and Jacob Sp o elstra Learning gab or

based features for face detection In Proceedings of World Congress in Neural

Networks International Neural Network Society pages Sep

Sinha a P Sinha Ob ject Recognition via Image Invariants A Case Study

In Investigative Ophthalmology and Visual Sciencevolume pages

Sarasota Florida May

Sinha b P Sinha Qualitative imagebased representations for ob ject recogni

tion AI Memo MIT Articial Intelligence Lab oratory

Stollnitz et al EJ Stollnitz TD DeRose and DH Salesin Wavelets for

computer graphics A primer Technical Rep ort Department of Computer

Science and Engineering UniversityofWashington Septemb er

Sullivan et al M Sullivan C Richards C Smith O Masoud and N Pa

panikolop oulos Pedestrian tracking from a stationary camera using active de

formable mo dels In Proceedings of Intel ligent Vehicles pages

Sung and Poggio KK Sung and T Poggio Examplebased learning for view

based human face detection AI Memo MIT Articial Intelligence Lab ora

tory Decemb er

Sung and Poggio KK Sung and T Poggio Examplebased learning for view

based human face detection IEEE Transactions on Pattern Analysis and Machine

Intel ligen ce January

Sung KK Sung Learning and Example Selection for Object and Pattern

Detection PhD thesis MIT Articial Intelligence Lab oratory Decemb er

Takato o et al M Takato o C Onuma and Y Kobayashi Detection of ob

jects including p ersons using image pro cessing In Proceedings of International

Conference on Pattern Recognition pages

Tsukiyama and Shirai T Tsukiyama and Y Shirai Detection of the move

ments of p ersons from a sparse sequence of tv images Pattern Recognition

Vaillant et al R Vaillant C Monro cq and Y Le Cun Original approach

for the lo calisation of ob jects in images IEE Proceedings Vision Image Signal

Processing August

Vapnik V Vapnik The Nature of Statistical Learning Theory Springer Ver

lag

Vapnik V Vapnik Statistical Learning Theory John Wiley and Sons New

York

Venkatraman and Govindara ju M Venkatraman and V Govindara ju Zero

oceedings crossings of a nonorthogonal wavelet transform for ob ject lo cation In Pr

of International Conference on Image Processing pages

Virage Inc Virage Inc wwwviragecom

Wahba G Wahba Spline Models for Observational Data Series in Applied

Mathematics Vol SIAM Philadelphia

Web er and Casasent D Web er and D Casasent Quadratic gab or lters for

ob ject detection In Proceedings of Symposium on Communications and Signal

Processing pages

Wolp ert D Wolp ert editor The Mathematics of Generalization The Pro

ceedings of the SFICNLS Workshop on Formal Approaches to SupervisedLearning

Santa Fe Institute Studies in the Sciences of Complexity AddisonWesley

Wren et al C Wren A Azarbayejani T Darrell and A Pentland Pnder

Realtime tracking of the human b o dyTechnical Rep ort MIT Media Lab ora

tory

Wu et al H Wu Q Chen and M Yachida Face detection from color images

using a fuzzy pattern matching metho d IEEE Transactions on Pattern Analysis

and Machine Intel ligen ce

Yaho o Inc Yaho o Inc wwwyaho ocom

G Yang and T Huang Human face detection in a scene Yang and Huang

In Proceedings of Computer Vision and Pattern Recognition pages

Yang and Huang G Yang and T Huang Human face detection in a complex

background Pattern Recognition

Yow and Cip olla a K Yow and R Cip olla A probabilistic framework for p er

ceptual grouping of features for human face detection In Proceedings of Interna

tional ConferenceonAutomatic Face and GestureRecognition pages

Yow and Cip olla b Kin Cho ong Yow and R Cip olla Detection of human faces

under scale orientation and viewp ointvariations In Proceedings of the Second

International ConferenceonAutomatic Face and GestureRecognition pages

Oct

Yow and Cip olla Kin Cho ong Yow and R Cip olla F eaturebased human face

detection Image and Vision Computing Sep

Yuille et al A Yuille P Hallinan and D Cohen Feature Extraction from

Faces using Deformable Templates International Journal of Computer Vision

Yuille A Yuille Deformable Templates for Face Recognition Journal of

Cognitive Neuroscience