Institut für Ruhr-Universität

Neuroinformatik Bochum

Internal Rep ort

Ob ject Recognition with a Sparse and Autonomously Learned

Representation Based on Banana Wavelets

by

Norb ert Kruger Gabriele Peters Christoph von der Malsburg

IRINI

RuhrUniversitat Bo chum

Dezemb er Institut fur Neuroinformatik

ISSN Bo chum

Ob ject Recognition with a Sparse and Autonomously Learned

Representation Based on Banana Wavelets

Norb ert Kruger x Gabriele Petersx Christoph von der Malsburgxz

x RuhrUniversitat Bo chum

Institut fur Neuroinformatik

D Bo chum Germany

z University of Southern California

Dept of Computer Science and Section for Neurobiology

Los Angeles CA USA

Abstract

We intro duce an ob ject recognition system based on the well known Elastic Graph Matching EGM but

includes signicant improvements compared to earlier versions Our basic features are banana wavelets which

are generalized Gab or wavelets In addition to the qualities frequency and orientation banana wavelets have the

attributes curvature and size Banana wavelets can b e metrically organized A sparse and ecient representation

of ob ject classes is learned utilizing this metric organization Learning is guided by a sensible amount of a priori

knowledge in form of basic principles The learned representation is used for a fast matching Signicant sp eed

up can b e achieved by hierarchical pro cessing of features Furthermore manual construction of ground truth is

replaced by an automatic generation of suitable training examples using motor controlled feedback We motivate

the biological plausibility of our approach by utilizing concepts like hierarchical pro cessing or metrical organization

of features inspired by brain research and criticize a to o detailed mo delling of biological pro cessing

Intro duction

In this pap er we describ e a novel ob ject recognition system in which representations of ob ject classes can b e learned

automatically The learned representations allow a fast and eective lo cation and identication of ob jects in compli

cated scenes Our ob ject recognition system is based on three pillars Firstly our prepro cessing is based on the idea

of sparse coding Secondly eective learning is guided by a priori constraints covering fundamental structure

of the visual world Thirdly we use Elastic Graph Matching EGM for the lo cation and identication of

ob jects

A sparse representation can b e dened as a co ding of an ob ject by a smal l number of binary features taken from

a large feature space A certain feature is only useful for co ding a small subset of ob jects and is not applicable for

most of the other ob jects Sparse co ding has biologically motivated advantages like minimizing wiring length for

forming asso ciations Baum et al p oint to the increase of asso ciative memory capacity provided by a sparse co de

Ohlshausen Field argue that the retinal pro jection of the threedimensional world has a sparse structure and

therefore a sparse co de meets the principle of redundancy reduction by reducing higherorder statistical correlations

of the input As an additional advantage to the reasons mentioned ab ove our matching algorithm achieves a siginicant

sp eedup by utilizing the fact that only a small numb er of features is required in our sparse representation of an ob ject

For a more detailed discussion of sparse co ding we refer to

Our representation of a certain view of an ob ject class comprises only imp ortant features These are extracted

from dierent examples see gure iiv The central assumption of our learning algorithm necessitates on a priori

knowledge applied to the system in the form of general principles and mechanisms Learning is inherently faced with

the biasvariance dilemma If the starting conguration of the system is very general it can learn from and

sp ecialize to a wide variety of domains but it will in general have to buy this advantage by having many internal

degrees of freedom This is a serious problem since the numb er of examples needed to train a system scales very

badly with the systems size quickly leading to totally unrealistic learning time or else with a limited set of training

examples the system will trivially adapt to its accidental p eculiarities and the system will fail to generalize prop erly

Supp orted by grants from the German Ministry for Science and Technology INE NEUROS and MA Electronic Eye

to new examples This is the variance problem On the other hand if the initial system has few degrees of freedom

it may b e able to learn eciently but unless the system is designed with much sp ecic insight into the domain at

hand the solution we criticized ab ove there is great danger that the structural domain spanned by those degrees of

freedom do es not cover the given domain of application at all the bias problem

a)

b)

i) ii) iii) iv) v)

Figure iiv Dierent examples of cans and faces used for learning v The learned representations

We prop ose that a priori knowledge is needed to overcome the biasvariance dilemma The challenge here is to attain

generality and to avoid the extreme of equipping the system with manually constructed sp ecic domain knowledge

such as geometry and physics in general or even the geometric and physical structure of ob jects themselves We have

formulated a numb er of a priori principles to reduce the dimension of the search space and to guide learning ie

to handle the varianceproblem We assume that we can avoid the biasproblem b ecause of the general applicability

of those principles All these principles are concerned with the selection of imp ortant features from a predened

feature space P P P and the structure thereof P In and we have already made use of the following

principles P Lo cality Features refering to dierent lo cations are treated as indep endent P Invariance Features

are preferred which are invariant under a wide range of ob ject transformations P Minimal Redundancy Features

should b e selected for minimal redundancy of information

Here we intro duce a principle P as an imp ortant additional constraint

P Lo cal Feature Assumption Signicant features of a lo cal area of the twodimensional pro jection of the visual

world are lo calized curved lines

We formalize P by extending the concept of Gab or wavelets see eg to banana wavelets section To the

parameters frequency and orientation we add curvature and size see gure An ob ject can b e represented

as a conguration of a few of these features gure v therefore it can b e co ded sparsely The space of banana wavelet

resp onses can b e understo o d as a metric space its metric representing the similarity of features This metric is utilized

for the learning of a representation of ob jects and for recognition of these ob jects during the matching pro cedure The

banana wavelet resp onses can b e derived from Gab or wavelets resp onses by hierarchical pro cessing to gain sp eed and

reduce memory requests see section A set of examples of a certain view of an ob ject class gure iiv is used

to learn a sparse representation sections and which contains only the imp ortant features ie features which

are robust against changes of background and illumination or slight variations in scale and orientation This sparse

representation allows for quickly and eectively lo cating see section by using EGM

Our system has certain analogies to the visual system of vertebrates There is evidence for curvature sensitive features

pro cessed in a hierchical manner in early stages sparse co ding is discussed as a co ding scheme used in the visual

system and metric organization of features seems to play an imp ortant role for information pro cessing in the

brain Instead of detailed mo delling of brain areas we aim to apply some basic concepts inspired by brain

research like sparse co ding hierarchical pro cessing metrical organisation of features etc in our articial ob ject

recognition system We think a system do es not necessarily need to contain neurons or hebbian plasticity to b e

called biologically motivated Mayb e we miss the imp ortant asp ects of information pro cessing in the brain by lo oking

on a to o detailed level After all humans did not build planes with feathers but the observation of birds inspired the

understanding of the basic principles of ying which are used by any airplane For a more detailed discussion of the

analogy to biology we refer to

To enable simultaneously a rough understanding of the basic ideas of the approach and a detailed description of the

algorithm this pap er can b e read in two mo des For every subsections we give rst a short summary and then a more

detailed description b eginning with the phrases Formally sp eaking or More formally The reader may skip

the latter parts for a rough understanding or a rst reading

size size

size size

frequency frequency frequency frequency

curvature curvature

size size size

curvature curvature

direction direction

Figure Relation b etween Gab or wavelets and banana wavelets Left four examples of Gab or wavelets which dier

in frequency and direction only Right examples of banana wavelets which are related to the Gab or wavelets on

the left Banana wavelets are describ ed by two additional parameters curvature and size

The Banana Space

In this section we describ e our realization of principle P a feature generation based on banana wavelets and its

metric organization in the banana space P gives us a signicant reduction of the search space Instead of allowing

eg all linear lters as p ossible features we restrict ourself to a small subset Considering the risk of a wrong feature

selection it is necessary to give go o d reasons for our decision We argue that nearly any ob ject can b e comp osed of

lo calized curved lines Furthermore the fact that humans can easiliy handle line drawings of ob jects strengthens our

assumption We think that a go o d feature has to have a certain complexity but an extreme increase of complexity up

to a sp ezialization to a very narrow class of ob jects has to b e avoided In any case there is some arbitraryness in the

assumption P and it therefore can only b e justied by the nal p erformance of the whole system

Banana wavelets can b e naturally organized in a metric space Their distance expresses the similarities of qualities of

the kernels sich as p osition orientation or curvature This metric organization is essential for the learning algorithm

describ ed in section b ecause it allows to summarize cluster of similar features by their center of gravity

Banana Wavelets

Our a priori principle P states that curved lines are imp ortant features of the lo cal visual world A banana wavelet can

b e understo o d as a generalized Gab or Wavelet Banana wavelets like Gab or wavelets are lo calized lters which

can b e derived from a mother wavelet In contrast to Gab or wavelets which are characterized by two parameters

the set of all banana wavelets is describ ed by four parameters see gure a

b

A banana wavelet B is a complex valued function dened on IR IR It is parameterized by a vector b of four

variables b f c s expressing the attributes frequency f orientation o curvature c and size s It can b e

b b

understo o d as a pro duct of a constant with a curved and rotated complex wave function F x y and a stretched

b b

twodimensional Gaussian G x y b ent and rotated according to F see gure top

b b b b b

B x y G x y F x y DC

with

f

b

x cos y sin c x sin y cos s x sin y cos G x y exp

x y

and

b

F x y exp if x cos y sin c x sin y cos

A banana wavelet can b e equivalently expressed by a combination of matrix op erations M M and a nonlinear

s

op eration M M p erforms a rotation by angle M stretches the Gaussian M x is a nonlinear function

c s c

b ending the co ordinate system see App endix A

=

b

Figure Top Real part of a banana wavelet is the pro duct of a curved Gaussian G x y and a curved wave function

b

F x y Bottom Real and imaginary part of the same banana wavelet depicted as grey level picture with ehite

enco ding high values

To ensure DCfreedom of the banana wavelets ie the indep endence of the lter resp onses from the mean grey value

intensity we set

R

b b

G xF xdx

x

b

DC e

R

b

G xdx

To comp ensate dierences of lter resp onses deriving from banana wavelets of dierent sizes or frequencies we set

f f

s s

max

max

f

s f

f s

max max

b

b

jjB jj

where jj jj represents the L norm The factor f comp ensates the decrease of the p ower sp ectrum of natural images

The factor

s s f f

max max

s f

f s

max max

ensures a more even distribution of the rep onses of the banana wavelets It intensies resp onses for small size and

high frequency

We dene a discrete sampling of the space of banana wavelets by a function

E l o b m f l o cb sm

emb edding the discrete grid with integer co ordinates l o b m in the continuous space f c s In our simulations

we only make use of the discrete set of banana wavelets with parameters f l o cb sm The kernels of two

b b

banana wavelets B and B with small euclidian distance jjb b jj have small L distance by denition Accordingly

E has to b e chosen such that neighb oring co ordinates in the grid corresp ond to similar kernels The emb edding function

E ensures that the features corresp onding to the grid l o b m are suciently sep erated to avoid redundancy but

also suciently dense to ensure a certain completeness of information

More formally we dene E by

l

f l f f l n

max l

s

o

o n o

o

n

o

c b

max

cb c b n

max b

n

b

s s

max min

m n sm s m

m min

n

s

We refer to a discrete set of banana wavelets with n levels n orientations n curvatures and n sizes by B and call

l o b m

it a banana plant see gure In our simulations we used the parameter settings shown in table columns

in the following referred to as standard settings

1

The parameter o runs from to n where n represents the numb er of kernels used for the actual image pro cessing In case

o o

b (f (l)(o ^)n c(b)s(m))

o

that o is larger than n ie o B with b f l o cb sm represents the kernel ConjB where

o

ConjB represents the complex conjugated kernel corresp onding to B Except for section we only make use of the rst n kernels

o

Standard Parameter Settings

Transformation Approximation Banana space Learning Matching

n f e

l max x

n f e

o s y

W

n s n e p

b min f

b

W

n s n e p

m max

m

W

s c s e r

min x max c

min

W W

s s e r

y s

max

min

f

s

Table Standard Settings Columns Parameters of transformation Column Parameters in W diering from

the parameters in B Column Metric of the banana space Column Parameters of learning

b=6

b=3

b=0 o=4 o=0 o=4 o=0 o=4 o=0 o=4 o=0 m=0 m=1 m=0 m=1

l=0 l=1

Figure Banana plant These are some examples for wavelets of a banana plant with l frequencies

o orientations b curvatures and m magnitudes which is a standard setting

The Banana Space

Let I b e a given picture and I x y b e its value at pixel p osition x y The sixdimensional space of vectors

c x y l o b m is called the banana co ordinate spacerefered to as C where c represents the Banana wavelet

f locbsm

B at pixel p osition x y The banana co ordinate space has n n n n x y elements x

l o b m r es r es r es

and y representing the resolution of the image I In the following we dene a neighb ourho o d relation N c c and

r es

a metric dc c on C Two co ordinates c c are exp ected to b e neighb ored or have a small distance d when their

corresp onding kernels are similar For the co ordinates pixel p osition x y level l and size m we can assume that the

similarity of corresp onding kernels changes accordingly to the distance of these parameters ie the corresp onding

kernels can b e thought to b e arranged in a fourdimensional cub e For the co ordinates orientation o and curvature b

it is more convienent to arrange the corresp onding kernels in a Mo ebius top ology see gure Note that a banana

wavelet with orientation o and curvature cb rotated by pro duces the same absolute rep onse as a banana wavelet

with orientation o and curvature cb We use the neighb ourho o d relation N for our feature extraction describ ed

in section and the metric d in the learning algorithm describ ed in section

More formally we rstly dene a Mo ebis top ology on the subset o b o b o b are called neighb ored N c c

T r ue if at least one of the following two conditions hold true

Within Toplogy max fjo o j jb b jg for o n o n

k k

Border Top ology o o n jb b j o n o jb b j

o o

Secondly we can extend the neighb ourho o d relation to C c is neighb ored to c if o b is neighb ored to o b and

max fjx x j jy y j jl l j jm m jg

Figure Mo ebius top ology The subspace of orientations and curvatures o b with n orientations and n

o b

curvatures Top The banana wavelets on the left are connected by lines to the wavelets with neighb ouring indices

ob on the right Connecting the right edge with the left edge according to these neighb ourho o ds leads to the Mo ebius

top ology shown at the b ottom

E

C P

Figure Emb edding The discrete banana co ordinate space C is emb edded in the continuous parameter space P C

and P are shown in three dimensions only

Now we dene a distance measure on C harmonizing with its top ology The mapping

E c x y f l o cb sm

emb eddes the discrete space C in a continuous space P in the following called parameter space see gure E is a

simple extension of E taking also the pixel p osition x y into account see gure After dening a metric on the

parameter space P we use the emb edding function E to translate this metric back to C As for the top ology ab ove we

can dene rstly a distance in the subspace c expressing the Mo ebius top ology thereof Let e e e e e e

x y f c s

b e a cub e of volume one in P Let d c c b e dened as follows

d c c

q q q

o n

c c c c c c

min

e e e e e e

c c c

Now we can dene a distance measure on P

dc c

s

y y f f x x s s

d c c

e e e e

x y s

f

Setting

dc c dE c E c

we can nally extend C to a discrete metric space

2

Our coice of parameters are shown in table column

Banana Wavelet Resp onses

b

The basic feature of our ob ject recognition system is the magnitude of the lter resp onse of a banana wavelet B

b

extracted by a convolution of B with the image I In the following F I x b represents the magnitude of the lter

b b

resp onse of the banana wavelet B at pixel p osition x in image I A banana wavelet B causes a strong resp onse

b

at pixel p osition x when the lo cal structure of the image at that pixel p osition is similar to B We call this six

dimensional metric space AI x b the banana resp onse space asso ciated with image I The very same metric and

top ology as dened in and can b e applied to this space We call the whole construction consisting of a banana

plant the co ordinate space and the resp onse space the banana space

b

More formally let the op erator F symb olize the convolution of an image I with B for all p ossible b at a pixel p osition

x in the image I

Z

b b

F I x b B x x I x dx B I x

and let AI x b b e the magnitudes of F I x b

AI x b F I x b

Figure shows the complex and absolute resp onses for an image and a sp ecic banana wavelet

Figure Results of a transformation with banana wavelets Top real part of a banana wavelet imaginary part of a

banana wavelet image to b e transformed Bottom the results of the convolution of the image with the wavelet From

left to right real part of the convolution result imaginary part of the convolution result magnitude of the convolution

result White pixels co de high values so there are lo cal maxima at those parts of the image which show lines or edges

of the same orientation curvature and size as the banana wavelet here esp ecially the head of the p erson

Path Corresp onding to a Banana Wavelet

b b b

To every banana wavelet B there can b e dened a curve p called the path corresp onding to B see gure ab

This curve is used in section to sp eed up the transformation of an image by hierarchical pro cessing It also allows the

visualization of the learned representation of an ob ject see gure c Therefore the path corresp onding to a banana

wavelet also represents a transition of a grey level feature represented by a banana wavelet to a feature based on line

drawings In the approximation algorithm describ ed in section we apply two qualities connected with a curve p the

derivative pt at a certain p oint t expressing the tangent vector at pt and the length Lp of the curve

More formally we dene

b

p t IR

c

s t sin s t cos

y y

b

f f

p t

c

sin cos s t s t

y y

f f

b

t in our matrix notation see app endix A We can equivalently express p

3

For the concept of curves see eg

a) b) c)

Figure Path corresp onding to a banana wavelet a Arbitrary wavelet b Corresp onding path c Visualization of

a representation of an ob ject class The width of a line segment dep ends on the parameter l banana wavelets with

lower frequencies are represented by line segments with larger width

Approximation of Banana Wavelets by Gab or Wavelets

The banana resp onse space contains a huge amount of features their generation takes a long time on a sequential

computer and requires large memory capacities Eg a transformation with our standard setting as dened in table

needs approximately seconds on a Sparc Ultra and requires megabytes of main memory Here we dene an

algorithm to approximate banana wavelets from a small set of Gab or wavelets and banana wavelet resp onses from

Gab or wavelet resp onses by hierarchical pro cessing This approximation can b e p erformed b efore the matching as

describ ed in section or in a virtual mode in which only those features are evaluated on the y which are actually

requested for the matching Because of the sparseness of our representations of ob jects only a small subset of the

banana space is actually used during matching and can b e evaluated therefore very fast In case that all Banana

wavelets are evaluated b efore matching we achieve by the hierarchical pro cessing sp eed up of a factor In the virtual

mo de we can accelerate the matching up to a factor and we can reduce memory requests by a factor The reader

who is more interested in the learning algorithm may skip this section

The Approximation Problem

w

Let B b e a set of banana wavelets Let W b e a discrete set of banana wavelet W with zero curvature n one

b

size n and n n chosen as for B The elements of W can b e interpreted as Gab or wavelets b ecause they only

m f o

x w w

have the variable qualities frequency and orientation Let W b e W translated by the vector x Our aim is to

approximate an arbitrary banana wavelet in B by a weighted sum of translated banana wavelets in W see gure

Let

n o

b b b b

J x w j n

j j

b b b b

b e a set of p ositions x and parameter vectors w We calculate the approximation B of B by a weighted sum of

i i

Gab or wavelets in W

X

b b

b b x w

j j

B W

j

b b

b

x w J

j j b +b b = . + . ++ .

b1 b 2 b3

Figure Approximation The banana wavelet on the left is approximated by the weighted sum of Gab or wavelets on

the right

In this approximation problem we have to regulate two dierent and contradictional entities The quality of approxi

mation and the numb er of basis functions used for the approximation In terms of the quality of approximation we like

norm In terms of sp eed of approximation we like to minimize to minimize jjB B jj where jj jj represents the L

b

j for a set S we dene jS j as the numb er of elements of the numb er of additions and multiplications in ie jJ

S Because of the similarity of the Gab or wavelets in W to a lo cal part of a banana wavelet in B we exp ect to get a

fairly accurate approximation with a small numb er of Gab or wavelets see gure

W

Let F I b e the complex rep onses asso ciated with I obtained by a convolution with the elements of W Given the

approximation in equation we can analogously calculate an approximation AI of AI by

X

b b W b

AI x b F I x x w

j j j

b b

b

x w J

j j

W B W W

We dene s s and s s The parameter determines the width of the Gaussian of the Gab or

max

min min min

wavelet in ydirection The numb er of directions n is chosen indep endently A large numb er of orientations n

o

W

improves the accuracy of approximation but presupp oses a more time consuming convolution to obtain A I see

subsection The approximation in can b e p erformed b efore the later matching stages or in a virtual mo de

ie can b e calculated only if a certain banana resp onse is requested from the matching algorithm see section

In the rst case we achieve a sp eed up by a factor of In the virtual mo de the sp eed up dep ends on the complexity

of the representation used for matching For a typical task as decrib ed in section we achieve a sp eed up by a factor

of wird no ch mehr werden

Approximation using a Path Corresp onding to a Banana

b

We present a solution of the approximation problem dened ab ove by utilizing the path corresp onding p to a banana

wavelet as describ ed in section We simply cho ose as x w the closest Gab or wavelet in W to the tangent on

j j

b b

according to the magnitude of p t for aquidistantly sep erated t in the interval and we cho ose the weight

i i

i

b b

B at the p osition p t see gure

i

Figure Left Real part of a banana wavelet Middle Approximation of the banana wavelet Note that the

symmetry along the contour line of the original banana wavelet is not conserved in the approximation Esp ecially for

stronger curvatures this eect increases Right error of approximation

b b

Formally sp eaking the numb er n of Gab or wavelets used to approximate a certain B with is prop ortional to the

b

length of the path corresp onding to B devided by the length of a path corresp onding to the Gab or wavelet with same

w W

frequency f and zero orientation ie W with w f s

min

b

LB

b

n

W

f s

min

LW

An increase of leads to a narrowing of the base p oints of the approximation and therefore to an overlapping of the

Gab or wavelets The centre of the j th Gab or wavelet x is dened as

j

b b

x p t

j

j

for aquidistantly sep erated

j

b

j n t

j

b

n

b

b

Let op t b e the index of the orientation of the W W with asso ciated derivative closest to the derivative p t

Then we set

b b

x b p t f op t

j j j

W W

4

Note that the pathes corresp onding to Gab or wavelets with a certain frequency have the same length

5 b W

The imaginary part of a banana wavelet is not axis symmetric therefore the conjugated Gab or Here op t go es from to n

o

wavelet is needed to cover all curvatures a)

b)

i) ii) iii) iv) v)

Figure ai Banana wavelet aii Input picture aiii original transformation of the picture in aii with the banana

wavelet in ai aiv Gab or approximation of the trafo in aiii av dierence b etween aiV and aiii bii The

function E I x biiiv The normalized trafo its Gab orapproximation and the dierence of b oth for the kernel in

ai

b b b

the following way We dene ie we have x b p t f op t We calculate the weights

j j j

j

b b

realB p t

j

j

b b

and ensure that B has the same norm as B by setting

b

jjB jj

b

j

P

j

b

x b

j j

jj W jj

j

Quality of Approximation

We can measure the quality of the approximation in the space of lters by calculating the mean L distance of the

banana wavelets and its approximation

b b

X

jjB B jj

q B B

b

jB j

jjB jj

bB

or in the space of lter rep onses by evaluating the dierences of the transformation using the original kernels or the

formula

X X

jjA I b AI bjj

q A A I

jI jjB j

jjAI bjj

I I

b B

where I is a set of pictures and AI b resp ectively A I b are the functions representing the whole image convoluted

b b

with B resp ectively B Note that q and q are not completely dep endent see caption table Table gives the

W

quality of approximation and the sp eed up for dierent parameter settings of n and

o

Extracting the imp ortant Banana Resp onses p er Instance

Our second stage of prepro cessing reduces the numb er of vectors c in the co ordinate space C to represent a certain

picture I or an lo cal area of I Our aim is to extract the lo cal structure in I in terms of curved lines expressed by

banana wavelets Some of these lines may b e imp ortant to represent the sp ecic ob ject but there will b e also curved

lines representing features which are caused by accident conditions eg shadows caused by sp ecic illumination

background or ob ject surface texture An algorithm extracting the imp ortant features for a class of ob jects from

dierent pictures of this ob ject based on the prepro cessing describ ed here is presented in section

6

b

The division by jjB jj ensures that q is indep endent of a simple scalar multiplication of the banana wavelets

2 1

Quality of Approximation

Parameter quality org trafo appr trafo virt trafo

Org Trafo App Trafo sec sec sec

W

n n n n n q q match conv match conv match conv

k o b m

o

Table Quality of Approximation Row Variation of with constant Row Variation of with

constant Although q is minimal for the q has its minimum for We assume

this eect is caused by the fact that an increase of narrows the base p oints of approximation In natural pictures

lines are frequently features This regularity decreases the necessity of many base p oints Row Approximation

with only curvatures in the rst trafo The transformation without approximation requests MB main memory

the trafo and and the Fourier transformed kernel have to b e stored tha approximated trafo requests MB main

memory and the virtual trafo requests MB main memory for the transformation of the kernel in W

We dene an important feature in one image or p er instance by two qualities C and C An important feature per

instance

C has a strong resp onse

C has to represent a lo cal maximum in the banana space

C represents the requirement that a certain feature or similar feature is present whereas C allows a more sp ecic char

acterization of this feature Banana resp onses vary smo othly in the co ordinate space Therefore the sixdimensional

function AI x b is exp ected to have a prop erly dened set of lo cal maxima In terms of analogy to the pro cessing

in area V in the vertebrate visual system C may b e interpreted as the resp onse of a certain column which indicates

the general presence of a feature co ded in this column whereas C represents the interculumnar comp etition giving

a more sp ecic co ding of this feature Figure shows the signicant features p er instance represented by their

corresp onding path

We say a banana wavelet has a strong resp onse at a certain pixel p osition x when it is larger than an average resp onse

E I x For this average resp onse we consider the average activity in the complete rep onse space but we take also

the average activity of a lo cal area in the resp onse space into account Therefore a global and lo cal normalization is

p erformed

Formally sp eaking we dene the mean lo cal activity E I x at pixel p osition x and the mean total activity E I of

the banana space by

X X

E I x AI x b

x Ax r

E

b B

and

X X

E I AI x b

xI

b B

where Ax r represents the cub oid with center x and length of side r in the x y space in which the lo cal

E E

activity is calculated see gure The function E I x has high values when there is a lot of structure in the

lo cal area around x We now dene a threshold by the average of these two activities

E I E I x

T x

x b and we can formalize C and C as follows A banana rep onse AI represents a signicant feature p er instance

if

C AI x b T

Figure Result of the second stage of prepro cessing Left column the original images Middle column Signicant

Features corresp onding to banana wavelets of high frequency expressed by its corresp onding path Right column

Signicant Features corresp onding to low frequency The detailed structure of the house and the inner features of the

face are b est describ ed by elements of the banana space with high frequency Eg the eyes of the p erson are b est

describ ed by banana wavelets with small size

(x 0 ,y 0 )

rE y (f,a ,c,s)

x

a) b) c)

Figure Normalization a Input Picture I b The lo cal activity is calculated within the small cub oid Ax r

E

c The function E I x

C AI x b AI x b for all neighb ours of x b as dened in

i i

The parameter regulates the distinctness a feature must exceed the average activity to b e a candidate for a

signicant feature p er instance A larger value for reduces the numb er of signicant features The parameter

regulates the inuence of the lo cal versus the global activity our choice of parameters is shown in table To reduce

the time for calculating the average activities E I x we approximate them by taking only the banana resp onses

for the smallest size and with zero curvature into account The resp onses corresp onding to banana wavelets with

same orientation but dierent curvature or size are highly dep endent b ecause they represent similiar features For

the calculation of E I x which just represents some kind of average activity only one of these similar features has

taken into account

Learning

Here we describ e an algorithm to extract invariant lo cal features representing landmarks for a class of ob jects We

assume the corresp ondence problem to b e solved ie we assume the p osition of certain landmarks of an ob ject such

as the center of left eye or the midp oint of the right edge of a can to b e known on pictures of dierent examples of

this ob jects In some of our simulations we determine corresp onding landmarks by manual construction for the rest

we replaced this manual intervention by motor controlled feedback see section For learning it is indisp ensable

to ensure that comparable entities are used as training data otherwise the eect of learning will decrease b ecause of

the noise of the trainings data Furthermore it is advantagous to split a large learning problem like the learning of a

representation of a face into smaller subproblems like learning the representation of the eye region or the top of the

head This learning with comparable and smaller entities is the meaning of our a priori principle P

In a nutshell the learning algorithm works as follows We extract the signicant features for as describ ed in section

dierent images of an ob ject taken at a certain p ose for a sp ecic landmark For each landmark we collect these

features in one bin We dene a certain feature as signicant when this feature or a similar feature according to

our metric o ccurs often in the bin ie it o ccurs often in the dierent images of our training set We end up

with a graph with its no des lab eled with elements of the banana co ordinate space expressing the learned signicant

features mostly representing edges of an ob ject or invariant inner features like eyes or the nose We refer to such a

RepO

representation of an ob ject class O as S and to the set of pixels of the co ordinate space representing the k th

RepO

landmark as S Figure illustrates the learning algorithm

k

A signicant feature should b e indep endent of background illumination or accidental qualities of a certain example

of the ob ject class ie it should b e invariant under these transformations of an ob ject class P This is realized

by measuring the probability of o ccurence of features in a lo cal area of the banana space for dierent examples

Therefore its metric allows the grouping of similar features in one bin but it also allows the reduction of redundancy

of information P by avoiding multiple features of small distance in the learned representation

for 1. landmark for n. landmark

1

2

3 S 1 S n

4 S 1 S n

5

Figure Schematic explanation of the learning algorithm Calculate the convolution of a banana plant with

corresp onding landmarks in all training images Extract the signicant features p er instance for a sp ecic landmark

Collect these features in one bin Learned signicant features for a landmark extracted from all images

Learned representation for an ob ject of a certain view

Formally sp eaking let I b e a set of pictures of dierent examples of a class of ob jects of certain orientation and

jk

approximately equal size I represents an lo cal area in the j th image in I with the k th landmark as its center

k jk k

Let s b e the ith signicant feature p er instance extracted in the area I We collect all s for a sp ecic k in one

ij ij

k k

set S Then we apply the LBGvector quantization algorithm to S see gure After vector quantization a

k

of co de b o ok vectors c C C c n co deb o ok C expresses the vectors s with a constant numb er n

C C

i i ij

k k

p jS j p In case of a large p the dep endends on the numb er of entries in S n gure b n

C C

initial co de b o ok las a higher density in the training set

k

The LBGalgorithm reduces the distortion error ie the average error o ccuring when all elements of S are replaced

k k

by the nearest co deb o ok vector in C In case of high densities of elements s in S it may b e advantageous in terms

ij

of the distortion error to have co de b o ok vectors c and c with small distance dc c But the signicant features

for a certain class of ob jects are exp ected to express indep endent qualities P ie they are exp ected to have large

distances in the banana space We construct a smaller co deb o ok C in which the c c C with close distances are

combined to their centre of gravity Let r IR b e xed We calculate for all c C the numb er of c C with

distance dc c r gure c If there exist one such c c we substitute all the co deb o ok vectors in C with

dc c r by their center of gravity gure d C now represents a co de b o ok with less or equal elements than

C without redundant co deb o ok vectors Now we can dene the imp ortant features for the k th landmark of a certain

k k

ob ject as those co deb o ok vectors c C for which a certain p ercentage p of s exists with dc s r gure ef

ij ij

RepO

We collect these imp ortant features in a set S which is our learned representation of the k th landmark of a

k

certain class of ob jects

a) b) c)

d) e) f)

Figure Clustering a Distrubution of data b Co deb o ok Initialization c Co deb o ok vectors after learning d

Substituting sets of co deb o ok vectors with small distance r by their center of gravity e Counting numb er of

elements within radius r f Deleting co deb o ok vectors representing insignicant features

Matching

To use our learned representation for lo cation and classication of ob jects we have to dene a similarity b etween the

RepO

extracted representation S and a certain p osition in the image A view of an ob ject is characterized by a small

numb er of binary features a certain banana is present or absent from a large feature space the banana space This

sparse co ding will allow a fast matching b ecause only the presence of a few features has to b e checked in the pictures

Here we dene a simililarity function of a graph lab eled with banana wavelets with certain size and p osition in an

image We dene a total similarity expressing the systems condence whether there is a certain ob ject on an image

I at a certain p osition and size As in it simply averages local similarities expressing the systems condence

whether a no de of the graph represents a lo cal feature A graph is adapted to an image by EGM The total

similarity is optimized in two steps Shifting global move and scaling of the graph The optimal similarity value for

a graph gives the quality of its t to the image For each stored size of an ob ject we p erform a separate match The

graph with the highest similarity determines the size and p osition of the ob jects within the image while the p ositions

of its no des identify the landmarks

RepO

and pixel p osition in the In a nutshell the lo cal similarity is dened as follows For each learned feature in S

k

image we simply check whether the corresp onding banana resp onse is high or low ie the corresp onding feature is

present or absent Because of the sparseness of our representation only a few of these checks have to b e made therefore

the matching is very fast Because we make use only of the important features the matching is very ecient

1

q E

q 1 E 2

Figure The normalization function N t I x

More formally we intro duce a normalization in the banana space to transform our real valued lter resp onses AI x b

into quasi binary features which are comparable to the pixels of the co ordinate space in our learned representation

The normalized resp onses do less dep end on the exact lter resp onse but represent the presence or absence of a certain

feature Let the sigmoid function

for s E I x

t E I x

for E I x s E I x N t I x

E I x

for s E I x

x b b e our normalization function see gure Figure b shows the normalized transformation The value N AI

represents the systems condence of the presence of the feature b at p osition x This condence is high when the

resp onse exceeds the average activity signicantly The exact value of the resp onse is not of any interest We like to

avoid a very strict decision at this stage therefore we still allow a range of indicision of the system when the resp onse

is only slightly ab ove the average activity

RepO RepO

xy

Now we can dene a lo cal similarity S imS I b etween a no de lab eled with banana wavelet rep onses S

k k

xy

and a pixel p osition I in an image I by simply averaging the normalized lter resp onses corresp onding to the

RepO

learned representation of the k th landmark ie s x y f c s S in the image at the pixel p osition

i i i i i i i

k

x y

X

RepO

xy

AI x x y y f c s N S imS I

i i i i i i

k

RepO

jS j

RepO

k

s S

i

k

RepO

The numb er of pixels in the co ordinate space a no de of the graph S is lab eled with is very small therefore the

evaluation of is very fast In the bunch graph representation in a no de is lab eled by a large numb er of vectors

approximately of Gab or Filter resp onses each describing a landmark of one instance of the landmark of an ob ject

in the training set Therefore the evaluation of the lo cal similarity in takes much longer

RepO RepO

As in the total similarity S imS x y s I b etween a graph S x y s at p osition x y with size

s and the image I is simply dened as the average of the lo cal similarities dened ab ove

n

X

RepO

xy RepO

S imS I S imS x y s I

k

n

k

with n represents the numb er of no des of the graph

k

Simulations

We demonstrate the applicability of our algorithm to a wide range of problems First we learn representations of cans

and faces of dierent p oses We apply these representations to the problem of lo cating these ob jects in complex scenes

using the matching algorithm describ ed in section Finally we demonstrate a classication task the discrimination

of frontal faces and nonfrontal faces If not stated explicitly we used in our simulations the standard settings dened

in table With these settings the transformation without the approximation describ ed in section of a x

picture needs seconds the extraction of signicant features p er instance takes approximately seconds p er no de

and picture and the nal learning as describ ed in section takes seconds for each landmark for a training set of

examples All simulations were p erformed on a Sparc Ultra

Learning of Representation

Firstly we apply the learning algorithm describ ed in section to data consisting of manually provided landmarks In

subsection we replace this manual intervention by motor controlled feedback

Learning with manually provided ground truth

Our training sets consist of a set of approximately examples an ob ject viewed in a certain p ose As ob jects we used

cans frontal faces and half prole faces Corresp onding landmarks are dened manually on the dierent representatives

of a class of ob jects see gure

a) b) c)

Figure Manual dened graphs for a cans b frontal faces and c half proles

Figure shows the signicant features p er instance for some of the can examples in the training set as well as the

learned representations Figure shows the learned representations for faces using manual dened graphs as shown

in gure

a)

b)

i) ii) iii) iv) v)

Figure a Pictures for training biiv Extracted signicant features p er instance c the learned Representation

In gure a the variability of representation for dierent runs of the learning algorithm is demonstrated caused by

the random initialization of the LBGalgorithm The learned representations for dierent p is shown in gure b

The parameter p determines the fraction of features needed to b e present in the training data to dene a signicant

feature The change of representation for dierent size of the training set is demonstrated in gure c

Learning with automatic landmark denition

To avoid the manual generation of ground truth we made use of motor controlled feedback Our aim is the construction

of training data in which a certain ob ject is shown under changing conditions like dierent background and dierent

illumination but without change of the p osition of the landmarks Then we can simply apply our learning algorithm

using a rectangular grid to this data

We put a can on a rotating plate and changed background and lighting conditions in a sequence of pictures see gure

The whole generation of training data just to ok ab out seconds For the generation of ground truth for frontal

faces we recorded a sequence of pictures in which a p erson is sitting xed on a chair Illumination and background

is changed as for cans see gure To extract representations for dierent scales we simply apply the learning

algorithm to the very same pictures of the dierent sequences scaled accordingly

Figure Training Set and Learned Representation Top half prole faces Middle female faces Bottom male

faces Note that even the ne dierences b etween male and female faces can b e expressed by banana wavelets

Matching

Table gives the results for various matching tasks the lo cation of cans and faces in scenes of dierent complexity

In row one to four the matching with banana wavelets is compared to the matching with bunch graphs as describ ed

in We tested b oth approaches on two data sets row and gives the results with the approach describ ed

here row and gives the results for the bunch graph matching The rst set set contains frontal faces with

very controlled illumination in front of a homogenous background column gives information ab out the background

h homogenous nh non homogenous The faces vary in size b etween and pixels column and there is

a mo dest p ose variation column To handle the size variation we do matching with two graphs lab eled with

banana wavelets resp two bunch graphs Both approaches have comparable p erformance but the matching for the

banana approach is faster More interesting are the results for a more complex task set Figure shows some

examples of matches and mismatches on this data set The size variation of the faces is b etween and pixel

The p ose and illumination is much less controlled and the background is non homogenous for most of the pictures

therefore this data set represents a very hard task Row and give the results for the matching with bananas and

row four gives the results for the bunch graph matching We see a big gap of p erformance the bunch graph matching

found of the faces but the matching with banana wavelets

Match Results

Data Repres Trafo p erf

ob ject nb size rot pl rot dp bg nb reps rep mo de approx sec sec match p erf

o o

faces   h ag ban v

o o

faces   h mg bunch

o o

faces   nh ag ban v

o o

faces   nh mg bunch

o o

cans    n h mg ban na

o o

cans    n h mg ban a

o o

cans    n h mg ban v

o o

cans    n h al ban v

Table a)

b

0.3 0.5 0.6 0.7 1.0

c)

15 30 60 115

Figure Representations learned with dierent parameters a Dierent representations caused by random initial

ization of the LBGalgorithm b Variation of p the numb er of features which have to b e in a cluster to call its

centroid signicant c Variation when the size of the training data is varied from to examples

Discrimination The False Positive Test

We applied our representation to the problem of nding a face and classifying it into the classes frontal face and non

frontal Our test set consists of pictures generated from a face nder based on color and disparity information

develop ed by Hartmut Neven It consists of nonfrontal faces esp ecially hands found by the color detector or

faces lo oking rotated in plane or depth and frontal faces The size of the faces varied b etween and pixel

Our system rejected non frontals correctly by identifying frontals correctly frontal faces were not found and

non frontals were characterized as frontals by the system

Figure Automatical generation of ground truth for cans iiv Rotated Cans on a rotating table with varying

illumination note the shadow of the can iii Rotated cans with rectangular grid v Learned Representation a)

b)

c) d)

Figure Learned representations for frontal faces with automatically generated ground truth a One of the three

sequences with a p ersons face xed but with dierent background and illumination b Positioning of the grid learning

a representation for largest size on one example for each sequence The faces have approximately same p osition and

size therefore the no des of the xed grid represent comparable features c Representation for medium size d

Representation for small size

Comparison with other systems

Comparison with earlier versions of Elastic Graph Matching In earlier versions was based on

the concept of jets which were used as lab els of the graph A jet gives a lo cal description of an image at a

certain pixel p osition It is a vector with its co ecients representing Gab or wavelet resp onses of dierent orientation

and frequency at a certain pixel p osition the norm of the jets is set to one by dividing each co ecient by the norm

of the array of Gab or wavelet resp onses This normalization ensures the jets indep endence of the average grey level

intensity of an image

The concept of jet is one p ossible formalization of our a priori principle P Lo cality which enables us to handle

landmarks of dierent lo calities separately In the lo cality of jets we see an conceptual advantage compared to systems

based on non lo cal features like eg the principal comp onents of a whole image As a problem of jets we criticise

the mixing of features within one jet Eg if the top of the head o ccurs in a certain image in front of a textured

7

Our standard settings were levels of frequency and orientations

8

Hanco ck et al suggests for the problem of face recognition it is easier for our lo calized jet features to deal with varying background

or symmetry changes than for a PCAbased approach

a)

b)

i) ii) iii) iv) v)

Figure Go o d matches and mismatches for the set a iv bi go o d matches biiiv Mismatches In bii the

algorithm found a face like form in the picture In biii the learned form with shoulders did not t to the p ersons

face with his hands b ehind the head In biv the face is to o small In bv the rotation of the head is to o large

background the texture of this background is inherently part of the corresp onding jet co ecients and its separation is

a non trivial task For learning it is advantageous to represent b oth qualities separately then the convex contour

of the head can b e recognized as imp ortant feature compared to the varying background representing an accidental

feature Our learning algorithm is able to separate imp ortant and accidental features based on the quality of banana

wavelets to represent structure of the background separately from the structure of the head As an additional advantage

compared to the application of Gab or wavelets in we remark that the concept of sparseness found a more

convincing and consistent realization in our banana wavelet approach Already the representation of an image by

Gab or wavelets of dierent orientation and frequency leads to an increase of data instead of one greylevel image we

have the same amount of data for each pair of orientations and frequency The expansion of the feature space can b e

found analogously in the visual cortex of vertebrates and in this asp ect is fundamental diering from PCA

approaches in which the data space is reduced in early stages of pro cessing Our banana wavelet approach

enhances the data expansion of the earlier versions based on Gab or Wavelets by adding the qualities curvature and

size Diering from the feature binarity request in our formulation of sparseness see intro duction the similarity

function in was highly dep endent on the exact value of the lter resp onses In the banana approach we

substituted the lter resp onse value by binary features which are present or absent

In the idea of bunch graphs is intro duced to represent the variation of a certain view of an ob ject class In a

bunch graph a landmark is lab eled by a bunch of jets each jet representing an instance of this landmark in form of

a jet extracted from pictures of dierent p ersons at the corresp onding landmark Eg in a bunch graph a left eye of

a frontal face is represented as a set of jets extracted from the left eye of frontal faces of dierent p ersons The bunch

graph idea is successfully applied to other ob ject recognition problems eg the discrimination of hand gestures

and p ose estimation In each landmark was describ ed by approximately jets each containing complex

values Esp ecially to represent contour edges hitting the background a large amount of jets would b e necessary to

cover all p ossible combinations of this edge and the dierent backgrounds With our banana approach we can reduce

the data needed to represent a landmark to a few banana wavelets Furthermore in and the creation of an

ob ject representation is very time consuming b ecause for each view of an ob ject an ob ject dep endent grid has to

b e dened and the landmarks has to b e p ositioned manually for the pictures used to create the bunch graphs First

steps towards an automatic generaition of an ob ject representation based on the bunch graph approach are made

in where imp ortant no des and suitable jets were learned utilizing the principles P P and P But still a lot

of manual intervention for the generaition of ground truth was necessary In contrast to these manual interventions

here we intro duced metho ds to learn a representation autonomously Anyway there might b e situations in which a

landmark can not b e suciently describ ed by only one combination of banana wavelets eg in case of an eye with

and without glasses or a chair with and without armrests In those cases a bunch graph of combination of banana

wavelets might b e more appropriate to represent this landmark But still a much smaller amount of data than in the

jetbunch approach should b e sucient

Comparison with other ob ject recognition systems There exists a large variety of ob ject recognition systems

utilizing dierent amount of a priori knowledge As one extreme we refer to systems which apply learning algorithms

directly to the grey level pictures These algorithms can b e called neural like backpropagation or RBFNetworks

or strategies of classical pattern recognition like Bayesian estimation metho ds These systems apply a

very small amount of a priori knowledge and theoretical statements ab out their general applicability can b e made

presupp osing that the numb er of free variables of those systems grows to innity Unfortuntately generalization

and learning time is a fundamental restriction of those systems The lack of a priori knowledge makes them applicable

to any kind of problem let it b e the prediction of time series sp eech recognition or vision but they pay for this

generality with bad generalization prop erties and unrealistic learning time In other words those systems fall into

the trap of the variance problem The variance problem can b e reduced by cho osing a suitable prepro cessing

of the data reducing the search space but this manual intervention destroys the general applicability by leaving the

choice of a suitable prepro cessing to the creator of the system As an extreme on the other side of the biasvariance

dilemma there exist a large variety of systems putting a huge amount of a priori knowledge into their system As

only one example in fo otball players are tracked As a priori know ledge the structure of the background ie

the fo otball eld with its strict regulated lines and signs is explicitly used It is unthinkable to use those systems in

another surrounding Having in mind a system in which a large amount of dierent ob jects can b e represented and

recognized in complex scenes we see our systems in the middle of the two extremes mentioned ab ove Our system

explicitly makes use of a priori knowledge but b ecause of the generality of our a priori assumptions we aim to avoid

a to o narrow sp ecialization of our system

In an ob ject recognition system based on principle comp onent analysis PCA applied to the grey level picture

is intro duced PCA leads to a fast reduction of data by a linear transformation Taking the human visual system as

a mo del of the most successful vision algorithm existing so far there are no hints for a data compression but a lot of

hints for a data spreading in the rst stages of visual pro cessing We assume that this data spreading is

needed to allow a sparse co ding which inherently has a lot of advantages for the pro cessing of visual information see

section Furthermore it seems that nonlinear transformations play an imp ortant role in visual pro cessing

Co otes et al intro duce an ob ject recognition system which is also based on line segments they learn the variation

of an ob ject class by applying PCA to dierent instances of an ob ject class The line segments are not as lo cal as in

our approach but they describ e larger regions eg the contour of the face from the left ear down to the chin up to

the right ear The representation of ob jects has to b e dened manually For learning the variation of an ob ject class

this representation has to b e p ositioned manually for the dierent examples A similarity b etween this and our system

we see in the restriction of lo cal lines to describ e ob jects As an advantage of our system we regard the lo cality and

metric organization of our features which enable an autonomously learning of our representations of ob jects

Conclusion and Outlo ok

In section we illustrated the applicability of our system to a wide range of dicult problems in vision For the

problem of face nding we demonstrated a signicant improvement of p erformance compared to the older system

based on jet bunch graphs Our system is able to learn an eective representation of a wide range of ob jects

autonomously we chose cans articial rigid and faces natural slightly deformable as two very distinct examples

For faces we demonstrated that our representation is able to cover p ose dierences and even the ne dierences of faces

of males and females We assume that any ob ject lo cally describable by line segments can b e represented with our

system The class of representable ob jects principally covers therefore most of the ob jects humans have to deal with

Nevertheless we have also shown that our system is far away from b eing as p owerful as the human visual system but

we like to argue here that it might b e seen as an intermediate step towards a system with even b etter p erformance

Among others Biederman suggests that it is not a single feature which is imp ortant in the representation of an

ob ject but the relations of features At the present stage of our approach only metric relations expressed in the graph

structure are represented Banana wavelets represent features with certain complexity which describ e suitable abstract

prop erties orientation curvature In future work we aim to utilize this abstract prop erties to dene Gestaltrelations

b etween Banana wavelets like parallelism symmetry or connectivity These abstract prop erties of our features enable

the formalization of these relations Furthermore sparse co ding leads to an decrease of the numb er of p ossible relations

for an ob ject description only the relations b etween the few present features have to b e taken into account

Therrefore the reduction of the space of relations and the describable abstract prop erties of these features makes the

space of those relations manageable In the reduction of the space of relations we see an additional advantage of sparse

co ding not mentioned in the literature so far

In our approach the corresp ondence problem must b e solved b efore learning can start In section we used motor

controlled feedback to reduce the amount of manual intervention for the generation of ground truth In future work we

like to apply a rob otor arm to p osition landmarks correctly By moving the rob otor hand with an ob ject in front of an

non homogenous background in a surrounding with varying illumination and background and utilizing the knowledge

of the actual p osition of the rob oter hand to solve the corresp ondance problem we can easily create a large amount of

training examples automatically Another mechanism supp orting the generation of ground truth can b e the continuity

of movement Following an ob ject which is moving continuously is a much easier task than nding an ob ject without

any a priori knowledge Even a primitive representation of an ob ject may solve this task and may b e utilized for

the generation of ground truth used as training data for the learning of a more sophisticated representation In

the jetbunch approach is already successfully applied to the problem of tracking a moving ob ject

As an imp ortant op en question of the ob ject recognition system describ ed here remains its extension from the rep

resentation of dierent Dviews to a p owerfull representation of the complete threedimensional ob ject In

faces of dierent sizes and rotated in depth within a range of degrees are represented by the jetbunch approach

applying dierent bunch graphs for three sizes small medium and large and ve p oses prole left half prole

left frontal half prole right and prole right In dierent hand gestures are represented by bunch graphs

Analogously we could apply our banana wavelet representation by learning dierent representation for dierent sizes

as already done in some of our simulations and dierent p oses In an ob ject is simply represented as a

lo osely connected set of Dviews of the ob ject A more structured connection of Dviews is dened in In

this approach the two dimensional views are connected by complex arrangements of line segments called geons

These geons are presupp osed as a priori knowledge and mediate b etween D and D representation We hop e that

by formalizing Gestaltrelations b etween banana wavelets see b elow we can learn geonlike structures by lo oking at

statistical relevant relations or in terms of by extracting nonaccidental features

As a further improvement we intend to intro duce instead of the constant metric task dep endent metrics In our

similarity function we simply lo ok at the lter resp onses of the banana wavelets in our representation but we

do not distinguish b etween the dierent qualities like curvature size or orientation Eg to tell the top of a head

expressed by a banana wavelet with horizontal orientation b ent downwards from a horizontal do or b eam it is not the

orientation or size which is imp ortant but only the curvature In our actual representation also the do or b eam achieves

high values b ecause its horizontal orientation leads also to a strong resp onse of the banana wavelet representing the

top of the head ie it shares the quality horizontal orientation with the do or b eam For other tasks eg p ose

discrimination the top of the head is not imp ortant at all and only the inner face features are imp ortant In this case

not only certain qualities of a banana wavelet like curvature and size are insignicant but the imp ortance of the

whole banana wavelet has to b e reduced in the similarity function applied for this task In an algorithm for the

learning of metrics is intro duced which is based on the principles P P and P This algorithm is applied within

the frame of the bunchjet approach but in future work we intend to adapt this algorithm to the ob ject recognition

system describ ed in this pap er

In the long run we aim to a system equipp ed with a small numb er of mechanisms of small complexity like following

moving ob jects shifting ob jects with its arm and co ordinating the camera according to the movement to initiate

learning strategies representing more complex interrelations underlying the systems exp erience We think that the

system describ ed in this pap er is a very promising basis and an imp ortant step towards this challenging goal

Acknowledgement

We like to thank Laurenz Wiskott Michael Potzsch and Jan Vorbruggen for fruitful discussion Furthermore we like

to thank Thomas Maurer for solving the integral in equation

A App endix

A A Banana Wavelet expressed by Matrix Op erations

x y

f

G G

b b if x b

x y

F

B x y e e DC

with

x M M x

c

F

x x M

s

G

and

cos sin

M

sin cos

x cy

M x

c

y

M

s

s

b

A A Path p expressed by Matrix Op erations

t

t

M t t

y

s

f

b

p t M E t M t

c

with

E

and the other matrices as in A

References

H B Barlow Possible principles underlying the transformation of sensory messages in Sensory Communication

W A Rosenblith Ed pp MIT

E B Baum J Mo o dy F Wilczek Internal Representation for asso ciative memory Biological Cybernetics pp

I Biederman Recognition by Comp onents A theory of human image understanding Psychological Review

Vol No

TF Co otes CJ Taylor JGraham Active Shap e Mo delsTheir training and Application

and Image Understanding Vol No

JD Daugman Complete discrete d Gab or transforms by neural networks for image analysis and compression

IEEE Trans Acoustics Speech and Signal Processing vol no pp

A Dobbins S Zucker M S Cynader Endstopp ed neurons in the visual cortex as a substrate for calculating

curvature Nature vol pp

D J Field Relations b etween the statistics of natural images and the resp onse prop erties of cortical cells

Journal of the Optical Society of America vol no pp

D Field What is the Goal of Sensory Co ding Neural Computation vol no pp

K Fukunaga Intro duction to statistical pattern recognition nd ed Academic Press Boston

S Geman and R Doursat Neural Networks and the BiasVariance Dilemma Neural Computation vol pp

P Hanco ck V Brucd AM Burton A comparison of two computerbased face identication systems with

human p erception of faces submitted to Vision Research

J Hertz A Krogh RG Palmer Intro duction to the Theory of Neural Computation AddisonWesley

K Hornik Multilayer Feedforward Networks are Universal Approximators Neural Networks Vol pp

D H Hub el and T N Wiesel Brain Mechanisms of Vision Scientic American vol pp

SS Intille AF Bobick ClosedWorld Tracking In Pro c of the Int Conf Computer Vision June

E Kefalea O Rehse C vd Malsburg Ob ject Classication based on Contours with Elastic Graph Matching

submitted to rd Int Workshop on Visual Form Capri Italy

N Kruger Learning Weights in Discrimination Functions using a priori Constraints in Mustererkennung G

Sagerer et al Ed Springer Verlag pp

N Kruger M Potzsch C vd Malsburg Determination of Face Position and Pose with a Learned Representation

based on Lab eled Graphs Technical Rep ort IRINI

N Kruger G Peters M Potzsch Utilizing Sparse Co ding and Metrical Organization of Features for Articial

Ob ject Recognition in progress

YLinde A Buzo RM Gray An algorithm for vector quantizer design IEEE Transactions on communication

vol COM pp

M Lades JC Vorbruggen J Buhmann J Lange C von der Malsburg RP Wurtz W Konen Distortion

Invariant Ob ject Recognition in the Dynamik Link Architecture IEEE Transactions on Computers vol no

pp

T Maurer C von der Malsburg Tracking and Learning Graphs and Pose on Image Sequences of Faces

Pro ceedings of the d Int Conf on Automatic Face and GestureRecognition

R Millman and G Porter Elements of Dierential Geometry PrenticeHall

H Neven p ersonal communication

B Ohlshausen and D Field Sparse Co ding with an overcomplete basis set A strategy employed by V

MW Orram and DI Perret Mo deling Visual recognition from neurobiological Constraints Neural Networks

Vol pp

G Palm On asso ciative memory Biological Cybernetics vol pp

J Pauli Learning Op erators for View dep endent Ob ject Recognition Pro ceedings of the BMVC

G Peters Lernen lokaler Ob jektmerkmale mit Bananenwavelets Technical Rep ort IRINI Diploma

Thesis

M Potzsch N Kruger C von der Malsburg Improving Ob ject Recognition by transforming Gab or Filter

Resp onses Network Computation in Neural Systems

D Swets and J Weng SHOSLIFOSHOSLIF for Ob ject Recognition and Image Retrieval Phase Technical

Rep ort CPS Michigan State University Department of Computer Science

K Tanaka Neuronal mechanisms of ob ject recognition Science vol

J Triesch and C von der Malsburg Robust Classication Of Hand Postures Against Complex Backgrounds

Pro ceddings of the second international Conference on Automatic Face and Gesture Recognition Vermont

M Turk and A Pentland Eigenfaces for Recognition Journal of Cognitive Neuroscience Vol No

L Wiskott JM Fellous N Kruger C von der Malsburg Face Recognition and Gender Determination

Proceedings of the International Workshop on Automatic Face and Gesture recognition Zuric h