The Variable Bandwidth and Data-Driven Scale Selection

Dorin Comaniciu Visvanathan Ramesh Peter Meer

Imaging & Visualization Department Electrical & Computer Engineering Department

Siemens Corp orate Research Rutgers University

755 College Road East, Princeton, NJ 08540 94 Brett Road, Piscataway, NJ 08855

Abstract where the d-dimensional vectors fx g representa

i

i=1:::n

random sample from some unknown density f and the

Wepresenttwo solutions for the scale selection prob-

, K , is taken to be a radially symmetric, non-

lem in . The rst one is completely non-

negative function centered at zero and integrating to

parametric and is based on the the adaptive estimation

one. The terminology xedbandwidth is due to the fact

of the normalized density gradient. Employing the sam-

d

that h is held constant across x 2 R . As a result, the

ple p oint estimator, we de ne the Variable Bandwidth

xed bandwidth pro cedure (1) estimates the densityat

Mean Shift, prove its convergence, and show its sup eri-

eachpoint x by taking the average of identically scaled

orityover the xed bandwidth pro cedure. The second

kernels centered at each of the data p oints.

technique has a semiparametric nature and imp oses a

For p ointwise estimation, the classical measure of the

lo cal structure on the data to extract reliable scale in-

^

closeness of the estimator f to its target value f is the

formation. The lo cal scale of the underlying densityis

mean squared error (MSE), equal to the sum of the

taken as the bandwidth which maximizes the magni-

variance and squared bias

tude of the normalized mean shift vector. Both estima-

i h

2

tors provide practical to ols for autonomous image and

^

f (x ) f (x) MSE(x ) = E

quasi real-time video analysis and several examples are

i  h  

2

shown to illustrate their e ectiveness.

^ ^

f (x ) : (2) f (x ) + Bias = Var

1 Motivation for Variable Bandwidth

Using the multivariate form of the Taylor theorem, the

The ecacy of Mean Shift analysis has b een demon-

bias and the variance are approximated by [20, p.97]

strated in computer vision problems such as tracking

1

2

and segmentation in [5, 6]. However, one of the limi-

h  (K )f (x ) (3) Bias(x) 

2

2

tations of the mean shift pro cedure as de ned in these

and

1 d

pap ers is that it involves the sp eci cation of a scale

Var(x )  n h R (K )f (x) ; (4)

R R

parameter. While results obtained app ear satisfactory,

2

K (z)dz and R (K )= K (z)dz are where  (K )= z

2

1

when the lo cal characteristics of the feature space di ers

kernel dep endent constants, z is the rst comp onent

1

signi cantly across data, it is dicult to nd an opti-

of the vector z, and  is the Laplace op erator.

mal global bandwidth for the mean shift pro cedure. In

The tradeo of bias versus variance can b e observed

this pap er we address the issue of lo cally adapting the

2

in (3) and (4). The bias is prop ortional to h , which

bandwidth. We also study an alternative approachfor

means that smaller bandwidths give a less biased es-

data-driven scale selection which imp oses a lo cal struc-

timator. However, decreasing h implies an increase in

ture on the data. The prop osed solutions are tested in

1 d

the variance which is prop ortional to n h . Thus for

the framework of quasi real-time video analysis.

a xed bandwidth estimator we should cho ose h that

We review rst the intrinsic limitations of the xed

achieves an optimal compromise b etween the bias and

d

bandwidth density estimation metho ds. Then, two of

variance over all x 2 R , i.e., minimizes the mean inte-

the most p opular variable bandwidth estimators, the

grated squared error (MISE)

Z

bal loon and the sample point,areintro duced and their

 

2

^

advantages discussed. We conclude the section byshow-

f (x ) f (x ) dx : (5) MISE(x )=E

ing that, with some precautions, the p erformance of the

Nevertheless, the resulting bandwidth formula (see [17,

sample p oint estimator is sup erior to b oth xed band-

p.85], [20, p.98]) is of little practical use, since it de-

width and ballo on estimators.

pends on the Laplacian of the unknown density being

1.1 Fixed Bandwidth Density Estimation

estimated.

The multivariate xed bandwidth kernel density es-

The b est of the currently available data-driven meth-

timate is de ned by

o ds for bandwidth selection seems to b e the plug-in rule

 

n

X

x x 1

i

[15], which was proven to b e sup erior to least squares

^

K : (1) f (x)=

d

nh h

cross validation and biased cross-validation [11], [16,

i=1

p.46]. A practical one dimensional algorithm based on the sample p oint estimators (7) remain unchanged [8].

this metho d is describ ed in the App endix. For the mul- Various authors [16, p.56], [17, p.101] remarked that

tivariate case, see [20, p.108]. the metho d is insensitive to the ne detail of the pilot

Note that these data-driven bandwidth selectors estimate. The only provision that should b e taken is to

work well for multimo dal data, their only assumption b ound the pilot densityaway from zero.

b eing a certain smo othness in the underlying density. The nal estimate (7) is however in uenced bythe

However, the xed bandwidth a ects the estimation choice of the prop ortionality constant , which divides

p erformance, by undersmo othing the tails and over- the range of densityvalues into low and high densities.

~

smo othing the p eaks of the density. The p erformance When the lo cal density is low, i.e., f (x ) < , h(x )

i i

also decreases when the data exhibits lo cal scale varia- increases relative to h implying more smo othing for

0

~

tions. the p oint x . For data p oints that verify f (x ) >,the

i i

bandwidth b ecomes narrower.

1.2 Ballo on and Sample Point Estimators

A go o d initial choice [17, p.101] is to take  as the ge-

n o

According to expression (1), the bandwidth h can

~

ometric mean of f (x ) . Our exp eriments have

i

be varied in two ways. First, by selecting a di erent

i=1:::n

bandwidth h = h(x ) for each estimation point x , one

shown that for sup erior results, a certain degree of tun-

can de ne the bal loon density estimator

ing is required for . Nevertheless, the sample point

 

n

X

estimator proved to b e almost all the time much b etter

1 x x

i

^

f (x )= K : (6)

1

than the xed bandwidth estimator.

d

nh(x) h(x)

i=1

2 Variable Bandwidth Mean Shift

In this case, the estimate of f at x is the average of

identically scaled kernels centered at eachdatapoint.

Weshow next that starting from the sample p ointes-

Second, by selecting a di erent bandwidth h = h(x )

timator (7) an adaptive estimator of the density's nor-

i

for each data point x we obtain the sample point density

malized gradient can be de ned. The new estimator,

i

estimator

which asso ciates to each data p oint a di erently scaled

 

n

X

kernel, is the basic step for an iterative pro cedure that

1 1 x x

i

^

f (x)= K : (7)

2

weprovetoconverge to a lo cal mo de of the underlying

d

n h(x ) h(x )

i i

i=1

density, when the kernel ob eys some mild constraints.

for which the estimate of f at x is the average of di er-

We called the new pro cedure the Variable Bandwidth

ently scaled kernels centered at each data p oint.

Mean Shift. Due to its excellent statistical prop erties,

While the ballo on estimator has more intuitive ap-

weanticipate the extensive use of the adaptive estima-

p eal, its p erformance improvementover the xed band-

tor by vision applications that require minimal human

width estimator is insigni cant. When the bandwidth

intervention.

h(x)ischosen as a function of the k -th nearest neigh-

2.1 De nitions

2

bor, the bias and variance are still prop ortional to h

1 d

To simplify notations we pro ceed as in [6] by in-

and n h , resp ectively [8]. In addition, the ballo on

tro ducing rst the pro le of a kernel K as a function

estimators usually fail to integrate to one.

2

k : [0; 1) ! R such that K (x) = k (kxk ). We also

The sample p oint estimators, on the other hand, are

denote h  h(x ) for all i =1:::n. Then, the sample

themselves densities, being non-negativeand integrat-

i i

point estimator (7) can b e written as

ing to one. Their most attractive prop erty is that a par-

!

n

2

X

ticular choice of h(x ) reduces considerably the bias. In-

x x 1 1

i

i

^

k ; (9) f (x)=

K

d

deed, when h(x )istaken to b e recipro cal to the square

i

n h

h

i

i

i=1

ro ot of f (x )

i

where the subscript K indicates that the estimator is

 

1=2



based on kernel K .

h(x )=h (8)

i 0

A natural estimator of the gradientof f is the gra-

f (x )

i

^

dientoff (x )

4

K

the bias b ecomes prop ortional to h , while the variance

!

n

2

1 d

X

remains unchanged, prop ortional to n h [1, 8]. In

2 x x x x

i i

0

^

^

rf (x )  rf (x)= k

K K

d+2

(8), h represents a xed bandwidth and  is a prop or-

0

n h

h

i

i

i=1

tionality constant.

!

n

2

X

x x 2 x x

Since f (x ) is unknown it has to b e estimated from

i i

i

= g

d+2

the data. The practical approachis to use one of the

n h

h

i

i

i=1

" !#

metho ds describ ed in Section 1.1 to nd h and an ini-

0

n

2

X

1 x x ~ 2

i

tial estimate (called pilot)off denoted by f . Note that

 g =

d+2

~

n h

by using f instead of f in (8), the nice prop erties of

h

i

i

i=1

 

2 3

2

P

n

2.2 Prop erties of the Adaptive Mean Shift

x xx

i i

g

d+2

i=1

h

i

h

6 7

i

6 7

Equation (12) shows an attractive b ehavior of the

 

 x ; (10)

4 5

2

P

n

xx

1

adaptive estimator. The data p oints lying in large den-

i

g

d+2

i=1

h

i

h

i

sity regions a ect a narrower neighborhood since the

kernel bandwidth h is smaller, but are given a larger

i

where we denoted

d+2

imp ortance, due to the weight 1=h . By contrast,

0

i

g (x)=k (x) ; (11)

the points that corresp ond to the tails of the under-

and assumed that the derivative of pro le k exists for

lying density are smo othed more and receive a smaller

all x 2 [0; 1), except for a nite set of p oints.

weight. The extreme p oints (outliers) receivevery small

The last bracket in (10) represents the variable band-

weights, being thus automatically discarded. Recall

width mean shift vector

 

that the xed bandwidth mean shift [5, 6] asso ciates

2

P

n

x x x

i i

the same kernel for eachdatapoint.

g

d+2

i=1

h

i

h

i

The most imp ortant prop erty of the adaptive esti-

 

M (x )  x (12)

v

2

P

n

mator is the convergence asso ciated with its rep etitive

x x

1

i

g

d+2

i=1

h

i

h

computation. In other words, if we de ne the mean shift

i

procedure recursively as the evaluation of the mean shift To see the signi cance of expression (12), we de ne

vector M (x )followed by the translation of the kernel

rst the kernel G as

v

2

G by M (x), this pro cedure leads to a stationary p oint

v

G(x )=Cg(kxk ); (13)

(zero gradient) of the underlying density. More sp eci -

where C is a normalization constant that forces G to

cally,we will show that the p ointofconvergence repre-

integrate to one.

sents a stationary p oint of the sample p oint estimator

Then, by employing (8), the term that multiplies the

(9). Thus, the sup erior p erformance of the sample p oint

mean shift vector in (10) can b e written as

estimator translates into sup erior p erformance for the

!# " " #

P

n

2 n

~

X

f (x )

x x 2 1 2

adaptive mean shift.

i

i

i=1

^



g = f (x )

G

2

d+2

Wedenoteby y the sequence of successive

n h C nh

h

j

i

0

j =1;2:::

i

i=1

lo cations of the kernel G, where

 

(14)

2

P

y x

i

n

x j

i

g

d+2

i=1

h

i

 

h

where

i

2

P

 

y = ; j =1; 2;::: (17)

n

xx

1

i

j +1

~

2

f (x ) g

P

i

y x

d

i

i=1 n

h

j

h 1

i

i

g

d+2

^

i=1

h

i

h

(15) f (x )  C

P

G

i

n

~

f (x )

i

i=1

is the weighted mean at y computed with kernel G

j

d+2

is nonnegative and integrates to one, representing an

and weights 1=h ,andy is the center of the initial

1

i

estimate of the density of the data p oints weighted by

kernel. The density estimates computed with kernel K

~

the pilot densityvalues f (x ).

i

in the p oints (17) are

n o n o

Finally,by using (10), (12), and (14) it results that

^ ^ ^

f = f (j )  f (y ) : (18)

K K K

2

j

^

 rf (x ) h

j =1;2::: j =1;2:::

K

0

: (16) M (x)=

P

v

n

Weshow in App endix that if the kernel K has a con- 1

~

^

2=C

f (x ) n

f (x)

i

G

i=1

vex and monotonic decreasing pro le and the kernel G

Equation (16) represents a generalization of equation

is de ned according to (11) and (13), the sequences (17)

(13) derived in [6] for the xed bandwidth mean shift.

and (18) are convergent. This means that the adap-

It shows that the adaptive bandwidth mean shift is an

tive mean shift pro cedure initialized at a given lo ca-

estimator of the normalized gradient of the underlying

tion, converges at a nearbypoint where the estimator

density.

(9) has zero gradient. In addition, since the mo des of

The prop ortionality constant, however, dep ends on

the density are p oints of zero gradient, it results that

the value of . When  is increased, the norm of the

the convergence p oint is a mo de candidate.

mean shift vector also increases. On the other hand,

The advantage of using the mean shift rather than

a small value for  implies a small kM k. Due to this

v

the direct computation of (9) followed by a search for lo-

external variability of the mean shift norm, the conver-

calmaximaistwofold. First, the overall computational

gence prop ertyof an iterative pro cedure based on the

complexity of the mean shift is much smaller than that

variable bandwidth mean shift is remarkable. Note also

of the direct metho d. The direct search for maxima

that when  is taken equal to the arithmetic mean of

n o

requires a numb er of density function evaluations that

~

f (x ) , the prop ortionality constant b ecomes as

i

increases exp onentially with the space dimension. Sec-

i=1:::n

in the xed bandwidth case.

ond, for many applications (see for example [6]) weonly

40

need to knowthe mo de asso ciated with a reduced set

35

of data points. In this case, the mean shift pro cedure

30

b ecomes a natural pro cess that follows the trail to the

25 lo cal mo de.

20

The iterative pro cedure for mo de detection based on Histogram Value

15 the variable bandwidth mean shift is summarized be- w. lo 10

5

Variable Bandwidth Mean Shift Algorithm

0

0 50 100 150 200 250

en the data p oints fx g :

Giv x

i

i=1:::n

(a)

Derive a xed bandwidth h and a pilot estimate

1. 250

0

~

f using the plug-in rule (see App endix for the one

200

dimensional plug-in rule).

P

n

1

~

2. Compute log  = n log f (x ).

i =1

i 150

3. For eachdatapoint x compute its adaptiveband-

i

h i

1=2

Number of Points 100

~

width h(x )=h =f (x ) .

i 0 i

50

4. Initialize y with the lo cation of interest and com-

1

pute iteratively (17) till convergence. The conver- 0 0 50 100 150 200 250

x

gence point is a p oint of zero gradient, hence, a (b)

mo de candidate. 250

200

2.3 Performance Comparison

e compared the variable and xed bandwidth mean

W 150

shift algorithms for various multimo dal data sets that

Number of Points 100

exhibited also scale variations. The xed bandwidth

pro cedure was run with a bandwidth h derived from 0

50

the plug-in rule given in App endix.

The plug-in rule was develop ed for density estima- 0 0 50 100 150 200 250

x

tion [15] and since here we are concerned with density

(c)

gradient estimation it is recommended [20, p.49] to use

a larger bandwidth to comp ensate for the inherently in-

Figure 1: A mixture of 200 data points from each

N(5,2), N(17,4), N(37,8), N(70,16), N(145,32). The

creased sensitivity of the estimation pro cess. Wehave

continuous line is a scaled version of the density es-

mo di ed the plug-in rule byhalvening the contribution

timate. The detected mo des are marked prop ortional

of the variance term. This change was maintained for all

to the numb er of data p oints that converged to them.

the exp eriments presented in this pap er. The constant

(a) Histogram of the data. (b) Variable Bandwidth. (c)

 of the adaptive pro cedure was kept as the geometric

Fixed Bandwidth.

n o

~

mean of f (x ) .

i

i=1:::n

scale selection to derive an initial bandwidth h . The

0

As one can see from Figures 1 and 2 the xed band-

criterion for bandwidth selection was a global measure

width mean shift resulted in go o d p erformance for the

(MISE), hence, h achieved an optimal compromise b e-

0

lo cations where the lo cal scale was in the medium range.

tween the integrated squared bias and the integrated

However, the very narrow p eaks were fused, while the

variance. Then, we mo di ed this bandwidth for each

tails were broken into pieces. On the other hand, the

data p oint, according to the lo cal density.

adaptive algorithm showed sup erior p erformance, by

The main problem with this approach is that for

cho osing a prop er bandwidth for eachdatapoint.

multidimensional multimo dal data, it is very dicult

to determine the right h from the sample p oints and

3 Semiparametric Scale Selection

0

many of the practical issues are yet to b e resolved [20,

3.1 Motivation

p.108]. As a consequence, most of practical algorithms

use empirical bandwidth selection rules that are less The previous two sections followed purely nonpara-

dep endentoreven indep endent from the sample data. metric ideas, since no formal structure was assumed

This implies a decrease in their p erformance when the ab out the data. Implying only a certain smo othness of

input statistics is nonstationary, as it happ ens most of the underlying densityweusedavailable algorithms for

60

invariant measure of the go o dness of t is needed.

50

Fortunately, a simple solution exists. It is based on

the following theorem, valid when the number of avail-

40 able samples is large.

30

Histogram Value

Theorem 1 If the true density f is normal with pa-

20

2

rameters  and  =  I , and the xedbandwidth mean

10

shift is computed with a spherical normal kernel of band-

width h , then, the bandwidth normalized norm of the 0 0 0 50 100 150 200 250

x

mean shift vector is maximized when h   .

0 (a)

450

Recall that the xed bandwidth mean shift

400 Pro of

ector computed with kernel G of bandwidth h can b e

350 v 0

300 written as

2

^

h rf (x ) K

250 0

M (x)= : (19)

^

2=C

(x)

200 f G Number of Points

150

Since the true density f is normal with covariance

100 2

^

matrix  =  I it follows that the mean of f (x),

G

i h

50

2 2

^

f (x)  (x ;  + h ) is also a normal surface with E G 0 0 0 50 100 150 200 250

x

2 2

covariance ( + h )I . Likewise, by taking into account

(b)

0

h i

400

2 2

^

(11) wehave E rf (x ) = r(x ;  + h ).

K 0

350

By assuming that the large sample approximation is

300

valid (see [18]) it results that

250

h i

200 ^

2 2 2 2 E rf (x)

K

h h r(x ;  + h )

0 0 0

Number of Points

i h

plim M (x) =

150 =

2

2

2=C 2=C (x ;  + h )

^

0

f (x ) E

100 G

50 2

1 h

0

(x ); (20) =

0 2

0 50 100 150 200 250 2

=C  + h

x 2

0

(c)

where plim denotes probability limit with h held con-

0

Figure 2: A mixture of 200 data p oints from each

stant. This is equivalent to assuming the sample size

exp(3)+25, 2chi2(4)+50, lognormal(2,1)+90, lognor-

suciently large to make the variances of the means

mal(2,1)+90, 190-lognormal(3,1). The continuous line

relatively small.

is a scaled version of the density estimate. The detected

Finally, the norm of the bandwidth normalized mean

mo des are marked prop ortional to the numb er of data

points that converged to them. (a) Histogram of the

shift is

data. (b) Variable Bandwidth. (c) Fixed Bandwidth.

plim M (x) 1 h

0

= kx k ; (21)

2

2

h 2=C  + h

0

0

the time in vision tasks.

a quantity that has a unique p ositive maximum at

3.2 Normalized Mean Shift Based Scale

h =  .

0

Selection

We prop ose in this section a di erent approachfor

Theorem 1 leads to a very simple and accurate scale

bandwidth selection. The idea is to imp ose a lo cal

selection rule: the underlying density has the lo cal scale

structure on the data by assuming that local ly the

equal to the bandwidth that maximizes the norm of the

underlying density is spherical normal with unknown

normalized mean shift vector. We exp ect that a simi-

2

mean  and covariance matrix  =  I .

lar prop erty holds in the case of anisotropic covariance

At a rst lo ok, the task of nding  and  for each

matrices.

data point seems to be very dicult. To lo cally t

3.3 Scale Selection Exp eriments

a normal to the multivariate data one needs a priori

knowledge of the neighborhood size in which the un- Figure 3a shows a data set of size n = 2000, drawn

known parameters are to b e estimated. If the estima- from N(4,10). The bandwidth normalized mean shift is

tion is p erformed for several neighb orho o d sizes, a scale represented in Figure 3b as a function of scale. Observe

40 1.6

ferently for each space. A xed bandwidth was rst

35 1.4

derived for each color comp onent, based on the one di-

30 1.2

mensional plug-in rule. Then, the pilot density has b een

25 1

h pixel, and the adaptive color band-

20 0.8 computed for eac

widths were determined according to (8) for each pixel.

Histogram value 15 0.6

Norm of Mean Shift Vector

pro cess has b een rep eated for di erent scales of 10 0.4 This

5 0.2

the spatial kernel. Finally, the spatial scale has b een

0 0

for each pixel according to the semiparamet- 0 5 10 15 20 25 0 5 10 15 20 25 30 35 selected

x Scale

ric rule. As a result, each pixel received a unique color

(a) (b)

bandwidth for color and a unique spatial bandwidth.

To obtain the segmented image, the adaptive mean

Figure 3: Semiparametric scale selection. (a) Input

shift pro cedure has b een applied in the joint domain.

data. N(10,4), n = 2000. (b) Normalized mean shift

as a function of scale for the p oints with p ositive mean

The blobs were identi ed as groups of pixels that had

shift. The upp er curves corresp ond to the points lo-

the same connected convergence p oints (see [5]). The

cated far from the mean. The curves are maximized for

algorithm is summarized b elow.

h =4.

0

Adaptive Mean Shift Segmentation

Given the image pixels fx ; I 1 ;I2 ;I3 g , and a

i i i i

i=1:::n

the accurate lo cal scale indication by the maxima of the

range of spatial scales r :::r :

1 S

curves. The same accurate results were obtained for two

and three dimensions.

1. Derive h , h , h , a xed bandwidth for eachcolor

1 2 3

feature.

4 Video Data Analysis

2. For the spatial scale r , compute the adaptive

1

bandwidths h (x ; r ), h (x ; r ), h (x ; r ) and

1 i 1 2 i 1 3 i 1

A fundamental task in video data analysis is to de-

determine the magnitude of the normalized mean

tect blobs represented by collections of pixels that are

shift vector M (x ; r ).

i 1

coherent in spatial, range, and time domain [21]. The

two dimensional space of the lattice is known as the

3. Rep eat Step 2. for the spatial scales r :::r .

2 S

spatial domain while the graylevel, color, sp ectral, or

4. Select for each pixel a spatial scale r according texture information is represented in the range domain.

j

to the semiparametric rule. Select also the color

Based on the two new estimators intro duced in Sec-

bandwidths h (x ; r ), h (x ; r ), and h (x ; r ).

tions 2 and 3 we present next an autonomous technique

1 i j 2 i j 3 i j

that segment a video frame into representative blobs de-

5. Run the adaptive mean shift pro cedure, and iden-

tected in the spatial and color domains. The technique

tify the blobs as groups of pixels having the same

can b e naturally extended to incorp orate time informa-

connected convergence p oints.

tion, this b eing one of the sub jects of our currentwork.

We selected the orthogonal features I 1= (R + G +

Although the adaptive algorithm has an in-

B )=3, I 2= (R B )=2 and I 3= (2G R B )=4 from [10]

creased complexity, its careful software implementa-

to represent the color information. Due to the orthogo-

tion with three spatial scales (S=3) runs at ab out 8

nality of the features, the one dimensional plug-in rule

frames/second on a Dual Pentium I I I at 900MHz for a

for bandwidth selection can be applied indep endently

video frame size of 320  240 pixels. Figure4shows four

for each color co ordinate.

examples demonstrating the segmentation of color im-

As in [5], the idea is to apply the mean shift pro ce-

age data with very di erent statistics. Figure 5 shows

dure for the data p oints in the joint spatial-range do-

the stability of the algorithm in segmenting a color se-

main. Each data p oint b ecomes asso ciated to apoint

quence obtained by panning the camera. The identi ed

of convergence which represents the lo cal mo de of the

blobs were maintained very stable, although the scene

density in a d = 2+3 dimensional space (2 spatial

data changed gradually along with the camera gain.

comp onents and 3 color comp onents).

5 Discussion

We employed a spherical kernel for the spatial do-

main and a pro duct kernel for the three color comp o-

The most attractive prop erty of the techniques pro-

nents. The eciency of the pro duct kernel is known to

posed in this pap er is the automatic bandwidth selec-

be very close to that of spherical kernels [20, p.104].

tion in b oth color and spatial domain.

Due to the di erent nature of the two spaces, the The reason we used two di erent bandwidth selec-

problem of bandwidth selection has b een treated dif- tion techniques for the two spaces was not arbitrary.

While the color information can b e collected across the

image, allowing the computation of robust initial band-

width for color, the spatial prop erties of the blobs vary

drastically across the image, requiring lo cal decisions

for spatial scale selection.

The pro cess de ned by the mean shift technique in

the color domain resembles bilateral ltering [19] (see

also [3] for a discussion on the link b etween bilateral l-

tering, anisotropic di usion [12], and adaptive smo oth-

ing [13]). Due to the weighting of the data, the adap-

tive bandwidth mean shift is more related to robust

anisotropic di usion [4].

In the spatial domain, the mean shift is close to mul-

tiscale techniques such as [2], and the semiparametric

scale selection rule resembles in principle to those de-

velop ed in [7,9].

The uni cation of all these ideas is an interesting

sub ject for further research.

Figure 5: Sequence of segmented images used to test

the stability of our algorithm. Frame size: 320  240

pixels.

1=7 5=7

^ ^

5. ^ (h)=1:357fS (a)=T (b)g h .

2 D D

6. Solve the equation in h

2 1=5 1=5

^

[R (K )=f (K )S (^ (h))g] n h =0;

D 2

2

where  (K ) and R (K ) are de ned in (3) and (4),

2

resp ectively.

Figure 4: Segmentation examples. Frame size: 320 

Convergence Pro of for Variable Bandwidth

240 pixels.

Mean Shift

^

Since n is nite the sequence f is b ounded, there-

K

APPENDIX

^

fore, it is sucienttoshow that f is strictly monotonic

K

^ ^

One dimensional plug-in rule [15] increasing, i.e., if y 6= y then f (j ) < f (j +1),

K K

j j +1

for all j =1; 2 :::.

1. Compute ^ = Q Q , the sample interquartile

3 1

By assuming without loss of generality that y = 0

j

range.

we write

1=7 1=9

2. Compute a =0:920 n ^ , b =0:912 n ^ .

^ ^

f (j +1) f (j )=

K K

! !# "

n n

n

2 2

X X

X

y x

x 1 1

i

vi 1

i j +1

1 7

^

3. T (b)=fn(n 1)g b  fb (x x )g

k :(B.1) = k

D i j

d

n h h

h

i i

i

i=1 j =1

i=1

vi

where  is the sixth derivative of the normal ker-

The convexity of the pro le k implies that

nel (see [20] [p.177]).

0

n k (x )  k (x )+k (x )(x x ) (B.2) n

2 1 1 2 1

X X

iv 1

1 5

^

4. S (a)=fn(n 1)g a  fa (x x )g,

D i j

0

for all x ;x 2 [0; 1), x 6= x , and since k = g ,the

1 2 1 2

i=1 j =1

inequality (B.2) b ecomes

iv

where  is the forth derivative of the normal ker-

nel.

k (x ) k (x )  g (x )(x x ): (B.3)

2 1 1 1 2

[3] D. Barash, \Bilateral Filtering and Anisotropic Di u- Using now (B.1) and (B.3) wehave

sion: Towards a Uni ed Viewp oint," Hew lett-Packard

^ ^

f (j +1) f (j ) 

HPL-2000-18(R.1).Available at http: www.hpl.hp.com.

K K

!

n

2

[4] M.J. Black, G. Sapiro, D.H. Marimont, D. Heeger,,

X

 

1 1 x

i

2 2

kx k ky x k g 

\Robust Anisotropic Di usion," Image Processing,

i i

j +1

d+2

n h

h

i

i

i=1

7(3):421{432, 1998.

!

n

2

X

[5] D. Comaniciu, P. Meer, \Mean Shift Analysis and Ap-

 

1 1 x

i

> 2

= 2y x ky k g

i

plications," IEEE Int'l Conf. Comp. Vis., Kerkyra,

j +1 j +1

d+2

n h

h

i

i

i=1

Greece, 1197{1203, 1999.

!

n

2

X

[6] D. Comaniciu, V. Ramesh, P. Meer, \Real-Time Track-

x x 1

i i

>

g 2y =

j +1

ing of Non-Rigid Ob jects using Mean Shift," IEEE

d+2

n h

h

i

i

i=1

Conf. Comp. Vis. Patt. Recogn., Hilton Head, South

!

n

2

X

Carolina, Vol. 2, 142{149, 2000.

1 1 x

i

2

ky k g

j +1

[7] J. Elder, S.W. Zucker, \Lo cal Scale Control for Edge

d+2

n h

h

i

i

i=1

Detection and Blur Estimation," IEEE Trans. Pattern

(B.4)

Anal. Machine Intel l., 20(7):699{716, 1998.

[8] P. Hall, T.C. Hui, J.S. Marron, \Improved Variable

and by employing (17) it results that

Window Kernel Estimates of Probability Densities,"

!

n

2

X

The Annals of Statistics, 23(1):1{10, 1995.

1 x 1

i

2

^ ^

f (j +1) f (j )  : ky k g

K K

j +1

d+2

[9] T. Lindeb erg, \Edge Detection and Ridge Detection

n h h

i

i=1

with Automatic Scale Selection," Int. J. Comp. Vision.,

(B.5)

30(2):117{154, 1998.

0

Since k is monotonic decreasing we have k (x) 

[10] Y. Ohta, T. Kanade, T. Sakai, \Color Information for

g (x)  0 for all x 2 [0; 1). The sum

 

Region Segmentation," Computer Graphics and Image

2

P

n

x

1

i

Processing, 13:222{241, 1980.

g is strictly p ositive, since it was

d+2

i=1

h

h

i

[11] B. Park, J.S. Marron, \Comparison of Data-

assumed to be nonzero in the de nition of the mean

Driven Bandwidth Selectors," J. Am. Statist. Assoc.,

shift vector (12). Thus, as long as y 6= y = 0,the

j +1 j

85(409):66{72, 1990.

^

right term of (B.5) is strictly p ositive, i.e., f (j +1)

K

[12] P. Perona, J. Malik, \Scale-Space and Edge Detec-

^ ^

f (j ) > 0. Hence, the sequence f is convergent.

tion Using Anisotropic Di usion," IEEE Trans. Pattern

K K



Toshow the convergence of the sequence y

Anal. Machine Intel l., 12(7):629{639, 1990.

j

j =1;2:::

[13] P. Saint-Marc, J.S. Chen, G.G. Medioni, \Adaptive

we rewrite (B.5) but without assuming that y = 0.

j

Smo othing: A General Tool for Early Vision", IEEE

After some algebra it results that

!

Trans. PAMI, 13(7):514-529, 1991.

n

2

X

y x

1 1

i

j

2

^ ^

[14] D.W. Scott, Multivariate Density Estimation, New

f (j +1)f (j )  g ky y k

K K

j +1 j

d+2

n h

h

i

York: Wiley, 1992.

i

i=1

(B.6)

[15] S.J. Sheather, M.C. Jones, \A Reliable Data-based

^ ^

Since f (j +1) f (j ) converges to zero, (B.6) implies

Bandwidth Selection Metho d for Kernel Density Esti-

K K



mation," J. R. Statist. Soc. B, 53(3):683{690, 1991.

that ky y k also converges to zero, i.e., y

j +1 j j

j =1;2:::

[16] J.S. Simono , Smoothing Methods in Statistics, New

is a Cauchy sequence. But anyCauchy sequence is con-



York: Springer-Verlag, 1996.

vergentin the Euclidean space, therefore, y

j

j =1;2:::

[17] B.W. Silverman, Density Estimation for Statistics and

is convergent.

Data Analysis, New York: Chapman and Hall, 1986.

Acknowledgment

[18] T.M Sto cker, \Smo othing Bias in Density Derivative

Estimation," American Stat. Assoc., 88(423):855{863,

Peter Meer was supp orted by the NSF under the

1993.

grant IRI 99-87695.

[19] C. Tomasi, R. Manduchi, \Bilateral Filtering for Gray

and Color Images", Int'l Conf. Comp. Vis., Bombay,

References

India, 839{846, 1998.

[1] I.S. Abramson, \On Bandwidth Variation in Kernel Es-

[20] M.P. Wand, M.C. Jones, Kernel Smoothing, London:

timates - A Square Ro ot Law," The Annals of Statistics,

Chapman & Hall, 1995.

10(4):1217{1223, 1982.

[21] C. Wren, A. Azarbayejani, T. Darrell, A. Pentland,

[2] N. Ahuja, \A Transform for Multiscale Image Segmen-

\P nder: Real-Time Tracking of the Human Bo dy,"

tation byIntegrated Edge and Region Detection," IEEE

IEEE Trans. Pattern Analysis Machine Intel l., 19:780{

Trans. Pattern Anal. Machine Intel l., 18:1211{1235,

785, 1997. 1996.