The Variable Bandwidth Mean Shift and Data-Driven Scale Selection
Total Page:16
File Type:pdf, Size:1020Kb
The Variable Bandwidth Mean Shift and Data-Driven Scale Selection Dorin Comaniciu Visvanathan Ramesh Peter Meer Imaging & Visualization Department Electrical & Computer Engineering Department Siemens Corp orate Research Rutgers University 755 College Road East, Princeton, NJ 08540 94 Brett Road, Piscataway, NJ 08855 Abstract where the d-dimensional vectors fx g representa i i=1:::n random sample from some unknown density f and the Wepresenttwo solutions for the scale selection prob- kernel, K , is taken to be a radially symmetric, non- lem in computer vision. The rst one is completely non- negative function centered at zero and integrating to parametric and is based on the the adaptive estimation one. The terminology xedbandwidth is due to the fact of the normalized density gradient. Employing the sam- d that h is held constant across x 2 R . As a result, the ple p oint estimator, we de ne the Variable Bandwidth xed bandwidth pro cedure (1) estimates the densityat Mean Shift, prove its convergence, and show its sup eri- eachpoint x by taking the average of identically scaled orityover the xed bandwidth pro cedure. The second kernels centered at each of the data p oints. technique has a semiparametric nature and imp oses a For p ointwise estimation, the classical measure of the lo cal structure on the data to extract reliable scale in- ^ closeness of the estimator f to its target value f is the formation. The lo cal scale of the underlying densityis mean squared error (MSE), equal to the sum of the taken as the bandwidth which maximizes the magni- variance and squared bias tude of the normalized mean shift vector. Both estima- i h 2 tors provide practical to ols for autonomous image and ^ f (x ) f (x) MSE(x ) = E quasi real-time video analysis and several examples are i h 2 shown to illustrate their e ectiveness. ^ ^ f (x ) : (2) f (x ) + Bias = Var 1 Motivation for Variable Bandwidth Using the multivariate form of the Taylor theorem, the The ecacy of Mean Shift analysis has b een demon- bias and the variance are approximated by [20, p.97] strated in computer vision problems such as tracking 1 2 and segmentation in [5, 6]. However, one of the limi- h (K )f (x ) (3) Bias(x) 2 2 tations of the mean shift pro cedure as de ned in these and 1 d pap ers is that it involves the sp eci cation of a scale Var(x ) n h R (K )f (x) ; (4) R R parameter. While results obtained app ear satisfactory, 2 K (z)dz and R (K )= K (z)dz are where (K )= z 2 1 when the lo cal characteristics of the feature space di ers kernel dep endent constants, z is the rst comp onent 1 signi cantly across data, it is dicult to nd an opti- of the vector z, and is the Laplace op erator. mal global bandwidth for the mean shift pro cedure. In The tradeo of bias versus variance can b e observed this pap er we address the issue of lo cally adapting the 2 in (3) and (4). The bias is prop ortional to h , which bandwidth. We also study an alternative approachfor means that smaller bandwidths give a less biased es- data-driven scale selection which imp oses a lo cal struc- timator. However, decreasing h implies an increase in ture on the data. The prop osed solutions are tested in 1 d the variance which is prop ortional to n h . Thus for the framework of quasi real-time video analysis. a xed bandwidth estimator we should cho ose h that We review rst the intrinsic limitations of the xed achieves an optimal compromise b etween the bias and d bandwidth density estimation metho ds. Then, two of variance over all x 2 R , i.e., minimizes the mean inte- the most p opular variable bandwidth estimators, the grated squared error (MISE) Z bal loon and the sample point,areintro duced and their 2 ^ advantages discussed. We conclude the section byshow- f (x ) f (x ) dx : (5) MISE(x )=E ing that, with some precautions, the p erformance of the Nevertheless, the resulting bandwidth formula (see [17, sample p oint estimator is sup erior to b oth xed band- p.85], [20, p.98]) is of little practical use, since it de- width and ballo on estimators. pends on the Laplacian of the unknown density being 1.1 Fixed Bandwidth Density Estimation estimated. The multivariate xed bandwidth kernel density es- The b est of the currently available data-driven meth- timate is de ned by o ds for bandwidth selection seems to b e the plug-in rule n X x x 1 i [15], which was proven to b e sup erior to least squares ^ K : (1) f (x)= d nh h cross validation and biased cross-validation [11], [16, i=1 p.46]. A practical one dimensional algorithm based on the sample p oint estimators (7) remain unchanged [8]. this metho d is describ ed in the App endix. For the mul- Various authors [16, p.56], [17, p.101] remarked that tivariate case, see [20, p.108]. the metho d is insensitive to the ne detail of the pilot Note that these data-driven bandwidth selectors estimate. The only provision that should b e taken is to work well for multimo dal data, their only assumption b ound the pilot densityaway from zero. b eing a certain smo othness in the underlying density. The nal estimate (7) is however in uenced bythe However, the xed bandwidth a ects the estimation choice of the prop ortionality constant , which divides p erformance, by undersmo othing the tails and over- the range of densityvalues into low and high densities. ~ smo othing the p eaks of the density. The p erformance When the lo cal density is low, i.e., f (x ) < , h(x ) i i also decreases when the data exhibits lo cal scale varia- increases relative to h implying more smo othing for 0 ~ tions. the p oint x . For data p oints that verify f (x ) >,the i i bandwidth b ecomes narrower. 1.2 Ballo on and Sample Point Estimators A go o d initial choice [17, p.101] is to take as the ge- n o According to expression (1), the bandwidth h can ~ ometric mean of f (x ) . Our exp eriments have i be varied in two ways. First, by selecting a di erent i=1:::n bandwidth h = h(x ) for each estimation point x , one shown that for sup erior results, a certain degree of tun- can de ne the bal loon density estimator ing is required for . Nevertheless, the sample point n X estimator proved to b e almost all the time much b etter 1 x x i ^ f (x )= K : (6) 1 than the xed bandwidth estimator. d nh(x) h(x) i=1 2 Variable Bandwidth Mean Shift In this case, the estimate of f at x is the average of identically scaled kernels centered at eachdatapoint. Weshow next that starting from the sample p ointes- Second, by selecting a di erent bandwidth h = h(x ) timator (7) an adaptive estimator of the density's nor- i for each data point x we obtain the sample point density malized gradient can be de ned. The new estimator, i estimator which asso ciates to each data p oint a di erently scaled n X kernel, is the basic step for an iterative pro cedure that 1 1 x x i ^ f (x)= K : (7) 2 weprovetoconverge to a lo cal mo de of the underlying d n h(x ) h(x ) i i i=1 density, when the kernel ob eys some mild constraints. for which the estimate of f at x is the average of di er- We called the new pro cedure the Variable Bandwidth ently scaled kernels centered at each data p oint. Mean Shift. Due to its excellent statistical prop erties, While the ballo on estimator has more intuitive ap- weanticipate the extensive use of the adaptive estima- p eal, its p erformance improvementover the xed band- tor by vision applications that require minimal human width estimator is insigni cant. When the bandwidth intervention. h(x)ischosen as a function of the k -th nearest neigh- 2.1 De nitions 2 bor, the bias and variance are still prop ortional to h 1 d To simplify notations we pro ceed as in [6] by in- and n h , resp ectively [8]. In addition, the ballo on tro ducing rst the pro le of a kernel K as a function estimators usually fail to integrate to one. 2 k : [0; 1) ! R such that K (x) = k (kxk ). We also The sample p oint estimators, on the other hand, are denote h h(x ) for all i =1:::n. Then, the sample themselves densities, being non-negativeand integrat- i i point estimator (7) can b e written as ing to one. Their most attractive prop erty is that a par- ! n 2 X ticular choice of h(x ) reduces considerably the bias. In- x x 1 1 i i ^ k ; (9) f (x)= K d deed, when h(x )istaken to b e recipro cal to the square i n h h i i i=1 ro ot of f (x ) i where the subscript K indicates that the estimator is 1=2 based on kernel K . h(x )=h (8) i 0 A natural estimator of the gradientof f is the gra- f (x ) i ^ dientoff (x ) 4 K the bias b ecomes prop ortional to h , while the variance ! n 2 1 d X remains unchanged, prop ortional to n h [1, 8].