The Variable Bandwidth Mean Shift and Data-Driven Scale Selection

The Variable Bandwidth Mean Shift and Data-Driven Scale Selection Dorin Comaniciu Visvanathan Ramesh Peter Meer Imaging & Visualization Department Electrical & Computer Engineering Department Siemens Corporate Research Rutgers University 755 College Road East, Princeton, NJ 08540 94 Brett Road, Piscataway, NJ 08855 Abstract where the d-dimensional vectors {~i}~=~,,,~represent a We present two solutions for the scale selection prob- random sample from some unknown density f and the lem in computer vision. The first one is completely non- kernel, K, is taken to be a radially symmetric, non- parametric and is based on the the adaptive estimation negative function centered at zero and integrating to of the normalized density gradient. Employing the sam- one. The terminology fixed bandwidth is due to the fact ple point estimator, we define the Variable Bandwidth that h is held constant across x E Rd. As a result, the Mean Shift, prove its convergence, and show its superi- fixed bandwidth procedure (1) estimates the density at ority over the fixed bandwidth procedure. The second each point x by taking the average of identically scaled technique has a semiparametric nature and imposes a kernels centered at each of the data points. local structure on the data to extract reliable scale in- For pointwise estimation; the classical measure of the formation. The local scale of the underlying density is closeness of the estimator f to its target value f is the taken as the bandwidth which maximizes the magni- mean squared error (MSE), equal to the sum of the tude of the normalized mean shift vector. Both estima- variance and squared bias tors provide practical tools for autonomous image and 2 MSE(x) = E - quasi real-time video analysis and several examples are [fw m] shown to illustrate their effectiveness. = Var (f(x)) + [Bias (f(x))]'. (2) 1 Motivation for Variable Bandwidth Using the multivariate form of the Taylor theorem, the The efficacy of Mean Shift analysis has been demon- bias and the variance are approximated by [20, p.971 strated in computer vision problems such as tracking and segmentation in [5, 61. However, one of the limi- Bias(x) x Th2p2(K)Af(x)1 (3) tations of the mean shift procedure as defined in these and papers is that it involves the specification of a scale Var(x) M , (4) parameter. While results obtained appear satisfactory, n-'h-d~(~)f(x) where p2(K) = and R(K)= are when the local characteristics of the feature space differs Sz;K(z)dz jK(z)dz kernel dependent constants, is the first component, significantly across data, it is difficult to find an opti- z1 of the vector z, and A is the Laplace operator. mal global bandwidth for the mean shift procedure. In The tradeoff of bias versus variance can be observed this paper we address the issue of locally adapting the in (3) and (4). The bias is proportional to h2, which bandwidth. We also study an alternative approach for means that smaller bandwidths give a less biased es- data-driven scale selection which imposes a local struc- timator. However, decreasing h implies an increase in ture on the data. The proposed solutions are tested in the variance which is proportional to n-l h-d. Thus for the framework of quasi real-time video analysis. a fixed bandwidth estimator we should choose h that We review first the intrinsic limitations of the fixed achieves an optimal compromise between the bias and bandwidth density estimation methods. Then, two of variance over all x E Rd,i.e., minimizes the mean inte- the most popular variable bandwidth estimators, the grated squared error (MISE) balloon and the sample point, are introduced and their 2 advantages discussed. We conclude the section by show- MISE(x) = E/ (f(x) - f(x)) dx . (5) ing that, with some precautions, the performance of the sample point estimator is superior to both fixed band- Nevertheless, the resulting bandwidth formula (see [17, width and balloon estimators. p.851, [20, p.981) is of little practical use, since it de- pends on the Laplacian of the unknown density being 1.1 Fixed Bandwidth Density Estimation estimated. The multivariate fixed bandwidth kernel density es- The best of the currently available data-driven meth- timate is defined by ods for bandwidth selection seems to be the plug-in ru1.e [15], which was proven to be superior to least squares cross validation and biased cross-validation [ll], [lG, , 438 0-7695-1143-0/01$10.00 0 2001 IEEE p.461. A practical one dimensional algorithm based on the sample point estimators (7) remain unchanged [8]. this method is described in the Appendix. For the mul- Various authors [16, p.561, [17, p.1011 remarked that tivariate case, see [20, p.1081. the method is insensitive to the fine detail of the pilot Note that these data-driven bandwidth selectors estimate. The only provision that should be taken is to work well for multimodal data, their only assumption bound the pilot density away from zero. being a certain smoothness in the underlying density. The final estimate (7) is however influenced by the However, the fixed bandwidth affects the estimation choice of the proportionality constant X, which divides performance, by undersmoothing the tails and over- the range of density values into low and hzgh densities. smoothing the peaks of the density. The performance When the local density is low, i.e., f(xz) < A, h(x,) also decreases when the data exhibits local scale varia- increases relative to ho implying more s-moothing for tions. the point x,. For data points that verify f(x,) > A, the 1.2 Balloon and Sample Point Estimators bandwidth becomes narrower. According to expression (l),the bandwidth h can A good initial choice [17, p.1011 is to take X as the ge- be varied in two ways. First, by selecting a different ometric mean of { f(z,)- . Our experiments have } z=1 n bandwidth h = h(x) for each estimation point x, one shown that for superior results, a certain degree of tun- can define the balloon density estimator ing is required for A. Nevertheless, the sample point estimator proved to be almost all the time much better than the fixed bandwidth estimator. In this case, the estimate of f at x is the average of 2 Variable Bandwidth Mean Shift idenhisally scaled kernels centered at each data point. We show next that starting from the sample point es- Second!, by selkcting a different bandwidth h = h(x,) timator (7) an adaptive estimator of the density's nor- for each data poznt x, we obtain the sample pozmt density malized gradient can be defined. The new estimator, estimator which associates to each data point a differently scaled kernel', is the basic step for an iterative procedure that we prove to converge to a local mode of the underlying density, when the kernel obeys some mild constraints. for which the estimate of f at x is the avemge of differ- We called the new procedure the Varzable Bandwzdth ently scaled kernels centered at each data point. Mean Shzft. Due to its excellent statistical properties, While the balloon estimator has more intuitive ap- we anticipate the extensive use of the adaptive estima- peal, its performance improvement over the fixed band- tor by vision applications that require minimal human width estimator is insignificant. When the bandwidth intervention. h(x) is chosen as a function of the k-th nearest neigh- bor, the bias and variance are still proportional to h' 2.1 Definitions and n-lh-d, respectively [8]. In addition, the balloon To simplify notations we proceed as in [6] by in- estimators usually fail to integrate to one. troducing first the profile of a kernel K as a function The sample point estimators, on the other hand, are k : [0,03) t R such that K(x) = k(ll~11~).We also themselves densities, being non-negative and integrat- denote h, I h(x,) for all i = 1 .,.n. Then, the sample ing to one. Their most attractive property is that a par- point estimator (7) can be written as ticular choice of h(x2)reduces considerably the bias. In- deed, when h(x,) is taken to be reciprocal to the square root of f(x,) where the subscript K indicates that the estimator is based on kernel K. A natural estimator of the gradient of f is the gra- the bias becomes proportional to h4,while the variance dient of j~(x) remains unchanged, proportional to n-'h-d [l, 81. In (8), ho represents a fixed bandwidth and X is a proportionality constant. Since f(x,) is unknown it has to be estimated from the data. The practical approach is to use one of the methods described in Section 1.1 to find ho and an initial estimate (called pilot) off denoted by f. Note that by using f instead of f in (8), the nice properties of 439 2.2 Properties of the Adaptive Mean Shift Equation (12) shows an attractive behavior of the adaptive estimator. The data points lying in large density regions affect a narrower neighborhood since the kernel bandwidth hi is smaller, but are given a larger importance, due to the weight l/h;+'. By contrast, the points that correspond to the tails of the under- and assumed that the derivative of profile k exists for lying density are smoothed more and receive a smaller all 2 E [0, oo), except for a finite set of points. weight. The extreme points (outliers) receive very small The last bracket in (10) represents the variable band- weights, being thus automatically discarded.

The Variable Bandwidth Mean Shift and Data-Driven Scale Selection

Mean Shift Paper by Comaniciu and Meer Presentation by Carlo Lopez-Tello What Is the Mean Shift Algorithm?

Identification Using Telematics

DBSCAN++: Towards Fast and Scalable Density Clustering

Deep Mean-Shift Priors for Image Restoration

Comparative Analysis of Clustering Techniques for Movie Recommendation

The Variable Bandwidth Mean Shift and Data-Driven Scale Selection

A Comparison of Clustering Algorithms for Face Clustering

Generalized Mean Shift with Triangular Kernel Profile Arxiv:2001.02165V1 [Cs.LG] 7 Jan 2020

Application of Deep Learning on Millimeter-Wave Radar Signals: a Review

Machine Learning – En Introduktion Josefin Rosén, Senior Analytical Expert, SAS Institute

Arxiv:1703.09964V1 [Cs.CV] 29 Mar 2017 Denoising, Or Each Magniﬁcation Factor in Super-Resolution

Outline of Machine Learning