Advanced Clustering Methods Meanshift, Medoidshift and Quickshift

Advanced Clustering Methods MeanShift, MedoidShift and QuickShift References: D. Comaniciu, P. Meer: Mean Shift: A Robust Approach toward Feature Space Analysis, IEEE Trans. Pattern Analysis Machine Intel., Vol. 24(5), 603-619, 2002 Yaser Sheikh, Erum Khan, Takeo Kanade, "Mode-seeking via Medoidshifts", IEEE International Conference on Computer Vision, 2007. A. Vedaldi and S. Soatto, "Quick Shift and Kernel Methods for Mode Seeking," in Proceedings of the European Conference on Computer Vision (ECCV), 2008 MeanShift Clustering nonparametric mode seeking Background: Parzen Estimation (Aka Kernel Density Estimation) Consider a mathematical model of how histograms are formed. Assume continuous data points. Parzen Estimation (Aka Kernel Density Estimation) Consider a mathematical model of how histograms are formed. Assume continuous data points. Convolve with box filter of width w (e.g. [1 1 1]) Subsample the results, with spacing w Resulting value at point u represents count of data points falling in range u-w/2 to u+w/2 Increased smoothing h[7] h[8] h[9] h[6] h[5] h[4] h[3] Example Histograms h[0] h[1] h[2] samples: [1 1 1] Box filter [ 1 1 1 1 1 1 1] [ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Why Formulate it This Way? • Generalize from box filter to other filters (for example Gaussian) • Gaussian acts as a smoothing filter. • Key idea: produces a nonparametric estimate of a continuous distribution from a discrete set of points. Kernel Density Estimation • Parzen windows: Approximate probability density by estimating local density of points (same idea as a histogram) – Convolve points with window/kernel function (e.g., Gaussian) using scale parameter (e.g., sigma) from Hastie et al. Density Estimation at Different Scales • Example: Density estimates for 5 data points with differently-scaled kernels • Scale influences accuracy vs. generality (overfitting) from Duda et al. Smoothing Scale Selection • Unresolved issue: how to pick the scale (sigma value) of the smoothing filter • Answer for now: a user-settable parameter from Duda et al. Mean-Shift The mean-shift algorithm is a hill-climbing algorithm that seeks modes of a density without explicitly computing that density. The density is implicitly represented by raw samples and a kernel function. The density is the one that would be computed if Parzen estimation was applied to the data with the given kernel. What is Mean Shift ? A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN PDF in feature space • Color space • Scale space Non-parametric • Actually anyDensity feature Estimation space you can conceive • … Discrete PDF Representation Data Non-parametric Density GRADIENT Estimation (Mean Shift) PDF Analysis Ukrainitz&Sarel, Weizmann Intuitive Description Region of interest Center of mass Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Intuitive Description Region of interest Center of mass Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Intuitive Description Region of interest Center of mass Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Intuitive Description Region of interest Center of mass Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Intuitive Description Region of interest Center of mass Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Intuitive Description Region of interest Center of mass Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Intuitive Description Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann MeanShift as Mode Seeking P(x) x1 x2 x3 Parzen density estimation MeanShift as Mode Seeking P(x) y1 y2 Seeking the mode from an initial point y1 Construct a tight convex lower bound b at y1 [b(y1)=P(y1)] Find y2 to maximize b MeanShift as Mode Seeking P(x) P(y2) P(y1) y1 y2 Note, P(y2) >= b(y2) > b(y1) = P(y1) Therefore P(y2) > P(y1) MeanShift as Mode Seeking P(x) y2 y3 y4 y5 y6 Move to y2 and repeat until you reach a mode MeanShift in Equations Parzen estimate using {x1,x2,...,xN} where (r) is a convex, symmetric windowing (weighting) kernel; e.g. (r)=exp(-r) 2 Form lower bound of each j by expanding about r0= ||yold-xj|| note: first-order taylor series expansion f(r) ~ f(r0) + (r-r0)f’(r0) + h.o.t. MeanShift in Equations Lower bound of at r0 now let and substitute into Parzen equation to get a quadratic lower bound (wrt y) on the Parzen density function where MeanShift in Equations Quadratic lower bound on the Parzen density function Want to find y to maximize this. We therefore need to maximize Take derivative wrt y, set to 0, and solve mean-shift update eqn MeanShift Clustering For each data point xi Let yi(0) = xi ; t = 0 Repeat Compute weights w1, w2, ... wN for all data points, using yi(t) Compute updated location yi(t+1) using mean-shift update eqn If ( yi(t) and yi(t+1) are “close enough” ) declare convergence to a mode and exit else t = t + 1; number distinct modes found = number clusters end (repeat) points that converge to “same” mode are labeled as belonging to one cluster What is MedoidShift? Instead of solving for continuous-valued y to maximize the quadratic lower bound on the Parzen density estimate, consider only a discrete set of y locations, taken from the initial set of data points {x1,x2,...,xN} That is, for the first step towards the mode y (0) = x y is now chosen from a i i discrete set of data points MedoidShift Implications of this strategy for clustering: Since we are choosing the best y from among {x1,x2,...,xN}, yi(1) will be one of the original data points. Therefore, we only need to compute a single step starting from each data point, because yi(2) will then be the result of yyi(1)(1) x1 y3(1) x y5(3) 3 y4(1) y (2) y2(1) 5 x4 x2 y5(1) y5(1) x5 y5(0) MedoidShift Different clusters are found as separate connected components (trees, actually... so can do efficient tree traversal to locate all points in each cluster. Also note: no need for heuristic distance thresholds to test for convergence to a mode, or to determine membership in a cluster. x 1 x6 x3 x7 x10 x8 x4 x9 x2 x11 x5 MedoidShift MedoidShift behavior asymptotically approaches MeanShift behavior as number of data samples approaches infinity. Advantages of MedoidShift •Automatically determines the number of clusters (as does mean-shift) •Previous computations can be reused when new samples are added or old samples are deleted (good for incremental clustering applications) •Can work in domains where only distances between samples are defined (e.g. EMD or ISOMAP); there is no need to compute a mean. •No need for heuristic terminating conditions (don’t need to use a distance threshold to determine convergence) MeanShift vs MedoidShift Drawbacks of MedoidShift •O(N3) [although not really – Vedaldi paper] •Can fail to find all the modes of a density, thereby causing over-fragmentation starting point continuous meanshift step goes here but medoidshift has to stay here Iterated MedoidShift •keep a count of how many points move to xj and apply a new iteration of medoidshift with points weighted by that count. •Each iteration is essentially using a smoother Parzen estimate Seeds of QuickShift! Vedaldi and Soatto Contributions: • If using Euclidean distance, MedoidShift can actually be implemented in O(N^2) • generalize to non-Euclidean distances using kernel trick • show that MedoidShift does not always find all modes • propose QuickShift, an alternative approach that is simple, fast, and yields a one-parameter family of clustering results that explicitly represent the tradeoff between over- and under-fragmentation. Complexity of MedoidShift (NxN matrices) note: define NxN matrices D=[Dkj] , F=[Fji] then matrix multiply does the summation over j and argmax is looking for max element in col i Complexity of MedoidShift However, if using Euclidean distance (which is typical), the computation time is better than this. Euclidean MedoidShift then we can write the squared Euclidean distance matrix D as Euclidean MedoidShift Therefore This has constant columns We have to compute this and can be ignored But these quantities can all be computed in O(N2) time Kernel MedoidShift Insight: medoidshift is defined in terms of pairwise distances between data points. Therefore, using the “kernel trick”, we can generalize it to handle nonEuclidean distances defined by a kernel matrix K (similar to how many machine learning algorithms are “kernelized”) Kernel MedoidShift - 2 and we can apply medoidshift using this DF matrix instead Kernel MedoidShift note: But What is QuickShift? A combination of the above kernel trick and... instead of choosing the data point that optimizes the lower bound function b(x) of the Parzen estimate P(x) choose the closest data point that yields a higher value of P(x) directly QuickShift vs MedoidShift •Recall our failure case for medoidshift starting point continuous meanshift goes here medoidshift has to stay here quickshift goes HERE Comparisons meanshift medoidshift quickshift Quickshift Benefits •Similar benefits to medoidshift, plus additional ones: •All points are in one connected tree. You can therefore explore various clusterings by choosing a threshold T and pruning edges such that d(xi,xj) < T •Runs much faster in practice than either meanshift or medoidshift Applications image segmentation by clustering (r,g,b), or (r,g,b,x,y), or whatever your favorite color space is Applications manifold clustering using ISOMAP distance Applications clustering bag-of-features signatures using Chi-squared kernel (distances between histograms) to discover visual “categories” in Caltech-4 clusters in 400D space (visualized by projection to rank 2) 5 discovered categories.

Advanced Clustering Methods Meanshift, Medoidshift and Quickshift

Mean Shift Paper by Comaniciu and Meer Presentation by Carlo Lopez-Tello What Is the Mean Shift Algorithm?

Identification Using Telematics

DBSCAN++: Towards Fast and Scalable Density Clustering

Deep Mean-Shift Priors for Image Restoration

Comparative Analysis of Clustering Techniques for Movie Recommendation

The Variable Bandwidth Mean Shift and Data-Driven Scale Selection

A Comparison of Clustering Algorithms for Face Clustering

Generalized Mean Shift with Triangular Kernel Profile Arxiv:2001.02165V1 [Cs.LG] 7 Jan 2020

Application of Deep Learning on Millimeter-Wave Radar Signals: a Review

Machine Learning – En Introduktion Josefin Rosén, Senior Analytical Expert, SAS Institute

Arxiv:1703.09964V1 [Cs.CV] 29 Mar 2017 Denoising, Or Each Magniﬁcation Factor in Super-Resolution

Outline of Machine Learning