Robert Collins Robert Collins CSE586 CSE586 MeanShift Clustering

nonparametric seeking Advanced Clustering Methods MeanShift, MedoidShift and QuickShift

References:

D. Comaniciu, P. Meer: : A Robust Approach toward Feature Space Analysis, IEEE Trans. Pattern Analysis Machine Intel., Vol. 24(5), 603-619, 2002

Yaser Sheikh, Erum Khan, Takeo Kanade, "Mode-seeking via Medoidshifts", IEEE International Conference on , 2007.

A. Vedaldi and S. Soatto, "Quick Shift and Methods for Mode Seeking," in Proceedings of the European Conference on Computer Vision (ECCV), 2008 Don’t need to know number of clusters in advance!

Robert Collins Robert Collins CSE586 CSE586 Parzen Density Estimation What is Mean Shift ?

Given a set of data samples xi; i=1...n A tool for: Convolve with a kernel function H to generate a smooth function f(x) Finding modes in a set of data samples, manifesting an Equivalent to superposition of multiple kernels centered at each data point underlying probability density function (PDF) in RN

Non-parametric Density Estimation Discrete PDF Representation

Data H Non-parametric Density GRADIENT Estimation (Mean Shift) PDF Analysis Ukrainitz&Sarel, Weizmann

Robert Collins Robert Collins CSE586 CSE586 Smoothing Scale Selection Density Estimation at Different Scales

• Example: Density estimates for 5 data points with • Unresolved issue: how to pick the scale differently-scaled kernels (sigma value) of the smoothing filter • Scale influences accuracy vs. generality (overfitting) • Answer for now: a user-settable parameter

from Duda et al. from Duda et al.

1 Robert Collins Robert Collins CSE586 CSE586 Mean-Shift MeanShift as Mode Seeking

P(x) The mean-shift algorithm is a hill-climbing algorithm that seeks modes of a density without explicitly computing that density.

The density is implicitly represented by raw samples and a kernel function. The density is the one that would be computed if Parzen estimation was applied to the data with the given kernel. x1 x2 x3 Parzen density estimation

Robert Collins Robert Collins CSE586 CSE586 MeanShift as Mode Seeking MeanShift as Mode Seeking

P(x) P(x)

P(y2)

P(y1)

y1 y2 y1 y2

Seeking the mode from an initial point y1 Note, P(y2) >= b(y2) > b(y1) = P(y1) Construct a tight convex lower bound b at y1 [b(y1)=P(y1)] Therefore P(y2) > P(y1) Find y2 to maximize b

Robert Collins Robert Collins CSE586 CSE586 MeanShift as Mode Seeking Aside: EM Optimizes a Lower Bound

P(x) • The optimization EM is performing can also be described as constructing a lower bound, optimizing it, then repeating until convergence.

• In the case of EM the lower bound is constructed by Jensen’s Inequality: log 1 f(x ) > 1 log f(x ) Σ i Ν Σ i y2 y3 y4 y5 y6 Ν

Move to y2 and repeat until you reach a mode Not necessarily quadratic, but solvable in closed-form

2 Robert Collins Robert Collins CSE586 CSE586 MeanShift in Equations MeanShift in Equations

Parzen density estimate using {x1,x2,...,xN} Parzen density estimate using {x1,x2,...,xN}

where Φ(r) is a convex, symmetric windowing (weighting) where Φ(r) is a convex, symmetric windowing (weighting) kernel; e.g. Φ(r)=exp(-r) with r = ||x -x||2 kernel; e.g. Φ(r)=exp(-r) i 2 Form lower bound of each Φj by expanding about r0= ||yold-xj||

Φ(r)= 1-r Examples Epanichnikov

Φ(r)= exp(-r) note: first-order taylor series expansion Gaussian f(r) ~ f(r0) + (r-r0)f’(r0) + h.o.t.

Robert Collins Robert Collins CSE586 CSE586 MeanShift in Equations MeanShift in Equations

Quadratic lower bound on the Parsen density function Lower bound of Φ at r0

Want to find y to maximize this. We therefore need to maximize now let

and substitute into Parzen equation

to get a quadratic lower bound (wrt y) on the Parzen density function Take derivative wrt y, set to 0, and solve

where

mean-shift update eqn

Robert Collins Robert Collins CSE586 CSE586 MeanShift in Equations Intuitive Description Density estimate Region of interest

Mean-shift update eqn Center of mass

Weights at each iteration

Epanichnikov Flat

Mean Shift vector Objective : Find the densest region Gaussian Gaussian Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann

3 Robert Collins Robert Collins CSE586 CSE586 Intuitive Description Intuitive Description Region of Region of interest interest

Center of Center of mass mass

Mean Shift Mean Shift vector vector Objective : Find the densest region Objective : Find the densest region Distribution of identical billiard balls Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Ukrainitz&Sarel, Weizmann

Robert Collins Robert Collins CSE586 CSE586 Intuitive Description Intuitive Description Region of Region of interest interest

Center of Center of mass mass

Mean Shift Mean Shift vector vector Objective : Find the densest region Objective : Find the densest region Distribution of identical billiard balls Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Ukrainitz&Sarel, Weizmann

Robert Collins Robert Collins CSE586 CSE586 Intuitive Description Intuitive Description Region of Region of interest interest

Center of Center of mass mass

Mean Shift vector Objective : Find the densest region Objective : Find the densest region Distribution of identical billiard balls Distribution of identical billiard balls Ukrainitz&Sarel, Weizmann Ukrainitz&Sarel, Weizmann

4 Robert Collins Robert Collins CSE586 CSE586 MeanShift Clustering What is MedoidShift?

For each data point x i Instead of solving for continuous-valued y to maximize the quadratic

lower bound on the Parzen density estimate, Let y (0) = x ; t = 0 i i

consider only a discrete set of y locations, taken from the initial set of Repeat data points {x ,x ,...,x } 1 2 N

Compute weights w1, w2, ... wN for all data points, using yi(t) That is, for the first step towards the mode

Compute updated location yi(t+1) using mean-shift update eqn

If ( yi(t) and yi(t+1) are “close enough” ) declare convergence to a mode and exit y (0) = x else y is now chosen from a i i t = t + 1; discrete set of data points number distinct modes found = number clusters end (repeat) points that converge to “same” mode are labeled as belonging to one cluster

Robert Collins Robert Collins CSE586 CSE586 MedoidShift MedoidShift

Implications of this strategy for clustering: Different clusters are found as separate connected components (trees, actually... so can do efficient tree traversal to locate all points in each

Since we are choosing the best y from among {x1,x2,...,xN}, cluster. yi(1) will be one of the original data points. Also note: no need for heuristic distance thresholds to test for Therefore, we only need to compute a single step starting from each convergence to a mode, or to determine membership in a cluster.

data point, because yi(2) will then be the result of yyi(1)(1)

x1 x1 x y3(1) 6 x y5(3) x3 x3 7 y4(1) x10 x8 y (2) x y2(1) 5 4 x4 x9 x x 2 y (1) y (1) 2 5 5 x11

x5 x5 y5(0)

Robert Collins Robert Collins CSE586 CSE586 MedoidShift Advantages of MedoidShift

• Automatically determines the number of clusters (as does mean-shift)

• Previous computations can be reused when new samples are added or old samples are deleted (good for incremental clustering applications)

• Can work in domains where only distances between samples are defined (e.g. EMD or ISOMAP); there is no need to compute a mean.

• No need for heuristic terminating conditions (don’t need to use a distance threshold to determine convergence) MedoidShift behavior asymptotically approaches MeanShift behavior as number of data samples approaches infinity.

5 Robert Collins Robert Collins CSE586 CSE586 MeanShift vs MedoidShift Drawbacks of MedoidShift

• O(N3) [although not really – Vedaldi paper]

• Can fail to find all the modes of a density, thereby causing over-fragmentation

starting point continuous meanshift step goes here

but medoidshift has to stay here

Robert Collins Robert Collins CSE586 CSE586 Iterated MedoidShift Seeds of QuickShift!

• keep a count of how many points move to xj and apply a new iteration of medoidshift with points weighted by that count.

• Each iteration is essentially using a smoother Parzen estimate

Robert Collins Robert Collins CSE586 CSE586 Vedaldi and Soatto Kernel MedoidShift Contributions: Insight: medoidshift is defined in terms of pairwise distances • If using Euclidean distance, MedoidShift can actually be between data points. implemented in O(N^2) Therefore, using the “kernel trick”, we can generalize it to • generalize to non-Euclidean distances using kernel trick handle nonEuclidean distances defined by a kernel matrix K (similar to how many algorithms are “kernelized”) • show that MedoidShift does not always find all modes

• propose QuickShift, an alternative approach that is simple, fast, and yields a one-parameter family of clustering results that explicitly represent the tradeoff between over- and under-fragmentation.

6 Robert Collins Robert Collins CSE586 CSE586 What is QuickShift? QuickShift vs MedoidShift

A combination of the above kernel trick and... • Recall our failure case for medoidshift

instead of choosing the data point that optimizes the lower bound function b(x) of the Parzen estimate P(x)

choose the closest data point that yields a higher value of P(x) directly

starting point continuous meanshift goes here medoidshift has to stay here quickshift goes HERE

Robert Collins Robert Collins CSE586 CSE586 Comparisons Quickshift Benefits

• Similar benefits to medoidshift, plus additional ones:

• All points are in one connected tree. You can therefore explore various clusterings by choosing a threshold T and pruning edges

such that d(xi,xj) < T meanshift medoidshift quickshift • Runs much faster in practice than either meanshift or medoidshift

Robert Collins Robert Collins CSE586 CSE586 Applications Applications

image segmentation by clustering (r,g,b), or (r,g,b,x,y), or whatever manifold clustering using ISOMAP distance your favorite color space is

7 Robert Collins Robert Collins CSE586 CSE586 Applications Applications

clustering bag-of-features signatures using Chi-squared kernel (distances Clustering images using a bhattacharya coefficient on 5D histograms between histograms) to discover visual “categories” in Caltech-4 composed of (L,a,b,x,y) values

clusters in 400D space (visualized by projection to rank 2)

5 discovered categories. Ground- truth category “airplanes” is split into two clusters

8