Spectral Clustering

Machine Learning 10-701/15-781, Spring 2008 Spectral Clustering Eric Xing Lecture 23, April 14, 2008 Reading: Eric Xing 1 Data Clustering z Two different criteria z Compactness, e.g., k-means, mixture models z Connectivity, e.g., spectral clustering Compactness Connectivity Eric Xing 2 1 Spectral Clustering Data Similarities Eric Xing 3 Weighted Graph Partitioning i z Some graph terminology Wij j z Objects (e.g., pixels, data points) i∈ I = vertices of graph G i z Edges (ij) = pixel pairs with Wij > 0 z Similarity matrix W = [ Wij ] A z Degree di = Σj∈G Sij ⊆ A dA = Σi∈A di degree of A G B z Assoc(A,B) = Σi∈A Σj∈B Wij Eric Xing 4 2 Cuts in a Graph z (edge) cut = set of edges whose removal makes a graph disconnected z weight of a cut: cut( A, B ) = Σi∈A Σj∈B Wij=Assoc(A,B) z Normalized Cut criteria: minimum cut(A,Ā) cut(A, B) cut(A, B) Ncut(A, B) = + d A dB More generally: k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠ Eric Xing 5 Graph-based Clustering z Data Grouping i Wij j Wij = f (d(xi , x j )) Wij G = {V,E} z Image sigmentation z Affinity matrix: W = [wi, j ] z Degree matrix: D = diag(di ) z Laplacian matrix: L = D −W z (bipartite) partition vector: x = [x1,...,xN ] = [1,1,K1,−1,−1,K−1] Eric Xing 6 3 Affinity Function 2 − X i − X j 2 σ 2 Wi, j = e z Affinities grow as σ grows Æ z How the choice of σ value affects the results? z What would be the optimal choice for σ? Eric Xing 7 Clustering via Optimizing Normalized Cut z The normalized cut: cut(A, B) cut(A, B) Ncut(A, B) = + d A d B z Computing an optimal normalized cut over all possible y (i.e., partition) is NP hard z Transform Ncut equation to a matrix form (Shi & Malik 2000): yT (D −W)y min Ncut(x) = min x y yT Dy n Rayleigh quotient Subject to: y ∈{1,−b} yT D1 = 0 cut(A,B) cut(A,B) Ncut(A,B) = + z Still an NP hard problem deg(A) deg(B) T T D(i,i) (1+ x) (D − S)(1+ x) (1− x) (D − S)(1− x) ∑x >0 = + ; k = i k1T D1 (1− k)1T D1 D(i,i) ∑i Eric Xing = ... 8 4 Relaxation yT (D −W)y min Ncut(x) = min x y yT Dy Rayleigh quotient n Subject to: y ∈{1,−b} yT D1 = 0 z Instead, relax into the continuous domain by solving generalized eigenvalue system: T T miny y (D −W)y, s.t. y Dy =1 z Which gives: (D −W )y = λDy Rayleigh quotient theorem z Note that ( D − W ) 1 = 0 so, the first eigenvector is y0=1 with eigenvalue 0. z The second smallest eigenvector is the real valued solution to this problem!! Eric Xing 9 Algorithm 1. Define a similarity function between 2 nodes. i.e.: 2 − X (i ) − X ( j ) 2 2 σ X wi, j = e 2. Compute affinity matrix (W) and degree matrix (D). 3. Solve (D −W ) y = λDy z Do singular value decomposition (SVD) of the graph Laplacian L = D −W * L = V T ΛV ⇒ y * 4. Use the eigenvector with the second smallest eigenvalue, y , to bipartition the graph. z For each threshold k, Ak={i | yi among k largest element of y*} Bk={i | yi among n-k smallest element of y*} z Compute Ncut(Ak,Bk) * z Output and A * , B * k = arg max Ncut(Ak , Bk ) k k Eric Xing 10 5 Ideally … yT (D − S)y Ncut(A, B)= , with y ∈{1,−b}, yT D1 = 0. yT Dy i 2 y2 y i i i (D − S)y = λDy i A A Eric Xing 11 Example Eric Xing 12 6 Poor features can lead to poor outcome (xing et al 2002) Eric Xing 13 Cluster vs. block matrix cut(A, B) cut(A, B) Ncut(A, B) = + d A d B A Degree(A) = W ∑i∈A, j∈V i, j B cut(A, B) cut(A, B) A Ncut(A, B) = + d A d B B Eric Xing 14 7 Compare to Minimum cut z Criterion for partition: min cut(A, B) = min Wi, j A,B ∑ i∈A, j∈B Problem!Problem! A WeightWeight of of cut cut is is directly directly proportional proportional toto the the number number of of edges edges in in the the cut. cut. B Cuts with lesser weight than the ideal cut Ideal Cut First proposed by Wu and Leahy Eric Xing 15 Superior performance? z K-means and Gaussian mixture methods are biased toward convex clusters Eric Xing 16 8 Ncut is superior in certain cases Eric Xing 17 Why? Eric Xing 18 9 General Spectral Clustering Data Similarities Eric Xing 19 Representation z Partition matrix X: segments pixels X = []X 1,..., X K z Pair-wise similarity matrix W: W (i, j) = aff (i, j) z Degree matrix D: D(i,i) = w ∑ j i, j z Laplacian matrix L: L = D −W Eric Xing 20 10 Eigenvectors and blocks z Block matrices have block eigenvectors: λ1= 2 λ2= 2 λ3= 0 1 1 0 0 .71 0 λ4= 0 1 1 0 0 .71 0 eigensolver 0 0 1 1 0 .71 0 0 1 1 0 .71 z Near-block matrices have near-block eigenvectors: λ1= 2.02 λ2= 2.02 λ3= -0.02 1 1 .2 0 .71 0 λ4= -0.02 1 1 0 -.2 .69 -.14 eigensolver .2 0 1 1 .14 .69 0 -.2 1 1 0 .71 Eric Xing 21 Spectral Space z Can put items into blocks by eigenvectors: e1 1 1 .2 0 .71 0 1 1 0 -.2 .69 -.14 .2 0 1 1 .14 .69 e2 0 -.2 1 1 0 .71 e1 e2 z Clusters clear regardless of row ordering: e1 1 .2 1 0 .71 0 .2 1 0 1 .14 .69 1 0 1 -.2 .69 -.14 e2 0 1 -.2 1 0 .71 e1 e2 Eric Xing 22 11 Spectral Clustering z Algorithms that cluster points using eigenvectors of matrices derived from the data z Obtain data representation in the low-dimensional space that can be easily clustered z Variety of methods that use the eigenvectors differently (we have seen an example) z Empirically very successful z Authors disagree: z Which eigenvectors to use z How to derive clusters from these eigenvectors z Two general methods Eric Xing 23 Method #1 z Partition using only one eigenvector at a time z Use procedure recursively z Example: Image Segmentation z Uses 2nd (smallest) eigenvector to define optimal cut z Recursively generates two clusters with each cut Eric Xing 24 12 Method #2 z Use k eigenvectors (k chosen by user) z Directly compute k-way partitioning z Experimentally has been seen to be “better” Eric Xing 25 Spectral Clustering Algorithm Ng, Jordan, and Weiss 2003 z Given a set of points S={s1,…sn} 2 − Si −S j 2 σ 2 z Form the affinity matrix wi, j = e , ∀i ≠ j, wi,i = 0 z Define diagonal matrix Dii= Σκ aik z Form the matrix L = D−1/2WD−1/2 z Stack the k largest eigenvectors of L to for the columns of the new matrix X: ⎡| | | ⎤ X = ⎢x x x ⎥ ⎢ 1 2 L k ⎥ ⎣⎢| | | ⎦⎥ z Renormalize each of X’s rows to have unit length and get new matrix Y. Cluster rows of Y as points in R k Eric Xing 26 13 SC vs Kmeans Eric Xing 27 Why it works? z K-means in the spectrum space ! Eric Xing 28 14 More formally … z Recall generalized Ncut k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠ z Minimizing this is equivalent to spectral clustering k ⎛ cut(A , A ) ⎞ min Ncut(A , A A ) = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ r=1 d ⎝ Ar ⎠ segments T −1/2 −1/2 min Y D WD Y pixels s.t. Y TY = I Y Eric Xing 29 Toy examples Images from Matthew Brand (TR-2002-42) Eric Xing 30 15 User’s Prerogative z Choice of k, the number of clusters z Choice of scaling factor 2 z Realistically, search over σ and pick value that gives the tightest clusters z Choice of clustering method: k-way or recursive bipartite z Kernel affinity matrix wi, j = K(Si , S j ) Eric Xing 31 Conclusions z Good news: z Simple and powerful methods to segment images. z Flexible and easy to apply to other clustering problems. z Bad news: z High memory requirements (use sparse matrices). z Very dependant on the scale factor for a specific problem. 2 − X − X (i ) ( j ) 2 2 W (i, j) = e σ X Eric Xing 32 16.

Load more