Spectral Clustering

Machine Learning 10-701/15-781, Spring 2008 Spectral Clustering Eric Xing Lecture 23, April 14, 2008 Reading: Eric Xing 1 Data Clustering z Two different criteria z Compactness, e.g., k-means, mixture models z Connectivity, e.g., spectral clustering Compactness Connectivity Eric Xing 2 1 Spectral Clustering Data Similarities Eric Xing 3 Weighted Graph Partitioning i z Some graph terminology Wij j z Objects (e.g., pixels, data points) i∈ I = vertices of graph G i z Edges (ij) = pixel pairs with Wij > 0 z Similarity matrix W = [ Wij ] A z Degree di = Σj∈G Sij ⊆ A dA = Σi∈A di degree of A G B z Assoc(A,B) = Σi∈A Σj∈B Wij Eric Xing 4 2 Cuts in a Graph z (edge) cut = set of edges whose removal makes a graph disconnected z weight of a cut: cut( A, B ) = Σi∈A Σj∈B Wij=Assoc(A,B) z Normalized Cut criteria: minimum cut(A,Ā) cut(A, B) cut(A, B) Ncut(A, B) = + d A dB More generally: k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠ Eric Xing 5 Graph-based Clustering z Data Grouping i Wij j Wij = f (d(xi , x j )) Wij G = {V,E} z Image sigmentation z Affinity matrix: W = [wi, j ] z Degree matrix: D = diag(di ) z Laplacian matrix: L = D −W z (bipartite) partition vector: x = [x1,...,xN ] = [1,1,K1,−1,−1,K−1] Eric Xing 6 3 Affinity Function 2 − X i − X j 2 σ 2 Wi, j = e z Affinities grow as σ grows Æ z How the choice of σ value affects the results? z What would be the optimal choice for σ? Eric Xing 7 Clustering via Optimizing Normalized Cut z The normalized cut: cut(A, B) cut(A, B) Ncut(A, B) = + d A d B z Computing an optimal normalized cut over all possible y (i.e., partition) is NP hard z Transform Ncut equation to a matrix form (Shi & Malik 2000): yT (D −W)y min Ncut(x) = min x y yT Dy n Rayleigh quotient Subject to: y ∈{1,−b} yT D1 = 0 cut(A,B) cut(A,B) Ncut(A,B) = + z Still an NP hard problem deg(A) deg(B) T T D(i,i) (1+ x) (D − S)(1+ x) (1− x) (D − S)(1− x) ∑x >0 = + ; k = i k1T D1 (1− k)1T D1 D(i,i) ∑i Eric Xing = ... 8 4 Relaxation yT (D −W)y min Ncut(x) = min x y yT Dy Rayleigh quotient n Subject to: y ∈{1,−b} yT D1 = 0 z Instead, relax into the continuous domain by solving generalized eigenvalue system: T T miny y (D −W)y, s.t. y Dy =1 z Which gives: (D −W )y = λDy Rayleigh quotient theorem z Note that ( D − W ) 1 = 0 so, the first eigenvector is y0=1 with eigenvalue 0. z The second smallest eigenvector is the real valued solution to this problem!! Eric Xing 9 Algorithm 1. Define a similarity function between 2 nodes. i.e.: 2 − X (i ) − X ( j ) 2 2 σ X wi, j = e 2. Compute affinity matrix (W) and degree matrix (D). 3. Solve (D −W ) y = λDy z Do singular value decomposition (SVD) of the graph Laplacian L = D −W * L = V T ΛV ⇒ y * 4. Use the eigenvector with the second smallest eigenvalue, y , to bipartition the graph. z For each threshold k, Ak={i | yi among k largest element of y*} Bk={i | yi among n-k smallest element of y*} z Compute Ncut(Ak,Bk) * z Output and A * , B * k = arg max Ncut(Ak , Bk ) k k Eric Xing 10 5 Ideally … yT (D − S)y Ncut(A, B)= , with y ∈{1,−b}, yT D1 = 0. yT Dy i 2 y2 y i i i (D − S)y = λDy i A A Eric Xing 11 Example Eric Xing 12 6 Poor features can lead to poor outcome (xing et al 2002) Eric Xing 13 Cluster vs. block matrix cut(A, B) cut(A, B) Ncut(A, B) = + d A d B A Degree(A) = W ∑i∈A, j∈V i, j B cut(A, B) cut(A, B) A Ncut(A, B) = + d A d B B Eric Xing 14 7 Compare to Minimum cut z Criterion for partition: min cut(A, B) = min Wi, j A,B ∑ i∈A, j∈B Problem!Problem! A WeightWeight of of cut cut is is directly directly proportional proportional toto the the number number of of edges edges in in the the cut. cut. B Cuts with lesser weight than the ideal cut Ideal Cut First proposed by Wu and Leahy Eric Xing 15 Superior performance? z K-means and Gaussian mixture methods are biased toward convex clusters Eric Xing 16 8 Ncut is superior in certain cases Eric Xing 17 Why? Eric Xing 18 9 General Spectral Clustering Data Similarities Eric Xing 19 Representation z Partition matrix X: segments pixels X = []X 1,..., X K z Pair-wise similarity matrix W: W (i, j) = aff (i, j) z Degree matrix D: D(i,i) = w ∑ j i, j z Laplacian matrix L: L = D −W Eric Xing 20 10 Eigenvectors and blocks z Block matrices have block eigenvectors: λ1= 2 λ2= 2 λ3= 0 1 1 0 0 .71 0 λ4= 0 1 1 0 0 .71 0 eigensolver 0 0 1 1 0 .71 0 0 1 1 0 .71 z Near-block matrices have near-block eigenvectors: λ1= 2.02 λ2= 2.02 λ3= -0.02 1 1 .2 0 .71 0 λ4= -0.02 1 1 0 -.2 .69 -.14 eigensolver .2 0 1 1 .14 .69 0 -.2 1 1 0 .71 Eric Xing 21 Spectral Space z Can put items into blocks by eigenvectors: e1 1 1 .2 0 .71 0 1 1 0 -.2 .69 -.14 .2 0 1 1 .14 .69 e2 0 -.2 1 1 0 .71 e1 e2 z Clusters clear regardless of row ordering: e1 1 .2 1 0 .71 0 .2 1 0 1 .14 .69 1 0 1 -.2 .69 -.14 e2 0 1 -.2 1 0 .71 e1 e2 Eric Xing 22 11 Spectral Clustering z Algorithms that cluster points using eigenvectors of matrices derived from the data z Obtain data representation in the low-dimensional space that can be easily clustered z Variety of methods that use the eigenvectors differently (we have seen an example) z Empirically very successful z Authors disagree: z Which eigenvectors to use z How to derive clusters from these eigenvectors z Two general methods Eric Xing 23 Method #1 z Partition using only one eigenvector at a time z Use procedure recursively z Example: Image Segmentation z Uses 2nd (smallest) eigenvector to define optimal cut z Recursively generates two clusters with each cut Eric Xing 24 12 Method #2 z Use k eigenvectors (k chosen by user) z Directly compute k-way partitioning z Experimentally has been seen to be “better” Eric Xing 25 Spectral Clustering Algorithm Ng, Jordan, and Weiss 2003 z Given a set of points S={s1,…sn} 2 − Si −S j 2 σ 2 z Form the affinity matrix wi, j = e , ∀i ≠ j, wi,i = 0 z Define diagonal matrix Dii= Σκ aik z Form the matrix L = D−1/2WD−1/2 z Stack the k largest eigenvectors of L to for the columns of the new matrix X: ⎡| | | ⎤ X = ⎢x x x ⎥ ⎢ 1 2 L k ⎥ ⎣⎢| | | ⎦⎥ z Renormalize each of X’s rows to have unit length and get new matrix Y. Cluster rows of Y as points in R k Eric Xing 26 13 SC vs Kmeans Eric Xing 27 Why it works? z K-means in the spectrum space ! Eric Xing 28 14 More formally … z Recall generalized Ncut k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠ z Minimizing this is equivalent to spectral clustering k ⎛ cut(A , A ) ⎞ min Ncut(A , A A ) = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ r=1 d ⎝ Ar ⎠ segments T −1/2 −1/2 min Y D WD Y pixels s.t. Y TY = I Y Eric Xing 29 Toy examples Images from Matthew Brand (TR-2002-42) Eric Xing 30 15 User’s Prerogative z Choice of k, the number of clusters z Choice of scaling factor 2 z Realistically, search over σ and pick value that gives the tightest clusters z Choice of clustering method: k-way or recursive bipartite z Kernel affinity matrix wi, j = K(Si , S j ) Eric Xing 31 Conclusions z Good news: z Simple and powerful methods to segment images. z Flexible and easy to apply to other clustering problems. z Bad news: z High memory requirements (use sparse matrices). z Very dependant on the scale factor for a specific problem. 2 − X − X (i ) ( j ) 2 2 W (i, j) = e σ X Eric Xing 32 16.

Spectral Clustering

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support