Machine Learning
10-701/15-781, Spring 2008
Spectral Clustering
Eric Xing
Lecture 23, April 14, 2008
Reading: Eric Xing 1
Data Clustering
z Two different criteria
z Compactness, e.g., k-means, mixture models
z Connectivity, e.g., spectral clustering
Compactness Connectivity
Eric Xing 2
1 Spectral Clustering
Data Similarities
Eric Xing 3
Weighted Graph Partitioning
i z Some graph terminology Wij j z Objects (e.g., pixels, data points) i∈ I = vertices of graph G i
z Edges (ij) = pixel pairs with Wij > 0
z Similarity matrix W = [ Wij ] A
z Degree
di = Σj∈G Sij ⊆ A dA = Σi∈A di degree of A G B
z Assoc(A,B) = Σi∈A Σj∈B Wij
Eric Xing 4
2 Cuts in a Graph
z (edge) cut = set of edges whose removal makes a graph disconnected
z weight of a cut:
cut( A, B ) = Σi∈A Σj∈B Wij=Assoc(A,B)
z Normalized Cut criteria: minimum cut(A,Ā)
cut(A, B) cut(A, B) Ncut(A, B) = + d A dB
More generally:
k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠ Eric Xing 5
Graph-based Clustering
z Data Grouping i Wij j
Wij = f (d(xi , x j )) Wij G = {V,E}
z Image sigmentation
z Affinity matrix: W = [wi, j ] z Degree matrix: D = diag(di ) z Laplacian matrix: L = D −W z (bipartite) partition vector:
x = [x1,...,xN ] = [1,1,K1,−1,−1,K−1] Eric Xing 6
3 Affinity Function
2 − X i − X j 2 σ 2 Wi, j = e
z Affinities grow as σ grows Æ
z How the choice of σ value affects the results?
z What would be the optimal choice for σ?
Eric Xing 7
Clustering via Optimizing Normalized Cut
z The normalized cut: cut(A, B) cut(A, B) Ncut(A, B) = + d A d B z Computing an optimal normalized cut over all possible y (i.e., partition) is NP hard
z Transform Ncut equation to a matrix form (Shi & Malik 2000): yT (D −W)y min Ncut(x) = min x y yT Dy n Rayleigh quotient Subject to: y ∈{1,−b} yT D1 = 0 cut(A,B) cut(A,B) Ncut(A,B) = + z Still an NP hard problem deg(A) deg(B) T T D(i,i) (1+ x) (D − S)(1+ x) (1− x) (D − S)(1− x) ∑x >0 = + ; k = i k1T D1 (1− k)1T D1 D(i,i) ∑i Eric Xing = ... 8
4 Relaxation
yT (D −W)y min Ncut(x) = min x y yT Dy Rayleigh quotient n Subject to: y ∈{1,−b} yT D1 = 0
z Instead, relax into the continuous domain by solving generalized eigenvalue system: T T miny y (D −W)y, s.t. y Dy =1
z Which gives: (D −W )y = λDy Rayleigh quotient theorem
z Note that ( D − W ) 1 = 0 so, the first eigenvector is y0=1 with eigenvalue 0.
z The second smallest eigenvector is the real valued solution to this problem!!
Eric Xing 9
Algorithm
1. Define a similarity function between 2 nodes. i.e.:
2 − X (i ) − X ( j ) 2 2 σ X wi, j = e 2. Compute affinity matrix (W) and degree matrix (D).
3. Solve (D −W ) y = λDy
z Do singular value decomposition (SVD) of the graph Laplacian L = D −W * L = V T ΛV ⇒ y * 4. Use the eigenvector with the second smallest eigenvalue, y , to bipartition the graph.
z For each threshold k, Ak={i | yi among k largest element of y*}
Bk={i | yi among n-k smallest element of y*}
z Compute Ncut(Ak,Bk) * z Output and A * , B * k = arg max Ncut(Ak , Bk ) k k
Eric Xing 10
5 Ideally …
yT (D − S)y Ncut(A, B)= , with y ∈{1,−b}, yT D1 = 0. yT Dy i
2 y2 y i i i (D − S)y = λDy i A A
Eric Xing 11
Example
Eric Xing 12
6 Poor features can lead to poor outcome (xing et al 2002)
Eric Xing 13
Cluster vs. block matrix
cut(A, B) cut(A, B) Ncut(A, B) = + d A d B
A Degree(A) = W ∑i∈A, j∈V i, j B
cut(A, B) cut(A, B) A Ncut(A, B) = + d A d B B
Eric Xing 14
7 Compare to Minimum cut
z Criterion for partition:
min cut(A, B) = min Wi, j A,B ∑ i∈A, j∈B Problem!Problem! A WeightWeight of of cut cut is is directly directly proportional proportional toto the the number number of of edges edges in in the the cut. cut. B
Cuts with lesser weight than the ideal cut Ideal Cut First proposed by Wu and Leahy
Eric Xing 15
Superior performance?
z K-means and Gaussian mixture methods are biased toward convex clusters
Eric Xing 16
8 Ncut is superior in certain cases
Eric Xing 17
Why?
Eric Xing 18
9 General Spectral Clustering
Data Similarities
Eric Xing 19
Representation
z Partition matrix X: segments pixels X = []X 1,..., X K
z Pair-wise similarity matrix W: W (i, j) = aff (i, j)
z Degree matrix D: D(i,i) = w ∑ j i, j
z Laplacian matrix L: L = D −W
Eric Xing 20
10 Eigenvectors and blocks
z Block matrices have block eigenvectors: λ1= 2 λ2= 2 λ3= 0
1 1 0 0 .71 0 λ4= 0
1 1 0 0 .71 0 eigensolver 0 0 1 1 0 .71
0 0 1 1 0 .71
z Near-block matrices have near-block eigenvectors:
λ1= 2.02 λ2= 2.02 λ3= -0.02
1 1 .2 0 .71 0 λ4= -0.02
1 1 0 -.2 .69 -.14 eigensolver .2 0 1 1 .14 .69
0 -.2 1 1 0 .71
Eric Xing 21
Spectral Space
z Can put items into blocks by eigenvectors: e1
1 1 .2 0 .71 0
1 1 0 -.2 .69 -.14
.2 0 1 1 .14 .69 e2 0 -.2 1 1 0 .71
e1 e2
z Clusters clear regardless of row ordering: e1
1 .2 1 0 .71 0
.2 1 0 1 .14 .69
1 0 1 -.2 .69 -.14 e2 0 1 -.2 1 0 .71
e1 e2 Eric Xing 22
11 Spectral Clustering
z Algorithms that cluster points using eigenvectors of matrices derived from the data
z Obtain data representation in the low-dimensional space that can be easily clustered
z Variety of methods that use the eigenvectors differently (we have seen an example)
z Empirically very successful
z Authors disagree:
z Which eigenvectors to use
z How to derive clusters from these eigenvectors
z Two general methods
Eric Xing 23
Method #1
z Partition using only one eigenvector at a time
z Use procedure recursively
z Example: Image Segmentation
z Uses 2nd (smallest) eigenvector to define optimal cut
z Recursively generates two clusters with each cut
Eric Xing 24
12 Method #2
z Use k eigenvectors (k chosen by user)
z Directly compute k-way partitioning
z Experimentally has been seen to be “better”
Eric Xing 25
Spectral Clustering Algorithm Ng, Jordan, and Weiss 2003
z Given a set of points S={s1,…sn} 2 − Si −S j 2 σ 2 z Form the affinity matrix wi, j = e , ∀i ≠ j, wi,i = 0
z Define diagonal matrix Dii= Σκ aik z Form the matrix L = D−1/2WD−1/2
z Stack the k largest eigenvectors of L to for the columns of the new matrix X: ⎡| | | ⎤ X = ⎢x x x ⎥ ⎢ 1 2 L k ⎥ ⎣⎢| | | ⎦⎥
z Renormalize each of X’s rows to have unit length and get new matrix Y. Cluster rows of Y as points in R k
Eric Xing 26
13 SC vs Kmeans
Eric Xing 27
Why it works?
z K-means in the spectrum space !
Eric Xing 28
14 More formally …
z Recall generalized Ncut
k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠
z Minimizing this is equivalent to spectral clustering
k ⎛ cut(A , A ) ⎞ min Ncut(A , A A ) = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ r=1 d ⎝ Ar ⎠ segments
T −1/2 −1/2 min Y D WD Y pixels s.t. Y TY = I Y
Eric Xing 29
Toy examples
Images from Matthew Brand (TR-2002-42)
Eric Xing 30
15 User’s Prerogative
z Choice of k, the number of clusters
z Choice of scaling factor 2 z Realistically, search over σ and pick value that gives the tightest clusters
z Choice of clustering method: k-way or recursive bipartite
z Kernel affinity matrix
wi, j = K(Si , S j )
Eric Xing 31
Conclusions
z Good news:
z Simple and powerful methods to segment images.
z Flexible and easy to apply to other clustering problems.
z Bad news:
z High memory requirements (use sparse matrices).
z Very dependant on the scale factor for a specific problem.
2 − X − X (i ) ( j ) 2 2 W (i, j) = e σ X
Eric Xing 32
16