Machine Learning

10-701/15-781, Spring 2008

Spectral Clustering

Eric Xing

Lecture 23, April 14, 2008

Reading: Eric Xing 1

Data Clustering

z Two different criteria

z Compactness, e.g., k-means, mixture models

z Connectivity, e.g., spectral clustering

Compactness Connectivity

Eric Xing 2

1 Spectral Clustering

Data Similarities

Eric Xing 3

Weighted Graph Partitioning

i z Some graph terminology Wij j z Objects (e.g., pixels, data points) i∈ I = vertices of graph G i

z Edges (ij) = pixel pairs with Wij > 0

z Similarity W = [ Wij ] A

z

di = Σj∈G Sij ⊆ A dA = Σi∈A di degree of A G B

z Assoc(A,B) = Σi∈A Σj∈B Wij

Eric Xing 4

2 Cuts in a Graph

z (edge) cut = set of edges whose removal makes a graph disconnected

z weight of a cut:

cut( A, B ) = Σi∈A Σj∈B Wij=Assoc(A,B)

z Normalized Cut criteria: minimum cut(A,Ā)

cut(A, B) cut(A, B) Ncut(A, B) = + d A dB

More generally:

k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠ Eric Xing 5

Graph-based Clustering

z Data Grouping i Wij j

Wij = f (d(xi , x j )) Wij G = {V,E}

z Image sigmentation

z Affinity matrix: W = [wi, j ] z Degree matrix: D = diag(di ) z : L = D −W z (bipartite) partition vector:

x = [x1,...,xN ] = [1,1,K1,−1,−1,K−1] Eric Xing 6

3 Affinity Function

2 − X i − X j 2 σ 2 Wi, j = e

z Affinities grow as σ grows Æ

z How the choice of σ value affects the results?

z What would be the optimal choice for σ?

Eric Xing 7

Clustering via Optimizing Normalized Cut

z The normalized cut: cut(A, B) cut(A, B) Ncut(A, B) = + d A d B z Computing an optimal normalized cut over all possible y (i.e., partition) is NP hard

z Transform Ncut equation to a matrix form (Shi & Malik 2000): yT (D −W)y min Ncut(x) = min x y yT Dy n Rayleigh quotient Subject to: y ∈{1,−b} yT D1 = 0 cut(A,B) cut(A,B) Ncut(A,B) = + z Still an NP hard problem deg(A) deg(B) T T D(i,i) (1+ x) (D − S)(1+ x) (1− x) (D − S)(1− x) ∑x >0 = + ; k = i k1T D1 (1− k)1T D1 D(i,i) ∑i Eric Xing = ... 8

4 Relaxation

yT (D −W)y min Ncut(x) = min x y yT Dy Rayleigh quotient n Subject to: y ∈{1,−b} yT D1 = 0

z Instead, relax into the continuous domain by solving generalized eigenvalue system: T T miny y (D −W)y, s.t. y Dy =1

z Which gives: (D −W )y = λDy Rayleigh quotient theorem

z Note that ( D − W ) 1 = 0 so, the first eigenvector is y0=1 with eigenvalue 0.

z The second smallest eigenvector is the real valued solution to this problem!!

Eric Xing 9

Algorithm

1. Define a similarity function between 2 nodes. i.e.:

2 − X (i ) − X ( j ) 2 2 σ X wi, j = e 2. Compute affinity matrix (W) and degree matrix (D).

3. Solve (D −W ) y = λDy

z Do singular value decomposition (SVD) of the graph Laplacian L = D −W * L = V T ΛV ⇒ y * 4. Use the eigenvector with the second smallest eigenvalue, y , to bipartition the graph.

z For each threshold k, Ak={i | yi among k largest element of y*}

Bk={i | yi among n-k smallest element of y*}

z Compute Ncut(Ak,Bk) * z Output and A * , B * k = arg max Ncut(Ak , Bk ) k k

Eric Xing 10

5 Ideally …

yT (D − S)y Ncut(A, B)= , with y ∈{1,−b}, yT D1 = 0. yT Dy i

2 y2 y i i i (D − S)y = λDy i A A

Eric Xing 11

Example

Eric Xing 12

6 Poor features can lead to poor outcome (xing et al 2002)

Eric Xing 13

Cluster vs.

cut(A, B) cut(A, B) Ncut(A, B) = + d A d B

A Degree(A) = W ∑i∈A, j∈V i, j B

cut(A, B) cut(A, B) A Ncut(A, B) = + d A d B B

Eric Xing 14

7 Compare to Minimum cut

z Criterion for partition:

min cut(A, B) = min Wi, j A,B ∑ i∈A, j∈B Problem!Problem! A WeightWeight of of cut cut is is directly directly proportional proportional toto the the number number of of edges edges in in the the cut. cut. B

Cuts with lesser weight than the ideal cut Ideal Cut First proposed by Wu and Leahy

Eric Xing 15

Superior performance?

z K-means and Gaussian mixture methods are biased toward convex clusters

Eric Xing 16

8 Ncut is superior in certain cases

Eric Xing 17

Why?

Eric Xing 18

9 General Spectral Clustering

Data Similarities

Eric Xing 19

Representation

z Partition matrix X: segments pixels X = []X 1,..., X K

z Pair-wise similarity matrix W: W (i, j) = aff (i, j)

z Degree matrix D: D(i,i) = w ∑ j i, j

z Laplacian matrix L: L = D −W

Eric Xing 20

10 Eigenvectors and blocks

z Block matrices have block eigenvectors: λ1= 2 λ2= 2 λ3= 0

1 1 0 0 .71 0 λ4= 0

1 1 0 0 .71 0 eigensolver 0 0 1 1 0 .71

0 0 1 1 0 .71

z Near-block matrices have near-block eigenvectors:

λ1= 2.02 λ2= 2.02 λ3= -0.02

1 1 .2 0 .71 0 λ4= -0.02

1 1 0 -.2 .69 -.14 eigensolver .2 0 1 1 .14 .69

0 -.2 1 1 0 .71

Eric Xing 21

Spectral Space

z Can put items into blocks by eigenvectors: e1

1 1 .2 0 .71 0

1 1 0 -.2 .69 -.14

.2 0 1 1 .14 .69 e2 0 -.2 1 1 0 .71

e1 e2

z Clusters clear regardless of row ordering: e1

1 .2 1 0 .71 0

.2 1 0 1 .14 .69

1 0 1 -.2 .69 -.14 e2 0 1 -.2 1 0 .71

e1 e2 Eric Xing 22

11 Spectral Clustering

z Algorithms that cluster points using eigenvectors of matrices derived from the data

z Obtain data representation in the low-dimensional space that can be easily clustered

z Variety of methods that use the eigenvectors differently (we have seen an example)

z Empirically very successful

z Authors disagree:

z Which eigenvectors to use

z How to derive clusters from these eigenvectors

z Two general methods

Eric Xing 23

Method #1

z Partition using only one eigenvector at a time

z Use procedure recursively

z Example: Image Segmentation

z Uses 2nd (smallest) eigenvector to define optimal cut

z Recursively generates two clusters with each cut

Eric Xing 24

12 Method #2

z Use k eigenvectors (k chosen by user)

z Directly compute k-way partitioning

z Experimentally has been seen to be “better”

Eric Xing 25

Spectral Clustering Algorithm Ng, Jordan, and Weiss 2003

z Given a set of points S={s1,…sn} 2 − Si −S j 2 σ 2 z Form the affinity matrix wi, j = e , ∀i ≠ j, wi,i = 0

z Define Dii= Σκ aik z Form the matrix L = D−1/2WD−1/2

z Stack the k largest eigenvectors of L to for the columns of the new matrix X: ⎡| | | ⎤ X = ⎢x x x ⎥ ⎢ 1 2 L k ⎥ ⎣⎢| | | ⎦⎥

z Renormalize each of X’s rows to have unit length and get new matrix Y. Cluster rows of Y as points in R k

Eric Xing 26

13 SC vs Kmeans

Eric Xing 27

Why it works?

z K-means in the spectrum space !

Eric Xing 28

14 More formally …

z Recall generalized Ncut

k ⎛ W ⎞ k ∑i∈A , j∈V \ A ij ⎛ cut(A , A ) ⎞ Ncut(A , A A ) = ⎜ r r ⎟ = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ ∑⎜ ⎟ r=1 Wij r=1 d A ⎝ ∑i∈Ar , j∈V ⎠ ⎝ r ⎠

z Minimizing this is equivalent to spectral clustering

k ⎛ cut(A , A ) ⎞ min Ncut(A , A A ) = ⎜ r r ⎟ 1 2 K k ∑⎜ ⎟ r=1 d ⎝ Ar ⎠ segments

T −1/2 −1/2 min Y D WD Y pixels s.t. Y TY = I Y

Eric Xing 29

Toy examples

Images from Matthew Brand (TR-2002-42)

Eric Xing 30

15 User’s Prerogative

z Choice of k, the number of clusters

z Choice of scaling factor 2 z Realistically, search over σ and pick value that gives the tightest clusters

z Choice of clustering method: k-way or recursive bipartite

z Kernel affinity matrix

wi, j = K(Si , S j )

Eric Xing 31

Conclusions

z Good news:

z Simple and powerful methods to segment images.

z Flexible and easy to apply to other clustering problems.

z Bad news:

z High memory requirements (use sparse matrices).

z Very dependant on the scale factor for a specific problem.

2 − X − X (i ) ( j ) 2 2 W (i, j) = e σ X

Eric Xing 32

16