2 Fully Connected Graphs

T m Given a set of data points X = (x1, ...xn) , xi R ,i = 1, 2, ..., n, we can form an undirected graph G = (V, E) ∈ that vertex vi represents xi. Based on the undirected graph G, we can construct a weighted adjacency graph W = (wij )n n, where wij = wji 0. To get W , we first construct the adjacency graph through ǫ-neighborhood, k-nearest neighbor× or other methods or≥ simply connect all vertices and then put weights on the connected edges by a similarity 2 2 function. The most common similarity function is the RBF kernel, where wij = exp xi xj /(2δ ). If wij > 0 −k − k for all pair of vertices vi and vj , G becomes a fully connected graph that every vertex connects to every other vertex.

Given the weighted adjacency matrix, the degree matrix is defined as a diagonal matrix D = diag(Dii)n n, where n × Dii = j=1 wij is the degree of vertex vi. Given W and D, we have the unnormalized graph Laplacian P L = D W (1) − and two common normalized graph Laplacians

sym 1 1 L := D− 2 LD− 2 (2) rw 1 L := D− L where Lsym is the symmetric normalized graph and Lrw is the random walk normalized graph.

2.1 The Unnormalized Graph Laplacians

The absence of objective function for the dimension reduction part is why spectral clustering works mysteriously for a fully connected graph. A good objective function for dimension reduction should preserve data points’ similarities. Therefore, the covariance between the original data points’ similarities and the mapped data points’ similarities should be a good choice. T k Let Y = (y1, ..., yn) be the low-dimensional representation, yi R represents xi, i = 1, 2, ..., n, and dij be the ∈ dissimilarity between yi and yj , that

2 dij = yi yj . (3) k − k

Let the adjacency weight wij be the similarity between xi and xj , the covariance between similarities becomes

n n 1 cov( d, w)= (dij d¯)(wij w¯) (4) − −2n − − Xi=1 Xj=1 where w = (w11, w12, ..., wnn), d = (d11, d12..., dnn), and d¯, w¯ are the means. Since for any constant c =0, we have 6

cov(cd, w)= c cov(d, w) (5) ∗ and only the relative information matters for Y , it is natural to constrain the mapped data’s means to zero and constant their norms. That is

n i=1 yit =0, t =1, ...k 2 (6) Pyi = ci,i =1, ..., n k k where ci =0 is a constant. Therefore, the covariance becomes 6 1 cov( d, w)= tr(Y T LY )+ constant (7) − −n and the optimization problem becomes

minimize trace(Y T LY ) Y subject to y 2 = c , i =1, 2, ..., n i i (8) k n k yit =0, t =1, 2, ..., k. Xi=1

Let λ0, ..., λk, ...λn 1 be the eigenvalues of L sorted in ascending order, and f0,f1, ..., fk be the first k +1 corre- sponding eigenvectors.− Since the number of zero eigenvalues of a graph Laplacian equals to the number of connected components, and L is a positive semi-definite fully connected graph Laplacian, we have

0= λ0 < λ1 ... λn 1. (9) ≤ ≤ − Let ~1 be the n-dimensional vector of all ones and ~0 be the n-dimensional vector of all zeros, since L~1= ~0, we have

T f0 = (c0, ..., c0) (10) ~1 ft =0, t =1, 2, ..., k · where c0 is a constant. As a constant vector, f0 cannot satisfy the constraints in optimization problem (8) simulta- neously. What’s more, L is positive semi-definite, the solution of the optimization problem (8) becomes the first k non-constant eigenvectors, that is

Y ∗ = (f1,f2, (11)

The classical spectral clustering uses the first k eigenvectors, that is

sp Y = (f0,f1, ..., fk 1). (12) −

As a constant vector, f0 provides no extra information, which means removing f0 makes no differencefor the clustering result. Therefore, we only need to adjust the number of clusters from k 1 to k. − 2.2 The Normalized Graph Laplacians

For the symmetric normalized graph Lsym, we have

sym Lij Lij = . (13) DiiDjj p Let

sym yi zi = (14) √Dii and

dsym = zsym zsym 2. (15) ij k i − j k sym Similar to the unnormalized graph, we add constraints to zi

sym sym zi = ci ,i =1, 2, ...n k n k (16)  i=1 zit =0,t =1, 2, ..., k P where csym =0 is a constant. Based on the constrains, we have i 6

n n sym 1 sym sym cov( d , w)= (d d¯ )(wij w¯) − −2n ij − − Xi=1 Xj=1 (17) 1 = tr(Y T LsymY )+ constant −n Therefore, the symmetric normalized graph Laplacian optimization problem becomes

minimize trace(Y T LsymY ) Y subject to zsym = csym,i =1, 2, ...n i i (18) k n k sym zit =0,t =1, 2, ..., k. Xi=1 Similar to the unnormalized graph Laplacian, the solution becomes the first k non-constant eigenvectors of Lsym, that is

sym sym sym Y = (f1 , ..., fk ) (19) and

sym 1 sym Z =Λ− 2 Y (20)

1 1 where Λ− 2 = diag( )n n is a diagonal matrix. It seems that there is a strong connection between the normalized √λii × graph Laplacian and the commute time embedding in [1]. The random walk normalized graph Lrw is similar to the symmetric normalized graph Lsym, we only need to set zrw = yj . j √Dii

3 Multi-Connected Graphs

For a multi-connected graph with k connected components, the weight adjacency matrix will be a block diagonal matrix after sorting the vertices according to the connected components they belong to. And the Lm becomes

L1 L2 Lm =   (21)  ··· L   k where each subgraph Lt,t =1, 2, ..., k is a fully connected graph that has one and only one eigenvalueequals to 0. Let m m ut be the first eigenvector of subgraph Lt, ut becomes a constant vector with the corresponding eigenvalue equals to zero, and the first k eigenvectors of Lm becomes

m u1 um U m =  2  . (22)  ··· um  k  Therefore, U m becomes the indicator of the connected components.

4 The Equivalence To PCA

Principal component analysis(PCA) is a commonly used linear dimension reduction method that attempts to construct a low-dimensional representation preserving as much variance as possible. After removing the means, XT X can be recognized as the covariance matrix, and the k-dimensional representation Y PCA∗ becomes

PCA Y ∗ = XMk (23)

T where Mk is a m k matrix whose columns are the k largest eigenvectors of X X. That is, the columnsof Mk are the eigenvectors of X×T X whose corresponding eigenvalues are the k largest, and the eigenvectors are sorted according to their eigenvalues in descending order. Denote ∆k k as the diagonal matrix, whose diagonal entries are the sorted k largest eigenvalues of XT X, thus ×

T PCA T PCA XX Y ∗ = XX XMk = XMk∆k = Y ∗∆k. (24)

Therefore, Y PCA∗ are the k largest eigenvectors of G = XXT . To make PCA equal to spectral embedding, we need to construct a fully connected graph Laplacian LPCA, whose smallest k non-constant eigenvectors equals to Y . Based on G, we choose the cosine similarity as the similarity PCA function, and to make sure wij > 0, we let wij = 2+cos θij . In this way, L becomes a positive semi-definite fully connect graph Laplacian. After apply standardization to X, we have

T wij =2+ xi xj (25) and

n Dii = wij =2n (26) Xi=1 and the Laplacian matrix becomes

LPCA =2nI 2H G (27) − − PCA where H = (1)n n is a n n all-ones matrix. Let β0, ..., βk, ..., βn 1 be the eigenvalues of L sorted in ascending × × − order, and u0, ..., uk, ..., un 1 be the corresponding eigenvectors, we have −

0= β0 <β1... βk... βn 1 (28) ≤ ≤ − and the solution of LPCA becomes

PCA∗ YL = (u1, ..., uk). (29)

For t =1, 2, ..., n 1, since ~1 ut =0, with equation (27), we have − ·

Gut = (2n βt)ut. (30) − For β0 =0 and u0, since X has zero means and u0 is a constant vector, we have

Gu0 = ~0. (31)

What’s more, since G is positive semi-definite, the eigenvalues of G become

2n β1 ... 2n βk... 2n βn 1 β0 =0. (32) − ≥ ≥ − ≥ − − ≥

In this way, u1, ..., uk become the k largest eigenvectors of G which is the solution of k-dimensional PCA,

PCA∗ PCA∗ Y = (u1, ..., uk)= YL . (33)

5 Conclusions

In this paper, we explain the mathematics behind spectral clustering by dividing graph Laplacian into two categories based on whether the graph is fully connected or not. For a fully connected graph, we consider spectral clustering as a two-step algorithm that first reduces dimension and then applies a standard clustering algorithm to the lower dimen- sional representation. We explain the spectral embedding part by offering an objective function which is the covariance between the original data points’ similarities and the mapped data points’ similarities. For a multi-connected graph, we prove that with a proper k, the first k eigenvectors are the indicators of the connected components. This paper also proves the equivalence between spectral embedding and PCA by setting the cosine similarity as the similarity function when constructing the weighted adjacency graph. Since the choice of similarity function is flexible, the spectral embedding should have more equivalent dimension reduction algorithms.

