arXiv:2103.00733v1 [stat.ML] 1 Mar 2021 hspprepan pcrlcutrn ydvdn rp L graph dividing by clustering spectral explains paper This Keywords iw.SiadMlk20)2 s h omlzdct omea to cuts normalized the use Malik(2000)[2] and Shi views. i e h -iesoa rjcinb etn h last the setting by projection k-dimensional node the reaching get for walker random a by taken time average the aapit’smlrte.Scin3poe htframulti a for that proves 3 Section covarian the similarities. points’ function: objective data an providing by representati embedding low-dimensional the representation to gr low-dimensional algorithm connected clustering a fully finds a first For that not. algorithm or connected fully is Laplacian wit connection strong a has embedding distance time commute us arn,Fus e,adDpn(07[]ueEuclide use random Dupont(2007)[1] the and Therefore, Yen, c Fouss, normalized walk. Saerens, random the cuts. the by of formulated matrix problem transition spectral Meil the as between 2007)[6]. similarities the Luxburg, interpreting guar by solution(von segmentation no exact spectral is v the there indicator However, to relaxing compared By problem. eigenvalue generalized groups. a within similarity total the easil be can mysterio working that been has algorithm clustering popular spectral simplicity, a is clustering Spectral Introduction 1 n C.Scin5aeteconclusions. the pr are 4 5 Section Section components. PCA. connected and the of indicators the are T n aetasomtost nndmninlEciensp Euclidean n-dimensional an to transformations make and , HE hsppras rvsteei neuvlnebtenspect between equivalence an is there proves also paper This rvsta ihaproper a similarit with points’ that data proves mapped the connec and function fully objective similarities a an points’ offering For by not. part or reduction dimension connected fully is dividing Laplacian by clustering graph spectral spec explains years, paper For This data. ously. the from po derived clusters matrices that Laplacian algorithm of popular a is clustering Spectral M pcrlClustering Spectral ATHEMATICS · rp Laplacian Graph k h first the , E B QUIVALENCE EHIND k [email protected] ievcosaeteidctr ftecnetdcomponent connected the of indicators the are eigenvectors · eray2,2021 28, February pcrlEmbedding Spectral A P A n BSTRACT S deflw naMro admwl n rv h equivalence the prove and walk random Markov a in flows edge − Shen T n uha -en.Scin2fcsso xliigspectra explaining on focuses 2 Section k-means. as such on, REPRINT sy o er,dfeetppr r oepani rmdiff from it explain to try papers different years, For usly. PECTRAL k cnetdgah ihaproper a with graph, -connected ae nteLpainmti n hnapisaclassical a applies then and matrix Laplacian the on based ovdb tnadlna ler ehd.Dsieits Despite methods. algebra linear standard by solved y pain notoctgre ae nwehrtegraph the whether on based categories two into aplacians cost elvle,teotmzto rbe becomes problem optimization the values, real to ectors p,ti ae osdr pcrlcutrn satwo-step a as clustering spectral considers paper this aph, ebtenoiia aapit’smlrte n mapped and similarities points’ data original between ce the graph, connected fully a For zeros. to eigenvectors vsteei neuvlnebtenseta embedding spectral between equivalence an is there oves nCmueTm Distance(ECTD) Time Commute an c rsrigET yseta opsto,adthen and composition, spectral by ECTD preserving ace uetettldsiiaiybtendfeetgop and groups different between dissimilarity total the sure j ne nteqaiyo h eae rbe’ solution problem’s relaxed the of quality the on antee pcrlembedding. spectral h nadSi20)4 rvd admwl iwof view walk random a provide Shi(2001)[4] and an akve hrstesm rbe ihnormalized with problem same the shares view walk t ehdadteegnausegnetr fthe of eigenvalues/eigenvectors the and method uts hnsatn rmnode from starting when T e.Framlicnetdgah hspaper this graph, multi-connected a For ies. nsuigteegnausadeigenvectors and eigenvalues the using ints tit w aeoisbsdo hte the whether on based categories two into it h oainebtenteoiia data original the between covariance the : O rlcutrn a enwrigmysteri- working been has clustering tral a medn n PCA. and embedding ral e rp,ti ae eosrtsthe demonstrates paper this graph, ted PCA · ieso Reduction Dimension C LUSTERING i n oigbc onode to back coming and , k h first the , · PCA n A ( ,j i, k ND ) eigenvectors omeasure to s. T HE erent l The Mathematics Behind Spectral Clustering And The Equivalence To PCA APREPRINT

2 Fully Connected Graphs

T m Given a set of data points X = (x1, ...xn) , xi R ,i = 1, 2, ..., n, we can form an undirected graph G = (V, E) ∈ that vertex vi represents xi. Based on the undirected graph G, we can construct a weighted adjacency graph W = (wij )n n, where wij = wji 0. To get W , we first construct the adjacency graph through ǫ-neighborhood, k-nearest neighbor× or other methods or≥ simply connect all vertices and then put weights on the connected edges by a similarity 2 2 function. The most common similarity function is the RBF kernel, where wij = exp xi xj /(2δ ). If wij > 0 −k − k for all pair of vertices vi and vj , G becomes a fully connected graph that every vertex connects to every other vertex.

Given the weighted adjacency matrix, the degree matrix is defined as a diagonal matrix D = diag(Dii)n n, where n × Dii = j=1 wij is the degree of vertex vi. Given W and D, we have the unnormalized graph Laplacian P L = D W (1) − and two common normalized graph Laplacians

sym 1 1 L := D− 2 LD− 2 (2) rw 1 L := D− L where Lsym is the symmetric normalized graph and Lrw is the random walk normalized graph.

2.1 The Unnormalized Graph Laplacians

The absence of objective function for the dimension reduction part is why spectral clustering works mysteriously for a fully connected graph. A good objective function for dimension reduction should preserve data points’ similarities. Therefore, the covariance between the original data points’ similarities and the mapped data points’ similarities should be a good choice. T k Let Y = (y1, ..., yn) be the low-dimensional representation, yi R represents xi, i = 1, 2, ..., n, and dij be the ∈ dissimilarity between yi and yj , that

2 dij = yi yj . (3) k − k

Let the adjacency weight wij be the similarity between xi and xj , the covariance between similarities becomes

n n 1 cov( d, w)= (dij d¯)(wij w¯) (4) − −2n − − Xi=1 Xj=1 where w = (w11, w12, ..., wnn), d = (d11, d12..., dnn), and d¯, w¯ are the means. Since for any constant c =0, we have 6

cov(cd, w)= c cov(d, w) (5) ∗ and only the relative information matters for Y , it is natural to constrain the mapped data’s means to zero and constant their norms. That is

n i=1 yit =0, t =1, ...k 2 (6) Pyi = ci,i =1, ..., n k k where ci =0 is a constant. Therefore, the covariance becomes 6 1 cov( d, w)= tr(Y T LY )+ constant (7) − −n and the optimization problem becomes

2 The Mathematics Behind Spectral Clustering And The Equivalence To PCA APREPRINT

minimize trace(Y T LY ) Y subject to y 2 = c , i =1, 2, ..., n i i (8) k n k yit =0, t =1, 2, ..., k. Xi=1

Let λ0, ..., λk, ...λn 1 be the eigenvalues of L sorted in ascending order, and f0,f1, ..., fk be the first k +1 corre- sponding eigenvectors.− Since the number of zero eigenvalues of a graph Laplacian equals to the number of connected components, and L is a positive semi-definite fully connected graph Laplacian, we have

0= λ0 < λ1 ... λn 1. (9) ≤ ≤ − Let ~1 be the n-dimensional vector of all ones and ~0 be the n-dimensional vector of all zeros, since L~1= ~0, we have

T f0 = (c0, ..., c0) (10) ~1 ft =0, t =1, 2, ..., k · where c0 is a constant. As a constant vector, f0 cannot satisfy the constraints in optimization problem (8) simulta- neously. What’s more, L is positive semi-definite, the solution of the optimization problem (8) becomes the first k non-constant eigenvectors, that is

Y ∗ = (f1,f2, ...fk). (11)

The classical spectral clustering uses the first k eigenvectors, that is

sp Y = (f0,f1, ..., fk 1). (12) −

As a constant vector, f0 provides no extra information, which means removing f0 makes no differencefor the clustering result. Therefore, we only need to adjust the number of clusters from k 1 to k. − 2.2 The Normalized Graph Laplacians

For the symmetric normalized graph Lsym, we have

sym Lij Lij = . (13) DiiDjj p Let

sym yi zi = (14) √Dii and

dsym = zsym zsym 2. (15) ij k i − j k sym Similar to the unnormalized graph, we add constraints to zi

sym sym zi = ci ,i =1, 2, ...n k n k (16)  i=1 zit =0,t =1, 2, ..., k P where csym =0 is a constant. Based on the constrains, we have i 6

3 The Mathematics Behind Spectral Clustering And The Equivalence To PCA APREPRINT

n n sym 1 sym sym cov( d , w)= (d d¯ )(wij w¯) − −2n ij − − Xi=1 Xj=1 (17) 1 = tr(Y T LsymY )+ constant −n Therefore, the symmetric normalized graph Laplacian optimization problem becomes

minimize trace(Y T LsymY ) Y subject to zsym = csym,i =1, 2, ...n i i (18) k n k sym zit =0,t =1, 2, ..., k. Xi=1 Similar to the unnormalized graph Laplacian, the solution becomes the first k non-constant eigenvectors of Lsym, that is

sym sym sym Y = (f1 , ..., fk ) (19) and

sym 1 sym Z =Λ− 2 Y (20)

1 1 where Λ− 2 = diag( )n n is a diagonal matrix. It seems that there is a strong connection between the normalized √λii × graph Laplacian and the commute time embedding in [1]. The random walk normalized graph Lrw is similar to the symmetric normalized graph Lsym, we only need to set zrw = yj . j √Dii

3 Multi-Connected Graphs

For a multi-connected graph with k connected components, the weight adjacency matrix will be a block diagonal matrix after sorting the vertices according to the connected components they belong to. And the Lm becomes

L1 L2 Lm =   (21)  ··· L   k where each subgraph Lt,t =1, 2, ..., k is a fully connected graph that has one and only one eigenvalueequals to 0. Let m m ut be the first eigenvector of subgraph Lt, ut becomes a constant vector with the corresponding eigenvalue equals to zero, and the first k eigenvectors of Lm becomes

m u1 um U m =  2  . (22)  ··· um  k  Therefore, U m becomes the indicator of the connected components.

4 The Equivalence To PCA

Principal component analysis(PCA) is a commonly used linear dimension reduction method that attempts to construct a low-dimensional representation preserving as much variance as possible. After removing the means, XT X can be recognized as the covariance matrix, and the k-dimensional representation Y PCA∗ becomes

4 The Mathematics Behind Spectral Clustering And The Equivalence To PCA APREPRINT

PCA Y ∗ = XMk (23)

T where Mk is a m k matrix whose columns are the k largest eigenvectors of X X. That is, the columnsof Mk are the eigenvectors of X×T X whose corresponding eigenvalues are the k largest, and the eigenvectors are sorted according to their eigenvalues in descending order. Denote ∆k k as the diagonal matrix, whose diagonal entries are the sorted k largest eigenvalues of XT X, thus ×

T PCA T PCA XX Y ∗ = XX XMk = XMk∆k = Y ∗∆k. (24)

Therefore, Y PCA∗ are the k largest eigenvectors of G = XXT . To make PCA equal to spectral embedding, we need to construct a fully connected graph Laplacian LPCA, whose smallest k non-constant eigenvectors equals to Y . Based on G, we choose the cosine similarity as the similarity PCA function, and to make sure wij > 0, we let wij = 2+cos θij . In this way, L becomes a positive semi-definite fully connect graph Laplacian. After apply standardization to X, we have

T wij =2+ xi xj (25) and

n Dii = wij =2n (26) Xi=1 and the Laplacian matrix becomes

LPCA =2nI 2H G (27) − − PCA where H = (1)n n is a n n all-ones matrix. Let β0, ..., βk, ..., βn 1 be the eigenvalues of L sorted in ascending × × − order, and u0, ..., uk, ..., un 1 be the corresponding eigenvectors, we have −

0= β0 <β1... βk... βn 1 (28) ≤ ≤ − and the solution of LPCA becomes

PCA∗ YL = (u1, ..., uk). (29)

For t =1, 2, ..., n 1, since ~1 ut =0, with equation (27), we have − ·

Gut = (2n βt)ut. (30) − For β0 =0 and u0, since X has zero means and u0 is a constant vector, we have

Gu0 = ~0. (31)

What’s more, since G is positive semi-definite, the eigenvalues of G become

2n β1 ... 2n βk... 2n βn 1 β0 =0. (32) − ≥ ≥ − ≥ − − ≥

In this way, u1, ..., uk become the k largest eigenvectors of G which is the solution of k-dimensional PCA,

PCA∗ PCA∗ Y = (u1, ..., uk)= YL . (33)

5 The Mathematics Behind Spectral Clustering And The Equivalence To PCA APREPRINT

5 Conclusions

In this paper, we explain the mathematics behind spectral clustering by dividing graph Laplacian into two categories based on whether the graph is fully connected or not. For a fully connected graph, we consider spectral clustering as a two-step algorithm that first reduces dimension and then applies a standard clustering algorithm to the lower dimen- sional representation. We explain the spectral embedding part by offering an objective function which is the covariance between the original data points’ similarities and the mapped data points’ similarities. For a multi-connected graph, we prove that with a proper k, the first k eigenvectors are the indicators of the connected components. This paper also proves the equivalence between spectral embedding and PCA by setting the cosine similarity as the similarity function when constructing the weighted adjacency graph. Since the choice of similarity function is flexible, the spectral embedding should have more equivalent dimension reduction algorithms.

References [1] F. Fouss, A. Pirotte, J. Renders, and M. Saerens. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on Knowledge and Data Engineering, 19(3):355–369, 2007. [2] Jianbo Shi and J. Malik. Normalized cuts and . IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000. [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for and data representation. Neural Computation, 15(6):1373–1396, 2003. [4] M. Meila and J. Shi. A random walks view of spectral segmentation. In T. S. Richardson and T. S. Jaakkola, editors, AISTATS. Society for Artificial Intelligence and Statistics, 2001. [5] M. Saerens, F. Fouss, L. Yen, and P. Dupont. The principal components analysis of a graph, and its relationships to spectral clustering. In J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi, editors, : ECML 2004, pages 371–383, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg. [6] U. von Luxburg. A tutorial on spectral clustering. CoRR, abs/0711.0189, 2007. [7] Y. Weiss. Segmentation using eigenvectors: a unifying view. In Proceedings of the Seventh IEEE International Conference on Computer Vision, volume 2, pages 975–982 vol.2, 1999.

6