Relational Clustering by Symmetric Convex Coding

Bo Long [email protected] Zhongfei (Mark) Zhang [email protected] Computer Science Dept., SUNY Binghamton, Binghamton, NY 13902

Xiaoyun Wu [email protected] Google Inc, 1600 Amphitheatre, Mountain View, CA 94043

Philip S. Yu [email protected] IBM Watson Research Center, 19 skyline Drive, Hawthorne, NY 10532

Abstract applications, such as web mining, social network analysis, bioinformatics, VLSI design, and task scheduling. Further- Relational data appear frequently in many ma- more, the relational data are more general in the sense all chine learning applications. Relational data con- the feature data can be transformed into relational data un- sist of the pairwise relations (similarities or dis- der a certain distance function. similarities) between each pair of implicit ob- jects, and are usually stored in relation matri- The most popular way to cluster similarity-based relational ces and typically no other knowledge is avail- data is to formulate it as the graph partitioning problem, able. Although relational clustering can be for- which has been studied for decades. Graph partitioning mulated as graph partitioning in some applica- seeks to cut a given graph into disjoint subgraphs which correspond to disjoint clusters based on a certain edge cut tions, this formulation is not adequate for gen- objective. Recently, graph partitioning with an edge cut ob- eral relational data. In this paper, we propose a jective has been shown to be mathematically equivalent to general model for relational clustering based on an appropriate weighted kernel k-means objective function symmetric convex coding. The model is applica- (Dhillon et al., 2004; Dhillon et al., 2005). The assump- ble to all types of relational data and unifies the tion behind the graph partitioning formulation is that since existing graph partitioning formulation. Under the nodes within a cluster are similar to each other, they this model, we derive two alternative bound opti- form a dense subgraph. However, in general this is not true mization algorithms to solve the symmetric con- for relational data, i.e., the clusters in relational data are not vex coding under two popular distance functions, necessarily dense clusters consisting of strongly-related ob- Euclidean distance and generalized I-divergence. jects. Experimental evaluation and theoretical analysis Figure 1 shows the relational data of four clusters, show the effectiveness and great potential of the which are of two different types. In Figure 1, C1 = proposed model and algorithms. {v1, v2, v3, v4} and C2 = {v5, v6, v7, v8} are two tradi- tional dense clusters within which objects are strongly re- lated to each other. However, C3 = {v9, v10, v11, v12} and 1. Introduction C4 = {v13, v14, v15, v16} also form two sparse clusters, within which the objects are not related to each other, but Two types of data are used in unsupervised learning, fea- they are still ”similar” to each other in the sense that they ture and relational data. Feature data are in the form of are related to the same set of other nodes. In Web min- feature vectors and relational data consist of the pairwise ing, this type of cluster could be a group of music ”fans” relations (similarities or dissimilarities) between each pair Web pages which share the same taste on the music and of objects, and are usually stored in relation matrices and are linked to the same set of music Web pages but are not typically no other knowledge is available. Although feature linked to each other (Kumar et al., 1999). Due to the impor- data are the most common type of data, relational data have tance of identifying this type of clusters (communities), it become more and more popular in many machine learning has been listed as one of the five algorithmic challenges in Appearing in Proceedings of the 24 th International Conference Web search engines (Henzinger et al., 2003). Note that the on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by cluster structure of the relation data in Figure 1 cannot be the author(s)/owner(s). correctly identified by graph partitioning approaches, since Relational Clustering by Symmetric Convex Coding they look for only dense clusters of strongly related objects 2 3 5 6 by cutting the given graph into subgraphs; similarly, the 1 4 7 8 pure bi-partite graph models cannot correctly identify this 9 10 11 12 type of cluster structures. Note that re-defining the rela- tions between the objects does not solve the problem in this 13 14 15 16 situation, since there exist both dense and sparse clusters. (a) (b) If the relational data are dissimilarity-based, to apply graph Figure 1. The graph (a) and relation matrix (b) of the relational partitioning approaches to them, we need extra efforts on data with different types of clusters. In (b), the dark color denotes appropriately transforming them into similarity-based data 1 and the light color denotes 0. and ensuring that the transformation does not change the cluster structures in the data. Hence, it is desirable for Leland, 1995; Karypis & Kumar, 1998). an algorithm to be able to identify the cluster structures no matter which type of relational data is given. This is Recently, graph partitioning with an edge cut objective even more desirable in the situation where the background has been shown to be mathematically equivalent to an knowledge about the meaning of the relations is not avail- appropriate weighted kernel k-means objective function able, i.e., we are given only a relation matrix and do not (Dhillon et al., 2004; Dhillon et al., 2005). Based on this know if the relations are similarities or dissimilarities. equivalence, the weighted kernel k-means algorithm has been proposed for graph partitioning (Dhillon et al., 2004; In this paper, we propose a general model for relational Dhillon et al., 2005). Yu et al. (2005) propose the graph- clustering based on symmetric convex coding of the re- factorization clustering for the graph partitioning, which lation matrix. The proposed model is applicable to the seeks to construct a bipartite graph to approximate a given general relational data consisting of only pairwise relations graph. Nasraoui et al. (1999) propose the relational fuzzy typically without other knowledge; it is capable of learning maximal density estimator algorithm. both dense and sparse clusters at the same time; it unifies the existing graph partition models to provide a generalized In this paper, our focus is on the homogeneous relational theoretical foundation for relational clustering. Under this data, i.e., the objects in the data are of the same type. There model, we derive iterative bound optimization algorithms are some efforts in the literature that can be considered to solve the symmetric convex coding for two important as clustering heterogeneous relational data, i.e., different distance functions, Euclidean distance and generalized I- types of objects are related to each other. For example, co- divergence. The algorithms are applicable to general rela- clustering addresses clustering two types of related objects, tional data and at the same time they can be easily adapted such as documents and words, at the same time. Dhillon to learn a specific type of cluster structure. For example, et al. (2003) propose a co-clustering algorithm to maximize when applied to learning only dense clusters, they provide the mutual information. A more generalized co-clustering new efficient algorithms for graph partitioning. The con- framework is presented by Banerjee et al. (2004) wherein vergence of the algorithms is theoretically guaranteed. Ex- any Bregman divergence can be used in the objective func- perimental evaluation and theoretical analysis show the ef- tion. Long et al. (2005), Li (2005) and Ding et al. (2006) fectiveness and great potential of the proposed model and all model the co-clustering as an optimization problem in- algorithms. volving a triple matrix factorization.

2. Related Work 3. Symmetric Convex Coding Graph partitioning (or clustering) is a popular formulation In this section, we propose a general model for relational of relational clustering, which divides the nodes of a graph clustering. Let us first consider the relational data in Fig- into clusters by finding the best edge cuts of the graph. ure 1. An interesting observation is that although the dif- Several edge cut objectives, such as the average cut (Chan ferent types of clusters look so different in the graph from et al., 1993), average association (Shi & Malik, 2000), nor- Figure 1(a), they all demonstrate block patterns in the re- malized cut (Shi & Malik, 2000), and min-max cut (Ding lation matrix of Figure 1(b) (without loss of generality, we et al., 2001), have been proposed. Various spectral algo- arrange the objects from the same cluster together to make rithms have been developed for these objective functions the block patterns explicit). Motivated by this observation, (Chan et al., 1993; Shi & Malik, 2000; Ding et al., 2001). we propose the Symmetric Convex Coding (SCC) model These algorithms use the eigenvectors of a graph affinity to cluster relational data by learning the block pattern of matrix, or a matrix derived from the affinity matrix, to par- a relation matrix. Since in most applications, the relations tition the graph. are of non-negative values and undirected, relational data can be represented as non-negative, symmetric matrices. Multilevel methods have been used extensively for graph Therefore, the definition of the SCC is given as follows. partitioning with the Kernighan-Lin objective, which at- tempt to minimize the cut in the graph while maintaining Definition 3.1. Given a symmetric matrix A ∈ R+, a dis- equal-sized clusters (Bui & Jones, 1993; Hendrickson & tance function D and a positive number k, the symmetric Relational Clustering by Symmetric Convex Coding convex coding is given by the minimization, Theorem 3.2 states that with the prototype matrix B re- stricted to be of the form rIk, SCC under Euclidean dis- min D(A, CBCT ). (1) tance is reduced to the trace maximization in (3). Since var- C∈Rn×k,B∈Rk×k + + ious graph partitioning objectives, such as ratio association C1=1 (Shi & Malik, 2000), normalized cut (Shi & Malik, 2000), ratio cut (Chan et al., 1993), and Kernighan-Lin objective According to Definition 3.1, the elements of C are between (Kernighan & Lin, 1970), can be formulated as the trace 0 and 1 and the sum of the elements in each row of C equal maximization (Dhillon et al., 2004; Dhillon et al., 2005), to 1. Therefore, SCC seeks to use the convex combination Theorem 3.2 establishes the connection between the SCC of the prototype matrix B to approximate the original rela- model and the existing graph partitioning objective func- tion matrix. The factors from SCC have intuitive interpre- tions. Based on this connection, it is clear that the existing tations. The factor C is the soft membership matrix such graph partitioning models make an implicit assumption for that Cij denotes the weight that the ith object associates the cluster structure of the relational data, i.e., the clusters with the jth cluster. The factor B is the prototype matrix are not related to each other (the off-diagonal elements of such that Bii denotes the connectivity within the ith clus- B are zeroes) and the nodes within clusters are related to ter and Bij denotes the connectivity between the ith cluster each other in the same way (the diagonal elements of B are and the jth cluster. r). This assumption is consistent with the intuition about SCC provides a general model to learn various cluster the graph partitioning, which seeks to ”cut” the graph into structures from relational data. Graph partitioning, which k separate subgraphs corresponding to the strongly-related focuses on learning dense cluster structure, can be formu- clusters. lated as a special case of the SCC model. We propose the With Theorem 3.2 we may put other types of structural con- following theorem to show that the various graph partition- straints on B to derive new graph partitioning models. For ing objective functions are mathematically equivalent to a example, we fix B as a general diagonal matrix instead of special case of the SCC model. Since most graph parti- rIk, i.e, the model fixes the off-diagonal elements of B tioning objective functions are based on the hard cluster as zero and learns the diagonal elements of B. This is a membership, in the following theorem we modify the con- T more flexible graph partitioning model, since it allows the straints on C as C ∈ R+ and C C = Ik to make C to be connectivity within different clusters to be different. More the following cluster indicator matrix, generally, we can use B to restrict the model to learn other ( 1 types of the cluster structures. For example, by fixing diag- 1 if vi ∈ πj 2 onal elements of B as zeros, the model focuses on learning Cij = |πj | 0 otherwise only spare clusters (corresponding to bi-partite or k-partite subgraphs), which are important for Web community learn- where |πj| denotes the number of nodes in the jth cluster. ing (Kumar et al., 1999; Henzinger et al., 2003). In sum- Theorem 3.2. The hard version of SCC model under mary, the prototype matrix B not only provides the intu- ition for the cluster structure of the data, but also provides Euclidean distance function and B = rIk for r > 0, i.e., a simple way to adapt the model to learn specific types of T 2 min ||A − C(rIk)C || (2) cluster structures. n×k k×k C∈R+ ,B∈R+ CT C=I k 4. Algorithm Derivation is equivalent to the maximization In this section, we derive efficient algorithms for the SCC max tr(CT AC), (3) model under two popular distance functions, Euclidean dis- tance and generalized I-divergence. where tr denots the trace of a matrix. 4.1. Algorithm for SCC under Euclidean Distance Proof. Let L denote the objective function in Eq. 2. We derive an alternative optimization algorithm for SCC L = ||A − rCCT ||2 (4) under Euclidean distance, i.e., the algorithm alternatively updates B and C until convergence. = tr((A − rCCT )T (A − rCCT )) (5) First we fix B to update C. To deal with the constraint = tr(AT A) − 2rtr(CCT A) + r2tr(CCT CCT )(6) C1 = 1 efficiently, we transform it to a ”soft” constraint T T 2 = tr(A A) − 2rtr(C AC) + r k (7) by adding a penalty term, α||C1 − 1||2, to the objective function, where α is a positive constant. Therefore, we The above deduction uses the property of trace tr(XY ) = obtain the following optimization. tr(YX). Since tr(AT A), r and k are constants, the minimization of L is equivalent to the maximization of min ||A − CBCT ||2 + α||C1 − 1||2. (8) T n×k tr(C AC). The proof is completed. C∈R+ Relational Clustering by Symmetric Convex Coding

The objective function in (8) is quartic with respect to C. + log Cjh − log C˜ig − log C˜jh) + βC˜jh(1 + log Cjh − C X 4 4 We derive an efficient updating rule for based on the 1 T Cig Cjh log C˜jh)) + ( [CB˜ C˜ ]ij C˜ig BghC˜jh( + ) bound optimization procedure (Salakhutdinov & Roweis, 2 ˜4 ˜4 gh Cig Cjh 2003; D.D.Lee & H.S.Seung, 1999). The basic idea is 4 1 Cjh to construct an auxiliary function which is a convex upper + β[C˜1]j C˜jh( + 1))) 2 ˜4 Cjh bound for the original objective function based on the solu- X X ˜ ˜ tion obtained from the previous iteration. Then, a new solu- = (Aij + kβ − 2 (Aij Cig BghCjh(1 + 2 log Cjh tion to the current iteration is obtained by minimizing this ij gh ˜ ˜ ˜ upper bound. The definition of the auxiliary function and a −2 log Cjh) + βCjh(1 + log Cjh − log Cjh)) + X 4 useful lemma (D.D.Lee & H.S.Seung, 1999) are quoted as T Cjh ([CB˜ C˜ ] C˜ B C˜ + ij ig gh jh ˜4 follows. gh Cjh t 4 Definition 4.1. G(S, S ) is an auxiliary function for F (S) 1 Cjh β[C˜1]j C˜jh( + 1))) t 2 ˜4 if G(S, S ) ≥ F (S) and G(S, S) = F (S). Cjh Lemma 4.2. If G is an auxiliary function, then F t+1 During the above deduction, we uses Jensen’s inequality, is non-increasing under the updating rule S = 2 t convexity of the quadratic function and inequalities, x + arg min G(S, S ). 2 S y ≥ 2xy and x ≥ 1 + log x.

We propose an auxiliary function for C in the following The following theorem provides the updating rule for C. theorem. Lemma 4.3. Theorem 4.4. The objective function F (C) in Eq.(9) is nonincreasing under the updating rule, X α X G(C, C˜) = (Aij + − 2 (Aij C˜ig BghC˜jh(1 + 2 log Cjh n α ij ˜ gh ACB + 2 1 C = C˜ ¯ ( ) 4 α α (10) −2 log C˜jh) + C˜jh(1 + log Cjh − log C˜jh)) + ˜ ˜T ˜ ˜ nk CBC CB + 2 CE X 4 T Cjh ([CB˜ C˜ ]ij C˜ig B C˜ + gh jh ˜4 where C˜ denotes the solution from the previous iteration, E gh Cjh 4 denotes a k × k matrix of 1’s, ¯ denotes entry-wise prod- α Cjh [C˜1]j C˜jh( + 1))) uct, and the division between two matrices is entry-wise 2nk C˜4 jh division. is an auxiliary function for Proof. Based on Lemma 4.3, take the derivative of ˜ F (C) = ||A − CBCT ||2 + α||C1 − 1||2. (9) G(C, C) w.r.t. Cjh to obtain ˜ X X ˜ ˜ α ∂G(C, C) ˜ Cjh α Cjh Proof. For convenience, we let β = nk . = (−4Aij CigBgh − 2 ∂Cjh Cjh nk Cjh i gh X X X 2 2 F (C) = ((Aij − Cig BghCjh) + β (Cjh − 1) ) 3 T Cjh ij gh gh +4[CB˜ C˜ ]ij C˜igBgh ˜3 X X ˜ ˜ ˜ ˜T Cjh Cig BghCjh [CBC ]ij 2 ≤ ( (Aij − Cig BghCjh) ˜ ˜T ˜ ˜ 3 ij gh [CBC ]ij Cig BghCjh α Cjh +2 [C˜1]j ). ˜3 X C˜ [C˜1]j nk C +β jh ( C − 1)2) jh ˜ ˜ jh gh [C1]j Cjh ˜ X X Solve ∂G(C,C) = 0 to obtain = (Aij − 2 Aij Cig BghCjh + ∂Cjh ij gh T P P α X [CB˜ C˜ ]ij X [C˜1]j ˜ 2 2 2 2 i gh Aij CigBgh + 2 1 Cig BghCjh + β Cjh ˜ 4 C˜ B C˜ C˜ Cjh = Cjh( P P ) gh ig gh jh gh jh [CB˜ C˜T ] C˜ B + α [C˜1] X i gh ij ig gh 2 j −2β Cjh + kβ) gh Formulate the above equation into the matrix form X X Cig Cjh = (Aij + kβ − 2 (Aij C˜ig BghC˜jh + C˜ig C˜jh ˜ α ij gh ACB + 2 1 C = C˜ ¯ ( ) 4 2 2 α Cjh X Cig Cjh ˜ ˜T ˜ ˜ βC˜ ) + ([CB˜ C˜T ] C˜ B C˜ CBC CB + 2 CE jh ˜ ij ig gh jh ˜2 ˜2 Cjh gh Cig Cjh 2 By Lemma 4.2, the proof is completed. Cjh +β[C˜1]j C˜ )) jh ˜2 Cjh X X Similarly, we present the following theorems to derive the ≤ (Aij + kβ − 2 (Aij C˜ig BghC˜jh(1 + log Cig ij gh updating rule for B. Relational Clustering by Symmetric Convex Coding

Algorithm 1 SCC-ED algorithm Compared with the multi-level approaches such as METIS Input: A graph affinity matrix A and a positive integer k. (Karypis & Kumar, 1998), this new algorithm does not re- Output: A community membership matrix C and a com- strict clusters to have an equal size. munity structure matrix B. Another advantage of the SCC-ED algorithm is that it is Method: very easy for the algorithm to incorporate constraints on B 1: Initialize B and C. to learn a specific type of cluster structures. For example, 2: repeat if the task is to learn the sparse clusters by constraining 3: the diagonal elements of B to be zero, we can enforce this CT AC constraint simply by initializing the diagonal elements of B = B ¯ . CT CBCT C B as zeros. Then, the algorithm automatically only updates the off-diagonal elements of B and the diagonal elements 4: of B are ’locked’ to zeros. α ACB + 1 2 4 Yet another interesting observation about SCC-ED is that C = C ¯ ( T α ) CBC CB + 2 CE if we set α = 0 to change the updating rule for C into the following, 5: until convergence ACB˜ 1 C = C˜ ¯ ( ) 4 , (13) CB˜ C˜T CB˜ Lemma 4.5. the algorithm actually provides the symmetric conic cod- X X G(B, B˜) = (A − 2 A C B C + ing. This has been touched in the literature as the symmet- ij ij ig gh jh ric case of non-negative factorizaion (Catral et al., 2004; ij gh Ding et al., 2005; Long et al., 2005). Therefore, SCC-ED X 2 Bgh under α = 0 also provides a theoretically sound solution to [CBC˜ ] C C ) ij ig jh ˜ the symmetric nonnegative matrix factorization. gh Bgh is an auxiliary function for 4.2. Algorithm for SCC under Generalized I-divergence F (B) = ||A − CBCT ||2. (11) Under the generalized I-divergence, the SCC objective Theorem 4.6. The objective function F (B) in Eq.(11) is function is given as follows, nonincreasing under the updating rule X T Aij D(A||CBC ) = (Aij log T CT AC [CBC ]ij B = B˜ ¯ . (12) ij T ˜ T T C CBC C −Aij + [CBC ]ij) (14)

Following the way to prove Lemma 4.3 and Theorem 4.4, it Similarly, we derive an alternative bound optimization al- is not difficult to prove the above theorems. We omit details gorithm for this objective function. First, we derive the here. updating rule for C and our task is the following optimiza- tion. We call the algorithm as the SCC-ED algorithm, which is T 2 summarized in Algorithm 1. The implementation of SCC- min D(A||CBC ) + α||C1 − 1|| . (15) C∈Rn×k ED is simple and it is easy to take advantage of the distrib- + uted computation for a very large data set. The complexity of the algorithm is O(tn2k) for t iterations and it can be Then, the following theorems provide the updating rule for further reduced for sparse data. The convergence of the C. SCC-ED algorithm is guaranteed by Theorems 4.4 and 4.6. Lemma 4.7. X α If the task is to learn the dense clusters from similarity- G(C, C˜) = (A log A − A + ij ij ij n based relational data as the graph partitioning does, SCC- ij ED can achieve this task simply by fixing B as the identity X C˜igBghC˜jh C˜igC˜jh matrix and updating only C by (10) until convergence. In +Aij ( log ) [CB˜ C˜T ] [CB˜ C˜T ] other words, updating rule (10) itself provides a new and ef- gh ij ij X 2 ficient graph partitioning algorithm, which is computation- α Cjh + ((C˜igBghC˜jh + [C˜1]j C˜jh) ) ally more efficient than the popular spectral graph partition- nk ˜2 Cjh ing approaches which involve expensive eigenvector com- gh X ˜ ˜ putation (typically O(n3)) and the extra post-processing CigBghCjh α −2 ((Aij + C˜jh) log Cjh) ˜ ˜T nk (Yu & Shi, 2003) on eigenvectors to obtain the clustering. gh [CBC ]ij Relational Clustering by Symmetric Convex Coding X α −2 C˜ (1 − log C˜ )) nk jh jh Table 1. Summary of the synthetic relational data gh Graph Parameter n k " # is an auxiliary function for 0.5 0 0 syn1 0 0.5 0 900 3 0 0 0.5 T 2 F (C) = D(A||CBC ) + α||C1 − 1|| . (16) syn2 " 1 − syn1 # 900 3 0 0.1 0.1 syn3 0.1 0 0.2 900 3 Theorem 4.8. The objective function F (C) in Eq.(16) is 0.1 0.2 0 nonincreasing under the updating rule, syn4 [0, 1]10×10 5000 10

P Aij [CB˜ ]ih T + α 5.1. Data Sets and Parameter Setting i [CB˜ C˜ ]ij 1 C = C˜ ( ) 2 (17) jh jh P ˜ ˜ i[CB]ih + α[C1]j The data sets used in the experiments include synthetic data sets with various cluster structures and real data sets based where C˜ denotes the solution from the previous iteration. on various text data from the 20-newsgroups (Lang, 1995), WebACE and TREC (Karypis, 2002). The following theorems provide the updating rule for B. First, we use synthetic binary relational data to simulate re- Lemma 4.9. lational data with different types of clusters such as dense X X clusters, sparse clusters and mixed clusters. All the syn- G(B, B˜) = (Aij log Aij − Aij + CigBghCjh thetic relational data are generated based on Bernoulli ij gh distribution. The distribution parameters to generate the X graphs are listed in the second column of Table 1 as matri- CigB˜ghCjh −Aij ( (log CigBghCjh T ces (true prototype matrices for the data). In a parameter [CBC˜ ]ij gh matrix P , Pij denotes the probability that the nodes in the C B˜ C ith cluster are connected to the nodes in the jth cluster. For − log ig gh jh ))) T example, in data syn3, the nodes in cluster 2 are connected [CBC˜ ]ij to the nodes in cluster 3 with probability 0.2 and the nodes is an auxiliary function for within a cluster are connected to each other with probability 0. Syn2 is generated by using 1 minus syn1. Hence, syn1 and syn2 can be viewed as a pair of similarity/dissimilarity F (B) = D(A||CBCT ). (18) data. Data syn4 has ten clusters mixing with dense clusters and sparse clusters. Due to the space limit, its distribution Theorem 4.10. The objective function F (B) in Eq.(18) is parameters are omitted here. Totally syn4 has 5000 nodes nonincreasing under the updating rule, and about 2.1 million edges. P Aij Cig Cjh The graphs based on the text data have been widely used ij ˜ T [CBC ]ij to test graph partitioning algorithms (Ding et al., 2001; Bgh = B˜gh P (19) ij CigCjh Dhillon, 2001; Zha et al., 2001). Note that there also ex- ist feature-based algorithms to directly cluster documents where B˜ denotes the solution from the previous iteration. based on word features. However, in this study our focus is clustering based on relations instead of features. Hence Due to the space limit, we omit the proofs for the above graph clustering algorithms are used as comparisons. We theorems. We call the algorithm based on updating rule use various data sets from the 20-newsgroups (Lang, 1995), (17) and (19) as SCC-GI, which provides another new rela- WebACE and TREC (Karypis, 2002), which cover data sets tional clustering algorithm. Similarly, when applied to the of different sizes, different balances and different levels of similarity-based relational data of dense clusters, SCC-GI difficulties. We construct relational data for each text data provides another new and efficient graph partitioning algo- set such that objects (documents) are related to each other rithm. with cosine similarities between the term-frequency vec- tors. A summary of all the data sets to construct relational data used in this paper is shown in Table 2, in which n 5. Experimental Results denotes the number of objects in the relational data, k de- notes the number of true clusters, and balance denotes the This section provides empirical evidence to show the ef- size ratio of the smallest clusters to the largest clusters. fectiveness of the SCC model and algorithms in compari- son with two representative graph partitioning algorithms, For the number of clusters k, we simply use the number of a spectral approach, Normalized Cut (NC) (Shi & Malik, the true clusters. Note that how to choose the optimal num- 2000), and a multilevel algorithm, METIS (Karypis & Ku- ber of clusters is a nontrivial model selection problem and mar, 1998). beyond the scope of this paper. For performance measure, Relational Clustering by Symmetric Convex Coding

METIS on it, since they can identify both dense clusters Table 2. Summary of relational data based on text data sets. and sparse clusters at the same time. Name n k Balance Source tr11 414 9 0.046 TREC For the real data based on the text data sets, our task is tr23 204 6 0.066 TREC to find dense clusters, which is consistent with the objec- NG17-19 1600 3 0.5 20-newsgroups tives of graph partitioning approaches. Overall, the SCC NG1-20 14000 20 1.0 20-newsgroups algorithms perform better than NC and METIS on the real k1b 2340 6 0.043 WebACE data sets. Especially, SCC-ED provides the best perfor- hitech 2301 6 0.192 TREC mance in most data sets. The possible reasons for this are MEDLINE/ discussed as follows. First, the SCC model makes use of classic3 3893 3 0.708 CISI/CRANFILD any possible block pattern in the relation matrices; on the other hand, the edge-cut based approaches focus on diago- we elect to use the Normalized Mutual Information (NMI) nal block patterns. Hence, the SCC model is more robust to (Strehl & Ghosh, 2002) between the resulting cluster labels heavily overlapping cluster structures. For example, for the and the true cluster labels, which is a standard way to mea- difficult NG17-19 dataset, SCC algorithms do not totally sure the cluster quality. The final performance score is the fail as NC and METIS do. Second, since the edge weights average of ten runs. from different graphs may have very different probabilistic distributions, popular Euclidean distance function, which 5.2. Results and Discussion corresponds to normal distribution assumption, are not al- Table 3 shows the NMI scores of the four algorithms on ways appropriate. By Theorem 3.2, edge-cut based algo- synthetic and real relational data. Each NMI score is the rithms are based on Euclidean distance. On the other hand, average of ten test runs and the standard deviation is also SCC-ED is based on generalized I-divergence correspond- reported. We observe that although there is no single win- ing to Poisson distribution assumption, which is more ap- ner on all the data, for most data SCC algorithms perform propriate for graphs based on text data. Note that how to better than or close to NC and METIS. Especially, SCC-GI choose distance functions for specific graphs is non-trivial provides the best performance on eight of the eleven data and beyond the scope of this paper. Third, unlike METIS, sets. the SCC algorithms do not restrict clusters to have an equal size and hence they are more robust to unbalanced clusters. For the synthetic data syn1, almost all the algorithms pro- vide perfect NMI score, since the data are generated with In the experiments, we observe that SCC algorithms per- very clear dense cluster structures, which can be seen from forms stably and rarely provides unreasonable solution, the parameter matrix in Table 1. For data syn2, the dissim- though like other algorithms SCC algorithms provide local ilarity version of syn1, we use exactly the same set of true optima to the NP-hard clustering problem. In the experi- cluster labels as that of syn1 to measure the cluster quality; ments, we also observe that the order of the actual running the SCC algorithms still provide almost perfect NMI score; time for the algorithms is consistent with theoretical analy- however, the METIS totally fails on syn2, since in syn2 the sis in Section 4.1, i.e., METIS

Table 3. NMI comparisons of NC, METIS, SCC-ED and SCC-GI algorithms Data NC METIS SCC-ED SCC-GI syn1 0.9652 ± 0.031 1.000 ± 0.000 1.000 ± 0.000 1.000 ± 0.000 syn2 0.8062 ± 0.52 0.000 ± 0.00 0.9038 ± 0.045 0.9753 ± 0.011 syn3 0.636 ± 0.152 0.115 ± 0.001 0.915 ± 0.145 1.000 ± 0.000 syn4 0.611 ± 0.032 0.638 ± 0.001 0.711 ± 0.043 0.788 ± 0.041 tr11 0.629 ± 0.039 0.557 ± 0.001 0.6391 ± 0.033 0.661 ± 0.019 tr23 0.276 ± 0.023 0.138 ± 0.004 0.335 ± 0.043 0.312 ± 0.099 NG17-19 0.002 ± 0.002 0.091 ± 0.004 0.1752 ± 0.156 0.225 ± 0.045 NG1-20 0.510 ± 0.004 0.526 ± 0.001 0.5041 ± 0.156 0.519 ± 0.010 k1b 0.546 ± 0.021 0.243 ± 0.000 0.537 ± 0.023 0.591 ± 0.022 hitech 0.302 ± 0.005 0.322 ± 0.001 0.319 ± 0.012 0.319 ± 0.018 classic3 0.621 ± 0.029 0.358 ± 0.000 0.642 ± 0.043 0.822 ± 0.059 Dhillon, I., Guan, Y., & Kulis, B. (2005). A fast kernel-based Table 4. The members of cluster 121 in the actor graph multilevel algorithm for graph clustering. KDD ’05. Cluster 121 Dhillon, I. S. (2001). Co-clustering documents and words using , , Miranda Otto, bipartite spectral graph partitioning. KDD (pp. 269–274). , Brad Dourif, , Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information- Ian McKellen , , , theoretic co-clustering. KDD’03 (pp. 89–98). , John Rhys-Davies , , Ding, C., He, X., & Simon, H. (2005). On the equivalence of non- Bernard Hill, , , negative matrix factorization and spectral clustering. SDM’05. , , , Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal non- ,, Sala Baker negative matrix tri-factorizations for clustering. kdd’06. eral relational data with various types of clusters and unifies Ding, C. H. Q., He, X., Zha, H., Gu, M., & Simon, H. D. (2001). the existing graph partitioning models. We derive iterative A min-max cut algorithm for graph partitioning and data clus- tering. Proceedings of ICDM 2001 (pp. 107–114). bound optimization algorithms to solve the symmetric con- Hendrickson, B., & Leland, R. (1995). A multilevel algorithm for vex coding for two important distance functions, Euclidean partitioning graphs. Supercomputing ’95 (p. 28). distance and generalized I-divergence. The algorithms are Henzinger, M., Motwani, R., & Silverstein, C. (2003). Challenges applicable to general relational data and at the same time in web search engines. Proc. of the 18th International Joint they can be easily adapted to learn specific types of cluster Conference on Artificial Intelligence (pp. 1573–1579). structures. The convergence of the algorithms is theoreti- Karypis, G. (2002). A clustering toolkit. cally guaranteed. Experimental evaluation shows the effec- Karypis, G., & Kumar, V. (1998). A fast and high quality mul- tiveness and the great potential of the proposed model and tilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20, 359–392. algorithms. Kernighan, B., & Lin, S. (1970). An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49, 291–307. Acknowledgement Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). We thank the anonymous reviewers for insightful com- Trawling the Web for emerging cyber-communities. Computer ments. This work is supported in part by NSF Networks (Amsterdam, Netherlands: 1999), 31, 1481–1493. Lang, K. (1995). News weeder: Learning to filter netnews. ICML. (IIS-0535162), AFRL (FA8750-05-2-0284), and AFOSR Li, T. (2005). A general model for clustering binary data. (FA9550-06-1-0327). KDD’05. Long, B., Zhang, Z. M., & Yu, P. S. (2005). Co-clustering by block value decomposition. KDD’05. References Nasraoui, O., Krishnapuram, R., & Joshi, A. (1999). Relational Banerjee, A., Dhillon, I. S., Ghosh, J., Merugu, S., & Modha, clustering based on a new robust estimator with application to D. S. (2004). A generalized maximum entropy approach to web mining. NAFIPS 99. bregman co-clustering and matrix approximation. KDD (pp. Salakhutdinov, R., & Roweis, S. (2003). Adaptive overrelaxed 509–514). bound optimization methods. ICML’03. Bui, T. N., & Jones, C. (1993). A heuristic for reducing fill-in in Shi, J., & Malik, J. (2000). Normalized cuts and image segmen- sparse matrix factorization. PPSC (pp. 445–452). tation. IEEE Transactions on Pattern Analysis and Machine Catral, M., Han, L., Neumann, M., & Plemmons, R. (2004). On Intelligence, 22, 888–905. reduced rank nonnegative matrix factorization for symmetric Strehl, A., & Ghosh, J. (2002). Cluster ensembles – a knowledge nonnegative matrices. Linear Algebra and Its Application. reuse framework for combining partitionings. AAAI 2002 (pp. Chan, P. K., Schlag, M. D. F., & Zien, J. Y. (1993). Spectral k-way 93–98). ratio-cut partitioning and clustering. DAC ’93 (pp. 749–754). Yu, K., Yu, S., & Tresp, V. (2005). Soft clustering on graphs. D.D.Lee, & H.S.Seung (1999). Learning the parts of objects by NIPS’05. non-negative matrix factorization. Nature, 401, 788–791. Yu, S., & Shi, J. (2003). Multiclass spectral clustering. ICCV’03.. Dhillon, I., Guan, Y., & Kulis, B. (2004). A unified view of kernel Zha, H., Ding, C., Gu, M., He, X., & Simon, H. (2001). Bi-partite k-means, spectral clustering and graph cuts (Technical Report graph partitioning and data clustering. ACM CIKM’01. TR-04-25). University of Texas at Austin.