Clustering Scientific Collaboration Networks

Fabricio Murai Haibin Huang Jie Bing

December 15, 2011

Abstract In this work, we evaluate the performance of different clustering methods applied to Scientific collaboration networks. In particular, we study two subsets of the coauthorship network we obtain from the DBLP dataset: the network composed by the faculty of the CS department at UMass and the largest connected of the coauthorship network (containing about 848,000 nodes). We apply the following methods to both weighted and unweighted versions of those networks: Spectral and Hierarchical Clustering, the Kernighan-Lin algorithm, Spectral Partitioning, Edge betweenness and Leading eigenvector community detection. Our results show that methods that are not spectral-based perform better in general, but are clearly not scalable. On the other hand, we empirically show that the Spectral Clustering can perform almost as good as non-spectral-based methods, while retaining scalability as long as we use an approximate (but accurate) spectral decomposition. At last, we include some useful discussion about how to handle large sparse matrices.

1 Introduction

Most real networks exhibit non-trivial characteristics such as long-tail distributions, small distances and high clusterization. In particular, consider the scientific collaboration networks, where nodes represent authors and edges between two nodes indicate that they have published at least one paper together. It is well-known these networks are highly clusterized, which means that there are groups of scientists that are much more likely to be connected to each other than if they were randomly selected. Although graph visualization techniques can be used to find clusters in very small networks, this becomes impracticable when graphs grow larger. Fortunately, there are many methods proposed for graph clustering in the literature that can deal with thousands of nodes, including the Kernighan- Lin algorithm, hierarchical and edge-betweenness community detection. For even larger graphs (e.g.: hundreds of thousands of nodes), while some clustering methods cannot be directly applied, some cannot be used at all. For instance, spectral-based methods include a spectral decomposition step, and hence can only be used if we replace the this step by an approximation that can be efficiently computed. However, we will see that some approximations can lead to poor clusterization in comparison to the exact decomposition. It is worth to note that unlike regression or classification problems, there is no right answer for a clustering problem. Nevertheless, one can define the quality of the clustering in the context of graphs with respect to the connectivity (i.e.: the relative number of edges) inside (e.g.: ) or outside clusters (e.g: cut set size). In this work, we compare different clustering methods applied to scientific collaboration networks in Computer Science extracted from the DBLP dataset. This dataset comprises more than 1 million authors that together published more than 2.8 million papers to the date of this work. In

1 variable description A of G n number of vertices (nodes) in G m number of edges in G

ki = j Aij degree of i (in unweighted graph) KP vector such that Ki = ki D diagonal matrix such that Dii = ki si cluster to which vertex i belongs S = (s1,...,sn) partitioning of G θ(Sa,Sb) cut set size of the bisection (Sa,Sb) of G δ(.) Kronecker delta

Table 1: Notation description. what concerns the methods, we compare Spectral and Hierarchical Clustering, the Kernighan-Lin algorithm, Spectral Partitioning, Edge betweenness and Leading eigenvector community detection. The clustering is then assessed using the modularity metric.

2 Problem statement

Consider a set of publications and the corresponding authors extracted from the DBLP dataset. We build an undirected graph G where nodes are authors, including edges between pairs of authors that have published at least one paper together. We work with both weighted and unweighted versions of this graph. In the former version, the weight Aij of the edge between nodes i and j is computed as follows. Let P be the set of papers that i and j published together. Also let np be the number of authors of paper p. We set Aij = p∈P 1/np. In the unweighted version, A is simply the adjacency matrix of G. Therefore,P given the (un)weighted graph G we want to find a high quality clustering. In order to do this, we apply different clustering methods and evaluate the results with respect to the modularity metric, which is explained in detail in Section 3.1. The notation we use throughout this document is summarized in Table 1.

3 Background

3.1 Modularity The modularity is a metric that has been proposed by Newman [1] to evaluate clustering quality. It measures how many more edges there are inside clusters than expected with respect to a null model. In particular, the null model Newman uses is the configuration model, in which the probability of kikj i and j being connected is given by 2m . The modularity for unweighted graphs is given by 1 k k Q(S)= A − i j δ(s ,s ) (1) 2m  ij 2m  i j i,jX∈V

This metric can be easily extended to weighted graphs, by taking A as the weighted adjacency matrix.

2 4 Methods

4.1 Spectral-based methods The k-means algorithm uses Euclidean as a similarity measure between observations. However, we cannot directly apply k-means to the adjacency matrix A because nodes are not embedded in an Euclidean space. In fact, if we consider the rows of A as coordinates in an Euclidean space, the distance between pairs of nodes will be very similar and will not correspond to the distance in the graph. Since the k-means algorithm mainly depends on the distance measure, it will not work well if we use the Euclidean distance in the original data space. Hence we use spectral-based methods which help us to translate our data into a vector space.

4.1.1 Spectral clustering The basic idea of spectral clustering is to map the original data into a vector space spanned by a few eigenvectors and apply the k-means algorithm in that space. The assumption here is that although our data samples are high dimensional, they lie in a low dimensional subspace of the original space. In the literature, there are several versions of spectral clustering based on different definitions of the graph Laplacian operator. Here we use the spectral clustering proposed by Shi and Malik [2], based on the normalized graph Laplacian. The algorithm is as following: Step 1. Compute the normalized graph Laplacian M = D−1/2AD−1/2 Step 3. Compute top k eigenvectors of M Step 4. Arrange the eigenvectors as the columns of a matrix Y Step 5. Run K-means on the new embedding matrix Y .

4.1.2 Spectral partitioning This method attempts to minimize the cut set size of a partitioning of a graph. The optimization problem is defined as a function of the cut set size between two groups: 1 arg min Aij1(si = sj) S 2 Xi,j

It is shown in [3] that if we allow si to assume any value in [−1, 1], then the solution to this minimization problem is the eigenvector corresponding to the second smallest eigenvalue, also called Fiedler vector. Step 1: Compute the graph Laplacian L = D − A Step 2: Find the Fiedler vector v of L Step 3: Run k-means on the elements of the eigenvector v

4.1.3 Leading eigenvector Newman proposes an algorithm to directly maximize the modularity [4]. It is based on the relax- ation of the following maximization problem:

1 T arg max Q = 4 s Bs s m

s.t. i si = 0 P KK′ where si is either −1 or 1, and B = A − 2m is called modularity matrix. In the relaxed version, si is allowed to assume any value in [−1, 1].

3 The solution s to the relaxed optimization problem is the eigenvector corresponding to the largest eigenvalue of matrix B. Hence we have the Leading eigenvector algorithm: KK′ Step 1: Compute B = A − 2m Step 2: Find the eigenvector v corresponding to the largest eigenvalue Step 3: Run k-means on the elements of the eigenvector v

4.2 Non-spectral-based methods 4.2.1 Hierarchical Clustering Using a given similarity measure between pairs of nodes and also between groups, we can also perform hierarchical clustering as follows [5]: Step 1: Evaluate the similarity between all pairs of nodes Step 2: Assign each node to a group of its own. Step 3: Find the pair of groups with the highest similarity and join them Step 4: Repeat step (3) until we have k groups

4.2.2 Kernighan-Lin algorithm This method [6] bisects the graph until we find k groups. In each step, it randomly assigns nodes to one of two clusters and swaps pairs of nodes in order to reduce the cut size θ(S1,S2). Step 1: Randomly divide the network in two groups, S1 and S2, with n1 and n2 nodes, respec- tively, marking all nodes as untouched Step 2: For each pair of untouched nodes (i, j), i ∈ S1, j ∈ S2, calculate how much the cut size would change if we swap i and j Step 3: Swap the pair (i, j) that leads to the smallest cut size and mark the nodes as touched Step 4: For every state (S1,S2) that the network passed through during the swapping procedure, ′ ′ let (S1,S2) be the one with the smallest cut size ′ ′ Step 5: Go to Step 2 with S1 = S1,S2 = S2. Stop if no improvement was found

4.2.3 Edge betweenness This method assumes that clusters are connected by a few edges, thus there are many shortest paths between clusters that go through these edges (i.e., these edges have high betweenness). By removing edges with high betweenness, we expect to disconnect clusters. The algorithm is described in the following [5]. Step 1: Compute the edge betweenness of all edges Step 2: Remove the edge with the highest betweenness Step 3: Stop when it find k connected components

5 Experiment Results

We conducted a series of experiments to compare the performance of the methods described in Section 4 when clustering a subset of the DBLP dataset.

5.1 Dataset and Preprocess We extract 2 networks from the DBLP dataset. The first is the coauthorship network compris- ing only faculty members of the CS department at UMass (henceforth referred as “CS UMass

4 Network”) and the second corresponds to a large subset of the dataset (containing approximate 840,000 authors). DBLP provides XML snapshots and also a Parser in Java that prints simple statistics about the dataset. We modified this Parser to obtain two sorts of matrices. The former is the adjacency matrix of the weighted graph described in Section 2, which is used as input for the clustering methods. The latter is a matrix C, where Ci,j is the number of papers that author i published in conference j. This matrix can be used to compute a metric based on publication venues that we describe in Section 6.2. It is worth to note that DBLP provides an unique ID to each author, even if one author appear in the dataset multiple times with slightly different names. Furthermore, we attempt to consider a conference that is held in different years as the same conference by trimming the numbers from the strings that store the conference names. After extracting the network that corresponds to the entire DBLP dataset, we observed that it is composed by one very large component (containing approximately 848 thousand nodes) and many small components (containing up to 29 nodes). Since there is no reason to cluster different components together, we extract only the largest component (henceforth referred as “Largest Com- ponent of the coauthorship network”) to use in our experiments. In what follows, we evaluate the clustering results using the modularity metric.

5.2 Implementation We implemented all the described methods using the software R. In order to load large matrices into the memory, we use a sparse matrix representation provided by the library “Matrix”. Furthermore, to obtain the eigenvectors corresponding to the largest (smallest) eigenvalues, we run the implicitly restarted Lanczos bidiagonalization (IRLB) algorithm from library “irlba”. At last, we also included library “igraph” to perform hierarchical clustering and edge betweenness community detection and also to compute the modularity metric. Some methods are non-deterministic, including all methods that use k-means (due to the ini- tialization of the centers). For those non-deterministic cases, we computed the average results and the 95% confidence interval over 10 runs.

5.3 Results: CS UMass Network 5.3.1 Unweighted graph First, we compare the performance of hierarchical clustering on the unweighted coauthorship net- work, using different agglomeration methods to evaluate the similarity between groups, namely complete, single and average linkage, Ward and Mc Quitty. Intuitively, the complete and single methods are likely to find clusters with low modularity. This is due to the fact that the minimum and maximum similarity between all pairs of nodes are not good indicators of whether two groups are similar or not. Figure 1 illustrates the modularity computed from the results of each agglomeration method varying the number of clusters k. As expected, single and complete methods show poor performance, while the other methods achieve higher modularity values. Observing the results for average linkage and Ward, it seems that the modularity is maximum when we have 5 or 6 clusters. Furthermore, the Ward method exhibits the best performance (Q above 0.2) among the hierarchical clustering experiments. In what follows, we use this specific instance as a representative of the hierarchical clustering methods.

5 HC-ward 0.8 HC-single HC-complete HC-average 0.6 HC-mcquitty

0.4

Q 0.2

0

-0.2

-0.4 2 4 6 8 10 clusters

Figure 1: Modularity of Unweighted Hierarchical Clustering

16 1 Laplacian Spectral Clustering 0.8 14 Normalized Laplacian 0.8 Spectral Clustering+Lanczos Spectral Partitioning 12 0.6 0.6

10 0.4 0.4 8 0.2 Q 0.2 6 0 Laplacian

4 -0.2 0 Normalized Laplacian

2 -0.4 -0.2 0 -0.6 -0.4 -2 -0.8 2 4 6 8 10 0 5 10 15 20 25 30 35 40 clusters (a) Eigenvalues of the graph Laplacian (b) Modularity results varying k

Figure 2: Results of spectral-based methods for the unweighted CS UMass network.

Now we compare two spectral-based methods, namely spectral partitioning and spectral clus- tering. In particular, for the spectral clustering, we consider the exact spectral decomposition and an approximation obtained using IRLB. In order to choose the number of eigenvectors k used in spectral clustering, we analyze the spectrum of the (Normalized) graph Laplacian shown Figure 2a. We decided to choose k = 16, even though there are other eigenvalues of significant magnitude, because one does not usually want to compute a number of eigenvectors relatively large w.r.t. the matrix size. We expected to obtain similar results when IRLB is used instead of the exact spectral decompo- sition. However, Figure 2b shows that this approximation achieves very low modularity, regardless of the number of clusters. Moreover, the spectral partitioning performs slightly better than spec- tral clustering. Thus, we use the former as a representative of the spectral-based methods in what follows. Then we compare Hierarchical Clustering and Spectral Partitioning to the other methods de- scribed in Section 4. From Figure 3 we notice that Kernighan-Lin and HC-Ward outperform all other methods. However, it is worth to note that these methods (together with edge betweenness) are not scalable. On the other hand, spectral clustering and leading eigenvector are suited to large networks and do achieve some clustering, even though they do not perform as good as the other methods for small networks. Moreover, we observed that edge betweenness is worse than other algorithms specially for small k, because is tends to disconnect nodes with degree one first. Again, it seems that the maximum modularity increases with k up to 6 clusters and then it starts to decrease.

6 HC-ward Edge-Betweenness 0.4 Kernighan-Lin Leading-eigenvector Spectral-Partitioning 0.3

0.2 Q

0.1

0

-0.1 2 4 6 8 10 clusters

Figure 3: Modularity of Unweighted Clustering

0.7 Spectral Clustering HC-ward 0.8 Spectral Clustering+Lanczos Edge-Betweenness Spectral Partitioning 0.6 Kernighan-Lin Leading-Eigenvector 0.6 spectral-clustering with exact eigenvectors 0.5

0.4 0.4

Q 0.2 Q 0.3

0 0.2

-0.2 0.1

-0.4 0 2 4 6 8 10 2 4 6 8 10 clusters clusters (a) Spectral-based methods (b) Comparison among different methods.

Figure 4: Modularity for weighted clustering.

5.3.2 Weighted graph The results obtained for the weighted coauthorship network are very similar to those in the un- . Although the modularity value in the weighted and unweighted cases cannot be directly compared, the relationship between the performance of the clustering methods is roughly the same. The most significant difference for the weighted graph is that Spectral Clustering per- forms better among the spectral-based methods (see Figure 4a). Figure 4b shows similar results to those seen in Figure 3. Note, however, that now spectral clustering achieves modularity values as high as other methods. Although we could not obtain good results when using the IRLB to find an approximate decomposition, this is somewhat a positive result, since spectral-based methods can be regarded as scalable as long as we replace the spectral decomposition step by an accurate approximation. Conversely, the other methods we use are not scalable, and could not be applied to the larges component of the coauthorship network.

5.4 Results: Largest Component of the Coauthorship Network We applied spectral-based methods to find clusters in the largest component of the Coauthorship Network by using the IRLB algorithm. However, the performance we observed was really poor and we believe that this is due to bad approximations for the actual eigenvectors given by the IRLB, just as in the case of the CS UMass Network. On the other hand, we could successfully use the Leading Eigenvector method with the Power

7 0.15 0.15 Leading Eigenvector Leading Eigenvector

0.1 0.1

Q 0.05 Q 0.05

0 0

-0.05 -0.05 2 4 6 8 10 2 4 6 8 10 clusters clusters (a) Unweighted network (b) Weighted network.

Figure 5: Modularity for the of the Coauthorship Network.

Method, since this clustering algorithm only requires the eigenvector corresponding to the largest eigenvalue of the modularity matrix. Figures 5(a-b) show the results for the giant component of the Coauthorship Network. It is worth to say that the maximum value of the modularity actually depends on the network structure and is always less than 1. Furthermore, the expected modularity if clusters are chosen at random is 0. Thus, we conclude that the Leading Eigenvector could find clusters somewhat successfully.

6 Discussion and future work

6.1 How to deal with very large datasets There are several challenges one faces when working with very large datasets, starting from loading matrices in the memory. In the case of sparse matrices, for instance, special data structures can be used to store only the non-zero elements and to perform operations efficiently. Fortunately, many algorithms initially designed to solve small problems can be easily adapted to handle larger ones. For instance, instead of computing the normalized graph Laplacian M in terms of matrix multiplications, as in Section 4.1.1, we compute it in an unusual way, using “row by element products”, that we denote here by ⋆. More precisely, M can be calculated by ′ ′ M = ((A ⋆ d) ⋆ d) , where d is a vector such that di = 1/ki. Note also that some clustering algorithms we discussp have a spectral decomposition step. Al- though complete eigendecomposition is very costly, we usually need only the eigenvectors cor- responding to the k largest (smallest) eigenvalues, that can be efficiently obtained by Lanczos methods. In particular, when k = 1 (e.g.: leading eigenvector method), the Power method can efficiently find a good approximation in a sparse matrix. Although the modularity matrix B is very dense, we use a clever trick to make it possible to apply either the IRLB or the Power method. These methods are based on the multiplication of the matrix B by a vector x. Instead of computing Bx, KK′x we programmed those methods to compute Ax − 2m . At last, we emphasize that the software to be used must be carefully chosen. Despite of the fact that R offers several libraries and is relatively easy to use, almost all implemented methods are serial and very inefficient when handling large matrices. In particular, we had to implement k-means on CUDA. It turns out that this implementation was able to cluster matrices of size 1M×5 in less than 10 s, while R executed the same task for more than 12 hours without finishing it.

8 6.2 Venue metric Leskovec et. al argue in [7] that information of the publication venues can be used as a ground truth for clustering in the DBLP dataset. Based on this idea, we outline a new metric for assessing clustering quality, similar to Eq. (1). Let C be the total number of conferences, Ci be the set of venues in which author i have published. We define the venue metric as 1 |C ||C | R(S)= |C ∩ C |− i j δ(s ,s ). (2) C  i j C  i j i,jX∈V Although not very discriminative among methods, our preliminary results show that the venue metric has positive values for the partitioning obtained when we use Spectral Clustering combined with IRLB in the unweighted CS UMass network. This means that such combination may actually be performing better than modularity indicates.

6.3 Future work From what we discussed above, our future work can focus mainly on two directions. First of all, we need to improve the accuracy of the eigenvectors obtained using approximate spectral decomposition methods. We observe from the experiment results that some of our algorithms strongly depend on the accuracy of the eigenvectors. Moreover, the time spent to compute the eigenvectors is currently the bottleneck of the algorithm. One possible solution is to use a GPU based SVD algorithm [8]. Since we are dealing with large sparse matrices, we can use some algorithms based on sampling, such as Compressed sensing method [9]. Secondly, we can take advantage of some content information. The methods we discussed so far only consider features of the network structure such as connections between nodes. In particular, either the paper title or its content could provide useful information for clustering. Therefore, we could try to combine the graph structure with a LDA model to improve our results. The combined model is proved to be a good method to cluster networks as in [10].

7 Related Work

Although Saerens et. al do not provide an explicit clustering algorithm, they propose a -based method for computing the distance between nodes in a weighted undirected graph (Euclidean Commute Distance Time or ECDT) and show how to obtain a subspace projection of the nodes of the graph that preserves as much variance as possible w.r.t. the ECDT [11]. One could apply k-means on such projection to find clusters on a graph. Kulis et. al [12] address the problem of graph clustering proposing a kernel-based semi- supervised clustering method. This method is semi-supervised because they assume that some constraints are given, namely nodes that must be in the same or in different clusters. Two spectral-based methods were proposed by White and Smyth [13]. In the first method, the authors use an eigenvector formulation to maximize the modularity, seeking the global optimum. This approach uses a variant of the Lanczos method. The second one is a greedy version of the former algorithm that may not give results as good as the first approach, but it is shown to be significantly faster. In our work, we used Lanczos methods to perform approximate spectral decomposition of a matrix M, however there are other methods that could be applied for this purpose. In particular, one can approximate M as a Kronecker Product of two smaller matrices B and C. The eigenvectors and eigenvalues of B ⊗ C are then determined from those of B and C [14].

9 As an alternative to spectral-based methods, Dhillon et. al propose the use of a weighted kernel k-means [15]. In this work, they show how to choose the kernel so that a weighted graph clustering objective is optimized, such as measures based on cut sizes.

8 Conclusion

We evaluated different clustering methods applied to a small coauthorship network extracted from the DBLP dataset. Our results show that methods such as Hierarchical Clustering and the Kernighan-Lin algorithm perform better than spectral-based methods with respect to the mod- ularity metric. Nevertheless, we show that Spectral Clustering can perform almost as good as non-spectral based-methods, while retaining some scalability. In fact, this method could be applied to the largest component of the coauthorship network if we could find a good approximation for the top eigenvectors. Although we failed in using IRLB to this purpose, we successfully used the Power Method to find an accurate approximation to the eigenvector corresponding to the largest eigenvalue of the modularity matrix B. Hence we could apply the Leading Eigenvector method to a very large network, finding reasonable clusters.

References

[1] M. E. J. Newman and M. Girvan, “Finding and evaluating in networks,” Phys. Rev. E, vol. 69, p. 026113, Feb 2004. [2] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE TPAMI, vol. 22, pp. 888–905, 1997. [3] M. Newman, Networks: an introduction. Oxford University Press, 2010. [4] M. E. J. Newman, “Modularity and community structure in networks,” PNAS, vol. 103, pp. 8577–8582, June 2006. [5] M. E. J. Newman, “Fast algorithm for detecting community structure in networks,” PHYS.REV.E, vol. 69, p. 066133, 2004. [6] B. W. Kernighan and S. Lin, “An efficient heuristic procedure for partitioning graphs,” Bell Sys. Tech. J., vol. 49, no. 2, pp. 291–308, 1970. [7] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Statistical properties of community structure in large social and information networks,” in WWW, (New York, NY, USA), pp. 695–704, ACM, 2008. [8] R. B. Foster, R. Wang, and S. Mahadevan, “A gpu-based approximate svd algorithm,” 9th Intl. Conf. on Parallel Processing and Applied Mathematics, 2011. [9] E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” Information Theory, IEEE Transactions on, vol. 52, pp. 489 – 509, feb. 2006. [10] Q. Mei, D. Cai, D. Zhang, and C. Zhai, “Topic modeling with network regularization,” in WWW, pp. 101–110, 2008. [11] M. Saerens, F. Fouss, L. Yen, and P. Dupont, “The principal components analysis of a graph, and its relationships to spectral clustering,” in Machine Learning: ECML 2004, vol. 3201 of Lecture Notes in Computer Science, pp. 371–383, Springer Berlin / Heidelberg, 2004. [12] B. Kulis, S. Basu, I. Dhillon, and R. Mooney, “Semi-supervised graph clustering: a kernel approach,” in ICML ’05, (New York, NY, USA), pp. 457–464, ACM, 2005. [13] S. White and P. Smyth, “A Spectral Clustering Approach To Finding Communities in Graphs,” 2005.

10 [14] J. Johns, S. Mahadevan, and C. Wang, “Compact spectral bases for value function approximation using kronecker factorization,” in AAAI-07 - Volume 1, pp. 559–564, AAAI Press, 2007. [15] I. Dhillon, Y. Guan, and B. Kulis, “Weighted graph cuts without eigenvectors a multilevel approach,” IEEE TPAMI, vol. 29, pp. 1944 –1957, nov. 2007.

11