Clustering Scientific Collaboration Networks

Clustering Scientific Collaboration Networks Fabricio Murai Haibin Huang Jie Bing December 15, 2011 Abstract In this work, we evaluate the performance of different clustering methods applied to Scientific collaboration networks. In particular, we study two subsets of the coauthorship network we obtain from the DBLP dataset: the network composed by the faculty of the CS department at UMass and the largest connected component of the coauthorship network (containing about 848,000 nodes). We apply the following methods to both weighted and unweighted versions of those networks: Spectral and Hierarchical Clustering, the Kernighan-Lin algorithm, Spectral Partitioning, Edge betweenness and Leading eigenvector community detection. Our results show that methods that are not spectral-based perform better in general, but are clearly not scalable. On the other hand, we empirically show that the Spectral Clustering can perform almost as good as non-spectral-based methods, while retaining scalability as long as we use an approximate (but accurate) spectral decomposition. At last, we include some useful discussion about how to handle large sparse matrices. 1 Introduction Most real networks exhibit non-trivial characteristics such as long-tail degree distributions, small distances and high clusterization. In particular, consider the scientific collaboration networks, where nodes represent authors and edges between two nodes indicate that they have published at least one paper together. It is well-known these networks are highly clusterized, which means that there are groups of scientists that are much more likely to be connected to each other than if they were randomly selected. Although graph visualization techniques can be used to find clusters in very small networks, this becomes impracticable when graphs grow larger. Fortunately, there are many methods proposed for graph clustering in the literature that can deal with thousands of nodes, including the Kernighan- Lin algorithm, hierarchical and edge-betweenness community detection. For even larger graphs (e.g.: hundreds of thousands of nodes), while some clustering methods cannot be directly applied, some cannot be used at all. For instance, spectral-based methods include a spectral decomposition step, and hence can only be used if we replace the this step by an approximation that can be efficiently computed. However, we will see that some approximations can lead to poor clusterization in comparison to the exact decomposition. It is worth to note that unlike regression or classification problems, there is no right answer for a clustering problem. Nevertheless, one can define the quality of the clustering in the context of graphs with respect to the connectivity (i.e.: the relative number of edges) inside (e.g.: modularity) or outside clusters (e.g: cut set size). In this work, we compare different clustering methods applied to scientific collaboration networks in Computer Science extracted from the DBLP dataset. This dataset comprises more than 1 million authors that together published more than 2.8 million papers to the date of this work. In 1 variable description A adjacency matrix of G n number of vertices (nodes) in G m number of edges in G ki = j Aij degree of vertex i (in unweighted graph) KP vector such that Ki = ki D diagonal matrix such that Dii = ki si cluster to which vertex i belongs S = (s1,...,sn) partitioning of G θ(Sa,Sb) cut set size of the bisection (Sa,Sb) of G δ(.) Kronecker delta Table 1: Notation description. what concerns the methods, we compare Spectral and Hierarchical Clustering, the Kernighan-Lin algorithm, Spectral Partitioning, Edge betweenness and Leading eigenvector community detection. The clustering is then assessed using the modularity metric. 2 Problem statement Consider a set of publications and the corresponding authors extracted from the DBLP dataset. We build an undirected graph G where nodes are authors, including edges between pairs of authors that have published at least one paper together. We work with both weighted and unweighted versions of this graph. In the former version, the weight Aij of the edge between nodes i and j is computed as follows. Let P be the set of papers that i and j published together. Also let np be the number of authors of paper p. We set Aij = p∈P 1/np. In the unweighted version, A is simply the adjacency matrix of G. Therefore,P given the (un)weighted graph G we want to find a high quality clustering. In order to do this, we apply different clustering methods and evaluate the results with respect to the modularity metric, which is explained in detail in Section 3.1. The notation we use throughout this document is summarized in Table 1. 3 Background 3.1 Modularity The modularity is a metric that has been proposed by Newman [1] to evaluate clustering quality. It measures how many more edges there are inside clusters than expected with respect to a null model. In particular, the null model Newman uses is the configuration model, in which the probability of kikj i and j being connected is given by 2m . The modularity for unweighted graphs is given by 1 k k Q(S)= A − i j δ(s ,s ) (1) 2m ij 2m i j i,jX∈V This metric can be easily extended to weighted graphs, by taking A as the weighted adjacency matrix. 2 4 Methods 4.1 Spectral-based methods The k-means algorithm uses Euclidean distance as a similarity measure between observations. However, we cannot directly apply k-means to the adjacency matrix A because nodes are not embedded in an Euclidean space. In fact, if we consider the rows of A as coordinates in an Euclidean space, the distance between pairs of nodes will be very similar and will not correspond to the distance in the graph. Since the k-means algorithm mainly depends on the distance measure, it will not work well if we use the Euclidean distance in the original data space. Hence we use spectral-based methods which help us to translate our data into a vector space. 4.1.1 Spectral clustering The basic idea of spectral clustering is to map the original data into a vector space spanned by a few eigenvectors and apply the k-means algorithm in that space. The assumption here is that although our data samples are high dimensional, they lie in a low dimensional subspace of the original space. In the literature, there are several versions of spectral clustering based on different definitions of the graph Laplacian operator. Here we use the spectral clustering proposed by Shi and Malik [2], based on the normalized graph Laplacian. The algorithm is as following: Step 1. Compute the normalized graph Laplacian M = D−1/2AD−1/2 Step 3. Compute top k eigenvectors of M Step 4. Arrange the eigenvectors as the columns of a matrix Y Step 5. Run K-means on the new embedding matrix Y . 4.1.2 Spectral partitioning This method attempts to minimize the cut set size of a partitioning of a graph. The optimization problem is defined as a function of the cut set size between two groups: 1 arg min Aij1(si = sj) S 2 Xi,j It is shown in [3] that if we allow si to assume any value in [−1, 1], then the solution to this minimization problem is the eigenvector corresponding to the second smallest eigenvalue, also called Fiedler vector. Step 1: Compute the graph Laplacian L = D − A Step 2: Find the Fiedler vector v of L Step 3: Run k-means on the elements of the eigenvector v 4.1.3 Leading eigenvector Newman proposes an algorithm to directly maximize the modularity [4]. It is based on the relax- ation of the following maximization problem: 1 T arg max Q = 4 s Bs s m s.t. i si = 0 P KK′ where si is either −1 or 1, and B = A − 2m is called modularity matrix. In the relaxed version, si is allowed to assume any value in [−1, 1]. 3 The solution s to the relaxed optimization problem is the eigenvector corresponding to the largest eigenvalue of matrix B. Hence we have the Leading eigenvector algorithm: KK′ Step 1: Compute B = A − 2m Step 2: Find the eigenvector v corresponding to the largest eigenvalue Step 3: Run k-means on the elements of the eigenvector v 4.2 Non-spectral-based methods 4.2.1 Hierarchical Clustering Using a given similarity measure between pairs of nodes and also between groups, we can also perform hierarchical clustering as follows [5]: Step 1: Evaluate the similarity between all pairs of nodes Step 2: Assign each node to a group of its own. Step 3: Find the pair of groups with the highest similarity and join them Step 4: Repeat step (3) until we have k groups 4.2.2 Kernighan-Lin algorithm This method [6] bisects the graph until we find k groups. In each step, it randomly assigns nodes to one of two clusters and swaps pairs of nodes in order to reduce the cut size θ(S1,S2). Step 1: Randomly divide the network in two groups, S1 and S2, with n1 and n2 nodes, respec- tively, marking all nodes as untouched Step 2: For each pair of untouched nodes (i, j), i ∈ S1, j ∈ S2, calculate how much the cut size would change if we swap i and j Step 3: Swap the pair (i, j) that leads to the smallest cut size and mark the nodes as touched Step 4: For every state (S1,S2) that the network passed through during the swapping procedure, ′ ′ let (S1,S2) be the one with the smallest cut size ′ ′ Step 5: Go to Step 2 with S1 = S1,S2 = S2.

Clustering Scientific Collaboration Networks

Optimal Subgraph Structures in Scale-Free Configuration

Detecting Statistically Significant Communities

Processes on Complex Networks. Percolation

Percolation Thresholds for Robust Network Connectivity

15 Netsci Configuration Model.Key

A Multigraph Approach to Social Network Analysis

Critical Window for Connectivity in the Configuration Model 3

Using Artificial Neural Network to Detect Fetal Alcohol Spectrum

0848736-Bachelorproject Peter Verleijsdonk

Arxiv:1806.06941V1 [Physics.Soc-Ph] 18 Jun 2018

Assortativity Measures for Weighted and Directed Networks

Distance Distribution in Configuration Model Networks