PHYSICAL REVIEW E 98, 042304 (2018) Weight Thresholding on Complex Networks
Total Page:16
File Type:pdf, Size:1020Kb
PHYSICAL REVIEW E 98, 042304 (2018) Weight thresholding on complex networks Xiaoran Yan,1,* Lucas G. S. Jeub,2 Alessandro Flammini,1,2 Filippo Radicchi,2 and Santo Fortunato1,2 1Indiana University Network Science Institute (IUNI), 1001 E SR 45/46 Bypass, Bloomington, Indiana 47408, USA 2School of Informatics, Computing and Engineering, Indiana University, 700 N Woodlawn Ave, Bloomington, Indiana 47408, USA (Received 21 June 2018; published 8 October 2018) Weight thresholding is a simple technique that aims at reducing the number of edges in weighted networks that are otherwise too dense for the application of standard graph-theoretical methods. We show that the group structure of real weighted networks is very robust under weight thresholding, as it is maintained even when most of the edges are removed. This appears to be related to the correlation between topology and weight that characterizes real networks. On the other hand, the behavior of other properties is generally system dependent. DOI: 10.1103/PhysRevE.98.042304 I. INTRODUCTION alternative to expensive community detection algorithms for testing the robustness of communities under thresholding. Many real networks have weighted edges [1], representing the intensity of the interaction between pairs of vertices. Also, some weighted networks, e.g., financial [2] and brain net- II. METHODS works [3], have a high density of edges. Analyzing very dense graphs with tools of network science is often impossible, Let us consider a weighted and undirected graph G com- unless some preprocessing technique is applied to reduce the posed of N vertices and M edges. Edges have positive number of connections. Different recipes of edge pruning, weights, and the graph topology is described by the symmetric or graph sparsification, have been proposed in recent years weight matrix W, where the generic element Wuv = Wvu > 0 [4–10]. if there is a weighted edge between nodes u and v, while In practical applications such as the preprocessing of data Wuv = Wvu = 0 otherwise. Weight thresholding removes all about brain, financial, and biological networks [2,11,12], edges with weight lower than a threshold value. This means weight thresholding is the most popular approach to sparsi- that the resulting graph G˜ has a thresholded weight matrix fication. It consists in removing all edges with weight below a W˜ , whose generic element W˜ uv = W˜ vu = Wuv if Wuv θ, given threshold. Ideally, one would like to eliminate as many and W˜ uv = W˜ vu = 0 otherwise. The thresholded graph G˜ is edges as possible without drastically altering key features therefore a subgraph of G with the same number of nodes. of the original system. A recent study of functional brain We first examined synthetic networks generated by the networks investigates how the graph changes as a function Lancichinetti, Fortunato, and Radicchi (LFR) benchmark of the threshold value [13], finding that conventional network [18]. We then extend the analysis to several real networks properties are usually disrupted early on by the pruning from different domains: structural brain networks [19], the procedure. Further, many standard measures do not behave world trade network [20], the airline network [21], and the smoothly under progressive edge removal, and hence are not coauthorship network of faculty of Indiana University. We reliable measures to assess the effective change to the system treat all networks as undirected, weighted graphs with pos- structure induced by the removal of edges. itive edge weights. We show the variation of several graph By analyzing several synthetic and real weighted networks, properties as a function of the fraction of removed edges. The we show that, while local and global network features are of- properties we have chosen include all network-level measures ten quickly lost under weight thresholding, the procedure does from Ref. [13] as well as mesoscopic structure measures: not alter the mesoscopic organization of the network [14]: (1) Characteristic path length (CPL), the average length of groups (e.g., communities) survive even when most of the all shortest paths connecting pairs of vertices of the network. edges are removed. Once the network becomes disconnected, CPL is not defined, In addition, we introduce a measure, the minimum absolute and we set its value to 0. spectral similarity (MASS), that estimates the variation of (2) Global efficiency, the average inverse distance between spectral properties of the graph when edges are removed. all pairs of vertices of the network [22]. Global efficiency Spectral properties are theoretically related to group struc- remains well defined after network disconnection, as discon- tures in general [15–17]. The MASS is stable under weight nected vertex pairs simply have a inverse distance of 0. thresholding for many real networks, making it a potential (3) Transitivity, or global clustering coefficient, is the ratio of triangles to triplets in the network, where a triplet is a motif consisting of one vertex and two links incident to the vertex. (4) Community structure, which we detect with two meth- *Corresponding author: [email protected] ods. The first is based on modularity maximization [23] via the 2470-0045/2018/98(4)/042304(9) 042304-1 ©2018 American Physical Society YAN, JEUB, FLAMMINI, RADICCHI, AND FORTUNATO PHYSICAL REVIEW E 98, 042304 (2018) Louvain algorithm [24,25]. The second method uses k-means To overcome this issue, we instead propose the absolute spectral clustering [15] with a constant k equal to the one spectral similarity with respect to the input vector x, defined found with the Louvain algorithm. as (5) Core-periphery structure, the vertices of which we xT ΔLx detect using the method introduced for the weighted coreness σ (x) = 1 − , (2) T measure [26]. x λN x (6) DeltaCon. a metric indicating the similarity between where λN is the largest eigenvalue of the original graph the original graph and the thresholded one based on graph Laplacian, and ΔL = ΔD − ΔW is the graph Laplacian of diffusion properties [27]. the difference graph ΔG, whose vertices are the same as in G, For all three partitioning algorithms, i.e., Louvain, k- while the edges are the ones removed by the chosen sparsifi- means, and coreness, we measure the similarity between the cation procedure. Without loss of generality, we consider only ˜ partitions of the sparsified graph G and those of G using unit length input vectors, |x|=1. In Appendix B, we prove the adjusted mutual information (AMI) [28]. To overcome that weight thresholding optimizes the expected value of σ (x), the randomness of the partitioning algorithms, we sample if we assume the entries of the vector x are independent 100 different partition outcomes from the original graph, identically distributed random variables. 10 from the sparsified graph, and use the maximum AMI Since the input vector is variable, we again consider the between any pair. These numbers are picked so that the same worst case scenario and use the minimum absolute spectral algorithm returns consistent results (AMI very close to 1) on similarity (MASS), independent runs on the original graph. For global efficiency we take the ratio between the value xT ΔLx λ − λΔ σ = min 1 − = N N , (3) of the measure on the sparsified graph and the corresponding min | |= x 1 λN λN value on the initial graph. CPL grows with edge removal, Δ and we therefore normalize it by its largest value before where λN is the largest eigenvalue of the difference Laplacian disconnection. Since DeltaCon scales directly with weights, ΔL. we normalize each matrix entry such that it reaches 0 on MASS is consistent with our intuition that disconnecting a empty graphs. This way all our measures are confined in the few peripheral vertices, while a bigger change than removing interval [0,1], and their trends can be compared (transitivity redundant edges, should have a small impact on the organi- naturally varies in this range). We also plot the relative size zation of the system. Many local and global network features of the largest connected component to keep track of splits of become ill-defined as soon as the network disconnects, and the network during edge removal. More details of how these hence they are not reliable measures to assess the effect of graph properties are calculated are given in Appendix A. thresholding. Mesoscopic properties, like communities, on the We also add another measure, capturing the variation of other hand, remain meaningful even after the network be- spectral properties of the graph. To define this measure we re- comes disconnected, and a reliable measure should be robust call that the Laplacian of G is defined as L = D − W, where in such situations. D is the diagonal matrix of the weighted degrees (strengths), A major advantage of MASS is its numerical stability = with entries Du v Wuv. Similarly, for the sparsified graph and computational efficiency (see Appendix C). Defined as G,˜ L˜ = D˜ − W˜ . The Laplacian has the spectral decomposi- the ratio of the largest eigenvalues of two graph Lapla- tion cians, it can be computed using standard numerical li- braries [30–32]. (Readers can find our Matlab implementation −1 L = D − W = V V , at https://github.com/IU-AMBITION/MASS.) In spite of its simplicity, MASS satisfies a series of theoretical properties where the columns of the matrix V are the eigenvectors (see Appendix E). v1,v2,...,vN of the Laplacian, and the entries in the diagonal matrix are the corresponding eigenvalues 0 = λ1 λ2 ··· λN . III. RESULTS The difference between the sparsified Laplacian L˜ and the To illustrate the changes of the above graph properties original L can be quantified by the minimum relative spectral under weight thresholding, we first compare a synthetic similarity (MRSS) [29], weighted network with its binary counterpart; see Fig.