Assessing the Relevance of Node Features for Network Structure

Assessing the relevance of node features for network structure Ginestra Bianconia, Paolo Pinb,c,1, and Matteo Marsilia aAbdus Salam International Center for Theoretical Physics, Strada Costiera 11, 34014 Trieste, Italy; bDipartimento di Economia Politica, Universitá degli Studi di Siena, Piazza San Francesco 7, 53100 Siena, Italy; and cMax Weber Programme, European University Institute, Via Delle Fontanelle 10, 50014 San Domenico di Fiesole, Italy Edited by Peter J. Bickel, University of California, Berkeley, CA, and approved April 21, 2009 (received for review December 1, 2008) Networks describe a variety of interacting complex systems in to ask what the functions or attributes of the nodes tell us about social science, biology, and information technology. Usually the the network than the other way around. In this article we propose nodes of real networks are identified not only by their connec- an indicator Θ that quantifies how much the topology of a network tions but also by some other characteristics. Examples of char- depends on a particular assignment of node characteristics. This acteristics of nodes can be age, gender, or nationality of a per- provides an information bound that can be used as a benchmark son in a social network, the abundance of proteins in the cell for feature-extraction algorithms. This exercise, as we shall see, taking part in protein-interaction networks, or the geographical can also reveal statistical regularities that shed light on possible position of airports that are connected by directed flights. Inte- mechanisms underlying the network’s stability and formation. Θ grating the information on the connections of each node with In the following, we first define , and then we investigate sep- the information about its characteristics is crucial to discriminat- arately the case in which node characteristic assignment induces ing between the essential and negligible characteristics of nodes a community structure on the network and the case in which the assignment corresponds to a position of the nodes in some metric for the structure of the network. In this paper we propose a Θ general indicator Θ, based on entropy measures, to quantify the space. We will calculate for benchmarks and for examples of social, biological, and economics networks. dependence of a network’s structure on a given set of features. We apply this method to social networks of friendships in U.S. Θ APPLIED schools, to the protein-interaction network of Saccharomyces cere- Definition of MATHEMATICS Θ visiae and to the U.S. airport network, showing that the proposed We shall first give a description of our indicator in a simple case measure provides information that complements other known study and then give a general abstract definition. measures. Let us consider the specific problem of evaluating the sig- = nificance of the network community structure q (q1, ..., qN ) ∈{ } entropy | inference | social networks | communities induced by the assignment of a characteristic qi 1, ..., Q ,to ∈{ } each node i 1, ..., N of a network of N nodes. Individual nodes are characterized by their degree k , which is the number etworks have become a general tool for describing the struc- i of links they have to other nodes in the network. The network g ture of interaction or dependencies in such disparate systems N is fully specified by the adjacency matrix taking values g = 1if as cell metabolism, the internet, and society (1–5). Loosely speak- i,j nodes i and j are linked and 0 otherwise. The community struc- ing, the topology of a given network can be thought of as the ture induced by the assignment qi on the network is described byproduct of chance and necessity (6), where functional aspects by a matrix A of elements A(q, q ) indicating the total number of and structural features are selected in a stochastic evolutionary links between nodes with characteristics q and q . A natural mea- process. The issue of separating “chance” from “necessity” in sure of the significance of the induced community structure q on networks has attracted much interest. This entails understand- the network g is provided by the number of graphs g between ing random network ensembles (i.e., chance) and their inherent structural features (7–9) but also developing techniques to infer those individual nodes (characterized by the degree sequence k) structural and functional characteristics on the basis of a given that are consistent with A. The logarithm of this number is the Σ network’s topology. Examples go from inference of gene function entropy k,q (21, 22) of the distribution that assigns equal weight from protein-interaction networks (10) to the detection of com- to each graph g with the same q and k. This number also depends munities in social networks (11, 12). Community∗ detection, for on the degree sequence k and the relative frequency of differ- example, aims at uncovering a hidden classification of nodes, and ent values of q across the population. These systematic effects a variety of methods have been proposed relying on (i) structural are removed considering the entropy Σ π obtained from a ran- properties of the network [betweenness centrality (13), modularity k, (q) dom permutation π(q):i → qπ of the assignments, where (14), spectral decomposition (15), cliques (16), and hierarchi- (i) {π(i), i = 1, ..., N} is a random permutation of the integers cal structure (17)], (ii) statistical methods (18), or (iii) processes i ∈{1, ..., N}. Theindicator Θ is obtained as the standardized defined on the network (9, 19). Implicitly, each of these methods relies on a slightly different understanding of what a community is. Furthermore, there are intrinsic limits to detection; often the outcome depends on the algorithm and a clear assessment of the Author contributions: G.B., P.P., and M.M. designed research; G.B., P.P., and M.M. per- role of chance is possible in only a few cases (see, e.g., refs. 9 formed research; G.B., P.P., and M.M. analyzed data; and G.B., P.P., and M.M. wrote the and 20). paper. As a matter of fact, in several cases, a great deal of additional The authors declare no conflict of interest. information, beyond the network topology, is known about the This article is a PNAS Direct Submission. nodes. This comes in the form of attributes such as age, gender, Freely available online through the PNAS open access option. and ethnic background in social networks or annotations of known ∗A community structure, in general terms, is an assignment of nodes into classes. Com- functions for genes and proteins. Sometimes this information is munity detection aims at partitioning nodes into homogeneous classes, according to incomplete, so it is legitimate to attempt to estimate missing infor- similarity or proximity considerations. mation from the network’s structure. But often, the empirical data 1To whom correspondence should be addressed. E-mail: [email protected] on the network are no more reliable or complete than those on the This article contains supporting information online at www.pnas.org/cgi/content/full/ attributes of the nodes. In such cases, it may be more informative 0811511106/DCSupplemental. www.pnas.org / cgi / doi / 10.1073 / pnas.0811511106 Early Edition 1of6 Downloaded by guest on September 24, 2021 Σ Σ Θ deviation of k,q from the entropy k,π(q) of networks with Besides the value of , our approach also provides more randomized assignments: detailed information. Technically,this is extracted from the saddle point values of the Lagrange multipliers introduced in the calcu- Eπ Σ π −Σ lation of Σφ in order to enforce the constraints (see SI). In Θ = k, (q) k,q (g,q) , [1] k,q 2 the examples discussed below, this information is encoded in the Σ − Σ Eπ k,π(q) Eπ k,π(q) probability that a node i is linked to a node j in an ensemble with a given feature φ(g, q). This is given by π[...] where E stands for the expected value of over random uni- form permutations π(q) of the assignments. In words, Θ measures = zizjW (qi, qj) the specificity of the network g for the particular assignment q, with pij . [4] 1 + zizjW (qi, qj) respect to assignments obtained by a random permutation. The indicator Θ can be similarly defined in a much more general The value of the “hidden variables” z and the statistical weight setting, with the following abstract definition: Let g ∈ G be the N W (q, q ) can be inferred from the real data (21, 22). Therefore network we are interested in, where N is the number of vertices the function W (qi, qj) can shed light on the dependence of the and gi j is the adjacency matrix. GN is the set of all graphs of N , probability of a link between nodes i and j, on their assignments vertices. An assignment is a vector q, such that for each node i, qi and qj. qi ∈ Q is defined on a set Q of possible characteristics, given by the context. Call Q = QN the set of all possible such vectors on Application to Networks with a Community Structure Q. A feature is a mapping φ : GN × Q → Φ, which associates to Θ each graph g and assignment q a graph feature φ(g, q) ∈ Φ. As will In the following, we will describe how to measure for assess- become clear, we do not need any assumption about the topology ing the relevance of a community structure. First, we analyze of the set of features Φ. the behavior of Θ on synthetic datasets. These have been used A simple example of features is those which do not depend on as benchmarks for community detection algorithms (9, 14).

Assessing the Relevance of Node Features for Network Structure

Simplified Computational Model for Generating Bio- Logical Networks

Biological Network Approaches and Applications in Rare Disease Studies

Denoising Large-Scale Biological Data Using Network Filters

Introduction to Graph Theory

Visual Analytics for Multimodal Social Network Analysis: a Design Study with Social Scientists

An Integrative Approach to Modeling Biological Networks 1 Introduction

Communities and Homology in Protein-Protein Interactions

The Structure and Function of Complex Networks∗

'Small-World-Ness': a Quantitative Method for Determining Canonical Network Equivalence

Coupling AI and Network Biology

Network Biology: Understanding the Cell’S Functional Organization

Biological Network Analysis Has Become a Central Component of Computational and Systems Biology