Annotated Hypergraphs: Models and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Chodrow and Mellor Applied Network Science (2020) 5:9 Applied Network Science https://doi.org/10.1007/s41109-020-0252-y RESEARCH Open Access Annotated hypergraphs: models and applications Philip Chodrow1*† and Andrew Mellor2† *Correspondence: [email protected] Abstract †Philip Chodrow and Andrew Mellor Hypergraphs offer a natural modeling language for studying polyadic interactions contributed equally to this work. 1Operations Research Center, between sets of entities. Many polyadic interactions are asymmetric, with nodes Massachusetts Institute of playing distinctive roles. In an academic collaboration network, for example, the order Technology, 77 Massachusetts of authors on a paper often reflects the nature of their contributions to the completed Avenue, 02139 Cambridge, MA, USA Full list of author information is work. To model these networks, we introduce annotated hypergraphs as natural available at the end of the article polyadic generalizations of directed graphs. Annotated hypergraphs form a highly general framework for incorporating metadata into polyadic graph models. To facilitate data analysis with annotated hypergraphs, we construct a role-aware configuration null model for these structures and prove an efficient Markov Chain Monte Carlo scheme for sampling from it. We proceed to formulate several metrics and algorithms for the analysis of annotated hypergraphs. Several of these, such as assortativity and modularity, naturally generalize dyadic counterparts. Other metrics, such as local role densities, are unique to the setting of annotated hypergraphs. We illustrate our techniques on six digital social networks, and present a detailed case-study of the Enron email data set. Keywords: Hypergraphs, Null models, Network science, Statistical inference, Community detection Introduction Many data sets of contemporary interest log interactions between sets of entities of vary- ing size. In collaborations between scholars, legislators, or actors, a single project may involve an arbitrary number of agents. A single email links at least one sender to one or more receivers. A given chemical reaction may require a large set of reagents. The dynam- ics of processes such as these may depend on these polyadic interactions, and in many cases cannot equivalently expressed through constituent pairwise interactions. This phe- nomenon is observed in areas as diverse as knowledge aggregation (Greening Jr et al. 2015), social contagion (de Arruda et al. 2019), and the evolution of cooperation (Tarnita et al. 2009), among many others. Because of this, these networks cannot be represented by via the classical paradigm of dyadic graphs without a significant loss of model fidelity. Polyadic data representations such as hypergraphs (Berge 1984; Chodrow 2019a)and simplicial complexes (Young et al. 2017;Carlsson2009) have therefore emerged as prac- tical modeling frameworks that directly represent interactions between arbitrary sets of agents. As shown, for example, by one of the present authors in Chodrow (2019a), the choice of whether and when to represent polyadic data dyadically can lead to directionally © The author(s). 2020 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Chodrow and Mellor Applied Network Science (2020) 5:9 Page 2 of 25 contrasting study conclusions when studying common network metrics such as triadic closure and degree-assortativity. Furthermore, the use of polyadic data representations enable the analyst to study measures of higher-order structure which cannot even be defined in the dyadic framework. In some cases, even polyadic data representations may be inadequate. Increasingly, net- work data sets incorporate rich metadata over and above topological structure. Models that flexibly incorporate this information can assist analysts in discovering features that may not be apparent without metadata. In this article, we consider an important and rel- atively general class of metadata in which nodes are assigned roles in each edge. A variety of social data sets involve such roles. For example, research articles have junior and senior authors. Political bills have sponsors and supporters. Movies have starring and support- ing actors. Emails have senders, receivers, and carbon copies. Chemical reactions possess reactants, solvents, catalysts and inhibitors. These roles induce asymmetries in edges, and permuting role labels within an edge results in a meaningfully different data set. For example, a movie in which actor A plays a starring role and B a supporting role becomes a different movie if the roles are exchanged. Metadata, including roles, can be especially important for modeling processes evolv- ing on network substrates. A trivial example is that information cannot flow along an email edge from receiver to sender. A less trivial example comes from a recent study (Rotabi et al. 2017), which found that conventions in scholarly documentation preparation tend to flow along collaborations from more senior authors to more junior ones. In many fields, senior authors will tend to be “last” authors, while junior ones are more likely to be “first” or “middle” authors. The author order therefore carries important information about the spread of conventions along this collaboration network. There exists an extensive literature studying graphs and hypergraphs with metadata attached to nodes (Ghoshal et al. 2009;McMorrisetal.1994; Kovanen et al. 2013;Hen- derson et al. 2012;Peeletal.2017)andedges(Muchaetal.2010; Gomez et al. 2013; Battiston et al. 2014). The problem of studying hypergraphs with general roles, how- ever, does not neatly fit into any of these frameworks. This is because roles are not attributes of either nodes or edges, but rather of node-edge pairs. An actress is not (intrinsically) a “lead actress” – she may play a leading role in one film and a sup- porting role in the next. Contextual metadata is familiar in the context of directed networks. Each edge contains two nodes, one of which possesses the role “source” and the other “target,” however a node may be the source of some edges, and the target of others. In Gallo et al. (1993); Gallo and Scutella (1998), the authors allow edges to contain arbitrary numbers of nodes, each of which is assigned one of these two roles. This results in directed hypergraphs, which have found some application in the study of cellular networks (Klamt et al. 2009) and routing (Marcotte and Nguyen 1998)prob- lems. Subsequent work generalized further to “multimodal networks” by introducing a relationship of “association” (Heath and Sioson 2009) alongside the source and target roles. Several extant papers explicitly model roles in interactions. An early pair of papers by Söderberg (Söderberg 2003a; 2003b)defineandstudyaclassofinhomogeneous random graphs in which nodes possess colored (role-labeled) stubs denoting their role in a given interaction. This class of models preserves degree distributions in expectation rather than deterministically, and the author develops them only for dyadic graphs. A later paper by Chodrow and Mellor Applied Network Science (2020) 5:9 Page 3 of 25 Karrer and Newman (2010) assigns nodes to roles in network motifs – small recurrent subgraphs – and uses these roles to construct configuration-like models. A model closely related to the one we develop here can be obtained by using labeled cliques as the relevant motifs and interpreting these cliques as hyperedges. Most recently, Allard et al. (2015) define a flexible generalization of stub-matching that can reproduce a wide variety of network topologies, including the presence of heterogeneous hyperedges. As we discuss below, stub-matching can often generate graphs with a small number of structural degen- eracies. When studying dynamics on graphs as the authors do, these degeneracies can often be ignored. Since our focus is primarily inferential, we instead choose an approach that explicitly avoids degeneracies. Our aim in this work is to develop a unified modeling and analysis framework for polyadic data with contextual roles, which can then be flexibly deployed in varied domains. The article is structured as follows. We first define annotated hypergraphs, which naturally generalize the notion of directedness to polyadic data. We then define a configuration model for annotated hypergraphs, and prove a Markov Chain Monte Carlo algorithm for sampling from this model. Next, we formulate a range of role- aware metrics for studying the structure of annotated hypergraphs. Some of these are direct generalizations of familiar tools, including centrality, assortativity, and modu- larity. Others, such as local role densities, are qualitatively novel. We then bring our methods to bear on a small collection of social network data sets, showing how the framework of annotated hypergraphs allows us to flexibly highlight interpretable fea- tures in the data. Additionally, we conduct an extended case-study of the popular Enron email data set. We conclude with a discussion of our results and suggestions for future work in the modeling of rich, polyadic data. Throughout our development, we empha- size how the incorporation of metadata allows the analyst to ask and answer