Constructing, comparing, and reconstructing networks

by Brennan Klein B.A. in Cognitive Science & Psychology, Swarthmore College

A dissertation submitted to

The Faculty of the College of Science of Northeastern University in partial fulfillment of the requirements for the of Doctor of Philosophy

November 19, 2020

Dissertation Committee

Alessandro Vespignani, Chair Samuel V. Scarpino, Co-chair Tina Eliassi-Rad Laurent H´ebert-Dufresne

1 Acknowledgements

The thanks I give in this section will never—and can never—fully capture the extent of my gratitude. I cherish the friends, mentors, collaborators, and altogether supportive people in my life. Because of them, I have grown immensely as a scientist. Because of them, I have grown even more as a person. Because of them, I can go forth into this next stage of my life, full of a deep faith in what’s to come, supported by a network of endlessly kind and brilliant people. My dissertation committee—Laurent H´ebert-Dufresne, Tina Eliassi-Rad, Sam Scarpino, and Alex Vespignani—is a perfect example of this network of support. It has been a privilege to share this dissertation with them over these last few years.

One of the greatest joys in my life has been my friendship with Conor Heins. We grew up together in science, and there are ideas that I simply cannot grasp without his presence. If there is one thing I have learned throughout my short career in science, it is the irreplaceable role that friendship has in driving discovery. To my parents, Marsha and Don, I owe so much. This dissertation would not exist—and I would not be a scientist—without my brother, Jason.

Documents like these are, ironically, never comprehensive enough. I spent months compiling a list of all the people I wanted to thank, all the memories I wanted to share, to reminisce over, to try and inspire through. At the same time, I am writing this document in the midst of a year of devastation from the COVID-19 pandemic. For some strange and sad reason, I cannot bring myself to write the names of every person I want to acknowledge. As a result this section may seem artificially short or otherwise rushed. In place of a fuller list of acknowledgements, I make this promise to these cherished people in my life: That the acknowledgements will come in person, sporadically, surprisingly, over next several years of our lives together. I hope we recognize each other.

2 Abstract of Dissertation

Complex networks are the syntax of complex systems; they are models that allow us to study phenomena across nature and society. And because they are models, the famous “all models are wrong, but some are useful” quotation rings especially true. We need to use the right networks to properly study complex systems, and in order to do so, the methods we use to create and analyze networks must be fit for purpose. This motivation has guided much of my dissertation, and in it, I explore three related themes around constructing, comparing, and reconstructing complex networks.

In the first chapter, I describe a theoretical and computational infrastructure that allows us to ask whether a given network captures the most informative scale to model the dynamics in the system. We see that many real world networks (especially heterogeneous networks) exhibit an information holarchy whereby a coarse-grained, macroscale representation of the network has more effective information than the original microscale network. In the next chapter, I consider the challenging problem of comparing pairs of networks and quantifying their differences. These tools are broadly referred to as “graph ” measures, and there are dozens used throughout . However, unlike in other domains of Network Science where rigorous benchmarks have been established to compare our surplus of tools, there is still no theoretically-grounded benchmark for characterizing these tools. To address this, collaborators and I proposed that simple, well-understood ensembles of random networks are natural benchmarks for network comparison methods. In this chapter, I characterize over 20 different graph distance measures, and I show how this simple within-ensemble graph distance can lead to the development of new tools for studying complex networks. The final chapter is an example of exactly that: I show how the within-ensemble graph distance can be used to characterize and evaluate different techniques for reconstructing networks from time series data. Tying together the original theme of using the “right” network, this chapter addresses one of the most fundamental challenges in Network Science: how to study networks when the network structure is not known. Whether it’s reconstructing the network of neurons from time series of their activity, or identifying whether one stock’s price fluctuations cause changes in another’s, this problem is ubiquitous when studying complex systems; not only that, there are (again) dozens of techniques for transforming time series data into a network. In this chapter, I measure the within-ensemble graph distance between pairs of networks that have been reconstructed from time series data using a given reconstruction technique. What I find is that different reconstruction techniques have characteristic distributions of distances and that certain techniques are either redundant or underspecified given other more comprehensive methods. Ultimately, the goal of this dissertation is to stress the importance of rigorous standards for the suite of tools we have in Network Science, which ultimately becomes an argument about how to make Network Science more useful as a science.

3 Table of Contents

Acknowledgements...... 2

Abstract of Dissertation...... 3

Table of Contents...... 4

List of Figures...... 7

List of Tables...... 14

Chapter 1 Introduction...... 15 1.1 Science in Network Science...... 16 1.1.1 What makes a science a science?...... 17 1.2 Theory in ...... 20 1.2.1 Networks as data objects...... 22 1.2.2 Networks as generative models of data...... 22 1.2.3 Networks as hypotheses...... 23 1.3 The current dissertation...... 24

Chapter 2 Constructing: Informative higher scales in complex networks...... 26 2.1 Introduction...... 26 2.2 Results...... 28 2.2.1 Effective information...... 28 2.2.2 Determinism and degeneracy...... 31 2.2.3 Effective information in real networks...... 32 2.2.4 Causal emergence in complex networks...... 34 2.2.5 Network macroscales...... 36 2.2.6 Causal emergence reveals the scale of networks...... 37 2.2.7 Causal emergence in real networks...... 39 2.3 Discussion...... 40 2.4 Materials and Methods...... 42 2.4.1 Selection of real networks...... 42 2.4.2 Creating consistent macro-nodes...... 43 2.4.3 Greedy algorithm for causal emergence...... 43 2.5 Follow-up research: Biological networks...... 44 2.5.1 Background: Noise in biological systems...... 44 2.5.2 Effectiveness of interactomes across the tree of life..... 47 2.5.3 Causal emergence across the tree of life...... 48 2.5.4 Resilience of macroscale interactomes...... 49 2.5.5 Discussion...... 52 2.5.6 Protein interactome data...... 55 2.5.7 Robustness of causal emergence...... 55

4 Chapter 3 Comparing: The within-ensemble graph distance...... 61 3.1 Introduction...... 61 3.1.1 Formalism of graph distances...... 62 3.1.2 This study...... 64 3.2 Methods...... 64 3.2.1 Ensembles...... 64 3.2.2 Graph distance measures...... 66 3.2.3 Description of experiments...... 67 3.3 Results...... 69 3.3.1 Results for homogeneous graph ensembles...... 69 3.3.2 Results for sparse heterogeneous ensembles...... 75 3.4 Discussion...... 79

Chapter 4 Reconstructing: Comparing ensembles of reconstructed networks.... 82 4.1 Introduction to the netrd package...... 82 4.1.1 Network reconstruction from time series data...... 84 4.1.2 Simulated network dynamics...... 84 4.1.3 Comparing networks using graph distances...... 84 4.1.4 Related software packages...... 84 4.2 Introduction to the ( , , ) ensemble...... 86 4.2.1 Framing: A distributionG D R of ground truths...... 87 4.2.2 The ( , , ) ensemble...... 90 4.3 Methods...... G D R 92 4.3.1 The standardized graph distance...... 92 4.3.2 Description of experiments...... 93 4.4 Results...... 93 4.5 Discussion...... 96

Chapter 5 Conclusion...... 100

References...... 102

Appendices...... 118 6.1 Chapter 2 Appendix...... 118 6.1.1 Table of key terms...... 118 6.1.2 Effective information calculation...... 118 6.1.3 Effective information of common network structures.... 121 6.1.4 Network motifs as causal relationships...... 124 6.1.5 Table of network data...... 126 6.1.6 Examples of consistent macro-nodes...... 127 6.1.7 Emergent subgraphs...... 127 6.2 Chapter 3 Appendix...... 130 6.2.1 Jaccard Distance...... 132 6.2.2 Hamming Distance...... 133

5 6.2.3 Frobenius...... 133 6.2.4 Polynomial Dissimilarity...... 133 6.2.5 Jensen-Shannon Divergence...... 134 6.2.6 Portrait Divergence...... 134 6.2.7 Quantum Spectral Jensen-Shannon Divergence...... 136 6.2.8 Communicability Sequence Entropy Divergence...... 136 6.2.9 Graph Diffusion Distance...... 137 6.2.10 Resistance Perturbation Distance...... 137 6.2.11 NetLSD...... 138 6.2.12 Laplacian Spectrum Distances...... 139 6.2.13 Ipsen-Mikhailov...... 140 6.2.14 Hamming-Ipsen-Mikhailov...... 140 6.2.15 Non-backtracking Spectral Distance...... 141 6.2.16 Distributional Non-backtracking Distance...... 141 6.2.17 D-measure Distance...... 142 6.2.18 DeltaCon...... 143 6.2.19 NetSimile...... 143 6.2.20 Derivation: Jaccard Distance...... 144 6.2.21 Derivation: Hamming Distance...... 145 6.2.22 Derivation: Frobenius...... 146

6 List of Figures

Figure 2.1: Effective information depends on network structure. (A) In Erd˝os- R´enyi (ER) networks we see the network’s EI level off at EI = log2(p) as N, the network’s size, increases (log scale shown). (B) The EI of networks− grown under a mechanism, which depends on the preferential attachment exponent, α. Under this network growth model, new nodes add their m edges (here, m = 1) to existing nodes in the network with a probability proportional to kα. Only sublinear preferential attachment (α < 1.0) allows for the continuous growth of EI with the growth of the network. The ribbons around the data represent standard deviations after 100 simulations of each.. 28 Figure 2.2: Comparing determinism and degeneracy. (A) Left column: three exam- out ple out-weight vectors, Wi , of a given node, vi. A maximally deterministic out vector (top left, where the WA corresponds to node A in the inset ) is when a random walker on vi transitions to one of its neighbors with probability 1.0, whereas indeterminism occurs when vi has a uniform probability of visiting any node in the network in the next time step. Right: three example out in-weight vectors to a given vj. A maximally degenerate vector, Wi (top right, exemplified by the inset network motif), is when every outgoingh edgei in the network connects to a single node, whereas minimal degeneracy occurs when out 1 each value in Wi is uniformly N . (B) By comparing the determinism and degeneracy ofh canonicali network structures, we find a great deal of heterogeneity in different network models’ ratios between their determinism and degeneracy. High degeneracy is characterized by hub-and-spoke topology, as in the case of the star network. Networks with high determinism are characterized by longer average path lengths, as in the case of a ring lattice...... 30 Figure 2.3: Effective information of real networks. Effectiveness, a network’s EI, normalized by log2(N)[87], of 84 real networks from the Konect network database [102], grouped by domain of origin. To look further at the names and domains of the networks in question, see SM 6.1.5. Networks in different categories have varying effectiveness (t-test, comparison of means)...... 33

7 Figure 2.4: Macro-nodes. (A) The original network, G along with its (left). The shaded oval indicates that subgraph S member nodes vB and vC will be grouped together, forming a macro-node, µ. All macro-nodes are some transformation of the original adjacency matrix via recasting it as a new adjacency matrix (right). The manner of this recasting depends on the type out of macro-node. (B) The simplest form of a macro-node is when Wµ is an out average of the Wi of each node in the subgraph. (C) A macro-node that represents some path-dependency, such as input from A. Here, in averaging out to create the Wµ the out-weights of nodes vB and vC are weighted by their input from vA.(D) A macro-node that represents the subgraph’s output over the network’s stationary dynamics. Each node has some associated πi, which is out the probability of vi in the stationary distribution of the network. The Wµ of out a µ π macro-node is created by weighting each Wi of the micro-nodes in the | π subgraph S by P i .(E) A macro-node with a single timestep delay between k∈S πk input µ j and its output µ π, each constructed using the same techniques as its components.| However, µ j| deterministically outputs to µ π. See SM 6.1.1 for | out | details about the creation of the Wµ of each of the different HOMs shown.. 35 Figure 2.5: The emergence of scale in preferential attachment networks. (A) By repeatedly simulating networks with different degrees of preferential attachment (α values) with m = 1 new edge per each new node, and running them through a greedy algorithm (described in Materials & Methods), we observe a distinctive peak of causal emergence once the degree of preferential attachment is above α = 1, yielding networks that are no longer “scale-free.” (B) The log of the ratio of original network size, N, to the size of the macroscale network, NM . Net- works with higher α values—more star-like networks—show drastic dimension reductions, and in fact all eventually reach the same NM of 2. Comparatively, random trees (α = 0.0) show essentially no informative dimension reductions. 38 Figure 2.6: Propensity for causal emergence in real networks. Growing snowball samples of the two network domains that previously showed the greatest diver- gence in effectiveness: technological and biological networks. At each snowball size, Ns, each network is sampled 20 times. Across these samples the total amount of causal emergence for a given sample size is significantly different between the two domains (t-test, comparison of means)...... 41 Figure 2.7: Effectiveness of protein interactomes. (A) Effectiveness of all 1840 species with their superphylum association. Interactomes with a lower number of nucleotide substitutions per site tended to be Prokaryota (yellow), while those higher tended to be Eukaryota (blue). Solid line is a linear regression comparing the effectiveness of Bacteria and Eukaryota (r = 0.40, p < 10−5), due to the small number of Archaea that passed the threshold− for reliable datasets (see Section 2.5.2). (B) The effectiveness of prokaryotic protein interactomes is greater than that of eukaryotic species, indicating that effectiveness might decrease with more nucleotide substitutions per site...... 49

8 Figure 2.8: Causal emergence in protein interactomes. (A) The protein interactomes of each species undergoes a modified spectral analysis in order to identify the scale with EImax. The total dimension reduction of the network is shown, with there being a greater effect in Eukaryota as more subgraphs are grouped in macro-nodes. That is, as evolutionary time goes on the coarse-grained networks become a smaller fraction of their original microscale network size (r = 0.46,p < 10−6). (B) In order to compare the degree of causal emergence in protein− interactomes of different sizes, the total amount of causal emergence is normalized by the size of the network, log2(n), and we see here a positive correlation between evolution and causal emergence (r = 0.457, p < 10−7). (C) The amount of normalized causal emergence is significantly higher for Eukaryota. 50 Figure 2.9: Resilience of micro- and macro-nodes following causal emergence in interactomes. The resilience of species’ interactomes change across the tree of life, as shown in previous research [194]. Using the mapping generated by computing causal emergence (Fig. 2.8B), we calculate the resilience of the network, isolating the calculation to nodes that are either part of the macroscale or microscale. Points are color-coded according to the evolutionary domain; points with dark outlines are associated with micro-nodes that have been grouped into a macro-node (macroscale), while the points with light outlines have not been grouped into a macro-nodes (microscale). Nodes at the microscale contribute less to the overall resilience of a given network (0.331) compared to nodes that contribute to macro-nodes (0.543) on average (t-test, p < 1.010). Note: plotted are the microscale and macroscale resilience values for each interactome in the dataset; the difference in resilience across scales holds even when only including species with more than 10, 100, or 1000 citations..... 51 Figure 2.10: Statistical controls and network robustness tests. (A) As a greater fraction of links are randomly rewired, the resulting networks’ causal emergence decreases (normalized by the causal emergence value of the original network such that causal emergence of 1.0 corresponds to the original network’s value). This decrease is independent on evolutionary domain, network size, density, or other network properties. (B) A second statistical control known as a soft configuration model assesses whether there is anything intrinsic to the network’s degree distribution that could be driving a given result. Here, we divide the average causal emergence of 10 such configuration model networks by the causal emergence values of the original protein interactome and observe that the null model networks preserve only a small fraction of the original amount of information gain (at most, the configuration models may show 3% of the original causal emergence)...... 56 Figure 2.11: distribution of Prokaryota and Eukaryota. Modularity of the community partitions of the species’ interactomes (using the Girvan-Newman community detection method [71]). The two distributions are not significantly different from one another (student’s t-test, comparison of means p = 0.167). 57

9 Figure 3.1: Mean and standard deviations of the within-ensemble distances for G(n,p) and RGG. By repeatedly measuring the distance between pairs of G(n,p) and RGG networks of the same size and density, we begin to see characteristic behavior in both the graph ensembles as well as the graph distance measures themselves. In each subplot, the mean within-ensemble graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small toh see),i ± while the dashed lines are the standard deviations...... 71 Figure 3.2: Mean and standard deviations of the within-ensemble distances for G(n,hki) networks. Here, we generate pairs of ER networks with a given average degree, k , and measure the distance between them with each distance measure. In eachh i subplot, we highlight k = 1 and k = 2. In each subplot, the mean within-ensemble graph distanceh isi plotted ash ai solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too smallh toi see), ± while the dashed lines are the standard deviations...... 72 Figure 3.3: Mean and standard deviations of the within-ensemble distances for Watts-Strogatz networks. Here, we generate pairs of Watts-Strogatz net- works with a fixed size and average degree but a variable probability of rewiring random edges, pr. In each subplot we also plot the clustering and path length curves as in the original Watts-Strogatz paper [184] to accentuate the “small- world” regime with high clustering and low path lengths. The mean within- ensemble graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small toh see),i ± while the dashed lines are the standard deviations.. 73 Figure 3.4: Mean and standard deviations of the within-ensemble distances for soft configuration model networks with varying degree exponent. Here, we generate pairs of networks from a (soft) configuration model, varying the degree exponent, γ, while keeping k constant (n = 1000). In each subplot we highlight γ = 3. The mean within-ensembleh i graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small toh see),i ± while the dashed lines are the standard deviations...... 76 Figure 3.5: Mean and standard deviations of the within-ensemble distances for preferential attachment networks. Here, we generate pairs of preferential attachment networks, varying the preferential attachment kernel, α, while keeping the size and average degree constant. As α , the networks become more and more star-like, and at α = 1, this model→ generates ∞ networks with power-law degree distributions. The mean within-ensemble graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small toh see),i ± while the dashed lines are the standard deviations...... 77

10 Figure 4.1: Example network reconstruction from temporal data. In each example above, a is modeled as having a true network structure (left), and its entities generate observable dynamics (middle) that are used to reconstruct a network structure that resembles the true network (missing edges highlighted in red). Here, each node in the system outputs what appears to be a continuous series of activity, such as a time series of sensor data from EEG recordings on the scalp, fluctuations of stock prices in a stock market, etc...... 83 Figure 4.2: Example of the network reconstruction pipeline. (Top row) A sample network, its adjacency matrix, and an example time series, TS, of node-level activity simulated on the network. (Bottom rows) The outputs of 15 different network reconstruction algorithms, each using TS to create a new adjacency matrix that captures key structural properties of the original network. In Table 4.1, we list the methods that we used in this study, as well as other techniques acquired over the course of this paper...... 85 Figure 4.3: Example of the graph distance measures in netrd. Here, we measure the graph distance between two networks using 20 different distance measures from netrd...... 86 Figure 4.4: Distributions of within/between-ensemble graph distances. Using real data of functional connectomes of human subjects who have and have not been diagnosed with schizophrenia [179](n = 71 control, n = 56 treatment), we can measure the average within-ensemble graph distance of connectomes in each treatment groups. Blue histogram: pairwise graph distances (measured using the Nonbacktracking Spectral Distance [175]) between all connectomes in the control group. Of particular interest is the high density region around 0 DNBD(G, G ) = 2.0, suggesting that there is a characteristic or expected distance between healthy brains. Red histogram: pairwise graph distances between connectomes in the two different conditions. This distribution is much 0 less concentrated around DNBD(G, G ) = 2.0; on the contrary, it is characterized by much higher average distances, with much higher variance than the blue histogram...... 89 Figure 4.5: Comparison of two reconstruction techniques, on same underlying network and dynamics. In each subplot, we look at how different average standardized graph distance between the networks reconstructed under a given reconstruction technique (Granger causality or free energy minimization) and the ground truth network. In this example, the ground truth networks were preferential attachment networks (as in 3.2.1). These plots show how the average standardized distance between the ground truth and reconstructed networks changes as the underlying network structure changes with the kernel of preferential attachment, α. Additionally, we see that different distance measures pick up on different, sometimes divergent structural properties of the reconstructed networks. The four distance measures included here are IPM, DJS, 0 QJS, and DCN. The horizontal dashed line corresponds to Ds(G, G ) = 1, which is the mean within-ensemble graph distance for pairs of networks sampled from the same ensemble (in this case, from the nonlinear preferential attachment model). This plot includes networks with n = 64, m = 2...... 94

11 Figure 4.6: Highlighting the effect of dynamics on network reconstruction per- formance. Here, we isolate a single network reconstruction technique, Granger causality, to highlight the effect of two different dynamics simulated on the same ground truth network (again, nonlinear preferential attachment networks, varying the preferential attachment kernel, α). Again, we see different stan- dardized graph distances between the ground truth and reconstructed networks depending on 1) the graph distance measure used and 2) the dynamics. The four distance measures included here are JAC, POD, DME, and NES. The horizontal 0 dashed line corresponds to Ds(G, G ) = 1, which is the mean within-ensemble graph distance for pairs of networks sampled from the same ensemble (in this case, from the nonlinear preferential attachment model). This plot includes networks with n = 64, m = 2...... 95 Figure 4.7: Distributions of within-ensemble graph distances. Here, we report the distributions of within-ensemble graph distances (using QJS) for pairs of ground 0 truth networks sampled from the same ensemble (light grey bars, Ds(G, G )) as well as the within-ensemble graph distances of reconstructed networks from time series of Sherrington Kirkpatrick dynamics simulated upon the ground 0 truth networks (dark grey bars, Ds(Gr,Gr)). In each experiment, the ground truth networks are sampled from the nonlinear preferential attachment model where n = 64, m = 2. Left: α = 1.0. Right: α = 2.0...... 96

Figure 6.1: Illustration of the calculation of effective information. (A) The adja- cency matrix of a network with 1.158 bits of EI (calculation shown in (B)). out The rows correspond to Wi , a vector of probabilities that a random walker out on node vi at time t transitions to vj in the following time step, t + 1. Wi represents the (normalized) input weight distribution of the network, thath is,i the probabilities that a random walker will arrive at a given node vj at t + 1, after a uniform introduction of random walkers into the network at t.(B) Each out node’s contribution to the EI (EIi) is the KL divergence of its Wi vector out from the network’s Wi , known as the effect information...... 120 Figure 6.2: Effective informationh i of stars and rings. As the number of nodes in star networks increases, the EI approaches zero, while the EI of ring lattice networks grows logarithmically as the number of nodes increases...... 122 Figure 6.3: Effective Information of network motifs. All directed 3-node subgraphs and their EI...... 124 Figure 6.4: Effectiveness of real networks. Full data behind the results summarized in Fig. 2.3, color-coded in two ways. First by 16 “Domains” (as in Table 6.2), which corresponds to the classification of each network from its source repository (in this case, the Konect database [102] or the Network Repository [149]). The second categorization we report—those used in Fig. 2.3—involves grouping the Domains into four “Categories” (“Cat.” in Table 6.2): Biological, Information, Social, and Technological. These correspond to the colored squares to the right of each network’s name...... 125

12 Figure 6.5: Typically minimal inconsistency of higher-order macro-nodes. Each inset is of the microscale network, where each node’s color corresponds to the µ π macro-node it has been mapped to following one instance of the greedy algorithm| detailed in the Materials & Methods section. White nodes indicate a micro-node that was not grouped into a new macro-node. Inconsistency is plotted over time...... 126 Figure 6.6: Causal emergence in a simplified . Schematic showing the role of the two relevant parameters—the fraction of nodes in each community (ranging from r = 0.50 to r < 1.0) and the fraction of within-cluster connections (ranging from p = 0.0, a fully bipartite network, to p = 1.0— two disconnected cliques). By repeatedly simulating networks under various combinations of parameters (N = 100 with 100 simulations per combination of parameters), we see combinations that are more apt to produce networks with causal emergence...... 130 Figure 6.7: Causal emergence in Erd˝os-R´enyi networks. (A) As the edge density, p, of ER networks increases and N is held constant, the amount of causal emergence quickly drops to zero. (B) This drop occurs well before pN = k = 1, meaning the algorithm for uncovering causal emergence is only groupingh small,i disconnected, tree-like subgraphs that have yet to form into a giant . Of note here is the low magnitude of causal emergence even in cases where the random network is not a single large component, and the vanishing of causal emergence after it is...... 131 Figure 6.8: Mean and standard deviations of the within-ensemble distances for G(n,p) and G(n,hki) as n increases. Here, we generate pairs of ER networks with either a fixed density, p or with a fixed average degree, k , as we increase the network size, n. In each subplot, the mean within-ensembleh i graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small toh see),i ± while the dashed lines are the standard deviations...... 132

13 List of Tables

Table 3.1: Graph distances. Distance measures used to systematically compare graphs in this work, as well as their abbreviated labels, and their source. Abbreviations: Lap. = Laplacian, Gauss. = Gaussian, Loren. = Lorenzian, JSD = Jensen- Shannon divergence, Euc. = Euclidian distance...... 67 Table 3.2: Experiment parameterization. Here we report the ensembles that were used in these experiments, as well as their parameterizations. For G(n,hki) and WS key parameters, we span 100 values, spaced logarithmically, between the values above. Parameter labels: n = network size, p = density, k = average degree, h i pr = probability that a random edge is randomly rewired, γ = power-law degree exponent, α = preferential attachment kernel. Note: In SI 6.2, we show how the within-ensemble graph distance changes as n increases...... 68 Table 3.3: Summary of key within-ensemble graph distance properties for differ- ent ensembles. Each of the ensembles included in this work has characteristic properties that a within-ensemble graph distance may be able capture. Here we consolidate these various properties into a single table that classifies whether each distance has a given property. Models considered are dense Erd˝os-R´enyi graphs (G(n,p)), random geometric graphs (RGG), sparse Erd˝os-R´enyi graphs (G(n,hki)), the Watts-Strogatz model (WS), soft configuration model with power-law degree distribution (SCM) and general preferential attachment with kernel α (PA). Clarifications: In the WS model, we look at three properties: 1) the mean within-ensemble graph distance is larger for intermediate “small-world” values of pr than it is when pr = 1; 2) the within-ensemble graph distance is sensitive to values of pr where the magnitude slope of the Lp/L0 curve is largest (“path length sensitivity” above); 3) the within-ensemble graph distance is sensitive to values of pr where the magnitude slope of the Cp/C0 curve is largest (“clustering sensitivity” above). In the PA model, we look at whether high, positive values of α produce greater mean within-ensemble graph distances than lower, negative values of α, and at where the maximum within-ensemble distance occurs.... 70

Table 4.1: Network reconstruction techniques included in the netrd package. .. 91 Table 4.2: Network dynamics studied here...... 92

Table 6.1: Table of key terms. Quantities needed in order to calculate EI and create consistent macro-nodes...... 119 Table 6.2: Network datasets. Continued on the following page...... 128 Table 6.2: Network datasets (continued)...... 129

14 Chapter 1

Introduction

Complex networks are the syntax of complex systems; they are models that allow us to study phenomena across nature and society. And because they are models, the famous “all models are wrong, but some are useful” adage rings especially true. We need to use the right networks to properly study complex systems, and in order to do so, the methods we use to create and analyze networks must be fit for purpose. This motivation has guided much of my dissertation, and in it, I explore three related themes around constructing, comparing, and reconstructing complex networks.

In the first chapter, I describe a theoretical and computational infrastructure that allows us to ask whether a given network captures the most informative scale to model the dynamics in the system. We see that many real world networks (especially heterogeneous networks) exhibit an information holarchy whereby a coarse-grained, macroscale representation of the network has more effective information than the original microscale network.

In the next chapter, I consider the challenging problem of comparing pairs of networks and quantifying their differences. These tools are broadly referred to as “graph distance” measures, and there are dozens used throughout Network Science. However, unlike in other domains of Network Science where rigorous benchmarks have been established to compare our surplus of tools, there is still no theoretically-grounded benchmark for characterizing these tools. To address this, collaborators and I proposed that simple, well-understood ensembles of random networks are natural benchmarks for network comparison methods. In this chapter, I characterize over 20 different graph distance measures, and I show how this simple within-ensemble graph distance can lead to the development of new tools for studying complex networks.

The final chapter is an example of exactly that: I show how the within-ensemble graph

15 distance can be used to characterize and evaluate different techniques for reconstructing networks from time series data. Tying together the original theme of using the “right” network, this chapter addresses one of the most fundamental challenges in Network Science: how to study networks when the network structure is not known. Whether it’s reconstructing the network of neurons from time series of their activity, or identifying whether one stock’s price fluctuations cause changes in another’s, this problem is ubiquitous when studying complex systems; not only that, there are (again) dozens of techniques for transforming time series data into a network. In this chapter, I measure the within-ensemble graph distance between pairs of networks that have been reconstructed from time series data using a given reconstruction technique. What I find is that different reconstruction techniques have characteristic distributions of distances and that certain techniques are either redundant or underspecified given other more comprehensive methods.

Ultimately, the goal of this dissertation is to stress the importance of rigorous standards for the suite of tools we have in Network Science, which ultimately becomes an argument about how to make Network Science more useful as a science in its own right.

The following introduction is designed to present a philosophical look at Network Science today, an imperfect and surely incomplete discussion of the promise of Network Science as a discipline, and ultimately a proposition for why the research presented here constitutes a useful theoretical contribution in addition to its methodological contributions. I conclude this introduction with the simple argument that Network Science has the potential to be a species-defining field, but its potential is limited by a lack of formal theory; by developing network methods that are motivated by—and explicitly designed to address—broader theories of networks in general, we are better prepared to nurture cross-disciplinary insights that will heighten and sustain Network Science’s impact as a whole.

1.1 Science in Network Science

Stephen Hawking’s famous prediction, “I think the next century will be the century of complexity,” [42] appears to be coming true. In the 21st century, scientists have had to contend with the inherent messiness of the physical universe; wherever we look, we see the ubiquity of complex, interconnected, adaptive systems. Within this ever-growing field of Complex Systems Science, however, one insight has fundamentally changed how we approach these systems: networks. Networks provide a grammar, a mathematical syntax that furnishes us with a rich language to describe our complex world.

Indeed, many believe that the study of networks has grown into a science in its own right, with a rigorous set of tools that researchers now use to study a wide variety of systems, all under a common mathematical syntax. This field stems from early observations that natural systems seem to always be organized into networks—collections of nodes and links that characterize the types of interaction, influence, or communication between the constituent parts of a given system. For example: dense, subcellular networks of proteins and metabolites make

16 life possible; neuronal networks support consciousness, cognition, and behavior; an intricate web of personal, professional, and incidental ties create our social world and foster the spread of ideas, cultural norms, and beliefs. Interconnectedness, it seems, is a fundamental property of the natural world.

But is it accurate to call Network Science a science? While this may be an argument too often overheard in poorly lit pubs after a day of conference plenary talks, and while it may seem tedious or pedantic to bring up, I want to contend with this question in earnest in this dissertation. I certainly will not answer this question—nor is there likely a single answer to this question—but the work I have done over the course of my PhD speaks to a hopeful, prospective interpretation of this question. What this research points to is that yes, it is appropriate to treat Network Science as a science in its own right. Indeed, as I discuss in the following sections, I believe it is important to do so, as this makes it more likely that new, cross-disciplinary insights can emerge, the kind that might bring forth fundamentally new interpretations of our natural world.

However, this future where Network Science is a single, coherent discipline is limited by an inconsistent use of tools and practices across the discipline. Too often, the networks we study are under-informative and ill-suited to describe the systems in question. I discuss this problem and pose one possible solution in Chapter 2, highlighting the importance of constructing networks that are maximally informative; this approach addresses a related question about the proper scale we should use to model different systems. Another challenge that impedes the development of Network Science as a discipline is the inconsistent application and comparison of tools. In Chapters 3 and 4, I try to address the challenging problem of comparing networks—a persistent problem that can lead to the development and use of improper techniques for studying and characterizing networks. If we say that two networks are similar, when in fact they differ along one crucial axis, that type of mischaracterization can grow into a big problem for the maturation of Network Science as a discipline. In Chapter 3, I discuss a meta-comparison tool—a method for comparing comparison techniques known as the within-ensemble graph distance—that is introduced as a benchmark for network comparison. In Chapter 4, I use this benchmark to substantively compare different techniques for reconstructing networks from time series data. Each of these chapters describes one example—of surely many—of the kind of research that I believe will strengthen Network Science’s claim to being a scientific discipline in its own right.

1.1.1 What makes a science a science? It can seem simple to describe a collection of activities as a science—chemistry is the set of activities that chemists do, physics is what physicists do—but it is less simple to provide a working definition that captures the true essence of a science. It was not until the late 19th century that Biology was thought to be a science in its own right; before that, we see historians of science write about a spattering of expertises ranging from taxidermy, to anatomy, to natural philosophy, medicine, and botany, all separate disciplines without a

17 unifying label of Biology to group them together. In his Growth of Biological Thought, Ernst Mayr describes this shift [118]:

The word “biology” is a child of the nineteenth century. Prior to that date, there was no such science. When Bacon, Descartes, Leibniz, and Kant wrote about science and its methodology, biology as such did not exist, only medicine (including anatomy and physiology), natural history, and botany (somewhat of a mixture). Anatomy, the dissection of the human body, was until far into the eighteenth century a branch of medicine, and botany likewise was practiced primarily by physicians interested in medicinal herbs. The natural history of animals was studied mainly as part of natural theology, in order to bolster the argument from [intelligent] design. The scientific revolution in the physical sciences had left the biological sciences virtually untouched. The major innovations in biological thinking did not take place until the nineteenth and twentieth centuries. It is not surprising, therefore, that the philosophy of science, when it developed in the seventeenth and eighteenth centuries, was based exclusively on the physical sciences and that it has been very difficult, subsequently, to revise it in such a way as to encompass also the biological sciences.

What, then, were the forces that brought these (at the time) disparate fields together under the same roof? And why is this parable relevant to whether or not Network Science is a science? And why is Network Science’s status as a science (or not) so urgent that it needs discussion in the beginning of this dissertation? After all, we must assume that the mere label of Biology wasn’t what brought these disciplines together, but rather, the similarities between the topics, tools, and methods of each became too apparent to ignore. Likewise, simply asserting that certain fields are or are not in the purview of Network Science does not actually unify these different disciplines.

For the purposes of this dissertation, I see it as perfectly reasonable to treat Network Science as a science; like Biology, it is a conciliatory discipline that unifies fields we once thought were separate. To arrive here, I consider a science to be a lens, a pair of goggles, a point of view, an organized way of generating new knowledge—coordinated among people who largely agree on the tools and methods for bringing about new knowledge. In this view, Biology became a science through the gradual accumulation, convergence, and use of the same tools to study different things in different settings. Here, a tool could be a measurement instrument, an idea, a framework, a definition—anything that, were it not for its presence, would render us ignorant of the process in question. Again, we see Mayr highlight this point when discussing the emergence of Biology as a discipline, emphasizing the notion of a “concept” [118]:

Instead of formulating laws [as physicists have typically done], biologists usually organize their generalizations into a framework of concepts. The claim has been made that laws versus concepts is only a formal difference, since every concept can be translated into one or several laws... Laws lack the flexibility and heuristic

18 usefulness of concepts... Progress in biological science is, perhaps, largely a matter of the development of these concepts or principles.

The analogy between Biology and Network Science continues to be useful for our discussions here. For example, the introduction of one special concept—the network—has had the effect of bringing together bodies of knowledge that had otherwise little (formal) relation to one another. In Biology, a similar unifying concept might be the notion of life itself, natural selection, or genetics. The effect of introducing and committing to these formal concepts in Biology meant that generic insights gained from studying one system could potentially have direct impacts to completely different systems. These types of “jumps” are an example of a translational insight, using knowledge from one system to seek out knowledge from another.1

As an example, Nobel Prize winning biologist Gerald Edelman famously compared the dynamics of an individual’s immune system to the adaptive dynamics of neurons over the course of a lifetime [56; 57]. To us, this comparison might seem intriguing, possibly tenuous, but surely not unbelievable. What we might not realize in this example, however, is that to even make this comparison between immune systems and brains required Edelman to make a conceptual leap that was either impossible or highly unlikely for scientists in a world without a unified Biology. Edelman’s work looked at the role of antibodies in garnering various immunities; crucially, his insights were not limited to human antibodies or mouse antibodies or drosophila antibodies, but rather, his research applied to antibodies in general. This leap from the specific to the general is one that I suspect most scientists today would take for granted; of course insights about antibodies in mice can tell us about antibodies in humans. What this leap highlights is the essential role that Biology as a science has had on how we organize and generate new knowledge; it turns cross-disciplinary insights into stepping stones, bringing completely novel knowledge just within reach. And we are better for it. In line with these observations, Edelman writes in The Remembered Present [57]:

No scientific theory can expect to explain everything even within a restricted purview. But it should be able to point in new directions, to redefine its subject matter (at least implicitly), to unify observations, to suggest experiments, and to bear on certain philosophical issues that prepare one’s mind for further developments.

The answer to why it is appropriate and important to characterize Network Science as a science lies in the promise of the kinds of insights that this field has to offer. In the coming sections, I will use this assertion to contextualize the research discussed in this dissertation. To summarize, there are three main points that drive this framing: 1) Network Science is indeed a science. This is an important claim because it implicitly offers a permission

1Note that the language here is around seeking out rather than finding, as these translational insights are not guaranteed to emerge when applying insights from one system to a different system. In fact, it can be quite useful when these jumps are unsuccessful, as this might point to an insufficiency in whatever theory the scientist is adopting or implicitly suggest new experiments to run.

19 structure for translational insights—seeking knowledge from one domain based on knowledge of another. In this sense, we see several similarities with the emergence of Biology as a single, unifying discipline. 2) An even more appropriate claim is that Network Science in its idealized form is a science. The emphasize on an idealized Network Science is particularly important, at what seems to be a pivotal time in the development of this field; in this idealized world, we can seek knowledge in some networks and based on our knowledge of others. 3) Currently there are unanswered methodological questions that limit our ability to do this translational science.

1.2 Theory in Network Theory

Throughout this dissertation, I address a variety of technical problems, which themselves reveal theoretical challenges that I believe Network Science must contend with. In this section I will focus primarily on discussions that arise in Chapters 2 and 4, which deal with the theoretical rationale and tools we use for selecting which networks to study in Network Science. That is, the problems defining an appropriate scale to study a given system (Chapter 2) and reconstructing networks from time series of activity (Chapter 4) force us to question several assumptions latent throughout the field of Network Science.

First, there is an ever-present question of what is the network? We are often presented with data from complex systems with apparently networked structure—what is the network in these systems? Surely this question is too restrictive; there is not simply a single network that characterizes the structure and behavior of the system in question. However, just as surely, there are certain network representations that are better or more informative or otherwise more useful than others [182]. This point is discussed in Chapter 2, where I describe a coarse-graining procedure that recasts noisy subgraphs in the micro-scale of a network into single macronodes at a higher scale—a phenomenon known as causal emergence. These new macroscale representations of the system are selected because they increase the effective information of the networks, a quantity that has known implications for our understanding of information flow, causation, and effects in a system [172; 120; 146].

When reconstructing networks from time series data, answering the what is the network question becomes paramount, since multiple reconstruction techniques can output very different networks from the same time series input; as I will show in Chapter 3 and later in Chapter 4, the problem of quantifying the performance of any given technique is not trivial. When applied to real systems, these differences are not simply a matter of taste or of researcher preference, they can have dramatic impacts on the way that Network Science is used in, for example, clinical care for psychiatric patients [22; 21].2

2Famously, researchers have demonstrated that the brain activity of a dead salmon resembled those of healthy human subjects when using inadequate corrections for multiple comparison in correlation-based analyses [22]. While this is a particularly clever example, it is illustrative of the extent to which we may improperly draw theoretical implications from inadequate inference tools.

20 The second question that these problems—selecting the most informative network scale and (especially) reconstructing networks from time series data—force us to contend with may seem trivial but is surprisingly difficult to answer: if there is in fact a network (or set of networks) that meaningfully represents the system’s underlying connectivity structure, can we even access it? That is, are our measurement tools even equipped to pick up on the connections or interactions taking place in the system? Here again, we are posed with the question of scale, both in space and in time. There are surely networks of molecules interacting with one another, swirling around our faces right now, but 1) their relative positions are transient (temporally) and 2) they are often too small to record (spatially). In this case, in order to extract meaningful measurements or macroscopic insights from this system, we are forced to abandon our search for a network in favor of structures that are more measurable (see Scholtes (2017) for insightful discussion [156]). This is not to suggest that improved measurement tools won’t give us a Network Science of Air Molecules in Rooms, but it is not clear a priori whether representing this structure could be informative or have meaningful utility for the people who would perform such measurements. A more serious example is in the context of neuronal activity in the brain, where the timescale of a single action potential can range from one to one hundred milliseconds [165]. Only recently have our measurement tools given us the ability to accurately measure dynamics at such a small and fast scale. Any application of Network Science, it seems, exists only in the space afforded to us by these measurement tools.

These two questions build to an obvious third: if there is a meaningful network structure that can characterize a given system and if our measurement tools even equip us with a sufficient spatial and temporal resolution from which to even extract a network, how would we know? Here we find what I believe is (and indeed should be) the theoretical core of a Network Science: The best way to trust 1) that a network is even an appropriate description of a given system and 2) that our measurement tools even allow us to glimpse that system’s network structure is if we use and test our network representations in novel settings designed to challenge and define the limits of the network’s usefulness. In this sense, the network can be thought of as merely a hypothesis, reminiscent to the role of hypotheses in Karl Popper’s The Logic of Scientific Discovery [141]:

The game of science is, in principle, without end. He who decides one day that scientific statements do not call for any further test, and that they can be regarded as finally verified, retires from the game. Once a hypothesis has been proposed and tested, and has proved its mettle, it may not be allowed to drop out without ‘good reason’. A ‘good reason’ may be, for instance: replacement of the hypothesis by another which is better testable; or the falsification of one of the consequences of the hypothesis.

Hypotheses are introduced in order to be ruled in or ruled out, following rigorous attempts to falsify them. Here, we see that the idea of a network as a hypothesis is quite apt, and it is intriguingly flexible as a tool for bolstering Network Science’s claim as a science. To

21 elaborate on why this is a useful—if imperfect and temporary—framing, let us consider two alternative conceptualizations of networks that are scattered throughout Network Science. After briefly describing these different interpretations of how networks are used, we will continue trying to address the question of how we would even know that the networks we use are suited to characterize the systems in question.

1.2.1 Networks as data objects One view of networks common throughout Network Science is that networks are data objects, artifacts that record the state of a system at a given time, which we can analyze and study for their structural properties. Today, a great deal of the research in Network Science is done in this descriptive sense—a cartography of sorts, mapping out the connectivity of systems and studying their structure. For example, road networks map out the shape of our movement patterns and can reflect changes in policy outcomes and conflict [128], urban architectural design [26], or even historical patterns of inequality and segregation [185]. Networks extracted from online social media can describe snapshots of how humans share and receive information [38], reveal the emergence of social movements and uprisings [92], or highlight opinion dynamics and polarization [1]. These types of descriptive networks— networks as data objects—have the potential to offer enormous insights to fields where they are applied (e.g. political science, sociology, etc.). In this sense, Network Science is used as simply a suite of tools that get applied to another discipline’s research questions. Of course, this is a useful endeavor—one that highlights the ubiquity of the network as a conceptual and mathematical tool—but it seems to have its limits in advancing new theories in and about Network Science, per se.

1.2.2 Networks as generative models of data Another option is to consider networks as generative models of data, rather than the data object itself. In this view, researchers are often faced with data output from a given system, and their task is to generate a model that could—given the appropriate initial conditions— generate similar data to the kind they had collected. This approach to Network Science can take a variety of forms. On the one hand, we often see epidemiologists try to approximate underlying mobility structures within and between urban areas in order to deliver more precise predictions about the spread of disease [9; 10]. On the other hand, researchers often try to extract good generative models from systems without any a priori constraints about the number of entities in the system or how they should interact. Dynamic causal modeling in neuroscience is a good example of this, where researchers combine assumptions about the brain’s hemodynamic response with raw data output from neuroimaging to search a space of models that maximizes the likelihood of observing the data [127]. In both examples above, note that the network is treated as simply the venue for a dynamical process to play out, a tool for making predictions and retrodictions about the system in question.

22 1.2.3 Networks as hypotheses I am not claiming that people who study networks are outright taking one side or another, explicitly committing to the notion that either networks are data objects or they are generative models. On the contrary, I believe that this distinction is not discussed enough in Network Science for researchers to even commit to one interpretation or the other in their research. However, these two different interpretations of what networks are situates the idea of the network as a hypothesis as a comfortable middle ground that can weave between the two.

By their very nature, networks are counterfactual objects, almost begging us to ask if-I-did- this-what-would-happen questions. If I removed connections between these neurons, would I see a change in behavior? If I cut off transit from a regional hub, would I observe fewer infections. If I introduced two friends, would they also become friends? For this reason, it’s so hard to uncouple the idea that networks are data objects from the idea that networks are generative models of data. For the same reason, it situates the interpretation of networks as hypotheses as a potential resolution between the two different views. Additionally, treating networks as hypotheses allow us to zoom out and ask about the source of the hypothesis in the first place. That is, the presence of a hypothesis implies the presence of a theory, as alluded to by Popper in the passage below, from The Logic of Scientific Discovery [141]:

We may if we like distinguish four different lines along which the testing of a theory could be carried out. First there is the logical comparison of the conclusions among themselves, by which the internal consistency of the system is tested. Secondly, there is the investigation of the logical form of the theory, with the object of determining whether it has the character of an empirical or scientific theory, or whether it is, for example, tautological. Thirdly, there is the comparison with other theories, chiefly with the aim of determining whether the theory would constitute a scientific advance should it survive our various tests. And finally, there is the testing of the theory by way of empirical applications of the conclusions which can be derived from it. The purpose of this last kind of test is to find out how far the new consequences of the theory—whatever may be new in what it asserts—stand up to the demands of practice, whether raised by purely scientific experiments, or by practical technological applications.

Here, Popper refers to hypotheses when he mentions the “conclusions which can be derived from [a theory],” and he describes their purpose as being to test and rule out a given theory. In the context of networks as hypotheses, this is a crucial point: If networks can be thought of as hypotheses, under what theories are they generated? That is, if we are able to address the two questions from earlier in this chapter—conceiving of the right network representation of a given system and having the appropriate measurement tools to even construct the network—we are still left with a question of why we want to use a network (i.e., test a hypothesis) in the first place. Hasok Chang discusses a useful analog to this point in his book, Inventing Temperature: Measurement and Scientific Progress [39]:

23 The scientific study of heat started with the invention of the thermometer. That is a well-worn clich´e,but it contains enough truth to serve as the starting point of our inquiry. And the construction of the thermometer had to start with the establishment of “fixed points.” Today we tend to be oblivious to the great challenges that the early scientists faced in establishing the familiar fixed points of thermometry, such as the boiling and freezing points of water.

Chang’s characterization here is useful and highlights a few key similarities in the study of networks. The thermometer is a useful measurement tool. Similarly, data scraping, survey methods, reconstruction techniques, etc. are useful tools for measuring networks in different systems. In this example, the key difference is that today’s thermometers would not have emerged without theory, in particular the theory that there are (approximate) “fixed points” in the phase space of water. The analogy stops here, as I do not know of similar theoretical objects in Network Science. And this is precisely the point. Too often, our approach to Network Science does not consider the useful notion that a network can be thought of as a hypothesis; as such, we are not forced to contend with the logical dissonance that follows (if this network is my hypothesis, then what the heck is my theory?).

1.3 The current dissertation

Recall that I posed up this notion of networks as hypotheses as a way to address that third core question at the heart of Network Science: if there is a meaningful network structure that can characterize a given system and if our measurement tools even equip us with a sufficient spatial and temporal resolution from which to even extract a network, how would we know? The answer, I believe, lies in how we use our network representations of the systems we study. Much like assessing the usefulness of a given hypothesis, the usefulness of a given network representation rests on whether or not it can further generate new hypotheses or inform new interventions into the system.

To the extent that representing a system with a given network structure brings forth new knowledge, it can be thought of as a useful representation. This approach contends with the first two questions—whether we are appropriately representing the network structure of a given system and whether our measurement tools are sufficient for capturing that structure—by assuming that a given network is already a good representation of a system and proceeding as if that is the case. If we commit to representing and understanding a given system based on our network representation of it, and if our interventions into the system are successful (i.e., they generate informative data or resolve statistical disagreements between competing models), then it suggests we are justified in committing to that particular network representation of a given system. This does not mean our work is done nor that our understanding of the system in question is fixed and will not change in response to new data; simply that the particular network is an informative lens through which to understand and generate new hypotheses about the system. (This same point can be expressed in

24 the negative: what happens if a given network representation of a system does not bestow new insights or generate useful information? If repeated attempts to extract network-based insights from a given system fail to generate informative data or fail to distinguish competing models of the system’s behavior, it suggests that the suite of tools available are not equipped to be applied to the system in question.)

Ultimately, I believe that the core problem in defining Network Science as a science rests here. In a sense, our story is cyclical:

1. In order for Network Science to be a useful, actionable science, our networks must be informative and equipped to represent the systems they are designed to.

2. In order for our networks to represent the systems they are intended to represent, we must be able to accurately measure and construct networks at the scale where these informative network structures presumably exist.

3. In order to have confidence in our measurements of the systems in question, we have to test the networks that we extract from them, using these networks to generate predictions and insights about the system, which iteratively bring about new knowledge.

4. In order to design proper interventions and tests for these networks, there has to be a theoretical infrastructure that can design competing hypotheses and experimental settings to test them in.

In short, we need advances in theory (#4 above) in order to ultimately address and improve the technical questions of whether certain methods are suited to study a given system (#1 and #2 above). Both processes bootstrap one another, ever climbing the rungs of scientific progress.

The discussion throughout this introduction is not meant to police, rule in, or rule out what is or is not Network Science. As such a young field, Network Science benefits from new, divergent interpretations and ideas. My hope here is that this discussion motivated the following three chapters in this dissertation, contextualizing why constructing, comparing, and reconstructing are three core challenges when we think about the future of Network Science. Network Science has the potential to be a species-defining field, but its potential is limited by a lack of formal theory; by developing network methods that are motivated by—and explicitly designed to address—broader theories of networks in general, we are better prepared to nurture cross-disciplinary insights that will heighten and sustain Network Science’s impact as a whole.

25 Chapter 2

Constructing: Informative higher scales in complex networks

Summary: The connectivity structure between nodes in a network contains information about their interactions, associations, or dependencies; this information is implicitly counter- factual, in that it constrains what would happen if we were to intervene on a given node. In this chapter, we show that this information can be analyzed by measuring the uncertainty (and certainty) in paths along nodes and links in a network, using a quantity known as effective information. Networks with higher effective information contain more information about potential interactions or influence between nodes; as models, these networks are more informative and are thus more useful for describing the system in question. We then show how subgraphs of nodes can be grouped into macro-nodes, reducing the size of a network while increasing its effective information (a phenomenon known as causal emergence). In sum, this work is a suite of theoretically-inspired tools that can be used to study higher scales in different systems.

2.1 Introduction

Networks provide a powerful syntax for representing a wide range of systems, from the trivially simple to the highly complex [13; 130; 4]. It is common to characterize networks based on structural properties like their degree distribution or clustering, and the study of such properties has been crucial for the growth of Network Science. Yet there remains a gap in our treatment of the information contained in the relationships between nodes in a network, particularly in networks that have both weighted connections and feedback, which are hallmarks of complex systems [98; 144]. As we will show, analyzing this information allows for modeling the network at the most appropriate, informative scale. This is especially critical for networks that describe interactions or dependencies between nodes such as contact networks in epidemiology [140], neuronal and functional networks in the brain [21],

26 or interaction networks among cells, genes, or drugs [18], as these networks can often be analyzed at multiple different scales.

Here we introduce information-theoretic measures that capture the information contained in the connectivity of a network, which can be used to identify when these networks possess out informative higher scales. To do so, we focus on the out-weight vector, Wi , of each node, vi, in a network. This vector consists of weights wij between vi and its neighbors, vj, and out wij = 0 if there is no edge from vi to vj. For each Wi we assume j wij = 1, which means wij can be interpreted as the probability pij that a random walker on vi will transition to vj in the next time step, where a random walker might represent theP passing of a signal, an interaction, or a state-transition [116]. The information contained in a network’s connectivity can be characterized by the uncertainty among its nodes’ out-weights and in-weights. The total information in the relationships between nodes is a function of this uncertainty and can be derived from two properties.

The first is the uncertainty of a node’s outputs, which is the Shannon entropy [158] of out out its out-weights, H(Wi ). The average of this entropy, H(Wi ) , across all nodes is the h i out amount of noise present in the network’s relationships. Only if H(Wi ) = 0 is the network is deterministic. h i

The second property is how weight is distributed across the whole network, W out . This h i i vector is composed of elements that are the sum of the in-weights wji to each node vi from each of its incoming neighbors, vj (then normalized by total weight of the network). Its out entropy, H( Wi ), reflects how certainty is distributed across the network. If all nodes link h i out only to the same node, then H( Wi ) = 0, and the network is totally degenerate since all nodes lead to the same node. h i

The effective information (EI) of a network is the difference between these two quantities:

EI = H( W out ) H(W out) (1) h i i − h i i

The entropy of the distribution of out-weights in the network forms an upper bound of the amount of unique information in the network’s relationships, from which the information lost due to the uncertainty of those relationships is subtracted. Networks with high EI contain more certainty in the relationships between nodes in the network (since the links represent less uncertain dependencies, unique associations, or deterministic transitions), whereas networks with low EI contain less certainty. Note that EI can be interpreted simply as a structural property of random walkers on a network and their behavior, similar to other common network measures [116].

Here, we use this measure to develop a general classification of networks (key terms can be found in Supplementary Materials, SM 6.1.1). Furthermore, we show how the connectivity and different growth rules of a network have a deep relationship to that network’s EI. This also provides a principled means of quantifying the amount of information among the micro-,

27 10 10 p = 0.0006 = 0.0 9 p = 0.0012 9 = 0.5 p = 0.0022 = 1.0 8 p = 0.0042 8 = 1.2 p = 0.0079 = 1.3 7 7 p = 0.0150 = 1.4 6 p = 0.0282 6 = 1.5 p = 0.0531 = 2.0 I 5 I 5 = 2.5 E p = 0.1000 E

4 4

3 3

2 2

1 1

0 0 10 19 39 79 158 315 628 1253 2500 10 19 39 79 158 315 628 1253 2500 N N (a) (b)

Figure 2.1: Effective information depends on network structure. (A) In Erd˝os-

R´enyi (ER) networks we see the network’s EI level off at EI = log2(p) as N, the network’s size, increases (log scale shown). (B) The EI of networks− grown under a preferential attachment mechanism, which depends on the preferential attachment exponent, α. Under this network growth model, new nodes add their m edges (here, m = 1) to existing nodes in the network with a probability proportional to kα. Only sublinear preferential attachment (α < 1.0) allows for the continuous growth of EI with the growth of the network. The ribbons around the data represent standard deviations after 100 simulations of each. meso-, and macroscale dependencies in a network. We introduce a formalism for finding and assessing the most informative scale of a network: the scale that minimizes the uncertainty in the relationships between nodes. For some networks, a macroscale description of the network can be more informative in this manner, demonstrating a phenomenon known as causal emergence [87; 86], which here we generalize to complex networks. This provides a rigorous means of identifying when networks possess an informative higher scale.

2.2 Results

2.2.1 Effective information This work expands to networks previous research on using effective information to measure the amount of information in the causal relationships between the mechanisms or states of a system. Originally, EI was introduced to capture the causal influence between two subsets of neurons as a step in the calculation of integrated information in the brain [172]. Later, a system-wide version of EI was shown to capture fundamental causal properties in Boolean networks of logic gates, particularly their determinism and degeneracy [87].

28 Our current derivation from first principles of an EI for networks is equivalent to this system-wide definition (SM 6.1.2), which was based originally on interventions upon system states. For example, if a system in a particular state, A, always transitions to state B, the causal relationship between A and B can be represented by a node-link diagram wherein the two nodes—A and B—are connected by a directed arrow, indicating that B depends on A. This might be a node pair in a “causal diagram” (often represented as a directed acyclic graph, or a DAG) such as those used in [137; 136] to represent interventions and causal relationships. In such a case, the information in the causal relationship between A and B can be assessed by intervening to randomize A (do(A = Hmax)) and measuring the effects on B. The EI would be the mutual information between A and B under such randomization: I(do(A = Hmax),B)[85]. To expand this framework to networks in general, we relax this intervention requirement out by assuming that the elements in Wi sum to 1. In this case, an “intervention” can be interpreted as dropping a random walker on the network. For example, if the network represents a DAG or Markov chain, then dropping a random walker on a node vi would be equivalent to do(vi). The entropy of the transitions of the random walkers and how those transitions are distributed defines the EI of a network. In this generalized formulation, only in networks where the nodes and edges actually represent dynamics, interactions, or couplings does EI indicate information about causation. In the case where edges represent correlations, or where what nodes or edges represent is undefined, EI is merely a structural property of the information contained in the behavior of hypothetical random walkers (however, this situation is no different from other analysis methods that rely on random walkers).

Here we describe how this generalized structural EI property behaves in common network models, asking basic questions about the relationship between a network’s EI and its size, density, and structure. These inquiries allow for the exhaustive classification and quantification of the information contained in the connectivity of real networks. It is intuitive that the EI of a network will increase as the network grows in size. In general, adding more nodes should increase the entropy, which should in turn increase the amount of information. However, in cases of randomness rather than structure, EI should reflect this randomness. We found this is indeed the case.

Fig. 2.1a shows the relationship between a network’s EI and its size under several param- eterizations of Erd˝os-R´enyi (ER) random graphs [61; 28]. As the size of an ER network increases (while keeping constant the probability that any two nodes will be connected, p), its EI converges to a value of log2(p). That is, in random networks EI is dominated solely by the probability that any two− nodes are connected, a key finding which demonstrates that, after a certain point, a random network structure does not contain more information as its size increases. This shift occurs in ER networks at approximately k = log2(N), which is also the point at which we can expect all nodes to be in a giant componenth i [13]. This finding illustrates that network connectivity must be non-random to increase the amount of information in the relationships between nodes (see SM 6.1.3 for derivation). Note that if a network is maximally dense (i.e. a fully connected network, with self-loops), EI = 0.0.

29 1.0 1.0 C B C B Ring Lattice 0.8 0.8 D A D A out out WA 0.6 Wi 0.6 E F h i E F 0.4 0.4 maximally deterministic 0.2 0.2

0.0 0.0 A B C D E F A B C D E F Barabasi-Albert´ Star Network high determinism high degeneracy Erdos-R˝ enyi´ 1.0 1.0 C B C B

0.8 0.8 D A D A out out WF 0.6 Wi 0.6 E F h i E F 0.4 0.4

0.2 0.2 Determinism 0.0 0.0 A B C D E F A B C D E F med. determinism med. degeneracy 1.0 1.0 Complete Network EI = determinism degeneracy C B C B − 0.8 0.8 out D A D A determinism = log (N) H(W ) out out 2 i WD 0.6 Wi 0.6 − h i E F h i E F out 0.4 0.4 minimally degeneracy = log (N) H( W ) 2 − h i i 0.2 0.2 deterministic

0.0 0.0 A B C D E F A B C D E F minimally maximally low determinism low degeneracy degenerate Degeneracy degenerate (a) (b)

Figure 2.2: Comparing determinism and degeneracy. (A) Left column: three exam- out ple out-weight vectors, Wi , of a given node, vi. A maximally deterministic vector (top out left, where the WA corresponds to node A in the inset network motif) is when a random walker on vi transitions to one of its neighbors with probability 1.0, whereas indeterminism occurs when vi has a uniform probability of visiting any node in the network in the next time step. Right: three example in-weight vectors to a given vj. A maximally degenerate out vector, Wi (top right, exemplified by the inset network motif), is when every outgoing edge inh the networki connects to a single node, whereas minimal degeneracy occurs when out 1 each value in Wi is uniformly N . (B) By comparing the determinism and degeneracy of canonical networkh structures,i we find a great deal of heterogeneity in different network mod- els’ ratios between their determinism and degeneracy. High degeneracy is characterized by hub-and-spoke topology, as in the case of the star network. Networks with high determinism are characterized by longer average path lengths, as in the case of a ring lattice.

However, we expect such dense low-EI structures to be uncommon, since network structures found in nature and society tend to be sparse [50].

We report another key relationship between a network’s connectivity and its EI in Fig. 2.1b. We again compare the EI of a network to its size, focusing on networks grown under different parameterizations of a preferential attachment model [101; 16]. Under a preferential attachment growth model, a new node is added to the network at each time step, contributing m new edges to the network; these m edges connect to nodes already in the network, vj, with α a probability proportional to kj . Here kj is the degree of node vj and α tunes the amount of preferential attachment. A value of α = 0.0 corresponds to each node having an equal chance of receiving a new node’s link (i.e., no preferential attachment). The classic Barab´asi-Albert

30 network corresponds to linear preferential attachment, α = 1.0 [16]. Superlinear preferential attachment, α > 1.0, creates networks that have less and less EI, eventually resembling star-like structures (see SM 6.1.3 for derivation). As shown in Fig. 2.1b, only in cases of sublinear preferential attachment, α < 1.0, does the network’s EI continue to increase with its size. When α = 0.0—creating a random tree—the network’s EI increases logarithmically as its size increases.

The maximum possible EI in a network of N nodes is log2(N). This can be seen in the case of a directed ring network where each node has one incoming link and one outgoing link, each with a weight of 1.0, so each node has one node uniquely connecting to it. In such a network, each node contributes zero uncertainty, since H(W out) = 0.0, and H( W out ) = log (N), h i i h i i 2 and therefore its EI is always log2(N). In general, the EI of undirected lattices is fixed entirely by its size and the dimension of the ring lattice (i.e. d = 1 is an undirected ring, d = 2 is a taurus, etc. [184]), so for such lattices EI = log2(N) log2(2d) (see SM 6.1.3 for derivation). −

The picture that emerges is that EI is inextricably linked a network’s connectivity and growth (even network motifs, as shown in SM 6.1.4) and therefore to the fundamentals of Network Science. Random networks have a fixed amount of EI, and scale-freeness (α = 1.0) represents the critical bound for the growth of EI. In general, dense networks and star-like networks have less EI. The next section explores how EI’s components explain these associations.

2.2.2 Determinism and degeneracy Determinism and degeneracy are the two fundamental components of EI [87]. They are based on a network’s connectivity (see Figure 2.2a for a visual explanation), specifically the degree of overlapping weight in the networks. Determinism and degeneracy are derived from the uncertainty over outputs and uncertainty in how those outputs are distributed, respectively:

determinism = log (N) H(W out) (2) 2 − h i i degeneracy = log (N) H( W out ) (3) 2 − h i i

In a maximally deterministic network wherein all nodes have a single output, wij = 1.0, the out determinism is log2(N) because H(Wi ) = 0.0. Conceptually, this means that a random walker will move deterministicallyh startingi from any node. Degeneracy is the amount of information in the connectivity lost via an overlap in input weights (e.g., if multiple nodes output to the same node). In a perfectly non-degenerate system where all nodes have equal input weights, the degeneracy is zero since H( W out ) = log (N). Together, determinism h i i 2

31 and degeneracy can be used to define EI:

EI = determinism degeneracy (4) −

These two quantities provide clear explanations for why different networks have the EI they do. For example, as the size of an Erd˝os-R´enyi random network increases, its degeneracy approaches zero, which means the EI of a random network is driven only by the determinism of the network, which is in turn the negative log of the probability of connection, p. Similarly, in d-dimensional ring lattice networks, the degeneracy term is always zero, which means the EI of a ring lattice structure also reduces to the determinism of that structure. Ring networks with an average degree k will have a higher EI than ER networks with the same average degree because ringh networksi will have a higher determinism value. In the case of star networks, the degeneracy term alone governs the decay of the EI such that hub-and-spoke-like structures quickly become uninformative in terms of cause and effect (see SM 6.1.3 for derivations concerning these cases). In general, this means that canonical networks can be characterized by their ratio of determinism to degeneracy (see Fig. 2.2b).

2.2.3 Effective information in real networks So far, we have been agnostic as to the origin of the network under analysis. As described out previously, to measure the EI of a network one can create each Wi by normalizing each node’s out-weight vector to sum to 1.0. Regardless of what the relationships between the nodes represent, the network’s determinism reflects how targeted the out-weights of the nodes are (networks with more targeted links possess higher EI), while the degeneracy captures the overlap of the targeting of nodes. High EI reflects the greater specificity in the connectivity, whereas low EI indicates a lack of specificity (as in Fig. 2.2a). This generalizes our results to multiple types of representations, although the origin of the normalized network should be kept in mind when interpreting the value of the measure.

Since the EI of a network will change depending on the network’s size, we use a normalized form of EI known as effectiveness in order to compare the EI of real networks. Effectiveness ranges from 0.0 to 1.0 and is defined as: EI effectiveness = (5) log2(N)

As the determinism and degeneracy of a network increase to their minimum and maximum possible values, respectively, the effectiveness of that network will trend to 0.0. Regardless of its size, a network wherein each node has a deterministic output to a unique target has an effectiveness of 1.0. In Fig. 2.3, we examine the effectiveness of 84 different networks corresponding to data from

32 ** *** 1.0 *

0.8

0.6 n=22 Effectiveness 0.4 n=16

n=25 0.2 Mean p < 1e-06 *** n=21 p < 1e-05 ** p < 1e-03 * 0.0 biological social information technological

Figure 2.3: Effective information of real networks. Effectiveness, a network’s EI, normalized by log2(N)[87], of 84 real networks from the Konect network database [102], grouped by domain of origin. To look further at the names and domains of the networks in question, see SM 6.1.5. Networks in different categories have varying effectiveness (t-test, comparison of means). real systems. These networks were selected primarily from the Konect Network Database [102], which was used because its networks are publicly available, range in size from dozens to tens of thousands of nodes, often have a reasonable interpretation as being based on interactions between nodes, and they are diverse, ranging from social networks, to power networks, to metabolic networks. We defined four categories of interest: biological, social, informational, and technological. We selected our networks by using all the available networks (under 40,000 nodes due to computational constraints) in the domains corresponding to each category within the Konect database, and where it was appropriate, the Network Repository as well [149]. See the Materials & Methods section and SM Table 6.2 for a full description of this selection process.

Lower effectiveness values correspond to structures that either have high degeneracy (as in right column, Fig. 2.2a), low determinism (as in left column, Fig. 2.2a), or a combination of both. In the networks we measured, biological networks on average have lower effectiveness values, whereas technological networks on average have the highest effectiveness. This finding aligns intuitively with what we know about the relationship between EI and network

33 structure, and it also supports long-standing hypotheses about the role of redundancy, degeneracy, and noise in biological systems [58; 174]. On the other hand, technological networks like power grids, autonomous systems, or airline networks on average are associated with higher effectiveness values. One explanation for this difference is that efficiency in human-made technological networks tends to create sparser, non-degenerate networks with higher effectiveness on average, wherein the nodes relationships are more specific in their targeting.

Perhaps it might be surprising to find that evolved networks have such low effectiveness. But, as we will show, a low effectiveness can actually indicate that there is informative higher-scale (macroscale) connectivity in the system. That is, a low effectiveness can reflect the fact that biological systems often contain higher-scale structure, which we demonstrate in the following section.

2.2.4 Causal emergence in complex networks This new global network measure, EI, offers a principled way to answer an important question: what is the scale that best captures the connectivity of a complex system? The resolution to this question is important because science analyzes the structure of different systems at different spatiotemporal scales, often preferring to intervene and observe systems at levels far above that of the microscale [86]. This is likely because relationships at the microscale can be extremely noisy and therefore uninformative, and coarse-graining can minimize this noise [87]. Indeed, this noise minimization is actually grounded in Claude Shannon’s noisy-channel coding theorem [158], wherein dimension reductions can operate like codes that use more of a channel’s capacity [85]. Higher-level causal relationships often perform error-correction on the lower-level relationships, thus generating extra effective information at those higher scales. Measuring this difference provides a principled means of deciding when higher scales are more informative (emergence) or when higher scales are extraneous, epiphenomenal, or lossy (reduction).

Bringing these issues to network science, we can now ask: what representation will minimize the uncertainty present in a network? We do this by examining causal emergence, which is is when a dimensionally-reduced network contains more informative connectivity, in the form of a higher EI than the original network. Note that, as discussed, EI can be interpreted solely as a general structural property of networks. Therefore, while we still call this phenomenon “causal emergence” because it has the same mathematical formalization as previous work in Boolean networks and Markov chains [87; 85; 86], here we focus on how it can be used to identify the informative higher scales of networks regardless of what those networks represent.

Notably, the phenomenon can be measured by recasting networks at higher scales and observing how the EI changes, a process which identifies whether the network’s higher scales add information above and beyond lower scales.

34 A B C D E A B C D E A B C D E A μ D E w w w w w w w w w w w w w w w w w w w A AA AB AC AD AE A A A AA AB AC AD AE AA AB AC AD AE AA Aμ AD AE w w w w w w w w w w B BA BB BC BD BE B BA BB BC BD BE w w w w w w w w w μ μA μB μC μD μE μ μA μμ μD μE w w w w w w w w w w C CA CB CC CD CE C CA CB CC CD CE w w w w w w w w w w w w w w w w w w w D DA DB DC DD DE D DA DB DC DD DE D DA DB DC DD DE D DA Dμ DD DE w w w w w w w w w w w w w w w w w w w E EA EB EC ED EE E EA EB EC ED EE E EA EB EC ED EE E EA Eμ ED EE

B D D Group the out-weights Sum the in-weights to A of B and C together to B and C to create the A μ C E out in-weights of form a new μ Wμ E microscale macroscale (a)

out out out out out Wμ Wμ|j Wμ|π Wμ|j Wμ|π 0.9 1.0 ⅓ B B D B 0.1 D 0.75 0.75 C ⅓ 1.0 ⅓ ⅓-½0 C A 0.9 1.0 A 0.1 0.9 ⅓ ⅓ ⅓ 1.0 0.25 0.25 0.9 B ⅓ D 0.1 0.1 ⅓ D 1.0 C E C E 1.0 ½0 A E 1.0 1.0 A 1.0 E 1.0

0.072 0.975 D 0.9 D 0.984 0.1 1.0 1.0 1.0 0.9 μ 1.0 μ| j μ|π 1.0 1.0 A μ|j A μ|π 0.025 0.028 1.0 0.016 A E E E 1.0 A 1.0 E 1.0 1.0 (b) (c) (d) (e)

Figure 2.4: Macro-nodes. (A) The original network, G along with its adjacency matrix (left). The shaded oval indicates that subgraph S member nodes vB and vC will be grouped together, forming a macro-node, µ. All macro-nodes are some transformation of the original adjacency matrix via recasting it as a new adjacency matrix (right). The manner of this recasting depends on the type of macro-node. (B) The simplest form of a macro-node is out out when Wµ is an average of the Wi of each node in the subgraph. (C) A macro-node that represents some path-dependency, such as input from A. Here, in averaging to create out the Wµ the out-weights of nodes vB and vC are weighted by their input from vA.(D)A macro-node that represents the subgraph’s output over the network’s stationary dynamics. Each node has some associated πi, which is the probability of vi in the stationary distribution out out of the network. The Wµ of a µ π macro-node is created by weighting each Wi of the | πi micro-nodes in the subgraph S by P .(E) A macro-node with a single timestep delay k∈S πk between input µ j and its output µ π, each constructed using the same techniques as its components. However,| µ j deterministically| outputs to µ π. See SM 6.1.1 for details about out | | the creation of the Wµ of each of the different HOMs shown.

35 2.2.5 Network macroscales First we must introduce how to recast a network, G, at a higher scale. This is represented by a new network, GM . Within GM , a micro-node is a node that was present in the original G, whereas a macro-node is defined as a node, µ, that represents a subgraph, Si, from the original G (replacing the subgraph within the network). Since the original network has been dimensionally reduced by grouping nodes together, GM will always have fewer nodes than G.

out A macro-node µ is defined by some Wµ , derived from the edge weights of the various nodes within the subgraph it represents. One can think of a macro-node as being a summary statistic of the underlying subgraph’s behavior, a statistic that takes the form of a single node. Ultimately there are many ways of representing a subgraph, that is, building a macro-node, and some ways are more consistent than others in capturing the subgraph’s behavior, depending on the connectivity. We highlight here that macroscales of networks should in general be consistent with their underlying microscales in terms of their dynamics. While this has never been assessed within networks or systems generally, there has been previous research that has asked whether the macroscales of structural equation models are consistent with the effect of all possible interventions [152].

Here, to decide whether or not a macro-node is an consistent summary of its underlying subgraph, we formalize consistency as measure of whether random walkers behave identically on G and GM . We do this because random walks are often used to represent dynamics on networks [116], and therefore many important analyses and algorithms—such as PageRank for determining a node’s [132] or InfoMap for community discovery [150]—are based on random walks.

Specifically, we define the inconsistency of a macroscale as the Kullback-Leibler divergence [46] between the expected distribution of random walkers on G vs. GM , given some identical initial distribution on each. The expected distribution over G at some future time, t, is Pm(t), while the distribution over GM at some future time t is PM (t). To compare the two, the distribution Pm(t) is summed over the same nodes in the macroscale GM , resulting in the distribution PM|m(t) (the microscale given the macroscale). We can then define the macroscale inconsistency over some series of time steps T as:

T

inconsistency = DKL [PM (t) PM|m(t)] (6) t=0 || X

This consistency measure addresses the extent to which a random dynamical process on the microscale topology will be recapitulated on a dimensionally-reduced topology (for how this is applied in our analysis, see Materials & Methods).

What constitutes a consistent macroscale depends on the connectivity of the subgraph that out gets grouped into a macro-node, as shown in Fig. 2.4. The Wµ can be constructed based on the collective W out of the subgraph (shown in Fig. 2.4a). For instance, in some cases

36 out out one could just coarse-grain a subgraph by using its average W as the Wµ of some new macro-node µ (as in Fig. 2.4b). However, it may be that the subgraph has dependencies not captured by such a coarse-grain. Indeed, this is similar to the recent discovery that when constructing networks from data it is often necessary to explicitly model higher-order dependencies by using higher-order nodes so that the dynamics of random walks to stay true to the original data [190]. We therefore introduce higher-order macro-nodes (HOMs), which draw on similar techniques to consistently represent subgraphs as single nodes [190].

Different subgraph connectivities require different types of HOMs to consistently represent them. For instance, HOMs can be based on the input weights to the macro-node, which take the form µ j. In these cases the W out is a weighted average of each node’s W out in the | µ|j subgraph, where the weight is based on the input weight to each node in the subgraph (Fig. 2.4c). Another type of HOM that generally leads to consistent macro-nodes over time is when out the Wµ is based on the stationary output from the subgraph to the rest of the network, which we represent as µ π (Fig. 2.4d). These types of HOMs may have minor inconsistencies given some initial state,| but will almost always trend toward perfect consistency as the network approaches its stationary dynamics (outlined in Section 2.4).

Subgraphs with complex internal dynamics can require a more complex type of HOM in order to preserve the macro-node’s consistency. For instance, in cases where subgraphs have a delay between their inputs and outputs, this can be represented by a combination of µ j and µ π, which when combined captures that delay (Fig. 2.4e). In these cases the macro-node| µ| has two components, one of which acts as a buffer over a timestep. This means that macro-nodes can possess memory even when constructed from networks that are at the microscale memoryless, and in fact this type of HOM is sometimes necessary to consistently capture the microscale dynamics.

We present these types of macro-nodes not as an exhaustive list of all possible HOMs, but rather as examples of how to construct higher scales in a network by representing subgraphs as nodes, and also sometimes using higher-order dependencies to ensure those nodes are consistent. This approach offers a complete generalization of previous work on coarse-grains [87] and also black boxes [5; 85; 112], while simultaneously solving the previously unresolved issue of macroscale consistency by using higher-order dependencies. The types of macro- nodes formed by subgraphs also provides substantive information about the network, such as whether the macroscale of a network possesses memory or path-dependency.

2.2.6 Causal emergence reveals the scale of networks

A network has an informative macroscale when a recast network, GM (a macroscale), has more EI than the original network, G (the microscale). In general, networks with lower effectiveness (low EI given their size) have a higher potential for such emergence, since they can be recast to reduce their uncertainty. Searching across groupings allows the identification or approximation of a macroscale that maximizes the EI.

37 scale-free scale-free microscale mesoscale macroscale microscale mesoscale macroscale | | | | | | 100 | | | | 2.5

2.0

N = 30 N = 30 1.5 N = 60 N = 60 N = 90 10 1 N = 90 N = 120 N = 120 1.0 N = 150 N = 150

Causal emergence 0.5 Size ratio: macro to micro

0.0 10 2 -1 0 1 2 3 4 5 -1 0 1 2 3 4 5

(a) (b)

Figure 2.5: The emergence of scale in preferential attachment networks. (A) By repeatedly simulating networks with different degrees of preferential attachment (α values) with m = 1 new edge per each new node, and running them through a greedy algorithm (described in Materials & Methods), we observe a distinctive peak of causal emergence once the degree of preferential attachment is above α = 1, yielding networks that are no longer “scale-free.” (B) The log of the ratio of original network size, N, to the size of the macroscale network, NM . Networks with higher α values—more star-like networks—show drastic dimension reductions, and in fact all eventually reach the same NM of 2. Comparatively, random trees (α = 0.0) show essentially no informative dimension reductions.

Checking all possible groupings is computationally intractable for all but the smallest networks. Therefore, in order to find macro-nodes which increase the EI, we use a greedy algorithm that groups nodes together and checks if the EI increases. By choosing a node and then pairing it iteratively with its surrounding nodes we can grow macro-nodes until pairings no longer increase the EI, and then move on to a new node (see the Materials & Methods section for details on this algorithm).

By generating undirected preferential attachment networks and varying the degree of prefer- ential attachment, α, we observe a crucial relationship between preferential attachment and causal emergence. One of the central results in network science has been the identification of “scale-free” networks [16]. Our results show that networks that are not “scale-free” can be further separated into micro-, meso-, and macroscales depending on their connectivity. This scale can be identified based on their degree of causal emergence (Fig. 2.5a). In cases of sublinear preferential attachment (α < 1.0) networks lack higher scales. Linear preferential

38 attachment (α = 1.0) produces networks that are scale-free, which is the zone of preferential attachment right before the network develops higher scales. Such higher scales only exist in cases of superlinear preferential attachment (α > 1.0). And past α > 3.0 the network begins to converge to a macroscale where almost all the nodes are grouped into a single macro-node. The greatest amount of causal emergence is found in mesoscale networks, which is when α is between 1.5 and 3.0, when networks possess a rich array of macro-nodes. Note that the increase in EI following macro-scale groupings for α > 1.0 shown in Fig. 2.5a resembles the decrease in EI with higher α that we observe in Fig. 2.1b. This is because after α > 1.0 the decreasing EI of the microscale leaves room for improvement of the EI at the macroscale, following a grouping of nodes.

Correspondingly the size of GM decreases as α increases and the network develops an informative higher scale, which can be seen in the ratio of macroscale network size, NM , to the original network size, N (Fig. 2.5b). As discussed previously, networks generated with higher values for α will be more and more star-like. Star-like networks have higher degeneracy and thus less EI, and because of this, we expect that there are more opportunities to increase the network’s EI through grouping nodes into macro-nodes. Indeed, the ideal grouping of a star network is when NM = 2 and EI = 1 bit. This result is similar to recent advances in spectral coarse-graining that also observe that the ideal coarse-graining of a star network is to collapse it into a two-node network, grouping all the spokes into a single macro-node [105], which is what happens to star networks that are recast as macroscales.

Our results offer a principled and general approach to such community detection by asking when there is an informational gain from replacing a subgraph with a single node. Therefore we can define causal communities as being when a cluster of nodes, or some subgraph, forms a viable macro-node (note that this assumes the connections in the network actually represent possible causal interactions, but it also merely a topological property). Fundamentally causal communities represent noise at the microscale. The closer a subgraph is to complete noise, the greater the gain in EI by replacing it with a macro-node (see SM 6.1.7). Minimizing the noise in a given network also identifies the optimal scale to represent that network. However, there must be some structure that can be revealed by noise minimization in the first place. In cases of random networks that form a single large component which lacks any such structure, causal emergence does not occur (as shown in SM 6.1.7).

2.2.7 Causal emergence in real networks The presence and informativeness of macroscales should vary across real networks, dependent on connectivity. Here we investigate the disposition toward causal emergence of real networks across different domains. We draw from the same set of networks that are analyzed in Fig. 2.3, the selection process and details of which is outlined in the Materials & Methods section. The network sizes span up to 40,000 nodes, thus making it unfeasible to find the the best macroscales for each of them. Therefore, we focus specifically on the two categories that previously showed the greatest divergence in terms of the EI: biological

39 and technological. Since we are interested in the general question of whether biological or technological networks show a greater disposition or propensity for causal emergence, we approximate causal emergence by calculating the causal emergence of sampled subgraphs of growing sizes. Each sample is found using a “snowball sampling” procedure, wherein a node is chosen randomly and then a weakly connected subgraph of a specified size is found around it [81]. This subgraph is then analyzed using the previously described greedy algorithmic approach to find macro-nodes that maximized the EI in each network. Each available network is sampled 20 times for each size taken from it. In Fig. 2.6, we show how the causal emergence of these real networks differentiates as we increase the sampled subgraph size, in a sequence of 50, 100, 150, and finally 200 nodes per sample. Networks of these sizes previously provided ample evidence of causal emergence in simulated networks, as in Fig. 2.5a. Comparing the two categories of real networks, we observe a significantly greater propensity for causal emergence in biological networks, and that this is more articulated the larger the samples are. Note that constructing a random null model of these networks (e.g., a configuration model) would tend to create networks with minimal or negligible causal emergence, as is the case for ER networks (Fig. 6.7 in SM 6.1.7).

That subsets of biological systems show a high disposition toward causal emergence is consistent, and even explanatory, of many long-standing hypotheses surrounding the exis- tence of noise and degeneracy in biological systems [173]. It also explains the difficulty of understanding how the causal structure of biological systems function, since they are cryptic by containing certainty at one level and uncertainty at another.

2.3 Discussion

We have shown that the information in the relationships between nodes in a network is a function of the uncertainty intrinsic to their connectivity, as well as how that uncertainty is distributed. To capture this information we adapted a measure, effective information (EI), for use in networks, and analyzed what it reveals about common network structures that have been studied by network scientists for decades. For example, the EI of an ER random network tends to log2(p), and whether the EI of a preferential attachment network grows or shrinks as new nodes− are added is a function of whether its degree of preferential attachment, α, is greater or less than 1.0. In networks where the mechanisms or transitions are unknown, but the structure is known, EI captures the degree of unique targeting in the network. In real networks, we showed that the EI of biological networks tends to be much lower than technological networks.

We also illustrated that what has been called “causal emergence” can occur in networks. This is the gain in EI that occurs when a network, G, is recast as a new network, GM . Finding this sort of informative higher scale means balancing the minimization of uncertainty while simultaneously maximizing the number of nodes in the network. These methods may be useful in improving scientific experimental design, the compression and search of big data, model

40 2.00 Biological Technological p < 1e-07 *** 1.75 *** 1.50 *** *** ***

1.25

1.00

0.75

0.50 Causal emergence 0.25

0.00 50 100 150 200 Ns

Figure 2.6: Propensity for causal emergence in real networks. Growing snowball samples of the two network domains that previously showed the greatest divergence in effectiveness: technological and biological networks. At each snowball size, Ns, each network is sampled 20 times. Across these samples the total amount of causal emergence for a given sample size is significantly different between the two domains (t-test, comparison of means).

choice, and even machine learning. Importantly, not every recast network, GM , will have a higher EI than the G that it represents, that is, these same techniques can identify cases of reduction. Ultimately, this is because comparing the EI of different network representations provides a ground for comparing the effectiveness of any two network representations of the same complex system. These techniques allow for the formal identification of the scale of a network. Scale-free networks can be thought of as possessing a fractal pattern of connectivity [17], and our results show that the scale of a network is the breaking of that fractal in one direction or the other.

The study of higher-order structures in networks is an increasingly rich area of research [150; 23; 156; 192; 103], often focusing on constructing networks that better capture the data they represent. Here we introduce a formal and generalized way to recast networks at a higher scales while preserving random walk dynamics. In many cases, a macroscale of a network can be just as consistent in terms of random walk dynamics and also possess greater EI. Some macro-nodes in a macroscale may be of different types with different higher-order properties. In other words, we show how to turn a lower-order network into a higher-order network. One noteworthy and related aspect of our work is demonstrating how a system that is memoryless at the microscale can actually possess memory at the macroscale, indicating that whether a system has memory is a function of scale.

While some [163] have previously recast subgraphs as individual nodes as we do here, they have not done so in ways that are based on noise minimization and maximizing consistency,

41 focusing instead on gains to algorithmic speed via compression. Explicitly creating macro- nodes to minimize noise brings the dependencies of the network into focus. This means that causal emergence in networks has a direct relationship to community detection, a vast sub-discipline that treats dense subgraphs within a network as representing shared properties, membership, or functions [65; 143]. However, the relationship between causal emergence and traditional community detection is not as direct as it may seem. For one, causal emergence is high in networks with high degeneracy (i.e. networks with high-degree hubs, as we show in Fig. 2.5a). Community detection algorithms do not typically select for such structural properties, instead focusing on dense subgraphs that connect more highly within the subgraph than outside [65]. In SI Fig. 6.6, we show a landscape of stochastic block model networks and their associated values for causal emergence. Indeed in networks that would have high modularity [131] (e.g. two disconnected cliques), we do observe causal emergence, but only when the two disconnected cliques are of different sizes. This distinction is key and situates networks that display causal emergence in a meaningful place in the study of complex networks. In light of this, macro-nodes offer a sort of community detection where the micro-nodes that make up a macro-node are a community, and ultimately can be replaced by a macro-node that summarizes their behavior while reducing the subgraph’s noise. Under this interpretation, is characterized by noise rather than shared memberships.

2.4 Materials and Methods

2.4.1 Selection of real networks Networks were chosen to represent the four categories of interest: social, informational, biological, and technological (see SM Fig. 6.4, where we detail the same information as in Fig. 2.3, but also include the source of the network data in addition to the effectiveness value of each network). We used all the available networks under 40,000 nodes (due to computational constraints) within all the domains in the Konect database that reflected our categories of interest. For our social category we used the domains Human Contact, Human Social, Social, and Communication. For our information category we used the domains Citations, Co-authorship, Hyperlinks, Lexical, and Software. For our biological category we used the domains Trophic and Metabolic. Due to overlaps between the Konect database and the Network Repository [149] in these domains, and the paucity of other biological data in the Konect database, we also included the Brains domain and the Ecology domain from the Network Repository to increase our sample size (again, all networks within these domains under 40,000 nodes were included). For our technological category, we used the domains Computer and Infrastructure from the Konect database. Again due to overlap between the Konect database and the Network Repository, we also included the Technological and Power Networks domains from the Network Repository. For a full table of the networks used in this study, along with their source and categorization, see Table 6.2.

42 2.4.2 Creating consistent macro-nodes Previously we outlined methods for creating consistent macro-nodes of different types. Here we explore their implementation, which requires deciding which macroscales are consistent. Inconsistency is measured as the Kullback-Leibler divergence between the expected distri- bution of random walkers on both the microscale (G) and the macroscale (GM ), given an initial distribution, as in Eq.6.

To measure the inconsistency we use an initial maximum entropy distribution on the shared nodes between G and GM . That is, only the set of nodes that are left ungrouped in GM . Similarly, we only analyze the expected distribution over that same set of micro-nodes. Since such distributions are only over a portion of the network, to normalize each distribution to 1.0 we include a single probability that represents all the non-shared nodes between G and GM (representing when a random walker is on a macro-node).

We focus on the shared nodes between G and GM for the inconsistency measure because: a) it is easy to calculate which is necessary during an algorithmic search, b) except for unusual circumstances the inconsistency over the shared nodes still reflects the network as a whole, and c) even in cases of the most extreme macroscales (such as when α > 4 in Fig. 2.5), there are still nodes shared between G and GM . Here we examine our methods of using higher-order dependencies in order to demonstrate that this creates consistent macro-nodes. We use 1000 simulated preferential attachment networks, which were chosen as a uniform random sample between parameters α = 1.0 and 2.0, n = 25 to 35, and with either m = 1 or 2. These networks were then grouped via the algorithm described in the following section. All macro-nodes were of the µ π type and their inconsistency was checked over 1000 timesteps. These macro-nodes generally| have consistent dynamics, either because they start that way or because they trend to that over time, and of the 1000 networks, only 4 had any divergence greater than 0 after 1000 timesteps. In Fig. 6.5 in SM 6.1.6, we show 15 of these simulated networks, along with their parameters, number of macro-nodes, and consistencies. Note that even in the cases with early nonzero inconsistency, this is always very low in absolute terms of bits, and of the randomly chosen 15 none do not trend toward consistency over time. In our observations most macro-nodes converge before 500 timesteps, so therefore, in analyzing the real world networks using the µ π macro-node we check all macro-nodes for consistency and only reject those that are inconsistent| at 500 timesteps. More details about the algorithmic approach to finding causal emergence can be found in the following section.

2.4.3 Greedy algorithm for causal emergence The greedy algorithm used for finding causal emergence in networks is structured as follows: for each node, vi, in the shuffled node list of the original network, collect a list of neighboring nodes, vj Bi, where Bi is the Markov blanket of vi (in graphical models, the Markov { } ∈ blanket, Bi, of a node, vi, corresponds to the “parents”, the “children”, and the “parents

43 of the children” of vi [67]). This means that vj Bi consists of nodes with outgoing { } ∈ edges leading into vi, nodes that the outgoing edges from vi lead into, and nodes that have outgoing edges leading into the out-neighbors of vi. For each node in vj , the algorithm { } calculates the EI of a macroscale network after vi and vj are combined into a macro-node, vM , according to one of the macro-node types in Fig. 2.4. If the resulting network has a higher EI value, the algorithm stores this structural change and, if necessary, supplements the queue of nodes, vj , with any new neighboring nodes from vj’s Markov blanket that { } were not already in vj . If a node, vj, has already been combined into a macro-node via a { } 0 grouping with a previous node, vi, then it will not be included in new queues, v , of later { j} nodes to check. The algorithm iteratively combines such pairs of nodes until every node, vj, in every node, vi’s Markov blanket is tested.

2.5 Follow-up research: Biological networks

In the following subsections, I present and discuss results from followup work on causal emergence in biological systems [84].

Summary The internal workings of biological systems are notoriously difficult to under- stand. Due to the prevalence of noise and degeneracy in evolved systems, in many cases the workings of everything from gene regulatory networks to protein-protein interactome networks remain black boxes. One consequence of this black-box nature is that it is unclear at which scale to analyze biological systems to best understand their function. We analyzed the protein interactomes of over 1800 species, containing in total 8,782,166 protein-protein interactions, at different scales. We demonstrate the emergence of higher order “macroscales” in these interactomes and that these biological macroscales are associated with lower noise and degeneracy and therefore lower uncertainty. Moreover, the nodes in the interactomes that make up the macroscale are more resilient compared to nodes that do not participate in the macroscale. These effects are more pronounced in interactomes of Eukaryota, as compared to Prokaryota. This points to plausible evolutionary adaptation for macroscales: biological networks evolve informative macroscales to gain benefits of both being uncertain at lower scales to boost their resilience, and also being “certain” at higher scales to increase their effectiveness at information transmission. Our work explains some of the difficulty in understanding the workings of biological networks, since they are often most informative at a hidden higher scale, and demonstrates the tools to make these informative higher scales explicit.

2.5.1 Background: Noise in biological systems Interactions in biological systems are noisy and degenerate in their functions, making them fundamentally noisier and fundamentally different from those in engineered systems [58; 176].

44 The sources of noise in biology are nearly ubiquitous and vary widely. Noise may exist a gene regulatory network, wherein a gene might upregulate another gene but only probabilistically, or they may be noisy in that a protein may bind randomly across a set of possible pairings. There are numerous sources of such indeterminism in cells and tissues, such as how cell molecules a buffered by Brownian motion [59], to the stochastic opening and closing of ion channels [45], and even to the chaotic dynamics of neural activity [20].

There are also numerous sources of degeneracy within the cellular, developmental, and genetic operation of organisms [31]. Degeneracy is when an end state or output, like a phenotype, can come from a large number of possible states or inputs networks [174].

Due to this indeterminism and degeneracy, the dynamics and function of biological systems are often uncertain. This hampers control of system-level properties for biomedicine and synthetic bioengineering, as well as hampering the understanding of modelers and experimentalists who wish to build “big data” approaches to biology like interactomes, connectomes, and mapping molecular pathways [52; 114]. While there have been many attempts to characterize and understand this uncertainty in biological systems [174], the explanations typically do not extend beyond the advantages of redundancy in these systems [186].

How do noise and uncertainty span the tree of life? Here we examine this question in biological networks, a common type of model for biological systems networks [3; 30]. Specifically, we examine protein-protein interactomes from organisms across a wide range of organisms to investigate whether or not the noise and uncertainty in biological networks increases or decreases across evolution. In order to quantify this noise and uncertainty we make use of the effective information (EI), an information-theoretic network quantity based on the entropy of random walker behavior on a network. A lower EI indicates greater noise and uncertainty. Indeed, the EI of biological networks has already been shown to be lower in biological networks compared to technological networks in Section 2.2.7, which opens the question of why this is the case.

To see how EI changes across evolution, we examined networks of protein-protein interactions (PPIs) from organisms across the tree of life. The dataset consists of interactomes from 1840 species (1,539 Bacteria, 111 Archaea, and 190 Eukaryota) derived from the STRING database [168; 169]. These interactomes have been previously used to study evolution of resilience, where researchers found that species tended to have higher values of network resilience with increasing evolution (wherein “evolution” was defined as the number of nucleotide substitutions per site and can—very loosely—be interpreted as a time variable) [194]. In our work, we take a similar approach, highlighting changes in interactome properties as evolution progresses.

Additionally, we focus on identifying when interactomes have informative macroscales.A macroscale refers to some dimension reduction, such as an aggregation, coarse-graining, or grouping, of states or elements of the biological system. In networks this takes the form of replacing subgraphs of the network with individual nodes (macro-nodes). A network has an

45 informative macroscale when subgraphs of the network can be grouped into macro-nodes such that the resulting dimensionally-reduced network gains EI [97]. When such grouping leads to an increase in EI, we describe the resulting macro-node as being part of an informative macroscale. Following previous work, we refer to any gain in EI at the macroscale as causal emergence [87]. With these techniques, we can identify which PPIs have informative macroscales and which do not. By correlating this property with where (in time) each species lies in the evolutionary tree, we show that informative macroscales tend to emerge later in evolution, being associated more with Eukaryota than Prokaryota (such as Bacteria).

What is the evolutionary advantage of having informative higher scales? This question is important because higher scales minimize noise or uncertainty in biological networks. Yet such uncertainty or noise represents a fundamental paradox. The more noisy a network is, the more uncertain and the less effective that network is (like being able to effectively transform inputs to outputs, such as being able to effectively upregulate a particular gene in response to a detected chemical in the environment). Therefore, we might expect evolved networks to be highly effective. Yet this is the opposite of what we observe. Instead we observe that effectiveness of lower scales decreases later in evolution, as higher scales that are effective emerge.

We argue here that this multi-scale behavior is the resolution to a paradox: there are advantages to being effective, but there are also advantages to being less effective and therefore more uncertain or noisy. For instance, less effective networks might be more resistant to attack or node failure due to redundancy. The paradox is that networks that are certain are effective yet are vulnerable to attacks or node failures, while networks that are uncertain are less effective but are resilient in the face of attacks or node failures. We argue that biological networks have evolved to resolve this “certainty paradox” by having informative higher scales. Specifically, we propose that the macroscales of a evolve to have high effectiveness, but their underlying microscales may have low effectiveness, therefore making the system resilient without paying the price of a low effectiveness.

In a biological sense, node failures or attacks in a cellular network may represent certain mutations in proteins or other biochemical entities, which in turn may prevent regular functioning of the system [15]. Biological networks should then, over the course of evolution, develop degeneracy and noise at lower scales to maintain regular functioning, while at the same time developing effectiveness at a higher level. This transformation can be achieved by the action of both neutral and selective processes in evolution. Neutral processes such as pre-suppression, which aided by mutations, increases the number of interactions [110] and can therefore decrease network effectiveness. On the other hand, selective processes can weed out the noise that interferes with the functioning and efficiency of the system [33]. An interplay of these evolutionary processes can lead to a resolution the “certainty paradox” in cellular networks by the develop of informative macroscales.

This work therefore presents an explanation for the observed trend in increased resiliency through evolution [194]: informative macroscales make networks more resilient. Finally, we

46 offer insights into biological processes at molecular level that might be responsible for the emergence of informative macroscales in protein-protein interaction networks, specifically looking at the differences between Bacteria, which has a low rate of nucleotide substitutions per site, and Eukaryota, which exhibit a higher rate. Understanding the basic principles governing the differences in efficiency and uncertainty between these major divisions of life can help us comprehend the trade-offs involved in information processes in PPIs across evolution.

2.5.2 Effectiveness of interactomes across the tree of life Effective information is a network property reflecting the certainty (or uncertainty) contained in that network’s connectivity [97]. It is a structural property of a network calculated by quantifying the uncertainty in subsequent states of random walkers on a network. In PPI networks, the nodes are individual proteins and the edges of the network are interactions, gen- erally describing the possibility of binding between two proteins. Therefore, the uncertainty we analyze is uncertainty as to which protein(s) a given protein might interact (or bind) out with. Each node in the network has out-weights, which are represented by a vector, Wi . For instance, protein A might share an edge with protein B and also protein C. Therefore, out 1 1 WA is [ 2 , 2 ]. Since most protein interactome are undirected, its edges are normalized for out 1 each node (such that the sum of Wi for each node is 1.0). The uncertainty associated with each protein can be captured by examining the entropy out of the outputs of a node, H(Wi ), such that a higher entropy indicates more uncertainty in its as to interactions [158]. The entropy of the distribution of weight across the entire out out network, H( Wi ), reflects the spread of uncertainty across the network. A lower H( Wi ) h i h outi means that information is distributed only over a small number of nodes. A high H( Wi ) signifies that information is dispersed throughout the network. The EI of a networkh cani then be defined as the entropy of distribution of weights over the network minus the average uncertainty inherent in the weight of each node, or:

EI = H( W out ) H(W out) h i i − h i i as in Equation1. EI can itself be further decomposed into the degeneracy and indeterminism of a network [97; 85], where each indicate the lack of specificity in the network’s connectivity or interactions. Degeneracy indicates a lack of specificity in targeting nodes (many nodes target the same node), while indeterminism indicates a lack of specificity in targeted nodes

1Note that this process of normalization implies that the probability of binding is uniform across the different possible interactions. This transformation into a direct network makes the networks amenable to standard tools of network science, such as analyzing random walk dynamics, and it is also necessary 1 to calculate the EI of the network. Additionally, the uniform distribution of n is the simplest a priori assumption. However, the actual probability of binding is dependent on biological background conditions such as protein prevalence and not included in most open-source models, and therefore our analysis could change if such detailed probabilities were known.

47 (nodes target many nodes). Note that, if networks are considered deterministic in the physical sense, the indeterminism term of EI still reflects the uniqueness of targets in the network.

A network where all the nodes target a single node will have zero EI (since it has maximum degeneracy), as will a network where all nodes target all other nodes (complete indeterminism). EI will only be maximal if every node has a unique output. This forces the EI of a network to be bounded by log2(n), where n is the number of nodes in the network (see 6.1.3 for more in-depth looks at this property). Therefore, in order to compare networks of different sizes, we use the effectiveness, as in Equation5.

To explore the change in effectiveness of biological networks, we examined protein interactomes of 1840 species divided between Archaea, Bacteria, and Eukaryota (see below for details on the origin and nature of these protein interactomes). We found a clear pattern in the effectiveness of the networks, based on where they are located in the tree of life (Fig. 2.7), the position of which is based on each protein interactome’s small subunit ribosomal RNA gene sequence information [88]. Overall, we found that the mean effectiveness of protein interactomes actually decreases later in the tree of life as nucleotide substitutions occurred. Specifically, Bacteria were found to have a greater effectiveness (0.77) compared to Eukaryota (0.72) on average (student’s t-test, p < 10−8). Following Zitnik et al. (2019), we restricted further statistical analysis to interactomes with more than 1000 citations, in order use the most well-founded protein interactomes, but the directionality and significance of the result is unchanged when only those above 100 citations are included as compared to when all interactomes are included (student’s t-test, p < 10−11). Due to the small number of Archaea interactomes based on above 1000 citations we did not include those samples in Fig. 2.7B.

2.5.3 Causal emergence across the tree of life At first the higher effectiveness in Prokaryota interactomes as compared to that of Eukaryota (as shown in Fig. 2.7) may seem counter-intuitive. One might naively expect the effectiveness of cellular machinery, including or especially interactomes, to increase over evolutionary time, instead of decreasing as we have shown.

One hypothesis to explain these results is that, while the protein interactomes get less effective in their micro-scales over evolutionary time, the interactomes are able to nonetheless be effective due to the emergence of informative macroscales as evolution proceeds. Results from this analysis support our initial hypothesis that effectiveness is actually being transitioned to macroscales of biological networks in Eukaryota over evolutionary time, even though the microscales become noisy and less effective over evolutionary time. The total amount of causal emergence (the gain of EI by grouping subgraphs into macro-nodes) was identified for each protein interactome from each species, normalized by the total size of that protein interactome (Fig. 2.8B). Across the tree of life we observe that Eukaryota have more informative macroscales and show a significant difference in the percentage of microscale nodes that get grouped into macro-nodes than Prokaryota (Fig. 2.8A).

48 8 Bacteria A *** p < 10 B 0.9 0.9 Archaea Eukaryota 0.8 0.8 0.773

0.7 0.7 r = 0.40 0.711 ** p < 10 5 Effectiveness 0.6 Prokaryota 0.6 Eukaryota over 1000 citations 0.5 0.5 1 2 3 4 5 Prokaryota Eukaryota Evolution (nucleotide substitutions per site)

Figure 2.7: Effectiveness of protein interactomes. (A) Effectiveness of all 1840 species with their superphylum association. Interactomes with a lower number of nucleotide substitutions per site tended to be Prokaryota (yellow), while those higher tended to be Eukaryota (blue). Solid line is a linear regression comparing the effectiveness of Bacteria and Eukaryota (r = 0.40, p < 10−5), due to the small number of Archaea that passed the threshold for reliable− datasets (see Section 2.5.2). (B) The effectiveness of prokaryotic protein interactomes is greater than that of eukaryotic species, indicating that effectiveness might decrease with more nucleotide substitutions per site.

2.5.4 Resilience of macroscale interactomes Why might biological networks evolve over time to have informative macroscales? As previously discussed, one answer might be that having multi-scale structure provides benefits that networks with only a single scale lack. All networks face a “certainty paradox.” The paradox is that uncertainty in connectivity is desirable since it is protective from node failures. For instance, a node failure could be the removal of a protein due to a nonsense mutation, or the inability to express a certain protein due to an environmental effect, such as a lack of resources, or even a viral attack. In turn, this could lead to a loss of biological function or the development of disease or even cell death. A protein interactome may be resilient to such node failures by being highly uncertain or degenerate in its protein-protein interactomes. However, this comes at a cost. A high uncertainty can lead to problems with reliability, uniqueness, and control in terms of effects, such as an inability for a particular protein to deterministically bind with another protein. For instance, in a time of environmental restriction of resources, certain protein-protein interactomes may be necessary for continued cellular function, but if there is large-scale uncertainty even significant upregulation of genes controlling expression may not lead reliably to a certain interaction. Here we explore these issues by examining the network resilience of protein interactomes in response to node removals, which represent either attacks or general node failures.

In order to measure the resilience of the network in response to a node removal we follow

49 1.0 *** p < 10 9 A 0.12 B 0.12 C

0.8 0.10 0.10

0.6 0.08 0.08 r = 0.457 7 0.06 *** p < 10 0.06 r = 0.46 0.4 *** p < 10 6 0.041 0.04 0.04 Prokaryota 0.2 Causal emergence Causal emergence 0.017 Eukaryota 0.02 0.02

Size ratio: macro to micro over 1000 citations 0.0 0.00 0.00 1 2 3 4 5 1 2 3 4 5 Prokaryota Eukaryota Evolution (nucleotide substitutions per site) Evolution (nucleotide substitutions per site)

Figure 2.8: Causal emergence in protein interactomes. (A) The protein interactomes of each species undergoes a modified spectral analysis in order to identify the scale with EImax. The total dimension reduction of the network is shown, with there being a greater effect in Eukaryota as more subgraphs are grouped in macro-nodes. That is, as evolutionary time goes on the coarse-grained networks become a smaller fraction of their original microscale network size (r = 0.46,p < 10−6). (B) In order to compare the degree of causal emergence in protein interactomes− of different sizes, the total amount of causal emergence is normalized by the size of the network, log2(n), and we see here a positive correlation between evolution and causal emergence (r = 0.457, p < 10−7). (C) The amount of normalized causal emergence is significantly higher for Eukaryota.

[194], using the change in the Shannon entropy of the component size distribution of the network following random node removal. Here, pc is the probability that a randomly selected node is in connected component c C following the removal of a fraction f of the nodes in ∈ the network; the entropy associated with the component size distribution, H(Gf ), is:

1 nc H(G ) = p log (p ) (3) f log (N) c 2 c − 2 c X where nc is the number of connected components remaining after f fraction nodes have been removed (note: “removed” here indicates that the nodes become isolates, still contributing to the component size distribution though not retaining any of the original links). The change in entropy, H(Gf ), as f from 0.0 to 1.0 corresponds to the resilience of the network in question. Specifically, this resilience is defined as follows:

1 H(G ) Resilience(G) = 1 f (4) − rf f=0 X where rf is the rate of node removal (i.e., the increment that the fraction f increases from 0.0 to 1.0). In this work, we default to a value of rf = 100, which means that the calculation of a network’s resilience involves iteratively removing 1%, 2%, ... 100% of the nodes in the

50 1.0 Macro resilience Micro resilience 0.8

0.6 *** p < 1e-10 0.543

0.4

Resilience 0.331 0.2

Micro-nodes Micro-nodes Prokaryota in macro not in macro 0.0 Mean macro Mean micro Eukaryota resilience resilience

1 2 3 4 5 Evolution (nucleotide substitutions per site)

Figure 2.9: Resilience of micro- and macro-nodes following causal emergence in interactomes. The resilience of species’ interactomes change across the tree of life, as shown in previous research [194]. Using the mapping generated by computing causal emergence (Fig. 2.8B), we calculate the resilience of the network, isolating the calculation to nodes that are either part of the macroscale or microscale. Points are color-coded according to the evolutionary domain; points with dark outlines are associated with micro-nodes that have been grouped into a macro-node (macroscale), while the points with light outlines have not been grouped into a macro-nodes (microscale). Nodes at the microscale contribute less to the overall resilience of a given network (0.331) compared to nodes that contribute to macro-nodes (0.543) on average (t-test, p < 1.010). Note: plotted are the microscale and macroscale resilience values for each interactome in the dataset; the difference in resilience across scales holds even when only including species with more than 10, 100, or 1000 citations. network. For each value of f, we simulate the node removal process 20 times. Our hypothesis is that biological networks deal with this “certainty paradox” by maintaining uncertainty at their microscale. This gives a pool of noise and degeneracy, leading to resilience. Meanwhile, at the macroscale, the networks can develop a high effectiveness, wherein sets of proteins deterministically and non-degenerately interact. To explore this hypothesis, we compare the network’s resilience to removing micro-nodes that are members of subgraphs grouped into macro-nodes to the network’s resilience to removing micro-nodes that remain ungrouped (shown in Fig. 2.9).

51 By isolating the calculation of network resilience to only the micro- or macro-nodes of a network, we see a stark trend emerge wherein nodes inside highly informative macro-nodes are more resilient than nodes outside. That is, nodes in the original interactome that were grouped into a macro-node contribute more to the overall resilience of the interactome. This not only supports our hypothesis that biological networks resolve the “certainty paradox” by building multi-scale structure, but also provides further explanation and contextualization for the recent findings of increasing resilience across evolutionary time [194].

2.5.5 Discussion In this followup work, we analyzed how the informativeness of protein interactomes changed over evolutionary time. Specifically, we made use of the effective information to analyze the amount of uncertainty (or noise) in the connectivity of protein interactomes. We found that the effectiveness (the normalized EI) of protein interactomes decreased over evolutionary time, indicating that uncertainty in the connectivity of the interactomes was increasing over evolutionary time. However, we discovered that this was due to eukaryotic protein interactomes possessing higher (informative) scales, such that they had more EI when recast as a coarse-grained network—a phenomenon known as causal emergence. This lower effectiveness and higher causal emergence in eukaryotic species was due to the indeterminism and degeneracy in the network structure of their protein-protein interactions.

We used a dataset from the STRING database [168; 169] that spans more than 1800 species (1,539 Bacteria, 111 Archaea, and 190 Eukaryota), which has been shown to have considerable advantages compared to previous collections of protein interactomes [194]. However, we cannot rule out the possibility that biases might exist in the specific manner of data collection, such as high under-representation of specific types of difficult-to-detect interactions, which could potentially introduce errors in the calculations of effectiveness in eukaryotic interactomes. As such, we conducted a series of statistical robustness tests that accounted for potential biases in both the data collection and network structures of interactomes in our dataset (see Fig. 2.10 for further details about these statistical tests). In short, the results we observed in this study cannot be explained by two plausible sources of bias: 1) Random rewiring of network edges does not produce similar results and 2) Network null models of each interactome in this study produce only a fraction of the observed causal emergence in our dataset (the maximum causal emergence values for a species’ network null model only reached 3% of the causal emergence of the original interactome). Notwithstanding these statistical tests, as technology and methods continue to improve these results and hypothesis must be tested rigorous.

To analyze why macroscales of biological networks evolved, we calculated how resilience differed for nodes inside of or outside of macro-nodes. We found that resilience of nodes left outside the macro was far lower, on average, than the resilience of nodes grouped into macro-nodes. This indicates that there are benefits of having macroscales, such as increased resilience, and that systems with informative macroscales can still have a high effectiveness

52 but also maintain the benefits of having low effectiveness at a microscale. This is in line with the existing research showing that resilience increases with evolution [194].

These findings present evidence that biological systems are sensitive to the tradeoff between of effectiveness and robustness by examining whether evolution brings about multi-scale structure in biological networks. Systems with a single level of function face an irresolvable paradox: uncertainty in the connections and interactions between nodes leads to resilience to attack and robustness to node failures, but this decreases the effectiveness of that network. However, multi-scale systems, defined as those with an informative higher scale, can solve this “certainty paradox” by having high uncertainty in their connectivity at the microscale while having high certainty in their connectivity at the macroscale. The tradeoffs between being effective at a microscale (typically in prokaryotes, e.g. Bacteria) and being noisy at microscale, while transitioning the information to higher scales (Eukaryota) might have played a key role in evolutionary dynamics. Indeed, the drive from a prokaryotic ancestor to a eukaryotic one might have occurred based on this trade-off, however explaining such a phenomenon is outside the scope of the current work.

While we have illuminated many of the advantages of biological macroscales and posited a functional reason for their existence as the solution to the “certainty paradox,” what are the biological mechanisms behind the evolution of multi-scale structure? We offer here a few hypotheses about biological mechanisms that are concordant with the hypothesis of multi-scale advantages in terms of having both effectiveness and robustness.

Notably, evolution can proceed both via neutral processes and selection-based contexts. A well-known neutral process that affects interactions at cellular scale, such as those between proteins, is pre-suppression (also termed constructive neutralism) [33]. This refers to the complexity arising in the dependencies between interacting molecules in the absence of positive selection [110]. Simply put, the likelihood of maintaining independence between partners is less than that of moving away from the original state (by accumulating changes), and therefore, random changes can increase the number of interactions between proteins in a system by chance alone, and result in “noisiness” in the interactions. This may offer a biological mechanism behind the result in low effectiveness in an interactome. Because Eukaryota have both larger number of proteins and a higher substitution rate than bacteria [194], eukaryotic interactomes might be expected to feature a higher number of neutral processes, all of which would combine to make interaction networks noisier and less effective. One hypothesis is that neutral evolution specifically drives the noise at the microscale but not the macroscale. At the macroscale interactomes would be trimmed and evolved under evolutionary constraints and selective pressures [93], which would eventually reinforce beneficial relationships, thinning out those that can cause negative effects on survival or growth [33]. These processes may lead to formation of sub-groups of proteins in the network with more and stronger interactions within the group compared to fewer or weaker interactions between those in different subgroups [33; 113]—thereby leading to the emergence of modular, macroscale structures in these networks, which we hypothesize to be correlated with organismal function [3].

53 Another possible explanation as to the biological mechanism behind our observed results of a decrease in effectiveness is that prokaryotes are more metabolically diverse than eukaryotes, possessing more metabolic processing pathways [36]. Together with changed usage patterns (such as carbon catabolite repression in Bacteria), this specificity of metabolite process- ing reduces energy demand and allows for more effective usage of resources [73]. These processes would make biochemical inputs and outputs more streamlined and efficient in prokaryotes, which in turn, should increase the effectiveness of their protein interactomes, given energy and genomic size constraints [70]. In contrast, Eukaryota, as a group, are less constrained by energy than prokaryotes [104] but must contend with a constrained number of metabolites, channelizing them to perform cellular functions in morphologically more complex environments [36; 104]. Eukaryotic cells are about three orders of magnitude larger than prokaryotes [104], requiring more and different sets of controls and organizational processes. Prokaryotes depend on free diffusion for intracellular transport whereas Eukaryota have elaborate mechanisms for targeted transfers [47]. This reliance on cellular transport mechanisms can lead to higher modular (and thus more degenerate or indeterministic) structure in protein interactomes and other intracellular entities, which, as we show here, can be associated with less noise at higher scales of interaction. These higher-scale inter-module transfer mechanisms ensure the proper and less noisy flow of important molecules among these modules (such as protein or metabolite transport among organelles) [3]. Each of these larger-scale processes, such as transport among organelles, relies on only a handful of inputs and outputs from outside its module, as compared to much more diverse interactions within the modules themselves [3], which arise due to both functional and neutral processes. In terms of networks, this hierarchical organizational structure is apt to lead to a higher network effectiveness score at the module/process scale compared to the micro-scale.

Such mechanistic biological explanations for why we might observe these differences in effectiveness are in line with the theoretical reasoning that biological systems need to resolve the paradox they face at individual scales and therefore construct multi-scale structure. We seek to tie the “certainty paradox” directly to the notion of scale in biological systems and provide a means for researchers to reduce the “black box” nature of these systems by searching across scales for models with low uncertainty. Understanding the mechanics of information transfer and noise in biological systems, and how they affect functionality, remains a major challenge in biology today. One can imagine that the drive from unicellular to multicellular life was based on some form of similar trade-offs, as those between prokaryotes and eukaryotes, that allowed multicellular life to operate via effective macro-states while reserving a pool of noise and degeneracy. Thus, understanding the information structure of these interactomes lends us an eye into the inner workings of long-term evolutionary processes and trade-offs that might have resulted in the two biggest phenotypic splits in evolutionary history—that of prokaryotic and eukaryotic cells, and of unicellular and multicellular life. We hope this developed framework is applied to other interactomes and other biological networks, such as gene regulatory networks, or even functional brain networks, to examine both how uncertainty plays a role in robustness, how informative higher scales change across evolution, and what fundamental tradeoffs biological systems face.

54 2.5.6 Protein interactome data Protein interactomes are models of intracellular activity, often based on high-throughput experiments [151; 145]. Here protein interactomes formed from a curated set of high-quality interactions between proteins (protein-to-protein interactions, or PPIs) are taken from the STRING database [168; 169] the curation of which is outlined in [194]. In this curation the STRING database (Search Tool for the Retrieval of Interacting Genes/Proteins, found at http://string-db.org) is used to derive a protein interactome for each species. Each PPIs in the protein interactome is an undirected edge where the edges are based on experimentally- documented physical interactions in the species itself or on human expert-curated interactions (e.g., no interactions are based on text-mining or associations). The dataset is curated to only include interactions derived from direct biophysical protein-protein interactions, metabolic pathway interactions, regulatory protein-DNA interactions, and kinase-substrate interactions. Details on the curation of these interactomes are found in [194].

The evolutionary history of the set of PPIs was obtained by [194] and is derived from a high-resolution phylogenetic tree [88]. The tree is composed of Archaea, Bacteria, and Eukaryota and captures a diversity of species in each lineage. The phylogenetic tree is used to characterize the evolution of each species based on the total branch length (which takes the form of nucleotide substitutions per site) from the root of the tree to the leaf of the species. The phylogenetic taxonomy, the names of species, and lineages of each species were taken from the NCBI Taxonomy database [62]. Details of how this is associated with each species can be found at (http://snap.stanford.edu/tree-of-life), and we refer to [194] for further specifics on how each species was assigned an average nucleotide substitution rate. Ultimately these protein interactomes are incomplete models that may change as time goes on. Because we do not wish to bias our results, our statistical analyses were performed only over the interactomes of the species based on more than 1000 citations in the literature.

2.5.7 Robustness of causal emergence To ensure that the differences observed in the causal emergence values of the protein- protein interaction networks were not merely a statistical artefact, we conducted a series of robustness tests of our analysis. These tests were necessary for two key reasons. First, the nature of interaction data in biology is inherently difficult to obtain. While many of the tools we use to collect, clean, and interpret biological systems are sophisticated, they are nonetheless subject to potential biases. However, if there were systematic biases in the network construction process for the protein interactomes used in this study (for example, if the interaction networks of eukaryotic species systematically over-estimated certain interactions), randomization procedures should clarify the extent to which the results we observed are truly a property of the species themselves.

Second, these robustness tests offer insights into whether there is anything intrinsic to the network structures of the eukaryotic or prokaryotic species that could be contributing to their

55 1.0 0.05 A Grouped means: B Archaea Archaea Bacteria 0.8 Bacteria 0.04 Eukaryota Eukaryota

0.6 0.03

0.4 0.02 (scaled by max)

Causal emergence 0.2 0.01 Causal emergence ratio

0.0 ( / original) 0.00 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 Fraction of rewired network edges Evolution (nucleotide substitutions per site)

Figure 2.10: Statistical controls and network robustness tests. (A) As a greater fraction of links are randomly rewired, the resulting networks’ causal emergence decreases (normalized by the causal emergence value of the original network such that causal emergence of 1.0 corresponds to the original network’s value). This decrease is independent on evolu- tionary domain, network size, density, or other network properties. (B) A second statistical control known as a soft configuration model assesses whether there is anything intrinsic to the network’s degree distribution that could be driving a given result. Here, we divide the average causal emergence of 10 such configuration model networks by the causal emergence values of the original protein interactome and observe that the null model networks preserve only a small fraction of the original amount of information gain (at most, the configuration models may show 3% of the original causal emergence). causal emergence values. For example, the protein interaction networks of the eukaryote, Rattus norvegicus (the common sewer rat), has a certain amount of causal emergence. Would an arbitrary, simulated network with the same number of nodes and edges, connected randomly, also have a similar amount of causal emergence? By performing a series of robustness tests on the protein interaction networks in our study, we can get closer to the question of whether or not there is anything intrinsic to the protein interaction network of Rattus norvegicus, or any other species, that makes it particularly prone to displaying higher-scale informative structures?

To address the two concerns above, we performed two separate but similar robustness tests. The first uses a network null model known as the configuration model in order to randomize the connectivity of the protein interactomes while also preserving the number of nodes, edges, and distribution of node degree [69]. The second robustness test involves random edge rewiring [96]. For each network in our study, we iteratively increased the fraction of random edges to rewire in the network; an edge, eij, that connects nodes vi and vj, becomes re-connected to a new node, vk, forming a new edge, eik, instead of the original eij. We do

56 12 Prokaryota Eukaryota 10

8

6 Density 4

2

0 0.4 0.6 0.8 Modularity

Figure 2.11: Modularity distribution of Prokaryota and Eukaryota. Modularity of the community partitions of the species’ interactomes (using the Girvan-Newman community detection method [71]). The two distributions are not significantly different from one another (student’s t-test, comparison of means p = 0.167). this with an iteratively-increasing fraction of edges, starting with 1% of edges and increasing until 100% of the network’s edges are rewired.

If the causal emergence values of the networks in this study decrease following the robustness tests above—and in particular if they decrease differently for Eukaryota and prokaryotes— then the differences we observe are unlikely to have arisen simply from chance, noisy/biased data, or otherwise coincidental, ad hoc network properties. Instead, our testing of the robustness of our analysis lend credence to the main finding of this paper, which is that species that emerged later in evolutionary time are associated with more informative macroscale protein interaction networks.

In Fig. 2.10A, we show how the causal emergence of Archaea, Bacteria, and Eukaryota interactomes all decreases as a higher and higher amount of network edges are rewired, indicating that random rewiring has a similar effect on all datasets. This analysis suggests that if there were significant noise in the network data itself (i.e., connections between proteins where there otherwise should not be or a lack of connections where there should be), we should not expect to see the magnitude of causal emergence values that we indeed do see. This adds evidence that the inherent noise in the data collection process is not sufficient to produce the results we see.

In Fig. 2.10B, we show that random null models of the networks used in this study are

57 characteristically unlikely to have values for causal emergence values that are at all similar to the original interactomes. On the contrary, the maximum average causal emergence value for any of the networks used here reaches only 3% of the original network’s values. This suggests that random null models of networks are less likely to contain higher scale structure but also that the observed differences in the causal emergence values for prokaryotic and eukaryotic species is unlikely to be driven merely due to basic properties like their edge density or degree distribution.

While it is impossible to exhaust all possible sources of bias or confounding variables in biological networks, the two statistical controls performed here get us closer to validating the hypotheses underlying this work: that evolution brings about higher informative scales in protein networks.

Acknowledgements: A very special thanks to Conor Heins, Harrison Hartle, and Alessan- dro Vespignani for their insights about notation and formalism of effective information.

Data and software availability: All data used in this work was retrieved from the Konect Database [102] and also the Network Repository [149], which are publicly available. Software for calculating EI in networks and for finding causal emergence in networks is available by request or at https://github.com/jkbren/einet.

Broader impact: Associated manuscripts and related work Most of the work in this chapter has been published or preprinted; its primary contribution so far has been in the context of brain networks and biological networks, which are characterized by noisy interactions at the microscale (neurons, genes, etc.) leading to higher scale behaviors.

Personal research and collaboration Published: Klein, B. & Hoel, E. (2020). The emergence of informative higher scales in complex networks. Complexity. 8932526, 12 pages. doi: 10.1155/2020/8932526.

Under revision: Griebenow, R., Klein, B., & Hoel, E. (under revision, Journal of Physics: Complexity). Finding the right scale of a network: Efficient identification of causal emergence in preferential attachment networks through spectral clustering. arXiv: 1908.07565.

Hoel, E., Klein, B., Swain, A., Griebenow, R., & Levin, M. (under revision, Integrative Biology). Evolution leads to emergence: An analysis of protein interactomes

58 across the tree of life. bioRxiv: 10.1101/2020.05.03.074419v1.

Klein, B., Holm´er,L., Smith, K., Johnson, M., Swain, A., Stolp, L., Teufel, A., & Kleppe, A. (under revision, Communications Biology). Resilience and evolvability of protein-protein interaction networks. bioRxiv: 10.1101/2020.07.02.184325v1.

Open software and replication materials: Byrum, T., Swain, A., Klein, B., & Fagan, W. (2020). einet: Effective Information and Causal Emergence. R package version 0.1.0. CRAN.R-project.org/package= einet.

einet: https://github.com/jkbren/einet

Grant applications: received, Templeton Foundation: Toward a teleology of complex networks. Klein, B. (Co-I), Vespignani, A. (PI), & Scarpino, S.V. (Co-I); December, 2020 – November, 2023.

Selected related research building on Klein & Hoel (2020) Consciousness studies and neurobiology: Chang, A., Biehl, M., Yu, Y., & Kanai, R. (2020). Information closure theory of consciousness. Frontiers in Psychology, 11, 1504. doi: 10.3389/fpsyg.2020.01504.

Varley, T.F., Denny, V., Sporns, O., & Patania, A. (2020). Topological analysis of differential effects of ketamine and propofol anesthesia on brain dynamics. bioRxiv: 10.1101/2020.04.04.025437v1

Safron, A. (2020). Integrated World Modeling Theory (IWMT): Towards reverse engineer- ing consciousness with the Free Energy Principle and active inference. preprint, 10.31234/osf.io/paz5j.

Turkheimer, F.E., Rosas, F.E., Dipasquale, O., Martins, D., Fagerholm, E.D., Expert, P., Vasa, F., Lord, L., & Leech, R. (2020). A complex systems perspective on neuroimaging studies of behaviour and its disorders. preprint, 2020080654.

Theoretical biology: Hoel, E. & Levin, M. (2020). Emergence of informative higher scales in biological systems: A computational toolkit for optimal prediction and control. Communicative & Integrative Biology, 13:1, 108-118, doi: 10.1080/19420889.2020.1802914.

Information theory and machine learning: Rosas, F. E., Mediano, P. A., Jensen, H. J., Seth, A. K., Barrett, A. B., Carhart-Harris, R. L., & Bor, D. (2020). Reconciling emergences: An information-theoretic approach

59 to identify causal emergence in multivariate data. arXiv: 2004.08220.

Chvykov, P. & Hoel, E. (2020). Causal geometry. arXiv: 2010.09390.

Mattsson, S., Michaud, E. J., & Hoel, E. (2020). Examining the causal structures of deep neural networks using information theory. arXiv: 2010.13871.

60 Chapter 3

Comparing: The within-ensemble graph distance

Summary: Quantifying the differences between networks is a challenging and ever-present problem in Network Science. In recent years many diverse, ad hoc solutions to this problem have been introduced. Here we propose that simple and well-understood ensembles of random networks are natural benchmarks for network comparison methods. We show that the expected distance between two networks independently sampled from a generative model is a useful property that encapsulates many key features of that model. To illustrate our results, we calculate this within-ensemble graph distance and related quantities for classic network models (and several parameterizations thereof) using 20 distance measures commonly used to compare graphs. The within-ensemble graph distance provides a new framework for developers of graph distances to better understand their creations and for practitioners to better choose an appropriate tool for their particular task.

3.1 Introduction

Quantifying the extent to which two finite graphs structurally differ from one another is a common, important problem in the study of networks. We see attempts to quantify the dissimilarity of graphs in both theoretical and applied contexts, ranging from the comparison of social networks [25; 99; 177], to time-evolving networks [6; 54; 115; 175; 126], biological networks [54], power grids and infrastructure networks [155], object recognition [189], video indexing [34], and much more. Together, these network comparison studies all seek to define a notion of dissimilarity or distance between two networks and to then use such a measure to gain insights about the networks in question.

However, it is often unclear which network features a given graph distance will or will not capture. For this reason, rigorous benchmarks must be established in order to better

61 understand the tendencies and biases of these distances. We adopt the perspective that ensembles are the appropriate tool to achieve this task. Specifically, by sampling pairs of graphs from within a given random ensemble with the same parameterization and measuring the graph distance between them, we create a benchmark that allows us to better understand the sensitivity of a given graph distance to known statistical features of an ensemble. Ultimately, a good benchmark would characterize the behavior of graph distances between graphs sampled from both within an ensemble and between different ensembles. We tackle the former in this paper, noting a rich diversity of behaviors among commonly used graph distance measures. Even though this work focuses on within-ensemble graph distances, these results guide our understanding of how any two sets of networks structurally differ from each other regardless of if those sets are generated by the same random ensemble or another network-generating process. Put simply, the approach introduced in this work is general and can be used to develop a number of graph distance benchmarks.

There are many approaches used to quantify the dissimilarity between two graphs, and we highlight 20 different ones here. Given the large number of algorithms considered in this work, we find it useful to systematically characterize each of these measures. We do so by breaking them down into “description-distance” pairs. That is, every graph distance measure can be thought of as 1) computing some description or property of two graphs and 2) quantifying the difference between those descriptions using some distance metric.

3.1.1 Formalism of graph distances Graph descriptors

Definition 1 A graph description Ψ is a mapping from a set of graphs to a space , G D Ψ: . (3.1) G → D

The set is that of all finite labeled simple graphs, and the space is known as the G D graph descriptor space. Typically, is Rl×m for integers l, m or is a space of probability D distributions. Given a description Ψ, the descriptor of graph G, denoted ψG, is the element of to which G is mapped; ψG = Ψ(G). D

Descriptor distances

Definition 2 A distance maps a pair of descriptors to a nonnegative real value,

d : R+ (3.2) D × D → and satisfies the following properties for all x, y : ∈ D

62 1. d(x, y) = d(y, x) (Symmetry)

2. d(x, x) = 0 (Identity Law) The properties listed in this definition are general, and they do not restrict the large possibility of measures we might use, while also providing a clean separation between how we choose to describe graphs and how we calculate the differences between those descriptions. A common property when considering distance measures is the triangle inequality; however we have not included this in the list above as not all commonly used graph distances obey this property [24]. As in the case of pseudometrics, d(x, y) = 0 does not always imply x = y [175] 1.

Graph distances

Definition 3 Given a set of graphs , a graph description Ψ, its descriptor space M ⊆ G , and a distance d on , the associated graph distance measure D : R+ is a functionD defined by D M × M → 0 D(G, G ) = d(ψG, ψG0 ). (3.3)

Every graph distance quantifies some notion of dissimilarity between two graphs 2.

Network spaces

Definition 4 Given a distance d and description Ψ on descriptor space and a set of graphs , the associated network space, denoted (d, Ψ, ), is the setD of descriptors mappedM to ⊆by GΨ from graphs in , equipped with d as a distanceM measure. M

The network space (d, Ψ, ) consists of points in , namely ψG G∈M —giving rise to ( + 1)/2 distanceM values, one|M| for each pairD of descriptions{ } of elements⊆ D of . |M| |M| M Fundamental questions naturally arise. Does a network space capture known properties of a given ensemble of graphs? This question we can begin to answer by considering sets of graphs with known properties: i.e., random graph models.

1For example, two cospectral but non-identical graphs would have distance zero according to any spectral distance measure. 2Throughout this paper, we use the term “graph distance” or “distance” to refer to a dissimilarity measure between two graphs satisfying the properties we detail in Section 3.1.1. This language is somewhat imprecise from a mathematical perspective; many graph distances do not meet all the criteria of distance metrics. We have chosen to keep the term “graph distance” at the cost of some informality to maintain consistency with much of the existing literature we draw upon.

63 Models

Definition 5 A model M~α is a process which generates a probability distribution P~α over a set of graphs , where ~α is a vector of parameters needed by the model to generate the distribution. M ⊆ G

Models are typically stochastic processes that take some parameters as inputs and generate sets of graphs. The probability distribution of model M~α is then defined over the set of graphs that have non-zero probability of being generated given the model and its parameters ~α. For many well-known models, we have a deep understanding of how the structure of sampled graphs is influenced by the parameter values. Using our knowledge of how parameters affect graph structure, we can see how well the expected features of a given model are reflected by the structure of each network space.

3.1.2 This study Herein, we apply a variety of graph distances to pairs of independently and identically sampled networks from a variety of random network models, over a range of parameter values for each, and consider the within-ensemble distance distribution as a function of the type of graph and model parameters. While our focus is on the means of the distance distributions, we also include the standard deviations in each figure. Ultimately, we report the within-ensemble graph distances for 20 different graph distances from the software package, netrd 3 [119]. To our knowledge, this is the largest systematic comparison of graph distances to date.

3.2 Methods

3.2.1 Ensembles

We study the behavior of (d, Ψ, ) for sets of graphs sampled from M~α under a variety of parameterizations. There areM many graph ensembles that one could use to compute within-ensemble graph distances, and we begin by focusing on two broad classes: ensembles that produce graphs with homogeneous degree distributions and those that produce graphs with heterogeneous degree distributions. In total, we study the within-ensemble graph distance for five different ensembles. 3Note: this software package includes several more distances that were not included in these analyses, and as it is an open-source project, we anticipate that it will be updated with new distance measures as they continue to be developed.

64 Erd˝os-R´enyi random graphs

Graphs sampled from the Erd˝os-R´enyi model (ER), also known as G(n,p), have (undirected) edges among n nodes, with each pair being connected with probability p [61; 27]. This model is commonly used as a benchmark or a null model to compare with observed properties of real-world network data from nature and society. In our case, it allows us to explore the behavior of graph distance measures on dense and homogeneous graphs without any structure. In fact, this model maximizes entropy subject to a global constraint on expected edge density, p.

hki One well-studied construction of this ensemble is when p = n , in which n nodes are connected uniformly at random such that nodes in the resulting graph have an average degree of k . This ensemble is particularly useful for identifying which graph distance measures areh ablei to capture key structural transitions that happen as the average degree increases. For convenience, we will refer to this ensemble as G(n,hki).

Random geometric graphs We work with random geometric graphs of n nodes and edge density p, generated by sprinkling n coordinates uniformly into a one-dimensional ring of circumference 1, and connecting all p pairs of nodes whose coordinate distance (arc length) is less than or equal to 2 . Compared to G(n,p), this model produces graphs that have a high average local clustering coefficient, which is a property commonly found in real network data. Note that setting the connection p distance to 2 means that p parameterizes the edge density exactly as in G(n,p) [48; 139].

Watts-Strogatz graphs Watts-Strogatz (WS) graphs allow us to study the effects that random, long-range connections have on otherwise large-world regular lattices. A WS graph is initialized as a one-dimensional regular ring-lattice, parameterized by the number of nodes n and the even-integer degree of every node k (each node connects to the hki closest other nodes on either side). Each edge h i 2 in the network is then randomly rewired with probability pr, which generates graphs with both relatively high average clustering and relatively short average path lengths for a wide range of pr (0, 1) [184]. ∈

(Soft) Configuration model with power-law degree distribution We generate expected degree sequences from distributions with power-law tails with a mean of k . We construct an instance of a “soft” configuration model, the maximum entropy networkh i ensemble with a given sequence of expected degrees, by connecting node-pairs with probabilities determined via the method of Lagrange multipliers [135; 69; 44]. Through this method, we are able to construct networks with a tunable degree exponent, γ. The degree exponents that we test range from those that skew the distribution heavily, resulting in a

65 highly heterogeneous ultra-small-world network (γ (2, 3)), to those that generate more homogeneous networks (γ > 3). In contrast to the homogeneous∈ ensembles we tested—all of which have homogeneous degree distributions—the requirement of heterogeneity in these graphs constrains the possible edge densities to be vanishingly small. Otherwise, in the high-edge density regime, degrees cannot fluctuate to appreciably larger-than-average values, and we have a natural degree scale imposed by the network size.

Nonlinear preferential attachment The final ensemble of networks included here are grown under a degree-based nonlinear preferential attachment mechanism [14; 2; 101]. A network of n nodes is grown as follows: each new node is added to the network sequentially, connecting its m edges to nodes already α ki in the network vi V with probability Πi = P kα , where ki is the degree of node vi and α ∈ j j modulates the probability that a given node already in the network will collect new edges. When α = 1, this model generates networks with a power-law degree distribution (with degree exponent γ = 3), and a condensation regime emerges as n when α > 2, producing a star network with O(n) nodes all connected to a main hub node→ ∞ [101].

3.2.2 Graph distance measures The study of network similarity and graph distance has yielded many approaches for comparing two graphs [54]. Typically, these methods involve comparing simple descriptors based on either aggregate statistical properties of two graphs—such as their degree or average path length distributions [6]—or intrinsic spectral properties of the two graphs, such as the eigenvalues of their adjacency matrices, or of other matrix representations [94]. The description distances also tend to fall in two broad categories: either classic definitions of norms or distances based on statistical divergence. While different approaches are better suited for capturing differences between certain types of graphs, they obviously are expected to share several properties.

The simplest graph distances aggregate element-wise comparisons between the adjacency matrices of two graphs [72; 91; 76; 68], and extensions thereof [183]; these methods depend explicitly on the node labeling scheme (and hence are not invariant under graph isomorphism [41]), which may limit their utility when comparing graphs with unknown labels (e.g. graphs sampled from random graph ensembles, as we do here). Several measures collect empirical distributions [37] or a “signature” vector [25] from each graph and take the distance between them (using the Jensen-Shannon divergence, Canberra distance, earth mover’s distance, etc. 4), which, among other things, facilitates comparison of differently sized graphs [6; 121]. Another family of approaches compare spectral properties of certain matrices characterized

4From our preliminary analyses, the particular choice of metric can dramatically change the distance values, though we do not report this here. For an extensive description of distance metrics in general, see [51; 60].

66 Graph distance Label 1 Jaccard [91] JAC 2 Hamming [76] HAM 3 Hamming-Ipsen-Mikhailov [95] HIM 4 Frobenius [72] FRO 5 Polynomial dissimilarity [54] POD 6 Degree JSD [37] DJS 7 Portrait divergence [6] POR 8 Quantum spectral JSD [49] QJS 9 Communicability sequence [40] CSE 10 Graph diffusion distance [77] GDD 11 Resistance-perturbation [126] REP 12 NetLSD [177] LSD 13 Lap. spectrum; Gauss. kernel, JSD [94] LGJ 14 Lap. spectrum; Loren. kernel, Euc. [94] LLE 15 Ipsen-Mikhailov [89] IPM 16 Non-backtracking eigenvalue [175] NBD 17 Distributional Non-backtracking [122] DNB 18 D-measure distance [155] DMD 19 DeltaCon [99] DCN 20 NetSimile [25] NES Table 3.1: Graph distances. Distance measures used to systematically compare graphs in this work, as well as their abbreviated labels, and their source. Abbreviations: Lap. = Laplacian, Gauss. = Gaussian, Loren. = Lorenzian, JSD = Jensen-Shannon divergence, Euc. = Euclidian distance. by the graphs [95], such as the non-backtracking matrix [175; 122] or Laplacian matrix [94]. The relevant spectral properties associated with these distances are invariant under graph isomorphism [178; 41]. Some graph distances have been shown to be metrics (i.e., they satisfy properties such as triangle inequality, etc.) [24], whereas others have not. These are not exhaustive descriptions of every graph distance in use today, but they represent coarse similarities between the various methods. We summarize the 20 graph distances we consider in Table 3.1 and more extensively define them in Supplemental Information (SI) 6.2.

3.2.3 Description of experiments See Table 3.2 for the full parameterization of these sampled graphs. In each experiment, we generate N = 103 pairs of graphs for every combination of parameters. With these sampled random graphs, we measure the distance between pairs from the same parameterization of the same model, M~α, and report statistical properties of the resulting vectors of distances. In

67 Ensemble Fixed parameter(s) Key parameter G(n,p) n = 500 p 0.02, 0.06, ..., 0.98 RGG n = 500 p ∈ {0.02, 0.06, ..., 0.98} ∈ { −4 } G(n,hki) n = 500 k 10 , ..., n h i ∈ { −4 0} W S n = 500, k = 8 pr 10 , ..., 10 SCM n = 1000h, ki = 12 γ ∈ {2.01, 2.06, ...6}.01 P A n = 500, kh i= 4 α ∈ { 5, 4.95, ..., 5 } h i ∈ {− − } Table 3.2: Experiment parameterization. Here we report the ensembles that were used in these experiments, as well as their parameterizations. For G(n,hki) and WS key parameters, we span 100 values, spaced logarithmically, between the values above. Parameter labels: n = network size, p = density, k = average degree, pr = probability that a random edge is randomly rewired, γ = power-lawh i degree exponent, α = preferential attachment kernel. Note: In SI 6.2, we show how the within-ensemble graph distance changes as n increases. other words, our experiments consist of calculating mean within-ensemble graph distances,

0 0 D = D(G, G )P~α(G)P~α(G ), (3.4) h i G,G0∈G X where PM,~α : [0, 1] (or P~α when its meaning is unambiguous) is the graph probability G → 0 N distribution for model M~α. This is estimated by sampling N 1 graph-pairs (Gi,Gi) i=1 and computing  { } 1 N D D(G ,G0 ) . (3.5) N i i h i ≈ i=1 X

We then study the behavior of D for various M~α. The error on the mean within-ensemble h i σD graph distance is estimated from the following standard error of the mean σhDi √ , where ≈ N σD is the standard deviation on the within-ensemble graph distance D, estimated by sampling as well. For all experiments, we used N = 103 pairs of graphs, which is sufficient in general as can be seen from the small standard error relative to the mean in all figures. In each plot, we also include the standard deviations σD of the within-ensemble graph distances, and we highlight when the standard deviation offers particularly notable insights into the behavior of certain distances.

Lastly, there are several distances that assume alignment in the node labels of G and G0. Because we are sampling from random graph ensembles, the networks we study here are not node-aligned, and as such, care should be taken when interpreting the output of these graph distances. For every description of graph distances in SI 6.2, we note if node alignment is assumed.

68 3.3 Results

In the following sections, we broadly describe the behavior of the mean within-ensemble graph distance (in general denoted D ) for the distance measures tested. The general structure of this section is motivated byh criticali properties of the ensembles studied here. We highlight features of the within-ensemble graph distance for two broad characterizations of networks: homogeneous and heterogeneous graph ensembles, focusing on specific ensembles within each category.

All of the main results from the experiments described below are summarized in Table 3.3, which practitioners may find especially useful when considering which tools to use for comparing networks with particular structures. When relevant, we highlight certain distance measures to emphasize interesting within-ensemble graph distance behaviors.

3.3.1 Results for homogeneous graph ensembles Dense graph ensembles Here, we present our results for the two models that produce homogeneous and dense graphs.

The G(n,p) model possesses three notable features that we might expect graph distance measures to recover. Note that while we might expect graph distances to recover these features, we are not asserting that every graph distance measure should capture these properties.

1. The size of the ensembles shrink to a single isomorphic class in the limits p 0 and p 1, corresponding respectively to an empty and of size n.→ In both → limits, we might therefore expect D(Mn,p) to go to zero for any method that considers unlabelled graphs. h i

2. The G(n,p) model creates ensembles of graphs and graph complements symmetric under the change of variable p0 = 1 p. By definition, every graph G has a complement G¯ such that every edge that does− (or does not) exist in G does not (or does) exist in G¯. Therefore, for every graph in G(n,p), one can expect to find its complement occurring with the same probability in G(n,1−p). We might expect D(Mn,p) = D(Mn,1−p) if graph distances can capture this symmetry. h i h i

1 3. A density of p = 2 produces the G(n,p) ensemble with maximal entropy (all graph configurations have an equal probability). As a result, we might also expect D(Mn,p) 1 h i to have a global maximum at p = 2 .

The RGG model shares features 1 and 3 with the G(n,p) model, but not feature 2. Moreover, the most significant differences between the two models is that edges are not independent in the RGG model. Correlations between edges lead to local structure (i.e., higher-order structures like triangles) and to correlations in the joint-degree distribution. We therefore do

69 Model Property JAC HAM HIM FRO POD DJS POR QJS CSE GDD REP LSD LGJ LLE IPM NBD DNB DMD DCN NES G(n,p) Complement symmetry XXXXX G(n,p) Derivative with network size, n 0 0 0 + + + 1 −−−∼−∼−−−−− − − ∼ RGG Maximum: p 2 XXXXXXX ≈ ∗ ∗ ∗ ∗ ∗ G(n,hki) Detects the giant 1-core X X XXX XXX XX ∗ ∗ ∗ ∗ G(n,hki) Detects the giant 2-core X X X X G(n,hki) Derivative with network size, n 0 + 0 + + + + − − − − ∼ − − − − ∼ − − − WS Small-world > random XXXXXXXX ∗ ∗ ∗ WS Path length sensitivity X XXXX XXXX X ∗ ∗ WS Clustering sensitivity XXXXXXX X SCM Maximum: 2 < γ < 3 XXXXXXXXXXXXXXXXXXX † † † † † † † SCM Monotonic decay as γ grows X XX X XX X XXXXXX XXX PA Heterogeneous > homogeneous XX PA Maximum: α 0 (uniform) ≈ XXXX PA Maximum: α 1 (linear) ≈ XX PA Maximum: 1 < α 2 ≤ XXXXXXXXXXXX X = captures a given property through a global maximum/minimum in its within-ensemble graph distance curve. ∼ = non-monotonic relationship between network size and within-ensemble graph distance. ∗ X = potentially captures a given property (via local maxima in the mean or standard deviation, change in slope, etc.). † X = monotonic decay beyond a very small value of γ (γ ≈ 2) where there is an apparent maximum (for SCM).

Table 3.3: Summary of key within-ensemble graph distance properties for differ- ent ensembles. Each of the ensembles included in this work has characteristic properties that a within-ensemble graph distance may be able capture. Here we consolidate these various properties into a single table that classifies whether each distance has a given property. Models considered are dense Erd˝os-R´enyi graphs (G(n,p)), random geometric graphs (RGG), sparse Erd˝os-R´enyi graphs (G(n,hki)), the Watts-Strogatz model (WS), soft configuration model with power-law degree distribution (SCM) and general preferential attachment with kernel α (PA). Clarifications: In the WS model, we look at three properties: 1) the mean within-ensemble graph distance is larger for intermediate “small-world” values of pr than it is when pr = 1; 2) the within-ensemble graph distance is sensitive to values of pr where the magnitude slope of the Lp/L0 curve is largest (“path length sensitivity” above); 3) the within-ensemble graph distance is sensitive to values of pr where the magnitude slope of the Cp/C0 curve is largest (“clustering sensitivity” above). In the PA model, we look at whether high, positive values of α produce greater mean within-ensemble graph distances than lower, negative values of α, and at where the maximum within-ensemble distance occurs. not expect distance measures focused on the degree distribution to produce exactly the same mean within-ensemble distance curve in RGG as in G(n,p). Conversely, any distance measure that does produce the exact same within-ensemble distance curve for RGG and G(n,p) either fails to account for these correlations, or the effect of these correlations is negligible on the overall distance between two graphs drawn from the ensemble. This is the case for HAM, HIM and FRO.

Our result for homogeneous graph ensembles are shown in Figure 3.1. Only 5 out of 20 graph distances capture all features discussed above, namely: HAM, HIM, FRO, POD, DJS. Notably, these are some of the simplest methods considered. In fact, these include two in which theoretical predictions for ER graphs precisely match the observed results for both ER graphs and RGGs, despite no consideration of RGGs having been included in such

70 Within-ensemble graph distances: G(n, p) and RGG (n=500)

Jaccard dissimilarity Hamming distance Hamming-Ipsen-Mikhailov Frobenius norm Polynomial dissimilarity [JAC] [HAM] [HIM] [FRO] [POD] 100 102 1 10 3 10 1 10 d 1

) 10 0

G 101 , 2 2 4 2 10 10 10

G 10 ( D 0 3 10 5 10 3 10 3 10 10

Degree distribution Portrait divergence Quantum density matrix Communicability sequence Graph diffusion distance Jensen-Shannon div. [DJS] [POR] Jensen-Shannon div. [QJS] entropy [CSE] [GDD]

1 104 10 10 2 10 1 10 1

d 3

) 3 10 0 10

G 10 4 2 , 10 5 G 10 ( 1 2 2 10 D 10 10 6 10 7 10 100

Resistance perturbation NetLSD Laplacian (Gaussian kernel) Laplacian (Lorenzian kernel) Ipsen-Mikhailov distance [REP] [LSD] Jensen-Shannon div. [LGJ] Euclidean distance [LLE] [IPM]

1 102 10 10 1 101 0 10 1 d 10

) 1

0 10 100 G 10 1 , 1 2

G 10 10 ( 2 2 10 D 10 10 2 2 10 3 10 3 10

Nonbacktracking spectral Distribributional nonbactracking D-measure distance DeltaCon NetSimile distance [NBD] spectral distance [DNB] [DMD] [DCN] [NES] 10 1 101

0 1 d 10 10 ) 0 2 G 10 , 1 G 10 ( 100 D 1 10 10 3 100

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 p p p p p

means standard deviations G(n, p) max mean max standard deviation RGG

Figure 3.1: Mean and standard deviations of the within-ensemble distances for G(n,p) and RGG. By repeatedly measuring the distance between pairs of G(n,p) and RGG networks of the same size and density, we begin to see characteristic behavior in both the graph ensembles as well as the graph distance measures themselves. In each subplot, the mean within-ensemble graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small to see), whileh thei ±dashed lines are the standard deviations. calculations. In one such case (FRO), ER graphs and RGGs behave identically, yet there is also an n-dependence (See SI Figure 6.8).

Sparse graph ensembles While the previous section highlighted dense RGG and ER networks, we now turn to the within-ensemble graph distance of sparse homogeneous graphs sampled from G(n,p), such that p = hki . In the case of sparse graphs, the edge density decays to zero in the n n → ∞ 71 Within-ensemble graph distances: G(n, k ) (n=500)

Jaccard dissimilarity Hamming distance Hamming-Ipsen-Mikhailov Frobenius norm Polynomial dissimilarity [JAC] [HAM] [HIM] [FRO] [POD] 0 10 10 3 1 10 2 10 1 10

d 1 10 4 )

0 10 3 2 1 G 10 10 10

, 2 10 5

G 10

( 0 3 10

D 5 10 3 10 10 10 6 10 1 10 4

Degree distribution Portrait divergence Quantum density matrix Communicability sequence Graph diffusion distance Jensen-Shannon div. [DJS] [POR] Jensen-Shannon div. [QJS] entropy [CSE] [GDD] 10 2 104 10 1 10 1 3 103 2 10 d 10

) 3 0 2 10 102 10 10 4 G

, 1 5 10

G 3 10 5 ( 10 3 10 10 0

D 10 10 7 10 6 10 1 10 4 10 4 10 7 Resistance perturbation NetLSD Laplacian (Gaussian kernel) Laplacian (Lorenzian kernel) Ipsen-Mikhailov distance [REP] [LSD] Jensen-Shannon div. [LGJ] Euclidean distance [LLE] [IPM] 10 1 2 102 10 1 10 1 10 2 1 101 10

d 10 ) 0 0 0 3 G 10 10 10 2 , 10 2 1 1 10 G 10 10 ( 10 4 D 10 2 10 2 5 10 3 3 10 10 3 10 3 10

Nonbacktracking spectral Distribributional nonbactracking D-measure distance DeltaCon NetSimile distance [NBD] spectral distance [DNB] [DMD] [DCN] [NES] 1 100 10 101 1 101 10 d

) 2 0 10

G 0 0

, 10 10 G

( 1 10 3

D 10 10 1 100 10 1

10 3 10 1 101 103 10 3 10 1 101 103 10 3 10 1 101 103 10 3 10 1 101 103 10 3 10 1 101 103 k k k k k

means standard deviations k = 1 max mean max standard deviation k = 2

Figure 3.2: Mean and standard deviations of the within-ensemble distances for G(n,hki) networks. Here, we generate pairs of ER networks with a given average degree, k , and measure the distance between them with each distance measure. In each subplot, weh i highlight k = 1 and k = 2. In each subplot, the mean within-ensemble graph distance h i h i is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small to see), whileh thei dashed± lines are the standard deviations. limit as the mean degree k remains fixed. We found it important to cast this distinction h i between dense G(n,p) because of critical transitions that take place as k increases. As network scientists, these early transition points in sparse networks are foundational,h i with implications for a number of network phenomena (i.e. the occurrence of outbreaks in disease models [125], etc.).

In fact, the presence of such critical transitions in random graph models underscores the utility of this approach for studying graph distance measures. That is, a sudden change in

72 Within-ensemble graph distances: Watts-Strogatz (n=500, k=8)

Jaccard dissimilarity Hamming distance Hamming-Ipsen-Mikhailov Frobenius norm Polynomial dissimilarity [JAC] [HAM] [HIM] [FRO] [POD] 100 2 2 2 10 10 10 10 4

1 d 10 3 )

0 10 101 3 5 G 10 10

, 2 10 10 4 G ( 100

D 6 3 10 10 10 5 10 4

10 1 Degree distribution Portrait divergence Quantum density matrix Communicability sequence Graph diffusion distance Jensen-Shannon div. [DJS] [POR] Jensen-Shannon div. [QJS] entropy [CSE] [GDD]

10 3 10 1

d 1

) 10 0 0 10

G 10 2 , 3

G 10 ( 10 4 D 3 1 10 2 10 10

Resistance perturbation NetLSD Laplacian (Gaussian kernel) Laplacian (Lorenzian kernel) Ipsen-Mikhailov distance [REP] [LSD] Jensen-Shannon div. [LGJ] Euclidean distance [LLE] [IPM] 103 10 2 d ) 0 102 2 2 G 10 10 , 100 G

( 1 10 3 D 10

100 Nonbacktracking spectral Distribributional nonbactracking D-measure distance DeltaCon NetSimile distance [NBD] spectral distance [DNB] [DMD] [DCN] [NES]

1 1 10 100 10

d 1

) 10 0 G , 10 1 2 G 10 ( D 10 1 100 100 10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100 10 4 10 3 10 2 10 1 100 pr pr pr pr pr

means standard deviations Lp Cp max mean max standard deviation L0 C0

Figure 3.3: Mean and standard deviations of the within-ensemble distances for Watts-Strogatz networks. Here, we generate pairs of Watts-Strogatz networks with a fixed size and average degree but a variable probability of rewiring random edges, pr. In each subplot we also plot the clustering and path length curves as in the original Watts-Strogatz paper [184] to accentuate the “small-world” regime with high clustering and low path lengths. The mean within-ensemble graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small to see), whileh thei ±dashed lines are the standard deviations. the within-ensemble graph distance signals abrupt changes in the probability distribution over the set of graphs in the ensemble (i.e., the emergence of novel graph structures that are markedly different from the greater population of graphs in an ensemble). This may show up as a local or global maximum within-ensemble graph distance near parameter values for which this transition occurs. Conversely, if a sudden decrease in within-ensemble graph distance is observed, then there may be a sudden disappearance or reduction in largely dissimilar graphs in the ensemble.

73 hki In the case of G(n,p) where p = n , which we will refer to with the shorthand, G(n,hki), the following critical transitions emerge:

4. At k = 1, we see the emergence of a giant component in ER networks (likewise, a h i 2-core emerges at k = 2). We might expect, for example, a within-G(n,hki) graph distance to have ah locali maximum at such values.

Ultimately, we observe that distance measures that are fundamentally associated with flow- based properties of the network (i.e., if a distance measure is based on a graph’s Laplacian matrix, communicability, or other properties important to diffusion, such as path-length distributions, etc.) are the ones most sensitive for picking up on this property (Figure 3.2) 5.

What Figure 3.2 highlights, which the dense ensembles in Figure 3.1 could not, is the rich and varied behavior characteristic of sparse graphs. For example, the distance measures 1 with maxima at p = 2 (HAM, HIM, FRO, POD, DJS, etc.) are still seen in Figure 3.2, but the emphasis is instead on the degree as opposed to the edge density; given that most real-world networks are sparse [50], this view of the same parameter is especially informative.

Importantly, while the qualitative behaviors discussed here are general features of the models and distances, the quantitative value of the average within-ensemble graph distance also depends on network size. There are no specific structural transitions to discuss around this dependency, but it can be an important problem when comparing networks of different sizes without a good understanding of how network distances might behave. Interested readers can find our results in SI 6.2 where we use G(n,hki) to vary network size while keeping all other features fixed.

Small-world graphs The final homogeneous graph ensemble studied here is the Watts-Strogatz model. This model generates networks that are initialized as lattice networks, and edges are randomly rewired with probability, pr. At certain values of pr, we see two key phenomena occur: 5. “Entry” into the small-world regime: Even as the edges in the network are minimally rewired, the average path length quickly decreases relative to its initial (longer) value. Lp This is highlighted by the blue curve in Figure 3.3, corresponding to , where L0 is L0 the average path length before any edges have been rewired. For the parameterizations −3 used in this study, the largest (negative) slope of this curve is at pr 2 10 . We might expect a within-ensemble graph distance to be sensitive to this or≈ nearby× values of pr, as this region corresponds to changes in the graphs’ common structural features. 6. “Exit” from the small-world regime: After enough edges have been rewired, the network loses whatever clustering it had from originally being a lattice, reducing to

5Note that the two distance measures based on the non-backtracking matrix (NBD & DNB) are undefined in graphs without a 2-core, restricting their range in Figure 3.2.

74 approximately the clustering of an ER graph. This is highlighted by the violet curve Cp in Figure 3.3, corresponding to , where C0 is the average clustering before any edges C0 have been rewired. For the parameterizations used in this study, the largest (negative) −1 slope of this curve is at pr 3 10 . Again, we might expect a within-ensemble graph distance to be sensitive≈ to× this large decrease in clustering.

Together, the above features characterize Watts-Strogatz networks. Importantly, we are interested in whether a distance measure is sensitive to these “entry” and “exit” values of pr; sensitive here is deliberately broadly defined. For instance, as in the case of CSE, we observe a reduction in within-ensemble graph distance at a rate that almost exactly resembles the rate at which Cp decays. Alternatively, a distance measure can be sensitive to these critical C0 points by having a local maximum at or around the critical point. In the case of POR, we see that the within-ensemble graph distance is maximized at approximately the same point as the largest (negative) slope of the Lp curve. L0 Here, insensitivity to these critical points is also an informative property to highlight in a distance measure. As one example, HAM appears to be otherwise unaffected by the “exit” from the small-world regime, with distances increasing steadily despite the model generating networks with dramatic structural differences.

Lastly, we ask whether the within-ensemble graph distance of random networks (i.e., when pr 1) is greater than that of small-world networks; this is indicated by a within-ensemble → −3 −1 graph distance curve that is higher at pr = 1 than those between 10 < pr < 10 in Figure 3.3. This property holds for distance measures that depend on node labeling (e.g. JAC, HAM, HIM, FRO, POD, etc.) but also for DJS—which is intuitive, since more noise increases the variance of the degree distribution—as well as a few puzzling distances: QJS, DCN, and the two based on the non-backtracking matrix, NBD and DNB.

3.3.2 Results for sparse heterogeneous ensembles The sparse graph setting is much closer to that of real networks, which often also have heavy-tailed degree distributions [129]. This motivated the selection of the following two heterogeneous, sparse ensembles.

Soft configuration model: heavy-tailed degree distribution We study these graphs using a (soft) configuration model with a power-law expected degree distribution; i.e., the expected degree κ of a node is drawn proportionally to κ−γ. From this model, we expect two important features that graph distance measures could recover:

7. For γ < 3, we know the variance of the degree diverges in the limit of large graph size n [129]. Since there should be large variations on the degree sequences for two finite instances, we might also expect the graph distances to produce maximal distance D . h i

75 Within-ensemble graph distances: soft configuration model (n=1000, k=12)

Jaccard dissimilarity Hamming distance Hamming-Ipsen-Mikhailov Frobenius norm Polynomial dissimilarity [JAC] [HAM] [HIM] [FRO] [POD] 10 1 100 2 4 10 1 10 10

d 1 ) 2 0 10 10 G

, 101 10 5

G 2 2

( 10 10 3 D 10

10 3 100 10 6

Degree distribution Portrait divergence Quantum density matrix Communicability sequence Graph diffusion distance Jensen-Shannon div. [DJS] [POR] Jensen-Shannon div. [QJS] entropy [CSE] [GDD]

10 1 103

1

d 10 1

) 10 0

G 10 1 10 2 102 , G ( 2 D 10 10 2 10 3 101

Resistance perturbation NetLSD Laplacian (Gaussian kernel) Laplacian (Lorenzian kernel) Ipsen-Mikhailov distance [REP] [LSD] Jensen-Shannon div. [LGJ] Euclidean distance [LLE] [IPM] 100 103

1 d 10 ) 2 1 0 10 10 1 G 102 10 , G ( D 101 10 2 10 2 10 2 101

Nonbacktracking spectral Distribributional nonbactracking D-measure distance DeltaCon NetSimile distance [NBD] spectral distance [DNB] [DMD] [DCN] [NES] 101 101 10 1 d

) 1 0 10 100 6 × 100 G ,

G 4 × 100 (

D 100 3 × 100 10 2 10 1 2 × 100 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6

means standard deviations = 3 max mean max standard deviation

Figure 3.4: Mean and standard deviations of the within-ensemble distances for soft configuration model networks with varying degree exponent. Here, we generate pairs of networks from a (soft) configuration model, varying the degree exponent, γ, while keeping k constant (n = 1000). In each subplot we highlight γ = 3. The mean within- ensembleh graphi distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small to see), while theh dashedi ± lines are the standard deviations.

8. We might also expect a monotonic decay in the within-ensemble graph distance as γ increases. For large γ, most expected node-degrees will be approximately the average degree, making the network as a whole structurally similar to an ER graph. On the other hand when γ is small (especially when γ 3), there is a wide diversity in the degrees of nodes within the graph, and of the expected≤ degrees of nodes across graphs (since expected degrees are i.i.d. sampled from a Pareto distribution).

Out of the 20 studied, most distances capture both of these features. Since γ tunes the

76 Within-ensemble graph distances: preferential attachment networks (n=500, m=2)

Jaccard dissimilarity Hamming distance Hamming-Ipsen-Mikhailov Frobenius norm Polynomial dissimilarity [JAC] [HAM] [HIM] [FRO] [POD] 100 10 2 10 2 10 4

d 1

) 10 0 10 1 G 3

, 5 10 10 3 10 G

( 100 2 D 10 6 10 4 10 4 10 10 1 Degree distribution Portrait divergence Quantum density matrix Communicability sequence Graph diffusion distance Jensen-Shannon div. [DJS] [POR] Jensen-Shannon div. [QJS] entropy [CSE] [GDD] 0 10 10 2 10 1

5

d 10 ) 0 10 2 10 1 G 10 2 10 3 , 103 G ( 2 D 10 3 1 10 10 4 10

Resistance perturbation NetLSD Laplacian (Gaussian kernel) Laplacian (Lorenzian kernel) Ipsen-Mikhailov distance [REP] [LSD] Jensen-Shannon div. [LGJ] Euclidean distance [LLE] [IPM]

2 1 10 10 10 1 10 1 d ) 0 G , 2 G 1 10

( 10 2 10 2 D 100 10

Nonbacktracking spectral Distribributional nonbactracking D-measure distance DeltaCon NetSimile distance [NBD] spectral distance [DNB] [DMD] [DCN] [NES]

101 d )

0 0 2 10 1 10 10 G , G ( 0

D 10 10 3 100

4 2 0 1 2 4 4 2 0 1 2 4 4 2 0 1 2 4 4 2 0 1 2 4 4 2 0 1 2 4

means standard deviations = 1 max mean max standard deviation = 2

Figure 3.5: Mean and standard deviations of the within-ensemble distances for preferential attachment networks. Here, we generate pairs of preferential attachment networks, varying the preferential attachment kernel, α, while keeping the size and average degree constant. As α , the networks become more and more star-like, and at α = 1, this model generates networks→ ∞ with power-law degree distributions. The mean within-ensemble graph distance is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small to see), while theh i dashed ± lines are the standard deviations. degree-heterogeneity (larger γ yielding more homogeneous graphs), a decrease in the average distance among pairs of graphs might be expected. For large γ, most expected node-degrees will be approximately the average degree, making the network as a whole structurally similar to an ER graph. On the other hand when γ is small (especially when γ 3), there is a wide diversity in the degrees of nodes within the graph, and of the expected degrees≤ of nodes across graphs (since expected degrees are i.i.d. sampled from a Pareto distribution). Thus a reasonable expectation would be that pairs of graphs on average become farther apart as γ

77 is decreased. This is observed in many distances, but with the exceptions of QJS and REP, which each instead exhibit maxima at certain finite values of γ > 2. Additionally, several distances (HAM, POR, NBD, and NES) appear to decay monotonically beyond some very small value of γ, below which they have a slightly smaller value. This fact could have arisen as a finite-size effect or due to some other details of the implementation, since fluctuations become highly pronounced as γ 2. → Only one graph distance produces completely unexpected behavior: DCN yields D that monotonically increases with the scale exponent γ of the degree distribution, and itsh standardi deviation is minimized when γ 3. We will expand upon this in the following section. ≈

Nonlinear preferential attachment The final ensemble we include here is the nonlinear preferential attachment growth model. By varying the preferential attachment kernel, parameterized by α, we can capture a range of network properties:

9. As α , this model generates networks with maximized average path lengths, whereby→ each−∞ new node connects its m links to nodes with the smallest average degree; conversely α generates star-like networks [101], an effect known as condensation. → ∞ 10. At α = 1, linear preferential attachment, we see the emergence of scale-free networks [14], whereas uniform attachment α = 0 gives each node an equal chance of receiving the incoming node’s links.

When α = 1, this ensemble theoretically generates networks with power-law degree distri- butions (with degree exponent, γ = 3 [2]), which is reminiscent of the results in Figure 3.4 where we measure the within-ensemble graph distances while varying γ.

Various mean within-ensemble distances are maximized in the range α [1, 2], which is indicative of the diversity of possible graphs that can be produced by∈ the preferential attachment mechanism in the small-α regime. For α 0, newly arriving nodes connect primarily to the lowest-degree existing nodes (for example leading to long chains of degree-2 nodes when m = 1), making many distance measures record i.i.d. pairs of graphs as similar. For α 0, new nodes tend to connect to the highest-degree existing node, leaving a star-like network—then likewise many graph-pairs are deemed very similar. In the intermediate range (e.g. linear preferential attachment, α = 1), a much wider variety of possible graphs can arise. Thus on average, i.i.d. pairs are (usually) measured as farthest apart in that range.

For preferential attachment networks, we again see curious behavior for DCN where, unlike most other distance measures, heterogeneous graphs with 1 α < 2 have smaller within- ensemble graph distances than more homogeneous graphs α <≤0. Upon closer examination, we know why this happens, and to conclude this section, we will walk through the anatomy of DCN and show why its behavior is often different than the other distance measures studied

78 here, especially for heterogeneous networks.

The descriptor, ψG that DCN is based off of is an affinity matrix of the graph (constructed from a belief propagation algorithm, see SI 6.2.18 for full methodology), while the distance is calculated using the Matusita distance (similar to the Euclidean distance). The authors note that they selected this distance because they found that it gave more desirable results: “...it ‘boosts’ the node affinities and, therefore, detects even small changes in the graphs (other distance measures, including [Euclidean distance], suffer from high similarity scores no matter how much the graphs differ)” [99]. What the choice of the Matusita distance has apparently obscured, however, is a greater specificity for distinguishing heterogeneous networks. We know this because of preliminary experiments where the Matusita distance is swapped out for a Jensen-Shannon divergence (as in, for example, CSE); this resulting within-ensemble graph distance is maximized for heterogeneous networks (1 < α < 2). Finally, as we note in Section 3.3.1, we are not asserting that a graph distance measure should detect the unique behavior of linear preferential attachment (α = 1). Nor are we advocating for practitioners to abandon the use of DCN. What we are claiming, however—and why we chose to focus on DCN in this section—is that we need useful benchmarks for understanding the effects of choosing one descriptor-distance pairing over another. Furthermore, this benchmark should be based on the within-ensemble graph distances from well-known ensembles.

3.4 Discussion

Graph ensembles are core to the characterization and broader study of networks. Graphs sampled from a given ensemble will highlight certain observable features of the ensemble itself, and in this work, we have used the notion of graph distance to further characterize several commonly studied graph ensembles. The present study focused on one of the simplest quantities to construct given a distance measure and a graph ensemble, namely the mean within-ensemble distance D . Note however that there are many ensembles for which the present methods could beh repeated,i as well as more graph distance measures, and infinitely many other statistics that could be examined from the within-ensemble distance distribution. Despite examining the within-ensemble graph distances for only five different ensembles, we observed a richness and variety of behaviors among the various distance measures tested. We view this work as the starting point for more inquiries into the relationship between graph ensembles and graph distances.

One promising future direction for the study of within-ensemble graph distances is the prospect of deriving functional forms for various distance measures, as we do for JAC, HAM, and FRO in SI 6.2.20, 6.2.21, and 6.2.22. Other distance measures, such as DJS, likely have approximate analytical expressions derived for certain graph ensembles.

We have here only studied the behavior of graphs within a given ensemble and parameteriza- tion, which is essentially the simplest possible choice. This leaves wide open any questions

79 regarding distances between graphs sampled from different ensembles—or even different from two different parameterizations of the same ensemble. These will be the topic of follow-up works. Nevertheless, such follow-ups will likewise only cover a very small fraction of all possible combinations.

We hope that our approach will provide a foundation for researchers to clarify several aspects of the network comparison problem. First, we expect that practitioners will be able to use the within-ensemble graph distance in order to rule out sub-optimal distance measures that do not pick up on meaningful differences between networks in their domain of interest (e.g., what is an informative “description-distance” comparison between brain networks may not be as informative when comparing, for example, infection trees in epidemiology). Second, we expect that this work will provide a foundation for researchers looking to develop new graph distance measures (or hybrid distance measures, such as HIM) that are more appropriate for their particular application areas.

There were 20 different graph distances used in this work, with undoubtedly more that we have not included. Each of these measures seek to address the same thing: quantifying the dissimilarity of pairs of networks. We see the current work as an attempt to consolidate all such methods into a coherent framework—namely, casting each distance measure as a mapping of two graphs into a common descriptor space, and the application of a distance measure within that space. Not only that, we also suggest that stochastic, generative, graph models—because of known structural properties and certain critical transition points in their parameter space—are the ideal tool to use for characterizing and benchmarking graph distance measures.

Classic random graph models can fill an important gap by providing well-understood benchmarks on which to test distance measures before using them in applications. Much like in other domains of network science, having effective and well-calibrated comparison procedures is vital, especially given the great diversity of graph ensembles under study and of networks in nature.

Acknowledgements: A very special thanks to Tina Eliassi-Rad, Dima Krioukov, Johannes Wachs, and Leo Torres for helpful comments about this work throughout.

Data and software availability: All the experiments in this paper were conducted using the netrd Python package [119]. A repository with replication materials can be found at https://github.com/jkbren/wegd.

Broader impact: Associated manuscripts and related work The core contribution of this chapter is a manuscript published in Proceedings of the Royal Society A, but its broader impact will be in its introduction of a rigorous, formal benchmark

80 for quantifying the differences between pairs of networks. From this, new tools will be (and have already been) invented, and we will further our understanding in the measures we already use.

Personal research and collaboration Published: Hartle, H., Klein, B., McCabe, S., Daniels, A., St-Onge, G., Murphy, C., & H´ebert- Dufresne, L. (2020). Network comparison and the within-ensemble graph distance. Proceedings of the Royal Society A, 476: 20190744. Included in special feature: A Generation of Network Science doi: 10.1098/rspa.2019.0744.

Open software and replication materials: wegd: https://github.com/jkbren/wegd

Related research building on Hartle, Klein, et al. (2020) Daniels, A. & H´ebert-Dufresne, L. (NetSci 2020 ). Constructing a compact metric space of network ensembles using several graph comparison measures.

81 Chapter 4

Reconstructing: Comparing ensembles of reconstructed networks

Summary: Across many disciplines, we analyze networks that have been reconstructed or inferred from time series data, using a number of different methods. Different reconstruction techniques can output different networks, and as such, practitioners are often uncertain about whether their approach is suitable for describing the system in question. Similar to other tools in Network Science, it appears that no single technique is universally optimal for inferring network structure from time series data. As such, the goal of this chapter is twofold1: 1) describe a software package that allows researchers to compare these many techniques themselves, and 2) introduce a novel theoretical interpretation of the network reconstruction problem, casting reconstructed networks as being sampled from an ensemble of graphs, which is itself defined by an ensemble of “ground-truth” networks, which undergoes a particular dynamical process, producing time series data that is used to reconstruct an estimate of the original network, using a given network reconstruction technique. This ensemble is dubbed the ( , , ) ensemble, and it can be used to advance the study of network reconstruction, givingG D insightsR into biases, trends, and structural differences induced by our choice of methods used to study complex systems.

4.1 Introduction to the netrd package

Complex systems throughout nature and society are often best represented as networks. Over the last two decades, alongside the increased availability of large network datasets, we have witnessed the rapid rise of Network Science [4; 180; 130; 13]. This field is built around the idea that an increased understanding of the complex structural properties of a

1Note that this chapter also introduces for the first time the notion of a standardized graph distance; while this object deserves further treatment in forthcoming work, it is not the focus of this chapter.

82 True network structure Time series data Reconstructed network 7

6

5

4

3

Node ID 2

1

0 Time

Figure 4.1: Example network reconstruction from temporal data. In each example above, a complex system is modeled as having a true network structure (left), and its entities generate observable dynamics (middle) that are used to reconstruct a network structure that resembles the true network (missing edges highlighted in red). Here, each node in the system outputs what appears to be a continuous series of activity, such as a time series of sensor data from EEG recordings on the scalp, fluctuations of stock prices in a stock market, etc. variety systems will allow us to better observe, predict, and even control the behavior of these systems.

However, for many systems, the data we have access to is not a direct description of the underlying network. More and more, we see the drive to study networks that have been inferred or reconstructed from non-network data—in particular, using time series data from the nodes in a system to infer likely connections between them [32; 153]. Selecting the most appropriate technique for this task is a challenging problem in Network Science. Different reconstruction techniques usually have different assumptions, and their performance varies from system to system in the real world. One way around this problem could be to use several different reconstruction techniques and compare the resulting networks. However, network comparison is also not an easy problem, as it is not obvious how best to quantify the differences between two networks, in part because of the diversity of tools for doing so.

The netrd Python package seeks to address these two parallel problems in Network Science by providing, to our knowledge, the most extensive collection of both network reconstruction techniques and network comparison techniques (often referred to as graph distances) in a single library (https://github.com/netsiphd/netrd/). In this article, we detail the two main functionalities of the netrd package. Along the way, we describe some of its other useful features. This package builds on commonly used Python packages (e.g. [74], numpy [79], scipy [181]) and is already a widely used resource for network scientists and other multidisciplinary researchers. With ongoing open-source development, we see this as a tool that will continue to be used by all sorts of researchers to come.

83 4.1.1 Network reconstruction from time series data Given time series data, TS, of the behavior of N nodes / components / sensors of a system over the course of L timesteps, and given the assumption that the behavior of every node, vi, may have been influenced by the past behavior of other nodes, vj, there are dozens of techniques that can be used to infer which connections, eij, are likely to exist between the nodes. That is, we can use one of many network reconstruction techniques to create a network representation, Gr, that attempts to best capture the relationships between the time series of every node in TS. netrd is a Python package that lets users perform this network reconstruction task using 17 different techniques, meaning that many different networks can be created from a single time series dataset. For example, in Figure 4.2 we show the outputs of 15 different reconstruction techniques applied to time series data generated from an example network [166; 123; 83; 159; 66; 55; 193; 53; 19; 106; 164; 138].

4.1.2 Simulated network dynamics Practitioners often apply these network reconstruction algorithms to real time series data. For example, in neuroscience, researchers often try to reconstruct functional networks from time series readouts of neural activity [123]. In economics, researchers can infer networks of influence between companies based on time series of changes in companies’ stock prices [162]. At the same time, it is often quite helpful having the freedom to simulate arbitrary time series dynamics on randomly generated networks. This provides a controlled setting to assess the performance of network reconstruction algorithms. For this reason, the netrd package also includes a number of different techniques for simulating dynamics on networks.

4.1.3 Comparing networks using graph distances A common goal when studying networks is to describe and quantify how different two networks are. This is a challenging problem, as there are countless axes upon which two networks can differ; as such, a number of graph distance measures have emerged over the years attempting to address this problem. As is the case for many hard problems in Network Science, it can be difficult to know which (of many) measures are suited for a given setting. In netrd, we consolidate over 20 different graph distance measures into a single package [91; 76; 95; 72; 54; 37; 6; 49; 40; 77; 126; 177; 94; 89; 175; 122; 155; 99; 25]. Figure 4.3 shows an example of just how different these measures can be when comparing two networks, G1 and G2. This submodule in netrd has already been used in recent work with a novel characterization of the graph distance literature [80].

4.1.4 Related software packages In the network reconstruction literature, there are often software repositories that detail a single technique or a few related ones. For example Lizier (2014) implemented a Java package

84 Ground truth network Ground truth adjacency matrix Sherrington-Kirkpatrick Ising time series dynamics Node ID

0 250 500 750 1000 1250 1500 1750 2000 Time

Convergent Cross Mapping Correlation Matrix Exact Mean Field Free Energy Minimization Granger Causality

Graphical Lasso Marchenko-Pastur Maximum Likelihood Estimation Mutual Information Matrix Naive Mean Field

Ornstein-Uhlenbeck Inference Partial Correlation Matrix Regularized Correlation Matrix Thouless-Anderson-Palmer Transfer Entropy

Figure 4.2: Example of the network reconstruction pipeline. (Top row) A sample network, its adjacency matrix, and an example time series, TS, of node-level activity simulated on the network. (Bottom rows) The outputs of 15 different network reconstruction algorithms, each using TS to create a new adjacency matrix that captures key structural properties of the original network. In Table 4.1, we list the methods that we used in this study, as well as other techniques acquired over the course of this paper.

(portable to Python, octave, R, Julia, Clojure, MATLAB) that uses information-theoretic approaches for inferring network structure from time-series data [109]; Runge et al. (2019) created a Python package that combines linear or nonlinear conditional independence tests with a causal discovery algorithm to reconstruct causal networks from large-scale time series datasets [154]. These are two examples of powerful and widely used packages though neither includes as wide-ranging techniques as netrd (nor were they explicitly designed to). In the graph distance literature, the same trend is broadly true: many one-off software repositories

85 G1 G2 2 10 01 - Jaccard 02 - Hamming 03 - HammingIpsenMikhailov 04 - Frobenius 101 05 - PolynomialDissimilarity 06 - DegreeDivergence d

) 07 - PortraitDivergence

2 08 - QuantumJSD 09 - CommunicabilityJSD G 100

, 10 - GraphDiffusion

1 11 - ResistancePerturbation 12 - NetLSD G

( 13 - LaplacianSpectralGJS 1 14 - LaplacianSpectralEUL D 10 15 - IpsenMikhailov 16 - NonBacktrackingSpectral 17 - DistributionalNBD 18 - DMeasure 10 2 19 - DeltaCon 20 - NetSimile 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

Figure 4.3: Example of the graph distance measures in netrd. Here, we measure the graph distance between two networks using 20 different distance measures from netrd. exist for specific measures. However, there are some packages that do include multiple graph distances; for example, Wills (2017) created a NetComp package that includes several variants of a few distance measures included here [187].

4.2 Introduction to the ( , , ) ensemble G D R The acceleration of Network Science research over the last twenty years has been propelled by the recent availability of large network datasets [4; 180; 130; 13]. Through these network datasets, we have gained a rich understanding of the behavior and activity of a variety of systems, from societies to genomes. Much of the network data available today are accurate descriptions of the structure of these systems—they are collected by observing true connections between entities, encoding little uncertainty about the network structure itself. For example, if a researcher had friendship data from Facebook, it would not take sophisticated inference techniques to acquire the network structure—trivially, the data structure (e.g., an edgelist or an adjacency matrix) encodes this information.

However, when studying complex systems where the network structure is not obviously or immediately accessible, the adjacency matrix must instead be inferred by estimating likely connections, co-occurrence, or influence between the entities of a system. This is commonly done using time series data of a system’s activity, leveraging the differences in activity at one time step with the activity at another as the basis for inferring interactions between nodes. Many techniques to reconstruct networks from time series data have been developed over the last several years, drawing from information theory, machine learning, dynamical systems and random matrix theory [12; 63; 161; 100; 157; 170; 107; 43; 78; 191; 153; 162; 32; 138; 166; 123; 83; 159; 66; 55; 193; 53; 19; 106; 164], each trying different approaches for capturing features of the underlying network that generated the time series data. Ultimately, these techniques have allowed ideas from complex systems and Network Science to spread to other disciplines.

86 Despite these many techniques, there appears to be little consensus about what methods to use on what types of data. In this chapter, we will not attempt to make claims about the relative performance of each algorithm we are testing. Instead, we aggregate many common network reconstruction techniques into one comparison scheme, quantifying the similarity (using tools known as graph distances) between the networks output by different reconstruction techniques.

For any given random graph ensemble , dynamical process , and reconstruction technique , a new ensemble of graphs naturallyG arises: the graphs producedD by application of to theR time series data generated by dynamics running on a graph sampled from . WhileR this network ensemble may appear contrivedD or ad hoc at first, it is in fact wellG suited to describe networks across many disciplines. For example, in network neuroscience, the graph ensemble consists of the space of possible neuronal networks, and the dynamics may be some variantG of integrate-and-fire [90], the output of which can be observed via neuroimaging.D Given a graph sampled from and time series generated by simulating on the sampled network, a wide variety of newG networks can be generated, depending onD the choice of reconstruction technique . Here, we will define a framework for creating and characterizing ensembles of graphs basedR off this triplet, ( , , ). We then calculate the mean graph distance between pairs of independently-sampledG graphsD R from the same ensemble [80], as well as between networks sampled from a pair of ( , , ) ensembles that only differ by which reconstruction method was used. G D R

The ultimate aim of this article is twofold: 1) This work will serve as a reminder that the analysis of network data—especially network data that has been inferred from time series activity—must take into account the method that was used for creating the network. We encourage those who frequently use inferred or reconstructed networks to revisit the various methods across their respective fields, looking critically at whether differences in results or their interpretations could be due to the researchers’ choice of reconstruction algorithm. 2) We hope this research will spur new insights about possible latent similarities between the mathematical, algorithmic, or theoretical aspects of different network reconstruction techniques and of different network similarity measures. In doing so, this will inform the development of new, effective techniques for each.

4.2.1 Framing: A distribution of ground truths This chapter concerns networks that have been reconstructed from time series data. To frame this chapter, it is useful to consider an apocryphal but helpful story as our guide. The framing consists of a set of scientific claims, simple assumptions that will be built upon until we can address them with more theoretical rigor. While the example used below discusses brains and brain networks specifically, the theoretical approach is intended to hold for any system for which we reconstruct networks from time series data.

The first step is to imagine the space of all possible human genotypes; these specific

87 configurations of DNA define all the various forms that humans can take. As such, they also constrain the form that human brains can take. Tautologically put, human brains resemble human brains, and if any particular brain deviates too far from a relatively small space of possible human brains, then it either ceases to be human or ceases to be a viable brain. The structure of these human-like brains occupies a small sliver of the space of all possible brain structures. Building from discussions in Chapter 3, this space of brain-like structures can be thought of as an ensemble of brains, from which new human brains are sampled, much like a particular network can be sampled from an ensemble or generative model of networks. That is, just as a particular network, G sampled from G(n=100,p=0.2), is a single instance of a network sampled from an ensemble of possible networks, we can describe an individual brain, B, as an instance of a brain sampled from a specifically-parameterized ensemble of brains, (~α) (where ~α is a vector of parameters that endow these brains with human-like properties).B In summary:

Assumption 1: There is a space of all possible human brains, and we can describe this space as an ensemble of brains.

The second assumption in this story is that brains sampled from (~α) can be represented as networks, as they are intrinsically made up of interconnected partsB where activity in one part can influence the activity in another. Put another way, we assume that we can represent a given brain, B, as a brain network, G, and for shorthand, we can describe the space of all possible human brains, (~α), as an ensemble of human brain networks, (~α). In summary: B G Assumption 2: Brains can be usefully represented as networks, and as such, we can describe these networks as being sampled from an ensemble of possible brain networks.

The third assumption is that mere structure is insufficient to produce human-like behavior; a crucial ingredient in functional brains is the dynamics at play. On its face, this is obvious; living brains are characterized by brain activity. The specific ways that neurons in human brains interact with one another, however, constrain the space of possible actions that can be performed in a brain.

Assumption 3: There exists a space of all possible dynamics that can take place in brains; these dynamics are constrained by the structure of the brain itself.

The last point to this framing is the following:

Assumption 4: We do not know exactly what ensemble of brain networks a given brain is sampled from, nor do we know the specific dynamical processes governing brain activity; we can (and do) build models to approximate both.

These four assumptions may seem tedious, and surely human neuroscience is not this simple. However, moving forward, they offer a useful scaffolding upon which we can define a new way to characterize techniques for reconstructing networks from time series data. Specifically, they define an arena for scientists to introduce, test, and compare competing hypotheses.

88 Graph distances between connectomes

within healthy 0.16 control group between healthy and unhealthy 0.12

0.08 Density

0.04

0.00 0 2 4 6 8 10 12 14 DNBD(G, G0)

Figure 4.4: Distributions of within/between-ensemble graph distances. Using real data of functional connectomes of human subjects who have and have not been diagnosed with schizophrenia [179](n = 71 control, n = 56 treatment), we can measure the average within- ensemble graph distance of connectomes in each treatment groups. Blue histogram: pairwise graph distances (measured using the Nonbacktracking Spectral Distance [175]) between all connectomes in the control group. Of particular interest is the high density region around 0 DNBD(G, G ) = 2.0, suggesting that there is a characteristic or expected distance between healthy brains. Red histogram: pairwise graph distances between connectomes in the two 0 different conditions. This distribution is much less concentrated around DNBD(G, G ) = 2.0; on the contrary, it is characterized by much higher average distances, with much higher variance than the blue histogram.

To conclude this example, consider a fictive world where scientists have uncovered a highly realistic model of brain network structure; scientists in this world would have researched a variety of structural properties of real brain networks and used these insights to define a generative model of brain networks that produced networks with remarkably similar structure to that of real human brain networks. Alongside this improved understanding of brain network structure, scientists in this fictional world also defined a highly realistic model of neuronal dynamics such that simulating this particular dynamical process on these brain-like networks can produce time series data that remarkably resembles those from real brain networks. It is not clear that this fictional world is even possible, but imagining this idealized setting lets us consider an environment for defining a new set of criteria for evaluating the effectiveness of network reconstruction techniques.

Figure 4.4 shows an example of this intuition, applied to real data from the brains of 56 treatment and 71 control group participants [179]. Here, the assumption is that healthy

89 connectomes have all been sampled from the ensemble of all possible healthy brain networks (Assumption 2); in this idealized environment, these 71 connectomes would approximate common structural properties of graphs sampled from that ensemble. The key observation here is that by measuring the distribution of pairwise graph distances within that sample of healthy human brain networks (blue histogram in Figure 4.4), we are able to better understand deviation from that sample. The red histogram corresponds to the distribution of graph distances between connectomes in the control and treatment conditions. Note that not only is the mean distance different, there is also much more variance in the distances between treatment connectome and control connectomes. On one level, this is quite obvious. A graph distance measure picked up on differences between networks that were, a priori, known to be different. As such it is not intended to be a formal analysis of the differences between healthy and unhealthy brain networks—simply that there is meaningful information in the shapes of the distributions of within- and between-ensemble graph distances.

Despite only being an illustrative example (and not directly involving network reconstruction techniques), Figure 4.4 highlights an important point in this chapter. If we imagine collecting neuroimaging data—time series of brain activity—from the 71 healthy brains in the sample above (a sample with a characteristic distribution of within-group graph distances—blue histogram in Figure 4.4), and if we aim to use a network reconstruction technique to infer the connectivity of those original 71 connectomes from the time series data, then our selection of reconstruction technique should generate networks with the same (or similar) distribution of within-group graph distances. Put another way, the mean and variance of the within- ensemble graph distance between reconstructed networks should also match the distribution of the within-ensemble graph distances between graphs sampled from whatever original graph ensemble generated the “ground truth” networks.

Effective techniques for reconstructing networks from time series data should not only generate networks with similar properties to the original network, they should also produce ensembles of networks with similar properties to the ensemble from which the original network was sampled. In the fictional world imagined earlier in this section, the presence of these highly realistic models of brain network structure and dynamics means that the reconstruction technique that is able to satisfy the condition of ensemble similarity should be the one that gets used in application settings. In the following sections, we will more formally define this criteria and the family of graph ensembles that emerge from it, illustrating its utility in a few different settings.

4.2.2 The ( , , ) ensemble G D R Given a random graph ensemble, , and a particular dynamical process, , we are able to construct a wide variety of graphsG based on the time series that are createdD when simulating dynamics on graphs, G, sampled from . This properties of the reconstructed graph depends on the choice of reconstruction technique,G . Here, we will define a framework for creating and characterizing ensembles of graphs basedR off this triplet, ( , , ). Sampling from a G D R 90 Network reconstruction Label Source 1 Convergent cross-mapping CCM [166] 2 Correlation matrix COR [123] 3 Exact mean field EMF [83] 4 Free energy minimization FEM [83] 5 Granger causality GRA [159] 6 Graphical lasso GLA [66] 7 Marchenko-Pastur MAR [55] 8 Maximum likelihood estimation MLE [193] 9 Mutual information matrix MUT [53] 10 Naive mean field NMF [83] 11 Ornstein-Uhlenbeck inference OUI [19] 12 Partial correlation matrix PCM [106] 13 Stochastic block model SBM [138] 14 Thouless-Anderson-Palmer TAP [83] 15 Transfer entropy TRE [164]

Table 4.1: Network reconstruction techniques included in the netrd package. particular ( , , ) ensemble occurs in three steps: G D R 1. Sample a network from the graph ensemble, . This can be any number of models, from Erd˝os-R´enyi random graphs [61] to Barab´asi-AlbertG graphs [16] to random geometric graphs [48], and more. The usefulness of selecting one ensemble over another is domain- specific and could in theory help illustrate larger points about the effectiveness of a given network reconstruction technique.

2. Under a given dynamical process , simulate node-level activity for a pre-specified length of time. Here again, we see manyD possible ways to generate node-level dynamics, from simple random walk dynamics to complicated integrate-and-fire dynamics often used in simulations of neural activity [90; 123; 170].

3. Using the time series data generated in Step 2, use to create an adjacency matrix in an attempt to recapitulate the network structureR of the original graph from Step 1. This task is fundamentally challenging and full of assumptions that may or may not be justified in different contexts.

In the following sections, we explore the differences between graphs within and between different ( , , ) ensembles.2 We introduce several ways to compare graphs sampled from different (G, D, R) ensembles, including conditions for evaluating the effectiveness of a given techniqueG basedD R on network comparison tools known as graph distances.

2Note that the graphs sampled from a given ( , , ) ensemble also depend on the number of nodes n in the network and length L, of the time series. WeG omitD R these parameters from our notation for clarity.

91 Network dynamics Label Source 1 Sherrington-Kirkpatrick Ising model SK [160] 2 Branching Process model BP [75; 142]

Table 4.2: Network dynamics studied here.

4.3 Methods

This work involves network reconstruction from time series data and quantifying the differences between such reconstructed networks using tools known as graph distances. These two endeavors have generated much scholarship in the Network Science community [12; 63; 161; 100; 157; 170; 107; 43; 78; 191; 153; 162; 32; 138; 166; 123; 83; 159; 66; 55; 193; 53; 19; 106; 164], and as such, there are many different ways to approach each. See Sections 4.1.1 and 4.1.3 for introduction and motivation. In Table 4.1, we list the methods that we used in this study, as well as other techniques acquired over the course of this paper.

4.3.1 The standardized graph distance

We use D(G1,G2)label to denote the distance between graphs G1 and G2 computed using the graph distance technique with the given label. In Table 3.1 we show the graph distances used in this work along with their labels.

A graph distance measure can be a useful tool for quantifying the dissimilarity between pairs of networks. Using multiple graph distance measures paints a more comprehensive picture of the differences between pairs of networks, providing a multi-dimensional assessment of the differences between two networks. For the same two networks, G1 and G2, different measures may wildly differ in their assigned distance value. Comparing across different distance measures, however, remains a challenging task. For example, knowing that D(G1,G2)i = 1.35 and D(G1,G2)j = 2205.6 tells us that the two different distance measures, Di and Dj, have assigned different scalar values to the distances between the two graphs, but they do not provide intrinsic or intuitive ways to compare across distance measures.

For this reason, we introduce a useful standardization technique that gets past the problem above. This standardized graph distance is based on the mean within-ensemble graph distance [80] for graphs sampled from a given ensemble. The intuition behind this standardization is that we can use well-known ensembles of random networks to benchmark “close” and “far” distance values; as such, we can select any number of ensembles for our standardization. Take for example, G(n,p) or Erd˝os-R´enyi graphs, which generates networks that have n nodes connected randomly with probability p. Under a given distance measure with label d, the standardized graph distance between two networks, G1 and G2, is defined as

D(G1,G2)d Ds(G1,G2)d = (4.1) D(G(n ,p ),G(n ,p )) h 1 1 2 2 di

92 where n1,n2 and p1,p2 are respectively the sizes and densities of G1 and G2, and the angled brackets denote sample average. The values of the distances reported in this work will usually be standardized.

This standardized graph distance has several useful properties. If Ds(G1,G2)d = 1, then G1 and G2 are the same distance apart as one would expect two randomly sampled Erd˝os-R´enyi (in this case) graphs to be. When Ds(G1,G2)d < 1 graphs G1 and G2 are closer than the expected distance between two Erd˝os-R´enyi graphs of the same size and density; likewise if Ds(G1,G2)d > 1, they are further apart. This threshold of Ds(G1,G2)d = 1 may at first seem arbitrary, but in a field with so much variability among so many tools, such a benchmark is useful for standardizing distance values.

4.3.2 Description of experiments Much like the experiments performed in Section 3.2.3, we will be sampling ground truth networks from a given ensemble (in this case, the nonlinear preferential attachment model). The difference between these experiments and the ones in Chapter 3 are that now we also select a dynamical process to simulate on the nodes in the sampled ground truth network (Table 4.2). This generates a time series of activity, which we use to reconstruct a new graph, Gr, under multiple reconstruction techniques (Table 4.1). In some experiments, we will be studying the mean standardized graph distanceR between networks sampled from the same 0 ( , , ) ensemble Gr and Gr. In other experiments, we measure the distance between the reconstructedG D R and ground truth networks.

4.4 Results

We focus on two goals: 1) measuring the standardized within-ensemble graph distance between networks from a given ( , , ), and 2) measuring the standardized between-ensemble graph distance between networksG D sampledR from different ensembles ( , , ) and ( , , 0). In doing so, we report and unpack the contribution that the choiceG D of ground-truthR G D networkR ensemble (the ) has on the within/between-ensemble graph distances as well as the role that the node-levelG dynamics play (the ). D Before reporting the results below, we must first reiterate a number of disclaimers. We encourage the reader to spend time carefully looking through and familiarizing themselves with the figures here. Because the endeavor of this work is comparative in nature, there will be many subplots that may appear redundant or look similar, and we will try to thoroughly point the reader’s attention to what we found to be key differences between the various methods compared here.

Lastly, there are an enormous possible number of experiments to run to characterize and assess the differences and performance of the various network reconstruction measures tested

93 Standardized graph distance between reconstructed network Gr and original network G (Sherrington Kirkpatrick Ising model dynamics, comparing reconstruction techniques)

IpsenMikhailov DegreeDivergence QuantumJSD DeltaCon 10 1.3 14 1.2 1.2 8 12 1.1 ) 1.0

G 10

, 6 1.0 r 8 G

( 0.8 0.9 s 4 6 D 0.8 4 0.6 0.7 2 2 0.4 0.6 2 1 0 1 2 3 4 2 1 0 1 2 3 4 2 1 0 1 2 3 4 2 1 0 1 2 3 4

Figure 4.5: Comparison of two reconstruction techniques, on same underlying network and dynamics. In each subplot, we look at how different average standardized graph distance between the networks reconstructed under a given reconstruction technique (Granger causality or free energy minimization) and the ground truth network. In this example, the ground truth networks were preferential attachment networks (as in 3.2.1). These plots show how the average standardized distance between the ground truth and reconstructed networks changes as the underlying network structure changes with the kernel of preferential attachment, α. Additionally, we see that different distance measures pick up on different, sometimes divergent structural properties of the reconstructed networks. The four distance measures included here are IPM, DJS, QJS, and DCN. The horizontal dashed line 0 corresponds to Ds(G, G ) = 1, which is the mean within-ensemble graph distance for pairs of networks sampled from the same ensemble (in this case, from the nonlinear preferential attachment model). This plot includes networks with n = 64, m = 2. here. We will not be that exhaustive here. Instead, Figures 4.5, 4.6, and 4.7 are meant to be examples of the kinds of insights to be gained when approaching the problem of network reconstruction as we do here.

The first key finding in this section concerns the effect of the underlying network’s degree heterogeneity on the performance of the reconstruction technique. This is exemplified in Figure 4.5 where we compare the mean standardized distance between networks reconstructed from time series data (from Sherrington Kirkpatrick Ising model dynamics) and and the ground truth network. Here, we use the nonlinear preferential attachment model to generate the underlying ground truth networks, varying the amount of preferential attachment with the parameter, α. We compare two reconstruction techniques—GRA and FEM—and see that depending on the graph distance measure selected, the two reconstruction techniques can show drastically different, even divergent behavior. In each subplot in Figure 4.5, the 0 horizontal dashed line corresponds to a standardized graph distance Ds(G, G ) = 1, which is simply the mean within-ensemble graph distance for ground truth networks under that

94 Standardized graph distance between reconstructed network Gr and original network G (Granger causality network reconstruction, comparing dynamical processes)

Jaccard PolynomialDissimilarity DMeasure NetSimile 8 1.0 10 1.6 7 0.9 8 6 ) 1.4

G 0.8

, 5 r 1.2 6 G

( 0.7 4 s

D 1.0 4 3 0.6 2 0.8 2 0.5 1 2 1 0 1 2 3 4 2 1 0 1 2 3 4 2 1 0 1 2 3 4 2 1 0 1 2 3 4

Figure 4.6: Highlighting the effect of dynamics on network reconstruction per- formance. Here, we isolate a single network reconstruction technique, Granger causality, to highlight the effect of two different dynamics simulated on the same ground truth network (again, nonlinear preferential attachment networks, varying the preferential attachment kernel, α). Again, we see different standardized graph distances between the ground truth and reconstructed networks depending on 1) the graph distance measure used and 2) the dynamics. The four distance measures included here are JAC, POD, DME, and NES. The 0 horizontal dashed line corresponds to Ds(G, G ) = 1, which is the mean within-ensemble graph distance for pairs of networks sampled from the same ensemble (in this case, from the nonlinear preferential attachment model). This plot includes networks with n = 64, m = 2. parameterization. As α increases (horizontal axis, from left to right), the ground truth networks’ degree distribution becomes more heterogeneous, with α = 1 signifying the onset of a scale free regime [16]. What Figure 4.5 begins to show is that, under certain distance measures, the two network reconstruction measures can behave similarly for most values of α (as is the case of Degree Divergence), but their performance diverges for heterogeneous networks α > 1 for others (i.e., for Ipsen-Mikhailov and Quantum JSD). In Figure 4.6 we hold the reconstruction technique constant (Granger causality), while varying the dynamical process that is simulated on the ground truth networks. Here, too, this can generate very different networks, depending on the structure of the ground truth network as well as the choice of dynamics. Lastly, to address the framing from Section 4.2.1, we are interested in network reconstruction techniques that generate networks where the distribution of the within-ensemble graph distances between pairs of reconstructed networks is similar to that of the ground truth networks. Figure 4.7 plots histograms of within-ensemble graph distances for ground truth networks where α = 1.0 and α = 2.0, comparing these distributions to those of the reconstructed networks under the free energy minimization reconstruction approach. When α = 2.0, we see two key differences: For one, the variance of both the ground truth and reconstructed networks’ within-ensemble graph

95 Within-ensemble graph distance distributions (ground truth and reconstructed networks) (Free energy minimization network reconstruction, Sherrington-Kirkpatrick dynamics)

Nonlinear preferential attachment, = 1.0 Nonlinear preferential attachment, = 2.0

60

40 Density 20

0 0.10 0.15 0.20 0.10 0.15 0.20 Within-ensemble graph distance Within-ensemble graph distance

Figure 4.7: Distributions of within-ensemble graph distances. Here, we report the distributions of within-ensemble graph distances (using QJS) for pairs of ground truth 0 networks sampled from the same ensemble (light grey bars, Ds(G, G )) as well as the within-ensemble graph distances of reconstructed networks from time series of Sherrington 0 Kirkpatrick dynamics simulated upon the ground truth networks (dark grey bars, Ds(Gr,Gr)). In each experiment, the ground truth networks are sampled from the nonlinear preferential attachment model where n = 64, m = 2. Left: α = 1.0. Right: α = 2.0. distance distributions increase in variance. Additionally, the mean within-ensemble graph distance of the reconstructed and ground truth networks gets closer than when α = 1.0. According to the criteria laid out in 4.2.1, this is a particularly encouraging sign, as it suggests that the shape of the ensemble of networks being reconstructed from time series data of activity somewhat resembles the shape of the original ground truth networks.

4.5 Discussion

Every network analysis begins with two simple questions: What are the nodes? What are the links? While the questions themselves are simple, the endeavor of identifying, inferring, or otherwise selecting appropriate nodes and links is far from trivial. Here, we have proposed a way to systematically compare the wide array of techniques for inferring or reconstructing the network structure (i.e. the edgelist) from time series data of the dynamics of a given system. To do this, we defined an ensemble of graphs, the ( , , ) ensemble, from which we sampled thousands of graphs and quantified their differencesG D usingR a suite of tools known

96 as graph distances. Specifically, we looked at typical distances of graphs sampled from within the same ensemble, ( , , i), and graphs sampled between different ensembles, ( , , i) G D R G D R and ( , , j), in a similar vein to Hartle et al. [80]. This article’s primary contributions are twofold:G D R the first is the introduction of a new mathematical framework for comparing techniques for reconstructing networks. The second is the tools to perform these comparisons, most notably the standardized graph distance.

The analyses we include in this report are a small fraction of the possible ways that one might approach the systematic study of a given ( , , ) ensemble. By varying the underlying ensemble of “ground truth” graphs, , weG canD sampleR graphs that more-closely resemble whatever system is being studied (i.e.,G assuming there is prior evidence or domain precedent that point to the system having a certain network structure). Perhaps more importantly, the same systematic approach can be used to vary the form of the dynamical processes simulated on the network, , as we do in Figure 4.6. One can imagine using the ( , , ) ensemble across a number ofD fields, and its utility rests on its explicit isolation of threeG majorD R assumptions used whenever we induce networks from time series data. This provides a fertile ground for scientific discussion, disagreement, and progress. That is, researchers who use reconstructed networks as data objects should explicitly consider (and provide evidence for) their assumptions about the possibly underlying and the of the system in question, which in turn informs the choice of . G D R Take, for example, network reconstruction in the context of firm-to-firm interaction dynamics in the stock market [134]. Economists and researchers in financial Network Science have written extensively about the hypothesized network structure of this system, the possible types of interactions that nodes engage in, as well as higher-order structural properties like clustering in this network. The intuition behind using a network to study this system is simple: there is evidence to suggest that the behavior of one firm is not conditionally independent from the behavior of another firm and that we can access and analyze such dependencies using tools from Network Science. By creating an infrastructure to “plug-and- play” different hypothesized mechanisms for the dynamics in this system, , and coupling this with several candidate ground truth network structures, , we can generateD data that can more closely resemble the true data that we record from thisG system, either statistically or phenomenologically.

The approach laid out in this chapter raises the question of how disperse should the space of reconstructed networks be for this system? If try reconstructing a network using a dozen different techniques for , what if they are all similar to one another? Or conversely, how different should the variousR reconstructed networks be from one another, given the time series data? Given that different techniques for attend to certain signals in the data over others, do we also suspect that there is enoughR variability in the time series data to warrant such a, for example, divergent space of reconstructed networks? These questions are orthogonal to typical questions about the mere performance of network reconstruction techniques and, we argue, important for the broader use of Network Science in situations where the adjacency matrix is not known a priori.

97 Where does the ( , , ) ensemble fit in the broader toolkit of methods for reconstructing networks from timeG D seriesR data? Here, we hope to clarify two key points and close with a series of questions as well as a larger point about the way forward. First, the ensemble itself is now more accessible than ever, as the netrd Python package allows users to test a variety of combination of dynamics, reconstruction methods, ground truth networks, and compare the resulting networks using all sorts of network comparison methods. Second, by providing this framework and software, we can collate a series of outstanding questions that can be systematically answered, in an open-sourced manner, over the coming years.

What do nodes represent? What does a node represent? Is it readout from a sensor? • Is it a specific entity in the system? Is it a macro-grouping of micro-level entities?

What does a link between two nodes represent? Is it directed? Is it weighted? Does • it representation (effective) causation? Does it signify co-activation? Should the reconstructed network be bipartite?

Are there self-loops or self-excitation in the system in question? Is there spontaneous • firing? Is every link excitatory or are there also inhibitory relationships?

Is the time series periodic in nature? Is it chaotic? Is it random? Are there multi-order • correlations between the signals? What if we slowly increased the amount of noise in the time series, would we want the reconstruction to get closer and closer to a random reconstruction?

How long are the time series data? Is this the correct time scale to examine this system? • Has the data been sampled at a timescale that allows for reconstruction? What is the impact of noisy measurements on the network reconstruction? If the system were sampled for a massive amount of time, would the reconstruction technique accurately recapitulate the original network structure?

What features should be prioritized by a network reconstruction algorithm? Is it • important to identify the presence and absence of links? Is the degree distribution an important feature? Modular structure? Average path length? Sparsity?

Acknowledgements: A very special thanks to the Network Science Institute at North- eastern University, as well as to Rose Yu, Dima Krioukov, Laurent H´ebert-Dufresne, Patrick Desrosiers, Juniper Lovato, and several members of the Complex Networks Winter Workshop 2019 for early discussions about graph distances and network reconstruction techniques.

Data and software availability: All the experiments in this paper were conducted using the netrd Python package [119] and is available at https://github.com/netsiphd/netrd.

98 Broader impact: Associated manuscripts and related work In addition to the theoretical contribution of defining and characterizing ensembles of reconstructed networks, the primary contribution from this chapter is a Python software package called netrd. This package has grown into an international open source project with dozens of contributors, and as of 11/19/2020, it has been downloaded over 14,000 times (https://pepy.tech/project/netrd).

Personal research and collaboration Under revision McCabe, S., Torres, L., LaRock, T., Haque, S. A., Yang, C-H., Hartle, H., & Klein, B. (under revision, Journal of Open Source Software). netrd: A library for network reconstruction and graph distances. arXiv: 2010.16019.

In preparation Klein, B., McCabe, S., Hartle, H., Torres, L., Yang, C-H., LaRock, T., Shugars, S., Gallagher, R.J., Sakharov, T., Davis, J.T., Robertson, R., Mattsson, C., St-Onge, G., Murphy, C., Saffo, D., Mistry, D., Heins, C., Almeida, L., Haque, S., Towlson, E., Zhang, Q., Shrestha, M., Ruf, S., Gates, A., Chinazzi, M., Coronges, K., Riedl, C., Dunne, C., Lippner, G., Eliassi-Rad, T., Vespignani, A., & Scarpino, S.V. (in prep.). Comparing ensembles of networks reconstructed from time series data.

Open software and replication materials netrd: https://github.com/netsiphd/netrd

99 Chapter 5

Conclusion

Over the last generation, many scientific advances have emerged from the observation that networks across nature, society, and scale often show similar structural or dynamical properties [4; 180; 130; 13]. Despite differences in size, age, domain, and function, many complex systems appear to be more similar than they are different from each other. Fundamentally, this forces us to ask why. To date, however, we have seen few advances in theory in Network Science that have been able to rigorously characterize this apparent purposive behavior that we so often observe in networks. We are instead left with reductive, mechanistic answers, which themselves are useful but are limited in the answers they provide.

This challenge highlights the fundamental difference between descriptive and normative science. For example, we know that the electrochemical exchange of information in our brain leads to conscious thought, but we may not know why. We know that ecosystems are comprised by specific sets of species, but we may not know why. We know that complex networks across nature and society tend to show remarkably similar properties, but we may not know why.

These questions have long driven my research—to come up with theories that can bridge disciplinary divides and highlight the crucial role that a Network Science can have in how we approach our complex world. Indeed, one of the most attractive features of Network Science as a discipline has been its ability to use insights gained from one domain (e.g. the flow of information in brain networks [111]) and apply them to another entirely different domain (e.g. the flow of money in economic networks [117]). If we have a better set of tools for building Network Science-specific theories, then we are more equipped to generate these cross-disciplinary, translational insights. In this dissertation I have approached three different methodological questions with the expressed goal of building computational tools that can bring about theoretical gain. This has been the case whether it is developing computational tools for uncovering higher scales in complex networks (Chapter 2), introducing standardized ways to compare networks (Chapter 3), or consolidating an unruly and complicated set of tools into a neat simpler framework (Chapter 4).

100 I conclude this dissertation exhausted after several years of work, worn down from almost a year watching a pandemic grow into an ungraspable tragedy, buoyed by a solidarity that sprouts from this shared sadness, inspired by the humanity of those I have worked with, inspired by their brilliance, inspired by their tireless pursuit to transform their knowledge into a better world, assured that this path—this Network Science—is the right path to take, looking out hopeful across a vast, uncertain future, but feeling ready, not knowing for what.

101 References

[1] L. A. Adamic and N. Glance. The political blogosphere and the 2004 u.s. election: Divided they blog. In Proceedings of the 3rd International Workshop on Link Discovery, LinkKDD ’05, page 36–43, 2005. doi: 10.1145/1134271.1134277.

[2] R. Albert and A.-L. Barab´asi.Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1):47–97, 2002. doi: 10.1103/RevModPhys.74.47.

[3] U. Alon. Biological networks: The tinkerer as an engineer. Science, 301(5641): 1866–1867, 2003. doi: 10.1126/science.1089072.

[4] L. A. N. Amaral and J. M. Ottino. Complex networks. European Physical Journal B - Condensed Matter, 38(2):147–162, 2004. doi: 10.1140/epjb/e2004-00110-5.

[5] W. R. Ashby. An Introduction to Cybernetics. Chapman and Hall Ltd., 1957.

[6] J. P. Bagrow and E. M. Bollt. An information-theoretic, all-scales approach to comparing networks. Applied Network Science, 4(45):1–15, 2019. doi: 10.1007/s41109- 019-0156-x.

[7] J. P. Bagrow, E. M. Bollt, J. D. Skufca, and D. Ben-Avraham. Portraits of complex networks. Europhysics Letters, 81(6), 2008. doi: 10.1209/0295-5075/81/68004.

[8] L. Bai, L. Rossi, A. Torsello, and E. R. Hancock. A quantum Jensen-Shannon graph kernel for unattributed graphs. Pattern Recognition, 48(2):344–355, 2015. doi: 10.1016/j.patcog.2014.03.028.

[9] D. Balcan, V. Colizza, B. Gon¸calves, H. Hu, J. J. Ramasco, and A. Vespignani. Multiscale mobility networks and the spatial spreading of infectious diseases. Pro- ceedings of the National Academy of Sciences, 106(51):21484–21489, 2009. doi: 10.1073/pnas.0906910106.

[10] D. Balcan, B. Gon¸calves, H. Hu, J. J. Ramasco, V. Colizza, and A. Vespignani. Modeling the spatial spread of infectious diseases: The GLobal Epidemic and Mobility computational model. Journal of Computational Science, 1(3):132–145, 2010. doi: 10.1016/j.jocs.2010.07.002.

102 [11] D. Balduzzi and G. Tononi. Qualia: The geometry of integrated information. PLoS Computational Biology, 5(8), 2009. doi: 10.1371/journal.pcbi.1000462.

[12] M. Bansal and V. Belcastro. How to infer gene networks from expression profiles. Molecular Systems Biology, 3(August):78, 2007. doi: 10.1038/msb4100120.

[13] A.-L. Barab´asi. Network Science. Cambridge University Press, 2016. ISBN 1107076269. doi: 10.1177/0094306116681814.

[14] A.-L. Barab´asiand R. Albert. Emergence of scaling in random networks. Science, 286 (October):509–512, 1999. doi: 10.1126/science.286.5439.509.

[15] A.-L. Barab´asiand Z. N. Oltvai. Network biology: Understanding the cell’s functional organization. Nature Reviews Genetics, 5(2):101, 2004. doi: 10.1038/nrg1272.

[16] A.-L. Barab´asi,R. Albert, and H. Jeong. Mean-field theory for scale-free random networks. Physica A, 272(1):173–187, 1999. doi: 10.1016/S0378-4371(99)00291-5.

[17] A.-L. Barab´asi,E. Ravasz, and T. Vicsek. Deterministic scale-free networks. Physica A, 299:559–564, 2001. doi: 10.1016/S0378-4371(01)00369-7.

[18] A.-L. Barab´asi,N. Gulbahce, and J. Loscalzo. Network medicine: A network-based approach to human disease. Nature Reviews Genetics, 12(1):56–68, 2011. doi: 10.1038/nrg2918.

[19] P. Barucca. Localization in covariance matrices of coupled heterogenous Ornstein- Uhlenbeck processes. Physical Review E, 90(6):1–5, 2014. doi: 10.1103/Phys- RevE.90.062129.

[20] E. Ba¸sar. Chaos in Brain Function. Springer Science and Business Media, 2012.

[21] D. S. Bassett and O. Sporns. Network neuroscience. Nature Neuroscience, 20(3): 353–364, 2017. doi: 10.1038/nn.4502.

[22] C. M. Bennett, M. B. Miller, and G. L. Wolford. Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction. Neuroimage, 47:S125, 2009. doi: 10.1016/S1053-8119(09)71202- 9.

[23] A. R. Benson, D. F. Gleich, and J. Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016. doi: 10.1126/science.aad9029.

[24] J. Bento and S. Ioannidis. A family of tractable graph metrics. Applied Network Science, 4(1):1–27, 2019. doi: 10.1007/s41109-019-0219-z.

[25] M. Berlingerio, D. Koutra, T. Eliassi-Rad, and C. Faloutsos. Netsimile: A scalable

103 approach to size-independent network similarity. arXiv:1209.2684, 2012. URL https: //arxiv.org/abs/1209.2684.

[26] G. Boeing. OSMnx: New methods for acquiring, constructing, analyzing, and visualizing complex street networks. Computers, Environment and Urban Systems, 65:126–139, 2017. doi: 10.1016/j.compenvurbsys.2017.05.004.

[27] B. Bollob´as.A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. European Journal of Combinatorics, 1(4):311–316, 1980. ISSN 01956698. doi: 10.1016/S0195-6698(80)80030-8.

[28] B. Bollob´as.The evolution of random graphs. Transactions of the American Mathe- matical Society, 286(1):257, 1984. doi: 10.2307/1999405.

[29] P. Bonacich. Power and centrality: A family of measures. American Journal of Sociology, 92(5):1170–1182, 1987. doi: 10.1086/228631.

[30] D. Bray. Molecular networks: The top-down view. Science, 301:1864–5, 2003. doi: 10.1126/science.1089118.

[31] M. D. Brennan, R. Cheong, and A. Levchenko. How information theory handles cell signaling and uncertainty. Science, 338(6105):334–335, 2012. doi: 10.1126/sci- ence.1227946.

[32] I. Brugere, B. Gallagher, and T. Y. Berger-Wolf. Network structure inference, a survey. ACM Computing Surveys, 51(2):1–39, 2018. doi: 10.1145/3154524.

[33] T. D. P. Brunet and W. F. Doolittle. The generality of constructive neutral evolution. Biology and Philosophy, 33(1-2):2, 2018. doi: 10.1007/s10539-018-9614-6.

[34] H. Bunke and K. Shearer. A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19(3-4):255–259, 1998. doi: 10.1016/S0167- 8655(97)00179-7.

[35] G. T. Cantwell and M. E. J. Newman. Message passing on networks with loops. Proceedings of the National Academy of Sciences, 116(47):23398–23403, 2019. doi: 10.1073/pnas.1914893116.

[36] M. Carlile. Prokaryotes and eukaryotes: Strategies and successes. Trends in Biochemical Sciences, 7(4):128–130, 1982. doi: 10.1016/0968-0004(82)90199-2.

[37] L. C. Carpi, O. A. Rosso, P. M. Saco, and M. G. Ravetti. Analyzing complex networks evolution through Information Theory quantifiers. Physics Letters A, 375(4):801–804, 2011. doi: 10.1016/j.physleta.2010.12.038.

[38] M. Cha, F. Beneventuto, H. Hamed, and K. P. Gummadi. The world of connections

104 and information flow in Twitter. IEEE Transactions on Systems Man and Cybernetics Part A: Systems and Humans, 42(4), 2012. doi: 10.1109/TSMCA.2012.2183359.

[39] H. Chang. Inventing temperature: Measurement and scientific progress. Oxford University Press, 2004. ISBN 978-0195337389.

[40] D. Chen, D. D. Shi, M. Qin, S. M. Xu, and G. J. Pan. comparison based on communicability sequence entropy. Physical Review E, 98(1):1–8, 2018. doi: 10.1103/PhysRevE.98.012319.

[41] S. Chowdhury and F. M´emoli. Distances and isomorphism between networks and the stability of network invariants. arXiv:1708.04727, 2017. URL https://arxiv.org/ abs/1708.04727.

[42] G. Chui. “unified theory” is getting closer, hawking predicts. San Jose Mercury News, Jan 2000.

[43] G. Cimini, T. Squartini, D. Garlaschelli, and A. Gabrielli. Systemic risk analysis on reconstructed economic and financial networks. Scientific Reports, 5:1–12, 2015. doi: 10.1038/srep15758.

[44] G. Cimini, T. Squartini, F. Saracco, D. Garlaschelli, A. Gabrielli, and G. Caldarelli. The statistical physics of real-world networks. Nature Reviews Physics, 1(1):58–71, 2019. doi: 10.1038/s42254-018-0002-6.

[45] D. Colquhoun and A. Hawkes. On the stochastic properties of single ion channels. Pro- ceedings of the Royal Society B, 211(1183):205–235, 1981. doi: 10.1098/rspb.1981.0003.

[46] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, 2012. ISBN 9780471241959. doi: 10.1002/047174882X.

[47] J. B. Dacks, A. A. Peden, and M. C. Field. Evolution of specificity in the eukaryotic endomembrane system. The International Journal of Biochemistry & Cell Bbiology, 41 (2):330–340, 2009. doi: 10.1016/j.biocel.2008.08.041.

[48] J. Dall and M. Christensen. Random geometric graphs. Physical Review E, 66:016121, 2002. doi: 10.1103/PhysRevE.66.016121.

[49] M. De Domenico and J. Biamonte. Spectral entropies as information-theoretic tools for complex network comparison. Physical Review X, 6(4):34–37, 2016. doi: 10.1103/Phys- RevX.6.041062.

[50] C. I. Del Genio, T. Gross, and K. E. Bassler. All scale-free networks are sparse. Physical Review Letters, 107(17):1–4, 2011. doi: 10.1103/PhysRevLett.107.178701.

[51] M. M. Deza and E. Deza. Encyclopedia of Distances. In Encyclopedia of Distances, pages 1–583. Springer, 2009.

105 [52] K. Dolinski and O. G. Troyanskaya. Implications of Big Data for cell biology. Molecular Biology of the Cell, 26(14):2575–2578, 2015. ISSN 19394586. doi: 10.1091/mbc.E13-12- 0756.

[53] J. F. Donges, Y. Zou, N. Marwan, and J. Kurths. The backbone of the climate network. Europhysics Letters, 87(4), 2009. doi: 10.1209/0295-5075/87/48007.

[54] C. Donnat and S. Holmes. Tracking network dynamics: A survey using graph distances. Annals of Applied Statistics, 12(2):971–1012, 2018. doi: 10.1214/18-AOAS1176.

[55] A. Edelman and N. R. Rao. Random matrix theory. Acta Numerica, 14:233–297, 2005. doi: 10.1017/S0962492904000236.

[56] G. M. Edelman. Neural Darwinism: The Theory of Neuronal Group Selection. Basic books, 1987. ISBN 0465049346.

[57] G. M. Edelman. The Remembered Present: A Biological Theory of Consciousness. Basic books, 1989. ISBN 046506910X.

[58] G. M. Edelman and J. A. Gally. Degeneracy and complexity in biological systems. Proceedings of the National Academy of Sciences, 98(24):13763–13768, 2001. doi: 10.1073/pnas.231499798.

[59] A. Einstein. On the theory of the Brownian movement. Ann. Phys, 19(4):371–381, 1906.

[60] F. Emmert-Streib, M. Dehmer, and Y. Shi. Fifty years of graph matching, network alignment and network comparison. Information Sciences, 346-347:180 – 197, 2016. ISSN 0020-0255. doi: 10.1016/j.ins.2016.01.074.

[61] P. Erd˝osand A. R´enyi. On random graphs. Publicationes Mathematicae, 6:290–297, 1959. doi: 10.2307/1999405.

[62] S. Federhen. The NCBI Taxonomy database. Nucleic Acids Research, 40(D1):136–143, 2012. doi: 10.1093/nar/gkr1178.

[63] A. M. Feist, M. J. Herrg˚ard,I. Thiele, J. L. Reed, and B. Palsson. Reconstruction of biochemical networks in microorganisms. Nature Reviews Microbiology, 7(2):129–143, 2009. doi: 10.1038/nrmicro1949.

[64] R. A. Fisher. The design of experiments. The American Mathematical Monthly, 43(3): 180, 1936. doi: 10.2307/2300364.

[65] S. Fortunato and D. Hric. Community detection in networks: A user guide. Physics Reports, 659:1–44, 2016. doi: 10.1016/j.physrep.2016.09.002.

106 [66] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. doi: 10.1093/biostatistics/kxm045.

[67] K. Friston. Life as we know it. Journal of the Royal Society Interface, 10(86):20130475, 2013. doi: 10.1098/rsif.2013.0475.

[68] X. Gao, B. Xiao, D. Tao, and X. Li. A survey of graph edit distance. Pattern Analysis and applications, 13(1):113–129, 2010. doi: 10.1007/s10044-008-0141-y.

[69] D. Garlaschelli and M. I. Loffredo. Maximum likelihood: Extracting unbiased informa- tion from complex networks. Physical Review E, 78:015101, 2008. doi: 10.1103/Phys- RevE.78.015101.

[70] S. J. Giovannoni, J. C. Thrash, and B. Temperton. Implications of streamlining theory for microbial ecology. The ISME Journal, 8(8):1553, 2014. doi: 10.1038/ismej.2014.60.

[71] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821–6, 2002. doi: 10.1073/pnas.122653799.

[72] G. H. Golub and C. F. van Loan. Matrix Computations. Johns Hopkins University Press, fourth edition, 2013. ISBN 9781421407944. URL http://www.cs.cornell. edu/cv/GVL4/golubandvanloan.htm.

[73] B. G¨orke and J. St¨ulke. Carbon catabolite repression in bacteria: Many ways to make the most out of nutrients. Nature Reviews Microbiology, 6(8):613, 2008. doi: 10.1038/nrmicro1932.

[74] A. A. Hagberg, D. A. Schult, and P. J. Swart. Exploring network structure, dynamics, and function using NetworkX. In G. Varoquaux, T. Vaught, and J. Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15, 2008. URL http://conference.scipy.org/proceedings/SciPy2008/paper_2/.

[75] C. Haldeman and J. M. Beggs. Critical branching captures activity in living neural networks and maximizes the number of metastable states. Physical Review Letters, 94: 058101, 2005. doi: 10.1103/PhysRevLett.94.058101.

[76] R. W. Hamming. Error detecting and error correcting codes. The Bell System Technical Journal, 29(2):147–160, 1950. doi: 10.1016/S0016-0032(23)90506-5.

[77] D. K. Hammond, Y. Gur, and C. R. Johnson. Graph diffusion distance: A difference measure for weighted graphs based on the graph Laplacian exponential kernel. In 2013 IEEE Global Conference on Signal and Information Processing, pages 419–422, 2013. doi: 10.1109/GlobalSIP.2013.6736904.

[78] X. Han, Z. Shen, W. X. Wang, Y. C. Lai, and C. Grebogi. Reconstructing direct and

107 indirect interactions in networked public goods game. Scientific Reports, 6:1–12, 2016. doi: 10.1038/srep30241.

[79] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cour- napeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del R´ıo,M. Wiebe, P. Peterson, P. G´erard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, 2020. doi: 10.1038/s41586-020-2649-2.

[80] H. Hartle, B. Klein, S. McCabe, G. St-Onge, C. Murphy, A. Daniels, and L. H´ebert- Dufresne. Network comparison and the within-ensemble graph distance. Proceedings of the Royal Society A, 476:1–18, 2020. doi: 10.1098/rspa.2019.0744.

[81] D. D. Heckathorn and C. Cameron. Network sampling: From snowball and multiplicity to respondent-driven sampling. Annual Review of Sociology, 2017. doi: 10.1146/annurev- soc-060116-053556.

[82] K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H. Tong, and C. Faloutsos. It’s who you know: Graph mining using recursive structural features. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 663–671, New York, NY, USA, 2011. Association for Computing Machinery. doi: 10.1145/2020408.2020512.

[83] D.-T. Hoang, J. Song, V. Periwal, and J. Jo. Network inference in stochastic systems from neurons to currencies: Improved performance at small sample size. Physical Review E, 99:023311, 2019. doi: 10.1103/PhysRevE.99.023311.

[84] E. Hoel, B. Klein, A. Swain, R. Griebenow, and M. Levin. Evolution leads to the emergence of higher scales: An analysis of macro-nodes in protein interactomes across the tree of life. bioRxiv:10.1101/2020.05.03.074419, 2019.

[85] E. P. Hoel. When the map is better than the territory. Entropy, 19(5):188, 2017. doi: 10.3390/e19050188.

[86] E. P. Hoel. Agent above, atom below: How agents causally emerge from their underlying microphysics. In A. Aguirre, B. Foster, and Z. Merali, editors, Wandering Towards a Goal: How Can Mindless Mathematical Laws Give Rise to Aims and Intention?, pages 63–76. Springer International Publishing, 2018. doi: 10.1007/978-3-319-75726-1 6.

[87] E. P. Hoel, L. Albantakis, and G. Tononi. Quantifying causal emergence shows that macro can beat micro. Proceedings of the National Academy of Sciences, 110(49): 19790–5, 2013. doi: 10.1073/pnas.1314922110.

[88] L. A. Hug, B. J. Baker, K. Anantharaman, C. T. Brown, A. J. Probst, C. J. Castelle, C. N. Butterfield, A. W. Hernsdorf, Y. Amano, K. Ise, Y. Suzuki, N. Dudek, D. A.

108 Relman, K. M. Finstad, R. Amundson, B. C. Thomas, and J. F. Banfield. A new view of the tree of life. Nature Microbiology, 1(5):1–6, 2016. doi: 10.1038/nmicrobiol.2016.48.

[89] M. Ipsen and A. S. Mikhailov. Evolutionary reconstruction of networks. Physical Review E, 66(4):4, 2002. doi: 10.1103/PhysRevE.66.046109.

[90] E. M. Izhikevich. Dynamical Systems in Neuroscience. MIT press, 2007. ISBN 9780262514200.

[91] P. Jaccard. Etude de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Societe Vaudoise des Sciences Naturelles, 37:547–579, 1901. doi: 10.5169/seals-266450.

[92] S. J. Jackson, M. Bailey, and B. Foucault Welles. #GirlsLikeUs: Trans advocacy and community building online. New Media & Society, page 146144481770927, 2017. doi: 10.1177/1461444817709276.

[93] R. Jain, M. C. Rivera, and J. A. Lake. Horizontal gene transfer among genomes: The complexity hypothesis. Proceedings of the National Academy of Sciences, 96(7): 3801–3806, 1999. doi: 10.1073/pnas.96.7.3801.

[94] G. Jurman, R. Visintainer, and C. Furlanello. An introduction to spectral distances in networks. In N. N. W. B. Apolloni et al., editors, Neural Nets WIRN10: Proceedings of the 20th Italian Workshop on Neural Nets, volume 226, pages 227–234, 2011. doi: 10.3233/978-1-60750-692-8-227.

[95] G. Jurman, R. Visintainer, M. Filosi, S. Riccadonna, and C. Furlanello. The HIM glocal metric and kernel for network comparison and classification. Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015, pages 1–10, 2015. doi: 10.1109/DSAA.2015.7344816.

[96] B. Karrer, E. Levina, and M. E. J. Newman. Robustness of community structure in networks. Physical Review E, 77(4):1–9, 2007. doi: 10.1103/PhysRevE.77.046119.

[97] B. Klein and E. P. Hoel. The emergence of informative higher scales in complex networks. Complexity, 2020. doi: 10.1155/2020/8932526.

[98] A. Koseska and P. I. H. Bastiaens. Cell signaling as a cognitive process. The EMBO Journal, 36(5):568–582, 2017. doi: 10.15252/embj.201695383.

[99] D. Koutra, J. T. Vogelstein, and C. Faloutsos. DELTACON: A principled massive- graph similarity function. ACM Transactions on Knowledge Discovery from Data, 10 (3):162–170, 2016. doi: 10.1145/2824443.

[100] M. A. Kramer, U. T. Eden, S. S. Cash, and E. D. Kolaczyk. Network inference with confidence from multivariate time series. Physical Review E, 79(061916):1–13, 2009. doi: 10.1103/PhysRevE.79.061916.

109 [101] P. L. Krapivsky, S. Redner, and F. Leyvraz. Connectivity of growing random networks. Physical Review Letters, 85(21):4629–4632, 2000. doi: 10.1103/PhysRevLett.85.4629.

[102] J. Kunegis. KONECT - the Koblenz network collection. Proceedings of the 22nd International Conference on World Wide Web Companion, pages 1343–1350, 2013. doi: 10.1145/2487788.2488173.

[103] R. Lambiotte, M. Rosvall, and I. Scholtes. From networks to optimal higher-order models of complex systems. Nature Physics, 2019. doi: 10.1038/s41567-019-0459-y.

[104] N. Lane. Energetics and genetics across the prokaryote-eukaryote divide. Biology Direct, 6(1):35, 2011. doi: 10.1186/1745-6150-6-35.

[105] E. Laurence, N. Doyon, L. J. Dub´e,and P. Desrosiers. Spectral dimension reduc- tion of complex dynamical networks. Physical Review X, 9(1):011042, 2019. doi: 10.1103/PhysRevX.9.011042.

[106] O. Ledoit and M. Wolf. Honey, I shrunk the sample covariance matrix. SSRN, pages 1–22, 2003. doi: 10.2139/ssrn.433840.

[107] X. S. Liang. Unraveling the cause-effect relation between time series. Physical Review E, 90(5), 2014. doi: 10.1103/PhysRevE.90.052150.

[108] J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145–151, 1991. doi: 10.1109/18.61115.

[109] J. T. Lizier. JIDT: An information-theoretic toolkit for studying the dynam- ics of complex systems. Frontiers in Robotics and AI, pages 1–11, 2014. doi: 10.3389/frobt.2014.00011.

[110] J. Lukeˇs,J. M. Archibald, P. J. Keeling, W. F. Doolittle, and M. W. Gray. How a neutral evolutionary ratchet can build cellular complexity. IUBMB Life, 63(7):528–537, 2011. doi: 10.1002/iub.489.

[111] C. W. Lynn, L. Papadopoulos, A. E. Kahn, and D. S. Bassett. Human informa- tion processing in complex networks. Nature Physics, 16(9):965–973, 2020. doi: 10.1038/s41567-020-0924-7.

[112] W. Marshall, L. Albantakis, and G. Tononi. Black-boxing and cause-effect power. PLoS Computational Biology, 14(4):1–21, 2018. doi: 10.1371/journal.pcbi.1006114.

[113] W. Martin and E. V. Koonin. Introns and the origin of nucleus–cytosol compartmen- talization. Nature, 440(7080):41–45, 2006. doi: 10.1038/nature04531.

[114] V. Marx. The big challenges of big data. Nature, 498(7453):255–260, 2013. doi: 10.1038/498255a.

110 [115] N. Masuda and P. Holme. Detecting sequences of system states in temporal networks. Scientific Reports, 9(1):1–11, 2019. doi: 10.1038/s41598-018-37534-2.

[116] N. Masuda, M. A. Porter, and R. Lambiotte. Random walks and diffusion on networks. Physics Reports, 716-717:1–58, 2017. doi: 10.1016/j.physrep.2017.07.007.

[117] C. Mattsson. Networks of monetary flow at native resolution. arXiv:1910.05596, 2019. URL https://arxiv.org/abs/1910.05596.

[118] E. Mayr. The Growth of Biological Thought: Diversity, Evolution, and Inheritance. Belknap Press of Harvard University Press, 1982. ISBN 0674364465.

[119] S. McCabe, L. Torres, T. LaRock, S. A. Haque, C.-H. Yang, H. Hartle, and B. Klein. netrd: A library for network reconstruction and graph distances. arXiv:2010.16019, 2020. URL http://arxiv.org/abs/2010.16019.

[120] P. A. Mediano, F. Rosas, R. L. Carhart-Harris, A. K. Seth, and A. B. Barrett. Beyond integrated information: A taxonomy of information dynamics phenomena. arXiv:1909.02297, 2019. URL https://arxiv.org/abs/1909.02297.

[121] M. Meilˇa.Comparing clusterings-an information based distance. Journal of Multivariate Analysis, 98(5):873–895, 2007. doi: 10.1016/j.jmva.2006.11.013.

[122] A. Mellor and A. Grusovin. Graph comparison via the nonbacktracking spectrum. Physical Review E, 99:052309, 2019. doi: 10.1103/PhysRevE.99.052309.

[123] Y. Mishchenko, J. T. Vogelstein, and L. Paninski. A Bayesian approach for inferring neuronal connectivity from calcium fluorescent imaging data. Annals of Applied Statistics, 5(2 B):1229–1261, 2011. doi: 10.1214/09-AOAS303.

[124] C. Moler and C. Van Loan. Nineteen dubious ways to compute the exponen- tial of a matrix, twenty-five years later. SIAM Review, 45(1):3–49, 2003. doi: 10.1137/S00361445024180.

[125] M. Molloy and B. Reed. A critical point for random graphs with a given degree sequence. Random Structures and Algorithms, 6(2-3):161–180, 1995. doi: 10.1002/rsa.3240060204.

[126] N. D. Monnig and F. G. Meyer. The resistance perturbation distance: A metric for the analysis of dynamic networks. Discrete Applied Mathematics, 236:347–386, 2018. doi: 10.1016/j.dam.2017.10.007.

[127] R. Moran, D. A. Pinotsis, and K. J. Friston. Neural masses and fields in dynamic causal modeling. Frontiers in Computational Neuroscience, 7(May):1–12, 2013. doi: 10.3389/fncom.2013.00057.

[128] C. M¨uller-Crepon, P. Hunziker, and L.-E. Cederman. Roads to rule, roads to rebel:

111 Relational state Capacity and conflict in Africa. Journal of Conflict Resolution, October 2020. doi: 10.1177/0022002720963674.

[129] M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5):323–351, 2005. doi: 10.1080/00107510500052444.

[130] M. E. J. Newman. Networks: An Introduction. Oxford University Press, second edition, 2018. ISBN 9780191594175. doi: 10.1093/acprof:oso/9780199206650.001.0001.

[131] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69(2):1–15, 2004. doi: 10.1103/PhysRevE.69.026113.

[132] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. In Proceedings of the 7th International World Wide Web Conference, pages 161–172, 1998. URL citeseer.nj.nec.com/page98pagerank. html.

[133] V. Y. Pan and Z. Q. Chen. The complexity of the matrix eigenproblem. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, page 507–516, New York, NY, USA, 1999. Association for Computing Machinery. ISBN 1581130678. doi: 10.1145/301250.301389.

[134] F. Parisi, T. Squartini, and D. Garlaschelli. A faster horse on a safer trail: Generalized inference for the efficient reconstruction of weighted networks. New Journal of Physics, 22(5):053053, 2020. doi: 10.1088/1367-2630/ab74a7.

[135] J. Park and M. E. J. Newman. Statistical mechanics of networks. Physical Review E, 70:066117, 2004. doi: 10.1103/PhysRevE.70.066117.

[136] J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669, 1995. doi: 10.1093/biomet/82.4.669.

[137] J. Pearl. Causality. New York: Cambridge, 2000. doi: 10.1017/CBO9780511803161.

[138] T. P. Peixoto. Network reconstruction and community detection from dynamics. Physical Review Letters, 123:128301, 2019. doi: 10.1103/PhysRevLett.123.128301.

[139] M. Penrose. Random Geometric Graphs. Oxford University Press, 2003. doi: 10.1007/978-3-319-20565-6 9.

[140] N. Perra, B. Gon¸calves, R. Pastor-Satorras, and A. Vespignani. Activity driven modeling of time varying networks. Scientific Reports, 2:1–7, 2012. doi: 10.1038/srep00469.

[141] K. R. Popper. The Logic of Scientific Discovery. Hutchinson and Co., London and New York, 1934. ISBN 0203994620.

[142] V. Priesemann, M. Wibral, M. Valderrama, R. Pr¨opper, M. Le Van Quyen, T. Geisel,

112 J. Triesch, D. Nikoli´c,and M. H. J. Munk. Spike avalanches in vivo suggest a driven, slightly subcritical brain state. Frontiers in Systems Neuroscience, 8:108, 2014. doi: 10.3389/fnsys.2014.00108.

[143] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi. Defining and identifying communities in networks. Proceedings of the National Academy of Sciences, 101(9):2658–2663, 2004. doi: 10.1073/pnas.0400054101.

[144] F. A. Rodrigues, T. K. D. M. Peron, P. Ji, and J. Kurths. The Kuramoto model in complex networks. Physics Reports, 610:1–98, 2016. doi: 10.1016/j.physrep.2015.10.008.

[145] T. Rolland, M. Ta¸san,B. Charloteaux, S. J. Pevzner, Q. Zhong, N. Sahni, S. Yi, I. Lemmens, C. Fontanillo, R. Mosca, A. Kamburov, S. D. Ghiassian, X. Yang, L. Ghamsari, D. Balcha, B. E. Begg, P. Braun, M. Brehme, M. P. Broly, A. R. Carvunis, D. Convery-Zupan, R. Corominas, J. Coulombe-Huntington, E. Dann, M. Dreze, A. Dricot, C. Fan, E. Franzosa, F. Gebreab, B. J. Gutierrez, M. F. Hardy, M. Jin, S. Kang, R. Kiros, G. N. Lin, K. Luck, A. Macwilliams, J. Menche, R. R. Murray, A. Palagi, M. M. Poulin, X. Rambout, J. Rasla, P. Reichert, V. Romero, E. Ruyssinck, J. M. Sahalie, A. Scholz, A. A. Shah, A. Sharma, Y. Shen, K. Spirohn, S. Tam, A. O. Tejeda, S. A. Trigg, J. C. Twizere, K. Vega, J. Walsh, M. E. Cusick, Y. Xia, A. L. Barab´asi,L. M. Iakoucheva, P. Aloy, J. De Las Rivas, J. Tavernier, M. A. Calderwood, D. E. Hill, T. Hao, F. P. Roth, and M. Vidal. A proteome-scale map of the human interactome network. Cell, 159(5):1212–1226, 2014. doi: 10.1016/j.cell.2014.10.050.

[146] F. E. Rosas, P. A. Mediano, H. J. Jensen, A. K. Seth, A. B. Barrett, R. L. Carhart- Harris, and D. Bor. Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data. arXiv:2004.08220, 2020. URL https://arxiv.org/abs/2004.08220.

[147] L. Rossi, A. Torsello, E. R. Hancock, and R. C. Wilson. Characterizing graph symmetries through quantum jensen-shannon divergence. Physical Review E, 88 (3):032806, 2013. doi: 10.1103/PhysRevE.88.032806.

[148] L. Rossi, A. Torsello, and E. R. Hancock. Measuring graph similarity through continuous-time quantum walks and the quantum Jensen-Shannon divergence. Physical Review E, 91(2):022815, 2015. doi: 10.1103/PhysRevE.91.022815.

[149] R. A. Rossi and N. K. Ahmed. NetworkRepository: An interactive data repository with multi-scale visual analytics. SIGKDD Exploration Newsletter, 17(2):37–41, 2016. ISSN 1931-0145. doi: 10.1145/2897350.2897355.

[150] M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4):1118–23, 2008. doi: 10.1073/pnas.0706851105.

[151] J. F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F.

113 Berriz, F. D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon, M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong, G. Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik, C. Bex, P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar, S. Bosak, R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth, and M. Vidal. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437(7062):1173–1178, 2005. doi: 10.1038/nature04209.

[152] P. K. Rubenstein, S. Weichwald, S. Bongers, J. M. Mooij, D. Janzing, M. Grosse- Wentrup, and B. Sch¨olkopf. Causal consistency of structural equation models. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI). Association for Uncertainty in Artificial Intelligence (AUAI), 2017. URL http://auai. org/uai2017/proceedings/papers/11.pdf.

[153] J. Runge. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos, 28(7), 2018. doi: 10.1063/1.5025050.

[154] J. Runge, P. Nowack, M. Kretschmer, S. Flaxman, and D. Sejdinovic. Detecting and quantifying causal associations in large nonlinear time series datasets. Science Advances, 5(11):eaau4996, 2019. doi: 10.1126/sciadv.aau4996.

[155] T. A. Schieber, L. C. Carpi, A. D´ıaz-Guilera, P. M. Pardalos, C. Masoller, and M. G. Ravetti. Quantification of network structural dissimilarities. Nature Communications, 8(13928):1–10, 2017. doi: 10.1038/ncomms13928.

[156] I. Scholtes. When is a network a network? Multi-order graphical model selection in pathways and temporal networks. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1037–1046, 2017. doi: 10.1145/3097983.3098145.

[157] S. G. Shandilya and M. Timme. Inferring from complex dynamics. New Journal of Physics, 13, 2011. doi: 10.1088/1367-2630/13/1/013004.

[158] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(July 1928):379–423, 1948. doi: 10.1145/584091.584093.

[159] A. Sheikhattar, S. Miran, J. Liu, J. B. Fritz, S. A. Shamma, P. O. Kanold, and B. Babadi. Extracting neuronal functional network dynamics via adaptive Granger causality analysis. Proceedings of the National Academy of Sciences, page 201718154, 2018. doi: 10.1073/pnas.1718154115.

[160] D. Sherrington and S. Kirkpatrick. Solvable model of a spin-glass. Physical Review Letters, 35:1792–1796, 1975. doi: 10.1103/PhysRevLett.35.1792.

[161] M. Small, J. Zhang, and X. Xu. Transforming time series into complex networks.

114 Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecom- munications Engineering, 5(2):2078–2089, 2009. doi: 10.1007/978-3-642-02469-6 84.

[162] T. Squartini, A. Gabrielli, D. Garlaschelli, T. Gili, A. Bifone, and F. Caccioli. Com- plexity in neural and financial systems: From time-series to networks. Complexity, pages 1–2, 2018. doi: 10.1155/2018/3132940.

[163] N. Stanley, R. Kwitt, M. Niethammer, and P. J. Mucha. Compressing networks with super nodes. Scientific Reports, 8(10892):1–14, 2018. doi: 10.1038/s41598-018-29174-3.

[164] O. Stetter, D. Battaglia, J. Soriano, and T. Geisel. Model-free reconstruction of excitatory neuronal connectivity from calcium imaging signals. PLoS Computational Biology, 8(8), 2012. doi: 10.1371/journal.pcbi.1002653.

[165] C. F. Stevens. Neurophysiology: A primer. John Wiley & Sons, 1966. ISBN 9780471824367.

[166] G. Sugihara, R. May, H. Ye, C.-h. Hsieh, E. Deyle, M. Fogarty, and S. Munch. De- tecting causality in complex ecosystems. Science, 338:496–500, 2012. doi: 10.1126/sci- ence.1227079.

[167] W. A. Sutherland. Introduction to Metric and Topological Spaces. Oxford University Press, 2009.

[168] D. Szklarczyk, A. Franceschini, M. Kuhn, M. Simonovic, A. Roth, P. Minguez, T. Do- erks, M. Stark, J. Muller, P. Bork, L. J. Jensen, and C. Von Mering. The STRING database in 2011: Functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research, 39(SUPPL. 1):561–568, 2011. doi: 10.1093/nar/gkq973.

[169] D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic, A. Santos, N. T. Doncheva, A. Roth, P. Bork, L. J. Jensen, and C. Von Mering. The STRING database in 2017: Quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Research, 45(D1):D362–D368, 2017. doi: 10.1093/nar/gkw937.

[170] M. Timme and J. Casadiego. Revealing networks from dynamics: An introduction. Journal of Physics A, 47(34), 2014. doi: 10.1088/1751-8113/47/34/343001.

[171] G. Tononi. An information integration theory of consciousness. BMC Neuroscience, 5: 1–22, 2004. doi: 10.1186/1471-2202-5-42.

[172] G. Tononi and O. Sporns. Measuring information integration. BMC Neuroscience, 4 (1):31, 2003. doi: 10.1186/1471-2202-4-31.

[173] G. Tononi, G. M. Edelman, and O. Sporns. Complexity and coherency: Integrating information in the brain. Trends in Cognitive Sciences, 2(12):474–484, 1998. doi: 10.1016/S1364-6613(98)01259-5.

115 [174] G. Tononi, O. Sporns, and G. M. Edelman. Measures of degeneracy and redundancy in biological networks. Proceedings of the National Academy of Sciences, 96(6):3257–3262, 1999. doi: 10.1073/pnas.96.6.3257.

[175] L. Torres, P. Su´arez-Serrato,and T. Eliassi-Rad. Non-backtracking cycles: Length spectrum theory and graph mining applications. Applied Network Science, 4(1):41, 2019. doi: 10.1007/s41109-019-0147-y.

[176] L. S. Tsimring. Noise in biology. Reports on Progress in Physics, 77(2), 2014. doi: 10.1088/0034-4885/77/2/026601.

[177] A. Tsitsulin, D. Mottin, P. Karras, A. Bronstein, and E. M¨uller. NetLSD: Hear- ing the shape of a graph. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2347–2356, 2018. doi: 10.1145/3219819.3219991.

[178] M. van Steen. Graph Theory and Complex Networks: An Introduction. Maarten van Steen, 2010. ISBN 978-90-815406-1-2.

[179] F. V´aˇsa,E. T. Bullmore, and A. X. Patel. Probabilistic thresholding of functional connectomes: Application to schizophrenia. NeuroImage, 172(November 2017):326–340, 2018. doi: 10.1016/j.neuroimage.2017.12.043.

[180] A. Vespignani, M. Barth´el´emy, and A. Barrat. Dynamical Processes on Complex Networks. Cambridge University Press, 2008. doi: 10.1080/00107510903084036.

[181] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I.˙ Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.

[182] A. Wald. An essentially complete class of admissible decision functions. The Annals of Mathematical Statistics, 18(4):549–555, 1947. ISSN 0003-4851. doi: 10.1214/aoms/1177730345.

[183] W. Wallis, P. Shoubridge, M. Kraetz, and D. Ray. Graph distances using graph union. Pattern Recognition Letters, 22:701–704, 2001. doi: 10.1016/S0167-8655(01)00022-8.

[184] D. J. Watts and S. H. Strogatz. Collective dynamics of ‘small-world’ networks. Nature, 393(6684):440–442, 1998. doi: 10.1038/30918.

[185] D. J. Weiss, A. Nelson, H. S. Gibson, W. Temperley, S. Peedell, A. Lieber, M. Hancher,

116 E. Poyart, S. Belchior, N. Fullman, B. Mappin, U. Dalrymple, J. Rozier, T. C. D. Lucas, R. E. Howes, L. S. Tusting, S. Y. Kang, E. Cameron, D. Bisanzio, K. E. Battle, S. Bhatt, and P. W. Gething. A global map of travel time to cities to assess inequalities in accessibility in 2015. Nature, 553(7688):333–336, 2018. doi: 10.1038/nature25181.

[186] J. M. Whitacre. Degeneracy: A link between evolvability, robustness and complexity in biological systems. Theoretical Biology and Medical Modelling, 7(1):1–17, 2010. doi: 10.1186/1742-4682-7-6.

[187] P. Wills. NetComp, 2017. URL https://github.com/peterewills/NetComp.

[188] P. Wills and F. G. Meyer. Metrics for graph comparison: A practitioner’s guide. PLoS ONE, 15(2):1–54, 2020. doi: 10.1371/journal.pone.0228728.

[189] R. C. Wilson and P. Zhu. A study of graph spectra for comparing graphs and trees. Pattern Recognition, 41(9):2833–2841, 2008. doi: 10.1016/j.patcog.2008.03.011.

[190] J. Xu, T. L. Wickramarathne, and N. V. Chawla. Representing higher-order dependen- cies in networks. Science Advances, 2(5):e1600028, 2016. doi: 10.1126/sciadv.1600028.

[191] Y. Yang, T. Luo, Z. Li, X. Zhang, and P. S. Yu. A robust bethod for inferring network structures. Scientific Reports, 7(1):1–12, 2017. doi: 10.1038/s41598-017-04725-2.

[192] H. Yin, A. R. Benson, and J. Leskovec. Higher-order clustering in networks. Physical Review E, 97(5):1–11, 2018. doi: 10.1103/PhysRevE.97.052306.

[193] H. L. Zeng, M. Alava, E. Aurell, J. Hertz, and Y. Roudi. Maximum likelihood reconstruction for Ising models with asynchronous updates. Physical Review Letters, 110(21):1–5, 2013. doi: 10.1103/PhysRevLett.110.210601.

[194] M. Zitnik, R. Sosiˇc,M. W. Feldman, and J. Leskovec. Evolution of resilience in protein interactomes across the tree of life. Proceedings of the National Academy of Sciences, 116(10):4426–4433, 2019. doi: 10.1073/pnas.1818013116.

117 Appendix

6.1 Chapter 2 Appendix

6.1.1 Table of key terms A table of key terms can be found in Table 6.1.

6.1.2 Effective information calculation Mathematically, EI has been expressed in a number of previous ways. The first was as the mutual information between two subsets of a system (while injecting noise into one), originally proposed as a step in the calculation of integrated information between neuron-like elements [171; 11]. More recently, it was pointed out that in general an intervention distribution, ID, defined as a probability distribution over the do(x) operator (as in [137]), creates some resultant effect distribution, ED. Then the EI is the mutual information, I(ID; ED), between the two, when the interventions are done like a randomized trial to reveal the dependencies (i.e., at maximum entropy [64; 85]).

In order to calculate the total information contained in the causal relationships of a system, EI is applied to the system as a whole [87]. There, EI was defined over the set of all states of a system and its state transitions. Because the adjacency matrix of a network can be cast as a transition matrix (as in Fig. 6.1a), the EI of a network can be expressed as:

1 N EI = D [W out W out ] (7) N KL i i i=1 ||h i X where EI is the average of the effect information, EIi, of each node (see Table 6.1 and Fig.

118 Term Description Notation

Network size the number of nodes in the network N

out Out-weight vector a vector of probabilities wij that W = wi1, wi2, , ...wij, ...wiN i { } (vi) a random walker on node vi will transition to vj

out out Effective information the total information in a causal EI = H( Wi ) H(Wi ) (network) structure, in bits h i − h i

out Determinism how certain about next steps is a deti = log (N) H(W ) 2 − i (vi) random walker on vi

out Degeneracy how distributed the certainty is over degeneracy = log2(N) H( Wi ) (network) the nodes of the network − h i

out out Effect information the contribution of each node vi to EIi = D [W W ] KL i ||h i i (vi) the network’s EI

Micro-nodes in a set of micro-nodes grouped into a S = vi, vj, ... , of length NS { } macro-node macro-node in new network, GM

out out 1 Macro-node out-weights from macro-node, µ, to Wµ = Wi · NS out-weights its neighbors i∈S X   Macro-node out-weights from macro-node, µ, to w out-weights its neighbors, conditioned on in- out out j−>i ji Wµ|j = Wi · wjk given input weights weights to the micro-nodes, vi S i∈S Pj−>k∈S ∈ X   Macro-node out-weights from macro-node, µ, P out out πi out-weights to its neighbors, conditioned on Wµ|π = Wi · πk given the stationary the stationary probabilities, π , of i∈S k∈S i X   distribution micro-nodes, vi S P ∈

Table 6.1: Table of key terms. Quantities needed in order to calculate EI and create consistent macro-nodes.

6.1b). This is equivalent to our derivation of EI from first principles in Eq.1, since:

N N N 1 1 w EI = D [W out W out ] = w log ij N KL i i N ij 2 W i=1 ||h i i=1 j=1 j X X X   1 N N 1 N N = w log w w log W (8) N ij 2 ij N ij 2 j i=1 j=1 − i=1 j=1 X X  X X 

119 t + 1 A B C D E

out A 0.00 0.00 0.00 0.50 0.50 = WA

out B 0.33 0.00 0.33 0.33 0.00 = WB

t out C 0.00 0.50 0.00 0.50 0.00 = WC

out D 0.00 0.00 0.00 0.00 1.00 = WD

out E 0.50 0.00 0.00 0.50 0.00 = WE N 0.83 0.50 0.33 1.83 1.50 = wij i=1 X 0.17 0.10 0.07 0.37 0.30 = W out h i i (a)

out out 1.00 1.0 WA out out 1.0 WD out out out DKL[WA Wi ] = 0.592 out DKL[WD Wi ] = 1.737 0.8 Wi ||h i 0.8 Wi ||h i wij h i wij h i 0.6 0.50 0.50 0.6 0.4 0.37 0.30 0.4 0.37 0.30 0.2 0.17 0.10 0.2 0.17 0.10 0.00 0.00 0.07 0.00 0.00 0.00 0.07 0.00 0.00 0.0 0.0 A B C D E A B C D E out-neighbor out-neighbor

out out 1.0 WB out out 1.0 WE out out out DKL[WB Wi ] = 1.061 out DKL[WE Wi ] = 1.016 0.8 Wi ||h i 0.8 Wi ||h i wij h i wij h i 0.6 0.6 0.50 0.50 0.4 0.33 0.33 0.37 0.33 0.30 0.4 0.37 0.30 0.2 0.17 0.10 0.2 0.17 0.10 0.00 0.07 0.00 0.00 0.07 0.00 0.00 0.0 0.0 A B C D E A B C D E out-neighbor out-neighbor N out 1 1.0 W out out C out out EI = DKL[Wi Wi ] out DKL[WC Wi ] = 1.385 N ||h i 0.8 Wi ||h i wij h i i=1 0.6 0.50 0.50 1 X EI = [0.592 + 1.061 + 1.385 + 1.737 + 1.016] 0.4 0.37 0.30 5 0.2 0.17 0.10 0.00 0.07 0.00 0.00 0.0 EI = 1.158 bits A B C D E out-neighbor (b)

Figure 6.1: Illustration of the calculation of effective information. (A) The ad- jacency matrix of a network with 1.158 bits of EI (calculation shown in (B)). The rows out correspond to Wi , a vector of probabilities that a random walker on node vi at time t out transitions to vj in the following time step, t + 1. Wi represents the (normalized) input weight distribution of the network, that is, the probabilitiesh i that a random walker will arrive at a given node vj at t + 1, after a uniform introduction of random walkers into the network out at t.(B) Each node’s contribution to the EI (EIi) is the KL divergence of its Wi vector from the network’s W out , known as the effect information. h i i

N

Note that for a given node, vi, the term in the first summation in Eq.8 above, wij log2 wij , j=1 X  120 out is equivalent to the negative entropy of the out-weights from vi, H(Wi ). Also note that out − Wj, the j th element in the W vector, is the normalized sum of the incoming weights to h i i 1 N v from its neighbors, v , such that W = w . We substitute these two terms into Eq. j i j N ij i=1 8 above such that: X 1 N N EI = H(W out) W log W (9) N i j 2 j i=1 − − j=1 X X 

N out This is equivalent to the formulation of EI from Eq.1, since H( Wi ) = Wj log2(Wj): h i − j=1 X EI = H( W out ) H(W out) (1) h i i − h i i

In the derivations of SM 6.1.3 we adopt the relative entropy formulation of EI from Eq.7 for ease of derivation. For a visual intuition behind the calculations involved in this formulation of EI, see how the network in Fig. 6.1a is used to calculate its EI (Fig. 6.1b), by calculating the mean effect information, EIi, of nodes in the network.

6.1.3 Effective information of common network structures Here we inspect the EI for iconic graphical structures, and in doing so, we see several interesting relationships between a network structure and its EI. First, however, we will introduce key terminology and assumptions.

Let k be the average degree of a network, G, and each node, vi, has degree, ki. In directed h i in out graphs each vi has an in-degree, ki , and an out-degree, ki . These correspond to the number of edges leading in to vi and edges going out from vi. The total number of edges in G is represented by E. In undirected Erd˝os-R´enyi (ER) networks, the total number of edges N(N−1) is given by E = p 2 , where p represents the probability that any two nodes, vi and vj, will be connected. In the following subsections, we derive the EI of several prototypical network structures, from random graphs to ring lattices to star networks. Note that for the following derivations we proceed from the relative entropy formalism from SM 6.1.2, and note that therefore N is the number of nodes with the output, N = Nout.

Derivation: effective information of ER networks In Erd˝os-R´enyi networks, EI does not depend on the number of nodes in the network, N. Instead, the network’s EI reaches its maximum at log2(p). This is because in ER networks, each node is expected to connect to k = pN neighboring− nodes, such that every value in h i

121 102

101

100 I

E Star Network 10 1 Ring Lattice

10 2

10 3

101 102 103 N

Figure 6.2: Effective information of stars and rings. As the number of nodes in star networks increases, the EI approaches zero, while the EI of ring lattice networks grows logarithmically as the number of nodes increases.

W out = 1 and every value in W out = hki = 1 , which can be represented as: i hki h i i Nhki N 1 N 1 1 1 1 EIER = D , , ... , , ... N KL k k || N N i h i h i X h  i Each node in an ER network is expected to be identical to all other nodes in the network, and calculating the expected effect information, EIi, is equivalent to calculating the network’s EI. As such, we observe: 1 k i 1 k N EIi = log h i = log k · 2 1 2 pN j=1 h i ! X N   1 N EI = log (p) = log (p) (10) ER N 2 2 · i − − X

Derivation: effective information of ring-lattice and star networks Here, we compare two classes of networks with the same average degree—ring lattice networks and star, or hub-and-spoke, graphs (see Fig. 6.2). In each network, we assume an average

122 degree k = 2d, with d being the dimension. The EI of star network, EIstar, approaches 0.0 as Nh igets larger, while the EI of ring lattices approaches log (N) log (2d). These 2 − 2 derivations are shown below, first for the d-dimensional ring lattice, EId.

out 1 As every node in a ring lattice is connected to its 2d neighbors, each element of Wi is 2d in 2d 1 h i and each element of W is 2d×N = N . 1 1 1 1 1 EI = D , , ... , , ... (11) d N KL 2d 2d N N · i || X h  i Each node in a d-dimensional ring lattice is expected to be identical, so calculating the expected effect information, EIi, is equivalent to calculating the network’s EI. As such, we observe: 1 2d 1 N EI = log 2d = log i 2d 2 1 2 2d j=1 · ! ! X N EId = log (N) log (2d) (12) 2 − 2

Note: the EI of ring lattice networks reduces to simply the determinism of the network. The EI of ring lattice networks scale logarithmically with the size of the network, which is contrasted by the behavior of EI in star networks. Star networks have a hub-and-spoke structure, where N 1 nodes of degree kspoke = 1 are connected a hub node, which itself has − degree khub = N 1. For star networks, EI approaches 0.0 as the number of nodes increases. This derivation is− shown below.

1 N−1 EI = D W out W out + star N KL spoke i · " i=1 ||h i X h i out out DKL Whub Wi ||h i # h i

out Every spoke has an out-weight vector Wi with N 1 elements of wij = 0.0 and one with − 1 wij = 1.0. The single hub, however, has N 1 elements of wij = with a single wij = 0.0. − N−1 Similarly, W out consists of N 1 elements with values 1 . h i i − N(N−1) N−1 1 1 N 1 EIstar = D − + N KL 1 N · " i=1 || X h i 1 1 DKL (13) N 1||N(N 1) # h − − i 123 1.6

1.4

1.2

1.0 I

E 0.8

0.6

0.4

0.2

0.0 Motif 01 Motif 02 Motif 03 Motif 04 Motif 05 Motif 06 Motif 07 Motif 08 Motif 09 Motif 10 Motif 11 Motif 12 Motif 13

Figure 6.3: Effective Information of network motifs. All directed 3-node subgraphs and their EI.

Using the same techniques as above, this equation reduces to: 1 1 1 EIstar = (N 1) log + N · − · 2 (N 1) " − ! N 1 log N 1 2 1− !# N(N 1) − N 1 N 1 EIstar = − log + log N N · 2 N 1 N · 2  −  EIstar = 0.0 as lim  (14) N→∞

6.1.4 Network motifs as causal relationships It is important to understand why certain motifs have more EI while others have less. In Fig. 6.3, we show the EI in 13 directed three-node network motifs. The connectivity of each motif drastically influences the EI. Motif 09—the directed cycle—is the motif with the highest EI. Intuitively, this fits with our definition of EI: the amount of certainty in the network (notably, each link in Motif 09, if taken to represent a causal relationship, is both necessary and sufficient). A random walker in this system has zero entropy (even if the direction of its path were reversed), whereas every other three-node motif does not contain that degree of certainty. Second, we see that Motif 04—a system with a “sink” node—has

124 Power pcspwr 10 Power pcspwr 09 European roads U.S. power grid West U.S. powergrid Power 1138BUS Air traffic control Power 662BUS Power 494BUS Yeast protein PGP protocol Power 685BUS Digg user-user replies Human protein (Vidal) Avogato Adolescent health Cora citations Power ERIS1176 Free online dict Human protein (Stelzl) Routers RF PDZbase protein Chicago roads Tarragona univ. emails Astro-ph coauthorships Hamsterster HEP-ph citations HEP-th citations OpenFlights v2 King James Bible Infectious conference OpenFlights v1 Mouse Kasthuri UC Irvine messages Physicians DBLP citations WHOIS protocol Contiguous U.S. Twitter lists Human protein (Figeys) Mouse visual 1 Reactome humans Brain CAIDA autonomous systems Route views autonomous systems Citation Internet autonomous systems Illinois school Coauthorship Taro gift-giving Drosophila medulla Communication David Copperfield C. elegans metabolic Computer U.S. Airports Resident hall Edinburgh thesaurus Human Contact HEP-th coauthorships Train terrorists Human Social HEP-ph coauthorships Linux sources Hyperlink Google internal Political blogs Infrastructure Dem. Nat. Comm. emails Mouse visual 2 Lexical Manufacturing emails Google+ Metabolic Rhesus brain 1 Hypertext conference Haggle RFID contact Powergrid Reality mining RFID Jazz musicians Social Highland tribes Sampson cloister Software Little Rock Lake ecosystem Mouse retina 1 Technological Java Development Kit Seventh graders Trophic JUNG-Javax California windsurfers Cat brain 1 Biological Facebook NIPS Florida ecosystem - wet Information Florida ecosystem - dry Macaque cerebral Social Rhesus brain 2 Macaque interareal Technological Mouse brain 1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Effectiveness

Figure 6.4: Effectiveness of real networks. Full data behind the results summarized in Fig. 2.3, color-coded in two ways. First by 16 “Domains” (as in Table 6.2), which corresponds to the classification of each network from its source repository (in this case, the Konect database [102] or the Network Repository [149]). The second categorization we report—those used in Fig. 2.3—involves grouping the Domains into four “Categories” (“Cat.” in Table 6.2): Biological, Information, Social, and Technological. These correspond to the colored squares to the right of each network’s name.

125 pref. attach. (N=29, m=1, =1.02) pref. attach. (N=35, m=1, =1.78) pref. attach. (N=31, m=2, =0.98) pref. attach. (N=34, m=1, =1.04) pref. attach. (N=34, m=1, =1.75) causal emergence=0.0969 causal emergence=1.1176 causal emergence=0.0168 causal emergence=0.1659 causal emergence=0.5123 0.0175 0.0175 0.0175 0.0175 0.0175 number of macronodes: 3 number of macronodes: 2 number of macronodes: 1 number of macronodes: 3 number of macronodes: 3 0.0150 0.0150 0.0150 0.0150 0.0150

0.0125 0.0125 0.0125 0.0125 0.0125

0.0100 0.0100 0.0100 0.0100 0.0100

0.0075 0.0075 0.0075 0.0075 0.0075

0.0050 0.0050 0.0050 0.0050 0.0050

0.0025 0.0025 0.0025 0.0025 0.0025 inconsistency of macroscale inconsistency 0.0000 0.0000 0.0000 0.0000 0.0000

100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 pref. attach. (N=30, m=2, =1.92) pref. attach. (N=27, m=1, =1.59) pref. attach. (N=29, m=1, =1.58) pref. attach. (N=29, m=1, =1.36) pref. attach. (N=30, m=2, =1.30) causal emergence=0.1706 causal emergence=0.8530 causal emergence=0.1578 causal emergence=0.1130 causal emergence=0.0194 0.0175 0.0175 0.0175 0.0175 0.0175 number of macronodes: 5 number of macronodes: 2 number of macronodes: 4 number of macronodes: 3 number of macronodes: 2 0.0150 0.0150 0.0150 0.0150 0.0150

0.0125 0.0125 0.0125 0.0125 0.0125

0.0100 0.0100 0.0100 0.0100 0.0100

0.0075 0.0075 0.0075 0.0075 0.0075

0.0050 0.0050 0.0050 0.0050 0.0050

0.0025 0.0025 0.0025 0.0025 0.0025 inconsistency of macroscale inconsistency 0.0000 0.0000 0.0000 0.0000 0.0000

100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 pref. attach. (N=30, m=1, =1.53) pref. attach. (N=30, m=1, =1.62) pref. attach. (N=27, m=1, =1.67) pref. attach. (N=32, m=1, =1.94) pref. attach. (N=31, m=2, =1.35) causal emergence=0.0611 causal emergence=1.1460 causal emergence=0.2532 causal emergence=1.2221 causal emergence=0.0685 0.0175 0.0175 0.0175 0.0175 0.0175 number of macronodes: 3 number of macronodes: 1 number of macronodes: 3 number of macronodes: 1 number of macronodes: 1 0.0150 0.0150 0.0150 0.0150 0.0150

0.0125 0.0125 0.0125 0.0125 0.0125

0.0100 0.0100 0.0100 0.0100 0.0100

0.0075 0.0075 0.0075 0.0075 0.0075

0.0050 0.0050 0.0050 0.0050 0.0050

0.0025 0.0025 0.0025 0.0025 0.0025 inconsistency of macroscale inconsistency 0.0000 0.0000 0.0000 0.0000 0.0000

100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 time time time time time

Figure 6.5: Typically minimal inconsistency of higher-order macro-nodes. Each inset is of the microscale network, where each node’s color corresponds to the µ π macro-node it has been mapped to following one instance of the greedy algorithm detailed in| the Materials & Methods section. White nodes indicate a micro-node that was not grouped into a new macro-node. Inconsistency is plotted over time. no EI, suggesting that a causal structure with that architecture is not informative, since all causes lead to the same effect. Similarly, because there are no outputs from two nodes in Motif 01, we see an EI value of zero.

6.1.5 Table of network data In Table 6.2, we report the name, domain, source, category, and description of each of the 84 networks used in our comparison of EI in real networks. These networks were selected primarily from the Konect database [102], with supplemental datasets added from NetworkRepository [149] when the Konect database lacked a sufficient number of datasets in a given category, since the two databases already significantly overlapped. In many cases, the interactions among nodes in these networks (i.e., their edges) can reasonably be interpreted as causal, directed influence, or dependencies such that the behavior of a node, vi, at a given time can be thought to impact the behavior of its neighbors, vj. By instituting relatively minimal requirements for selecting the above networks, we are able to assess the EI in a variety of complex systems across different domains. However, while we can measure the EI of any given network, the further interpretation of this EI depends also on what the

126 nodes and edges of a network represent. In a case where the nodes represent states of a system, such as a Markov process, then the EI directly captures the information in the causal structure. In the case where the nodes represent merely dependencies or influence, EI can still be informative as a metric to compare different networks. In a network specifically composed of non-causal correlations, then EI is merely a structural property of the network’s connectivity.

6.1.6 Examples of consistent macro-nodes In Fig. 6.5 we display 15 different parameterizations of small networks grown under degree- based preferential attachment. Each plot shows to the inconsistency of the mapping from the microscale to the macroscale, in bits, which corresponds to the KL divergence of the distribution of random walkers on microscale nodes and the same distribution at the macroscale. Each of these networks are consistent after 1000 timesteps, with eight showing full consistency from the start. These 15 example networks also show the range of causal emergence values that is found in networks.

6.1.7 Emergent subgraphs What sort of subgraph connectivity leads to causal emergence? To explore this we take two independent subgraphs, and couple them together while varying their size, moving from -like to bipartite connectivity. We then check to see if grouping those clusters into macro-nodes leads to causal emergence (Fig. 6.6). Specifically, we simulate many small unweighted, undirected networks (N = 100) from a stochastic block model with two clusters, and we vary the probability of within-cluster edges (from 0.0 to 1.0) as well as the size-asymmetry of the two clusters (illustrated around the border of Fig. 6.6). In each simulation, we group the microscale network into two macro-nodes, each corresponding to one cluster. What we observe is a causal emergence landscape with several important characteristics (Fig. 6.6). First, in these networks we observe causal emergence when the fraction of within-cluster connections is either very high or very low (right and left sides of the heatmap in Fig. 6.6). These are the conditions in which there is a large amount of uncertainty, or noise, in that subgraph. Not only that, however, causal emergence is most likely when there is a size asymmetry between the two clusters, suggesting that macroscales out that maximize a network’s EI often do so by creating a more evenly distributed Wi . In general, however, the space of subgraphs leans toward causal reduction (a loss ofh EI afteri grouping), which fits with the success of reduction historically and explains why researchers and modelers should generally be biased toward reduction.

In cases of complete noise, with no asymmetries or differences between intra- or inter- connectivity between subgraphs, we should expect causal emergence to be impossible. Indeed, this is what we see for many parameterizations of Erd˝os-R´enyi networks of various sizes (Fig. 6.7). This result follows from insights in Fig. 2.1a, where the EI of ER networks converges

127 Network name Domain Source Cat. Description HEP-th citations Citation Konect Inf. high-energy physics (HEP) citations - theory HEP-ph citations Citation Konect Inf. HEP citations - phenomenology Cora citations Citation Konect Inf. citations from the Cora database DBLP citations Citation Konect Inf. database of scientific publications Astro-ph coauthorships Coauthorship Konect Inf. coauthors on astronomy arXiv papers HEP-th coauthorships Coauthorship Konect Inf. coauthors on HEP-theory arXiv papers HEP-ph coauthorships Coauthorship Konect Inf. coauthors on HEP-phenomenology arXiv papers Tarragona univ. emails Communication Konect Soc. emails from the University Rovira i Virgili Dem. Nat. Comm. emails Communication Konect Soc. 2016 Democratic National Committee email leak Digg user-user replies Communication Konect Soc. reply network from the social news website Digg UC Irvine messages Communication Konect Soc. messages between students at UC Irvine Manufacturing emails Communication Konect Soc. internal emails between employees at a company CAIDA autonomous systems Computer Konect Tec. autonomous systems network from CAIDA, 2007 Route views autonomous systems Computer Konect Tec. autonomous systems network Internet autonomous systems Computer Konect Tec. connected IP routing Haggle RFID contact Human Contact Konect Soc. human proximity, via carried wireless devices Reality mining RFID Human Contact Konect Soc. RFID data from 100 MIT students’ interactions California windsurfers Human Contact Konect Soc. contacts between windsurfers California, 1986 Train terrorists Human Contact Konect Soc. contacts between Madrid train bombing suspects Hypertext conference Human Contact Konect Soc. face-to-face contacts at the ACM Hypertext 2009 Infectious conference Human Contact Konect Soc. face-to-face contacts at INFECTIOUS, 2009 Jazz musicians Human Social Konect Soc. collaboration network between Jazz musicians Adolescent health Human Social Konect Soc. surveyed students list their best friends Physicians Human Social Konect Soc. innovation spread network among 246 physicians Resident hall Human Social Konect Soc. friendship ratings between students in a dorm Sampson cloister Human Social Konect Soc. relations between monks in a monastery Seventh graders Human Social Konect Soc. proximity ratings between seventh grade students Taro gift-giving Human Social Konect Soc. gift-givings (taro) between households Dutch college Human Social Konect Soc. friendship ratings between university freshmen Highland tribes Human Social Konect Soc. tribes in the Gahuku-Gama alliance structure Illinois school Human Social Konect Soc. friendships between boys at an Illinois highschool Free online dict. Hyperlink Konect Inf. cross references in Free Online Dict. of Computing Political blogs Hyperlink Konect Inf. hyperlinks between blogs, 2004 US election Google internal Hyperlink Konect Inf. hyperlink network from pages within Google.com Air traffic control Infrastructure Konect Tec. USA’s FAA, Preferred Routes Database OpenFlights v1 Infrastructure Konect Tec. flight network between airports, OpenFlights.org OpenFlights v2 Infrastructure Konect Tec. flight network between airports, OpenFlights.org Contiguous U.S. Infrastructure Konect Tec. 48 contiguous states and D.C. of the U.S. European roads Infrastructure Konect Tec. international E-road network, mainly in Europe Chicago roads Infrastructure Konect Tec. road transportation network of the Chicago region West U.S. powergrid Infrastructure Konect Tec. power grid of the Western U.S. U.S. Airports Infrastructure Konect Tec. flights between US airports in 2010 David Copperfield Lexical Konect Inf. network of common noun and adjective adjacencies Edinburgh thesaurus Lexical Konect Inf. word association network, collected experimentally King James Bible Lexical Konect Inf. co-occurrence between nouns in the Bible

Table 6.2: Network datasets. Continued on the following page.

to a fixed value of log2(p) as the size of the network increases. Here, we observe some causal emergence in− ER networks but only when the networks are very small. Importantly, the amount of causal emergence is also very small, especially relative to the causal emergence in networks with preferential attachment. This further suggests that causal emergence moves the existent structure of the network into focus by examining the network at a certain scale,

128 Network name Domain Source Cat. Description C. elegans metabolic Metabolic Konect Bio. metabolic network of the C. elegans roundworm Human protein (Figeys) Metabolic Konect Bio. interactions network of proteins in Humans PDZbase protein Metabolic Konect Bio. protein-protein interactions from PDZBase Human protein (Stelzl) Metabolic Konect Bio. interactions network of proteins in Humans Human protein (Vidal) Metabolic Konect Bio. proteome-scale map of Human protein interactions Yeast protein Metabolic Konect Bio. protein interactions contained in yeast Reactome humans Metabolic Konect Bio. protein interactions, from the Reactome project Avogato Social Konect Soc. trust network for users of Advogato Google+ Social Konect Soc. Google+ user-user connections Hamsterster Social Konect Soc. friendships between users of hamsterster.com Twitter lists Social Konect Soc. Twitter user-user following network Facebook NIPS Social Konect Soc. Facebook user-user friendship network Linux dependency Software Konect Inf. Linux source code dependency network J.D.K. dependency Software Konect Inf. software class dependencies, JDK 1.6.0.7 JUNG/javax dependency Software Konect Inf. software class dependencies, JUNG 2.0.1 & javax Florida ecosystem - dry Trophic Konect Bio. food web in the Florida wetlands (dry season) Florida ecosystem - wet Trophic Konect Bio. food web in the Florida wetlands (wet season) Little Rock Lake ecosystem Trophic Konect Bio. food web of Little Rock Lake, Wisconsin WHOIS protocol Technological NetworkRepository Tec. dataset of internet routing registries PGP protocol Technological NetworkRepository Tec. trust protocol of private keys of internet users Routers RF Technological NetworkRepository Tec. traceroute network between routers via Rocketfuel Cat brain 1 Brain NetworkRepository Bio. fiber tracts between brain regions of a cat Drosophila medulla Brain NetworkRepository Bio. neuronal network from the medulla of a fly Rhesus brain 1 Brain NetworkRepository Bio. collation of tract tracing studies in primates Rhesus brain 2 Brain NetworkRepository Bio. inter-areal cortical networks from a primate Macaque cerebral Brain NetworkRepository Bio. connections between cerebral cortex of a primate Macaque interareal Brain NetworkRepository Bio. inter-areal cortical networks from a primate Mouse Kasthuri Brain NetworkRepository Bio. neuronal network of a mouse Mouse brain 1 Brain NetworkRepository Bio. calcium imaging of neuronal networks in a mouse Mouse retina 1 Brain NetworkRepository Bio. electron microscopy of neurons in mouse retina Mouse visual 1 Brain NetworkRepository Bio. electron microscopy of visual cortex of a mouse Mouse visual 2 Brain NetworkRepository Bio. electron microscopy of visual cortex of a mouse Power 1138BUS Powergrid NetworkRepository Tec. power system admittance, via Harwell-Boeing Power 494BUS Powergrid NetworkRepository Tec. power system admittance, via Harwell-Boeing Power 662BUS Powergrid NetworkRepository Tec. power system admittance, via Harwell-Boeing Power 685BUS Powergrid NetworkRepository Tec. power system admittance, via Harwell-Boeing U.S. power grid Powergrid NetworkRepository Tec. electricity / power transmission network in the U.S. Power pcspwr 09 Powergrid NetworkRepository Tec. BCSPWR09 powergrid data via Harwell-Boeing Power pcspwr 10 Powergrid NetworkRepository Tec. BCSPWR10 powergrid data via Harwell-Boeing Power ERIS1176 Powergrid NetworkRepository Tec. powergrid data via Erisman, 1973

Table 6.2: Network datasets (continued). rather than creating that structure from nothing.

129 Causal emergence (positive) | reduction (negative)

0.5 more asymmetric 0.0

0.5

1.0

1.5 Relative size of the two clusters

2.0 more symmetric

more bipartite more clique-like Fraction of within-cluster connections

Figure 6.6: Causal emergence in a simplified stochastic block model. Schematic showing the role of the two relevant parameters—the fraction of nodes in each community (ranging from r = 0.50 to r < 1.0) and the fraction of within-cluster connections (ranging from p = 0.0, a fully bipartite network, to p = 1.0—two disconnected cliques). By repeatedly simulating networks under various combinations of parameters (N = 100 with 100 simulations per combination of parameters), we see combinations that are more apt to produce networks with causal emergence.

6.2 Chapter 3 Appendix Within-ensemble graph distance as size increases

In Figures 3.1, 3.2, 3.3, 3.4, and 3.5, we plot the within-ensemble graph distances of networks with a fixed size. However, one important behavior of graph distance measures is how they change as networks increase in size.

As an example, the Jensen-Shannon divergence between the degree distributions (DJS) of two ER graphs will decrease as n , since the empirical degree distributions get closer → ∞

130 0.05 N=20 0.05 N=20 N=30 N=30 0.04 N=40 0.04 N=40 N=50 N=50 k = 1 k = 1 0.03 0.03

0.02 0.02 Causal emergence Causal emergence 0.01 0.01

0.00 0.00 10 3 10 2 10 1 100 10 2 10 1 100 101 p k (a) (b)

Figure 6.7: Causal emergence in Erd˝os-R´enyi networks. (A) As the edge density, p, of ER networks increases and N is held constant, the amount of causal emergence quickly drops to zero. (B) This drop occurs well before pN = k = 1, meaning the algorithm for uncovering causal emergence is only grouping small, disconnected,h i tree-like subgraphs that have yet to form into a giant component. Of note here is the low magnitude of causal emergence even in cases where the random network is not a single large component, and the vanishing of causal emergence after it is. and closer to a binomial distribution. On the other hand, for graph distances that are explicitly accompanied by a size-normalizing term (e.g. HAM), we would expect that the mean within-ensemble graph distance does not change as network size increases.

In Figure 6.8, we show how the within-ensemble graph distance changes as n increases, both for a fixed density in G(n,p) as well as a fixed average degree in G(n,hki).

Descriptions of graph distance measures

Throughout the appendix, we assume graphs G and G0 are undirected and unweighted so that the adjacency matrices are binary and symmetric. We first consider several projections for distances given a description which is the full adjacency sequence or matrix, followed by projections involving statistical and ad-hoc descriptions. The list of graph distances used in this work is JAC, HAM, HIM, FRO, POD, DJS, POR, QJS, CSE, GDD, REP, LSD, LGJ, LLE, IPM, NBD, DNB, DMD, DCN{ , NES . }

131 Within-ensemble graph distances: G(n, p = 0.1) and G(n, k = 6), varying n

Jaccard dissimilarity Hamming distance Hamming-Ipsen-Mikhailov Frobenius norm Polynomial dissimilarity [JAC] [HAM] [HIM] [FRO] [POD] 3 100 10 10 2 10 1 10 1 3

d 1 2 10

) 10 10 0 2 10 2 4 G 10 10 , 2 10 1 G 3 10 5 ( 10 10

D 3 10 6 10 3 10 10 4 100 10 7 Degree distribution Portrait divergence Quantum density matrix Communicability sequence Graph diffusion distance Jensen-Shannon div. [DJS] [POR] Jensen-Shannon div. [QJS] entropy [CSE] [GDD]

1 2 10 10 1 10 1 10 1 d 3 10 ) 10 0 3

G 10 , 10 2 10 4 G

( 0 10 2 10

D 5 10 10 5

10 3 10 6

Resistance perturbation NetLSD Laplacian (Gaussian kernel) Laplacian (Lorenzian kernel) Ipsen-Mikhailov distance [REP] [LSD] Jensen-Shannon div. [LGJ] Euclidean distance [LLE] [IPM] 100

2 1 10 10 1 1 10 d 10 1 )

0 10 101 G 0

, 10 2 G 100 10 ( 10 2 D 1 2 10 1 10 10 10 3 10 3 Nonbacktracking spectral Distribributional nonbactracking D-measure distance DeltaCon NetSimile distance [NBD] spectral distance [DNB] [DMD] [DCN] [NES] 101 10 1 101

0 1 d 10 0 10 )

0 10 10 2 G , G

( 0 10 1 10 3 10 D 10 1

10 4 102 103 102 103 102 103 102 103 102 103 n n n n n

means standard deviations G(n, p) max mean max standard deviation G(n, k )

Figure 6.8: Mean and standard deviations of the within-ensemble distances for G(n,p) and G(n,hki) as n increases. Here, we generate pairs of ER networks with either a fixed density, p or with a fixed average degree, k , as we increase the network size, n. In each subplot, the mean within-ensemble graph distanceh i is plotted as a solid line with a shaded region around for the standard error ( D σhDi; note that in most subplots above, the standard error is too small to see), while theh i dashed ± lines are the standard deviations.

6.2.1 Jaccard Distance

n×n The Jaccard measure is computed using the adjacency matrix ψG = A 0, 1 . For two graphs -labeled G and G0, ∈ { } S D (G, G0) = d (A, A0) = 1 | | (6.1) JAC JAC − T | | 0 0 where Sij = AijAij represents the intersection of edge sets between graphs G and G , while 0 0 Tij = Sij + (1 A )Aij + (1 Aij)A represents the union of edge sets between graphs. − ij − ij 132 Here, S is the sum over the Sij and similarly for T . The computational complexity of the Jaccard| | distance is O( E + E0 ) when using unordered| | sets to get the union and intersection sets and their cardinality.| | This| | is what is done in the netrd package [119].

Since nearly empty graphs likely have nearly zero edges in common, the S T will be nearly | || | zero for p close to 0, so that dJAC approaches 1 at low p.

6.2.2 Hamming Distance Similarly, the Hamming measure may also be computed using the adjacency matrix A 0, 1 n×n. For two vertex-labeled graphs G and G0, the Hamming distance counts the number∈ { } 0 of elementwise differences between ψG = A and ψG0 = A :

0 1 0 DHAM(G, G ) := n Aij Aij . (6.2) 2 1≤i

6.2.3 Frobenius

The Frobenius distance dFRO is simply the norm of matrices, so that:

0 0 2 DFRO(G, G ) := Aij Aij (6.3) i,j | − | sX

0 2 0 0 Note that for binary adjacency matrices, Aij Aij = Aij Aij , and Aii = Aii = 0 i given that there are no self-loops. Note that,| because− | the distance| − operates| on the adjacency∀ matrices directly, it implicitly assumes the graphs are vertex-labeled. FRO has the same computational complexity as the Hamming distance due to their similarity. It is O(n2) if one compares all entries, as is in the netrd package, but it could be improved to O( E + E0 ). | | | |

6.2.4 Polynomial Dissimilarity The polynomial dissimilarity, POD, between two unweighted, vertex-labeled graphs is based on the eigenvalue decompositions of the two adjacency matrices of the graphs, G and G0 [54].

T To compute the polynomial dissimilarity between two graphs, first decompose A as QAΛAQA, where QA is an orthogonal matrix and ΛA is the diagonal matrix of eigenvalues. Second,

133 0 T construct vectors P (A) and P (A ) for each graph, where P (A) = QAWAQA and WA = 1 2 1 K ΛA + (n−1)α ΛA + ... + (n−1)α(K−1) ΛA . The polynomial dissimilarity, then, is calculated as the Frobenius norm between P (A) and P (A0) 1 D (G, G0) = P (A) P (A0) . (6.4) POD n2 || − ||

In this work, we consider a default value of K = 5 in order to accommodate potentially informative higher-order interactions in each of the graphs. Here, α = 1 by default, though in [54], α = 0.9 is commonly considered.

The computational complexity of POD is O(n3) in practice, which arises from it requiring two n n matrix eigendecompositions, which is O(n3) for general matrices and a method based on× the QR algorithm [133], as used in the netrd package. Note that recent techniques based on message-passing can give fast and exact results for sparse networks with short loops in O(n log n)[35] and could be used to reduce the computational complexity of spectral graph distances.

6.2.5 Degree Distribution Jensen-Shannon Divergence A simple graph distance measure is the Jensen-Shannon divergence [108] between the empirical degree distributions of two graphs. In this case for an n-node graph G the descriptor ψG is the empirical degree distribution encoded in the set of numbers pk(G) k≥0 := p given by n { } pk(G) := nk(G)/n, where nk(G) = i=1 1 ki = k , with 1 being the indicator function n { } {·} and ki = j=1 Aij being the degree of node i in terms of the adjacency matrix A of G. The Jensen-Shannon divergence betweenP two such distributions [37] is the degree Jensen-Shannon divergenceP or DJS distance between the graphs:

0 1 0 D (G, G ) = H [p+] (H[p] + H[p ]) , (6.5) DJS − 2 0 where p+ = (pk + pk)/2 k≥0 is a mixture distribution and H[p] = k pk ln pk is the Shannon entropy.{ } − P The computational complexity of DJS is O(n), which arises from computing two degree distributions (which is O(n)) and then comparing them (which is O(k+), with k+ < n being the maximum degree in either network).

6.2.6 Portrait Divergence The portrait divergence, POR, compares using the JSD a description for each of two graphs called their network portrait [7]. The network portrait is a matrix B with elements Blk such

134 that

Blk number of nodes with k nodes at distance l. (6.6) ≡

Alternatively stated, Blk is the kth entry of the empirical histogram of l-th neighborhood sizes. These elements are computed using a breadth-first search or similar method. The portrait divergence of G and G0 is the JSD of probability distributions associated with their portraits, B and B0 [6]. Note that each row in B can be interpreted as the probability distribution that there will be k nodes at a distance of l away from a randomly chosen node such that: B P (k l) = l,k (6.7) | N which can be normalized of the number of paths of length l such that the probability distribution is the probability that two randomly selected nodes are at a distance l away from each other: n kB P (l) = k=0 l,k (6.8) n2 P c c where nc is the number of nodes within a connectedP component, c. The joint probability of choosing a pair of nodes at a distance, l, away from each other and that one node has k nodes in total at distance, l, away is:

n 0 0 k B 0 B P (k, l) = P (k l)P (l) = k =0 l,k l,k (6.9) | n n2 P  c c P 0 There is now a PB(k, l) and PB0 (k, l) for each portrait, B and B , as well as a “mixed” ∗ 1 0 distribution for both, which is specified as P = 2 (PB(k, l) + PB (k, l)). The portrait divergence between G and G0 is the JSD between their portraits as follows

0 DPOR(G, G ) =JSD(PB(k, l),PB0 (k, l)) 1 = D (P (k, l),P ∗)+ 2 KL B ∗ DKL(PB0 (k, l),P ) (6.10)  where DKL is the Kullback-Leibler divergence. Note that √DPOR satisfies the properties of a metric (satisfies the triangle inequality, is positive-definite, symmetric) [167].

The computational complexity of POR is O(n(n+ E ) log n), which comes from the requirement of computing shortest paths between all pairs of| nodes| in the network. In our implementation, computing the shortest path between a source and all nodes is done with the Dijkstra’s algorithm with a binary heap, which takes O((n + E ) log n) operations in the worst case. Constructing the portrait and calculating the JSD between| | the associated distributions has

135 a lower computational complexity.

6.2.7 Quantum Spectral Jensen-Shannon Divergence This method compares graphs via the Jensen-Shannon divergence (JSD) between probability distributions associated with density matrices of two graphs G and G0 [8; 147; 148; 115], denoted ρ and ρ0 respectively, defined by

e−βL(G) ρ = (6.11) Z

n −βλi(L) where L(G) is the Laplacian matrix of graph G, and constant Z i=1 e , with λi(L) being the ith eigenvalue of L. Description-distance pair (ρ, JSD)≡ yields the “Quantum Spectral Jensen-Shannon Divergence” (QJS)[49], which compares twoP graphs by the entropy n of the eigenvalue spectra of their density matrices ρ. Treating the spectrum λi i=1 as a normalized probability distribution, the spectral R´enyi entropy of order q is given{ } by

1 n S = log λ (ρ)q, (6.12) q 1 q 2 i i=1 − X which, if q = 1, reduces to the Von Neumann entropy:

n

S1 = λi(ρ) log2 λi(ρ). (6.13) − i=1 X

The QJS distance between two graphs is defined to be:

0 0 ρ + ρ 1 0 D (G, G ) = Sq [Sq(ρ) + Sq(ρ )]. (6.14) QJS 2 − 2  

For default parameter values, we use β = 0.1 and q = 1.0, based on the explanations in [49]. QJS requires computation of Laplacian matrix spectra of two graphs, and comparison thereof, which yields a computational complexity of O(n3) (see Appendix 6.2.4).

6.2.8 Communicability Sequence Entropy Divergence The communicability sequence entropy divergence CSE between two graphs, G and G0, is the JSD between the communicability distributions of G and G0. In order to have a communicability distribution, we first construct the communicability matrix, which is an

136 n n matrix corresponding to the communicability between two nodes, vi and vj. × ∞ 1 C = eA = Ak (6.15) k! k=0 X

In other words, the communicability matrix, C, is computed as a matrix exponentiation n of the adjacency matrix. The elements Cij, i j, are stored in a vector (of length 2 ) and normalized to create the communicability≤ sequence, P and P 0, for each graph. The M  Shannon entropy of P is H[P ] = i=1 Pi log2 Pi, and the communicability sequence entropy divergence is calculated as the JSD− between P and P 0, where M is the mixed sequence of P and P 0. P 1 D (G, G0) = JSD(P,P 0) = H[M] (H[P ] + H[P 0]) . (6.16) CSE − 2

The computational complexity of CSE is O(n3), with the computationally intensive step being to compute the exponential of both adjacency matrices A and A0. Our implementation uses Pad´eapproximants through the SciPy package to perform this step, which takes O(n3) operations to get an approximation [124].

6.2.9 Graph Diffusion Distance The graph diffusion distance [77] GDD between two graphs, G and G0, is a distance measure based on the notion of flow within each graph. As such, this measure uses the unnormalized Laplacian matrices of both graphs, L and L0, and uses them to construct time-varying 0 Laplacian exponential diffusion kernels, e−tL and e−tL , by effectively simulating a diffusion process for t timesteps (as a default, t = 1000), creating a column vector of node-level activity at each timestep.

0 The distance dGDD(G, G ) is defined as the Frobenius norm between the two diffusion kernels at the timestep t∗ where the two kernels are maximally different.

D (G, G0) = e−t∗L e−t∗L0 (6.17) GDD || − || p The computational complexity is O(n3) since a spectral decomposition of the Laplacian matrices is used (see Appendix 6.2.4).

6.2.10 Resistance Perturbation Distance The resistance perturbation distance RES between two vertex-labeled graphs, G and G0, is the p-norm of the difference between two graph resistance matrices [126]. The resistance perturbation distance changes if either graph is relabeled (it is not invariant under graph

137 isomorphism), so node labels should be consistent between the two graphs being compared. The distance is not normalized.

The resistance matrix of a graph G is calculated as

R = diag( )1T + 1diag( )T 2 , (6.18) L L − L where is the Moore-Penrose pseudoinverse of the Laplacian of G. L The resistance perturbation graph distance of G and G0 is calculated as the p-norm (the pth root of the sum of the pth powers of elements) of the difference in their resistance matrices, R(1) and R(2)

1/p 0 0 p D (G, G ) = Ri,j R . (6.19) REP | − i,j| "i,j∈V # X The default value chosen in experiments is p = 2. The computational complexity of RES is O(n3) for our implementation, since we need to compute the Moore-Penrose pseudoinverse of the Laplacian matrix of both graphs, which is O(n3). Note that low-rank approximations can be used to reduce the computational complexity [126].

6.2.11 NetLSD The NetLSD distance LSD between two graphs, G and G0, is the Frobenius norm between the heat trace signatures of the normalized Laplacians L and L0 [177]. The heat kernel matrix is calculated as n −tL −tλj T Ht = e = e φjφj . (6.20) j=1 X The ij-th element of Ht contains the amount of heat transferred from node vi to node vj at time t (default of 256 log-spaced time intervals between 10−2 and 102). From the heat kernel matrix Ht, the heat trace, ht is defined as

n −tλj ht = Tr(Ht) = e . (6.21) j=1 X

The heat trace signature of graph G is the set ht t≥1. Upon computing heat trace signatures of both G and G0, they are compared via a Frobenius{ } norm

0 0 D (G, G ) = d ( ht t≥0, h t≥0) . (6.22) LSD FRO { } { t}

The computational complexity of LSD is O(n3) due to the spectral decomposition of the Laplacian matrices of both graphs (see Appendix 6.2.4).

138 6.2.12 Laplacian Spectrum Distances Many distances between two graphs, G and G0, use a direct comparison of their Laplacian spectrum. For all the methods below, we use the eigenvalues λ1 = 0 λ2 λn of the normalized Laplacian matrices L and L0. To perform the{ comparison,≤ ≤ a · subset · · ≤ of} the whole spectrum can be used, e.g. the k smallest [188] or largest [94; 115] in magnitude. Unless specified, we used all eigenvalues for comparison (k = n).

The distances compare the continuous spectra ρ(λ) and ρ0(λ) associated with the graph G and G0. A continuous spectrum is obtained by the convolution of the discrete spectrum ∗ δ(λ λi) with a kernel g(λ, λ ) i − P 1 n 2 ρ(λ) = g(λ, λ∗)δ(λ∗ λ )dλ∗ , (6.23) Z i i=1 0 − X Z where Z is a normalization factor. Different types of distribution can be used for the kernel, for instance a Lorentzian distribution [89] γ g(λ, λ∗) = , (6.24) π[γ2 + (λ λ∗)2] − or a Normal distribution exp[ (λ λ∗)2/2σ2] g(λ, λ∗) = − − . (6.25) √2πσ2

Different types of metrics can then be used to compare the spectra, such as the Euclidean metric 2 d(ρ, ρ0) = [ρ(λ) ρ0(λ)]2dλ , (6.26) s 0 − Z or the square root of the JSD d(ρ, ρ0) = JSD(ρ, ρ0), written as

0 1p 1 0 JSD(ρ, ρ ) = DKL(ρ ρ¯) + DKL(ρ ρ¯) (6.27) 2 || 2 || where ρ¯ = (ρ + ρ0)/2. Various combination of kernels and metrics yield the following distinct distance measures:

Laplacian spectrum: Gaussian kernel, JSD distance LGJ • Laplacian spectrum: Lorenzian kernel, Euclidean distance LLE • For both kernels, we use a half width at half maximum of 0.011775 (which means the standard deviation for the Gaussian kernel is 0.01). ≈

139 While we only focus on the two specific distances above, we note again that there is a world of possible combinations of descriptor-distance pairs to possibly use for comparing graphs. We selected the two above because their within-ensemble graph distance curves differed the most (e.g. as opposed to including Gaussian kernel / Euclidean distance or Lorenzian kernel / JSD). The computational complexity of this suite of graph distances is O(n3) due to the spectral decomposition of the Laplacian matrices of both graphs (see Appendix 6.2.4).

6.2.13 Ipsen-Mikhailov The Ipsen-Mikhailov distance [89] IPM between two graphs, G and G0, is a spectral comparison of their Laplacian matrices, L and L0. This approach treats the set of nodes in G and G0 as molecules with an elastic connection between them, which casts the distance measurement between G and G0 as the solution to a set of differential equations between the vibrational frequencies between the nodes. The vibrational frequencies, ωi, of each node in G is related 2 to the eigenvalues, λ, of L such that λi = ωi . With this, one can construct a spectral density for each graph as a sum of Lorenz distributions as follows n−1 1 γ ρ(ω) = (6.28) Z (ω ω )2 + γ2 i=1 i X − where Z is a normalization term, and γ is a fixed scaling term that controls the width of the Lorenz distributions (as in [89], we use γ = 0.08 as a default). The distance between G and G0 is then calculated as

∞ D (G, G0) = d(ρ, ρ0) = [ρ(ω) ρ0(ω)]2dω (6.29) IPM − sZ0

The computational complexity of IPM is O(n3) due to the spectral decomposition of the Laplacian matrices of both graphs (see Appendix 6.2.4).

6.2.14 Hamming-Ipsen-Mikhailov The Hamming-Ipsen-Mikhailov distance HIM between two vertex-labeled graphs, G and G0 is expressed as a weighted combination of the IPM (Section 6.2.13) distance and a normalized HAM (Section 6.2.2) distance [95]. The parameter γ for the IPM is fixed such that D ( n, n) = 1, IPM E F where n and n are the empty and complete graphs of n nodes. The HIM distance is defined as followsE F

0 1 0 2 0 2 DHIM(G, G ) = DIPM(G, G ) + ξDHAM(G, G ) (6.30) √1 + ξ p 140 We default to ξ = 1, as in [95]. The computational complexity of HIM is O(n3), with the computationally intensive part being the computation of the IPM distance.

6.2.15 Non-backtracking Spectral Distance The non-backtracking spectral distance NBD between two graphs, G and G0, is a method that compares the eigenvalues of the non-backtracking matrix of each graph, and 0 [175]. This distance is based on the length spectrum and the set of non-backtrackingB cyclesB of a graph (i.e., a closed walk that does not immediately return to the node from which it left) and is calculated as the earth mover’s distance (EMD) between the eigenvalues of and 0. The 0 0 0 0 B B eigenvalues of and are expressed as λk = ak + ibk and λ = a + ib , respectively, and B B k k k EMD(λB, λB0 ) is the solution to an optimization problem finding the minimum amount of work required to move the coordinates of λ to the positions of λ0.

0 DNBD(G, G ) = EMD(λB, λB0 ) . (6.31)

Note that the Ihara determinant formula can be used to obtain the non-backtracking eigenvalues different from 1 using a 2n 2n matrix [175]. ± × If one uses the whole non-backtracking spectrums to compute the distance, the computational complexity would be O(n3)[175]. Instead of using the whole spectrum of the non-backtracking matrices, for graph G we compute only the r eigenvalues larger in magnitude than √λ1, where λ1 is the largest eigenvalue of [175]. B The computational complexity of our implementation of NBD is O(max(r, r0)n2) for general 0 graphs, where r and r are the number of eigenvalues larger in magnitude than √λ1 and 0 0 λ1, respectively for graph G and G . To compute these eigenvalues, an implicitly restarted Arnoldi method is used. For sparse graphs the computation is even more efficient. p

6.2.16 Distributional Non-backtracking Distance Similar to the NBD distance [122], the DNB distance leverages spectral properties of the non-backtracking matrices, and 0, of two graphs, G and G0, in order to calculate their dissimilarity. B B

Unlike the NBD distance, the DNB involves a comparison of the (re-scaled) distribution of eigenvalues of and 0, which are then compared using either the Euclidean distance or the Chebyshev distanceB (here,B we use the Euclidean distance). We also use the whole spectrum for this distance. Therefore, the computational complexity of DNB is O(n3) due to the spectral decomposition of the two 2n 2n matrices (see Appendix 6.2.15). ×

141 6.2.17 D-measure Distance The D-measure distance [155] DMD between two graphs, G and G0, involves a combination of three properties from the two graphs to be compared, G and G0: the network node dispersion (NND), the node distance distribution (µ), and the α-centrality (α) for each graph. For a full explanation and justification for each of the components involved in this distance, we refer the reader to the original article [155], but we will briefly summarize it below.

In order to compute the NND of a graph, each node, vi, is assigned a probability vector, Pi, with elements that are the fraction of nodes that are connected to vi at each distance j d, where d is the diameter of the network. The NND, then, is defined as ≤

JSD P1, P2, ..., Pn NND(G) = (6.32) log( + 1) d  where JSD P1, P2, ..., Pn is the Jensen-Shannon divergence of each Pi from the whole network’s average node-distance distribution at every distance j, which we will denote µ .  j The average µj for all distances j d in a graph, G, we will denote µG. ≤ The final step before the calculation of the D-measure distance is to find the α-centrality [29] of each network, G and G0, as well as the α-centrality of the complement of each network, c c0 G and G . The α- of the original networks are denoted PαG and PαG0 , while the α-centralities of their complements are PαGc and PαGc0 .

Ultimately, the D-measure distance, DDMD, between two graphs is as follows:

0 JSD(µG, µG0 ) DDMD(G, G ) = w1 + s log(2) 0 w2 NND(G) NND(G ) + − p p w JSD(P ,P 0 ) JSD (P c ,P c0 ) 3 αG αG + αG αG (6.33) 2 s log(2) s log(2) ! where w1 + w2 + w3 must equal 1.0. To calculate the final distance value, we adopt the convention used in [155] such that w1 = 0.45, w2 = 0.45, w3 = 0.1.

According to Ref. [155], the computational complexity of DMD is O( E +n log n). However, one needs to compute all shortest paths between all nodes, which suggest| | a more computationally intensive calculation. We rather have a computational complexity of O(n(n + E ) log n) with our implementation using Dijkstra algorithm with a binary heap (see Appendix| | 6.2.6).

142 6.2.18 DeltaCon The DeltaCon distance DCN between two graphs, G and G0, is the Matusita distance between the affinity matrices, S and S0, of G and G0. The affinity matrices are constructed using Fast Belief Propagation, which is expressed as

2 [I +  D A]~si = ~ei (6.34) − where I is the n n identity matrix, D is the diagonal degree matrix, A is the adjacency × matrix, ~ei is a vector indicating the initial node vi from which a random walk process is initiated, and ~si is a column vector consisting of sij, which is the affinity of node vj with 0 2 −1 respect to node vi. The affinity matrices, S and S , are defined as S = [I +  D A] . The distance between G and G0 according to DeltaCon is as follows −

n n 0 0 0 2 DDCN(G, G ) = d(S,S ) = √sij s (6.35) v − ij u i=1 j=1 uX X q  t The computational complexity of our implementation of DCN is O(n3) since we obtain S by matrix inversion directly. However, note that it is possible to improve the algorithm and have an O(n2) computational complexity using a power method or even O( E ) by approximating the distance [99]. | |

6.2.19 NetSimile NetSimile NES is a method for comparing two graphs, G and G0, that is based on statistical features of the two graphs. It is invariant to graph labels and is able to compare graphs of different sizes [25]. It is calculated as the Canberera distance between the 7 5 feature matrix, p and p0, of each graph. To construct the p and p0 feature matrices, first× a 7 n matrix is constructed for each, with each column, j, consisting of the following seven× node-level quantities:

1. degree, kj = j Aij

P 3 kj 2. clustering coefficient, cj = (A )jj/ 2

(nn) 1  3. average neighbor degree k = kiAij. j kj i

P (ego) 4. average clustering coefficient of the nodes in the ego network cj = i ciAij

5. number of edges within the ego network Tj = l,m AjlAlmAmj P P (nn) 6. number of outgoing edges from the ego network Oj = Aijki Tj = kjk Tj i − j − P 143 (ego) 7. number of neighbors of the ego network nnj = i 1{∃l∈Nj :i∼l,i6∼j} These features are then summarized into p and p0, which areP 7 5 signature vectors consisting of the median, mean, standard deviation, skewness, and kurtosis× of each feature. NetSimile uses the Canberra distance to arrive at a final scalar distance.

n 0 pi p D (G, G0) = d(p, p0) = | − i| (6.36) NSE p + p0 i=1 i i X | | | |

The computational complexity of NES depends on two parts : features extraction and features aggregation. Features are all locally defined, hence their extraction will take O(qn) where q is the average degree of a node when selecting a random edge and choosing an endpoint [82]. Feature aggregation is O(n ln n)[25], hence the overall complexity is O(qn + n log n).

Analytical derivations

6.2.20 Derivation: Jaccard Distance

0 We can directly calculate dJAC(A, A ) G(n,p), the expected Jaccard distance among two graphs h i n sampled from G(n,p). Both T and S are distributed binomially, as they are the sum of 2 Bernoulli values arising with| probability| | | p2 and 2p(1 p) + p2, respectively. Since binomial  distributions are sharply peaked (for large values of−n), we can approximate the expected value of the ratio S / T by the ratio of the expected values of S and T . Thus we have, | | | | | | | |

0 S d (A, A ) G(n,p) = 1 | | h JAC i − T | | S 1 h| |i ≈ − T h| |i p2 n = 1 2 − (2p(1 p) + p2) n −  2 1 p = −  (6.37) 1 p − 2 which agrees precisely with simulations. Note, in the limit p 1, we have by Taylor expansion, ≈

144 0 S d (A, A ) G(n,p≈1) = 1 | | h JAC i − T   p≈1 | | 1 p d 1 p = − p + (p 1) − p + ... 1 − dp 1 − 2 p=1  − 2  p=1 1 1 (1 p)( ) (6.38) = 0 + (p 1) − + − − − 2 + ... − 1 p (1 p )2  2 2  p=1 − − 1 = (p 1) − + 0 + ... − 1 1  − 2  = 2(1 p) + ..., −

Similarly—as we show in SI 6.2.21—the Hamming distance (dHAM) behaves in this region as

0 d (A, A ) G(n,p≈1) = 2p(1 p) p=1 + (p 1) (2(1 p) 2p) p=1 + ... h HAM i − | − − − | = 0 + (p 1)(0 2) + ... (6.39) − − = 2(1 p) + ..., − which is exactly the same. Indeed, we observe this equivalence in Figure 3.1 in the region p 1. This finding makes intuitive sense because in the region p 1, the “union graph”, T, ≈ ≈ is likely an essentially complete graph, and dJAC simply measures the fraction of edges/non- 0 edges that are not in agreement between G and G , which is precisely what dHAM does for all p given an adjacency description.

6.2.21 Derivation: Hamming Distance The Hamming measure is simply the fraction of mismatched entries between A and A0. Due to this simplicity, we again can analytically predict the mean within-ensemble graph distance for graphs sampled from G(n,p):

0 1 0 dHAM(A, A ) G(n,p) = n P( Aij Aij = 1) (6.40) h i 2 1≤i

The function 2p(1 p) is n-independent, and has a maximum at p = 1 ; simulations are − 2 matched by it precisely. Interestingly, while this calculation was done for G(n,p), the results in Figure 3.1 shows an equivalent result for RGGs of the same density, p.

145 6.2.22 Derivation: Frobenius As a back-of-the envelope-calculation, note that the sum of elementwise differences is 0 binomially distributed with mean i,j Aij Aij = n(n 1)2p(1 p). Using sharply- peakedness, we can thus state approximately,h | − |i − − P

0 0 2 dFRO(A, A ) G(n,p) = Aij Aij h i * i,j | − | + sX

0 Aij A ≈ v | − ij| u* i,j + u X nt 2p(1 p), (6.41) ' − 1 p which exhibits a maximum at p = 2 for any given n, but grows linearly with n, the latter two observations are qualitatively born out in simulations.

146