Correlation in Complex Networks
Total Page:16
File Type:pdf, Size:1020Kb
Correlation in Complex Networks by George Tsering Cantwell A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Physics) in the University of Michigan 2020 Doctoral Committee: Professor Mark Newman, Chair Professor Charles Doering Assistant Professor Jordan Horowitz Assistant Professor Abigail Jacobs Associate Professor Xiaoming Mao George Tsering Cantwell [email protected] ORCID iD: 0000-0002-4205-3691 © George Tsering Cantwell 2020 ACKNOWLEDGMENTS First, I must thank Mark Newman for his support and mentor- ship throughout my time at the University of Michigan. Further thanks are due to all of the people who have worked with me on projects related to this thesis. In alphabetical order they are Eliz- abeth Bruch, Alec Kirkley, Yanchen Liu, Benjamin Maier, Gesine Reinert, Maria Riolo, Alice Schwarze, Carlos Serván, Jordan Sny- der, Guillaume St-Onge, and Jean-Gabriel Young. ii TABLE OF CONTENTS Acknowledgments .................................. ii List of Figures ..................................... v List of Tables ..................................... vi List of Appendices .................................. vii Abstract ........................................ viii Chapter 1 Introduction .................................... 1 1.1 Why study networks?...........................2 1.1.1 Example: Modeling the spread of disease...........3 1.2 Measures and metrics...........................8 1.3 Models of networks............................ 11 1.4 Inference.................................. 13 1.5 Contributions of this thesis........................ 15 2 Mixing Patterns and Individual Differences .................. 18 2.1 Individual preferences and patterns of connection.......... 20 2.1.1 Preference-based network model................ 21 2.1.2 Inferring individual preferences................ 23 2.2 Distributions of preferences....................... 24 2.3 Measures of assortativity and variation of preferences........ 27 2.4 Examples.................................. 30 2.4.1 High school friendships and ethnicity............. 32 2.4.2 Word adjacencies......................... 32 2.5 Discussion................................. 33 3 Inference and Individual Differences ...................... 35 3.1 Missing data and community detection................ 36 3.1.1 Monte Carlo algorithm to sample from @ 6 ......... 39 3.1.2 Performance............................¹ º 41 3.2 Theoretical analysis............................ 42 3.2.1 Derivation of the belief propagation.............. 44 3.2.2 Fourier transform and convolutions.............. 47 iii 3.2.3 Verification of the belief propagation............. 48 3.2.4 Symmetry breaking phase transition.............. 49 3.3 Discussion................................. 55 4 Edge Correlation and Thresholding ....................... 57 4.1 Model specification............................ 60 4.1.1 Thresholding locally correlated data.............. 60 4.1.2 Sampling from the model.................... 64 4.2 Network properties............................ 65 4.2.1 Edge density........................... 65 4.2.2 Triangles, clustering, and degree variance........... 65 4.2.3 Degree distribution........................ 67 4.2.4 Giant component......................... 70 4.2.5 Shortest path lengths....................... 71 4.3 Discussion................................. 73 5 Message Passing for Complex Networks .................... 74 5.1 Message passing with loops....................... 76 5.2 Applications................................ 77 5.2.1 Percolation............................ 77 5.2.2 Matrix spectra........................... 83 5.3 Discussion................................. 89 6 Entropy and Partition Functions of Complex High-Dimensional Models . 91 6.1 Tree ansatz................................. 95 6.1.1 Evaluating one- and two-point marginal distributions with belief propagation........................ 96 6.1.2 Example: the Ising model.................... 98 6.2 Neighborhood ansatz........................... 98 6.2.1 Evaluating the neighborhood marginal distributions with belief propagation........................ 102 6.2.2 Example: the Ising model.................... 103 6.3 Discussion................................. 104 7 Conclusion ..................................... 107 Appendices ...................................... 110 Bibliography ..................................... 124 iv LIST OF FIGURES 1.1 An example graph................................1 1.2 Simulations of the SIR model on a network..................6 1.3 Moreno’s sociograms.............................. 10 2.1 Histogram for :8B :8 in the model of Eq. (2.3)................ 24 2.2 Friendship preferences/ by race or ethnicity in a US high school...... 31 2.3 “Preferences” of different parts of speech to be followed by nouns.... 33 3.1 Test of the belief propagation algorithm of Sec. 3.2.1............ 49 3.2 Propagation of a small message perturbation................ 53 3.3 Performance in the group label recovery task with synthetic data..... 56 4.1 Thresholding relational data to obtain networks............... 61 4.2 Clustering and triangles per node ) for thresholded normal data... 67 4.3 The variance of degree for thresholded normal data............ 68 4.4 Degree distributions for thresholded normal data.............. 69 4.5 Degree histograms for three real-world networks.............. 70 4.6 Giant component phase transition for thresholded normal data...... 71 4.7 Average shortest paths in thresholded normal data............. 72 5.1 Constructing A-neighborhoods......................... 78 5.2 Percolation simulations and equation based results............. 82 5.3 An example of an excursion.......................... 85 5.4 Matrix spectra for two real-world networks.................. 88 6.1 Bethe ansatz for the Ising model........................ 99 6.2 Neighborhoods and various related quantities for node 8 in an example network...................................... 100 6.3 Neighborhood ansatz for the Ising model................... 104 v LIST OF TABLES 2.1 Estimates of normalized preference assortativity ' and preference vari- ance + for a selection of networks with known group assignments.... 30 3.1 Missing data recovery for six example networks............... 42 vi LIST OF APPENDICES A Estimating Mixing Parameters .......................... 110 B Technical Details for Normal Distributions .................. 113 C Derivation of the Message Passing Equations ................. 117 D Monte Carlo Algorithm for Percolation Message Passing .......... 121 vii ABSTRACT Network representations are now ubiquitous across science. Indeed, they are the natural representation for complex systems—systems composed of large num- bers of interacting components. Occasionally systems can be well represented by simple, regular networks, such as lattices. Usually, however, the networks them- selves are complex—highly structured but with no obvious repeating pattern. In this thesis I examine the effects of correlation and interdependence on network phenomena, from three different perspectives. First, I consider patterns of mixing within networks. Nodes within a network frequently have more connections to others that are similar to themselves than to those that are dissimilar. However, nodes can (and do) display significant hetero- geneity in mixing behavior—not all nodes behave identically. This heterogeneity manifests as correlations between individuals’ connections. I show how to identify and characterize such patterns, and how this correlation can be used for practical tasks such as imputation. Second, I look at the effects of correlation on the structure of networks. If edges within a relational data set are correlated with each other, and if we construct a network from this data, then several of the properties commonly associated with real-world complex networks naturally emerge, namely heavy-tailed degree distributions, large numbers of triangles, short path lengths, and large connected components. Third, I develop a family of technical tools for calculations about networks. If viii you are using a network representation, there’s a good chance you wish to calculate something about the network—for example, what will happen when a disease spreads across it. An important family of techniques for network calculations assume that the networks are free of short loops, which means that the neighbors of any given node are conditionally independent. However, real-world networks are clustered and clumpy, and this structure undermines the assumption of conditional independence. I consider a prescription to deal with this issue, opening up the possibility for many more analyses of realistic and real-world data. ix CHAPTER 1 Introduction The study of networks is now a well established field, complete with dedicated textbooks, journals, international conferences, popular writings, and research cen- ters [1–14]. Networks are used across scientific disciplines including in biology, neuroscience, public health and medicine, ecology, sociology, political science, en- gineering and economics [15–33]. In this thesis I will present contributions I have made over the last four years to the theoretical study of networks. First, we should discuss what a network actually is. Central to the concept of a network is the mathematical notion of a graph [34]. A graph is a mathematical object constructed from a set of nodes (or vertices), and a set of edges between nodes. One writes = +, to represent that graph contains