Networks in Nature: Dynamics, Evolution, and Modularity
Total Page:16
File Type:pdf, Size:1020Kb
Networks in Nature: Dynamics, Evolution, and Modularity Sumeet Agarwal Merton College University of Oxford A thesis submitted for the degree of Doctor of Philosophy Hilary 2012 2 To my entire social network; for no man is an island Acknowledgements Primary thanks go to my supervisors, Nick Jones, Charlotte Deane, and Mason Porter, whose ideas and guidance have of course played a major role in shaping this thesis. I would also like to acknowledge the very useful sug- gestions of my examiners, Mark Fricker and Jukka-Pekka Onnela, which have helped improve this work. I am very grateful to all the members of the three Oxford groups I have had the fortune to be associated with: Sys- tems and Signals, Protein Informatics, and the Systems Biology Doctoral Training Centre. Their companionship has served greatly to educate and motivate me during the course of my time in Oxford. In particular, Anna Lewis and Ben Fulcher, both working on closely related D.Phil. projects, have been invaluable throughout, and have assisted and inspired my work in many different ways. Gabriel Villar and Samuel Johnson have been col- laborators and co-authors who have helped me to develop some of the ideas and methods used here. There are several other people who have gener- ously provided data, code, or information that has been directly useful for my work: Waqar Ali, Binh-Minh Bui-Xuan, Pao-Yang Chen, Dan Fenn, Katherine Huang, Patrick Kemmeren, Max Little, Aur´elienMazurie, Aziz Mithani, Peter Mucha, George Nicholson, Eli Owens, Stephen Reid, Nico- las Simonis, Dave Smith, Ian Taylor, Amanda Traud, and Jeffrey Wrana. I would also like to thank the Clarendon Fund for providing the scholar- ship that enabled me to come to Oxford, and Merton College and Oxford Physics for hosting me and enabling me to meet many wonderful people and take advantage of many helpful resources. Finally, I must express my gratitude towards my parents and my family, without whom my existence would be neither possible nor meaningful. 5 Abstract In this thesis we propose some new approaches to the study of complex networks, and apply them to multiple domains, focusing in particular on protein-protein interaction networks. We begin by examining the roles of individual proteins; specifically, the influential idea of `date' and `party' hubs. It was proposed that party hubs are local coordinators whereas date hubs are global connectors. We show that the observations underlying this proposal appear to have been largely illusory, and that topological proper- ties of hubs do not in general correlate with interactor co-expression, thus undermining the primary basis for the categorisation. However, we find significant correlations between interaction centrality and the functional similarity of the interacting proteins, indicating that it might be useful to conceive of roles for protein-protein interactions, as opposed to individual proteins. The observation that examining just one or a few network properties can be misleading motivates us to attempt to develop a more holistic method- ology for network investigation. A wide variety of diagnostics of network structure exist, but studies typically employ only small, largely arbitrarily selected subsets of these. Here we simultaneously investigate many net- works using many diagnostics in a data-driven fashion, and demonstrate how this approach serves to organise both networks and diagnostics, as well as to relate network structure to functionally relevant characteris- tics in a variety of settings. These include finding fast estimators for the solution of hard graph problems, discovering evolutionarily significant aspects of metabolic networks, detecting structural constraints on par- ticular network types, and constructing summary statistics for efficient model-fitting to networks. We use the last mentioned to suggest that duplication-divergence is a feasible mechanism for protein-protein inter- action evolution, and that interactions may rewire faster in yeast than in larger genomes like human and fruit fly. Our results help to illuminate protein-protein interaction networks in mul- tiple ways, as well as providing some insight into structure-function rela- tionships in other types of networks. We believe the methodology outlined here can serve as a general-purpose, data-driven approach to aid in the understanding of networked systems. Contents 1 Introduction 1 1.1 Networks . 1 1.1.1 A brief history . 1 1.1.2 Basic concepts and terminology . 3 1.1.2.1 Hubs . 6 1.1.3 Communities in networks . 6 1.1.3.1 Modularity . 7 1.1.3.2 Multi-resolution community detection . 8 1.1.3.3 Optimisation . 9 1.1.3.4 Infomap . 9 1.1.3.5 Other methods . 10 1.1.4 Network diagnostics and summary statistics . 11 1.1.4.1 Connectivity measures . 11 1.1.4.2 Node or link centrality . 13 1.1.4.3 Paths and distances . 16 1.1.4.4 Clustering . 17 1.1.4.5 Motifs . 18 1.1.4.6 Graph complexity . 18 1.1.4.7 Spectral diagnostics . 21 1.1.4.8 Community-based . 22 1.1.4.9 Network energy and entropy . 25 1.1.4.10 Sampling . 25 1.1.5 Types of real-world networks . 27 1.1.6 Generative models for networks . 28 1.1.6.1 Erd}os-R´enyi . 28 1.1.6.2 Random geometric graphs . 29 1.1.6.3 Preferential attachment and `scale-free' structure . 29 1.1.6.4 Watts-Strogatz networks . 30 i 1.1.6.5 Community detection benchmark networks . 30 1.1.6.6 Exponential random graph models . 31 1.1.6.7 Duplication-divergence for gene/protein evolution . 32 1.2 Interactomics . 32 1.2.1 Proteins in biology . 33 1.2.2 Data sources . 35 1.2.2.1 Protein-protein interaction data . 35 1.2.2.2 Gene expression data . 36 1.2.2.3 The Gene Ontology . 37 1.2.3 Protein interaction networks . 38 1.3 Machine learning . 41 1.3.1 Basics . 41 1.3.2 Supervised learning . 42 1.3.2.1 Classification . 43 1.3.2.2 Regression . 45 1.3.3 Unsupervised learning . 46 1.4 Overview . 49 2 Roles in Protein Interaction Networks 52 2.1 Background and motivation . 52 2.2 Materials and methods . 56 2.2.1 Protein interaction data sets . 56 2.2.2 Functional homogeneity of communities . 59 2.2.3 Jaccard distance . 59 2.2.4 Functional similarity . 60 2.3 Revisiting date and party hubs . 60 2.4 Topological properties and node roles . 68 2.5 Data incompleteness and experimental limitations . 75 2.6 The roles of interactions . 77 2.7 Discussion . 81 3 High-Throughput Analysis of Networks 84 3.1 Motivation . 84 3.2 Data sets and algorithms . 87 3.3 Organisation of networks and features . 90 3.3.1 Network data . 90 3.3.2 Isomap and network clustering . 93 ii 3.3.3 Network classification . 100 3.3.4 Communities of features . 103 3.4 Hardness regression . 106 3.4.1 TSP solvers . 108 3.4.2 Network feature correlations . 109 3.4.3 Effect of network density . 115 3.5 Phylogeny regression . 117 3.5.1 Data . 118 3.5.2 Model fitting . 120 3.6 Discussion . 132 4 Feature Degeneracies and Network Entropies 135 4.1 Background . 135 4.2 Network feature degeneracies . 137 4.2.1 Granular contact networks . 138 4.3 Thermodynamic entropy of network ensembles . 146 4.4 Statistical entropy in feature space . 149 4.5 Entropy comparisons . 152 4.5.1 Erd}os-R´enyi networks . 152 4.5.2 Modular networks . 154 4.5.3 Watts-Strogatz networks . 157 4.6 Discussion . 159 5 Bayesian Model-Fitting for Networks 161 5.1 Background and motivation . 161 5.2 Bayesian inference for model-fitting . 162 5.3 Approximate Bayesian computation . 164 5.4 Data-driven parameterisation for ABC . 166 5.4.1 Automated network summary statistics . 166 5.4.2 Definition of error prior . 168 5.4.3 Algorithm . 170 5.5 Fitting network models . 171 5.5.1 Synthetic data . 172 5.5.2 Protein interaction networks . 177 5.5.2.1 Estimating rewiring rates . 180 5.6 Discussion . 184 iii 6 Conclusions 188 6.1 Key results . 188 6.2 Roles in protein interaction networks . 189 6.3 High-throughput analysis of networks . 191 6.4 Feature degeneracies and network entropies . 194 6.5 Bayesian model-fitting for networks . 195 6.6 Summary . 197 A List of Network Features 198 B Set of 192 Real-World Networks 203 C Approximate Analytic Expressions for Thermodynamic Network Entropy 208 C.1 Modular networks . 208 C.2 Watts-Strogatz networks . 209 Bibliography 211 iv List of Figures 1.1 The Seven Bridges of K¨onigsberg problem. ........... 2 1.2 Date and party hubs. ......................... 40 1.3 Example classifiers. .......................... 44 2.1 Variation in hub avPCC distribution. 63 2.2 Effects of hub deletion on network connectivity. 65 2.3 Hub deletion effects for AP/MS-only, Y2H-only, and bottle- necks data sets. ............................. 66 2.4 Community structure in the largest connected component of the FYI network. ............................ 69 2.5 Topological node role assignments and relation with avPCC. 72 2.6 Topological node role assignments and relation with avPCC. 73 2.7 Rolewise hub avPCC distributions. 74 2.8 Relating interaction betweenness, co-expression, and func- tional similarity. ............................ 79 3.1 Network-feature matrices. ...................... 94 3.2 Network clustering via Isomap dimensionality reduction. 96 3.3 Residual variance as the number of Isomap dimensions is in- creased. .................................. 99 3.4 Scatter plot in the space of 3 selected features. 102 3.5 Feature correlations on a set of 192 real-world networks. 105 3.6 Network features correlate significantly with outputs from a cross-entropy TSP solver.