Estimating the Size of the Human Interactome SEE COMMENTARY
Total Page:16
File Type:pdf, Size:1020Kb
Estimating the size of the human interactome SEE COMMENTARY Michael P. H. Stumpf†‡§, Thomas Thorne†, Eric de Silva†, Ronald Stewart†, Hyeong Jun An¶, Michael Lappe¶, and Carsten Wiuf§ʈ †Division of Molecular Biosciences, Imperial College London, Wolfson Building, London SW7 2AZ, United Kingdom; ‡Institute of Mathematical Sciences, Imperial College London, London SW7 2AZ, United Kingdom; ʈMax Planck Institute for Molecular Genetics, 14195 Berlin, Germany; and ¶Bioinformatics Research Center, University of Aarhus, 8000 Aarhus C, Denmark Edited by Burton H. Singer, Princeton University, Princeton, NJ, and approved February 19, 2008 (received for review August 27, 2007) After the completion of the human and other genome projects it of most systems biology data are increasingly being recognized emerged that the number of genes in organisms as diverse as fruit (11, 19) as important. flies, nematodes, and humans does not reflect our perception of There are, however, some properties of the true network that their relative complexity. Here, we provide reliable evidence that can be inferred even from subnet data, and here we show that the the size of protein interaction networks in different organisms total network size is one property for which this is the case. appears to correlate much better with their apparent biological Present protein-interaction datasets enable us to estimate the complexity. We develop a stable and powerful, yet simple, statis- size of the interactomes in different species by using graph tical procedure to estimate the size of the whole network from theoretical invariants. This is particularly interesting for species subnet data. This approach is then applied to a range of eukaryotic where more than one experimental dataset is available. Below we organisms for which extensive protein interaction data have been first describe a robust and very general estimator of network size collected and we estimate the number of interactions in humans to from partial network data that overcomes this problem. We then be Ϸ650,000. We find that the human interaction network is one apply it to available PIN data in a range of eukaryotic organisms. order of magnitude bigger than the Drosophila melanogaster In supporting information (SI) Text we demonstrate the power interactome and Ϸ3 times bigger than in Caenorhabditis elegans. of this approach by using extensive simulation studies. evolutionary systems biology ͉ network inference ͉ Estimating Interactome Size network sampling theory ͉ network evolution Here, we develop an approach for estimating the size of a EVOLUTION network from incomplete data. We will show below (and by using ne of the perhaps most surprising results of the genome- extensive simulations in SI Text) that for a given species estimates Osequencing projects was that the number of genes is much from different independent datasets—generated by different lower than had been expected and is, in fact, surprisingly similar methods such as yeast-two-hybrid and TAP tagging—yield es- for very different organisms (1, 2). For example, the nematode timates for the interactome size that are in excellent agreement. Caenorhabditis elegans appears to have a similar number of genes We are concerned with a true network, N, which has NN nodes APPLIED as humans, whereas rice and maize appear to have even more and MN edges. The sets of nodes and edges are given by VN and MATHEMATICS genes than humans. It was then quickly suggested that the EN, respectively; these define the graph representation of the biological complexity of organisms is not reflected merely by the true network: number of genes but by the number of physiologically relevant ϭ ͑ ͒ interactions (1, 3). In addition to alternative splice variants (4), GN VN, EN . [1] posttranslational processes (5), and other (e.g., genetic) factors ʕ influencing gene expression (6, 7), the structure of interactome We pick a subset of nodes VS VN and study properties of the is one of the crucial factors underlying the complexity of biological subgraph GS induced by the nodes in VS organisms. Here, we focus on the wealth of available protein G ϭ ͑V , E ͒, [2] interaction data and demonstrate that it is possible to arrive at a S S S reliable statistical estimate for the size of these interaction net- where the set of edges observed in the S is a subset of the total works. This approach is then used to assess the complexity of set of edges, ES ʕ EN. Our aim is to predict the number of protein interaction networks in different organisms from present interactions in the true network GN based on the available data incomplete and noisy protein interaction datasets. in the subnet, GS. There are now fairly extensive protein interaction network We assume that the network, GN, is generated according to (PIN) datasets in a number of species, including humans (8, 9). some (unknown) model characterized by a parameter (vector) , These have been generated by a variety of experimental tech- and subsequently the observed network, GS is sampled from it. niques (as well as some in silico inferences). Although these Then techniques and the resulting data are (i) notoriously prone to false positives and negatives (10, 11), and (ii) result in highly ͑ ͒ ϭ ͑ ͉ ͒ ͑ ͒ P, p GS Pp GS GN P GN , [3] idealized and averaged network structures (12), such interaction GNʖGS datasets are increasingly turning into useful tools for the analysis of the functional (e.g., ref. 13) and evolutionary properties (14) of biological systems. In particular, in Saccharomyces cerevisiae Author contributions: M.P.H.S., M.L., and C.W. designed research; M.P.H.S., T.T., E.d.S., we are beginning to have a fairly complete description of the M.L., and C.W. performed research; M.P.H.S., T.T., E.d.S., R.S., H.J.A., and C.W. analyzed protein interaction network that is accessible with current ex- data; and M.P.H.S., M.L., and C.W. wrote the paper. perimental technologies; the recent high-quality literature- The authors declare no conflict of interest. curated dataset of Reguly et al. (15) provides us with a dataset This article is a PNAS Direct Submission. that should be almost completely free from false positives. For See Commentary on page 6795. most other organisms, however, interaction data are still far from §To whom correspondence may be addressed. E-mail: [email protected] or complete and it has recently been shown that subnetworks, in [email protected]. general, have qualitatively different properties from the true This article contains supporting information online at www.pnas.org/cgi/content/full/ network (16–18). Although the importance of network-sampling 0708078105/DCSupplemental. properties had only been realized relatively recently, this aspect © 2008 by The National Academy of Sciences of the USA www.pnas.org͞cgi͞doi͞10.1073͞pnas.0708078105 PNAS ͉ May 13, 2008 ͉ vol. 105 ͉ no. 19 ͉ 6959–6964 Downloaded by guest on September 27, 2021 where it is assumed that the sampling is independent of the sampling of nodes is not too restrictive and should apply to many, network-generating model. The parameter p refers to a general in particular, high-throughput, experimental studies. sampling process, and not only independent node sampling. So far we have assumed that the number of ORFs, NN in an Furthermore, we assume the order NN of the network is known organism is known from genome surveys. Total genome size is, and allow nodes to be annotated with information not related to however, still not precisely known in most organisms. Uncer- the wiring of the network (e.g., GO terms or protein family tainty in the NN is, however, easily incorporated. Assume that classes). Consequently, the sum is over networks, GN, with NN the value NN is associated with an error or uncertainty (i.e., if (labeled) nodes only. For convenience, we take labeling infor- the genome contains N0 protein-coding genes of which NN are mation to be included in NN and NS (the order of GS). ϭ Ϫ ϭ Ϯ known, then (N0 NN)/N0. Then let NN : N0 (1 ) and If sampling only depends on the nodes in the network and not for Շ0.1 we have on their connections, then Pp(GS͉GN) splits into a product of two terms, NS NS P˜ ϭ Ϸ ͑1 Ϯ͒ ϭ ͑1 Ϯ͒pˆ. [9] ͑ ͒ ϭ ͑ ͒ ͑ ͒ ͑ ͒ N0 NN P, p GS Qp NS q GS, GN P GN , [4] ʖ GN GS Replacing pˆ in Eq. 8 with ˜p yields the error-corrected estimate for the true network size where Qp(NS) is a term denoting the probability of sampling the nodes in the observed PIN and q(GS, GN) denotes how many MS MS ways this can be done given the (labeled) nodes in GN—by ˜ ϭ Ϸ ͑ Ϯ ͒ MN 2 1 2 2 . [10] assumption, labeling of nodes is the same in all possible GNs. For p˜ pˆ example, if the nodes are unlabeled and have degree zero, then NN Thus, an uncertainty of in the number of nodes in the true q(GS, GN) ϭ ( ). If all nodes have degree one, a similar factor NS network results in an uncertainty of 2 for the number of edges can be derived based on the number of degree one (d1) and in the true network. degree zero (d0) nodes that are observed in the PIN: q(GS, GN) To assess the variability of the estimator we can construct is the number of ways one can choose d0 and d1 out of the NN/2 pairs of connected nodes in GN. If all nodes are labeled then approximate bootstrap confidence intervals (CI) (20).