Biological Network Representations and Databases Basic Topological
Total Page:16
File Type:pdf, Size:1020Kb
Biological Network Representations and Databases Transcriptional Regulatory Networks. In transcription regulatory networks the nodes represent transcrip- tion factors and genes while the links describe regulation-based interactions. Each link is directed, point- ing from the transcription factor to the gene that it regulates. To compare the topological properties of transcription-regulatory networks to those of metabolic and protein-protein interaction networks, we have stud- ied their undirected version where each directed link is replaced by an undirected one. The transcription net- work data of Escherichia coli has been reported in [2] and it is available from www.weizmann.ac.il/mcb/UriAlon. The transcription network data of Saccharomyces cerevisiae has been reported in [8], and it is available from http://www.weizmann.ac.il/mcb/UriAlon/Papers/networkMotifs/sma.html. Metabolic Networks. Metabolic networks represent the sum of all biochemical reactions taking place in an or- ganism. Different network representations are possible, depending on the purpose [4]. In our work we associate nodes with metabolites. Undirected links connect each substrate to the products of the reaction in which the sub- strate participates. The metabolic networks of E. coli and S. cerevisiae where obtained from the WIT database (http://igweb.integratedgenomics.com/IGwit), and they are also available www.nd.edu/∼networks . Protein-Protein Interaction Networks. In the protein-protein interaction network nodes represent proteins and undirected links describe the pairwise interactions between them. We focus on the protein-protein interaction network of S. cerevisiae. The data source is the Database of Interacting Proteins (DIP) available at http://dip.doe- mbi.ucla.edu. For a comparison between the different protein interaction databases see [1]. In summary, we constructed an undirected network representation of the transcription, the metabolic and the protein-protein interaction networks of E. coli and S. cerevisiae. In order to have a coherent basis of comparison we have reduced the original networks by removing self-loops, double links (i.e., if we have simultaneous links from node A to node B and from B to A, we draw only one undirected link joining the two A and B) and isolated nodes. The basic quantities characterizing the resulting networks are summarized in Tab. 3. Basic Topological Properties We denote by N the number of nodes in the network, usually referred to as the network size, and by E the number of links (or edges). Each node i is characterized by the degree ki and the clustering coefficient Ci. The degree ki counts the number of interactions that node i has with the other nodes in the network. By averaging ki over all nodes in the network we obtain the average degree N 1 2E hki ≡ k = . (1) N i N i=1 X The last equality follows from the fact that each link connects two nodes, and hence contributing to the degree of two nodes. The values of N, E and hki for the investigated biological networks are given in Table 3. Note that for each network hki ¿ N, meaning that biological networks are sparse, i.e. on average each node is connected to only a few other nodes. The clustering coefficient Ci is a measure of the fraction of connected neighbors of node i. A node with ki links ki can have at most 2 = ki(ki − 1)/2 pairs of its neighbors connected to each other. If we denote by ti the number of links among the neighbors of node i, then the clustering coefficient is defined as ¡ ¢ 2ti Ci ≡ . (2) ki(ki − 1) The average of Ci over all nodes in the network is N 1 hCi ≡ C . (3) N i i=1 X The values of hCi for the analyzed biological networks are given in Table 3. For comparison we also list the clustering coefficient of a random network with the same parameters N and E. Note that hCi À hki /N; therefore, biological networks are characterized by a high degree of local clustering [5]. The average values hki and hCi do not characterize exhaustively the topology since biological networks exhibit marked degree fluctuations [3, 4]. Indeed, the probability P (k) for a node to have degree k is given by the power law − P (k) = Ak γ , (4) where A is a constant and γ is the degree exponent. The highly inhomogeneous nature of P (k) prompted the study of the clustering coefficient as a function of the degree. Instead of taking the average over all nodes in the network, as in Eq. (3), the average is taken over nodes with the same degree, providing N i=1 Ciδkik C(k) = N , (5) P i=1 δkik where δkik = 1 if ki = k, and zero otherwise. For biologicalP networks C(k) follows a power law −α C(k) = C0k , (6) where C0 is a constant and α is the hierarchical exponent. The hierarchical exponent characterizes the network’s overall modularity, indicating the presence of many small, highly interconnected modules forming larger, less cohesive topological modules. The degree and hierarchical exponents, γ and α respectively, rather than the average degree hki and the average clustering coefficient hCi, are the two important topological quantities characterizing biological networks. Their values for the studied biological networks are summarized in Table I. Statistical Properties of Subgraphs Number of Triangles Passing by a Node. To develop an intuition for the statistical properties of general subgraphs we first study the occurrence of triangles. Since each connected pair of neighbors forms a triangle with the central node i, then ti defined in (2) also represents the number of triangles passing by node i. The average of ti over nodes of same degree k, denoted by T (k), is k T (k) = C(k) . (7) 2 µ ¶ k 2 Using 2 ∼ k /2, we find ¡ ¢ T (k) ∼ kβ , (8) with β = 2 − α . (9) Combining Eqs. (4) and (8), we predict that the probability that a randomly selected node participates in T triangles follows a power law − P (T ) ∼ T δ , (10) where γ − 1 δ = 1 + . (11) β P (T ) and T (k) for the studied biological networks are plotted in Fig. 4 together with the power law fits. The β and δ exponents provided by the fit are listed in Table I, indicating that the predicted values are in good agreement with the measured ones. Number of Small Subgraphs Passing by a Node. The above analysis based on triangles can be extended to other small subgraphs as well. We consider subgraphs with n nodes and m links that can be decomposed into a central node and n − 1 neighbors. More precisely, we have n − 1 interactions from the central node to its neighbors and t = m − (n − 1) interactions among the n − 1 neighbors. In the paper we have used m together with n to classify each motif; here, to simplify the calculations we use t instead of m. The total number of n-node subgraphs that can pass by a node k with degree k is n−1 , which tells us in how many different ways we can choose a group of n − 1 nodes out of the total k. Each of these n-node subgraphs can have at most n = (n − 1)(n − 2)/2 links between the n − 1 neighbors of ¡ ¢ p the central node. The probability that there is a link between two neighbors of a node with degree k is by definition C(k), and the probability that two neighbors are not connected is 1 − C(k). Therefore, the probability to obtain t connected pairs and np − t disconnected pairs is given by the binomial distribution n − b (k) = p C(k)t[1 − C(k)]np t . (12) nt t µ ¶ The average number of subgraphs formed by n − 1 neighbors and t interactions among them and centered at a node with degree k is given by k N (k) = b (k) . (13) nt n − 1 nt µ ¶ We readily obtain that if k is large, then Nnt(k) scales as βnt Nnt(k) ∼ k , (14) where βnt = n − 1 − αt = n − 1 − α(m − n + 1) . (15) Using Eqs. (4) and (14) we find that the probability for a randomly selected node to participate in Tnt subgraphs (n, t) scales as −δnt P (Tnt) ∼ Tnt , (16) where γ − 1 δnt = 1 + . (17) βnt In particular, triangles correspond to n = 3 and t = 1 (i.e., m = 3), leading to β31 = 2 − α and δ31 =1+(γ − 1)/δ, and we thus recover the expressions (9) and (11) derived earlier for triangles. Note that this approach can be generalized to higher order cycles, such as squares, pentagons and others. Over- (Type I) and Under-represented (Type II) Subgraphs. Here we use the scaling relations derived above to address the abundance of (n, t) subgraphs in scale-free and hierarchical networks. The expected number of (n, t) subgraphs in a network is given by Eq. (13) kmax Nnt = gntN P (k)Nnt(k) , (18) k X=1 where kmax is the maximum degree and the factor gnt takes into account that the same subgraph can have more than one node as a center (see Fig. 5). Eqs. (4), (6) and (12-18) lead to kmax n − − k − − N = NAg p Ct k γ αt [1 − C k α]np t . (19) nt nt t 0 n − 1 0 k µ ¶ X=1 µ ¶ We recognize the existence of two regimes. There is a regime where the sum in the above expression is dominated by k n−1 −α the large k region; this means that we can approximate n−1 by k /(n − 1)! and 1 − C0k by 1, being left with ¡ ¢ np t kmax Agnt C − − − N ≈ N t 0 kn 1 γ αt .