PRAMANA c Indian Academy of Sciences Vol. 84, No. 2 — journal of February 2015 physics pp. 285–293

Importance of randomness in biological networks: A random analysis

SARIKA JALAN1,2 1Complex Systems Lab, Discipline of Physics, Indian Institute of Technology Indore, IET-DAVV Campus, Khandawa Road, Indore 452 017, India 2Centre for Bio-Sciences and Bio-Engineering, Indian Institute of Technology Indore, IET-DAVV Campus, Khandawa Road, Indore 452 017, India E-mail: [email protected]

DOI: 10.1007/s12043-015-0940-9; ePublication: 29 January 2015

Abstract. Random matrix theory, initially proposed to understand the complex interactions in nuclear spectra, has demonstrated its success in diverse domains of science ranging from quantum chaos to galaxies. We demonstrate the applicability of random matrix theory for networks by pro- viding a new dimension to complex systems research. We show that in spite of huge differences these interaction networks, representing real-world systems, posses from random matrix models, the spectral properties of the underlying matrices of these networks follow random matrix theory bringing them into the same universality class. We further demonstrate the importance of random- ness in interactions for deducing crucial properties of the underlying system. This paper provides an overview of the importance of random matrix framework in complex systems research with biological systems as examples.

Keywords. Network theory; biological systems; spectra of matrices

PACS Nos 64.60.aq; 02.10.Ud; 87.19.xj

1. Introduction

The field of network analysis helps us look at every individual and its interac- tions as part of a complex social structure [1]. It yields explanations to various phenomena in a wide variety of disciplines ranging from physics to psychology to and attempts to draw the reason behind the formation of specific network ties or importance of individual’s position in a network in determining the opportunities and constraints that the individual encounters, in turn affecting its outcome [2,3]. Causal relations between structural attributes and success factors, which seemed thoroughly random to the eyes of a researcher until a decade, have been analysed under network theory framework [4]. The post-genomic era aims to understand the role of proteomics and genomics in

Pramana – J. Phys., Vol. 84, No. 2, February 2015 285 Sarika Jalan human health and diseases [5]. The ample availability of data in functional genomic and proteomics has been possible owing to the development of high-throughput data- collection techniques, that have resulted from the basic gene-based traditional molecular approach to a systems approach of network biology [6,7]. It has been increas- ingly realized that dissecting the genetic and chemical circuitry prevents us from further understanding the biological processes as a whole [8–10]. In order to understand the com- plexities involved, all reactions and processes should be analysed together [11]. Network biology provides such a framework where biological processes are considered as com- plex networks of interactions between numerous components of the cell rather than as independent interactions involving only a few molecules [12]. In this paper we shall provide an overview of recent developments in understanding the complex biological systems achieved through random matrix analysis of the underlying networks. Random matrix theory (RMT), proposed by Wigner to explain the statisti- cal properties of nuclear spectra, has elucidated a remarkable success in understanding complex systems which include disordered systems, quantum chaotic systems, spectra of large complex atoms, etc. [13,14]. Further studies illustrate the usefulness of RMT in understanding the statistical properties of the empirical cross-correlation matrices used in the study of multivariate time series of price fluctuations in the stock market [15], EEG data of the brain [16], variation of various atmospheric parameters [17], etc. In this paper, we review the recent extension of this theory to biological networks. The spectra of any real-world network can be divided into three parts, the first part consisting of extremal eigenvalues at both the ends of the spectra, second comprising the smooth middle region and the third part consisting of degenerate eigenvalues mostly found at values 0 and −1. In the following, we explore the properties of these three segments of spectra and their corresponding eigenvectors in detail in order to gain a deeper understanding of biological systems under a mathematical framework.

2. Methods and techniques

2.1 Construction of networks

A network consists of nodes (or vertices) which are connected through edges (or links). The A of a network is constructed as eq. (1):  1, if i ∼ j, A = (1) ij 0, otherwise.

Apart from the simple manner of network construction as mentioned earlier, different types of networks can be constructed based on the nature of connections. For example, [11] considers a gene coexpression network generated from the gene coexpression data of six brain regions relevant to Alzheimer’s disease. A binary network is then created in the following manner. Based on Pearson’s product–moment correlation value calculated for each probe set-pair expression level on the microarray (where one gene is represented by one or more probe sets), a threshold value can be chosen and if the coexpression strength is greater than the threshold, the value one is assigned to the corresponding element in the matrix. The coexpression value being less than the threshold yields zero entry in the

286 Pramana – J. Phys., Vol. 84, No. 2, February 2015 Importance of randomness in biological networks matrix. The use of threshold leads to the generation of a network with much less number of edges which might result in many disconnected components and in such cases one analyses properties of the largest connected component.

2.2 Nearest-neighbour spacing distribution (NNSD)

The spectra of the corresponding adjacency matrix is denoted by λi = 1,...,N and λ1 >λ2 >λ3 > ···>λN . The random matrix studies of the eigenvalue spectra consider two properties: (1) global properties such as spectral density distribution of eigenvalues defined as  N 1 1, if x = 0 ρ(λ) = δ(λ − λ ), where δ(x) = , (2) N j 0, if x = 0 j=1 is the delta function. (2) Local properties such as eigenvalue fluctuations around ρ(λ). In order to calculate local properties in RMT, it is customary to unfold the data by a ¯ ¯ ¯ λ   transformation λi = N(λi ), where N = ρ(λ )dλ is the averaged integrated eigen- λmin value density [14]. In the absence of an analytical form for N¯ , we unfold the spectrum numerically by a polynomial curve fitting. Using the unfolded spectra, we calculate the (i) ¯ ¯ nearest-neighbour spacings s = λi+1 − λi distribution (NNSD) and fit it by the Brody distribution (eq. (3)) characterized by the parameter β as follows [18]:   β β+1 Pβ (s) = As exp −αs , (3) where A and α are determined by the parameter β as

  + β + 2 β 1 A = (1 + β)α, α =  . β + 1

As β goes from zero to one, the Brody formula smoothly changes from the Poisson (P(s) = exp(−s)) to the Gaussian orthogonal ensemble (GOE) statistics char- acterized by P(s) = π/2s exp(−πs2/4). The GOE represents a universality class of chaotic systems with time-reversal symmetry yielding level repulsion at small spac- ings and a Gaussian fall-off at large spacings. This Brody distribution does not model pseudointegrable systems which are non-integrable as well as non-chaotic.

2.3 3(L) statistics

We analyse the long-range correlations in eigenvalues using 3(L) statistics which mea- sures the least-square deviation of the spectral staircase function, representing average integrated eigenvalue density N(λ)¯ , from the best fitted straight line for a finite interval of length L of the spectrum given by [18]

x+L 1 ¯ ¯ 2 ¯ 3(L; x) = mina,b [N(λ) − aλ − b] dλ, (4) L x

Pramana – J. Phys., Vol. 84, No. 2, February 2015 287 Sarika Jalan where a and b are regression coefficients obtained after least square fit. Average over several choices of x gives 3(L), the spectral rigidity. In case of GOE statistics, 3(L) statistic depends logarithmically on L given as 1 (L) ∼ ln L. (5) 3 π 2

2.4 Inverse participation ratio (IPR)

l k Let uk be the lth component of the kth eigenvector u . The eigenvector components of the GOE random matrix are the Gaussian-distributed random variables. The distribution =| l |2 of r uk , in the limit of large matrix dimension, is represented by the Porter–Thomas distribution [19]. The inverse participation ratio (IPR) of an eigenvector is defined as

N = [ l ]4 Ik uk . (6) l=1

The meaning of I√k is illustrated by two limiting cases: (1) a vector with identical com- l ≡ = 1 = ponents uk 1/ N has Ik 1/N, whereas (2) a vector with one component uk 1 and the remainder zero has Ik = 1. Thus, the IPR quantifies the reciprocal of the number of eigenvector components that contribute significantly. For a vector with components following Porter–Thomas distribution, the IPR takes the value 3/N.

3. Universal spacing distribution

All undirected networks entail real eigenvalues. The density distribution ρ(λ) calculated using eq. (2) for most of the biological networks as well as those considered here resem- ble triangular distribution with a peak at zero eigenvalues. The scafree distribution followed by the underlying networks is known to be one of the reasons for the occurrence of the triangular shape of the spectral density of the corresponding matrices [20]. Further, sparseness of real-world networks has been debated to bring upon high degeneracy at the zero eigenvalue [21,22]. While calculating the NNSD of the networks, we exclude the flat region of the spectra as well as the extremal eigenvalues and analyse only the smooth part of the spectra. For each real-world network, we analyse eigenvalues being less than zero and those being greater than zero separately, and present the average properties of these two datasets. If the NNSD of the networks, when fitted with eq. (3), yields value of β ∼ 1, then the spectra follow the universal GOE statistics of RMT [23–25]. The value of the fitted Brody parameter reflects that the NNSD of the protein–protein interaction networks investigated for six different species (C. elegans, D. melanogaster, H. pylori, H. sapiens, S. cerevisiae and E. coli) (presented in table 1) ensue the universal GOE statistics [26]. This universality in spectral behaviour predicted by RMT across these species belonging to different levels of , is not a trivial result, because in spite of genetic dif- ferences, differences in internal environment, biological activities, modes of functioning in those species affecting their protein–protein interactions, they exhibit similar universal behaviour. Covariance matrices of amino acid displacement are also investigated through RMT framework and have been shown to exhibit universal spacing distribution [27]. The

288 Pramana – J. Phys., Vol. 84, No. 2, February 2015 Importance of randomness in biological networks

Table 1. NNSD of the protein–protein interaction networks for six different species. N and β refer to the size of the largest connected component and the value of the Brody parameter, respectively for each species. L is the length of spectrum upto which the statistics comply with RMT. Values of L have not been included for species which do not follow 3(L) statistic [26].

Species N β % L/N

C. elegans 2386 1.01 – D. melanogaster 7321 0.96 0.38 H. pylori 709 0.97 1.83 H. sapiens 2138 1.02 – S. cerevisiae 5019 0.96 0.35 E. coli 2209 1.05 0.41 spacing distributions for the gene coexpression network of Alzheimer’s disease [11] and Zebrafish subjected to various toxic perturbations [28] also follow universal GOE statis- tics of RMT. NNSD following GOE statistics is indicative of the fact that the underlying system is complex and possesses some minimal amount of randomness which is evident in case of the protein–protein interaction networks. The NNSD changes from Poisson to GOE with a very small increment in rewiring prob- ability while progressing from a regular lattice to a random network and the transition to GOE (β = 1) takes place exactly at the onset of small-world [29] transition [24]. We make explicit remark that this transition from Poisson to GOE goes through eigenvalue repulsion combined with a Gaussian fall and not with an exponential fall. This establishes the fact that β ∼1 is essential to instill some drastic changes in structural properties of the underlying network. In the biological scenario, randomness in networks might be con- sidered to arise due to some nonsense mutations [30] occurring in the underlying system. While in dynamical systems, randomness may be related to the unpredictable nature of time evolution, e.g., in chaotic systems [31], in networks, randomness is referred to as the random connections between nodes [32]. In biological systems, such randomness might have evinced in the course of evolution randomly and not because of any particular func- tional importance of some connection. For instance, emergence of the modular structure in networks, known to be motivated by their specific functional role in the evolution [33] are probably linked with random connections perhaps resulting from mutations [30].

4. Varying amounts of randomness

The NNSD accounts for only the short-range correlations in eigenvalues and does not help one to quantify the amount of randomness existing in different networks. The second most insightful step in RMT is the analysis of long-range correlations in eigenvalues using spectral rigidity test, which is conducted usually using 3(L) statistic given by eq. (4). Except for C. elegans and H. sapiens, all other species among the six species investigated in [26] conform with 3 statistic of GOE (eq. (5)) up to a certain range L and thereafter deviate from the universal GOE statistics, indicating that the eigenvalues are correlated only up to this range. Two out of the six species, namely C. elegans and H. sapiens, cease

Pramana – J. Phys., Vol. 84, No. 2, February 2015 289 Sarika Jalan to follow RMT predictions beyond short-range correlations in eigenvalues (table 1) and raise few intriguing questions. For instance, how, in the course of evolution, the interac- tion network understands that it has attained randomness sufficient enough to introduce short-range correlations in the corresponding spectra? Since favourable mutation vari- ants captured in the course of natural selection [34] might apparently be responsible for spreading sufficient randomness, deviation from GOE might lead to apparent suppression in random mutations. The applicability of RMT in protein interaction networks and devi- ation from universality helps one to draw an interpretation of the deterministic nature of protein interaction networks together with its [32]. It is also observed that in gene coexpression network constructed on Alzheimer’s dis- ease data, 3 statistic agrees well with the GOE statistics up to a very long range of L. A deviation in spectral rigidity is observed beyond this value of L, indicating possible breakdown of universality, implicating that besides randomness, the underlying network possesses certain specific features [11]. This means that the gene coexpression network has sufficient amount of randomness, which may be important for the robustness of the systems. Mixtures of random connections and regular structure have been emphasized in biological systems where they act as key components of structural stability of underlying network owing to interactions between various levels of organization [35]. For instance, information processing in the brain is considered to happen through random connections among different modular structures [36]. Further, the importance of randomness in the establishment and conservation of complexity in social structure has been emphasized upon by investigating the interaction dynamics of a wild house mice population [37]. We remark that for model networks deviation of 3 statistic from random matrix predictions has been shown to follow deformed GOE statistics [38]. This can further be extended to analyse deviation of long-range correlations in biological networks which might help in giving more insight into the interactions [39].

5. Eigenvector localization

The deviation from the universal RMT predictions might provide clues about identifica- tion of system-specific, non-random properties of the system under investigation. In this regard, we resort to eigenvector analysis to extract system-dependent information per- taining to important nodes and their interactions. The component, say l, of a particular eigenvector refers to the contribution of the lth node of the network to that eigenvec- tor. Thus the distribution of the eigenvector components contains information about the number of genes contributing to the localization of a specific eigenvector. IPR defined as eq. (6) distinguishes any two eigenvectors, one having all approximately equally val- ued components (delocalized eigenvector) and another with less number of large valued components (localized eigenvector). According to the RMT predictions, the top-most contributing nodes in the localized eigenvectors might have some important functional significance in the underlying system or important functional relations among them [11]. The eigenvectors can be analysed in two segments based on the values of IPR, one bearing delocalized eigenvectors having value close to RMT prediction [19] and the other consisting of the localized eigenvectors. RMT indicates that the corresponding network has a mixture of random connections yielding the delocalized eigenvectors pertaining to

290 Pramana – J. Phys., Vol. 84, No. 2, February 2015 Importance of randomness in biological networks the former segment, and the structural features corresponding to functional performance leading to the localized second segment [28]. Based on the eigenvector localization, the spectra of the network can be divided into three sections: (1) the non-degenerate part of the spectra that follows RMT, (2) the non-degenerate extremal eigenvalues and inter- mediate eigenvalues, which deviate from RMT and their corresponding eigenvectors are expected to contain crucial information about the important nodes of the network and (3) the degenerate eigenvalues. The nodes corresponding to the top contributing compo- nents of the most localized eigenvectors (eigenvectors having higher IPR values) might be important in terms of functionality of the whole network. While analysing the gene coexpression network of Alzheimer’s disease [11], it was found that the top contributing nodes of the most localized eigenvectors revealed through IPR have comparatively low degree and do not lie in the list of highly connected hubs of the network arising in the power law [40]. It is noteworthy to mention here that the genes which are hubs or those which connect different communities are also important, as shown by several earlier studies carried out under the network framework [1,2,41] but the discussed work aimed to look for the important genes implicating in Alzheimer’s disease beyond these structural measures. Also it has been realized that the choice of the threshold value plays a driving role in the analysis of gene coexpression networks through eigenvector localization, as the complete set of top-contributing nodes change entirely on varying the threshold value for network construction. Proceeding further with eigenvector localization studies, the localized part of the spec- tra of the gene coexpression network of Zebrafish generated for different environmental perturbations, is divided into three regions corresponding to: (1) the lower eigenvalues regime, (2) the middle region near the degenerate eigenvalues and (3) eigenstates with larger eigenvalues [28]. It is observed that the top-contributing nodes of the eigenvectors belonging to region (1) have high degree. The eigenvectors belonging to region (2) have very few (one or two) top-contributing nodes. The top-contributing nodes of the eigenvec- tors belonging to part (3) have degree close to the average degree of the network and do not have distinguished nodes contributing much more than the rest. For a finite-dimensional matrix, deviation from randomness determines the localization length of the eigenvectors [42]. On plotting the square of the components of the most localized eigenvectors, most of the top-localized eigenvectors were seen to correspond to the eigenstates with the set (1), i.e., with the negative eigenvalues. It is important to mention that the eigenstate, cor- responding to the largest eigenvalue (λN ), has exponentially decaying components, and is not localized to few nodes. According to the RMT, the localized eigenvectors distinguish ‘genuine correlations’ from ‘apparent correlations’ [15] which, in terms of the gene coex- pression networks, can be interpreted as random and functionally important correlations between the genes.

6. Conclusion

Biological systems probed using random matrix techniques reveal that they possess some minimal amount of randomness, which might be crucial for their robustness and func- tional performance. We use properties of random matrices to explain the intricacies of biological systems, which despite being complex, are deterministic in nature and are governed by physical laws. The non-random behaviour in the real systems is directly

Pramana – J. Phys., Vol. 84, No. 2, February 2015 291 Sarika Jalan reflected in the deviation of the long-range correlations from randomness after the attain- ment of a minimal amount of randomness. This deviation can be probed further to extract system-dependent information. For example, the part of the spectra deviating from RMT predictions have been shown to pro- vide information of important genes in Zebrafish under diverse toxic perturbations. This strategy can be adopted to develop a future chip that can be used for the detection of pol- lutants and diagnosis of diseases. Further, the extremal eigenvalues have been shown to exhibit a completely different behaviour as compared to the GOE statistics, and indeed they have been shown to be modelled using generalized extreme value (GEV) statistics [43] providing a promising platform to analyse extreme eigenvalues of biological sys- tems under the GEV framework. The random matrix analyses of biological networks can henceforth be extended to investigate different diseases, to predict various structural and functional aspects of interactions [44], which in addition may help to compose novel multidrug targets [45]. Designing single-drug target might not always give satisfactory results, as a backup system might exist, which replaces the function of the inhibited target protein. The localization studies help to identify the set of genes which might possess functionally important interactions amongst themselves. By developing multitarget drugs against these genes, one can decrease the functionality of the entire signalling cascade, yielding more effective results. The links connecting hubs of the protein–protein inter- action network, intermodular links through nodes having high betweenness , or nodes in the overlap of numerous network modules, might also act as potential multi- drug targets [46,47]. The results presented in this paper about spectral properties of the underlying networks depicted through random matrix analyses add a new dimension to the understanding of complex biological systems and further work in this direction may lead to the extraction of useful information about such complicated systems.

Acknowledgements

The author thanks Department of Science and Technology (DST), Govt. of India grant SR/FTP/PS-067/2011 and Council of Scientific and Industrial Research (CSIR), Govt. of India grant 25(02205)/12/EMR-II for financial support. SJ is grateful to Arul Lakshminarayan, V K B Kota and Sudhir Jain for insightful discussions on random matrix theory.

References

[1] R Albert and A-L Barabási, Rev. Mod. Phys. 74, 47 (2002) [2] S Boccaletti, V Latora, Y Moreno, M Chavez and D U Hwang, Phys. Rep. 424, 175 (2006) [3] S P Borgatti, A Mehra, D J Brass and G Labianca, Science 323, 892 (2009) [4] S Jalan, C Sarkar, A Madhusudanan and S K Dwivedi, PLoS ONE 9, (2) e88249 (2014) [5] J C Venter et al, Science 16, 1304 (2001) [6] A-L Barabási, N Gulbahce and J Loscalzo, Nat. Rev. Genet. 12, 56 (2011) [7] L H Hartwell, J J Hopfield, S Leibler and A W Murray, Nature 402, 47 (1999) [8] H Kitano, Science 295, 1662 (2002) [9] X Zhu, M Gerstein and M Synder, Genes and Dev. 21, 1010 (2007) [10] A-L Barabási and Z Oltan, Nat. Rev.Genet.5, 101 (2004)

292 Pramana – J. Phys., Vol. 84, No. 2, February 2015 Importance of randomness in biological networks

[11] S Jalan, N Solymosi, G Vatty and B Li, Phys. Rev.E81, 046118 (2010) [12] T Ideker, T Galitski and L Hood, Annu. Rev. Genomics Hum. Genet. 2, 343 (2001) [13] T Guhr, A M Groeling and A H A Weidenmüller, Phys. Rep. 299, 189 (1998) [14] T Papenbrock and H A Weidenmüller, Rev. Mod. Phys. 79, 997 (2007) [15] V Plerou, P Gopikrishnan, B Rosenow, L A N Amaral and H E Stanley, Phys. Rev. Lett. 83, 1471 (1999) [16] P Seba, Phys. Rev. Lett. 91, 198104 (2003) [17] M S Santhanam and P K Patra, Phys. Rev.E64, 016102 (2001) [18] M L Mehta, Random matrices, 2nd edn (Academic Press, New York, 1991) [19] K Zyczkowski,˙ Quantum chaos edited by H A Cerdeira, R Ramaswamy, M C Gutzwiller and G Casati (World Scientific, 1991) [20] I J Farkas, I Derényi, A-L Barabási and T Vicsek, Phys. Rev.E64, 026704 (2001) [21] S N Dorogovtsev, A V Goltsev, J F Mendes and A N Samukhin, Phys. Rev.E68, 046109 (2003) [22] M A M de Aguiar and Y Bar-Yam, Phys. Rev.E71, e016106 (2005) [23] T A Brody, Lett. Nuovo Cimento 7, 482 (1973) [24] J N Bandyopadhyay and S Jalan, Phys. Rev.E76, 026109 (2007) [25] S Jalan and J N Bandyopadhyay, Phys. Rev.E76, 046107 (2007) [26] A Agarwal, C Sarkar, S K Dwivedi, N Dhasmana and S Jalan, Physica A 404, 359 (2014) [27] R Potestio, F Caccioli and P Vivo, Phys. Rev. Lett. 103, 268101 (2009) [28] S Jalan, C Y Ung, J Bhojwani, B Li, L Zhang, S H Lan and Z Gong, Europhys. Lett. 99, 48004 (2012) [29] D J Watts and S H Strogatz, Nature 393, 440 (1998) [30] S Clancy, Nature Education 1, 187 (2008) [31] F M Atay, S Jalan and J Jost, Complexity 15, 29 (2009) [32] S Jalan and J N Bandyopadhyay, Europhys. Lett. 87, 48010 (2009) [33] E Ravasz, A L Somera, D A Mongru, Z N Oltvai and A-L Barabási, Science 297, 1551 (2002) [34] D A Petrov, Genetica 115, 81 (2002) [35] M Buiatti and G Longo, Theory Biosci. arXiv:1104.1110 (2011) [36] J D Cohen and F Tong, Science 293, 2405 (2001) [37] N Perony, C J Tessone, B König and F Schweitzer, PLoS Comput. Biol. 8, e1002786 (2012) [38] S Jalan, Phys. Rev.E80, 046101 (2009) [39] J X de Carvalho, S Jalan and M S Hussein, Phys. Rev.E79, 056222 (2009) [40] A-L Barabási and R Albert, Science 286, 509 (1999) [41] R Guimerá and L A N Amaral, Nature (London) 433, 895 (2005) [42] F Evers and A D Mirlin, Rev. Mod. Phys. 80, 1355 (2008) [43] S K Dwivedi and S Jalan, Phys. Rev.E87, 042714 (2013) S Jalan and S K Dwivedi, Phys. Rev.E89, 062718 (2014) [44] Y Le and P A Agarwal, PloS ONE 4, e4346 (2009) [45] M A Yildirim, Nature Biotech. 25, 1119 (2007) [46] P Csermely, V Ágoston and S Pongor, Trends Pharmacol. Sci. 26, 178 (2005) T Korcsmáros, M S Szalay, C Böde, I A Kovács and P Csermely, Expert Opinion on Drug Discovery 2, 1 (2007) G R Zimmermann, J Lehár and C T Keith, Drug Discovery Today 12, 34 (2007) M Antal, C Böde and P Csemely, Curr. Protein Pept. Sci. 10, 161 (2009) PCsermely,Trends Biochem. Sci. 33, 569 (2008) [47] G I Simkó, D Gyurkó, D V Veres, T Nánási and P Csermely, Genome Medicine 1, 90 (2009)

Pramana – J. Phys., Vol. 84, No. 2, February 2015 293