EURASIP Journal on Bioinformatics and Systems Biology
Information Theoretic Methods for Bioinformatics
Guest Editors: Jorma Rissanen, Peter Grünwald, Jukka Heikkonen, Petri Myllymäki, Teemu Roos, and Juho Rousu Information Theoretic Methods for Bioinformatics EURASIP Journal on Bioinformatics and Systems Biology Information Theoretic Methods for Bioinformatics
Guest Editors: Jorma Rissanen, Peter Grunwald,¨ Jukka Heikkonen, Petri Myllymaki,¨ Teemu Roos, and Juho Rousu Copyright © 2007 Hindawi Publishing Corporation. All rights reserved.
This is a special issue published in volume 2007 of “EURASIP Journal on Bioinformatics and Systems Biology.” All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and repro- duction in any medium, provided the original work is properly cited. Editor-in-Chief I. Tabus, Tampere University of Technology, Finland
Associate Editors Jaakko Astola, Finland J. Garcia-Frias, USA Paola Sebastiani, USA Junior Barrera, Brazil Debashis Ghosh, USA Erchin Serpedin, USA Michael L. Bittner, USA John Goutsias, USA Ilya Shmulevich, USA MichaelR.Brent,USA RodericGuigo,Spain A. H. Tewfik, USA Yidong Chen, USA Yufei Huang, USA Sabine Van Huffel, Belgium Paul Dan Cristea, Romania Seungchan Kim, USA Z. Jane Wang, Canada Aniruddha Datta, USA John Quackenbush, USA Yue Wang, USA Bart De Moor, Belgium Jorma Rissanen, Finland Edward R. Dougherty, USA Stephane´ Robin, France Contents
Information Theoretic Methods for Bioinformatics, Jorma Rissanen, Peter Grunwald,¨ Jukka Heikkonen, Petri Myllymaki,¨ Teemu Roos, and Juho Rousu Volume 2007, Article ID 79128, 2 pages
Compressing Proteomes: The Relevance of Medium Range Correlations, Dario Benedetto, Emanuele Caglioti, and Claudia Chica Volume 2007, Article ID 60723, 8 pages
A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification, Chris Hemmerich and Sun Kim Volume 2007, Article ID 87356, 9 pages
Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates, Hasan Metin Aktulga, Ioannis Kontoyiannis, L. Alex Lyznik, Lukasz Szpankowski, Ananth Y. Grama, and Wojciech Szpankowski Volume 2007, Article ID 14741, 11 pages
Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information, Arvind Rao, Alfred O. Hero III, David J. States, and James Douglas Engel Volume 2007, Article ID 13853, 13 pages
Splitting the BLOSUM Score into Numbers of Biological Significance, Francesco Fabris, Andrea Sgarro, and Alessandro Tossi Volume 2007, Article ID 31450, 18 pages
Aligning Sequences by Minimum Description Length,JohnS.Conery Volume 2007, Article ID 72936, 14 pages
MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress, Scott C. Evans, Antonis Kourtidis, T. Stephen Markham, Jonathan Miller, Douglas S. Conklin, and Andrew S. Torres Volume 2007, Article ID 43670, 16 pages
Variation in the Correlation of G + C Composition with Synonymous Codon Usage Bias among Bacteria, Haruo Suzuki, Rintaro Saito, and Masaru Tomita Volume 2007, Article ID 61374, 7 pages
Information-Theoretic Inference of Large Transcriptional Regulatory Networks,PatrickE.Meyer, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi Volume 2007, Article ID 79879, 9 pages
NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks, Petri Kontkanen, Hannes Wettig, and Petri Myllymaki¨ Volume 2007, Article ID 90947, 11 pages Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 79128, 2 pages doi:10.1155/2007/79128
Editorial Information Theoretic Methods for Bioinformatics
Jorma Rissanen,1, 2 Peter Grunwald,¨ 3 Jukka Heikkonen,4 Petri Myllymaki,¨ 2, 5 Teemu Roos,2, 5 and Juho Rousu5
1 Computer Learning Research Center, University of London, Royal Holloway TW20 0EX, UK 2 Helsinki Institute for Information Technology, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland 3 Centrum voor Wiskunde en Informatica (CWI), P.O. Box 94079, 1090 GB Amsterdam, The Netherlands 4 Laboratory of Computational Engineering, Helsinki University of Technology, P.O. Box 9203, 02015 HUT, Finland 5 Department of Computer Science, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland
Received 24 December 2007; Accepted 24 December 2007
Copyright © 2007 Jorma Rissanen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The ever-ongoing growth in the amount of biological data, length with which the data can be encoded, taking advantage the development of genome-wide measurement technolo- of the regular features the model prescribes to the data. This gies, and the gradual, inevitable shift in molecular biology task requires information and coding theoretic means. Simi- from the study of individual genes to the systems view; all larly, the frequently used distance measures like the Kullback- these factors contribute to the need to study biological sys- Leibler divergence and the mutual information express mean tems by statistical and computational means. In this task, we codelength differences. are facing a dual challenge: on the one hand, biological sys- D. Benedetto et al. study correlations and compressibil- tems and hence their models are inherently complex, and on ity of proteome sequences. They identify dependencies at the the other hand, the measurement data, while being genome- range of 10 to 100 amino acids. The source of such depen- wide, are typically scarce in terms of sample sizes (the “large dencies is not entirely clear. One contributing factor in the p,smalln” problem) and noisy. case of interprotein dependencies is likely to be sequence du- This means that the traditional statistical approach, plication. The dependencies can be exploited in compression where the model is viewed as a distorted image of something of proteome sequences. Furthermore, they seem to have a called a true distribution which the statisticians are trying to role in evolutionary and structural analysis of proteomes. estimate, is poorly justified. This lack of rationality is particu- C. M. Hemmerich and S. Kim also use information the- larly striking when one tries to learn the structure of the data ory for studying the correlations in protein sequences. They by testing for the truth of a hypothesis in a collection where base their method on computing the mutual information of none of them is true. Similarly, the Bayesian approaches that nonadjacent residues lying at a fixed distance d apart, where require prior knowledge, which is either nonexistent or vague the distance is varied from zero to a fixed upper bound. The and difficult to express in terms of a distribution for the pa- mutual information vector formed by these statistics is used rameters, are subject to modeling assumptions which may to train a nearest-neighbor classifier to predict membership bias the results in an unintended manner. in protein families with results indicating that the correla- It was the editors’ intent and hope to encourage applica- tions between nonadjacent residues are predictive of protein tions of techniques for model fitting influenced by informa- family. tion theory, originally created for communication theory but H. M. Aktulga et al. detect statistically dependent ge- more recently expanded to cover algorithmic information nomic sequences. Their paper addresses two applications. theory and applicable to statistical modeling. In this view, First, they identify different parts of a gene (maize zmSRp32) the objective in modeling is to learn structures and proper- that are mutually dependent without appealing to the usual ties in data by simply fitting models without requiring any of assumption that dependencies are revealed by a considerable them to be “true”. The performance is not measured by any amount of exact matches. It is discovered that dependencies distance to the nonexisting “truth” but in terms of the prob- exist between the 5 untranslated region and its alternatively ability they assign to the data, which is equivalent to the code spliced exons. As a second application, they discover short 2 EURASIP Journal on Bioinformatics and Systems Biology tandem repeats which are useful in, for instance, genetic pro- correlated with G+C (guanine-cytosine) composition in the filing. In both cases, the used techniques are based on mutual genome. In their paper, H. Suzuki et al. quantify the corre- information. lation of G+C composition with synonymous codon usage The objective in the paper by A. Rao et al. is to dis- bias, where the bias is measured by the entropy of the third cover long-range regulatory elements (LREs) that determine codon position. They show that the correlation depends on tissue-specific gene expression. Their methodology is based various genomic features and varies among different species. on the concept of directed information,avariantofmutual This raises several interesting questions about the different information introduced originally in the 1970s. It is shown evolutionary forces causing the codon usage bias. that directed information can be successfully used for select- The paper by P. E. Meyer et al. tackles the challenging ing motifs that discriminate between tissue-specific and non- problem of inferring large gene regulatory networks using in- specific LREs. In particular, the performance of directed in- formation theory. Their MRNET method extends the maxi- formation is better than that of mutual information. mum relevance/minimum redundancy (MRMR) feature se- F. Fabris et al. present an in-depth study to BLOSUM— lection technique to networks by formulating the network in- block substitution matrix scores. They propose a decompo- ference problem as a series of input/output supervised gene sition of the BLOSUM score into three components: the mu- selection procedures. Empirical results are competitive with tual information of two compared sequences, the divergence the state-of-the-art methods. of observed amino acid co-occurence frequencies from the P. Kontkanen et al. study the problem of computing the probabilities in the substitution matrix, and the background normalized maximum likelihood (NML) universal model for frequency divergence measuring the stochastic distance of Bayesian networks, which are important tools for modeling the observed amino acid frequences from the marginals in discrete data in biological applications. The most advanced the substitution matrix. The authors show how the result MDL method for model selection between such networks is of the decomposition, called BLOSpectrum, can be used to based on comparing the NML distributions for each network analyze questions about the correctness of the chosen BLO- under consideration, but the naive computation of these dis- SUM matrix, the degree of typicality of compared sequences tributions requires exponential time with respect to the given or their alignment, and the presence of weak or concealed data sample size. Utilizing certain computational tricks, and correlations in alignments with low BLOSUM scores. building on earlier work with multinomial and Naive Bayes The paper by J. Conery presents a new framework for models, the authors show how the computation can be per- biological sequence alignment that is based on describing formed efficiently for tree-structured Bayesian networks. pairs of sequences by simple regular expressions. These reg- ular expressions are given in terms of right-linear grammars, ACKNOWLEDGMENTS and the best grammar is found by use of the MDL prin- ciple. Essentially, when two sequences contain similar sub- We thank the Editor-in-Chief for the opportunity to prepare strings, this similarity can be exploited to describe the se- this special issue, and the staff of Hindawi for their assistance. quences with fewer bits. The precise codelengths are deter- The greatest credit is of course to the authors, who submit- mined with a substitution matrix that provides conditional ted contributions of the highest quality. We also thank the probabilities for the event that a particular symbol is re- reviewers who have had a crucial role in the selection and placed by another particular symbol. One advantage of such editing of the ten papers appearing in the special issue. a grammar-based approach is that gaps are not needed to align sequences of varying length. The author experimentally Jorma Rissanen compares the alignments found by his method with those Peter Grunwald¨ found by CLUSTALW. In a second experiment, he measures Jukka Heikkonen the accuracy of his method on pairwise alignments taken Petri Myllymaki¨ from the BAlisBASE benchmark. Teemu Roos S. C. Evans et al. explore miRNA sequences based on Juho Rousu MDLcompress, an MDL-based grammar inference algo- rithm that is an extension of the optimal symbol compres- sion ratio (OSCR) algorithm published earlier. Using MDL- compress, they analyze the relationship between miRNAs, single nucleotide polymorphisms (SNPs) and breast can- cer. Their results suggest that MDLcompress outperforms other grammar-based coding methods, such as DNA se- quitur, while retaining a two-part code that highlights bio- logically significant phrases. The ability to quantify cost in bits for phrases in the MDL model allows prediction of re- gions where SNPs may have the most impact on biological activity. The partially redundant third position of codons (protein-coding nucleotide triplets) tends to have a strongly biased distribution. The amount of bias is known to be Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 60723, 8 pages doi:10.1155/2007/60723
Research Article Compressing Proteomes: The Relevance of Medium Range Correlations
Dario Benedetto,1 Emanuele Caglioti,1 and Claudia Chica2
1 Dipartimento di Matematica, Universita` di Roma “La Sapienza”, Piazzale Aldo Moro 5, 00185 Roma, Italy 2 Structural and Computational Biology Unit, EMBL Heidelberg, Meyerhofstraße 1, 69117 Heidelberg, Germany
Received 14 January 2007; Revised 28 May 2007; Accepted 10 September 2007
Recommended by Teemu Roos
We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical mod- els that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of proteomes and protein sequences.
Copyright © 2007 Dario Benedetto et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. INTRODUCTION finite sequence of length L, the informational content in bits is approximately Lh and so Lh is the minimum length in bit Protein sequences have been considered for a long time as of any sequence that contains the same information. In this nearly random or highly complex sequences, from the infor- way Lh provides a theoretical lower bound for the sequence’s mational content point of view. The main reason for this is compression. A compression algorithm is intended to code a the local complexity of amino acid composition, that is, the sequence into a shorter one, from which it is possible to ob- type and number of amino acids found in a sequence seg- tain unequivocally the former. In practise, one cannot com- ment, especially inside the globular domains [1]. This com- press at a rate equal to the Shannon entropy for the given plexity could be related to the so called randomness of cod- sequence. Nonetheless, it is possible to approximate such a ing sequences in DNA, already pointed out in a pioneering limit, using an efficient compression algorithm. work [2] and explained by evolutionary models [3]. Studies Statistical compression algorithms achieve their goal by on protein sequence compression show that proteins behave assigning shorter code words to the most probable charac- as sequences of independent characters and have a very low ters; their efficiency depends on the accuracy of the model compressibility, around 1% [4]. The ordered set of protein used to estimate each character’s probability. Models try to sequences belonging to one organism, the proteome, was also take advantage of the correlations between characters con- considered to be not compressible due to this little Markov sidering, for example, how the preceding characters, that is, dependency [5]. Improvements are obtained by [6, 7]. How- the character’s context, determine the probability of the next ever, later studies [8–10] suggest that proteomes contain dif- one, as in the prediction by partial matching (PPM) scheme ferent sources of regularities, and can be compressed to rates [12]. around 30%. For a relevant discussion on the validity of these Most successful algorithms for proteome compression results see Cao et al. [7]. are based on the identification of duplicated sequences or In this work, we focus on the statistical study of proteome repeats. The compress protein (CP) algorithm [5], for ex- sequences, using the concept of entropy brought into infor- ample, considers that duplicated sequences in proteomes are mation theory by Shannon [11]. The Shannon entropy is re- similar but not identical because of mutation and evolu- lated to the amount of information of a sequence emitted by tionary divergence. CP uses a modified PPM that includes a certain source. The entropy h of a sequence is the limit of the probability of amino acid substitutions when estimating the average amount of information per character, when the each residue probability. The ProtComp algorithm [8]opti- length of the sequence tends to infinity. In particular, for a mises the use of approximate repeats by updating the amino 2 EURASIP Journal on Bioinformatics and Systems Biology
Table 1: Proteome sequences.
Abbreviation Organism Proteome length Number of proteins Mj Methanococcus jannaschii 448 779 1680 Hi Heamophilus influenzae 509 519 1657 Vc Vibrio cholerae 870 500 2988 Ec Escherichia coli 157 8496 5339 Sc Saccharomyces cerevisiae 2 900 352 5835 Dm Drosophyla melanogaster 5 818 330 11 592 Ce Caenorhabditis elegans 6 874 562 17 456 Hs Homo sapiens 3 295 751 5733 acid substitution matrix as the repeated similar blocks appear length. Protein length range correlations are in agreement along the sequence. The context-tree weighting (CTW) [13] with the process of sequence duplication, as it has been pre- is another context-based method that has been applied for viously suggested for long-range correlations [9]; in addition biological sequence compression. In [6] the authors present a to that, we show that they also contain information about CTW-based algorithm that predicts the probability of a char- the three-dimensional structure of the proteins. Short range acter by weighting the importance of short and long contexts correlations might instead relate to the local constraints on considering as well the occurrence of approximate repeats or amino acid distribution due to secondary structure require- palindromes in those contexts. The XM [7] is a statistical al- ments. gorithm which combines, via a Bayesian average, the prob- ability of an amino acid calculated on a local scale with the probability of that same residue being part of a duplicated 2. RESULTS AND DISCUSSION region of the proteome. Nonstatistical approaches, based on the Burros-Wheeler For our statistical analysis, we used the proteomes of 4 transform (BWT) [9], have also been used for identifying prokaryotic and 4 eukaryotic organisms shown in Table 1. overlapping and distant repeats in proteomes, and efficiently They were retrieved from the database of the Integr8 web use them in compression. Even simpler models, that rely on a portal [14], with exception of the Hi, Mj, Sc, and Hs pro- block code representation of the protein sequences [10], have teomes that were obtained from the protein corpus in [15], proved to be successful in some cases. for the sake of comparison of our compression rate results All the algorithms commented above put into evidence with previous studies on the same proteomes. The proteomes the existence and importance of redundancy in proteome se- are not complete (in particular the version of Hs in the pro- quences. Here we present a purely statistical study of 8 eu- tein corpus) but they represent a natural set of proteins where karyotic and prokaryotic proteomes. Firstly, we analyse the the redundancy has a biological meaning. It is important to correlation function of the whole sequences and find evi- remark that the sequence of the proteins in the proteome files dence of medium range correlations, between amino acids of the Integr8 database is not the natural one. Those files are located 100 residues apart. Then we calculate the amino acid not useful for our analysis. Nevertheless, using the additional correlations considering the protein boundaries and iden- information available in the database, it is possible to order tify the role of the intra/interprotein scale in determining the proteins as they are found in the chromososmes. The pro- the medium range correlations. Furthermore, we generate teome files of the protein corpus do not present this problem, groups of amino acids using their pair correlations at dis- but the sequence of the proteins is not available. Therefore, tance 100, that reveal the structural meaning of the medium for the analysis shown in Table 2 and in Figure 2,wehave range correlations. Using the results of proteome correla- used the version of Hi, Mj, Sc in the Integr8 database. For the tions, we propose a statistical model for the distribution of same reason, the data for Hs is missing in Table 2 since the amino acids in 4 proteomes: Haemophilus influenzae (bac- protein order is not obtainable at the Integr8 site. teria), Methanococcus jannaschii (bacteria), Saccharomyces cerevisiae (eukarya) and Homo sapiens (eukarya), and we es- 2.1. Correlations timate their compression rate to compare our results against previous works. The sources of nonrandomness studied fall into two As a first approximation to the general trends in residue dis- scales: the medium range correlations between amino acids tribution, we study the cooccurrence of amino acids. More ff of the same and neighboring sequences, at distances of order precisely, we calculate the pair correlations at di erent dis- 100, and the short range Markovian correlations between the tances, that is, the average number of times equal residues a contiguous residues up to distance 10. Previous studies [9] appear at distance k along the whole sequence show that proteomes present repeated subsequences at very long distances (50–300). In this article, we do not consider k = 1 k these long-range correlations of the order of the proteome C Caa (1) 20 a Dario Benedetto et al. 3
0.0004 Table 2: Intra- and interprotein correlation. Intraprotein correla- 0.00035 tion is always higher than interprotein correlation, and correlation between matching halves (−−) is higher than that of not corre- 0.0003 sponding halves (+−). ) k ( 0.00025 −− +− C Proteome Intraprot corr Interprot corr Interprot corr 0.0002 Mj 0.271914 0.050381 0.050231 0.00015 Hi 0.265803 0.045588 0.039246
Correlation 0.0001 Vc 0.256386 0.063712 0.041780 Ec 0.271597 0.080064 0.069980 5e − 05 Sc 0.270560 0.032501 0.018606 0 Dm 0.295940 0.095722 0.056176 −5e − 05 100 200 300 400 500 600 700 800 900 1000 Ce 0.288071 0.122692 0.077690 Distance k
Dm Mj the same protein sequence). In Table 2, we present the re- Ce Vc sults for the intraprotein correlation between the two halves Sc Hi of the same protein and the interprotein correlation between corresponding and noncorresponding halves of two contigu- Figure 1: Correlation function for the 8 proteomes. Notice that the −− function remains positive for distances up to 1000 and that eukary- ous proteins: first half with first half (corr ) and second half +− otic proteomes (continuous lines) tend to present higher values. with first half (corr ). These correlations are defined as follows. Let Np be the − + number of proteins, let ρi (a)andρi (a) be the relative fre- with quency of the residue a in the first and the second half of the ith protein, respectively, and let ρ(a) be the corresponding − 1 Nk mean value. We define Ck = χ σ = a χ σ = a − f 2,(2) aa − i i+k a N k i=1 1 σ ±± = ρ±(a) − ρ(a) ρ±(a) − ρ(a) ,(3) i,j 20 i j where N is the sequence length, χ(σi = a) is the charac- a teristic function of finding residue a at position i,and fa is the relative frequency of amino acid a in the proteome. Ac- for instance, cording to this definition, a positive correlation means that, 1 σ ±− = ρ+(a) − ρ(a) ρ−(a) − ρ(a) . (4) for a distance k, the number of pairs of equal amino acid i,j 20 i j is more frequent than expected due to their frequency in a the proteome. The resulting correlation function for the 8 We also define proteomes we studied (Figure 1) shows that eukaryotic se- quences have stronger correlations than prokaryotic ones. + = ++ − = −− σi σi,i , σi σi,i . (5) Moreover, for all the proteomes, the correlation remains pos- itive at a medium range, for values of k bigger than 800 or The intraprotein correlation is 1000, depending on the proteome. We notice that the natu- ral order of proteins in the proteomes, given by the succes- Np −+ 1 σi,i sion of genes in the chromosomes, is relevant: when we ran- Cintra = − . (6) N σ σ+ domly permute proteins, the medium range correlations are p i=1 i i lost, both in eukaryotes and prokaryotes. The two interprotein correlations are The medium range correlations imply that, in proteomes, the amino acid distribution of neighboring proteins tends N −1 p σ −− to be more similar than that of distant ones. This fact can −− = 1 i,i+1 Cinter − − − , be related to the process of duplication, recognied as the Np 1 =1 σi σi+1 i (7) dominant force in the evolution of protein function [16]. As N −1 p σ+− protein repeats have been related to duplication at different +− = 1 i,i+1 Cinter − + − . scales (genome, gene, or exon) [17], it is possible that the Np 1 i=1 σi σi+1 amino acid patterns responsible for the observed medium range correlation have the same evolutionary origin. The correlation values in Table 2 have the same trend for all Due to the correlation definition used, the medium range the proteomes: intraprotein correlation is always higher than correlations could be caused either by pairs of amino acids interprotein correlation. ±± ff belonging to the same protein, or to different ones. There- The correlation defined by means of σi,j are di erent k fore, we split the nonlocal correlation into two groups and from the traditional correlation Caa which is the correla- analyse them separately: interprotein correlations (between tion of the symbol a at distance k,wherek is the number of 2 contiguous proteins) and intraprotein correlations (inside residues: we have calculated the correlation function of the 4 EURASIP Journal on Bioinformatics and Systems Biology
0.05 of strongly conserved hydrophobic residues even when the other residues start to differ at several other positions. 0.04 The evidence obtained from the correlation analysis does not allow to clarify the nature of the structural constraints ) k
( 0.03 measured: do they reflect the modular repetition of sec- C ondary structure elements, caused by duplication or, per- 0.02 haps, they depend on the conservation of higher order ter- tiary structure units like domains? We try to address this
Correlation 0.01 question by defining amino acid groups as explained in the next section. 0 2.2. Grouping of amino acids −0.01 0 5 10 15 20 25 30 In a previous study [4], the complexity of large sets of nonre- Distance k (no of proteins) dundant protein sequences was measured using a reduced al-
−− phabet approximation, that is, using groups of amino acids Sc: inter-prot corr defined by an a priori classification. The Shannon entropy Sc: inter-prot corr+− was then estimated from the entropies of the blocks of n- Figure 2: Correlation function, at distance of k proteins, between characters. The authors did not find enough evidence to sup- amino acids belonging to corresponding (corr−), and noncorre- port the existence of short range correlations between the sponding (corr+−) halves; S. cerevisiae proteome. Correlation be- amino acids of protein sequences. tween corresponding halves is higher, suggesting that structural re- Conversely, given the above evidence of medium range quirements modulate the evolution of protein sequences, by main- correlations in proteome sequences, we build groups of cor- taining certain amino acid patterns. related amino acids using the correlations between the 20 k amino acids. We calculate Cab, the correlation between all amino acid pairs ab at distances k, in the same way we cal- k frequencies of the amino acids at the distance of one protein. culate Caa in the previous section: In Figure 2, we also analyse how the interprotein correlations between matching and nonmatching protein halves vary with N−k k = 1 = = − the number k of proteins separating the two halves. We com- C χ σi a χ σi+k b fa fb. (9) ab N − k pare 1
− −− A quick look at the resulting 20 × 20 matrix for k = 100 Np k σ −− = 1 i,i+k (Figure 3), which presumably includes both intraprotein and C (k) − − − , Np k =1 σi σi+k interprotein correlation, puts in evidence that the signs of the i (8) − matrix elements, and thus the positive and negative correla- Np k σ+− +− = 1 i,i+k tions, are not distributed randomly among residues but, in- C (k) − + − . Np k i=1 σi σi+k stead, in a grouped fashion: some amino acids present posi- tive or negative correlations with the same subset of residues. As an extension of the results in Table 2, we find that the Then, we construct groups of amino acids in such a way correlation between matching halves is kept higher than that that they maximise the positive medium range correlation; of noncorresponding halves along the proteome. Analogous in practical terms it means that amino acids which are more results to Table 2 and Figure 2 hold for second-second and likely to appear at distances of order 100 would be grouped first-second halves. together. Gene duplication can explain both the existence and or- For a given partition of the set of amino acids in Ng der dependence of interprotein correlation, but it is not groups, we calculate the sum of the correlation function be- enough to justify why intraprotein correlations remain high, tween any pair of residues ab belonging to a same group. because high interprotein correlations can also appear in a More precisely, groups are obtained by maximising the fol- low intraprotein correlations context. Indeed, the presence of lowing quantity: intraprotein correlations indicates a nonrandom distribution of amino acids at a protein length scale. This nonrandomness Ng 200 can be related to segmental duplication, that is, duplication = k F(G) Cab, (10) = of segments inside the same protein; likewise, it can reflect i 1a,b∈gi k=1 the maintenance of amino acid patterns during the protein divergence that follows gene duplication as a consequence of which is function of a partition G of the amino acids in Ng the structural constraints imposed upon protein sequences. disjoint sets gi. Due to the huge number of possible choices As an example, extensive searches of protein databases for the groups, we maximise this value using a simulated an- [18] reveal the high frequency of tandemly repeated se- nealing algorithm. This is a Monte Carlo algorithm used for quences of approximately 50 amino acids, ARM and HEAT, optimisation [19]. For a given partition G, we construct a in eukaryotic proteins. Moreover, those repeats present a core new partition G choosing at random a residue and changing Dario Benedetto et al. 5
synonymous relationships. It is well known that mutations between amino acids sharing geometrical and/or physico- CYP chemical properties are the basis of neutral evolution at a T molecular level [20]; this fact also explains why there is not a one-to-one relationship between protein sequences and structures [21]. Moreover, structurally neighboring residues have been found to distribute differentially (proxi- mally/distally) in the protein sequences, depending on their physico-chemical properties [22]. Indeed, the groups defined from the pair correlations at amediumrange(Table 3) almost correspond with the natu- ral classification based on their physico-chemical properties: hydrophobic, polar, charged, small, and ambiguous. In par- ticular, the fact that hydrophobic amino acids group together allows us to think that the correlation function is gathering
VLIMFWNQHKRDEGAS some of the three-dimensional information contained in the VLIMFWNQHKRDEGASTCYP protein sequence, more precisely tertiary structure informa- Figure 3: Correlation between the 20 amino acids for Hi. Posi- tion, as hydrophobic interactions are considered the driving tive (black) and negative (grey) correlations determine amino acid forces of the protein folding process [23]. groups. Therefore, the reason why intraprotein correlations re- main high is not only related to the repetition of secondary structure units, but is also the conservation of the amino Table 3: Groups of amino acids determined by maximisation of acids responsible for the protein tertiary structure. the positive medium range correlation. Amino acids that are more Beside this, it is important to notice that, even if the likely to appear at 200 residues distance are grouped together. amino acid usage in eukaryotes and prokaryotes is very sim- Proteome Groups ilar [24], the amino acid correlations are not, as they col- LIFWSY lect part of the structural information, contained in the se- quences. The number of groups is also different: 3 for H. in- Hi VMGATP fluenzae and M. jannaschii,4forS. cerevisiae and H. sapiens. NQHKRDEC This could indicate a higher interchangeability of residues in LIFWNSY some proteomes, but further analysis is needed to confirm Mj VMQHGATCP this hypothesis. KRDE LIMFWCY 2.3. Sequence entropy estimation NQHSTP Sc KRDE In order to quantify the capability that a statistical model has VGA to identify the nonrandomness of a sequence, one can use it to construct an arithmetic coding compressor [25]. We es- VLIMFWNY timate the compression rate of such a compressor with the HSTC Hs sequence entropy QKDE RGAP N =−1 S log 2 pi(σi), (11) N i its group. If F(G ) >F(G), the algorithm accepts the new par- using the model to calculate the probability Pi σi of charac- tition. Iterating this procedure we would reach a local max- ter σi at position i. The better is the model, the lower is the imum which may not be the absolute maximum. In order estimated value of the sequence entropy. We construct three to avoid being trapped in a local maximum, the algorithm models to estimate the probability of each character, consid- accepts, with a small probability P, a new partition G for ering the previous ones and taking into account both short which F(G) ≤ F(G). The value of this probability P slowly and medium range correlations. For each model, we find pa- decreases to zero as the number of iterations increases in such rameters that minimise the sequence entropy. The Smin value a way that the convergence of the algorithm to the absolute obtained is taken as an estimate of the compression rate of maximum of F is guaranteed. a running arithmetic codification [25] of the proteomes and The number and the structure of the groups chosen have is used to compare our results with other compression algo- the highest value of F(G) and represent an equilibrated par- rithms (Table 4). tition of the 20 amino acids, that is, groups with only one Previous works on protein sequence compression like [5] element are not accepted. are based on short range Markovian models. In those models, The idea behind our grouping scheme is to simplify the probability of each amino acid is calculated as a function the amino acid pattern mining by taking advantage of their of the context in which it appears, considering the frequency 6 EURASIP Journal on Bioinformatics and Systems Biology
Table 4: Compression rate in bit per character for the studied proteomes. One-character entropy is the entropy of the sequences considering that their residues are independently distributed.
Algorithm Hi Mj Sc Hs One-character entropy 4.155 4.068 4.165 4.133 CP, Nevill-Manning and Witten 1999 [5] 4.143 4.051 4.146 4.112 lza-CTW, Matsumoto et al. 2000 [6] 4.118 4.028 3.951 3.920 ProtComp, Cao et al. 2007 [7] 4.108 4.008 3.938 3.824 XM, Cao et al. 2007 [7] 4.102 4.000 3.885 3.786 Model 1∗ 4.111 4.017 3.963 3.978 Model 2∗ 4.102 4.005 3.948 3.933 Model 3∗ 4.100 4.002 3.945 3.931 ProtComp, Hategan and Tabus 2004 [8]† 2.330 3.910 3.440 3.910 BWT/SCP, Adjeroh and Nan 2006 [9]† 2.546 2.273 3.111 3.435 ∗ Estimation † Results obtained with a different set of proteomes with which this amino acid happens to be after the l previous argued in other works on latent periodicity of protein se- residues. quences [27, 28]. From the point of view of protein sequence Following this idea, we start our statistical description evolution, the short range parameters can also reflect the ex- of proteome sequences taking into account the information istence of constraints on the distribution of residues. Protein given by the neighboring residues using a variation of the in- sequences are modified by mutation, but still have to cope terpolated Markov models [26]. In order to predict the prob- with folding requirements that determine a nonrandom po- ability of the ith character, we consider the contexts up to a sitioning of key residues, depending on their geometrical and length Nc (number of contexts) that precede it, that is, the physico-chemical properties. In fact, structural alphabets de- substrings σi−k ···σi−1 for k = 0, ...,Nc.Foranycharac- rived from hidden Markov models denote that local confor- i ff ter a, we count the number Fk(a) of previous occurrences mations of protein structures have di erent sequence speci- of the substring σi−k ···σi−1a. The conditional frequency of ficity [29]. finding character a after the context σi−k ···σi−1 is obtained The intra/interprotein correlations identified in previous dividing by the sum over all amino acids b at position i: sections suggest that the frequencies of the single residues has nonnegligible fluctuations on the medium range. We take Fi (a) into account these fluctuations in our second model (model k . (12) Fi (b) 2inTable 4): b k i Nc i 1+μR (a)+ = λ F (a) Our model 1 predicts the probability of character a at posi- = L k 0 k k Model 2: pi(a) i Nc i . (14) tion i with b 1+μRL(b)+ k=0λkFk(b) Nc i Here we added 1+ = λ F (a) Model 1: p (a) = k 0 k k . (13) i Nc i i = ··· i 1+ = λ F (b) R (a) number of a in σi−L σi−1 . (15) b k 0 k k L L We remark that the main difference between our short range This quantity is proportional to the frequency of the amino approach and CTW is that we give a weight to the different acid a in the subsequence of length L,withL a distance of contexts, while in [6] a weight is given to their correspond- medium scale, starting from the position i − L.Thefactori/L i = ing conditional probabilities. We find that the most infor- guarantees that aRL(a) i, so that it increases with i in the i = mative positions were the previous 8; this length is in qual- same way as the other terms of the sum (e.g., aF0(a) i). itative agreement with the results found in [6]. Model 1 in The parameter μ is optimised as λk.TheoptimalvaluesforL Table 4 indicates the results obtained considering only the found during the entropy minimisation stage are 190 for Hi, short range correlations for Nc = 8. 163 for Mj, 105 for Sc, and 115 for Hs. The model depends on the parameters λk that are op- Finally, in model 3, we use the groups found in timised, using standard algorithms for minimisation, in or- Section 2.2 (see Table 3). In particular, a contribution to der to achieve the best estimate of the compression rate. This the probablity of a given residue is obtained by computing “entropy minimisation” stage is very time expensive. In a real the probability of the residue to belong to a certain group compression procedure, those parameters should be speci- and then the conditional probability of the residue once the fied and therefore would contribute to the estimated entropy. group is given is In our case this contribution is negligible. i i Nc i 1+μGL ga f (a)+ k=0λkFk(a) The short range correlations support the existence of pe- Model 3: pi(a) = , i i Nc i riodic patterns in protein sequences. They can be caused by b 1+μGL gb f (b)+ k=0λkFk(b) the alternation of alpha-beta secondary structure units, as (16) Dario Benedetto et al. 7
i where ga is the group of a, f (a) is the relative frequency of a is a reasonable explanation for the interprotein correlation. in its group, as measured up to the position i − 1, and However, it does not account for the intraprotein correla- tions; this can instead be related to the maintenance of the i = GL(g) number of amino acids of amino acid patterns responsible for the three-dimensional i (17) structure, as the segregation between hydrophobic and polar the group g in σ − ···σ − . i L i 1 L amino acids indicates. More elaborately, the sampling of the space of structures during proteome evolution is determined For this model, the optimal values of the parameter L are 129 by the duplication processes but it is highly constrained by for Hi, 94 for Mj, 77 for Sc, and 100 for Hs. the structural and functional requirements that protein se- As one can see in Table 4, the capability of our statistical quences have to meet inside a living system. model to represent the nonrandom information contained Prokaryotic proteomes show lower correlation values, es- in proteomes is comparable to those models that consider pecially for distances under 100 residues, and a smaller com- repeated amino acid patterns at both short and medium scale pressibility than eukaryotic proteomes. These characteristics [6, 7]. point at a higher redundancy of eukaryotic proteome se- The improvement in the performance of models 2 and 3 quences, and suggest that the increase of proteome size does is due to the fact that they identify the short range correla- not imply de novo generation of protein sequences, with tions and separate them from the fluctuations of amino acid completely different amino acid distribution. frequencies at a protein length range. This demonstrates that both correlation types are informative and that the statistical ACKNOWLEDGMENTS significance of repetitions at those scales is enough to model the amino acid probabilities. The authors would like to thank Toby Gibson for reading and The compression rate achieved when the medium range commenting the manuscript and the reviewers for their con- correlations are modelled with the frequency of amino acid structive criticism that helped to improve the quality of the groups (model 3) is almost equivalent to the compression paper. rate of model 2. From a biological perspective it indicates that groups of amino acids are meaningful, and that the redun- REFERENCES dant information at medium scale has a structural compo- nent might be coming from the three-dimensional structure [1] J. C. Wootton, “Non-globular domains in protein sequences: constraints. automated segmentation using complexity measures,” Com- According to our results, there is an important difference puters & Chemistry, vol. 18, no. 3, pp. 269–285, 1994. in the compressibility rates of the eukaryotic and prokaryotic [2] B. E. Blaisdell, “A prevalent persistent global nonrandomness proteomes which is in agreement with the correlation func- that distinguishes coding and non-coding eucaryotic nuclear tion in Figure 1. The sequences of S. cerevisiae and H. sapi- DNA sequences,” Journal of Molecular Evolution, vol. 19, no. 2, pp. 122–133, 1983. ens are more redundant, and thus more compressible, than [3] Y. Almirantis and A. Provata, “An evolutionary model for the those of H. influenzae and M. jannaschii; correspondingly, origin of non-randomness, long-range order and fractality in the correlation functions of Sc and Hs remain positive for the genome,” BioEssays, vol. 23, no. 7, pp. 647–656, 2001. longer distances than Hi and Mj. This additional redundancy [4] O. Weiss, M. A. Jimenez-Monta´ no,˜ and H. Herzel, “Informa- could be related to the presence, in eukaryotic proteomes, of tion content of protein sequences,” Journal of Theoretical Biol- paralogous proteins with very similar distribution of synony- ogy, vol. 206, no. 3, pp. 379–386, 2000. ff mous amino acids, but di erent function. There is evidence [5] C. G. Nevill-Manning and I. H. Witten, “Protein is incom- suggesting that paralogous genes have been recruited during pressible,” in Proceedings of the Data Compression Conference evolution of different metabolic pathways and are related to (DCC ’99), pp. 257–266, Snowbird, Utah, USA, March 1999. the organism adaptability to environmental changes [16]. On [6] T. Matsumoto, K. Sadakane, and H. Imai, “Biological sequence the other hand, the lower compressibility of the Hi and Mj compression algorithms,” Genome Informatics, vol. 11, pp. 43– proteomes is in agreement with the reduction of prokaryotic 52, 2000. genome size as an adaptation to fast metabolic rates [30, 31]. [7]M.D.Cao,T.I.Dix,L.Allison,andC.Mears,“Asimplestatis- tical algorithm for biological sequence compression,” in Pro- ceedings of the Data Compression Conference (DCC ’07),pp. 3. CONCLUSIONS 43–52, Snowbird, Utah, USA, March 2007. In this article, we show that the correlation function gath- [8] A. Hategan and I. Tabus, “Protein is compressible,” in Pro- ers evolutionary and structural information of proteomes. ceedings of the 6th Nordic Signal Processing Symposium (NOR- SIG ’04), pp. 192–195, Espoo, Finland, June 2004. Even if proteins are highly complex sequences, at a proteome [9] D. Adjeroh and F. Nan, “On compressibility of protein se- scale, it is possible to identify correlations between charac- quences,” in Proceedings of the Data Compression Conference ters at short and medium ranges. It confirms that protein (DCC ’06), pp. 422–434, Snowbird, Utah, USA, March 2006. sequences are not completely random, indeed they present [10] G. Sampath, “A block coding method that leads to signifi- repeated amino acid patterns at those two scales. The alter- cantly lower entropy values for the proteins and coding sec- nation of secondary structure units can determine the local tions of Haemophilus influenzae,” in Proceedings of the IEEE redundancy. This was already known and generally modelled Bioinformatics Conference (CSB ’03), pp. 287–293, Stanford, using Markov models. In our opinion, sequence duplication Calif, USA, August 2003. 8 EURASIP Journal on Bioinformatics and Systems Biology
[11] C. E. Shannon, “A mathematical theory of communication,” [31] J. Raes, J. O. Korbel, M. J. Lercher, C. von Mering, and P. Bork, Bell System Technical Journal, vol. 27, pp. 379–423 and 623– “Prediction of effective genome size in metagenomic samples,” 656, 1948. Genome Biology, vol. 8, no. 1, p. R10, 2007. [12] J. Cleary and I. Witten, “Data compression using adaptive cod- ing and partial string matching,” IEEE Transactions on Com- munications, vol. 32, no. 4, pp. 396–402, 1984. [13] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree weighting method: basic properties,” IEEE Trans- actions on Information Theory, vol. 41, no. 3, pp. 653–664, 1995. [14] Integr8 web portal, ftp://ftp.ebi.ac.uk/pub/databases/integr8/, 2006. [15] J. Abel, “The data compression resource on the internet,” http://www.datacompression.info/, 2005. [16] C. A. Orengo and J. M. Thornton, “Protein families and their evolution—a structural perspective,” Annual Review of Bio- chemistry, vol. 74, pp. 867–900, 2005. [17] J. Heringa, “The evolution and recognition of protein se- quence repeats,” Computers & Chemistry, vol. 18, no. 3, pp. 233–243, 1994. [18] M.A.Andrade,C.Petosa,S.I.O’Donoghue,C.W.Muller,¨ and P. Bork, “Comparison of ARM and HEAT protein repeats,” Journal of Molecular Biology, vol. 309, no. 1, pp. 1–18, 2001. [19] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, “Optimiza- tion by simulated annealing,” Science, vol. 220, no. 4598, pp. 671–680, 1983. [20] L. A. Mirny and E. I. Shakhnovich, “Universally conserved po- sitions in protein folds: reading evolutionary signals about sta- bility, folding kinetics and function,” Journal of Molecular Bi- ology, vol. 291, no. 1, pp. 177–196, 1999. [21]M.A.Huynen,P.F.Stadler,andW.Fontana,“Smoothness within ruggedness: the role of neutrality in adaptation,” Pro- ceedings of the National Academy of Sciences of the United States of America, vol. 93, no. 1, pp. 397–401, 1996. [22] S. Karlin, “Statistical signals in bioinformatics,” Proceedings of the National Academy of Sciences of the United States of Amer- ica, vol. 102, no. 38, pp. 13355–13362, 2005. [23] K. A. Dill, “Dominant forces in protein folding,” Biochemistry, vol. 29, no. 31, pp. 7133–7155, 1990. [24] B. Rost, “Did evolution leap to create the protein universe?” Current Opinion in Structural Biology, vol. 12, no. 3, pp. 409– 416, 2002. [25] J. Rissanen and G. G. Langdon Jr., “Arithmetic Coding,” IBM Journal of Research and Development, vol. 23, no. 2, pp. 149– 162, 1979. [26] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, “Microbial gene identification using interpolated Markov models,” Nu- cleic Acids Research, vol. 26, no. 2, pp. 544–548, 1998. [27] V. P. Turutina, A. A. Laskin, N. A. Kudryashov, K. G. Skryabin, and E. V. Korotkov, “Identification of latent period- icity in amino acid sequences of protein families,” Biochemistry (Moscow), vol. 71, no. 1, pp. 18–31, 2006. [28] E. V. Korotkov and M. A. Korotkova, “Enlarged similarity of nucleic acid sequences,” DNA Research, vol. 3, no. 3, pp. 157– 164, 1996. [29]A.C.CamprouxandP.Tuffery,´ “Hidden Markov model- derived structural alphabet for proteins: the learning of pro- tein local shapes captures sequence specificity,” Biochimica et Biophysica Acta, vol. 1724, no. 3, pp. 394–403, 2005. [30] S. D. Bentley and J. Parkhill, “Comparative genomic structure of prokaryotes,” Annual Review of Genetics, vol. 38, pp. 771– 791, 2004. Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 87356, 9 pages doi:10.1155/2007/87356
Research Article A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification
Chris Hemmerich1 and Sun Kim2
1 Center For Genomics and Bioinformatics, Indiana University, 1001 E. 3rd Street, Bloomington 47405-3700, India 2 School of Informatics, Center for Genomics and Bioinformatics, Indiana University, 901 E. 10th Street, Bloomington 47408-3912, India Received 28 February 2007; Revised 22 June 2007; Accepted 31 July 2007
Recommended by Juho Rousu
We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI) of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better than the classic MI method, reaching the level where proteins can be classified without alignment information.
Copyright © 2007 C. Hemmerich and S. Kim. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. INTRODUCTION tein structure. To investigate this question, we used the fam- ily and sequence alignment information from Pfam-A [4]. To A protein can be viewed as a string composed from the 20- model sequences, we defined and used the mutual informa- symbol amino acid alphabet or, alternatively, as the sum of tion vector (MIV) where each entry represents the MI estima- their structural properties, for example, residue-specific in- tion for amino acid pairs separated by a particular distance in teractions or hydropathy (hydrophilic/hydrophobic) interac- the primary structure. We studied two different properties of tions. Protein sequences contain sufficient information to sequences: amino acid identity and hydropathy. construct secondary and tertiary protein structures. Most In this paper, we report three important findings. methods for predicting protein structure rely on primary se- (1) MI scores for the majority of 1000 real protein se- quence information by matching sequences representing un- quences sampled from Pfam are statistically significant known structures to those with known structures. Thus, re- (as defined by a P value cutoff of .05) as compared to searchers have investigated the correlation of amino acids random sequences of the same character composition, within and across protein sequences [1–3]. Despite all this, in see Section 4.1. terms of character strings, proteins can be regarded as slightly (2) MIV has significantly better modeling power of pro- edited random strings [1]. teins than MI, as demonstrated in the protein sequence Previous research has shown that residue correlation can classification experiment, see Section 5.2. provide biological insight, but that MI calculations for pro- (3) The best classification results are provided by MIVs tein sequences require careful adjustment for sampling er- containing scores generated from both the amino acid rors. An information-theoretic analysis of amino acid con- alphabet and the hydropathy alphabet, see Section 5.2. tact potential pairings with a treatment of sampling biases has shown that the amount of amino acid pairing informa- In Section 2, we briefly summarize the concept of MI tion is small, but statistically significant [2]. Another recent and a method for normalizing MI content. In Section 3,we study by Martin et al. [3] showed that normalized mutual in- formally define the MIV and its use in characterizing pro- formation can be used to search for coevolving residues. tein sequences. In Section 4, we test whether MI scores for From the literature surveyed, it was not clear what signif- protein sequences sampled from the Pfam database are sta- icance the correlation of amino acid pairings holds for pro- tistically significant compared to random sequences of the 2 EURASIP Journal on Bioinformatics and Systems Biology same residue composition. We test the ability of MIV to clas- From the entropy equations above, we derive the MI sify sequences from the Pfam database in Section 5, and in equation for a protein sequence X = (x1, ..., xN ): Section 6, we examine correlation with MIVs and further in- P(x , x ) vestigate the effects of alphabet size in terms of information = i j MI P xi, xj log2 ,(4) P(xi)P(xj ) theory. We conclude with a discussion of the results and their i∈ΣA j∈ΣA implications. where the pair probability P(xi, xj ) is the frequency of two residues being adjacent in the sequence. 2. MUTUAL INFORMATION (MI) CONTENT We use MI content to estimate correlation in protein se- 2.2. Normalization by joint entropy quences to gain insight into the prediction of secondary and tertiary structures. Measuring correlation between residues Since MI(X, Y) represents a reduction in H(X)orH(Y), the is problematic because sequence elements are symbolic vari- value of MI(X, Y) can be altered significantly by the entropy ables that lack a natural ordering or underlying metric [5]. in X and Y. The MI score we calculate for a sequence is also ff Residues can be ordered in certain properties such as hy- a ected by the entropy in that sequence. Martin et al. [3]pro- dropathy, charge, and molecular weight. Weiss and Herzel [6] pose a method of normalizing the MI score of a sequence analyzed several such correlation functions. using the joint entropy of a sequence. The joint entropy, or H(X, Y), can be defined as MI is a measure of correlation from information theory [7] based on entropy, which is a function of the probability =− H(X, Y) P xi, xj log2P xi, xj (5) distribution of residues. We can estimate entropy by count- i∈ΣA j∈ΣA ing residue frequencies. Entropy is maximal when all residues appear with the same frequency. MI is calculated by system- and is related to MI(X, Y) by the equation atically extracting pairs of residues from a sequence and cal- MI(X, Y) = H(X)+H(Y) − H(X, Y). (6) culating the distribution of pair frequencies weighted by the frequencies of the residues composing the pairs. The complete equation for our normalized MI measure- By defining a pair as adjacent residues in the protein se- ment is quence, MI estimates the correlation between the identities MI(X, Y) of adjacent residues. We later define pairs using nonadjacent H(X, Y) residues, and physical properties rather than residue identi- ∈Σ ∈Σ P x , x log P x , x /P x P x =− i A j A i j 2 i j i j ties. . i∈ΣA j∈ΣA P xi, xj log2P xi, xj MI has been proven useful in multiple studies of bio- (7) logical sequences. It has been used to predict coding regions in DNA [8], and has been used to detect coevolving residue 3. MUTUAL INFORMATION VECTOR (MIV) pairs in protein multiple sequence alignments [3]. We calculate the MI of a sequence to characterize the struc- 2.1. Mutual information ture of the resulting protein. The structure is affected by dif- ferent types of interactions, and we can modify our meth- The entropy of a random variable X, H(X), represents the ods to consider different biological properties of a protein se- uncertainty of the value of X. H(X) is 0 when the identity of quence. To improve our characterization, we combine these X is known, and H(X) is maximal when all possible values different methods to create of vector of MI scores. of X are equally likely. The mutual information of two vari- Using the flexibility of MI and existing knowledge of pro- ables MI(X, Y) represents the reduction in uncertainty of X tein structures, we investigate several methods for generating given Y,andconversely,MI(Y, X) represents the reduction MI scores from a protein sequence. We can calculate the pair in uncertainty of Y given X: probability P(xi, xj ) using any relationship that is defined for ∈ Σ MI(X, Y) = H(X) − H(X | Y) = H(Y) − H(Y | X). (1) all amino acid identities i, j A. In particular, we examine distance between residue pairings, different types of residue- | When X and Y are independent, H(X Y) simplifies to residue interactions, classical and normalized MI scores, and H(X), so MI(X, Y) is 0. The upper bound of MI(X, Y) is the three methods of interpreting gap symbols in Pfam align- lesser of H(X)andH(Y), representing complete correlation ments. between X and Y: H(X | Y) = H(Y | X) = 0. (2) 3.1. Distance MI vectors We can measure the entropy of a protein sequence S as Protein exists as a folded structure, allowing nonadjacent residues to interact. Furthermore, these interactions help to =− H(S) P xi log2P xi ,(3)determine that structure. For this reason, we use MIV to ∈Σ i A characterize nonadjacent interactions. Our calculation of MI where ΣA is the alphabet of amino acid residues and P(xi)is for adjacent pairs of residues is a specific case of a more gen- the marginal probability of residue i.InSection 3.3, we dis- eral relationship, separation by exactly d residues in the se- cuss several methods for estimating this probability. quence. C. Hemmerich and S. Kim 3
Table 1: MI(3)—residue pairings of distance 3 for the sequence Our second method is to use a common prior probability DEIPCPFCGC. distribution for all sequences. Since all of our sequences are (1) DEIPCPFCGC (4) DEIPCPFCGC part of the Pfam database, we use residue frequencies calcu- lated from Pfam as our prior. In our results, we refer to this (2) DEIPCPFCGC (5) DEIPCPFCGC method as the Pfam prior. The large sample size allows the (3) DEIPCPFCGC (6) DEIPCPFCGC frequency to more accurately estimate the probability. How- ever, since Pfam contains sequences from many organisms, Table 2: Amino acid partition primarily based on hydropathy. the probability distribution is less accurate. Hydropathy Amino acids Hydrophobic: C,I,M,F,W,Y,V,L 3.4. Interpreting gap symbols Hydrophilic: R,N,D,E,Q,H,K,S,T,P,A,G The Pfam sequence alignments contain gap information, which presents a challenge for our MIV calculations. The Definition 1. For a sequence S = (s1, ..., sN ), mutual infor- gap character does not represent a physical element of the mation of distance d, MI(d) is defined as sequence, but it does provide information on how to view the sequence and compare it to others. Because of this con- P x , x tradiction, we compared three strategies for processing gap = d i j MI(d) Pd xi, xj log2 . (8) characters in the alignments. P xi P xj i∈ΣA j∈ΣA
The pair probabilities, Pd(xi, xj ), are calculated using all The strict method combinations of positions sm and sn in sequence S such that This method removes all gap symbols from a sequence be- m +(d +1)= n, n ≤ N. (9) fore performing any calculations, operating on the protein sequence rather than an alignment. A sequence of length N will contain N − (d +1)pairs. The literal method Table 1 shows how to extract pairs of distance 3 from the sequence DEIPCPFCGC. Gaps are a proven tool in creating alignments between re- lated sequences and searching for relationships between se- Definition 2. The mutual information vector of length k for quences. This method expands the sequence alphabet to in- asequenceX,MIV(X), is defined as a vector of k entries, k clude the gap symbol. For Σ we define and use a new alpha- MI(0), ...,MI(k − 1). A bet:
3.2. Sequence alphabets Σ = Σ ∪{−} A A . (10) The alphabet chosen to represent the protein sequence has Σ Σ Σ two effects on our calculations. First, by defining the alpha- MI is then calculated for A . H is transformed to G using bet, we also define the type of residue interactions we are the same method. measuring. By using the full amino acid alphabet, we are only able to find correlations based on residue-specific inter- The hybrid method actions. If we instead use an alphabet based on hydropathy, we make correlations based on hydrophilic/hydrophobic in- This method is a compromise of the previous two methods. teractions. Second, altering the size of our alphabet has a sig- Gap symbols are excluded from the sequence alphabet when nificant effect on our MI calculations. This effect is discussed calculating MI. Occurrences of the gap symbol are still con- in Section 6.2. sidered when calculating the total number of symbols. For a In our study, we used two different alphabets: a set of 20 sequence containing one or more gap symbols, amino acids residues, ΣA, and a hydropathy-based alphabet, Σ H , derived from grammar complexity and syntactic struc- Pi < 1. (11) ∈Σ ture of protein sequences [9] (see Table 2 for mapping ΣA to i A ΣH ). Pairs containing any gap symbols are also excluded, so for a 3.3. Estimating residue marginal probabilities gapped sequence, To calculate the MIV for a sequence, we estimate the Pij < 1. (12) marginal probabilities for the characters in the sequence al- i,j∈ΣA phabet. The simplest method is to use residue frequencies from the sequence being scored. This is our default method. TheseadjustmentsresultinanegativeMIscoreforsome Unfortunately, the quality of the estimation suffers from the sequences, unlike classical MI where a minimum score of 0 short length of protein sequences. represents independent variables. 4 EURASIP Journal on Bioinformatics and Systems Biology
Table 3: MIVs’ examples calculated for four sequences from Pfam. All methods used literal gap interpretation.
Globin MI(d) Ferrochelatase MI(d) DUF629 MI(d) Big 2 MI(d)
d ΣA ΣH ΣA ΣH ΣA ΣH ΣA ΣH 0 1.34081 0.42600 0.95240 0.13820 0.70611 0.04752 1.26794 0.21026 1 1.20553 0.23740 0.93240 0.03837 0.63171 0.00856 0.92824 0.05522 2 1.07361 0.12164 0.90004 0.02497 0.63330 0.00367 0.95326 0.07424 3 0.92912 0.02704 0.87380 0.03133 0.66955 0.00575 0.99630 0.04962 4 0.97230 0.00380 0.90400 0.02153 0.62328 0.00587 1.00100 0.08373 5 0.91082 0.00392 0.78479 0.02944 0.68383 0.00674 0.98737 0.03664 6 0.90658 0.01581 0.81559 0.00588 0.63120 0.00782 1.06852 0.05216 7 0.87965 0.02435 0.91757 0.00822 0.67433 0.00172 1.04627 0.12002 8 0.83376 0.01860 0.87615 0.01247 0.63719 0.00495 1.00784 0.05221 9 0.88404 0.01000 0.90823 0.00721 0.61597 0.00411 0.97119 0.04002 10 0.88685 0.01353 0.89673 0.00611 0.60790 0.00718 1.02660 0.02240 11 0.90792 0.01719 0.94314 0.02195 0.66750 0.00867 0.92858 0.02261 12 0.95955 0.00231 0.87247 0.01027 0.64879 0.00805 0.98879 0.03156 13 0.88584 0.01387 0.85914 0.00733 0.66959 0.00607 1.09997 0.04766 14 0.93670 0.01490 0.88250 0.00335 0.66033 0.00106 1.06989 0.01286 15 0.86407 0.02052 0.94592 0.00548 0.62171 0.01363 1.27002 0.06204 16 0.89004 0.04024 0.92664 0.01398 0.63445 0.00314 1.05699 0.03154 17 0.91409 0.01706 0.80241 0.00108 0.67801 0.00536 1.06677 0.02136 18 0.89522 0.01691 0.85366 0.00719 0.65903 0.00898 1.05439 0.03310 19 0.92742 0.03319 0.90928 0.01334 0.70176 0.00151 1.17621 0.01902
3.5. MIV examples In theory, a random string contains no correlation be- tween characters. So, we expect a “slightly edited random Table 3 shows eight examples of MIVs calculated from the string” to exhibit little correlation. In practice, noninfinite Pfam database. A sequence was taken from four random random strings usually have a nonzero MI score. This over- families, and the MIV was calculated using the literal gap estimation of MI in finite sequences is a factor of the length method for both ΣH and ΣA. All scores are in bits. The scores of the string, alphabet size, and frequency of the characters generated from ΣA are significantly larger than those from that make up the string. We investigated the significance of ΣH . We investigate this observation further in Sections 4.1 this error for our calculations and methods for reducing or and 6.2. correcting for the error. To confirm the significance of our MI scores, we used 3.6. MIV concatenation a permutation-based technique. We compared known cod- ing sequences to random sequences in order to generate a The previous sections have introduced several methods for P value signifying the chance that our observed MI score scoring sequences that can be used to generate MIVs. Just or higher would be obtained from a random sequence of aswecombinedMIscorestocreateMIV,wecanfurther residues. Since MI scores are dependent on sequence length ffl concatenate MIVs. Any number of vectors calculated by any and residue frequency, we used the shu e command from methods can be concatenated in any order. However, for two the HMMER package to conserve these parameters in our vectors to be comparable, they must be the same length, and random sequences. must agree on the feature stored at every index. We sampled 1000 sequences from our subset of Pfam- A. A simple random sample was performed without replace- Definition 3. Any two MIVs, MIV j (A)andMIVk(B), can be ment from all sequences between 100 and 1000 residues in concatenated to form MIVj+k(C). length. We calculated MI(0) for each sequence sampled. We then generated 10 000 shuffled versions of each sequence and calculated MI(0) for each. 4. ANALYSIS OF CORRELATION IN We used three scoring methods to calculate MI(0): PROTEIN SEQUENCES (1) ΣA with literal gap interpretation, In [1], Weiss states that “protein sequences can be regarded (2) Σ normalized by joint entropy with literal gap inter- as slightly edited random strings.” This presents a significant A pretation, challenge for successfully classifying protein sequences based on MI. (3) ΣH with literal gap interpretation. C. Hemmerich and S. Kim 5
1 2 1.8 0.9 1.6 0.8
es (bits) 1.4 0.7 ffl 1.2 0.6 1 es/MI(0) for sequence
ffl 0.5 0.8 0.6 0.4 0.4 0.3 Mean of MI(0) for shu 0.2 0.2 0 100 200 300 400 500 600 700 800 900 1000
Mean of MI(0) for0 shu .1 Sequence length (residue count) 0 100 200 300 400 500 600 700 800 900 1000 ΣA literal Sequence length (residue count) ΣA literal, normalized ΣH literal ΣA literal Σ Figure 1: Mean MI(0) of shuffled sequences. A literal, normalized ΣH literal Figure 2: Normalized MI(0) of shuffled sequences. In all three cases, the MI(0) score for a shuffled se- quence of infinite length would be 0; therefore, the calculated scores represent the error introduced by sample-size effects. this experiment for MI(1), MI(5), MI(10), and MI(15) and Figure 1, mean MI(0) of shuffled sequences, shows the aver- summarized the results in Table 4. age shuffled sequence scores (i.e., sampling error) in bits for These results suggest that despite the low MI content of each method. This figure shows that, as expected, the sam- protein sequences, we are able to detect significant MI in a pling error tends to decrease as the sequence length increases. majority of our sampled sequences at MI(0). The number of significant sequences decreases for MI(d) as d increases. The 4.1. Significance of MI(0) for protein sequences results for the classic MI method are significantly affected by sampling error. Normalization by joint entropy reduces this To compare the amount of error, in each method we nor- error slightly for most sequences, and using ΣH is a much malized the mean MI(0) scores from Figure 1 by dividing the more effective correction. mean MI(0) score by the MI(0) score of the sequence used to ffl generate the shu es. This ratio estimates the amount of the 5. MEASURING MIV PERFORMANCE THROUGH ff sequence MI(0) score attributed to sample-size e ects. PROTEIN CLASSIFICATION Figure 2, normalized MI(0) of shuffled sequences, com- pares the effectiveness of our two corrective methods in min- We used sequence classification to evaluate the ability of MI imizing the sample-size effects. This figure shows that nor- to characterize protein sequences and to test our hypothe- malization by joint entropy is not as effective as Figure 1 sug- sis that MIV characterizes a protein sequence better MI. As gests. Despite a large reduction in bits, in most cases, the por- such,ourobjectiveistomeasurethedifference in accuracy tion of the score attributed to sampling effects shows only a between the methods, rather than to reach a specific classifi- minor improvement. ΣH still shows a significant reduction in cation accuracy. sample-size effects for most sequences. We used the Pfam-A dataset to carry out this compar- Figures 1 and 2 provide insight into trends for the three ison. The families contained in the Pfam database vary in methods, but do not answer our question of whether or not sequence count and sequence length. We removed all fami- the MI scores are significant. For a given sequence S,weesti- lies containing any sequence of less than 100 residues due to mated the P value as complications with calculating MI for small strings. We also x limited our study to families with more than 10 sequences P = , (13) N and less than or equal to 200 sequences. After filtering Pfam- A based on our requirements, we were left with 2392 families where N is the number of random shuffles and x is the num- to consider in the experiment. ber of shuffles whose MI(0) was greater than or equal to Sequence similarity is the most widely used method of MI(0) for S. For this experiment, we choose a significance family classification. BLAST [10] is a popular tool incor- cutoff of .05. For a sequence to be labeled significant, no more porating this method. Our method differs significantly, in than 50 of the 10 000 shuffled versions may have an MI(0) that classification is based on a vector of numerical features, score equal or larger than the original sequence. We repeated rather than the protein’s residue sequence. 6 EURASIP Journal on Bioinformatics and Systems Biology
ff Table 4: Sequence significance calculated for significance cuto .05. as MIV20. The results for these experiments are summarized in Table 5, classification Results for MI(0) and MIV20. Number of significant sequences (of 1000) Scoring method All MIV20 methods were more accurate than their MI(0) MI(0) MI(1) MI(5) MI(10) MI(15) counterparts. The best method was ΣH with hybrid gap scor- Literal-ΣA 762 630 277 103 54 ing with a mean accuracy of 85.14%. The eight best perform- Normalized ing methods used Σ , with the best method based on Σ hav- 777 657 309 106 60 H A literal-ΣA ing a mean accuracy of only 66.69%. Another important ob-
Literal-ΣH 894 783 368 162 117 servation is that strict gap interpretation performs poorly in sequence classification. The best strict method had a mean accuracy of 29.96%—much lower than the other gap meth- Classification of feature vectors is a well-studied prob- ods. lem with many available strategies. A good introduction to Our final classification attempts were made using con- many methods is available in [11], and the method chosen catenations of previously generated MIV20 scores. We eval- can significantly affect performance. Since the focus of this uated all combinations of methods. The five combinations experiment is to compare methods of calculating MIV, we most accurate at classification are shown in Table 6. The best only used the well-established and versatile nearest neighbor method combinations are over 90% accurate, with the best Σ classifier in conjunction with Euclidean distance [12]. being 90.99%. The classification power of H with hybrid gap interpretation is demonstrated, as this method appears 5.1. Classification implementation in all five results. Surprisingly, two strict scoring methods ap- pear in the top 5, despite their poor performance when used For classification, we used the WEKA package [11]. WEKA alone. uses the instance based 1 (IB1) algorithm [13] to imple- Based on our results, we made the following observa- ment nearest neighbor classification. This is an instance- tions. based learning algorithm derived from the nearest neighbor (1) The correlation of non-adjacent pairs as measured ffi pattern classifier and is more e cient than the naive imple- by MIV is significant. Classification based on every mentation. method improved significantly for MIV compared to ff The results of this method can di er from the classic MI(0). The highest accuracy achieved for MI(0) was nearest neighbor classifier in that the range of each attribute 26.73% and for MIV it was 85.14% (see Table 5). is normalized. This normalization ensures that each attribute (2) Normalized MI had an insignificant effect on scores gen- contributes equally to the calculation of the Euclidean dis- erated from Σ . Both methods reduce the sample-size tance. As shown in Table 3, MI scores calculated from Σ H A error in estimating entropy and MI for sequences. A have a larger magnitude than those calculated from Σ . This H possible explanation for the lack of further improve- normalization allows the two alphabets to be used together. ment through normalization is that ΣH is a more ef- fective corrective measure than normalization. We ex- 5.2. Sequence classification with MIV plore this possibility further in Section 6.2,werewe consider entropy for both alphabets. In this experiment, we explore the effectiveness of classifica- (3) For the most accurate methods, using the Pfam prior de- tions made using the correlation measurements outlined in creased accuracy. Despite our concerns about using the Section 3. frequency of a short sequence to estimate the marginal Each experiment was performed on a random sample of residue probabilities, the results show that these es- 50 families from our subset of the Pfam database. We then timations better characterize the sequences than the used leave-one-out cross-validation [14]totesteachofour Pfam prior probability distribution. However, four of classification methods on the chosen families. the five best combinations contain a method utilizing In leave-one-out validation, the sequences from all 50 the Pfam prior, showing that the two methods for esti- families are placed in a training pool. In turn, each sequence mating marginal probabilities are complimentary. is extracted from this pool and the remaining sequences are used to build a classification model. The extracted sequence (4) As with sequence-based classification, introducing gaps is then classified using this model. If the sequence is placed improves accuracy. For all methods, removing gap char- in the correct family, the classification is counted as a suc- acters with the strict method drastically reduced accu- cess. Accuracy for each method is measured as racy. Despite this, two of the five best combinations in- cluded a strict scoring method. no. of correct classifications (5) The best scoring concatenated MIVs included both al- . (14) Σ no. of classification attempts phabets. The inclusion of A is significant—all eight nonstrict ΣH methods scored better than any ΣA We repeated this process 100 times, using a new sampling method (see Table 5). The inclusion shows that ΣA of 50 families from Pfam each time. Results are reported for provides information not included in the ΣH and each method as the mean accuracy of these repetitions. For strengthens our assertion that the different alphabets each of the 24 combinations of scoring options outlined in characterize different forces affecting protein struc- Section 3, we evaluated classification based on MI(0), as well ture. C. Hemmerich and S. Kim 7
Table 5: Classification results for MI(0) and MIV20 methods. SD represents the standard deviation of the experiment accuracies.
MIV MI(0) accuracy MIV accuracy 20 Method 20 rank Mean SD Mean SD
1 Hybrid-ΣH 26.73% 2.59 85.14% 2.06
2 Normalized hybrid-ΣH 26.20% 4.16 85.01% 2.19
3 Literal-ΣH 22.92% 3.41 79.51% 2.79
4 Normalized literal-ΣH 23.45% 3.88 78.86% 2.79
5 Normalized Hybrid-ΣH w/Pfam prior 26.31% 3.95 77.21% 2.94
6 Literal-ΣH w/Pfam prior 22.73% 4.90 76.89% 2.91
7 Normalized Literal-ΣH w/Pfam prior 22.45% 4.89 76.29% 2.96
8 Hybrid-ΣH w/Pfam prior 22.81% 2.97 71.57% 3.15
9 Normalized literal-ΣA 17.76% 3.21 66.69% 4.14
10 Hybrid-ΣA 17.16% 3.06 64.09% 4.36
11 Normalized literal-ΣA w/Pfam prior 19.60% 3.67 63.39% 4.05
12 Literal-ΣA 16.36% 2.84 61.97% 4.32
13 Literal-ΣA w/Pfam prior 19.95% 2.84 61.82% 4.12
14 Hybrid-ΣA w/Pfam prior 23.09% 3.36 58.07% 4.28
15 Normalized hybrid-ΣA 18.10% 3.08 41.76% 4.59
16 Normalized hybrid-ΣA w/Pfam prior 23.32% 3.65 40.46% 4.04
17 Strict-ΣH w/Pfam prior 12.97% 2.85 29.96% 3.89
18 Normalized strict-ΣH w/Pfam prior 13.01% 2.72 29.81% 3.87
19 Normalized strict-ΣA w/Pfam prior 19.77% 3.52 29.73% 3.93
20 Normalized strict-ΣA 18.27% 2.92 29.20% 3.65
21 Strict-ΣH 11.22% 2.33 29.09% 3.60
22 Normalized strict-ΣH 11.15% 2.52 28.85% 3.58
23 Strict-ΣA w/Pfam prior 19.25% 3.38 28.44% 3.91
24 Strict-ΣA 16.27% 2.75 25.80% 3.60
Table 6: Top scoring combinations of MIV methods. All combinations of two MIV methods were tested, with these five methods performing the most accurately. SD represents the standard deviation of the experiment accuracies.
Rank First method Second method Mean accuracy SD
1 Hybrid-ΣH Normalized hybrid-ΣA w/Pfam prior 90.99% 1.44
2 Hybrid-ΣH Normalized strict-ΣA w/Pfam prior 90.66% 1.47
3 Hybrid-ΣH Literal-ΣA w/Pfam prior 90.30% 1.48
4 Hybrid-ΣH Literal-ΣA 90.24% 1.73
5 Hybrid-ΣH Strict-ΣA w/Pfam prior 90.08% 1.57
6. FURTHER MIV ANALYSIS The results strengthen our observations from the classifi- cation experiment. Methods that performed well in classifi- In this section, we examine the results of our different meth- cation exhibit less redundancy between MIV indexes. In par- ods of calculating MIVs for Pfam sequences. We first use cor- ticular, the advantage of methods using ΣH is clear. In each relation within the MIV as a metric to compare several of our case, correlation decreases as the distance between indexes scoring methods. We then take a closer look at the effect of increases. For short distances, ΣA methods exhibit this to a reducing our alphabet size when translating from ΣA to ΣH . lesser degree; however, after index 10, the scores are highly correlated. 6.1. Correlation within MIVs 6.2. Effect of alphabets We calculated MIVs for 120 276 Pfam sequences using each of our methods and measured the correlation within each Not all intraprotein interactions are residue specific. Cline method using Pearson’s correlation. The results of this anal- [2] explored information attributed to hydropathy, charge, ysis are presented in Figure 3. Each method is represented by disulfide bonding, and burial. Hydropathy, an alphabet com- a20× 20 grid containing each pairing of entries within that posed of two symbols, was found to contain half as much in- MIV. formation as the 20-element amino acid alphabet. However, 8 EURASIP Journal on Bioinformatics and Systems Biology
20 20 20 20 0.8 15 15 15 15 0.6 10 10 10 10 0.4 5 5 5 5 0.2
5101520 5101520 5101520 5101520
Literal-ΣA Normalized literal-ΣA Hybrid-ΣA Normalized hybrid-ΣA (a)
20 20 20 20 0.8 15 15 15 15 0.6 10 10 10 10 0.4 5 5 5 5 0.2
5101520 5101520 5101520 5101520
Literal-ΣH Normalized literal-ΣH Hybrid-ΣH Normalized hybrid-ΣH (b)
Figure 3: Pearson’s correlation analysis of scoring methods. Note the reduced correlation in the methods based on ΣH , which all performed very well in classification tests. with only two symbols, the alphabet should be more resistant Table 7: Comparison of measured entropy to expected entropy val- to the underestimation of entropy and overestimation of MI ues for 1000 amino acid sequences. Each sequence is 100 residues caused by finite sequence effects [15]. long and was generated by a Bernoulli scheme. For this method, a protein sequence is translated using Alphabet Theoretical Mean measured Alphabet the process given in Section 3.2. It is important to remem- size entropy entropy ber that the scores generated for entropy and MI are actually Σ estimates based on finite samples. Because of the reduced al- A 20 4.322 4.178 ΣH 2 0.971 0.964 phabet size of ΣH , we expected to see increased accuracy in entropy and MI estimations.To confirm this, we examined the effects of converting random sequences of 100 residues (a length representative of those found in the Pfam database) Σ bution. The positions remain independent, so the expected into H . MI remains 0. We generated each sequence from a Bernoulli scheme. Table 7 shows the measured and expected entropies for Each position in the sequences is selected independently of both alphabets. The entropy for ΣA is underestimated by any residues selected before it, and all selections are made .144, and the entropy for Σ is underestimated by only randomly from a uniform distribution. Therefore, for every H .007. The effect of ΣH on MI estimation is much more pro- position in the sequence, all residues are equally likely to oc- nounced. Figure 4 shows the dramatic overestimation of MI cur. in ΣA and high standard deviation around the mean. The By sampling residues from a uniform distribution, the overestimation of MI for Σ is negligible in comparison. Bernoulli scheme maximizes entropy for the alphabet size H (N): 7. CONCLUSIONS 1 H =−log . (15) 2 N We have shown that residue correlation information can be Since all positions are independent of others, MI is 0. used to characterize protein sequences. To model sequences, Knowing the theoretical values of both entropy and MI, we we defined and used the mutual information vector (MIV) can compare the calculated estimates for a finite sequence to where each entry represents the mutual information content the theoretical values to determine the magnitude of finite between two amino acids for the corresponding distance. We sequence effects. have shown that MIV of proteins is significantly different We estimated entropy and MI for each of these sequences from random sequences of the same character composition and then translated the sequences to ΣH . The translated when the distance between residues is considered. Furthermore, sequences are no longer Bernoulli sequences because the we have shown that the MIV values of proteins are significant residue partitioning is not equal—eight residues fall into one enough to determine the family membership of a protein se- category and twelve into the other. Therefore, we estimated quence with an accuracy of over 90%. What we have shown is the entropy for the new alphabet using this probability distri- simply that the MIV score of a protein is significant enough C. Hemmerich and S. Kim 9
2.5 main,” Journal of Molecular Evolution, vol. 48, no. 5, pp. 501– 516, 1999. 2 [6]O.WeissandH.Herzel,“Correlationsinproteinsequences and property codes,” Journal of Theoretical Biology, vol. 190, 1.5 no. 4, pp. 341–353, 1998. [7]T.M.CoverandJ.A.Thomas,Elements of Information Theory, MI (d) 1 Wiley-Interscience, New York, NY, USA, 1991. [8] I. Grosse, H. Herzel, S. V. Buldyrev, and H. E. Stanley, “Species 0 5 . independence of mutual information in coding and noncod- ing DNA,” Physical Review E, vol. 61, no. 5, pp. 5624–5629, 0 2000. 024681012141618 [9]M.A.Jimenez-Monta´ no,˜ “On the syntactic structure of pro- Residue distance d tein sequences and the concept of grammar complexity,” Bul- letin of Mathematical Biology, vol. 46, no. 4, pp. 641–659, 1984. Mean MIV for ΣH Mean MIV for ΣA [10] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lip- man, “Basic local alignment search tool,” Journal of Molecular Figure 4: Comparison of MI overestimation in protein sequences Biology, vol. 215, no. 3, pp. 403–410, 1990. generated from Bernoulli schemes for gap distances from 0 to [11] I. H. Witten and E. Frank, Data Mining: Practical Machine 19 residues. The full residue alphabet greatly over-estimates this Learning Tools and Techniques, Morgan Kaufmann Series in amount. Reducing the alphabet to two symbols approximates the Data Management Systems, Morgan Kaufmann, San Fran- theoretical value of 0. cisco, Calif, USA, 2nd edition, 2005. [12] T. M. Cover and P. Hart, “Nearest neighbor pattern classifica- tion,” IEEE Transactions on Information Theory, vol. 13, no. 1, for family classification—MIV is not a practical alternative to pp. 21–27, 1967. similarity-based family classification methods. [13] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learn- There are a number of interesting questions to be an- ing algorithms,” Machine Learning, vol. 6, no. 1, pp. 37–66, swered. In particular, it is not clear how to interpret a vector 1991. [14] R. Kohavi, “A study of cross-validation and bootstrap for ac- of mutual information values. It would also be interesting ff curacy estimation and model selection,” in Proceedings of the to study the e ect of distance in computing mutual infor- 14th International Joint Conference on Artificial Intelligence (IJ- mation in relation to protein structures, especially in terms CAI ’95), vol. 2, pp. 1137–1145, Montreal,´ Quebec,´ Canada, of secondary structures. In our experiment (see Table 4), we August 1995. have observed that normalized MIV scores exhibit more in- [15] H. Herzel, A. O. Schmitt, and W. Ebeling, “Finite sample ef- formation content than nonnormalized MIV scores. How- fects in sequence analysis,” Chaos, Solitons & Fractals, vol. 4, ever, in the classification task, normalized MIV scores did no. 1, pp. 97–113, 1994. not always achieve better classification accuracy than non- normalized MIV scores. We hope to investigate this issue in the future.
ACKNOWLEDGMENTS
This work is partially supported by NSF DBI-0237901 and Indiana Genomics Initiatives (INGEN). The authors also thank the Center for Genomics and Bioinformatics for the use of computational resources.
REFERENCES
[1] O. Weiss, M. A. Jimenez-Monta´ no,˜ and H. Herzel, “Informa- tion content of protein sequences,” Journal of Theoretical Biol- ogy, vol. 206, no. 3, pp. 379–386, 2000. [2] M.S.Cline,K.Karplus,R.H.Lathrop,T.F.Smith,R.G.Rogers Jr., and D. Haussler, “Information-theoretic dissection of pair- wise contact potentials,” Proteins: Structure, Function and Ge- netics, vol. 49, no. 1, pp. 7–14, 2002. [3] L. C. Martin, G. B. Gloor, S. D. Dunn, and L. M. Wahl, “Us- ing information theory to search for co-evolving residues in proteins,” Bioinformatics, vol. 21, no. 22, pp. 4116–4124, 2005. [4] A. Bateman, L. Coin, R. Durbin, et al., “The Pfam protein fam- ilies database,” Nucleic Acids Research, vol. 32, Database issue, pp. D138–D141, 2004. [5] W. R. Atchley, W. Terhalle, and A. Dress, “Positional depen- dence, cliques, and predictive motifs in the bHLH protein do- Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007, Article ID 14741, 11 pages doi:10.1155/2007/14741
Research Article Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates
Hasan Metin Aktulga,1 Ioannis Kontoyiannis,2 L. Alex Lyznik,3 Lukasz Szpankowski,4 Ananth Y. Grama,1 and Wojciech Szpankowski1
1 Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA 2 Department of Informatics, Athens University of Economics & Business, Patission 76, 10434 Athens, Greece 3 Pioneer Hi-Breed International, Johnston, IA, USA 4 Bioinformatics Program, University of California, San Diego, CA 92093, USA
Received 26 February 2007; Accepted 25 September 2007
Recommended by Petri Myllymaki¨
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5 untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI’s combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats—an application of importance in genetic profiling.
Copyright © 2007 Hasan Metin Aktulga et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. INTRODUCTION tivated, we propose to develop precise and reliable method- ologies for quantifying and identifying such dependencies, Questions of quantification, representation, and description based on the information-theoretic notion of mutual infor- of the overall flow of information in biosystems are of cen- mation. tral importance in the life sciences. In this paper, we de- Biomolecules store information in the form of monomer velop statistical tools based on information-theoretic ideas, strings such as deoxyribonucleotides, ribonucleotides, and and demonstrate their use in identifying informative parts amino acids. As a result of numerous genome and protein in biomolecules. Specifically, our goal is to detect statistically sequencing efforts, vast amounts of sequence data is now dependent segments of biosequences, hoping to reveal po- available for computational analysis. While basic tools such tentially important biological phenomena. It is well known as BLAST provide powerful computational engines for iden- [1–3] that various parts of biomolecules, such as DNA, RNA, tification of conserved sequence motifs, they are less suitable and proteins, are significantly (statistically) correlated. For- for detecting potential hidden correlations without experi- mal measures and techniques for quantifying these correla- mental precedence (higher-order substitutions). tions are topics of current investigation. The biological im- The application of analytic methods for finding regions plications of these correlations are deep, and they themselves of statistical dependence through mutual information has remain unresolved. For example, statistical dependencies be- been illustrated through a comparative analysis of the 5 un- tween exons carrying protein coding sequences and noncod- translated regions of DNA coding sequences [4]. It has been ing introns may indicate the existence of as-yet unknown er- known that eukaryotic translational initiation requires the ror correction mechanisms or structural scaffolds. Thus mo- consensus sequence around the start codon defined as the 2 EURASIP Journal on Bioinformatics and Systems Biology
Kozak’s motif [5]. By screening at least 500 sequences, an and introns may justify additional search for still unknown unexpected correlation between positions −2and−1 of the factors affecting RNA processing. Kozak’s sequence was observed, thus implying a novel trans- The complexity and importance of the RNA processing lational initiation signal for eukaryotic genes. This pattern system is emphasized by the largely unexplained mechanisms was discovered using mutual information, and not detected of alternative splicing, which provide a source of substantial by analyzing single-nucleotide conservation. In other rele- diversity in gene products. The same sequence may be recog- vant work, neighbor-dependent substitution matrices were nized as an exon or an intron, depending on a broader con- applied to estimate the average mutual information con- text of splicing reactions. The information that is required tent of the core promoter regions from five different organ- for the selection of a particular segment of RNA molecules is isms [6, 7]. Such comparative analyses verified the impor- very likely embedded into either exons or introns, or both. tance of TATA-boxes and transcriptional initiation. A similar Again, it seems that the splicing outcome is determined methodology elucidated patterns of sequence conservation by structural information carried by RNA molecules them- at the 3 untranslated regions of orthologous genes from hu- selves, unless the fundamental dogma of biology (the unidi- man, mouse, and rat genomes [8], making them potential rectional flow of information from DNA to proteins) is to be targets for experimental verification of hidden functional sig- questioned. nals. Finally, the constant evolution of genomes introduces In a different kind of application, statistical dependence certain polymorphisms, such as tandem repeats, which are an techniques find important applications in the analysis of gene important component of genetic profiling applications. We expression data. Typically, the basic underlying assumption also study these forms of statistical dependencies in biologi- in such analyses is that genes expressed similarly under di- cal sequences using mutual information. vergent conditions share functional domains of biological ac- In Section 2 we develop some theoretical background, tivity. Establishing dependency or potential relationships be- and we derive a threshold function for testing statistical sig- tween sets of genes from their expression profiles holds the nificance. This function admits a dual interpretation either key to the identification of novel functional elements. Statis- as the classical log-likelihood ratio from hypothesis testing, tical approaches to estimation of mutual information from or as the “empirical mutual information.” gene expression datasets have been investigated in [1]. Section 3 contains our experimental results. In Section Protein engineering is another important area where sta- 3.1 we present our empirical findings for the problem of de- tistical dependency tools are utilized. Reliable predictions of tecting statistical dependency between different parts in a protein secondary structures based on long-range depen- DNA sequence. Extensive numerical experiments were car- dencies may enhance functional characterizations of pro- ried out on certain regions of the maize zmSRp32 gene [11], teins [9]. Since secondary structures are determined by both which is functionally homologous to the human ASF/SF2 al- short- and long-range interactions between single amino ternative splicing factor. The efficiency of the empirical mu- acids, the application of comparative statistical tools based tual information in this context is demonstrated. Moreover, on consensus sequence algorithms or short amino acid se- our findings suggest the existence of a biological connection quences centered on the prediction sites is far from optimal. between the 5 untranslated region in zmSRp32 and its alter- Analyses that incorporate mutual information estimates may natively spliced exons. provide more accurate predictions. Finally, in Section 3.2, we show how the empirical mu- In this work we focus on developing reliable and pre- tual information can be utilized in the difficult problem of cise information-theoretic methods for determining whether searching DNA sequences for short tandem repeats (STRs), two biosequences are likely to be statistically dependent. Our an important task in genetic profiling. We extend the simple main goal is to develop efficient algorithmic tools that can hypothesis test of the previous sections to a methodology for be easily applied to large data sets, mainly—though not testing a DNA string against different “probe” sequences, in exclusively—as a rigorous exploratory tool. In fact, as dis- ordertodetectSTRsbothaccuratelyandefficiently. Experi- cussed in detail below, our findings are not the final word on mental results on DNA sequences from the FBI’s combined the experiments we performed, but, rather, the first step in DNA index system (CODIS) are presented, showing that the the process of identifying segments of interest. Another moti- empirical mutual information can be a powerful tool in this vating factor for this project, which is more closely related to context as well. ideas from information theory, is the question of determin- ing whether there are error correction mechanisms built into large molecules, as argued by Battail; see [10] and the ref- 2. THEORETICAL BACKGROUND erences therein. We choose to work with protein coding ex- ons and noncoding introns. While exons are well-conserved In this section, we outline the theoretical basis for the mu- parts of DNA, introns have much greater variability. They tual information estimators we will later apply to biological are dispersed on strings of biopolymers and still they have sequences. to be precisely identified in order to produce biologically rel- Suppose we have two strings of unequal lengths, evant information. It seems that there is no external source of information but the structure of RNA molecules them- n = X1 X1, X2, ..., Xn, selves to generate functional templates for protein synthesis. (1) M = Determining potential mutual relationships between exons Y1 Y1, Y2, Y3, ..., YM, Hasan Metin Aktulga et al. 3
where M ≥ n, taking values in a common finite alphabet A. ilarly, let P(x)andqj (y) denote the empirical distributions + −1 In most of our experiments, M is significantly larger than of Xn and Y j n , respectively. We define the empirical (per- ≈ ≈ 1 j n; typical values of interest are n 80 and M 300. n j+n−1 Our main goal is to determine whether or not there is some symbol) mutual information Ij (n)betweenX1 and Yj form of statistical dependence between them. Specifically, by applying (2) to the empirical instead of the true distribu- n tions, so that we assume that the string X1 consists of independent and identically distributed (i.i.d.) random variables Xi with com- p (x, y) = j mon distribution P(x)onA, and that the random vari- Ij (n) pj (x, y)log . (3) ∈ p(x)qj(y) ables Yi are also i.i.d. with a possibly different distribution x,y A Q(y). Let {W(y | x)} be a family of conditional distribu- →∞ tions, or “channel,” with the property that, when the in- The law of large numbers implies that as n ,wehave p(x)→P(x), q (y)→Q(x), and p (x, y) converges to the true put distribution is P, the output has distribution Q, that is, j j ∈ | = ff joint distribution of X, Y. x AP(x)W(y x) Q(y)forally.Wewishtodi erentiate n between the following two scenarios: Clearly, this implies that in scenario (i), where X1 and n M n → →∞ (i) independence: X1 and Y1 are independent, Y1 are independent, Ij (n) 0, for any fixed j,asn .On n ∈ (ii) dependence: First X1 is generated, then an index J the other hand, in scenario (ii), IJ (n)convergestoI(X; Y) > { − } J+n−1 1, 2, ..., M n+1 is chosen in an arbitrary way, and YJ 0 where the two random variables X, Yare such that X has is generated as the output of the discrete memoryless channel distribution P and the conditional distribution of Y given n = = | W with input X1 , that is, for each j 1, 2, ..., n, the condi- X x is W(y x). n | tional distribution of Yj+J−1 given X1 is W(y Xj ). Finally, In passing we should point out there are other methods the rest of the Yi’s are generated i.i.d. according to Q.(To of checking statistical (in)dependence, for instance, random- avoid the trivial case where both scenarios are identical, we ization or permutation tests discussed in [13, 14]. assume that the rows of W are not all equal to Q so that in n J+n−1 the second scenario X1 and YJ are actually not indepen- 2.1. An independence test based on dent.) mutual information It is important at this point to note that although nei- ther of these two cases is biologically realistic as a descrip- We propose to use the following simple test for detecting de- n M tion of the elements in a genomic sequence, it turns out that pendence between X1 and Y1 . Choose and fix a threshold this set of assumptions provides a good operational starting θ>0, and compute the empirical mutual information Ij (n) n j+n−1 point: the experimental results reported in Section 3 clearly between X1 and each contiguous substring Yj of length indicate that, in practice, the resulting statistical methods ob- M n from Y1 .IfIj (n) is larger than θ for some j, declare that tained under the present assumptions can provide accurate n j+n−1 and biologically relevant information. Of course, the natu- the strings X1 and Yj are dependent; otherwise, declare ral next step in any application is the careful examination of that they are independent. the corresponding findings, either through purely biological Before examining the issue of selecting the value of the considerations or further testing. threshold θ, we note that this statistic is identical to the To distinguish between (i) and (ii), we look at every pos- (normalized) log-likelihood ratio between the above two hy- sible alignment of Xn with Y M, and we estimate the mutual potheses. To see this, observe that expanding the definition 1 1 information between them. Recall that for two random vari- of pj (x, y)inIj (n), we can simply rewrite ables X, Y with marginal distributions P(x), Q(y), respec- n p (x, y) tively, and joint distribution V(x, y), the mutual information = 1 I j ( ) { − }( , )log Ij n (Xi,Yj+i 1) x y between X and Y is defined as x,y∈A n i=1 p(x)qj(y) (4) V(x, y) n p (x, y) I(X; Y) = V(x, y)log . (2) = 1 I j {(X Y − )}(x, y)log , ∈ P(x)Q(y) i, j+i 1 x,y A n i=1x,y∈A p(x)qj(y)
Recall also that I(X; Y) is always nonnegative, and it equals I where the indicator function { − }(x, y)equals1if zero if and only if X and Y are independent. The loga- (Xi,Yj+i 1) (X Y − ) = (x, y) and it is equal to zero otherwise. Then, rithms above and throughout the paper are taken to base 2, i, j+i 1 log = log , so that I(X; Y) can be interpreted as the number n 2 1 pj Xi, Yj+i−1 of bits of information that each of these two random vari- = Ij (n) log n = p Xi qj Yj+i−1 ables carries about the other (cf. [12]). i 1 n (5) In order to distinguish between the two scenarios above, = p X Y − = 1 i 1 j i, j+i 1 n log n , we compute the empirical mutual information between X1 − M n i=1 p Xi qj Yj+i 1 and each contiguous substring of Y1 of length n:foreach j = 1, 2, ..., M − n +1,let p (x, y) denote the joint j which is exactly the normalized logarithm of the ratio be- n j+n−1 n empirical distribution of (X1 , Yj ), that is, let pj (x, y) tween the joint empirical likelihood i=1 pj (Xi, Yj+i−1)of be the proportion of the n positions in (X1, Yj ), (X2, the two strings, and the product of their empirical marginal n n Yj+1), ...,(Xn, Yj+n−1) where (Xi, Yj+i−1)equals(x, y). Sim- likelihoods i=1 p(Xi)][ i=1 qj (Yj+i−1) . 4 EURASIP Journal on Bioinformatics and Systems Biology
2.2. Probabilities of error I = I(X; Y) of the mutual information, but, as we show be- low, the rate of this convergence is slower than the 1/n rate There are two kinds of errors this test can make: declaring → of scenario√ (i): here,√I(n) I with probability one, but only at that two strings are dependent when they are not, and vice rate 1/ n, in that n [I(n) − I] converges in distribution to versa. The actual probabilities of these two types of errors a Gaussian depend on the distribution of the statistic Ij (n). Since this √ D distribution is independent of j,wetakej = 1 and write n I(n) − I −→ T∼N 0, σ2 , (10) I(n) for the normalized log-likelihood ratio I (n). The next 1 where the resulting variance σ2 is given by two subsections present some classical asymptotics for ( ) I1 n . W(Y | X) σ2 = Var log Scenario (i): independence Q(Y) W(y | x) 2 (11) We already noted that in this case I(n)convergestozeroas = p(x)W(y | x) log − I . ( ) n→∞, and below we shall see that this convergence takes x,y∈A Q y place at a rate of approximately 1/n.Specifically,I(n) →0 with probability one, and a standard application of the mul- An outline of the proof of (10) is given below; for another tivariate central limit theorem for the joint empirical distri- derivation see [19]. Therefore, for any fixed threshold θ
DNA structure of zmSRp32